Showing posts with label language. Show all posts
Showing posts with label language. Show all posts

13 Jun 2012

Rails i18n translations in Yaml: translation tool support

With Rails 2.2 the i18n API was introduced with a new method for translations.  Instead of embracing the venerable gettext which had been the previous standard, the Rails team invented a new way using Yaml files.  The result is a particularly graceful, flexible and very Rubylike way of specifying translations.  It also is much more reliable than gettext, which had many inscrutable issues with locales and caching, and sometimes caused people to get things in the wrong language.  So: bravo, great job.

But to do this, they specified their own translation format, the very flexible Yaml file. There are already a lot of formats floating around, and translation tool vendors and open-source translation developers have been working for a long time on conversion tools between them.  The Translate Toolkit and Pootle emerged from South Africa (a country which groans beneath the weight revels in the glory of eleven official languages) which provide an excellent web-based tool for collaboration, centered around gettext PO files.  However, poor little Pootle started a migration from Python to Django, and we all know how rewrites go.  [Halfway. Badly.]  But Translate Toolkit supported a lot of formats:

  • moz2po - Mozilla .properties and .dtd converter. Works with Firefox and Thunderbird
  • oo2po - OpenOffice.org SDF converter (See also oo2xliff).
  • odf2xliff - Convert OpenDocument (ODF) documents to XLIFF and vice-versa.
  • prop2po - Java property file (.properties) converter
  • php2po - PHP localisable string arrays converter.
  • sub2po - Converter for various subtitle files
  • txt2po - Plain text to PO converter
  • po2wordfast - Wordfast Translation Memory converter
  • po2tmx - TMX (Translation Memory Exchange) converter
  • pot2po - initialise PO Template files for translation
  • csv2po - Comma Separated Value (CSV) converter. Useful for doing translations using a spreadsheet.
  • csv2tbx - Create TBX (TermBase eXchange) files from Comma Separated Value (CSV) files
  • html2po - HTML converter
  • ical2po - iCalendar file converter
  • ini2po - Windows INI file converter
  • json2po - JSON file converter
  • web2py2po - web2py translation to PO converter
  • rc2po - Windows Resource .rc (C++ Resource Compiler) converter
  • symb2po - Symbian-style translation to PO converter
  • tiki2po - TikiWiki language.php converter
  • ts2po - Qt Linguist .ts converter
  • xliff2po - XLIFF (XML Localisation Interchange File Format) converter

In its heels, Google introduced the Google Translate Toolkit, which lets you use the Google Translate engine to suggest translations (based on its own databases or translation memories you provide).  It also does the core of what Pootle does: collaboration, access, but without crashing and flakiness, and it works with:
But neither of them supports Yaml files.  Unfortunately, tooling support libraries have not embraced this format in the intervening two and a half years.  I did find one solution: i18n-translators-tools which supports conversion between Yaml and gettext PO files, but it's somewhat idiosyncratic, and it turns out there's a good reason why there isn't a straightforward Yaml  PO converter: the PO format is consists of name-value pairs with metadata, and the Yaml format is a tree.

English source Yaml fileSpanish Yaml file produced by i18n-translators-tools from a PO file
page_info:

  sales/credit_notes:

    date: "Date"

    title:

      default: "Sales Credit Note"

      new: "New Sales Credit Note"
page_info:
  sales/credit_notes:
    date: "Fecha"
    title:
      default:
        default: "Sales Credit Note"
        translation: "Crédito de venta"
      new:
        default: "New Sales Credit Note"
        translation: "New Sales Credit Note"


There are some interesting things going on here: the Spanish Yaml file provides fallbacks so untranslated strings don't come through as blank.  The intermediate gettext PO file keeps the tree structure in the msgctxt metadata, and looks like this:

msgctxt "page_info.fuji_sales/sales_credit_notes.title.default"
msgid "Sales Credit Note"
msgstr "Crédito de venta"

msgctxt "page_info.fuji_sales/sales_credit_notes.title.new"
msgid "New Sales Credit Note"
msgstr "New Sales Credit Note"

So it's possible to use Google Translate Toolkit to translate your Rails Yaml files, provided you use the i18n-translators-tools library to do the conversions, and configure your Rails applications to support fallbacks.


4 Jul 2008

Sous le Grand Chapiteau

Cirque du Soleil logotypeWe went to see Cirque du Soleil's Corteo this afternoon. It was a very good show, of course: they do beautiful work.

The last time I saw Cirque du Soleil was in 1991: Nouvelle Expérience in Atlanta, to which my mother invited me. That show was a revelation to me. It was also a very different time for that organization: just a small circus company, not the "entertainment empire" as it is described today. Then, it was just one touring show; now it has fourteen touring companies and six resident shows.

It was also distinctly non-English then, with very few words spoken at all, and the French nature of the show very distinct. Today they sing and mumble in a eurotrash polyglot (which probably reflects the many nationalities involved) and speak in English. It somehow now feels safe where the show used to feel somewhat subversive: two nearly naked men doing a hand-to-hand show was practically a felony in Georgia, and although speaking French hadn't yet reached the depths of infamy it later did, it was certainly different for that place and time.

Inevitably, that which is good and cutting edge eventually becomes mainstream. Even when the quality remains the same (or gets better, to be perfectly honest), it often doesn't feel that way. After all, middle age is when nostalgia blossoms. But age has some consolations: at least I no longer affect a little black fez with gold embroidery (some photos will never be scanned).

23 May 2008

Tools and emergent complexity: exonerating Twitter and Rails

Twitter has had substantial downtime over the last several days, and this has prompted no end of commentary and analysis. nail gunRuby on Rails was initially blamed for the problems a year ago, then exonerated, then blamed again (and exonerated again). But blaming the hammer for improperly driving a screw is not very illuminating; blaming a screwdriver for how it drives a nail even less so, and although using a hammer and screwdriver combination to drive a large number of finishing nails probably isn't the best solution, until a better machine is invented you wouldn't necessarily know that.

The reason Twitter is having difficulties is that it truly is a novel application. The rules are deceptively simple on the surface, but the emergent complexity is Easy Riderprofound, especially as you start to build a massive database of users (which Twitter certainly is now doing). The sort of many-to-many relationships embodied in the way people follow one another, coupled with the different options on what sorts of tweets you want to see, and the different ways of interfacing – the website, instant messaging, text messaging, a raft of third party applications (Twhirl, gTwitter, FriendFeed, et cetera, etc, &c, ...), the ability to track specific terms...

All of this adds up into an extremely complex system that gets exponentially harder to manage as the user base grows. The telephone systems' switching rules are simple by comparison: they are simple, one-to-one connections that connect, persist a short time, and go away, leaving nothing but possibly a billing record (and definitely an entry in an NSA database). A tweet goes onto a user's own list, their friends lists, possibly the lists of friends-of-friends, the list of anyone who is tracking that term, sends it out via SMS, instant messenger and the API, AND persists the message forever; if the user then decides to delete it or make it private then it is removed from all of those lists. Simple, huh? Oh yeah, and it has to do all of that in realtime.

Twitter is built on Ruby on Rails, which came from a simple project management application. Obviously a simple project management application isn't designed to robustly handle the type of complex operations outlined above. It turns out nothing is, which is why Twitter has no easy solutions at hand. Their difficulties in scaling would have likely happened with any existing platform, as not even airline reservation and telephone switching systems handle such a flood of interrelated and interdependent traffic coming from so many different sources – traffic that doubles in two months.

Evan Williams and company invented something new, and they shouldn't be blamed for not initially understanding the true potential and nature of the beast. Although it isn't profitable, it continues to attract investors; anything with this kind of growth and engagement is interesting to businesspeople. NTT invested for a reason, and it's not just because it is popular (and profitable) in Japan. This is an example of how next-generation communication is working: modern switching rules, attention-based networking – a step beyond instant messaging, a step beyond SMS and a step sideways from the phone system. The right tools for the job probably don't exist yet; maybe Erlang is a step in the right direction.

Asian tigerLastly, I don't blame the Twitter staff for doing experiments on the site during the day. They live in the United States and there's no reason they should have to stay up all night. Besides, we should face the sobering conclusion that Japan's market and the rest of Asia might be more important to Twitter than the depressed, aging, and troubled North American market. From that standpoint, the US is a cheap, talented labour pool crafting clever mercantile goods to send to Asia in exchange for hard currency. Oh, how the worm turns.

13 Apr 2008

Speech synthesis on Ubuntu

Text-to-speech (TTS) has been around for a couple of decades, and it keeps getting better. There are a bunch of really fun untapped applications for it, combining RSS, filters (like Pipes), podcasts, telephony, and hidden speakers.

Under Linux there is a nice package available called Festival. To get started, grab an appropriate package, such as:
  • festvox-hi-nsk (Hindi male)
  • festvox-kallpc16k (american English male)
  • festvox-rablpc16k (British English male)
  • festvox-mr-nsk (Marathi male)
  • festvox-suopuhe-lj Finnish female
  • festvox-suopuhe-mv (Finnish male)
  • festvox-te-nsk (Telugu male)
Too bad you can't get a female speaker except in Finnish. (I had never heard of the Indian languages Marathi and Telugu, and I consider myself a language buff... sigh.)

The results are pretty good. Here's how to use it from the command line:
text2wave text-file.txt -o audio-file.wav

For extra fun, use pidgin-festival to turn incoming instant messages into speech (use festival-gaim if you haven't made the jump to Hardy Heron yet).

7 May 2007

Unexpected vacation; media abdication

Check out a satirical comic about Maher Arar's extraordinary rendition by Tom Tomorrow today on Salon.

I was in the states listening to National Public Radio when the Canadian government apologized to Arar for their role in his kidnapping and torture. Anne Garrels read a brief story about the case, saying that he had been deported to Syria "where he was allegedly interrogated." I had a moment of radio rage. Allegedly interrogated? Is that all it was? See, I was under the impression that the "allegations" were of torture. I didn't think there was even any question that he had been interrogated.

But you see, that's the state of journalism today. Even NPR which is supposedly so very liberal, independent and trustworthy has deteriorated to the point where it abuses language to avoid reporting news that government and corporate masters would rather not be heard. It seems like NPR is reporting news, because it still calls itself news and NPR once did something resembling news, but it has become little more than a pack of politicized corporate cheerleaders.

If you have been listening continuously to NPR for the past ten years or so you might not have noticed the change. With a little bit of distance you notice the ever-lengthening advertisements ("sponsorships") for pharmaceutical companies, defense contractors, and agriculture conglomerates (they started out as five-word blurbs, but then they grew inexorably – in length and frequency). The subtle shift in language when dealing with political matters is harder to quantify, but is definitely there. Evidence of partisan skulduggery at NPR is well documented, but you don't have to be a researcher to notice the effects in the types of commentators now invited to voice their opinions. Where opinions were once balanced and questioned, they now reverberate unanswered, especially when they deal with United States foreign policy and international investment, legal and trade agreements.

Media conglomeration in the US has resulted in less diversity of opinion and less real commentary in the official mainstream press. At the same time, public radio has been dragged down to the point where it provides no meaningful competition to commercial media organs. None dare call it conspiracy, because there is nobody left standing to do so. Instead, corporate cheerleaders with airbrushed makeup and great hair read sanitized newsbytes without context, and bejowled father-figures terrify and scold, providing judgments without bothering to inform. In short, certain interests control the medium and the message, and don't bet they're doing it in your best interests.