Monthly Archives: October 2013

Google translate and its possible strategies

The Google team under Och uses statistical approaches for the Google Translate service. As everyone in the NLP community knows, the larger the parallel corpus, the better the translations one can get – all other things considered equal. Apparently, the Google system uses parallel corpora mostly from and into English. If you want to go from Russian to French, the system translates first from Russian into English and then from English into French. This is obviously going to produce more information loss: using any human language as interlingua leads to an additional loss of specificity and a new layer of calculations that produce more translation errors. There are some languages for which an additional translation step is added, like for Belorussian.

The first reason for having English as an interlingua for all other language pairs is the availability of parallel corpora for English. Still: is this the only reason? I do not know. United Nations has translations that allow for training between all those language pairs.
The European Union has the same. In this case English is usually the source for human translators but their translations should be, in principle, considered equally “natural”.

My guess is that the available corpora – those of United Nations and EU – are not diverse enough. They might be large, but as we know, it’s not only the size of the corpora but their diversity that counts. There aren’t so many organisations that might give Google free access to parallel corpora in other language pairs.

Some questions I have:

  • Will Google change strategy in the middle term and go for no intermediate natural language model?  
  • Are machine translation systems such as Google making the production of human translations between other languages even less likely?

Google is already  providing a toolkit for translators. It would be interesting to know what kind of people use this framework for what language pairs. One thing I am certain about is that Google will use all the possible data – in this case the translators’ hard work – to keep training their translation systems. We know what this means, don’t we? The question is when the trained data will reach the critical mass.