machine translation

I was reading a German article on Spiegel about China’s economy and thought the information could be interesting for a friend in Venezuela. That friend usually reads in Spanish only. As I didn’t have the time to translate the article, I thought I could let Google MT engine translate it for him. I chose as target Spanish. As I mentioned in a previous post, Google’s engine seems to be using English as an intermediate state for translation between other language pairs. In any case, for this

“Der Zuwachs liegt zwar knapp über dem selbst gesteckten Wachstumsziel der Regierung von 7,5 Prozent, allerdings hatte sie in der Vergangenheit immer sehr vorsichtige Vorgaben gemacht, die am Ende meist deutlich übertroffen worden waren.”

the engine produced this:

“Although the growth is just above the self-imposed growth target of 7.5 percent of the government, but they had done in the past always very careful guidelines that had been surpassed at the end usually. “

I only want to talk about one issue here: the “translation” of German 7.5 into Spanish 7.5. That is wrong. Standard Spanish uses commas for decimals and points for thousands, so it should be 7,5. There were other fractions that got the correct Spanish form: 7.7 -> 7,7, 7.8 -> 7,8.

I started to write some other examples with numbers that Google’s engine might not have seen translations for -very long fractions- and then did the same experiment between English and Spanish.

The problem seems to be that Google’s MT treats certain numbers as numbers and does the necessary transformations but it uses some prêt-à-porter forms in some other cases.

Google MT is apparently using some sub-standard Spanish translations for training: there seems to be a Spanish training text coming from a translator who was influenced by English. The engine might have thought this example might trump everything else. I think the company could do better than this. It wouldn’t be hard to get a general solution for these cases.

The Google team under Och uses statistical approaches for the Google Translate service. As everyone in the NLP community knows, the larger the parallel corpus, the better the translations one can get – all other things considered equal. Apparently, the Google system uses parallel corpora mostly from and into English. If you want to go from Russian to French, the system translates first from Russian into English and then from English into French. This is obviously going to produce more information loss: using any human language as interlingua leads to an additional loss of specificity and a new layer of calculations that produce more translation errors. There are some languages for which an additional translation step is added, like for Belorussian.

The first reason for having English as an interlingua for all other language pairs is the availability of parallel corpora for English. Still: is this the only reason? I do not know. United Nations has translations that allow for training between all those language pairs.
The European Union has the same. In this case English is usually the source for human translators but their translations should be, in principle, considered equally “natural”.

My guess is that the available corpora – those of United Nations and EU – are not diverse enough. They might be large, but as we know, it’s not only the size of the corpora but their diversity that counts. There aren’t so many organisations that might give Google free access to parallel corpora in other language pairs.

Some questions I have:

Will Google change strategy in the middle term and go for no intermediate natural language model?
Are machine translation systems such as Google making the production of human translations between other languages even less likely?

Google is already providing a toolkit for translators. It would be interesting to know what kind of people use this framework for what language pairs. One thing I am certain about is that Google will use all the possible data – in this case the translators’ hard work – to keep training their translation systems. We know what this means, don’t we? The question is when the trained data will reach the critical mass.

The blog

Thoughts about computing and natural language processing

Tag Archives: machine translation

Hapax Googlegomenon

Google translate and its possible strategies