Monthly Archives: January 2014

Hapax Googlegomenon

I was reading a German article on Spiegel about China’s economy and thought the information could be interesting for a friend in Venezuela. That friend usually reads in Spanish only. As I didn’t have the time to translate the article, I thought I could let Google MT engine translate it for him. I chose as target Spanish. As I mentioned in a previous post, Google’s engine seems to be using English as an intermediate state for translation between other language pairs. In any case, for this

“Der Zuwachs liegt zwar knapp über dem selbst gesteckten Wachstumsziel der Regierung von 7,5 Prozent, allerdings hatte sie in der Vergangenheit immer sehr vorsichtige Vorgaben gemacht, die am Ende meist deutlich übertroffen worden waren.”

the engine produced this:

“Although the growth is just above the self-imposed growth target of 7.5 percent of the government, but they had done in the past always very careful guidelines that had been surpassed at the end usually. “

I only want to talk about one issue here: the “translation” of German 7.5 into Spanish 7.5. That is wrong. Standard Spanish uses commas for decimals and points for thousands, so it should be 7,5. There were other fractions that got the correct Spanish form: 7.7 -> 7,7, 7.8 -> 7,8.

I started to write some other examples with numbers that Google’s engine might not have seen translations for -very long fractions- and then did the same experiment between English and Spanish.

The problem seems to be that Google’s MT treats certain numbers as numbers and does the necessary transformations but it uses some prêt-à-porter forms in some other cases.

Google MT is apparently using some sub-standard Spanish translations for training: there seems to be a Spanish training text coming from a translator who was influenced by English. The engine might have thought this example might trump everything else. I think the company could do better than this. It wouldn’t be hard to get a general solution for these cases.

I didn’t know about this tool: the Great Language Game. It is just so cool! I got first time 1000 points, then went down to 900 but I will definitely visit the site again and see if I can do better. I have enjoyed guessing what languages people are using specially when I walk through the masses in large agglomerations with many expats.

The other day I was thinking about developing a wee tool for letting a human guess randomly extracted texts from the different Wikipedias, but I didn’t have time. I will firstly check who has developed something similar, so as not to re-invent the wheel. Still, I think there are quite some parameters to play around with there so that trivial guesses are avoided.

Usual Babel stuff

NLP in 2014

There are obvious areas in which natural language processing is set to move forward in 2014 specially:

improvements in statistically-based machine translation (the Google team, Microsoft, but also a couple of others specially in the Asian market),
innovative ways of NLP-support for vertical semantic search,
a myriad of NLP-supported solutions for analysis of social networks,
As usual in the last few years, quite a few teams will try to work further on cross-lingual, “semantic”-aided search. I don’t believe there will be a major breakthrough there unless there is a fundamental focus shift: researchers need to tackle some fundamental topics about adaptability and scalability. Some more innovative approaches to linguistic, semantic representation and their link to statistics

I think there are some areas that have been overlooked, a couple of approaches that need more attention. I will be posting about them in the next weeks.

The blog

Thoughts about computing and natural language processing

Monthly Archives: January 2014

Hapax Googlegomenon

NLP in 2014