Category Archives: Natural language processing

Semantics revisited

We have seen many “topics” or subdomains becoming more or less popular in the wide range of things considered “artificial intelligence” in the last decades. Even the term artificial intelligence became a bit of a term to avoid when presenting research proposals twenty years ago. Neural networks have a long history and yet ten years ago most people were not hearing about them.

Now we see a new come back: that of semantic technologies. We used to talk a lot about semantics in the nineties and in the early part of this century. Still, the concrete approaches did not became scalable. Other approaches to data management took hold.

Still, slowly but surely, semantics started to appear again, even if not so well understood by many who were supposed to apply it. In the last few years people in the NLP and related domains started to use the word semantics in the context of word embeddings and document embeddings. Open data became more important and suddenly more people started to realise ontologies can be based or enriched by machine learning approaches.

Some of the more exciting things I have seen out there:

the improvement of graph data bases such as Neo4J and Amazon Neptune (itself connected to other interesting services)
the spread of Query and to a lesser extent Gremlin
the appearance of some tools such as Owlready for interaction with the Python sphere
the maturity of resources such as DBPedia, Wikipedia and a pletora of projects connected to these
the improvement of algorithms for graph manipulation

There are lots of interesting challenges that we need to tackle now in order to solve real life problems. One of them is how to optimise versioning and automatic growth of ontologies with external resources, how to protect personal data in these systems and how to represent ever more complexed relations, specially those explaining “stories”.

I believe this is where we need to explore at a very abstract and then concrete level what I will call high order syntax. Linguists have worked for millenia on syntactic problems. Software specialists have worked on the syntax of programming languages for several decades now. Likewise ontology experts have been developing ever more complex frameworks to express n-nary relationships. Now we need semantic theories that help us tell and manipulate stories with data. And then we will need to spread the knowledge.

Currently there are lots of people working in the NLP spectrum who do not have an understanding of syntax in the linguistic sense and they also lack knowledge about syntax in the sense of semantic languages such as OWL. They talk about “language models” when they are experimenting with parameters for optimising transformers for this or that recognition of text and images. Computational linguists and semantic specialists are needed in order to develop more comprehensive frameworks so that digital systems can somehow tell or recognise stories and, more importantly, react upon them in a reliable way.

I recommend two newish books for those interested in semantic technologies: Knowledge Graphs, Fundamentals, Techniques and Applications, by Keyriwal et alia (2021) and, a little bit more mundane but still interesting, Ontologies with Python by Lamy Jean-Baptiste (also 2021).

Natural Language Processing with TensorFlow by Ganegadara

I actually started to read this book last year. I went through most of it and experimented a lot but had no time to write. I finally read the last chapter, which I had somehow put off. All in all, it is a useful book on using TensorFlow for NLP.

It offers a good exploration of what Word2Vec is. It goes on to CNNs first with image recognition and then with sentence classification, then it does a good initial cover of LSTMS and finally it touches very briefly some trends.

I think I would have skipped the NLP introduction, but then I have worked on NLP almost all of my life. Ganegadara should not have gone so long on WordNet etc. As much as it was an usual tool, it still might be, a paragraph would have been enough.

It was a pity the last part, the one I left until now on trends, was so short but then the technology and even the basics on what deep learning is about are changing so fast!

Another Pac(k)t

OK, terrible pun. I was not inspired. Anyway: I am going through Natural Language Processing with TensorFlow by Ganegedera and although I find Packt books often are published in a bit of a rush, this seems to present a neat introduction to Tensorflow as related to NLP. There are some simplifications on NLP and all, but it does present a good introduction to TensorFlow in this area.

Text to speech on demand

Over a decade ago I was working primarily in text-to-speech at Lernout and Hauspie. We were developing state-of-the-art technology and we were very excited about it, even if we were going through a horrible financial period that would lead to L&H’s ultimate bankruptcy.

I was recently analysing some demos from TTS companies. I was surprised to see how little the field has advanced in the last 10 years. Surely enough, there have been certain advances here and there, but these have been less than I expected. Something has gone wrong with the R&D focus of the companies involved – and public research organisations also have had their financial limitations.

Now I see the effort of certain major Internet corporations to use tts technology and I am thinking what they are aiming at.

I think these companies might be exploring some interesting ideas like offering in the middle term a framework for normal users to develop their own tts systems. I do not see just yet – or not necessarily just yet – a completely personalized own-voice module but something in the middle. This would require making users work for those corporations by training their own data. The vast majority of users won’t be able or willing to perform tasks such as interpreting mappings into phonological interpretations of their speech or going too deep into any verification of speech segmentation. Still, I think it is perfectly possible to offer several methods through which users can be guided to provide input for machine learning processes that then deliver increasingly improved personal tts services

These companies will have the resources to support such frameworks…and at the same time profit from the users’ interactions. In the middle term I see a reduction in the amount of hours an Internet user will have to invest in order to get a free or nearly free tts system.

And this would be a huge challenge for companies specializing in tts technology. In a future post I will go deeper into that.

Hapax Googlegomenon

I was reading a German article on Spiegel about China’s economy and thought the information could be interesting for a friend in Venezuela. That friend usually reads in Spanish only. As I didn’t have the time to translate the article, I thought I could let Google MT engine translate it for him. I chose as target Spanish. As I mentioned in a previous post, Google’s engine seems to be using English as an intermediate state for translation between other language pairs. In any case, for this

“Der Zuwachs liegt zwar knapp über dem selbst gesteckten Wachstumsziel der Regierung von 7,5 Prozent, allerdings hatte sie in der Vergangenheit immer sehr vorsichtige Vorgaben gemacht, die am Ende meist deutlich übertroffen worden waren.”

the engine produced this:

“Although the growth is just above the self-imposed growth target of 7.5 percent of the government, but they had done in the past always very careful guidelines that had been surpassed at the end usually. “

I only want to talk about one issue here: the “translation” of German 7.5 into Spanish 7.5. That is wrong. Standard Spanish uses commas for decimals and points for thousands, so it should be 7,5. There were other fractions that got the correct Spanish form: 7.7 -> 7,7, 7.8 -> 7,8.

I started to write some other examples with numbers that Google’s engine might not have seen translations for -very long fractions- and then did the same experiment between English and Spanish.

The problem seems to be that Google’s MT treats certain numbers as numbers and does the necessary transformations but it uses some prêt-à-porter forms in some other cases.

Google MT is apparently using some sub-standard Spanish translations for training: there seems to be a Spanish training text coming from a translator who was influenced by English. The engine might have thought this example might trump everything else. I think the company could do better than this. It wouldn’t be hard to get a general solution for these cases.

NLP in 2014

There are obvious areas in which natural language processing is set to move forward in 2014 specially:

improvements in statistically-based machine translation (the Google team, Microsoft, but also a couple of others specially in the Asian market),
innovative ways of NLP-support for vertical semantic search,
a myriad of NLP-supported solutions for analysis of social networks,
As usual in the last few years, quite a few teams will try to work further on cross-lingual, “semantic”-aided search. I don’t believe there will be a major breakthrough there unless there is a fundamental focus shift: researchers need to tackle some fundamental topics about adaptability and scalability. Some more innovative approaches to linguistic, semantic representation and their link to statistics

I think there are some areas that have been overlooked, a couple of approaches that need more attention. I will be posting about them in the next weeks.

Google translate and its possible strategies

The Google team under Och uses statistical approaches for the Google Translate service. As everyone in the NLP community knows, the larger the parallel corpus, the better the translations one can get – all other things considered equal. Apparently, the Google system uses parallel corpora mostly from and into English. If you want to go from Russian to French, the system translates first from Russian into English and then from English into French. This is obviously going to produce more information loss: using any human language as interlingua leads to an additional loss of specificity and a new layer of calculations that produce more translation errors. There are some languages for which an additional translation step is added, like for Belorussian.

The first reason for having English as an interlingua for all other language pairs is the availability of parallel corpora for English. Still: is this the only reason? I do not know. United Nations has translations that allow for training between all those language pairs.
The European Union has the same. In this case English is usually the source for human translators but their translations should be, in principle, considered equally “natural”.

My guess is that the available corpora – those of United Nations and EU – are not diverse enough. They might be large, but as we know, it’s not only the size of the corpora but their diversity that counts. There aren’t so many organisations that might give Google free access to parallel corpora in other language pairs.

Some questions I have:

Will Google change strategy in the middle term and go for no intermediate natural language model?
Are machine translation systems such as Google making the production of human translations between other languages even less likely?

Google is already providing a toolkit for translators. It would be interesting to know what kind of people use this framework for what language pairs. One thing I am certain about is that Google will use all the possible data – in this case the translators’ hard work – to keep training their translation systems. We know what this means, don’t we? The question is when the trained data will reach the critical mass.

The blog

Thoughts about computing and natural language processing