Category Archives: data mining

A good general book about machine learning

Peter Flach wrote a very neat introduction to machine  learning. The ISBN is 978-1107422223. It is more on a theoretical side than on the practical, but it also contains a lot of good pieces of advice about concrete decisions professionals have to make on selection of features, testing and so on. Mathematical points are clearly explained. I usually don’t mention this but the layout is also very pleasant, with lots of charts helping to visualize the issues discussed.

I would have liked to read more on kernel methods. A bit more of detail on the algorithmic part would also have been welcome. Still, it is a good reference book.

 

Analyzing graphs in Wikipedia (II)

Here I will call Wikipedia categories simply wikicategories. Wikicategories listed under a wikicategory article are simply subwikicategories.

We can gain a lot of semantically relevant information from the links between wikicategories if we determine the particular features produced in networks created out of useful semantic relationships between the wikicategories and opposed them to networks produced by randomly following all the links between wikicategories. It is possible to determine those particular features by using unsupervised or supervised algorithms or something in between. We believe the best way is this latter: relevant relationships are domain-dependent and the understanding of a knowledge expert can be helpful to establish where to look firstly. Let’s talk about some rules of the thumb here.

As we have said, Wikipedia categories are not concepts but more akin to tags in a folksonomy. Subwikicategories can often be used to automatically derive possible is-a relationships for an ontology but the linking is far from trivial. In many cases subwikicategories refer to features or simply topics mildly related to the initial wikicategory.

We can assume that wikicategories attached to wikicategories, wiki-parents, can be potentially references to parent concepts. In fact, a certain wikicategory can have among its attached tags one or more wikicategories apt to be parent.

Some of the strategies we can use to determine most likely “parents” are:

  1. morpho-syntactic relationship: evidently, Russian ambassador is likely to be a parent of Russian ambassador to Belgium. A shallow parser can help here. “Person nach Staatsangehörigkeit” or “people by nationality”
  2. if we add some semantic knowledge, we can make this method wiser. We can derive that the wikicategory European author can be used to derive a semantic parent for Norwegian author.
  3. We can use train neural networks or use Bayesian statistics to filter the less likely patterns, those where a substring is still not reflecting a parent or where one adjective that might be a generalization of another might be less reliable – Spanish
  4. Cross-validation through languages is a particularly fruitful approach. Given a wikicategory A in language L1, we list a series of paths produced by following the subwikicategories recursively, as long as they do not lead into a cycle. We then do the same for the wikicategory B in language L2 that is linked t wikicategory A in L1. We will often have competing paths in L1 that have equivalent paths in L2. We will sometimes have paths in L2 that are partially similar. We will also have paths that stop being cross-linguistically linked for some time but that later seemed to get back like in this image derived from an article about the current ambassador of Russia to Belgium. Of course, the possibility of parameters to consider here is legion. We can use one or more languages for validation of the structures in language L1, we can use different weights for different languages, we can penalize differently partial paths and so on
  5. A category with too many leaves is likely to be problematic. Take “1975 births”. Doe this mean we need to exclude them automatically or can we detect some pattern in the networks we can generate with other wikicategories? The category 20 century births can be related to births by year, which in itself has as subcategories things like events by year and people by year. Evidently, these cannot be considered parents. One of the critical things we need to do is determine what kind of common features the subsequent paths have. A previously available meta-ontology can help a lot here, specially in conjunction with a semantic reasoner.
A Russian article and a possible wikicategory path and English equivalents

A Russian article and a possible wikicategory path and English equivalents

 

See this for a somewhat old discussion on Wikipedia mining for semantic relations etc.

 

Analyzing graphs in Wikipedia (I)

This is the start of a series  of posts on Wikipedia, DBPedia and data mining.

I have tried, like many others, to analyse the free text and the metadata in Wikipedia, and like some, I have tested different approaches for getting the optimal weight of different metadata collections for relevance of different textual data and for validation of text mining.

One interesting item I discussed once was how the formal analysis of graph structures of metadata can help us in the validation and weighing of textual and semi-textual data. By semi-textual data I mean above all Wikipedia categories.

Checking out the completeness of graphs for different Wikipedia data is a good start

Checking out the completeness of graphs for different Wikipedia data is a good start

It turns out that if you let a robot use translation links of Wikipedia articles to produce graphs and analyse whether those graphs are complete you can detect to some extent how mappable the Wikipedia categories of those entries are and how problematic they can be for generating ontologies or proto-ontologies.

For some time now, though, Wikipedia has introduced a system whereby translation links are managed at a central place and all-to-all translations are being almost forced upon. But we can also use Wikipedia categories and the graphs produced by linking parent and sibling categories and category translations. This is a more productive area. I will go into this in the following post with some examples of mining German, Chinese and English Wikipedia projects.