Analyzing graphs in Wikipedia (II)

Here I will call Wikipedia categories simply wikicategories. Wikicategories listed under a wikicategory article are simply subwikicategories.

We can gain a lot of semantically relevant information from the links between wikicategories if we determine the particular features produced in networks created out of useful semantic relationships between the wikicategories and opposed them to networks produced by randomly following all the links between wikicategories. It is possible to determine those particular features by using unsupervised or supervised algorithms or something in between. We believe the best way is this latter: relevant relationships are domain-dependent and the understanding of a knowledge expert can be helpful to establish where to look firstly. Let’s talk about some rules of the thumb here.

As we have said, Wikipedia categories are not concepts but more akin to tags in a folksonomy. Subwikicategories can often be used to automatically derive possible is-a relationships for an ontology but the linking is far from trivial. In many cases subwikicategories refer to features or simply topics mildly related to the initial wikicategory.

We can assume that wikicategories attached to wikicategories, wiki-parents, can be potentially references to parent concepts. In fact, a certain wikicategory can have among its attached tags one or more wikicategories apt to be parent.

Some of the strategies we can use to determine most likely “parents” are:

  1. morpho-syntactic relationship: evidently, Russian ambassador is likely to be a parent of Russian ambassador to Belgium. A shallow parser can help here. “Person nach Staatsangehörigkeit” or “people by nationality”
  2. if we add some semantic knowledge, we can make this method wiser. We can derive that the wikicategory European author can be used to derive a semantic parent for Norwegian author.
  3. We can use train neural networks or use Bayesian statistics to filter the less likely patterns, those where a substring is still not reflecting a parent or where one adjective that might be a generalization of another might be less reliable – Spanish
  4. Cross-validation through languages is a particularly fruitful approach. Given a wikicategory A in language L1, we list a series of paths produced by following the subwikicategories recursively, as long as they do not lead into a cycle. We then do the same for the wikicategory B in language L2 that is linked t wikicategory A in L1. We will often have competing paths in L1 that have equivalent paths in L2. We will sometimes have paths in L2 that are partially similar. We will also have paths that stop being cross-linguistically linked for some time but that later seemed to get back like in this image derived from an article about the current ambassador of Russia to Belgium. Of course, the possibility of parameters to consider here is legion. We can use one or more languages for validation of the structures in language L1, we can use different weights for different languages, we can penalize differently partial paths and so on
  5. A category with too many leaves is likely to be problematic. Take “1975 births”. Doe this mean we need to exclude them automatically or can we detect some pattern in the networks we can generate with other wikicategories? The category 20 century births can be related to births by year, which in itself has as subcategories things like events by year and people by year. Evidently, these cannot be considered parents. One of the critical things we need to do is determine what kind of common features the subsequent paths have. A previously available meta-ontology can help a lot here, specially in conjunction with a semantic reasoner.
A Russian article and a possible wikicategory path and English equivalents

A Russian article and a possible wikicategory path and English equivalents

 

See this for a somewhat old discussion on Wikipedia mining for semantic relations etc.