Click here to Skip to main content
12,560,367 members (54,497 online)
Click here to Skip to main content
Add your own
alternative version


8 bookmarked

Hierarchy of categories and classifying Wikipedia articles using XML dump

, , 3 Jun 2016 CPOL
Rate this:
Please Sign up or sign in to vote.
A hierarchical object is built from relationships between categories and their parents. It is used in a classifier, detecting if an article belongs to possibly far parent category.


The value of the Wikipedia as a unique source for data mining cannot be overemphasized. XML dumps of the whole textual contents of it allow every researcher to discover hidden relationships between objects and patterns in history and society.

The necessary stage of this process is filtering information of interest. In simplest cases search of keywords may be used. However, credibility of such approach is low because of ignorance of keyword meaning in a phrase and impossibility of stating sets of keywords, describing the subject closely. Wikipedia has an intrinsic mechanism for organizing its information. It is categorization. Almost every page is required pertaining to at least one category. However, classification, based on this fact, is not "easy to use" in practice.

In this article we:

  • Use code and results of our previous work to build pretty complicated object, representing hierarchy of categories from Wikipedia.
  • Develop classifier to determine if a page belongs to a category or to any of its children.
  • Provide GUI application, demonstrating usage of hierarchy of categories and performing proper classification.


It is good to understand the process of creation of Wikipedia pages and their categorization. Download of complete "pages‐articles" XML dump file is required. It may be enwiki‐20160305‐pages‐articlesmultistream.xml.bz2 ﴾about 12.7G archive containing 52.5G﴿ file. Then, the project from our previous article "Parsing Wikipedia XML dump" is needed to extract the real data.

C# code uses various collections, some LINQ, and RegEx.

Categorization in Wikipedia

To start, let's take a very brief excursus to the Wikipedia's way of categorization.

Every Web page returned by Wikipedia shows a list of categories to which it belongs. They are listed in the "Categories" box at the bottom of each page. For instance, page "Algorithm" belongs to "Algorithms", "Mathematical logic", and to "Theoretical computer science" categories.

We call them as "immediate parents" and, correspondingly, treat proper page as an "immediate child". Some exceptions apply. For instance, the very basic categories, like "Contents", have no parents. Actual articles are stored under "Articles" category. Currently, there are more than million of categories, even not including so called "hidden" categories and categories of special usage.

Each immediate parent category usually has its own immediate parents. For instance, "Mathematical logic" is immediate child of "Logic", "Fields of Mathematics" and "Philosophy of mathematics".This way parent‐child relationships may form long chains.

Hierarchy of categories, generated by such approach, is not a simple tree structure, because each page usually has several parents, at least. The structure is much more complicated. Ideally, it should be directed acyclic graph. Simply speaking, there should be no cycles and no cases when a child appears in the list of its own parents. In reality, there are many cases of such cycles and both ways parent‐child relationship. Some of them are pretty short, others span through multiple levels. Wikipedia suggests a tool revealing them. Examples:

Action (philosophy) Free will History of science History of ideas Science Science education Science technology engineering and mathematicsBoxes on this image are clickable. Arrows denote relationship from parent to child.

Wikipedia changes with time, thus some of such cases may disappear. But it is difficult to imagine that all of them may be eliminated. Developing of an automatic classifier, based on hierarchy of categories, we should take this very carefully.

Another thing that should be taken into account is that any error in relationship of a category may have much more significant effect than just an error in a regular page, because search may return a large set of false positives ﴾when wrong category is defined as child﴿ or false negatives ﴾when true child is missed﴿. Choice of parents almost completely is a responsibility of Wikipedia contributors. A couple of examples of related problems will be given later. Thus researchers should not rely only on formal mathematical approach querying the hierarchy. They must be ready to use common sense, make enough statistical selection checks and choose the most reasonable queries.

Fortunately, the situation is not as bad as it may appear at first glance. Everyone may select several categories by typing "Category:[SomeCategory]" in Wikipedia search box and walk trough parents or children to ensure how reasonably the chain is built.

Researcher's goal

Building the hierarchy object is not an end itself. Practical task is in extracting information, related to the subject of research. In many cases the subject may be described in terms of categories. Let's consider an example.

Suppose our study is focused on people in sport, mentioned in the Wikipedia. Filtering the list of categories, we find category "Sportspeople" existing. If pages with their parent categories had been extracted as we described earlier, then we have lookup table like

 Name  Born  Died  Age  Count  Parent categories
Moonlight Graham  1879  1965  86  45  ...|Sportspeople from Fayetteville, North Carolina|American physicians|...|New York Giants ﴾NL﴿ players|University of Maryland, Baltimore alumni|...
Lou Thesz  1916  2002  86  198  ..|American male professional wrestlers|American people of Hungarian descent|Professional Wrestling Hall of Fame and Museum|...
Marc Perrodon  1878  1939  61  9  ..|People from Vendôme|French male fencers|Olympic fencers of France|Fencers at the 1908 Summer Olympics|...

The first row explicitly mentions word "Sportspeople" in immediate parents. Thus classification seems extremely obvious. However, it is not, because actual parent is "Sportspeople from...", not "Sportspeople". A single word may have any meaning in a whole category name. "Sportspeople from Fayetteville, North Carolina" is a child category of "Sportspeople", indeed, but having common word is not a credible evidence of parent‐child relationship in general case.

Classification of the second and the third row is clear for human because we know that wrestlers and fencers ﴾especially, participated in Olympic games﴿ are sportspeople. Computer does not know this a priori. That's why knowledge of all children of the category of interest is needed for automatic classification. The page is related to the subject category only if a child of it is exactly present in page's immediate category parents.

Let's build graph of categories hierarchy and a classifier, based on it.

Hierarchy class

Two dictionaries form the core of Hierarchy class:

/// <summary>CategoryData of category by name</summary>
public Dictionary<string, CategoryData> CategoriesByName;

/// <summary>Category name by index</summary>
public Dictionary<int, string> CategoriesByIndex; 

The first one allows quick access to category properties by name, the second one retrieves category name by index. Category properties are of CategoryData class:

/// <summary>A class containing information about category</summary>
public class CategoryData
    /// <summary>Immediate parents for the category by index</summary>
    public HashSet<int> Parents = new HashSet<int>();
    /// <summary>Immediate children for the category by index</summary>
    public HashSet<int> Children = new HashSet<int>();

    public int Index;
    public int Level;

    /// <summary>Constructor</summary>
    public CategoryData(int index, int level = -1)
        Index = index;
        Level = level;

We introduce Level to measure how far a category is relatively to the top category. Top category may be any. However, to include the whole contents of Wikipedia, it should be set to "Contents" or "Articles". Level of top category is zero by definition. Level of every of its immediate children is one. Level of their children is two and so on. Before calculating levels, we need to load all categories with their immediate parents. For each category they are stored in Parents member of CategoryData. HashSet<int> is used because order of members does not matter and no duplicates guaranteed.

If category A is an immediate parent of category B, then B is immediate child of A. Quick access to children is available through Children member of CategoryData.

Constructor of Hierarchy class reads tab delimited file, containing categories names and names of their parents, into these dictionaries. The file may contain optional "Level" column:

/// <summary>
/// This constructor loads a tab-delimited file containing categories, 
/// their levels (optional), and immediate parents.
/// </summary>
/// <param name="categoryParentsPath">
/// Path to tab-delimited file with the following columns: 
/// Category  [Level] [Any columns] Parent1|Parent2|...
/// </param>
public Hierarchy(string categoryParentsPath) 

If input file contains Levels column, level for each category is set to the value from file. Otherwise, levels are calculated at the end of this method relatively to "Articles", if such exists. If not, subsequent usage
requires caller to execute

/// <summary>Calculates levels relatively to the specified top category</summary>
/// <param name="topCategory">Name of top category</param>
/// <returns>None</returns>
public void SetLevelsFromTop(string topCategory)

To save Hierarchy object ﴾or it's subgraph, after call to SetLevelsFromTop(...)﴿, use Save(...):

/// <summary>
// Saves the hierarchy in the tab-delimited file of format: Category Level Parent1|Parent2...
/// </summary>
/// <param name="path">Path to the output file</param>
/// <returns>None</returns>
public void Save(string path)

The saved file of smaller size may be used on input, if just a subgraph is needed.

To retrieve all parents or children ﴾not only immediate﴿, use AllRelatives(...):

/// <summary>Calculates all relatives of category</summary>
/// <param name="index">Index of category</param>
/// <param name="direction">true for children, false for parents</param>
/// <returns>HashSet<int> (Index and Level) </returns>
public HashSet<int> AllRelatives(int index, bool direction)

Proper calculations may be relatively long. To avoid "freezing" of GUI, this method is decorated by wrappers around asynchronous tasks, returning their results:

private async Task<HashSet<int>> AllRelativesTask(int index, bool direction)
    return await Task.Run(() => GetAllRelatives(index, direction)).ConfigureAwait(false);

private HashSet<int> GetAllRelatives(int index, bool direction)
    HashSet<int> result = new HashSet<int>();
    HashSet<int> relatives = new HashSet<int>();
    while (relatives.Count > 0)
        relatives = ValidRelatives(relatives, direction);
    return result;

This method implements non‐recursive walk through the graph in a given direction. Using HashSet class takes care of breaking cycles. Loop is finite because of finite number of categories and absence of
duplicates in result.

ValidRelatives(...) collects acceptable immediate relatives from all nodes in HashSet<int> categories collection:

private HashSet<int> ValidRelatives(HashSet<int> categories, bool direction)
    HashSet<int> result = new HashSet<int>();
    foreach (int i in categories)
        HashSet<int> validRelatives = ValidRelatives(i, direction);
    return result;

Meaning of HashSet<int> ValidRelatives(int index, bool direction) will be explained after considering the following example of real‐world child‐parent relationships:

Competition Conceptual distinctions Abstraction Thought Competitions Difference Sports competitions Mind Animal anatomy Brain

Though there's no guarantee that related content of Wikipedia won't be changed, it's good to track this chain, which is quite typical. Just click on areas of image and look at proper parent category at the bottom of opened Web page.

It is absolutely senseless that "Sports competitions" appears as child of "Abstraction" and even of "Animal anatomy". To understand why this happens, let's pay attention to levels of categories.

The first questionable relation is Competition ﴾Level 3﴿ <‐ Difference ﴾4﴿. The second one is Thought ﴾3﴿ <‐ Mind ﴾5﴿ <‐ Brain ﴾6﴿. Normally, parent should be located upper in the hierarchy and have lower value of Level than child has. The chain above demonstrates violation of this rule.

To minimize risk of such senseless results, we apply restrictions on possible relations between parent and child levels, using ParentLevelAllowance property

public enum ParentLevelAllowanceType { LowerOnly = 0, SameOrLower = 1, Any = 2 };
private ParentLevelAllowanceType parentLevelAllowance = ParentLevelAllowanceType.SameOrLower; // Default

and calculate ValidRelatives this way:

public HashSet<int> ValidRelatives(int category, bool direction)
    int level = Level(category);
    switch (ParentLevelAllowance)
        case (ParentLevelAllowanceType.LowerOnly) :
            return direction ? new HashSet<int>(Children(category).Where(x => Level(x) > level))
                             : new HashSet<int>(Parents(category).Where(x => Level(x) < level));
        case (ParentLevelAllowanceType.SameOrLower) :
            return direction ? new HashSet<int>(Children(category).Where(x => Level(x) >= level))
                             : new HashSet<int>(Parents(category).Where(x => Level(x) <= level));
            return direction ? new HashSet<int>(Children(category))
                             : new HashSet<int>(Parents(category));

LowerOnly and SameOrLower, studying categories of interest. In our experience, SameOrLower is the best default. LowerOnly lowers number of false positives by reducing total number of detections.

To illustrate the problem, let's assume that a research is focused on people in science. "Science" category seems good choice for start. In case LowerOnly it produces a list of about 40000 children. Though there may be exceptions, we did not find any non‐scientific children ﴾according to our picture of that﴿. At the same time it is easy to find that the list is not as complete as necessary.SameOrLower generates much larger set with many bogus children, containing something that we don't want:

Chain from Science (2) to Prayers by meher baba (5)Science Scientific disciplines Social sciences Cognitive science Epistemology Epistemology of religion Religious behaviour and experience Prayer Prayers by meher baba

These observations lead to conclusion that study of parent‐children chains is important and that "Science" category is not a good choice for research, stated above. Much better choice is "Scientists" category. It brings very accurate results, especially when applied to the list of people. GUI utility may be used to prove that.

GUI Utility

The utility is developed to illustrate usage of Hierarchy class. It allows performing some researches. It was used to produce all examples above. The file, containing categories and their parents, extracted from full Wikipedia dump, is required on input.

Main window groups three tasks. Each one is represented by proper tab. Tab, shown above, allows searching names of categories, matching substring or regular expression. This search is useful when researcher does not know names of categories matching the subject of interest most closely.

Sample output is as following:

This grid and the ones shown below are sortable and searchable. Names of categories are clickable. Click on cell in first column opens proper Wikipedia Web page.

Second tab serves for listing parents and children of selected category:

Clicking "Parents" button returns four parents:

Clicking "Children" button brings in this case a list of 589 categories:

"Explain" item of popup menu reveals chains ﴾possible paths in the graph, not necessarily the shortest﴿ of parent‐child relationships. For selected categories they are:

Studying of such chains is important part of designing most appropriated queries, which, for instance, exclude some subcategories from paths.

GUI, presented here, works with queries containing just a single category. Programming queries, containing simple logical expressions on categories seems straightforward.

The last tab illustrates classifier which filters pages belonging to specified category from list of Wikipedia pages or from biographical pages, created in our previous work:

Here a random selection, containing, say, 10% or even 0.1% of input file may be used. This saves time selecting random samples for manual statistical verification of output:

Happy Wikipedia mining!


June 1nd, 2016: Original publication

June 3nd, 2016: Minor changes:

  • Fixed a couple of typos and spelling mistakes
  • Several screenshots resized to fit layout better.



This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


About the Authors

Ilia Reznik
United States United States
Programmer, developer, and researcher. Worked for several software companies, including Microsoft.
PhD in Theoretical Physics, D.Sc. in Solid State Physics.
Enjoys discovering patterns in nature, history and society, which break public opinion or are hidden from it.

Vladimir Shatalov
Instructor / Trainer National Technical University of Ukraine "Kiev Pol
Ukraine Ukraine
Professor Vladimir Shatalov works on National Technical University of Ukraine 'Kyiv Polytechnic Institute', Slavutych branch, teaches students to Computer Science. Research interests include Data Mining, Artificial Intelligence, Theoretical Physics and Biophysics.
Research activity also concerns investigations of mechanisms of non-thermal electromagnetic and acoustic fields impacts on bio-liquids, effects of irradiations on physical and chemical properties of water.

You may also be interested in...


Comments and Discussions

GeneralMy vote of 5 Pin
Franc Morales3-Jun-16 23:27
memberFranc Morales3-Jun-16 23:27 
GeneralRe: My vote of 5 Pin
Vladimir Shatalov4-Jun-16 1:29
professionalVladimir Shatalov4-Jun-16 1:29 
QuestionMy vote of 5! ;) Pin
Super Lloyd2-Jun-16 17:17
memberSuper Lloyd2-Jun-16 17:17 
AnswerRe: My vote of 5! ;) Pin
Vladimir Shatalov3-Jun-16 0:13
professionalVladimir Shatalov3-Jun-16 0:13 
GeneralRe: My vote of 5! ;) Pin
Super Lloyd3-Jun-16 9:08
memberSuper Lloyd3-Jun-16 9:08 
GeneralRe: My vote of 5! ;) Pin
Vladimir Shatalov3-Jun-16 9:52
professionalVladimir Shatalov3-Jun-16 9:52 
QuestionUsable classification Pin
scalp2-Jun-16 3:32
memberscalp2-Jun-16 3:32 
AnswerRe: Usable classification Pin
Ilia Reznik2-Jun-16 10:17
memberIlia Reznik2-Jun-16 10:17 
GeneralRe: Usable classification Pin
scalp14-Jun-16 5:03
memberscalp14-Jun-16 5:03 
GeneralRe: Usable classification Pin
Vladimir Shatalov14-Jun-16 9:35
professionalVladimir Shatalov14-Jun-16 9:35 
GeneralRe: Usable classification Pin
scalp28-Jun-16 2:16
memberscalp28-Jun-16 2:16 
GeneralRe: Usable classification Pin
Vladimir Shatalov28-Jun-16 3:35
professionalVladimir Shatalov28-Jun-16 3:35 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.161026.1 | Last Updated 3 Jun 2016
Article Copyright 2016 by Ilia Reznik, Vladimir Shatalov
Everything else Copyright © CodeProject, 1999-2016
Layout: fixed | fluid