Maximum Entropy Modeling Using SharpEntropy

Richard Northedge

4.84/5 (38 votes)

Jul 24, 2005

12 min read

210793

6407

Presents a Maximum Entropy modeling library, and discusses its usage, with the aid of two examples: a simple example of predicting outcomes, and an English language tokenizer.

Overview

This article presents a maximum entropy modeling library called SharpEntropy, and discusses its usage, first by way of a simple example of predicting outcomes, and secondly, by presenting a way of splitting English sentences into constituent tokens (useful for natural language processing). Please note that because most of the code is a conversion based on original Java libraries published under the LGPL license, the source code available for download with this article is also released under the LGPL license. This means, it can freely be used in software that is released under any sort of license, but if you make changes to the library itself and those changes are not for your private use, you must release the source code to those changes.

A second article, Statistical parsing of English sentences, shows how SharpEntropy can be used to perform sophisticated natural language processing tasks. A small additional library, SharpEntropySqlite, downloadable with this article, provides a simple means of storing maximum entropy models created by SharpEntropy in a Sqlite database.

Introduction

SharpEntropy is a C# port of the MaxEnt toolkit available from SourceForge. MaxEnt is a mature library written in Java, originally created by Jason Baldridge, Tom Morton, and Gann Bierner. SharpEntropy is based upon the latest version (2.3.0) of the MaxEnt toolkit, released in August 2004.

Maximum entropy modeling is a general-purpose machine learning technique originally developed for statistical physics, but which has been employed in a wide variety of fields, including computer vision and natural language processing. It can be usefully applied whenever there is a complex process whose internals are unknown but which must be modeled by the computer. Like any statistical modeling technique, it relies on the existence of a data sample that shows, for given sets of input, what output the process generates. This sample is analysed, and from it, a model is generated, encapsulating all the rules about the process that could be inferred from the sample. This model is then used to predict the output of the process, when supplied with sets of input not found in the sample data.

What technique should be used to generate the model from the sample data? The maximum entropy method follows the principle of “maximizing entropy” – that is, it chooses the model that takes account of all the facts available in the sample data but otherwise preserves as much uncertainty as possible. The principle could be stated as: "Don’t assume anything about your probability distribution other than what you have observed".

The more mathematically inclined may wish to follow some of the references at the end of the article to find out more about the maximum entropy method.

A Simple Example

Suppose you wish to predict whether on a given day, a certain man will leave the house with his umbrella or not. The first thing you do is gather some sample data by observing over a number of days what the man does:

    Day 1        Warm        Dry        No_Umbrella
    Day 2        Cold        Dry        No_Umbrella
    Day 3        Cold        Rainy      Umbrella
    Day 4        Cold        Dry        Umbrella
    Day 5        Warm        Dry        No_Umbrella

After five days, you realise that there may be an additional factor at work: whether the man is late in the morning or not. So, you collect some more sample data:

    Day 6        Cold        Dry        Early        Umbrella
    Day 7        Cold        Rainy      Early        Umbrella
    Day 8        Cold        Dry        Late         No_Umbrella
    Day 9        Warm        Rainy      Late         No_Umbrella
    Day 10       Warm        Dry        Late         No_Umbrella

Each of these rows of data represents a training event. Each training event has an outcome (in this example, either "Umbrella" or "No_Umbrella"), and a context, which consists of one or more predicates ("Cold", "Dry" etc.). The context is the combination of predicates that lead to the outcome of the event. Note that the order of the predicates is not significant.

The training data for our simple umbrella example can be found in the file named data.txt in the same folder as the SimpleExample executable. This is a text file with one training event per line. In each line, the contextual predicates and the outcome are separated by single spaces (the outcome is always assumed to be the last string on the line). This training data format is chosen because the SharpEntropy library already has a class called SharpEntropy.IO.PlainTextByLineDataReader that can read text files delimited in this way. The following code, taken from the source of the SimpleExample application, shows how this training data file can be loaded and used to generate a maximum entropy model:

System.IO.StreamReader trainingStreamReader = 
          new System.IO.StreamReader(trainingDataFile);
SharpEntropy.ITrainingEventReader eventReader = 
   new SharpEntropy.BasicEventReader(new 
   SharpEntropy.PlainTextByLineDataReader(trainingStreamReader));
SharpEntropy.GisTrainer trainer = new SharpEntropy.GisTrainer();
trainer.TrainModel(eventReader);
mModel = new SharpEntropy.GisModel(trainer);

The GisTrainer class trains a maximum entropy model using the GIS (Generalized Iterative Scaling) method. There are a number of other methods for training MaxEnt models, many of which outperform GIS. Currently, however, GIS is the only method that the Java MaxEnt library supports, and therefore the only method whose algorithm was converted for use in SharpEntropy. The trainer class expects to be fed with a stream of events provided by a class that implements the ITrainingEventReader interface. The PlainTextByLineDataReader and BasicEventReader classes together provide the means to supply this stream of events from the delimited text file, whose path is stored in the string variable trainingDataFile.

The result is an instance of the GisModel class, the only class currently coded that implements the IMaximumEntropyModel interface. Once the model has been trained, it can be used to predict future outcomes. The SimpleExample application lets you choose a combination of predicates using dropdown lists:

SimpleExample application user interface

When the Take Umbrella? button is clicked, the chosen features are added to an array of strings and passed into the Evaluate method of the GisModel object:

private void btnOutcome_Click(object sender, System.EventArgs e)
{
    ArrayList context = new ArrayList();

    if (cboTemperature.Text != "Unknown")
    {
        context.Add(cboTemperature.Text);
    }
    if (cboPrecipitation.Text != "Unknown")
    {
        context.Add(cboPrecipitation.Text);
    }
    if (cboTiming.Text != "Unknown")
    {
        context.Add(cboTiming.Text);
    }

    double[] probabilities = mModel.Evaluate((string[])
                             context.ToArray(typeof(string)));

    lblOutcome.Text = probabilities[mUmbrellaOutcomeId].ToString("N5");
}

The Evaluate method returns an array of doubles containing the probabilities for all the possible outcomes. Together, these probabilities will always add up to 1. In this simple example, there are only two possible outcomes, Umbrella and No_Umbrella, so it is easy enough to just display the probability of taking the umbrella. The IMaximumEntropyModel interface contains a number of other useful methods for making sense of the results of a prediction, the details of which can be found in the NDoc-generated library help file. The reason why there is an overloaded version of the Evaluate method that takes in an array of doubles as a parameter, is that in more complicated scenarios, where the Evaluate method may be called many hundreds or thousands of times, performance is improved by continually recycling an array of doubles that is already pre-sized to match the number of possible outcomes. That array of doubles can then be passed, if required, to the GetBestOutcome method, to obtain the name of the most likely outcome, or to the GetAllOutcomes method, which will return a formatted string, listing all the possible outcomes, together with their probabilities.

Reading and Writing Models

In the simple example above, the model is built from the training data, each time it is run. This is acceptable with such a small training data set, but in a real world application, the training data file is likely to be much, much larger, and it would be unrealistic to require the overhead of rebuilding a model from training data at application load time in this way. Instead, the model would be built once, saved to persistent storage, and then read back from storage when needed. The SharpEntropy library provides three file formats in the SharpEntropy.IO namespace that may be used to store the model, and three sets of Reader and Writer classes to access the three formats:

Plain text format (PlainTextGisModelReader and PlainTextGisModelWriter). This format has the twin advantages that it is (more or less) human readable, so you can check the results of the data training, and it is compatible with the plain text model files generated by the Java MaxEnt library. It produces large file sizes, and would normally be used only during development, not in production.
"Java compatible" binary format (JavaBinaryGisModelReader and JavaBinaryGisModelWriter). The file sizes are smaller than plain text, but the binary files are interoperable with the Java MaxEnt library. This results in some performance sacrifices, caused by the need to reverse the byte order of data items (the Java library preferring Big-Endian data formats) and the fact that the MaxEnt format is less than optimal.
Binary format (BinaryGisModelReader and BinaryGisModelWriter). This is the format you would normally choose. The files produced are slightly smaller than the Java compatible format, and they are faster to read in.

The code to write a model to one of these file formats, and to read it back, is simple:

//write a model to disk
string modelDataFile = @"C:\model.nbin";
SharpEntropy.IO.BinaryGisModelWriter writer = 
          new SharpEntropy.IO.BinaryGisModelWriter();
writer.Persist(model, modelDataFile);

//read a model in from disk
SharpEntropy.IO.BinaryGisModelReader reader = 
       new SharpEntropy.IO.BinaryGisModelReader(modelDataFile);
SharpEntropy.IMaximumEntropyModel model = new SharpEntropy.GisModel(reader);

It is possible to implement new formats, and depending on the similarity of your chosen format to the existing ones, it may be useful to implement the IGisModelReader interface or to inherit from the GisModelReader and GisModelWriter classes. I have working implementations of Reader/Writer pairs that use SQL Server and Sqlite as backing stores. You can download the Sqlite implementation from the link at the top of the article. However, I am still looking for an optimally efficient read-only file format.

A More Complex Example - Tokenizing English Sentences

The Java MaxEnt library is used by another LGPL-licensed open source Java library, called OpenNLP, which provides a number of natural language processing tools based on maximum entropy models. A C# port of the OpenNLP library is available in my second article, Statistical parsing of English sentences, but to provide a more complex SharpEntropy example, as well as to give a flavor of the OpenNLP code, here is a small application to tokenize English sentences, based on a MaxEnt model created by the Java OpenNLP team.

English Tokenizer application user interface

Tokenizing occurs in the MaxentTokenizer.Tokenize method:

public string[] Tokenize(string input)
{
    ArrayList tokens = new ArrayList();

    string[] candidateTokens = input.Split(mWhitespaceChars);
    foreach (string candidateToken in candidateTokens)
    {
        if (candidateToken.Length < 2) 
        {
            tokens.Add(candidateToken);
        }
        else if (mAlphaNumericRegex.IsMatch(candidateToken))
        {
            tokens.Add(candidateToken);
        }
        else
        {
            int startPos = 0;
            int endPos = candidateToken.Length;
            for (int currentPos = startPos + 1; 
                 currentPos < endPos; currentPos++)
            {
                double[] probabilities = 
                  mModel.Evaluate(GenerateContext(candidateToken, 
                  currentPos));
                string bestOutcome = mModel.GetBestOutcome(probabilities);
                if (bestOutcome == "T")
                {
                    tokens.Add(candidateToken.Substring(startPos, 
                                         currentPos - startPos));
                    startPos = currentPos;
                }
            }
            tokens.Add(candidateToken.Substring(startPos, 
                                     endPos - startPos));
        }
    }
    return (string[])tokens.ToArray(typeof(string));
}

It first splits the input at each whitespace character into candidate tokens. Each candidate token is then examined, and if it is less than two characters long, or consists only of alphanumeric characters, then it is accepted as a token. Otherwise, each character position in the candidate token is examined in turn to see if it should be split into more than one token at that position. This is done by building up a set of predicates related to the possible split position in the GenerateContext method. Various features, such as the characters before and after the split, whether the characters on each side are letters or numbers, and so on, are used to generate this set of predicates. This set of predicates is then evaluated against the MaxEnt model. The model has two possible outcomes, "T" for a split, and "F" for a non-split, so if the best outcome is "T", then the characters to the left of the split position are separated off into a new token.

The sample application simply allows you to type in an input sentence, tokenizes it, and lists the resulting tokens, each on a separate line, in a read-only textbox.

A Sqlite Model Reader/Writer

SharpEntropy, like its Java original, holds all of the maximum entropy model data in memory at once, and complex models (such as those used by OpenNLP) may easily consume hundreds of MB. I have been looking into the possibility of keeping most of the model in permanent storage and only accessing it when necessary. An early result of this is the SharpEntropySqlite library, which stores model data in a Sqlite database, and only accesses it on demand when the GIS model is queried. The ADO.NET Sqlite provider used by this library is the public domain one created by Robert Simpson and available on SourceForge.

Notes On The Java Conversion

Lastly, I thought it may be interesting to provide some explanation of how the SharpEntropy library developed. I do not have a mathematical background, and my interest lies largely in the possibility of using maximum entropy for practical purposes, including statistical natural language processing.

Though I started off with a copy of the MaxEnt source code and Microsoft's Java Language Conversion Assistant (JLCA), the SharpEntropy code has subsequently been through many iterations to reach its current state. I have endeavored to make the Java origins of the code as unnoticeable as possible, rather than having the library resemble Java written in a foreign language. Some of the major phases of coding I went through (some revisited iteratively) were:

Conversion with the JLCA, and fixing of errors in the generated code. The result of this phase was compilable C# code with a heavy Java flavour. The MaxEnt library uses the open source Trove collections library to handle many of the MaxEnt model data structures, so this phase also involved dutifully converting a large chunk of Trove into C# as well.
Removal of Java-like constructs from the code and replacing them with C# versions, particularly where the C# version is known to be more efficient. Some Java-like remnants remain, but hopefully they are few in number.
Changing a large number of identifiers in the code, ranging all the way from variables with unclear names to classes and files, particularly those with endings that clash with .NET framework conventions (EventStream, for example).
Running FxCop and fixing as many of the issues raised as possible.
Tuning for performance. I did a lot of work trying to reduce SharpEntropy's memory footprint. My optimizations have now been back-ported into the Java MaxEnt library.

References

"The OpenNLP MaxEnt Homepage" - home of the Java MaxEnt project hosted at SourceForge.
"Maximum Entropy Modeling" - a useful page of links to Maximum Entropy resources on the web.
"A Brief MaxEnt Tutorial" - hyperlinked tutorial on maximum entropy modeling, complete with math!
"A MaxEnt reading list" - a page of references to articles on Maximum Entropy.
Statistical parsing of English sentences - follow-up article showing how to use SharpEntropy for natural language processing tasks.

History

Third version (3^rd May 2006). Added .NET 2.0 versions of the code, plus Sqlite Model Reader/Writer.
Second version (16^th November 2005). Added links to OpenNLP article; minor edits to the text.
Initial version (24^th July 2005).