Skip to main content
Email Password   helpLost your password?

Overview

This article presents a maximum entropy modeling library called SharpEntropy, and discusses its usage, first by way of a simple example of predicting outcomes, and secondly, by presenting a way of splitting English sentences into constituent tokens (useful for natural language processing). Please note that because most of the code is a conversion based on original Java libraries published under the LGPL license, the source code available for download with this article is also released under the LGPL license. This means, it can freely be used in software that is released under any sort of license, but if you make changes to the library itself and those changes are not for your private use, you must release the source code to those changes.

A second article, Statistical parsing of English sentences, shows how SharpEntropy can be used to perform sophisticated natural language processing tasks. A small additional library, SharpEntropySqlite, downloadable with this article, provides a simple means of storing maximum entropy models created by SharpEntropy in a Sqlite database.

Introduction

SharpEntropy is a C# port of the MaxEnt toolkit available from SourceForge. MaxEnt is a mature library written in Java, originally created by Jason Baldridge, Tom Morton, and Gann Bierner. SharpEntropy is based upon the latest version (2.3.0) of the MaxEnt toolkit, released in August 2004.

Maximum entropy modeling is a general-purpose machine learning technique originally developed for statistical physics, but which has been employed in a wide variety of fields, including computer vision and natural language processing. It can be usefully applied whenever there is a complex process whose internals are unknown but which must be modeled by the computer. Like any statistical modeling technique, it relies on the existence of a data sample that shows, for given sets of input, what output the process generates. This sample is analysed, and from it, a model is generated, encapsulating all the rules about the process that could be inferred from the sample. This model is then used to predict the output of the process, when supplied with sets of input not found in the sample data.

What technique should be used to generate the model from the sample data? The maximum entropy method follows the principle of �maximizing entropy� � that is, it chooses the model that takes account of all the facts available in the sample data but otherwise preserves as much uncertainty as possible. The principle could be stated as: "Don�t assume anything about your probability distribution other than what you have observed".

The more mathematically inclined may wish to follow some of the references at the end of the article to find out more about the maximum entropy method.

A Simple Example

Suppose you wish to predict whether on a given day, a certain man will leave the house with his umbrella or not. The first thing you do is gather some sample data by observing over a number of days what the man does:

    Day 1        Warm        Dry        No_Umbrella
    Day 2        Cold        Dry        No_Umbrella
    Day 3        Cold        Rainy      Umbrella
    Day 4        Cold        Dry        Umbrella
    Day 5        Warm        Dry        No_Umbrella

After five days, you realise that there may be an additional factor at work: whether the man is late in the morning or not. So, you collect some more sample data:

    Day 6        Cold        Dry        Early        Umbrella
    Day 7        Cold        Rainy      Early        Umbrella
    Day 8        Cold        Dry        Late         No_Umbrella
    Day 9        Warm        Rainy      Late         No_Umbrella
    Day 10       Warm        Dry        Late         No_Umbrella

Each of these rows of data represents a training event. Each training event has an outcome (in this example, either "Umbrella" or "No_Umbrella"), and a context, which consists of one or more predicates ("Cold", "Dry" etc.). The context is the combination of predicates that lead to the outcome of the event. Note that the order of the predicates is not significant.

The training data for our simple umbrella example can be found in the file named data.txt in the same folder as the SimpleExample executable. This is a text file with one training event per line. In each line, the contextual predicates and the outcome are separated by single spaces (the outcome is always assumed to be the last string on the line). This training data format is chosen because the SharpEntropy library already has a class called SharpEntropy.IO.PlainTextByLineDataReader that can read text files delimited in this way. The following code, taken from the source of the SimpleExample application, shows how this training data file can be loaded and used to generate a maximum entropy model:

System.IO.StreamReader trainingStreamReader = 
          new System.IO.StreamReader(trainingDataFile);
SharpEntropy.ITrainingEventReader eventReader = 
   new SharpEntropy.BasicEventReader(new 
   SharpEntropy.PlainTextByLineDataReader(trainingStreamReader));
SharpEntropy.GisTrainer trainer = new SharpEntropy.GisTrainer();
trainer.TrainModel(eventReader);
mModel = new SharpEntropy.GisModel(trainer);

The GisTrainer class trains a maximum entropy model using the GIS (Generalized Iterative Scaling) method. There are a number of other methods for training MaxEnt models, many of which outperform GIS. Currently, however, GIS is the only method that the Java MaxEnt library supports, and therefore the only method whose algorithm was converted for use in SharpEntropy. The trainer class expects to be fed with a stream of events provided by a class that implements the ITrainingEventReader interface. The PlainTextByLineDataReader and BasicEventReader classes together provide the means to supply this stream of events from the delimited text file, whose path is stored in the string variable trainingDataFile.

The result is an instance of the GisModel class, the only class currently coded that implements the IMaximumEntropyModel interface. Once the model has been trained, it can be used to predict future outcomes. The SimpleExample application lets you choose a combination of predicates using dropdown lists:

SimpleExample application user interface

When the Take Umbrella? button is clicked, the chosen features are added to an array of strings and passed into the Evaluate method of the GisModel object:

private void btnOutcome_Click(object sender, System.EventArgs e)
{
    ArrayList context = new ArrayList();

    if (cboTemperature.Text != "Unknown")
    {
        context.Add(cboTemperature.Text);
    }
    if (cboPrecipitation.Text != "Unknown")
    {
        context.Add(cboPrecipitation.Text);
    }
    if (cboTiming.Text != "Unknown")
    {
        context.Add(cboTiming.Text);
    }

    double[] probabilities = mModel.Evaluate((string[])
                             context.ToArray(typeof(string)));

    lblOutcome.Text = probabilities[mUmbrellaOutcomeId].ToString("N5");
}

The Evaluate method returns an array of doubles containing the probabilities for all the possible outcomes. Together, these probabilities will always add up to 1. In this simple example, there are only two possible outcomes, Umbrella and No_Umbrella, so it is easy enough to just display the probability of taking the umbrella. The IMaximumEntropyModel interface contains a number of other useful methods for making sense of the results of a prediction, the details of which can be found in the NDoc-generated library help file. The reason why there is an overloaded version of the Evaluate method that takes in an array of doubles as a parameter, is that in more complicated scenarios, where the Evaluate method may be called many hundreds or thousands of times, performance is improved by continually recycling an array of doubles that is already pre-sized to match the number of possible outcomes. That array of doubles can then be passed, if required, to the GetBestOutcome method, to obtain the name of the most likely outcome, or to the GetAllOutcomes method, which will return a formatted string, listing all the possible outcomes, together with their probabilities.

Reading and Writing Models

In the simple example above, the model is built from the training data, each time it is run. This is acceptable with such a small training data set, but in a real world application, the training data file is likely to be much, much larger, and it would be unrealistic to require the overhead of rebuilding a model from training data at application load time in this way. Instead, the model would be built once, saved to persistent storage, and then read back from storage when needed. The SharpEntropy library provides three file formats in the SharpEntropy.IO namespace that may be used to store the model, and three sets of Reader and Writer classes to access the three formats:

The code to write a model to one of these file formats, and to read it back, is simple:

//write a model to disk

string modelDataFile = @"C:\model.nbin";
SharpEntropy.IO.BinaryGisModelWriter writer = 
          new SharpEntropy.IO.BinaryGisModelWriter();
writer.Persist(model, modelDataFile);
//read a model in from disk

SharpEntropy.IO.BinaryGisModelReader reader = 
       new SharpEntropy.IO.BinaryGisModelReader(modelDataFile);
SharpEntropy.IMaximumEntropyModel model = new SharpEntropy.GisModel(reader);

It is possible to implement new formats, and depending on the similarity of your chosen format to the existing ones, it may be useful to implement the IGisModelReader interface or to inherit from the GisModelReader and GisModelWriter classes. I have working implementations of Reader/Writer pairs that use SQL Server and Sqlite as backing stores. You can download the Sqlite implementation from the link at the top of the article. However, I am still looking for an optimally efficient read-only file format.

A More Complex Example - Tokenizing English Sentences

The Java MaxEnt library is used by another LGPL-licensed open source Java library, called OpenNLP, which provides a number of natural language processing tools based on maximum entropy models. A C# port of the OpenNLP library is available in my second article, Statistical parsing of English sentences, but to provide a more complex SharpEntropy example, as well as to give a flavor of the OpenNLP code, here is a small application to tokenize English sentences, based on a MaxEnt model created by the Java OpenNLP team.

English Tokenizer application user interface

Tokenizing occurs in the MaxentTokenizer.Tokenize method:

public string[] Tokenize(string input)
{
    ArrayList tokens = new ArrayList();

    string[] candidateTokens = input.Split(mWhitespaceChars);
    foreach (string candidateToken in candidateTokens)
    {
        if (candidateToken.Length < 2) 
        {
            tokens.Add(candidateToken);
        }
        else if (mAlphaNumericRegex.IsMatch(candidateToken))
        {
            tokens.Add(candidateToken);
        }
        else
        {
            int startPos = 0;
            int endPos = candidateToken.Length;
            for (int currentPos = startPos + 1; 
                 currentPos < endPos; currentPos++)
            {
                double[] probabilities = 
                  mModel.Evaluate(GenerateContext(candidateToken, 
                  currentPos));
                string bestOutcome = mModel.GetBestOutcome(probabilities);
                if (bestOutcome == "T")
                {
                    tokens.Add(candidateToken.Substring(startPos, 
                                         currentPos - startPos));
                    startPos = currentPos;
                }
            }
            tokens.Add(candidateToken.Substring(startPos, 
                                     endPos - startPos));
        }
    }
    return (string[])tokens.ToArray(typeof(string));
}

It first splits the input at each whitespace character into candidate tokens. Each candidate token is then examined, and if it is less than two characters long, or consists only of alphanumeric characters, then it is accepted as a token. Otherwise, each character position in the candidate token is examined in turn to see if it should be split into more than one token at that position. This is done by building up a set of predicates related to the possible split position in the GenerateContext method. Various features, such as the characters before and after the split, whether the characters on each side are letters or numbers, and so on, are used to generate this set of predicates. This set of predicates is then evaluated against the MaxEnt model. The model has two possible outcomes, "T" for a split, and "F" for a non-split, so if the best outcome is "T", then the characters to the left of the split position are separated off into a new token.

The sample application simply allows you to type in an input sentence, tokenizes it, and lists the resulting tokens, each on a separate line, in a read-only textbox.

A Sqlite Model Reader/Writer

SharpEntropy, like its Java original, holds all of the maximum entropy model data in memory at once, and complex models (such as those used by OpenNLP) may easily consume hundreds of MB. I have been looking into the possibility of keeping most of the model in permanent storage and only accessing it when necessary. An early result of this is the SharpEntropySqlite library, which stores model data in a Sqlite database, and only accesses it on demand when the GIS model is queried. The ADO.NET Sqlite provider used by this library is the public domain one created by Robert Simpson and available on SourceForge.

Notes On The Java Conversion

Lastly, I thought it may be interesting to provide some explanation of how the SharpEntropy library developed. I do not have a mathematical background, and my interest lies largely in the possibility of using maximum entropy for practical purposes, including statistical natural language processing.

Though I started off with a copy of the MaxEnt source code and Microsoft's Java Language Conversion Assistant (JLCA), the SharpEntropy code has subsequently been through many iterations to reach its current state. I have endeavored to make the Java origins of the code as unnoticeable as possible, rather than having the library resemble Java written in a foreign language. Some of the major phases of coding I went through (some revisited iteratively) were:

References

History

You must Sign In to use this message board.
 
 
Per page   
 FirstPrevNext
GeneralThank you!!! Pin
Phan Dung
8:28 13 Aug '09  
QuestionNERC Pin
gcurkov
20:55 6 Aug '09  
Questionkey not found in dictionary Pin
zink2040
20:46 18 Jan '09  
AnswerRe: key not found in dictionary Pin
Richard Northedge
10:26 20 Jan '09  
AnswerRe: key not found in dictionary Pin
zink2040
2:57 21 Jan '09  
GeneralRe: key not found in dictionary Pin
Richard Northedge
9:57 21 Jan '09  
GeneralRe: key not found in dictionary Pin
zink2040
18:28 21 Jan '09  
GeneralRe: key not found in dictionary Pin
Richard Northedge
9:58 22 Jan '09  
GeneralRe: key not found in dictionary Pin
zink2040
3:35 23 Jan '09  
QuestionParse Tree Pin
prethum
10:52 30 Oct '08  
AnswerRe: Parse Tree Pin
Richard Northedge
7:19 1 Nov '08  
GeneralObserved Expectation Pin
Hasan1479
5:31 21 Sep '08  
GeneralRe: Observed Expectation Pin
Richard Northedge
11:43 23 Sep '08  
GeneralWhere do I get the description of all the token type symbols Pin
Van P. Nguyen
7:21 21 Mar '08  
GeneralRe: Where do I get the description of all the token type symbols Pin
Richard Northedge
11:01 21 Mar '08  
GeneralRe: Where do I get the description of all the token type symbols Pin
Van P. Nguyen
6:55 22 Mar '08  
GeneralGrouping NNPS Pin
ambatisreedhar
23:31 5 Feb '08  
GeneralRe: Grouping NNPS Pin
Richard Northedge
9:28 12 Feb '08  
Questionplzzz i need help..help me please Pin
sabrysoft
15:36 23 Nov '07  
Generalfeature representation Pin
Jam Jong
2:11 24 Sep '07  
GeneralSharpEntropy Pin
john.hayne
8:53 4 Jun '07  
QuestionGramatical sentences Pin
michel_jhon
21:32 25 Sep '06  
AnswerRe: Gramatical sentences Pin
Richard Northedge
23:33 26 Sep '06  
QuestionRe: Gramatical sentences Pin
michel_jhon
21:10 27 Sep '06  
AnswerRe: Gramatical sentences Pin
Richard Northedge
23:15 1 Oct '06  


Last Updated 9 May 2006 | Advertise | Privacy | Terms of Use | Copyright © CodeProject, 1999-2009