12,067,351 members (53,681 online)
alternative version

144.4K views
104 bookmarked
Posted

# Maximum Entropy Modeling Using SharpEntropy

, 9 May 2006
 Rate this:
Presents a Maximum Entropy modeling library, and discusses its usage, with the aid of two examples: a simple example of predicting outcomes, and an English language tokenizer.

## Overview

A second article, Statistical parsing of English sentences, shows how SharpEntropy can be used to perform sophisticated natural language processing tasks. A small additional library, SharpEntropySqlite, downloadable with this article, provides a simple means of storing maximum entropy models created by SharpEntropy in a Sqlite database.

## Introduction

SharpEntropy is a C# port of the MaxEnt toolkit available from SourceForge. MaxEnt is a mature library written in Java, originally created by Jason Baldridge, Tom Morton, and Gann Bierner. SharpEntropy is based upon the latest version (2.3.0) of the MaxEnt toolkit, released in August 2004.

Maximum entropy modeling is a general-purpose machine learning technique originally developed for statistical physics, but which has been employed in a wide variety of fields, including computer vision and natural language processing. It can be usefully applied whenever there is a complex process whose internals are unknown but which must be modeled by the computer. Like any statistical modeling technique, it relies on the existence of a data sample that shows, for given sets of input, what output the process generates. This sample is analysed, and from it, a model is generated, encapsulating all the rules about the process that could be inferred from the sample. This model is then used to predict the output of the process, when supplied with sets of input not found in the sample data.

What technique should be used to generate the model from the sample data? The maximum entropy method follows the principle of “maximizing entropy” – that is, it chooses the model that takes account of all the facts available in the sample data but otherwise preserves as much uncertainty as possible. The principle could be stated as: "Don’t assume anything about your probability distribution other than what you have observed".

The more mathematically inclined may wish to follow some of the references at the end of the article to find out more about the maximum entropy method.

## A Simple Example

Suppose you wish to predict whether on a given day, a certain man will leave the house with his umbrella or not. The first thing you do is gather some sample data by observing over a number of days what the man does:

```Day 1        Warm        Dry        No_Umbrella
Day 2        Cold        Dry        No_Umbrella
Day 3        Cold        Rainy      Umbrella
Day 4        Cold        Dry        Umbrella
Day 5        Warm        Dry        No_Umbrella
```

After five days, you realise that there may be an additional factor at work: whether the man is late in the morning or not. So, you collect some more sample data:

```Day 6        Cold        Dry        Early        Umbrella
Day 7        Cold        Rainy      Early        Umbrella
Day 8        Cold        Dry        Late         No_Umbrella
Day 9        Warm        Rainy      Late         No_Umbrella
Day 10       Warm        Dry        Late         No_Umbrella
```

Each of these rows of data represents a training event. Each training event has an outcome (in this example, either "Umbrella" or "No_Umbrella"), and a context, which consists of one or more predicates ("Cold", "Dry" etc.). The context is the combination of predicates that lead to the outcome of the event. Note that the order of the predicates is not significant.

The training data for our simple umbrella example can be found in the file named data.txt in the same folder as the SimpleExample executable. This is a text file with one training event per line. In each line, the contextual predicates and the outcome are separated by single spaces (the outcome is always assumed to be the last string on the line). This training data format is chosen because the SharpEntropy library already has a class called `SharpEntropy.IO.PlainTextByLineDataReader` that can read text files delimited in this way. The following code, taken from the source of the SimpleExample application, shows how this training data file can be loaded and used to generate a maximum entropy model:

```System.IO.StreamReader trainingStreamReader =
SharpEntropy.GisTrainer trainer = new SharpEntropy.GisTrainer();
mModel = new SharpEntropy.GisModel(trainer);```

The `GisTrainer` class trains a maximum entropy model using the GIS (Generalized Iterative Scaling) method. There are a number of other methods for training MaxEnt models, many of which outperform GIS. Currently, however, GIS is the only method that the Java MaxEnt library supports, and therefore the only method whose algorithm was converted for use in SharpEntropy. The trainer class expects to be fed with a stream of events provided by a class that implements the `ITrainingEventReader` interface. The `PlainTextByLineDataReader` and `BasicEventReader` classes together provide the means to supply this stream of events from the delimited text file, whose path is stored in the string variable `trainingDataFile`.

The result is an instance of the `GisModel` class, the only class currently coded that implements the `IMaximumEntropyModel` interface. Once the model has been trained, it can be used to predict future outcomes. The SimpleExample application lets you choose a combination of predicates using dropdown lists:

When the Take Umbrella? button is clicked, the chosen features are added to an array of strings and passed into the `Evaluate` method of the `GisModel` object:

```private void btnOutcome_Click(object sender, System.EventArgs e)
{
ArrayList context = new ArrayList();

if (cboTemperature.Text != "Unknown")
{
}
if (cboPrecipitation.Text != "Unknown")
{
}
if (cboTiming.Text != "Unknown")
{
}

double[] probabilities = mModel.Evaluate((string[])
context.ToArray(typeof(string)));

lblOutcome.Text = probabilities[mUmbrellaOutcomeId].ToString("N5");
}```

The `Evaluate` method returns an array of `double`s containing the probabilities for all the possible outcomes. Together, these probabilities will always add up to 1. In this simple example, there are only two possible outcomes, `Umbrella` and `No_Umbrella`, so it is easy enough to just display the probability of taking the umbrella. The `IMaximumEntropyModel` interface contains a number of other useful methods for making sense of the results of a prediction, the details of which can be found in the NDoc-generated library help file. The reason why there is an overloaded version of the `Evaluate` method that takes in an array of `double`s as a parameter, is that in more complicated scenarios, where the `Evaluate` method may be called many hundreds or thousands of times, performance is improved by continually recycling an array of `double`s that is already pre-sized to match the number of possible outcomes. That array of `double`s can then be passed, if required, to the `GetBestOutcome` method, to obtain the name of the most likely outcome, or to the `GetAllOutcomes` method, which will return a formatted string, listing all the possible outcomes, together with their probabilities.

In the simple example above, the model is built from the training data, each time it is run. This is acceptable with such a small training data set, but in a real world application, the training data file is likely to be much, much larger, and it would be unrealistic to require the overhead of rebuilding a model from training data at application load time in this way. Instead, the model would be built once, saved to persistent storage, and then read back from storage when needed. The SharpEntropy library provides three file formats in the `SharpEntropy.IO` namespace that may be used to store the model, and three sets of Reader and Writer classes to access the three formats:

• Plain text format (`PlainTextGisModelReader` and `PlainTextGisModelWriter`). This format has the twin advantages that it is (more or less) human readable, so you can check the results of the data training, and it is compatible with the plain text model files generated by the Java MaxEnt library. It produces large file sizes, and would normally be used only during development, not in production.
• "Java compatible" binary format (`JavaBinaryGisModelReader` and `JavaBinaryGisModelWriter`). The file sizes are smaller than plain text, but the binary files are interoperable with the Java MaxEnt library. This results in some performance sacrifices, caused by the need to reverse the byte order of data items (the Java library preferring Big-Endian data formats) and the fact that the MaxEnt format is less than optimal.
• Binary format (`BinaryGisModelReader` and `BinaryGisModelWriter`). This is the format you would normally choose. The files produced are slightly smaller than the Java compatible format, and they are faster to read in.

The code to write a model to one of these file formats, and to read it back, is simple:

```//write a model to disk
string modelDataFile = @"C:\model.nbin";
SharpEntropy.IO.BinaryGisModelWriter writer =
new SharpEntropy.IO.BinaryGisModelWriter();
writer.Persist(model, modelDataFile);```
```//read a model in from disk

It is possible to implement new formats, and depending on the similarity of your chosen format to the existing ones, it may be useful to implement the `IGisModelReader` interface or to inherit from the `GisModelReader` and `GisModelWriter` classes. I have working implementations of Reader/Writer pairs that use SQL Server and Sqlite as backing stores. You can download the Sqlite implementation from the link at the top of the article. However, I am still looking for an optimally efficient read-only file format.

## A More Complex Example - Tokenizing English Sentences

The Java MaxEnt library is used by another LGPL-licensed open source Java library, called OpenNLP, which provides a number of natural language processing tools based on maximum entropy models. A C# port of the OpenNLP library is available in my second article, Statistical parsing of English sentences, but to provide a more complex SharpEntropy example, as well as to give a flavor of the OpenNLP code, here is a small application to tokenize English sentences, based on a MaxEnt model created by the Java OpenNLP team.

Tokenizing occurs in the `MaxentTokenizer.Tokenize` method:

```public string[] Tokenize(string input)
{
ArrayList tokens = new ArrayList();

string[] candidateTokens = input.Split(mWhitespaceChars);
foreach (string candidateToken in candidateTokens)
{
if (candidateToken.Length < 2)
{
}
else if (mAlphaNumericRegex.IsMatch(candidateToken))
{
}
else
{
int startPos = 0;
int endPos = candidateToken.Length;
for (int currentPos = startPos + 1;
currentPos < endPos; currentPos++)
{
double[] probabilities =
mModel.Evaluate(GenerateContext(candidateToken,
currentPos));
string bestOutcome = mModel.GetBestOutcome(probabilities);
if (bestOutcome == "T")
{
currentPos - startPos));
startPos = currentPos;
}
}
endPos - startPos));
}
}
return (string[])tokens.ToArray(typeof(string));
}```

It first splits the input at each whitespace character into candidate tokens. Each candidate token is then examined, and if it is less than two characters long, or consists only of alphanumeric characters, then it is accepted as a token. Otherwise, each character position in the candidate token is examined in turn to see if it should be split into more than one token at that position. This is done by building up a set of predicates related to the possible split position in the `GenerateContext` method. Various features, such as the characters before and after the split, whether the characters on each side are letters or numbers, and so on, are used to generate this set of predicates. This set of predicates is then evaluated against the MaxEnt model. The model has two possible outcomes, "T" for a split, and "F" for a non-split, so if the best outcome is "T", then the characters to the left of the split position are separated off into a new token.

The sample application simply allows you to type in an input sentence, tokenizes it, and lists the resulting tokens, each on a separate line, in a read-only textbox.

SharpEntropy, like its Java original, holds all of the maximum entropy model data in memory at once, and complex models (such as those used by OpenNLP) may easily consume hundreds of MB. I have been looking into the possibility of keeping most of the model in permanent storage and only accessing it when necessary. An early result of this is the SharpEntropySqlite library, which stores model data in a Sqlite database, and only accesses it on demand when the GIS model is queried. The ADO.NET Sqlite provider used by this library is the public domain one created by Robert Simpson and available on SourceForge.

## Notes On The Java Conversion

Lastly, I thought it may be interesting to provide some explanation of how the SharpEntropy library developed. I do not have a mathematical background, and my interest lies largely in the possibility of using maximum entropy for practical purposes, including statistical natural language processing.

Though I started off with a copy of the MaxEnt source code and Microsoft's Java Language Conversion Assistant (JLCA), the SharpEntropy code has subsequently been through many iterations to reach its current state. I have endeavored to make the Java origins of the code as unnoticeable as possible, rather than having the library resemble Java written in a foreign language. Some of the major phases of coding I went through (some revisited iteratively) were:

• Conversion with the JLCA, and fixing of errors in the generated code. The result of this phase was compilable C# code with a heavy Java flavour. The MaxEnt library uses the open source Trove collections library to handle many of the MaxEnt model data structures, so this phase also involved dutifully converting a large chunk of Trove into C# as well.
• Removal of Java-like constructs from the code and replacing them with C# versions, particularly where the C# version is known to be more efficient. Some Java-like remnants remain, but hopefully they are few in number.
• Changing a large number of identifiers in the code, ranging all the way from variables with unclear names to classes and files, particularly those with endings that clash with .NET framework conventions (`EventStream`, for example).
• Running FxCop and fixing as many of the issues raised as possible.
• Tuning for performance. I did a lot of work trying to reduce SharpEntropy's memory footprint. My optimizations have now been back-ported into the Java MaxEnt library.

## History

• Third version (3rd May 2006). Added .NET 2.0 versions of the code, plus Sqlite Model Reader/Writer.
• Second version (16th November 2005). Added links to OpenNLP article; minor edits to the text.
• Initial version (24th July 2005).

A list of licenses authors might use can be found here

## Share

 Web Developer United Kingdom
Richard Northedge is a senior developer with a UK Microsoft Gold Partner company. He has a postgraduate degree in English Literature, has been programming professionally since 1998 and has been an MCSD since 2000.

## You may also be interested in...

 First PrevNext
 feature types rezamin11-Jun-11 1:11 rezamin 11-Jun-11 1:11
 training data for word tokenize? vuhong18-May-11 0:45 vuhong 18-May-11 0:45
 My vote of 5 Member 64306-Aug-10 2:30 Member 6430 6-Aug-10 2:30
 On maxentropy model janapatisiva10-Jan-10 20:17 janapatisiva 10-Jan-10 20:17
 Thank you!!! Phan Dung13-Aug-09 8:28 Phan Dung 13-Aug-09 8:28
 NERC gcurkov6-Aug-09 20:55 gcurkov 6-Aug-09 20:55
 Re: key not found in dictionary Richard Northedge20-Jan-09 10:26 Richard Northedge 20-Jan-09 10:26
 Re: key not found in dictionary Richard Northedge21-Jan-09 9:57 Richard Northedge 21-Jan-09 9:57
 Re: key not found in dictionary Richard Northedge22-Jan-09 9:58 Richard Northedge 22-Jan-09 9:58
 Parse Tree prethum30-Oct-08 10:52 prethum 30-Oct-08 10:52
 Re: Parse Tree Richard Northedge1-Nov-08 7:19 Richard Northedge 1-Nov-08 7:19
 Observed Expectation Hasan147921-Sep-08 5:31 Hasan1479 21-Sep-08 5:31
 Re: Observed Expectation Richard Northedge23-Sep-08 11:43 Richard Northedge 23-Sep-08 11:43
 Where do I get the description of all the token type symbols Van P. Nguyen21-Mar-08 7:21 Van P. Nguyen 21-Mar-08 7:21
 Re: Where do I get the description of all the token type symbols Richard Northedge21-Mar-08 11:01 Richard Northedge 21-Mar-08 11:01
 Re: Where do I get the description of all the token type symbols Van P. Nguyen22-Mar-08 6:55 Van P. Nguyen 22-Mar-08 6:55
 Grouping NNPS ambatisreedhar5-Feb-08 23:31 ambatisreedhar 5-Feb-08 23:31
 Re: Grouping NNPS Richard Northedge12-Feb-08 9:28 Richard Northedge 12-Feb-08 9:28
 plzzz i need help..help me please sabrysoft23-Nov-07 15:36 sabrysoft 23-Nov-07 15:36
 feature representation Jam Jong24-Sep-07 2:11 Jam Jong 24-Sep-07 2:11
 SharpEntropy john.hayne4-Jun-07 8:53 john.hayne 4-Jun-07 8:53
 Gramatical sentences michel_jhon25-Sep-06 21:32 michel_jhon 25-Sep-06 21:32
 Re: Gramatical sentences Richard Northedge26-Sep-06 23:33 Richard Northedge 26-Sep-06 23:33
 Re: Gramatical sentences michel_jhon27-Sep-06 21:10 michel_jhon 27-Sep-06 21:10
 Re: Gramatical sentences Richard Northedge1-Oct-06 23:15 Richard Northedge 1-Oct-06 23:15
 classification task keith_p3-Sep-06 8:54 keith_p 3-Sep-06 8:54
 Re: classification task Richard Northedge4-Sep-06 5:32 Richard Northedge 4-Sep-06 5:32
 Re: classification task keith_p4-Sep-06 8:58 keith_p 4-Sep-06 8:58
 When is retraining necessary. feerbase14-Jul-06 10:42 feerbase 14-Jul-06 10:42
 Re: When is retraining necessary. Richard Northedge7-Aug-06 5:38 Richard Northedge 7-Aug-06 5:38
 Re: When is retraining necessary. Richard Northedge13-Apr-07 11:18 Richard Northedge 13-Apr-07 11:18
 can it deal with real value features? terrance102627-Dec-05 18:29 terrance1026 27-Dec-05 18:29
 Re: can it deal with real value features? Richard Northedge28-Dec-05 1:01 Richard Northedge 28-Dec-05 1:01
 Re: can it deal with real value features? terrance102628-Dec-05 20:55 terrance1026 28-Dec-05 20:55
 Documentation about the used model Richard Heinrich22-Nov-05 0:54 Richard Heinrich 22-Nov-05 0:54
 Re: Documentation about the used model Richard Northedge22-Nov-05 13:16 Richard Northedge 22-Nov-05 13:16
 a question about predicates collection sunrising15-Nov-05 1:55 sunrising 15-Nov-05 1:55
 Re: a question about predicates collection Richard Northedge15-Nov-05 12:21 Richard Northedge 15-Nov-05 12:21
 Re: a question about predicates collection sunrising16-Nov-05 4:20 sunrising 16-Nov-05 4:20
 Re: a question about predicates collection Richard Northedge16-Nov-05 9:59 Richard Northedge 16-Nov-05 9:59
 persistance layer biffjohnson26-Jul-05 4:29 biffjohnson 26-Jul-05 4:29
 Re: persistance layer Richard Northedge26-Jul-05 23:13 Richard Northedge 26-Jul-05 23:13
 Thanks for the pointer. I chose SQL Server first because I have most experience with it, but I realise that an LGPL library probably should support one or more open source databases. I mentioned Sqlite because I had been looking into it for a different project, and discovered the C# ADO.NET provider at http://sourceforge.net/projects/adodotnetsqlite[^]. I'll spend a bit more time looking at db4o and see how it fits in. Thanks.
 Very cool. Giles24-Jul-05 7:48 Giles 24-Jul-05 7:48
 Last Visit: 31-Dec-99 19:00     Last Update: 8-Feb-16 4:06 Refresh 12 Next »