Click here to Skip to main content
Click here to Skip to main content
Go to top

Recipe Parsing with SharpEntropy

, 7 Oct 2013
Rate this:
Please Sign up or sign in to vote.
The code samples here are based on SharpNLP's entity extraction algorithm.

Introduction 

I've been fascinated with natural language processing ever since I read about SharpNLP many years ago here on codeproject.com. Since then I've explored some of the other packages out there and started toying around with of my own code based on what made sense to me. The code samples here are based on SharpNLP's entity extraction algorithm.

The code examples are available at https://github.com/foostub/Entity.Extraction 

Background    

Let's say you've been asked to build a system that will extract recipe information (amounts, units of measure, and ingredients) from a free form block of text. A quick Google returns several examples that are similar to:   

  • 1 cup packed brown sugar
  • 1/3 cup butter, melted
  • 2 tablespoons light corn syrup 

I've always been a fan of trying the simplest solution first. The above examples follow a straightforward pattern - <amount (whole or fractional number)> <unit of measure> <ingredient> - that could be parsed using a handful of regular expressions and a dictionary of ingredients and units of measure.   

  • 4 skinless, boneless chicken breast halves     
  • 1 (8 ounce) can Pillsbury® refrigerated crescent dinner rolls   

The same basic pattern is still being used - <amount> <unit of measure> <ingredient> - but there is enough variation in the language to make solving the problem with hard coded rules cumbersome. Chances are you will need to combine different and moderately complex technologies (regex, string searching, sequence alignment, etc) and rules to match all the possible variations. Again, it's possible but - at least in my opinion - error prone and difficult to maintain.  

Parsing with SharpEntropy

A different approach is to use a statistical model where the responsibility of the programmer shifts from coding rules to finding features that can be easily observed and encoded. I used SharpEntropy from the SharpNLP project due to it's simplicity and ease of use.  

Encoding the Model  

If you aren't familiar SharpEntropy you can imagine it as a way of encoding knowledge by mapping observed features to outcomes. Once you've trained your model it can be queried to return the most likely outcome and probability based on a set of test features. 

For example 1 cup packed brown sugar is translated into the following set of features:    

  1. 1:  { is-numeric }  
  2. cup: { lower, contains-letter } 
  3. packed: { lower, contains-letter, prefix-1: p, prefix-2: pa, prefix-3: pac,, suffix-1: d, suffix-2: ed, suffix-3: ked } 
  4. brown: { lower, contains-letter, prefix-1: b, prefix-2: br, prefix-3: bro, suffix-1: n, suffix-2: wn, suffix-3: own }  
  5. sugar:  { lower, contains-letter, prefix-1: s, prefix-2: su, prefix-3: sug, suffix-1: r, suffix-2: ar, suffix-3: gar }         

The next step in the process is to map a set of features to a possible outcome. I wanted a simple, human readable method of training my model.  A simple training example is shown below where _<label> is the outcome: 

  • 2_amount teaspoons_uom crushed garlic_ingredient

Now our 1 cup packed brown sugar becomes:

  1. 1:  { is-numeric }  -> amount 
  2. cup: { lower, contains-letter } -> uom 
  3. packed: { lower, contains-letter, prefix-1: p, prefix-2: pa, prefix-3: pac,, suffix-1: d, suffix-2: ed, suffix-3: ked } -> other 
  4. brown: { lower, contains-letter, prefix-1: b, prefix-2: br, prefix-3: bro, suffix-1: n, suffix-2: wn, suffix-3: own }  -> ingredient 
  5. sugar:  { lower, contains-letter, prefix-1: s, prefix-2: su, prefix-3: sug, suffix-1: r, suffix-2: ar, suffix-3: gar } -> ingredient         

Encoding the Sequence   

Now that you have a way of mapping individual items and their features to an outcome/label it's necessary to encode the sequence into the model. I ended up encoding both context and boundary information by overlapping token features and mapping them to outcomes that specify the start and inline tokens. I wish I could claim this as my own idea but it's how the entity extraction in SharpNLP works.   

The example from above now becomes:    

  1. 1:  { is-numeric, start }  -> start.amount  
  2. cup: { lower, contains-letter, start } -> start.uom  
  3. packed: { lower, contains-letter, prefix-1: p, prefix-2: pa, prefix-3: pac,, suffix-1: d, suffix-2: ed, suffix-3: ked, other } -> other  
  4. brown: { lower, contains-letter, prefix-1: b, prefix-2: br, prefix-3: bro, suffix-1: n, suffix-2: wn, suffix-3: own, ingredient }  -> start.ingredient  
  5. sugar:  { lower, contains-letter, prefix-1: s, prefix-2: su, prefix-3: sug, suffix-1: r, suffix-2: ar, suffix-3: gar, ingredient } -> continue.ingredient        

The next step is to create overlapping sets of features that are mapped to an outcome. So our example is turned in to:  

  1. { 1:  is-numeric, amount } + { cup: lower, contains-letter, uom } -> start.amount 
  2. { cup: lower, contains-letter, uom } + {  packed: lower, contains-letter, prefix-1: p, prefix-2: pa, prefix-3: pac,, suffix-1: d, suffix-2: ed, suffix-3: ked, other } -> start.uom 
  3. and so on... 

Decoding the Model and Sequence 

The process of decoding is very similar to training the model. You need to generate a set of features and n-grams and evaluate them against your model. Here's the code from Tagger.Tag that does the heavy lifting:  

 // generate a sequence of features
var tokens = this.annotator.Annotate(this.tokenizer.Tokenize(text));

// holds the most likely sequences
var possibleSequences = new List<Sequence>();

// start with an empty sequence
possibleSequences.Add(new Sequence(tokens.Length));

// iterate through each token while trying to find the best combination of labels
for (var i = 0; i < tokens.Length; i++)
{
    // working set for the n best sequences
    var nextSequences = new List<Sequence>(outcomeCount * 10);

    // iterate through each possible outcome and evaluate
    // based on the most likely sequences so far
    for (var j = 0; j < outcomeNames.Length; j++)
    {
        var possibleOutcome = outcomeNames[j];

        for (var k = 0; k < possibleSequences.Count; k++)
        {
            var sequence = possibleSequences[k];

            // take a previous sequence, add this outcome, and then evaluate
            var nextSequence = 
              new Sequence(sequence.Outcomes, sequence.Scores, sequence.Score, tokens.Length);
            var possibleOutcomes = nextSequence.Outcomes;
            possibleOutcomes.Add(possibleOutcome);
            
            // ngram based feature generation
            var outcomes = model.Evaluate(context.Generate(tokens, i, possibleOutcomes));

            // remove the previously added outcome because it will be added below with the score
            possibleOutcomes.RemoveAt(possibleOutcomes.Count - 1);
            var outcomeIndex = outcomeIndexes[possibleOutcome]; 

            // add the possible outcome and keep track of the score
            nextSequence.AddOutcome(possibleOutcome, outcomes[outcomeIndex]);
            nextSequences.Add(nextSequence);
        }
    }

    // advance the n most likely sequences to be processed using the next token
    possibleSequences = nextSequences.OrderByDescending(S => S.Score).Take(10).ToList();
}

// find the highest scoring sequence and assign the labels
var bestSequence = possibleSequences.OrderByDescending(S => S.Score).First();
var bestOutcomes = bestSequence.Outcomes;
probabilities = bestSequence.Scores.ToArray();
for (var i = 0; i < tokens.Count(); i++)
{
    // outcomes from the sharpentropy model are stored as <start|continue>.<label>
    tokens[i].Type = bestOutcomes[i].Split('.')[1];
}
return tokens;    

Here's the process from end-to-end:

var tokenizer = new Tokenizer();
var annotator = Annotator.Default();
var context = new Context(4);
var trainer = new Trainer(tokenizer, annotator, context);

// Add some training text. Words that will be labeled are added using _<label name>.
trainer.AddSample("1_amount (8_amount ounce) can_uom Pillsbury® refrigerated crescent dinner_ingredient rolls_ingredient\r\n");
trainer.AddSample("1_amount /_amount 4_amount cup_uom pizza_ingredient sauce_ingredient\r\n");
trainer.AddSample("3_amount /_amount 4_amount cup_uom shredded mozzarella_ingredient cheese_ingredient\r\n");
trainer.AddSample("1_amount /_amount 2_amount cup_uom sliced pepperoni_ingredient\r\n");
trainer.AddSample("2_amount /_amount 3_amount cup_uom sliced pepperoni_ingredient\r\n");
trainer.AddSample("7_amount /_amount 8_amount cup_uom sliced pepperoni_ingredient\r\n");
trainer.AddSample("11_amount teaspoon_uom grated Parmesan_ingredient cheese_ingredient\r\n");
trainer.AddSample("2_amount teaspoons_uom crushed garlic_ingredient\r\n");
trainer.AddSample("1_amount /_amount 4_amount cup_uom olive_ingredient oil_ingredient\r\n");
trainer.AddSample("4_amount skinless_preparation , boneless_preparation chicken_ingredient breast_ingredient halves\r\n");
trainer.AddSample("2_amount eggs_ingredient\r\n");
trainer.AddSample("3_amount eggs_ingredient\r\n");
trainer.AddSample("1_amount /_amount 3_amount cup_uom butter_ingredient , melted\r\n");
trainer.AddSample("1_amount cup_uom fresh orange_ingredient juice_ingredient\r\n");
trainer.AddSample("1_amount /_amount 2_amount cup_uom 2_ingredient %_ingredient milk_ingredient\r\n");
trainer.AddSample("This article has assumed that regular_programming expressions_programming are matched against an entire input string.");
trainer.AddSample("Modern regular expression implementations must deal with large non-ASCII_programming character sets such as Unicode. ");

// train the model
var gisTrainer = new GisTrainer();
gisTrainer.TrainModel(500, new TwoPassDataIndexer(trainer, 0));
var model = new GisModel(gisTrainer);

// initialize the tagger
var tagger = new Tagger(tokenizer, context, annotator, model);

foreach (var file in Directory.GetFiles("examples", "*.txt"))
{
    var text = File.ReadAllText(file);

    var probabilities = new double[0];
    var tokens = tagger.Tag(text, out probabilities);
} 

 History   

  • v 0.1  
    • Initial release October 7, 2013.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

S Hall
Software Developer (Senior) M2 Information Systems
United States United States
No Biography provided
Follow on   LinkedIn

Comments and Discussions

 
-- There are no messages in this forum --
| Advertise | Privacy | Mobile
Web01 | 2.8.140916.1 | Last Updated 8 Oct 2013
Article Copyright 2013 by S Hall
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid