|
|||||||||||||||||||||
|
|||||||||||||||||||||
|
Announcements
Want a new Job?
Chapters
Services
Feature Zones
|
OverviewIn a previous article, I presented a maximum entropy modeling library called SharpEntropy, a C# port of a mature Java library called the MaxEnt toolkit. The Java MaxEnt library is used by another open source Java library, called OpenNLP, which provides a number of natural language processing tools based on maximum entropy models. This article shows you how to use my C# port of the OpenNLP library to generate parse trees for English language sentences, as well as explores some of the other features of the OpenNLP code. Please note that because the original Java OpenNLP library is published under the LGPL license, the source code to the C# OpenNLP library available to download with this article is also released under the LGPL license. This means, it can freely be used in software that is released under any sort of license, but if you make changes to the library itself and those changes are not for your private use, you must release the source code to those changes. IntroductionOpenNLP is both the name of a group of open source projects related to natural language processing (NLP), and the name of a library of NLP tools written in Java by Jason Baldridge, Tom Morton, and Gann Bierner. My C# port is based upon the latest version (1.2.0) of the Java OpenNLP tools, released in April 2005. Development of the Java library is ongoing, and I hope to update the C# port as new developments occur. Tools included in the C# port are: a sentence splitter, a tokenizer, a part-of-speech tagger, a chunker (used to "find non-recursive syntactic annotations such as noun phrase chunks"), a parser, and a name finder. The Java library also includes a tool for co-reference resolution, but the code for this feature is in flux and has not yet been ported to C#. All of these tools are driven by maximum entropy models processed by the SharpEntropy library. Since this article was first written, the coreference tool has been ported to C# and is available, along with the latest version of the other tools, from the SharpNLP Project on CodePlex. Setting up the OpenNLP librarySince this article was first written, the required binary data files have now been made available for download from the SharpNLP Project on CodePlex. Instead of downloading the Java-compatible files from Sourceforge and then converting them via the ModelConverter tool, you can download them directly in the required .nbin format. The maximum entropy models that drive the OpenNLP library consist of a set of binary data files, totaling 123 MB. Because of their large size, it isn't possible to offer them for download from CodeProject. Unfortunately, this means that setting up the OpenNLP library on your machine requires more steps than simply downloading the Zip file, unpacking, and running the executables. First, download the demo project Zip file and unzip its contents into a folder on your hard disk. Then, in your chosen folder, create a subfolder named "Models". Create two subfolders inside "Models", one called "Parser" and one called "NameFind". Secondly, download the OpenNLP model files from the CVS repository belonging to the Java OpenNLP library project area on SourceForge. This can be done via a CVS client, or by using the web interface. Place the .bin files for the chunker (EnglishChunk.bin), the POS tagger (EnglishPOS.bin), the sentence splitter (EnglishSD.bin), and the tokenizer (EnglishTok.bin) in the Models folder you created in the first step. This screenshot shows the file arrangement required:
Place the .bin files for the name finder into the NameFind subfolder, like this:
Then, place the files required for the parser into the Parser subfolder. This includes the files called "tagdict" and "head_rules", as well as the four .bin files:
These models were created by the Java OpenNLP team in the original MaxEnt format. They must be converted into .NET format for them to work with the C# OpenNLP library. The article on SharpEntropy explains the different model formats understood by the SharpEntropy library and the reasons for using them. The command line program ModelConverter.exe is provided as part of the demo project download for the purpose of converting the model files. Run it from the command prompt, specifying the location of the "Models" folder, and it will take each of the .bin files and create a new .nbin file from it. This process will typically take some time - several minutes or more, depending on your hardware configuration.
(This screenshot, like the folder screenshots above it, is taken from the Windows 98 virtual machine I used for testing. Of course, the code works on newer operating systems as well - my main development machine is Windows XP.) Once the model converter has completed successfully, the demo executables should run correctly. What does the demonstration project contain?As well as the ModelConverter, the demonstration project provides two Windows Forms executables: ToolsExample.exe and ParseTree.exe. Both of these use OpenNLP.dll, which in turn relies on SharpEntropy.dll, the SharpEntropy library which I explored in my previous article. The Parse Tree demo also uses (a modified version of) the NetronProject's treeview control, called "Lithium", available from CodeProject here The Tools Example provides a simple interface to showcase the various natural language processing tools provided by the OpenNLP library. The Parse Tree demo uses the modified Lithium control to provide a more graphical demonstration of the English sentence parsing achievable with OpenNLP. Running the code in sourceThe source code is provided for the two Windows Forms executables, the ModelConverter program, and the OpenNLP library (which is LGPL licensed). Source code is also included for the modified Lithium control, though the changes to the original CodeProject version are minimal. Source code for the SharpEntropy library can be obtained from my SharpEntropy article. The source code is written so that the EXEs look for the "Models" folder inside the folder they are running from. This means that if you are running the projects from the development environment, you will either need to place the "Models" subfolder inside the appropriate "bin" directory created when you compile the code, or change the source code to look for a different location. This is the relevant code, from the mModelPath = System.IO.Path.GetDirectoryName(
System.Reflection.Assembly.GetExecutingAssembly().GetName().CodeBase);
mModelPath = new System.Uri(mModelPath).LocalPath + @"\Models\";
This could be replaced with your own scheme for calculating the location of the Models folder. A note on performanceThe OpenNLP code is set up to use a Detecting the end of sentencesIf we have a paragraph of text in a string variable Mr. Jones went shopping. His grocery bill came to $23.45.
Using the All of this functionality is packaged into the classes in the using OpenNLP.Tools.SentenceDetect;
EnglishMaximumEntropySentenceDetector sentenceDetector =
new EnglishMaximumEntropySentenceDetector(mModelPath + "EnglishSD.nbin");
string[] sentences = sentenceDetector.SentenceDetect(input);
The simplest The Tools Example executable illustrates the sentence splitting capabilities of the OpenNLP library. Enter a paragraph of text into the top textbox, and click the "Split" button. The split sentences will appear in the lower textbox, each on a separate line.
Tokenizing sentencesHaving isolated a sentence, we may wish to apply some NLP technique to it - part-of-speech tagging, or full parsing, perhaps. The first step in this process is to split the sentence into "tokens" - that is, words and punctuations. Again, the using OpenNLP.Tools.Tokenize;
EnglishMaximumEntropyTokenizer tokenizer =
new EnglishMaximumEntropyTokenizer(mModelPath + "EnglishTok.nbin");
string[] tokens = tokenizer.Tokenize(sentence);
This tokenizer will split words that consist of contractions: for example, it will split "don't" into "do" and "n't", because it is designed to pass these tokens on to the other NLP tools, where "do" is recognized as a verb, and "n't" as a contraction of "not", an adverb modifying the preceding verb "do". The "Tokenize" button in the Tools Example splits text in the top textbox into sentences, then tokenizes each sentence. The output, in the lower textbox, places pipe characters between the tokens.
Part-of-speech taggingPart-of-speech tagging is the act of assigning a part of speech (sometimes abbreviated POS) to each word in a sentence. Having obtained an array of tokens from the tokenization process, we can feed that array to the part-of-speech tagger: using OpenNLP.Tools.PosTagger;
EnglishMaximumEntropyPosTagger posTagger =
new EnglishMaximumEntropyPosTagger(mModelPath + "EnglishPOS.nbin");
string[] tags = mPosTagger.Tag(tokens);
The POS tags are returned in an array of the same length as the tokens array, where the tag at each index of the array matches the token found at the same index in the tokens array. The POS tags consist of coded abbreviations conforming to the scheme of the Penn Treebank, the linguistic corpus developed by the University of Pennsylvania. The list of possible tags can be obtained by calling the CC Coordinating conjunction RP Particle
CD Cardinal number SYM Symbol
DT Determiner TO to
EX Existential there UH Interjection
FW Foreign word VB Verb, base form
IN Preposition/subordinate VBD Verb, past tense
conjunction
JJ Adjective VBG Verb, gerund/present participle
JJR Adjective, comparative VBN Verb, past participle
JJS Adjective, superlative VBP Verb, non-3rd ps. sing. present
LS List item marker VBZ Verb, 3rd ps. sing. present
MD Modal WDT wh-determiner
NN Noun, singular or mass WP wh-pronoun
NNP Proper noun, singular WP$ Possessive wh-pronoun
NNPS Proper noun, plural WRB wh-adverb
NNS Noun, plural `` Left open double quote
PDT Predeterminer , Comma
POS Possessive ending '' Right close double quote
PRP Personal pronoun . Sentence-final punctuation
PRP$ Possessive pronoun : Colon, semi-colon
RB Adverb $ Dollar sign
RBR Adverb, comparative # Pound sign
RBS Adverb, superlative -LRB- Left parenthesis *
-RRB- Right parenthesis *
* The Penn Treebank uses the ( and ) symbols,
but these are used elsewhere by the OpenNLP parser.
The maximum entropy model used for the POS tagger was trained using text from the Wall Street Journal and the Brown Corpus. It is possible to further control the POS tagger by providing it with a POS lookup list. There are two alternative The The Tools Example application splits an input paragraph into sentences, tokenizes each sentence, and then POS tags that sentence by using the
Finding phrases ("chunking")The OpenNLP chunker tool will group the tokens of a sentence into larger chunks, each chunk corresponding to a syntactic unit such as a noun phrase or a verb phrase. This is the next step on the way to full parsing, but it could also be useful in itself when looking for units of meaning in a sentence larger than the individual words. To perform the chunking task, a POS tagged set of tokens is required. The ADJP Adjective Phrase PP Prepositional Phrase
ADVP Adverb Phrase PRT Particle
CONJP Conjunction Phrase SBAR Clause introduced by a subordinating conjunction
INTJ Interjection UCP Unlike Coordinated Phrase
LST List marker VP Verb Phrase
NP Noun Phrase
The using OpenNLP.Tools.Chunker;
EnglishTreebankChunker chunker =
new EnglishTreebankChunker(mModelPath + "EnglishChunk.nbin");
string formattedSentence = chunker.GetChunks(tokens, tags);
The Tools Example application uses the POS-tagging code to generate the string arrays of tokens and tags, and then passes them to the chunker. The result shows the POS tags indicated as before, but with the chunks shown by square-bracketed sections in the output sentences.
Full parsingProducing a full parse tree is a task that builds on the NLP algorithms we have covered up until now, but which goes further in grouping the chunked phrases into a tree diagram that illustrates the structure of the sentence. The full parse algorithms implemented by the OpenNLP library use the sentence splitting and tokenizing steps, but perform the POS-tagging and chunking as part of a separate but related procedure driven by the models in the "Parser" subfolder of the "Models" folder. The full parse POS-tagging step uses a tag lookup list, found in the The full parser is invoked by creating an object from the using OpenNLP.Tools.Parser;
EnglishTreebankParser parser =
new EnglishTreebankParser(mModelPath, true, false);
Parse sentenceParse = parser.DoParse(sentence);
There are many constructors for the The Parse Tree demo application shows how this
The Tools Example, meanwhile, uses the built-in
Name finding"Name finding" is the term used by the OpenNLP library to refer to the identification of classes of entities within the sentence - for example, people's names, locations, dates, and so on. The name finder can find up to seven different types of entities, represented by the seven maximum entropy model files in the NameFind subfolder - date, location, money, organization, percentage, person, and time. It would, of course, be possible to train new models using the SharpEntropy library, to find other classes of entities. Since this algorithm is dependent on the use of training data, and there are many, many tokens that might come into a category such as "person" or "location", it is far from foolproof. The name finding function is invoked by first creating an object of type using OpenNLP.Tools.NameFind;
EnglishNameFinder nameFinder =
new EnglishNameFinder(mModelPath + "namefind\\");
string[] models = new string[] {"date", "location", "money",
"organization", "percentage", "person", "time"};
string formattedSentence = mameFinder.GetNames(models, sentence);
The result is a formatted sentence with XML-like tags indicating where entities have been found.
It is also possible to pass a ConclusionMy C# conversion of the OpenNLP library provides a set of tools that make some important natural language processing tasks simple to perform. The demo applications illustrate how easy it is to invoke the library's classes and get good results quickly. The library does rely on holding large maximum entropy model data files in memory, so the more complicated NLP tasks (full parsing and name finding) are memory-intensive. On machines with plenty of memory, performance is impressive: a 3.4 Ghz Pentium IV machine with 2 GB of RAM loaded the parse data into memory in 12 seconds. Querying the model once loaded by passing sentence data to it produced almost instantaneous parse results. Work on the Java OpenNLP library is ongoing. The C# version now has a coreference tool and its development is also active, at the SharpNLP Project on CodePlex. Investigations into speedy ways of retrieving MaxEnt model data from disk rather than holding data in memory also continue. References
History
| ||||||||||||||||||||