This code implements the Term Frequency/Inverse Document frequency (TF-IDF). The TF-IDF is a text statistical-based technique which has been widely used in many search engines and information retrieval systems. I will deal with the documents similarity problem in the next section. To understand the theory, please see this article: Wiki definition of TF/IDF for more details.
Assume that you have a corpora of 1000 documents and your task is to compute the similarity between two given documents (or a document and a query). The following describes the steps of acquiring the similarity value:
Document pre-processing steps
- Tokenization: A document is treated as a string (or bag of words), and then partitioned into a list of tokens.
- Removing stop words: Stop words are frequently occurring, insignificant words. This step eliminates the stop words.
- Stemming word: This step is the process of conflating tokens to their root form (connection -> connect).
- We generate N-distinct words from the corpora and call them as index terms (or the vocabulary). The document collection is then represented as a N-dimensional vector in term space.
Computing Term weights
- Term Frequency.
- Inverse Document Frequency.
- Compute the TF-IDF weighting.
Measuring similarity between two documents
- We capture the similarity of two documents using cosine similarity measurement. The cosine similarity is calculated by measuring the cosine of the angle between two document vectors.
Using the code
The main class is
TFIDFMeasure. This is the testing code:
void Test (string docs, int i, int j)
StopWordHandler stopWord=new StopWordsHandler() ;
TFIDFMeasure tf=new TFIDFMeasure(doc) ;
float simScore=tf.GetSimilarity( i, j);
This library also includes stemming (Martin Porter algorithm), and N-gram text generation modules. If a token-based system did not work as expected, then you can make another choice with N-gram based. Thus, instead of expanding the list of tokens from the document, we will generate a list of N-grams, where N should be a predefined number. That means we will hash into a table to find the counter for the N-gram, but not words (or tokens).
The extra N-gram based similarities (bi, tri, quad...-gram) also help you compare the result of the statistical-based method with the N-gram based method. Let us consider two documents as two flat texts and then run the measurement to compare.
Example of some N-grams for the word "TEXT":
- uni(1)-gram: T, E, X, T
- bi(2)-gram: T, TE, EX, XT, T
- tri(3)-grams: TE, TEX, EXT, XT, T
- quad(4)-grams: TEX, TEXT, EXT, XT, T
A string of length k, will have k+1 bi-grams, k+1 tri-grams, k+1 quad-grams, and so on.
Point of interest
No complex technique was used, I only utilized the hashtable indexing, and array binary search to solve this problem. The N-gram based text similarity also gives us interesting results.
Articles worth reading