14,269,494 members

# Term frequency/Inverse document frequency implementation in C#

Rate this:
28 Oct 2005
Text statistical-based measuring of similarity between two documents in a corpora.

## Introduction

This code implements the Term Frequency/Inverse Document frequency (TF-IDF). The TF-IDF is a text statistical-based technique which has been widely used in many search engines and information retrieval systems. I will deal with the documents similarity problem in the next section. To understand the theory, please see this article: Wiki definition of TF/IDF for more details.

## Solution

Assume that you have a corpora of 1000 documents and your task is to compute the similarity between two given documents (or a document and a query). The following describes the steps of acquiring the similarity value:

Document pre-processing steps

• Tokenization: A document is treated as a string (or bag of words), and then partitioned into a list of tokens.
• Removing stop words: Stop words are frequently occurring, insignificant words. This step eliminates the stop words.
• Stemming word: This step is the process of conflating tokens to their root form (connection -> connect).

Document representation

• We generate N-distinct words from the corpora and call them as index terms (or the vocabulary). The document collection is then represented as a N-dimensional vector in term space.

Computing Term weights

• Term Frequency.
• Inverse Document Frequency.
• Compute the TF-IDF weighting.

Measuring similarity between two documents

• We capture the similarity of two documents using cosine similarity measurement. The cosine similarity is calculated by measuring the cosine of the angle between two document vectors.

## Using the code

The main class is `TFIDFMeasure`. This is the testing code:

```void Test (string[] docs, int i, int j)
// docs is collection of parsed documents
{
StopWordHandler stopWord=new StopWordsHandler() ;
TFIDFMeasure tf=new TFIDFMeasure(doc) ;
float simScore=tf.GetSimilarity( i, j);
// similarity of two given documents at the
// position i,j respectively
}```

## Extension

This library also includes stemming (Martin Porter algorithm), and N-gram text generation modules. If a token-based system did not work as expected, then you can make another choice with N-gram based. Thus, instead of expanding the list of tokens from the document, we will generate a list of N-grams, where N should be a predefined number. That means we will hash into a table to find the counter for the N-gram, but not words (or tokens).

The extra N-gram based similarities (bi, tri, quad...-gram) also help you compare the result of the statistical-based method with the N-gram based method. Let us consider two documents as two flat texts and then run the measurement to compare.

Example of some N-grams for the word "TEXT":

• uni(1)-gram: T, E, X, T
• bi(2)-gram: T, TE, EX, XT, T
• tri(3)-grams: TE, TEX, EXT, XT, T
• quad(4)-grams: TEX, TEXT, EXT, XT, T

A string of length k, will have k+1 bi-grams, k+1 tri-grams, k+1 quad-grams, and so on.

## Point of interest

No complex technique was used, I only utilized the hashtable indexing, and array binary search to solve this problem. The N-gram based text similarity also gives us interesting results.

A list of licenses authors might use can be found here

## Share

 Software Developer Vietnam
I'm still alive...but temporarily moved to work on mobile & web stuffs(j2me/brew/php/flash...something not M\$). things have just been very busy, and probably will continue...so don't have chance to maintain & respond. Hope will have time to try to write again, because many ideas with WPF &silver light are waiting. wish me luck

FYI:
- MESHSimPack project(c# library for measuring similarity among concepts of the MESH ontology):
http://sourceforge.net/projects/meshsimpack.

 FirstPrev Next
 Re: TDF/IDF Formula BHAVANA RAO7-Apr-07 1:19 BHAVANA RAO 7-Apr-07 1:19
 Can you pls give me a hand? chufangdavid2-Dec-05 8:26 chufangdavid 2-Dec-05 8:26
 Re: Can you pls give me a hand? Thanh Dao3-Dec-05 17:38 Thanh Dao 3-Dec-05 17:38
 Re: Can you pls give me a hand? namrevon4-Dec-05 8:25 namrevon 4-Dec-05 8:25
 Re: Can you pls give me a hand? Thanh Dao7-Dec-05 4:47 Thanh Dao 7-Dec-05 4:47
 Cosine similarity measurement rmcubedmm31-Oct-05 22:57 rmcubedmm 31-Oct-05 22:57
 Re: Cosine similarity measurement Thanh Dao31-Oct-05 23:42 Thanh Dao 31-Oct-05 23:42
 Re: Cosine similarity measurement thesky20918-Apr-12 11:52 thesky209 18-Apr-12 11:52
 sr but http://www.math.unipd.it/~fabseb60/Publications/ACMCS02.pdf[^] is no longer availble . Can u re-post a new one . Thanks you very much
 Five Stars as usual Grav-Vt29-Oct-05 9:08 Grav-Vt 29-Oct-05 9:08
 Re: Five Stars as usual Thanh Dao29-Oct-05 17:25 Thanh Dao 29-Oct-05 17:25
 Re: Five Stars as usual gowth0831-Jan-09 18:52 gowth08 31-Jan-09 18:52
 Last Visit: 21-Aug-19 22:58     Last Update: 21-Aug-19 22:58 Refresh « Prev12