14,331,160 members

# Term frequency/Inverse document frequency implementation in C#

Rate this:
28 Oct 2005
Text statistical-based measuring of similarity between two documents in a corpora.

## Introduction

This code implements the Term Frequency/Inverse Document frequency (TF-IDF). The TF-IDF is a text statistical-based technique which has been widely used in many search engines and information retrieval systems. I will deal with the documents similarity problem in the next section. To understand the theory, please see this article: Wiki definition of TF/IDF for more details.

## Solution

Assume that you have a corpora of 1000 documents and your task is to compute the similarity between two given documents (or a document and a query). The following describes the steps of acquiring the similarity value:

Document pre-processing steps

• Tokenization: A document is treated as a string (or bag of words), and then partitioned into a list of tokens.
• Removing stop words: Stop words are frequently occurring, insignificant words. This step eliminates the stop words.
• Stemming word: This step is the process of conflating tokens to their root form (connection -> connect).

Document representation

• We generate N-distinct words from the corpora and call them as index terms (or the vocabulary). The document collection is then represented as a N-dimensional vector in term space.

Computing Term weights

• Term Frequency.
• Inverse Document Frequency.
• Compute the TF-IDF weighting.

Measuring similarity between two documents

• We capture the similarity of two documents using cosine similarity measurement. The cosine similarity is calculated by measuring the cosine of the angle between two document vectors.

## Using the code

The main class is `TFIDFMeasure`. This is the testing code:

```void Test (string[] docs, int i, int j)
// docs is collection of parsed documents
{
StopWordHandler stopWord=new StopWordsHandler() ;
TFIDFMeasure tf=new TFIDFMeasure(doc) ;
float simScore=tf.GetSimilarity( i, j);
// similarity of two given documents at the
// position i,j respectively
}```

## Extension

This library also includes stemming (Martin Porter algorithm), and N-gram text generation modules. If a token-based system did not work as expected, then you can make another choice with N-gram based. Thus, instead of expanding the list of tokens from the document, we will generate a list of N-grams, where N should be a predefined number. That means we will hash into a table to find the counter for the N-gram, but not words (or tokens).

The extra N-gram based similarities (bi, tri, quad...-gram) also help you compare the result of the statistical-based method with the N-gram based method. Let us consider two documents as two flat texts and then run the measurement to compare.

Example of some N-grams for the word "TEXT":

• uni(1)-gram: T, E, X, T
• bi(2)-gram: T, TE, EX, XT, T
• tri(3)-grams: TE, TEX, EXT, XT, T
• quad(4)-grams: TEX, TEXT, EXT, XT, T

A string of length k, will have k+1 bi-grams, k+1 tri-grams, k+1 quad-grams, and so on.

## Point of interest

No complex technique was used, I only utilized the hashtable indexing, and array binary search to solve this problem. The N-gram based text similarity also gives us interesting results.

A list of licenses authors might use can be found here

## Share

 Software Developer Vietnam
I'm still alive...but temporarily moved to work on mobile & web stuffs(j2me/brew/php/flash...something not M\$). things have just been very busy, and probably will continue...so don't have chance to maintain & respond. Hope will have time to try to write again, because many ideas with WPF &silver light are waiting. wish me luck

FYI:
- MESHSimPack project(c# library for measuring similarity among concepts of the MESH ontology):
http://sourceforge.net/projects/meshsimpack.

 First PrevNext
 Vector space Member 1052649623-Jan-14 13:01 Member 10526496 23-Jan-14 13:01
 Alternative version of TF*IDF code kbomb98711-Sep-13 10:10 kbomb987 11-Sep-13 10:10
 help for run program mostafamosakhany15-Jul-13 20:27 mostafamosakhany 15-Jul-13 20:27
 text categorization arjuncent11-Feb-13 15:43 arjuncent 11-Feb-13 15:43
 Good program but complex structure falah ukm14-Apr-12 2:48 falah ukm 14-Apr-12 2:48
 stopword removal issue sianiparmaruli24-Jul-11 22:44 sianiparmaruli 24-Jul-11 22:44
 Re: stopword removal issue falah ukm7-May-12 3:45 falah ukm 7-May-12 3:45
 i have a question pls read this!!!! dyeingbreed8-Jul-11 21:22 dyeingbreed 8-Jul-11 21:22
 need help for visual basic coding for tf-idf fifi fadli6-Jul-11 23:42 fifi fadli 6-Jul-11 23:42
 My vote of 1 Libra85224-Oct-10 22:23 Libra852 24-Oct-10 22:23
 My vote of 3 Mustafa M. Gamal29-Aug-10 19:47 Mustafa M. Gamal 29-Aug-10 19:47
 "Term frequency/Inverse document frequency implementation in C#" Kasunmit9-May-10 16:15 Kasunmit 9-May-10 16:15
 How to run the code aurangzeb khan27-Sep-09 18:48 aurangzeb khan 27-Sep-09 18:48
 Performance tohi21-May-09 3:57 tohi 21-May-09 3:57
 How to use this code Nanomom24-Sep-08 21:11 Nanomom 24-Sep-08 21:11
 Sir very thanx for this code at code project but would you like to help me chancomsats21-May-08 9:35 chancomsats 21-May-08 9:35
 Still maintained? Marakai27-Dec-07 17:47 Marakai 27-Dec-07 17:47
 Re: Still maintained? daisy0001bc3-Jun-08 0:47 daisy0001bc 3-Jun-08 0:47
 Re: Still maintained? Marakai3-Jun-08 15:40 Marakai 3-Jun-08 15:40
 Re: Still maintained? Thanh Dao6-Jun-08 4:14 Thanh Dao 6-Jun-08 4:14
 Re: Still maintained? Marakai15-Jan-09 17:52 Marakai 15-Jan-09 17:52
 Hello! This has fallen off my radar - once I handed in my master's thesis I was exhausted and wanted to never even see code again. If I ever find the time I'll see if I can dig out the C# classes that I wrote and make them available.
 Re: Still maintained? me_project12-Jan-09 18:49 me_project 12-Jan-09 18:49
 Re: Still maintained? Marakai15-Jan-09 17:50 Marakai 15-Jan-09 17:50
 Re: Still maintained? me_project18-Jan-09 19:45 me_project 18-Jan-09 19:45
 Last Visit: 18-Oct-19 15:58     Last Update: 18-Oct-19 15:58 Refresh 12 Next »