12,547,806 members (47,148 online)
alternative version

148.3K views
58 bookmarked
Posted

# Term frequency/Inverse document frequency implementation in C#

, 28 Oct 2005
 Rate this:
Text statistical-based measuring of similarity between two documents in a corpora.

## Introduction

This code implements the Term Frequency/Inverse Document frequency (TF-IDF). The TF-IDF is a text statistical-based technique which has been widely used in many search engines and information retrieval systems. I will deal with the documents similarity problem in the next section. To understand the theory, please see this article: Wiki definition of TF/IDF for more details.

## Solution

Assume that you have a corpora of 1000 documents and your task is to compute the similarity between two given documents (or a document and a query). The following describes the steps of acquiring the similarity value:

Document pre-processing steps

• Tokenization: A document is treated as a string (or bag of words), and then partitioned into a list of tokens.
• Removing stop words: Stop words are frequently occurring, insignificant words. This step eliminates the stop words.
• Stemming word: This step is the process of conflating tokens to their root form (connection -> connect).

Document representation

• We generate N-distinct words from the corpora and call them as index terms (or the vocabulary). The document collection is then represented as a N-dimensional vector in term space.

Computing Term weights

• Term Frequency.
• Inverse Document Frequency.
• Compute the TF-IDF weighting.

Measuring similarity between two documents

• We capture the similarity of two documents using cosine similarity measurement. The cosine similarity is calculated by measuring the cosine of the angle between two document vectors.

## Using the code

The main class is `TFIDFMeasure`. This is the testing code:

```void Test (string[] docs, int i, int j)
// docs is collection of parsed documents
{
StopWordHandler stopWord=new StopWordsHandler() ;
TFIDFMeasure tf=new TFIDFMeasure(doc) ;
float simScore=tf.GetSimilarity( i, j);
// similarity of two given documents at the
// position i,j respectively
}```

## Extension

This library also includes stemming (Martin Porter algorithm), and N-gram text generation modules. If a token-based system did not work as expected, then you can make another choice with N-gram based. Thus, instead of expanding the list of tokens from the document, we will generate a list of N-grams, where N should be a predefined number. That means we will hash into a table to find the counter for the N-gram, but not words (or tokens).

The extra N-gram based similarities (bi, tri, quad...-gram) also help you compare the result of the statistical-based method with the N-gram based method. Let us consider two documents as two flat texts and then run the measurement to compare.

Example of some N-grams for the word "TEXT":

• uni(1)-gram: T, E, X, T
• bi(2)-gram: T, TE, EX, XT, T
• tri(3)-grams: TE, TEX, EXT, XT, T
• quad(4)-grams: TEX, TEXT, EXT, XT, T

A string of length k, will have k+1 bi-grams, k+1 tri-grams, k+1 quad-grams, and so on.

## Point of interest

No complex technique was used, I only utilized the hashtable indexing, and array binary search to solve this problem. The N-gram based text similarity also gives us interesting results.

A list of licenses authors might use can be found here

## Share

 Software Developer Vietnam
I'm still alive...but temporarily moved to work on mobile & web stuffs(j2me/brew/php/flash...something not M\$). things have just been very busy, and probably will continue...so don't have chance to maintain & respond. Hope will have time to try to write again, because many ideas with WPF &silver light are waiting. wish me luck

FYI:
- MESHSimPack project(c# library for measuring similarity among concepts of the MESH ontology):
http://sourceforge.net/projects/meshsimpack.

## You may also be interested in...

 Pro Pro

 First Prev Next
 Vector space Member 1052649623-Jan-14 13:01 Member 10526496 23-Jan-14 13:01
 Alternative version of TF*IDF code kbomb98711-Sep-13 10:10 kbomb987 11-Sep-13 10:10
 help for run program mostafamosakhany15-Jul-13 20:27 mostafamosakhany 15-Jul-13 20:27
 text categorization arjuncent11-Feb-13 15:43 arjuncent 11-Feb-13 15:43
 Good program but complex structure falah ukm14-Apr-12 2:48 falah ukm 14-Apr-12 2:48
 stopword removal issue sianiparmaruli24-Jul-11 22:44 sianiparmaruli 24-Jul-11 22:44
 Re: stopword removal issue falah ukm7-May-12 3:45 falah ukm 7-May-12 3:45
 i have a question pls read this!!!! dyeingbreed8-Jul-11 21:22 dyeingbreed 8-Jul-11 21:22
 need help for visual basic coding for tf-idf fifi fadli6-Jul-11 23:42 fifi fadli 6-Jul-11 23:42
 My vote of 1 naeem_libra24-Oct-10 22:23 naeem_libra 24-Oct-10 22:23
 My vote of 3 Mustafa M. Gamal29-Aug-10 19:47 Mustafa M. Gamal 29-Aug-10 19:47
 "Term frequency/Inverse document frequency implementation in C#" Kasunmit9-May-10 16:15 Kasunmit 9-May-10 16:15
 How to run the code aurangzeb khan27-Sep-09 18:48 aurangzeb khan 27-Sep-09 18:48
 Hi, The cosine similarity worth reading and its ussefull, still you may want to read some considerations about IDF in my article: http://www.codeproject.com/KB/IP/AnatomyOfASearchEngine1.aspx Cheers
 Performance tohi21-May-09 3:57 tohi 21-May-09 3:57
 How to use this code Nanomom24-Sep-08 21:11 Nanomom 24-Sep-08 21:11
 Sir very thanx for this code at code project but would you like to help me chancomsats21-May-08 9:35 chancomsats 21-May-08 9:35
 Still maintained? Marakai27-Dec-07 17:47 Marakai 27-Dec-07 17:47
 Re: Still maintained? daisy0001bc3-Jun-08 0:47 daisy0001bc 3-Jun-08 0:47
 Re: Still maintained? Marakai3-Jun-08 15:40 Marakai 3-Jun-08 15:40
 Re: Still maintained? Thanh Dao6-Jun-08 4:14 Thanh Dao 6-Jun-08 4:14
 Re: Still maintained? Marakai15-Jan-09 17:52 Marakai 15-Jan-09 17:52
 Re: Still maintained? me_project12-Jan-09 18:49 me_project 12-Jan-09 18:49
 Re: Still maintained? Marakai15-Jan-09 17:50 Marakai 15-Jan-09 17:50
 Re: Still maintained? me_project18-Jan-09 19:45 me_project 18-Jan-09 19:45
 What should I do if the vector length different? [modified] redeville18-Dec-07 23:37 redeville 18-Dec-07 23:37
 Re: What should I do if the vector length different? Marakai27-Dec-07 18:38 Marakai 27-Dec-07 18:38
 Need help Calvin Calton13-Nov-07 17:33 Calvin Calton 13-Nov-07 17:33
 About the code mdhakate16-Oct-07 21:54 mdhakate 16-Oct-07 21:54
 Generate N-distinct words mralfarra12-Aug-07 3:47 mralfarra 12-Aug-07 3:47
 urgent help shruthis914-May-07 21:13 shruthis9 14-May-07 21:13
 need immediate help BHAVANA RAO30-Apr-07 5:46 BHAVANA RAO 30-Apr-07 5:46
 run the code BHAVANA RAO7-Apr-07 1:23 BHAVANA RAO 7-Apr-07 1:23
 Re: run the code shruthis914-May-07 21:07 shruthis9 14-May-07 21:07
 TDF/IDF Formula [modified] albeg15-Jan-07 5:14 albeg 15-Jan-07 5:14
 Re: TDF/IDF Formula BHAVANA RAO7-Apr-07 1:19 BHAVANA RAO 7-Apr-07 1:19
 Can you pls give me a hand? chufangdavid2-Dec-05 8:26 chufangdavid 2-Dec-05 8:26
 Re: Can you pls give me a hand? Thanh Dao3-Dec-05 17:38 Thanh Dao 3-Dec-05 17:38
 Re: Can you pls give me a hand? namrevon4-Dec-05 8:25 namrevon 4-Dec-05 8:25
 Re: Can you pls give me a hand? Thanh Dao7-Dec-05 4:47 Thanh Dao 7-Dec-05 4:47
 Cosine similarity measurement rmcubedmm31-Oct-05 22:57 rmcubedmm 31-Oct-05 22:57
 Re: Cosine similarity measurement Thanh Dao31-Oct-05 23:42 Thanh Dao 31-Oct-05 23:42
 Re: Cosine similarity measurement thesky20918-Apr-12 11:52 thesky209 18-Apr-12 11:52
 Five Stars as usual Grav-Vt29-Oct-05 9:08 Grav-Vt 29-Oct-05 9:08
 Re: Five Stars as usual Thanh Dao29-Oct-05 17:25 Thanh Dao 29-Oct-05 17:25
 Re: Five Stars as usual gowth0831-Jan-09 18:52 gowth08 31-Jan-09 18:52
 Last Visit: 31-Dec-99 18:00     Last Update: 21-Oct-16 11:20 Refresh 1