13,055,490 members (53,068 online)
alternative version

Stats

162.4K views
58 bookmarked
Posted 28 Oct 2005

Term frequency/Inverse document frequency implementation in C#

, 28 Oct 2005
 Rate this:
Text statistical-based measuring of similarity between two documents in a corpora.

Introduction

This code implements the Term Frequency/Inverse Document frequency (TF-IDF). The TF-IDF is a text statistical-based technique which has been widely used in many search engines and information retrieval systems. I will deal with the documents similarity problem in the next section. To understand the theory, please see this article: Wiki definition of TF/IDF for more details.

Solution

Assume that you have a corpora of 1000 documents and your task is to compute the similarity between two given documents (or a document and a query). The following describes the steps of acquiring the similarity value:

Document pre-processing steps

• Tokenization: A document is treated as a string (or bag of words), and then partitioned into a list of tokens.
• Removing stop words: Stop words are frequently occurring, insignificant words. This step eliminates the stop words.
• Stemming word: This step is the process of conflating tokens to their root form (connection -> connect).

Document representation

• We generate N-distinct words from the corpora and call them as index terms (or the vocabulary). The document collection is then represented as a N-dimensional vector in term space.

Computing Term weights

• Term Frequency.
• Inverse Document Frequency.
• Compute the TF-IDF weighting.

Measuring similarity between two documents

• We capture the similarity of two documents using cosine similarity measurement. The cosine similarity is calculated by measuring the cosine of the angle between two document vectors.

Using the code

The main class is `TFIDFMeasure`. This is the testing code:

```void Test (string[] docs, int i, int j)
// docs is collection of parsed documents
{
StopWordHandler stopWord=new StopWordsHandler() ;
TFIDFMeasure tf=new TFIDFMeasure(doc) ;
float simScore=tf.GetSimilarity( i, j);
// similarity of two given documents at the
// position i,j respectively
}```

Extension

This library also includes stemming (Martin Porter algorithm), and N-gram text generation modules. If a token-based system did not work as expected, then you can make another choice with N-gram based. Thus, instead of expanding the list of tokens from the document, we will generate a list of N-grams, where N should be a predefined number. That means we will hash into a table to find the counter for the N-gram, but not words (or tokens).

The extra N-gram based similarities (bi, tri, quad...-gram) also help you compare the result of the statistical-based method with the N-gram based method. Let us consider two documents as two flat texts and then run the measurement to compare.

Example of some N-grams for the word "TEXT":

• uni(1)-gram: T, E, X, T
• bi(2)-gram: T, TE, EX, XT, T
• tri(3)-grams: TE, TEX, EXT, XT, T
• quad(4)-grams: TEX, TEXT, EXT, XT, T

A string of length k, will have k+1 bi-grams, k+1 tri-grams, k+1 quad-grams, and so on.

Point of interest

No complex technique was used, I only utilized the hashtable indexing, and array binary search to solve this problem. The N-gram based text similarity also gives us interesting results.

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

Share

 Software Developer Vietnam
I'm still alive...but temporarily moved to work on mobile & web stuffs(j2me/brew/php/flash...something not M\$). things have just been very busy, and probably will continue...so don't have chance to maintain & respond. Hope will have time to try to write again, because many ideas with WPF &silver light are waiting. wish me luck

FYI:
- MESHSimPack project(c# library for measuring similarity among concepts of the MESH ontology):
http://sourceforge.net/projects/meshsimpack.

You may also be interested in...

 First Prev Next
 Vector space Member 1052649623-Jan-14 13:01 Member 10526496 23-Jan-14 13:01
 Alternative version of TF*IDF code kbomb98711-Sep-13 10:10 kbomb987 11-Sep-13 10:10
 help for run program mostafamosakhany15-Jul-13 20:27 mostafamosakhany 15-Jul-13 20:27
 text categorization arjuncent11-Feb-13 15:43 arjuncent 11-Feb-13 15:43
 Can you post code using vector space model to categorize the text..
 Good program but complex structure falah ukm14-Apr-12 2:48 falah ukm 14-Apr-12 2:48
 stopword removal issue sianiparmaruli24-Jul-11 22:44 sianiparmaruli 24-Jul-11 22:44
 Re: stopword removal issue falah ukm7-May-12 3:45 falah ukm 7-May-12 3:45
 i have a question pls read this!!!! dyeingbreed8-Jul-11 21:22 dyeingbreed 8-Jul-11 21:22
 need help for visual basic coding for tf-idf fifi fadli6-Jul-11 23:42 fifi fadli 6-Jul-11 23:42
 My vote of 1 naeem_libra24-Oct-10 22:23 naeem_libra 24-Oct-10 22:23
 My vote of 3 Mustafa M. Gamal29-Aug-10 19:47 Mustafa M. Gamal 29-Aug-10 19:47
 "Term frequency/Inverse document frequency implementation in C#" Kasunmit9-May-10 16:15 Kasunmit 9-May-10 16:15
 How to run the code aurangzeb khan27-Sep-09 18:48 aurangzeb khan 27-Sep-09 18:48
 Relevance article Adrian Pirvu4-Jun-09 21:13 Adrian Pirvu 4-Jun-09 21:13
 Performance tohi21-May-09 3:57 tohi 21-May-09 3:57
 How to use this code Nanomom24-Sep-08 21:11 Nanomom 24-Sep-08 21:11
 Sir very thanx for this code at code project but would you like to help me chancomsats21-May-08 9:35 chancomsats 21-May-08 9:35
 Still maintained? Marakai27-Dec-07 17:47 Marakai 27-Dec-07 17:47
 Re: Still maintained? daisy0001bc3-Jun-08 0:47 daisy0001bc 3-Jun-08 0:47
 Re: Still maintained? Marakai3-Jun-08 15:40 Marakai 3-Jun-08 15:40
 Re: Still maintained? Thanh Dao6-Jun-08 4:14 Thanh Dao 6-Jun-08 4:14
 Re: Still maintained? Marakai15-Jan-09 17:52 Marakai 15-Jan-09 17:52
 Re: Still maintained? me_project12-Jan-09 18:49 me_project 12-Jan-09 18:49
 Re: Still maintained? Marakai15-Jan-09 17:50 Marakai 15-Jan-09 17:50
 Re: Still maintained? me_project18-Jan-09 19:45 me_project 18-Jan-09 19:45
 What should I do if the vector length different? [modified] redeville18-Dec-07 23:37 redeville 18-Dec-07 23:37
 Re: What should I do if the vector length different? Marakai27-Dec-07 18:38 Marakai 27-Dec-07 18:38
 Need help Calvin Calton13-Nov-07 17:33 Calvin Calton 13-Nov-07 17:33
 About the code mdhakate16-Oct-07 21:54 mdhakate 16-Oct-07 21:54
 Generate N-distinct words mralfarra12-Aug-07 3:47 mralfarra 12-Aug-07 3:47
 urgent help shruthis914-May-07 21:13 shruthis9 14-May-07 21:13
 need immediate help BHAVANA RAO30-Apr-07 5:46 BHAVANA RAO 30-Apr-07 5:46
 run the code BHAVANA RAO7-Apr-07 1:23 BHAVANA RAO 7-Apr-07 1:23
 Re: run the code shruthis914-May-07 21:07 shruthis9 14-May-07 21:07
 TDF/IDF Formula [modified] albeg15-Jan-07 5:14 albeg 15-Jan-07 5:14
 Re: TDF/IDF Formula BHAVANA RAO7-Apr-07 1:19 BHAVANA RAO 7-Apr-07 1:19
 Can you pls give me a hand? chufangdavid2-Dec-05 8:26 chufangdavid 2-Dec-05 8:26
 Re: Can you pls give me a hand? Thanh Dao3-Dec-05 17:38 Thanh Dao 3-Dec-05 17:38
 Re: Can you pls give me a hand? namrevon4-Dec-05 8:25 namrevon 4-Dec-05 8:25
 Re: Can you pls give me a hand? Thanh Dao7-Dec-05 4:47 Thanh Dao 7-Dec-05 4:47
 Cosine similarity measurement rmcubedmm31-Oct-05 22:57 rmcubedmm 31-Oct-05 22:57
 Re: Cosine similarity measurement Thanh Dao31-Oct-05 23:42 Thanh Dao 31-Oct-05 23:42
 Re: Cosine similarity measurement thesky20918-Apr-12 11:52 thesky209 18-Apr-12 11:52
 Five Stars as usual Grav-Vt29-Oct-05 9:08 Grav-Vt 29-Oct-05 9:08
 Re: Five Stars as usual Thanh Dao29-Oct-05 17:25 Thanh Dao 29-Oct-05 17:25
 Re: Five Stars as usual gowth0831-Jan-09 18:52 gowth08 31-Jan-09 18:52
 Last Visit: 31-Dec-99 18:00     Last Update: 28-Jul-17 1:17 Refresh 1

General    News    Suggestion    Question    Bug    Answer    Joke    Praise    Rant    Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.