Click here to Skip to main content
14,267,992 members

Term frequency/Inverse document frequency implementation in C#

Rate this:
4.87 (55 votes)
Please Sign up or sign in to vote.
4.87 (55 votes)
28 Oct 2005
Text statistical-based measuring of similarity between two documents in a corpora.

Introduction

This code implements the Term Frequency/Inverse Document frequency (TF-IDF). The TF-IDF is a text statistical-based technique which has been widely used in many search engines and information retrieval systems. I will deal with the documents similarity problem in the next section. To understand the theory, please see this article: Wiki definition of TF/IDF for more details.

Solution

Assume that you have a corpora of 1000 documents and your task is to compute the similarity between two given documents (or a document and a query). The following describes the steps of acquiring the similarity value:

Document pre-processing steps

  • Tokenization: A document is treated as a string (or bag of words), and then partitioned into a list of tokens.
  • Removing stop words: Stop words are frequently occurring, insignificant words. This step eliminates the stop words.
  • Stemming word: This step is the process of conflating tokens to their root form (connection -> connect).

Document representation

  • We generate N-distinct words from the corpora and call them as index terms (or the vocabulary). The document collection is then represented as a N-dimensional vector in term space.

Computing Term weights

  • Term Frequency.
  • Inverse Document Frequency.
  • Compute the TF-IDF weighting.

Measuring similarity between two documents

  • We capture the similarity of two documents using cosine similarity measurement. The cosine similarity is calculated by measuring the cosine of the angle between two document vectors.

Using the code

The main class is TFIDFMeasure. This is the testing code:

void Test (string[] docs, int i, int j)
// docs is collection of parsed documents
{
    StopWordHandler stopWord=new StopWordsHandler() ;
    TFIDFMeasure tf=new TFIDFMeasure(doc) ;
    float simScore=tf.GetSimilarity( i, j);
      // similarity of two given documents at the 
      // position i,j respectively
}

Extension

This library also includes stemming (Martin Porter algorithm), and N-gram text generation modules. If a token-based system did not work as expected, then you can make another choice with N-gram based. Thus, instead of expanding the list of tokens from the document, we will generate a list of N-grams, where N should be a predefined number. That means we will hash into a table to find the counter for the N-gram, but not words (or tokens).

The extra N-gram based similarities (bi, tri, quad...-gram) also help you compare the result of the statistical-based method with the N-gram based method. Let us consider two documents as two flat texts and then run the measurement to compare.

Example of some N-grams for the word "TEXT":

  • uni(1)-gram: T, E, X, T
  • bi(2)-gram: T, TE, EX, XT, T
  • tri(3)-grams: TE, TEX, EXT, XT, T
  • quad(4)-grams: TEX, TEXT, EXT, XT, T

A string of length k, will have k+1 bi-grams, k+1 tri-grams, k+1 quad-grams, and so on.

Point of interest

No complex technique was used, I only utilized the hashtable indexing, and array binary search to solve this problem. The N-gram based text similarity also gives us interesting results.

Articles worth reading

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

Share

About the Author

Thanh Dao
Software Developer
Vietnam Vietnam
I'm still alive...but temporarily moved to work on mobile & web stuffs(j2me/brew/php/flash...something not M$). things have just been very busy, and probably will continue...so don't have chance to maintain & respond. Hope will have time to try to write again, because many ideas with WPF &silver light are waiting. wish me luck Smile | :)

FYI:
- MESHSimPack project(c# library for measuring similarity among concepts of the MESH ontology):
http://sourceforge.net/projects/meshsimpack.

Comments and Discussions

 
GeneralRe: Still maintained? Pin
me_project12-Jan-09 18:49
memberme_project12-Jan-09 18:49 
GeneralRe: Still maintained? Pin
Marakai15-Jan-09 17:50
memberMarakai15-Jan-09 17:50 
GeneralRe: Still maintained? Pin
me_project18-Jan-09 19:45
memberme_project18-Jan-09 19:45 
QuestionWhat should I do if the vector length different? [modified] Pin
redeville18-Dec-07 23:37
memberredeville18-Dec-07 23:37 
GeneralRe: What should I do if the vector length different? Pin
Marakai27-Dec-07 18:38
memberMarakai27-Dec-07 18:38 
GeneralNeed help Pin
Calvin Calton13-Nov-07 17:33
memberCalvin Calton13-Nov-07 17:33 
QuestionAbout the code Pin
mdhakate16-Oct-07 21:54
membermdhakate16-Oct-07 21:54 
QuestionGenerate N-distinct words Pin
mralfarra12-Aug-07 3:47
membermralfarra12-Aug-07 3:47 
hello ..
this is a great work but I think if this work contain an technical documentation, it will be more greater ...

please, I would like to aske about generateing N-distinct words, do you use a specific techniques to choose the terms ? or we choose all the distinct words ..
as if I have two documents, after preprocessing steps , the first one contain 50 words and the second 70 words , and they have 30 shared words then the dim. of the index array will be ...
(50+ 70) - 30 = 90 words ...
or we use any techniques to reduce this number of words ?

mahmoud Rafeek
Generalurgent help Pin
shruthis914-May-07 21:13
membershruthis914-May-07 21:13 
Questionneed immediate help Pin
BHAVANA RAO30-Apr-07 5:46
memberBHAVANA RAO30-Apr-07 5:46 
Questionrun the code Pin
BHAVANA RAO7-Apr-07 1:23
memberBHAVANA RAO7-Apr-07 1:23 
AnswerRe: run the code Pin
shruthis914-May-07 21:07
membershruthis914-May-07 21:07 
GeneralTDF/IDF Formula [modified] Pin
albeg15-Jan-07 5:14
memberalbeg15-Jan-07 5:14 
QuestionRe: TDF/IDF Formula Pin
BHAVANA RAO7-Apr-07 1:19
memberBHAVANA RAO7-Apr-07 1:19 
QuestionCan you pls give me a hand? Pin
chufangdavid2-Dec-05 8:26
memberchufangdavid2-Dec-05 8:26 
AnswerRe: Can you pls give me a hand? Pin
Thanh Dao3-Dec-05 17:38
memberThanh Dao3-Dec-05 17:38 
GeneralRe: Can you pls give me a hand? Pin
namrevon4-Dec-05 8:25
membernamrevon4-Dec-05 8:25 
GeneralRe: Can you pls give me a hand? Pin
Thanh Dao7-Dec-05 4:47
memberThanh Dao7-Dec-05 4:47 
GeneralCosine similarity measurement Pin
rmcubedmm31-Oct-05 22:57
memberrmcubedmm31-Oct-05 22:57 
GeneralRe: Cosine similarity measurement Pin
Thanh Dao31-Oct-05 23:42
memberThanh Dao31-Oct-05 23:42 
GeneralRe: Cosine similarity measurement Pin
thesky20918-Apr-12 11:52
memberthesky20918-Apr-12 11:52 
GeneralFive Stars as usual Pin
Grav-Vt29-Oct-05 9:08
memberGrav-Vt29-Oct-05 9:08 
GeneralRe: Five Stars as usual Pin
Thanh Dao29-Oct-05 17:25
memberThanh Dao29-Oct-05 17:25 
GeneralRe: Five Stars as usual Pin
gowth0831-Jan-09 18:52
groupgowth0831-Jan-09 18:52 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Article
Posted 28 Oct 2005

Stats

183.3K views
6.7K downloads
59 bookmarked