## Introduction

This code implements the Term Frequency/Inverse Document frequency (TF-IDF). The TF-IDF is a text statistical-based technique which has been widely used in many search engines and information retrieval systems. I will deal with the documents similarity problem in the next section. To understand the theory, please see this article: Wiki definition of TF/IDF for more details.

## Solution

Assume that you have a corpora of 1000 documents and your task is to compute the similarity between two given documents (or a document and a query). The following describes the steps of acquiring the similarity value:

**Document pre-processing steps**

- Tokenization: A document is treated as a string (or bag of words), and then partitioned into a list of tokens.
- Removing stop words: Stop words are frequently occurring, insignificant words. This step eliminates the stop words.
- Stemming word: This step is the process of conflating tokens to their root form (connection -> connect).

**Document representation**

- We generate N-distinct words from the corpora and call them as index terms (or the vocabulary). The document collection is then represented as a N-dimensional vector in term space.

**Computing Term weights**

- Term Frequency.
- Inverse Document Frequency.
- Compute the TF-IDF weighting.

**Measuring similarity between two documents**

- We capture the similarity of two documents using cosine similarity measurement. The cosine similarity is calculated by measuring the cosine of the angle between two document vectors.

## Using the code

The main class is `TFIDFMeasure`

. This is the testing code:

void Test (string[] docs, int i, int j)
{
StopWordHandler stopWord=new StopWordsHandler() ;
TFIDFMeasure tf=new TFIDFMeasure(doc) ;
float simScore=tf.GetSimilarity( i, j);
}

## Extension

This library also includes stemming (Martin Porter algorithm), and N-gram text generation modules. If a token-based system did not work as expected, then you can make another choice with N-gram based. Thus, instead of expanding the list of tokens from the document, we will generate a list of N-grams, where N should be a predefined number. That means we will hash into a table to find the counter for the N-gram, but not words (or tokens).

The extra N-gram based similarities (bi, tri, quad...-gram) also help you compare the result of the statistical-based method with the N-gram based method. Let us consider two documents as two flat texts and then run the measurement to compare.

Example of some N-grams for the word "TEXT":

- uni(1)-gram: T, E, X, T
- bi(2)-gram: T, TE, EX, XT, T
- tri(3)-grams: TE, TEX, EXT, XT, T
- quad(4)-grams: TEX, TEXT, EXT, XT, T

A string of length k, will have k+1 bi-grams, k+1 tri-grams, k+1 quad-grams, and so on.

## Point of interest

No complex technique was used, I only utilized the hashtable indexing, and array binary search to solve this problem. The N-gram based text similarity also gives us interesting results.

## Articles worth reading