Download source code - 4.7 KB

Introduction

Model is an idealization or abstraction of an actual process. Information Retrieval models can describe the computational process.

For example

how documents are ranked
Note that how documents or indexes are stored is implemented.

The goal of information retrieval (IR) is to provide users with those documents that will satisfy their information need. Retrieval models can attempt to describe the human Process, such as the information need, interaction.

Retrieval model can be categorize as

Boolean retrieval model
Vector space model
Probabilistic model
Model based on belief net

The Boolean model of information retrieval is a classical information retrieval (IR) model and is the first and most adopted one. It is used by virtually all commercial IR systems today.

Exact vs Best match

In exact match a query specifies precise criteria. Each document either matches or fails to match the query. The results retrieved in exact match is a set of document (without ranking).

In best match a query describes good or best matching documents. In this case the result is a ranked list of document. The Boolean model here I’m going to deal with is the most common exact match model.

Basic Assumption of Boolean Model

An index term is either present(1) or absent(0) in the document
All index terms provide equal evidence with respect to information needs.
Queries are Boolean combinations of index terms.

X AND Y: represents doc that contains both X and Y
X OR Y: represents doc that contains either X or Y
NOT X: represents the doc that do not contain X

Boolean Queries Example

User information need: Interested to know about Everest and Nepal

User Boolean query: Everest AND Nepal

Implementation Part

Example of Input collection

Doc1= English tutorial and fast track

Doc2 = learning latent semantic indexing

Doc3 = Book on semantic indexing

Doc4 = Advance in structure and semantic indexing

Doc5 = Analysis of latent structures

Query problem: advance and structure AND NOT analysis

Boolean Model Index Construction

First we build the term-document incidence matrix which represents a list of all the distinct terms and their presence on each document (incidence vector). If the document contains the term than incidence vector is 1 otherwise 0.

Terms/doc	Doc1	Doc2	Doc3	Doc4	Doc5
English	1	0	0	0	0
Tutorial	1	0	0	0	0
Fast	1	0	0	0	0
Track	1	0	0	0	0
Books	0	1	0	0	0
Semantic	0	1	1	1	0
Analysis	0	1	0	0	1
Learning	0	0	1	0	0
Latent	0	0	1	0	1
Indexing	0	0	1	1	0
Advance	0	0	0	1	0
Structures	0	0	0	1	1

So now we have 0/1 vector for each term. To answer the query we take the vectors for advance, structure and analysis, complement the last, and then do a bitwise AND.

Doc1	Doc2	Doc3	Doc4	Doc5
0	0	0	1	0
0	0	0	1	1	(AND)
0	0	0	1	0
1	0	1	1	0	(NOT analysis)
0	0	0	1	0
			Doc4

Hence Doc4 is retrieved here.

Diving into the code

Document tokenization and indexing

Tokenization is the task of chopping the input character sequence into pieces, called tokens. Here I have used termsCollection to store the retrieve tokens from the input document and HasSet collection type distinctTermto create the index of tokens which ensures that the collection type contains only distinct terms .

//read all the text document on the specified directory
foreach (string file in Directory.EnumerateFiles("F:\\IR\\", "*.txt"))
{
    string contents = File.ReadAllText(file);
    String[] termsCollection = RemoveStopsWords(contents.ToUpper().Split(' '));
    foreach (string term in termsCollection)
    {
        //prepeare distinct terms collection
        //remove stop words
        if (!stopWords.Contains(term)) 
        {
            distinctTerm.Add(term);
        }
    }

    //add document and their terms collection
    documentCollection.Add(file, termsCollection.ToList());
    //add document and its content for displaying the search result
    documentContentList.Add(count, contents);
    count++;
}

Removing stop words

Most Search Engines do not consider extremely common words in order to save disk space or to speed up search results. These filtered words are known as 'Stop Words' here I have used generic collection stopWords which contains a list of stop word, to remove it from the document.

//removes all stop words
public static string[] RemoveStopsWords(string[] str)
{
    List<string> terms = new List<string>();
    foreach (string term in str)
    {
        if (!stopWords.Contains(term))
        {
            terms.Add(term);
        }
    }
    return terms.ToArray();
}

Term - Document Incidence matrix creation

If a term appears on a document than its term incidence vector is 1 else 0.

The GetTermDocumentIncidenceMatrix function returns the term document incidence matrix for all distinct terms which are stored on HashSet collection. I have used Dictionary collection type termDocumentIncidenceMatrix to store the term and its incidence vectors on each document.

//prepares Term Document Incidence Matrix
public static Dictionary<string, List<int>> GetTermDocumentIncidenceMatrix(HashSet<string> 
       distinctTerms,Dictionary<string,List<string>> documentCollection)
{
    Dictionary<string, List<int>> termDocumentIncidenceMatrix = 
                new Dictionary<string, List<int>>();
    List<int> incidenceVector = new List<int>();
    foreach (string term in distinctTerms)
    {
        //incidence vector for each terms
        incidenceVector = new List<int>();
        foreach(KeyValuePair<string ,List<string>> p in documentCollection)
        {
            
            if (p.Value.Contains(term))
            {
                //document contains the term
                incidenceVector.Add(1); 

            }
            else
            {
                //document do not contains the term
                incidenceVector.Add(0); 
            }
        }
        termDocumentIncidenceMatrix.Add(term, incidenceVector);

    }
    return termDocumentIncidenceMatrix;
}

Retrieving incidence vector of individual term

//returns term incidence vector
public static List<int> GetTermIncidenceVector(string term)
{
    return termDocumentIncidenceMatrix[term.ToUpper()];
}

Query Processor

The string variable bitWiseOp holds the Boolean operator that exists on the query. Generally the term but is treated as semantically equivalent with Boolean AND operator. The collection previousTermIncidenceV and nextTermIncidenceV is used to hold the term incidence vector of each term that appears before and after the Boolean operator on the query.

//process the boolean query
public static List<int> ProcessQuery(string query)
{
    //query boolean operator
    string bitWiseOp = string.Empty; 
    string[] queryTerm = RemoveStopsWords(query.ToUpper().Split(''));
    //remove query term that doesnot appears on document collection
    FilterQueryTerm(ref queryTerm);
    List<int> previousTermIncidenceV = null;
    List<int> nextTermsIncidenceV = null;
    //holds the bitwise operation result
    List<int> resultSet = null;
    //On query X AND Y, X is previousTerm term and Y is nextTerm
    Boolean hasPreviousTerm = false; 
    Boolean hasNotOperation = false;
    foreach (string term in queryTerm)
    {
        //is a term
        if (!booleanOperator.Contains(term)&&!term.Equals("BUT"))
        {
            //query case: structure AND NOT analysis
            if (hasNotOperation) 
            {
                
                if (hasPreviousTerm)
                {
                    nextTermsIncidenceV = ProcessBooleanOperator("NOT", 
                      GetTermIncidenc eVector(term), nextTermsIncidenceV);
                }
                //query case: eg.NOT analysis
                else 
                {
                    previousTermIncidenceV = ProcessBooleanOperator("NOT", 
                      GetTermIncid enceVector(term), nextTermsIncidenceV);
                    resultSet = previousTermIncidenceV; 
                }
                hasNotOperation = false;
            }
            else if (!hasPreviousTerm)
            {
                previousTermIncidenceV = GetTermIncidenceVector(term);
                resultSet = previousTermIncidenceV;
                hasPreviousTerm = true; //
            }
            else
            {
                
                nextTermsIncidenceV = GetTermIncidenceVector(term);
            }
        }
        else if (term.Equals("NOT"))
        {
            //indicates that the  term in the next iteration should be complemented.
            hasNotOperation = true;
        }
        else
        {
            //'BUT' also should be evaluated as AND eg. structure BUT
            //NOT semantic should be evaluated as structure AND NOT semantic
            if (term.Equals("BUT")) 
            {
                bitWiseOp = "AND";
            }
            else
            bitWiseOp = term;
        }

        if (nextTermsIncidenceV != null && !hasNotOperation)
        {
            resultSet = ProcessBooleanOperator(bitWiseOp, 
                             previousTermIncidenceV, nextT ermsIncidenceV);
            previousTermIncidenceV = resultSet;
            hasPreviousTerm = true;
            nextTermsIncidenceV = null;
        }
    }

    return resultSet;
}

Filter Query Term

private static void FilterQueryTerm(ref string[] str)
{
     List<string> _queryTerm = new List<string>();            
     foreach (string queryTerm in str) 
     {
         if (queryTerm.ToUpper().Equals("BUT") ||termDocumentIncidenceMatrix.ContainsKey(queryTerm.ToUpper()) ||booleanOperator.Contains(queryTerm))
            {
                    _queryTerm.Add(queryTerm);
                   
            }
      }
      str = _queryTerm.ToArray();
}

Processing Boolean operators

public static List<int> ProcessBooleanOperator(string op, 
          List<int> previousTermV,List<int> nextTermV)
{
    List<int> resultSet = new List<int>();
    if(op.Equals("NOT"))
    {
        foreach(int a in previousTermV)
        {
            if (a == 1)
            {
                resultSet.Add(0);
            }
            else
            {
                resultSet.Add(1);
            }
        }
    }
    else if (op.ToUpper().Equals("AND")) //bitwise AND operation
    {
        for (int a = 0; a < previousTermV.Count; a++)
        {
            if (previousTermV[a] == 1 && nextTermV[a] == 1)
            {
                resultSet.Add(1);
            }
            else
            {
                resultSet.Add(0);
            }
        }
    }
    else if (op.ToUpper().Equals("OR")) //bitwise OR operation
    {
        for (int a = 0; a < previousTermV.Count; a++)
        {
            if (previousTermV[a] == 0 && nextTermV[a] == 0)
            {
                resultSet.Add(0);
            }
            else
            {
                resultSet.Add(1);
            }
        }
    }
    return resultSet;
}