12,950,399 members (63,266 online)
alternative version

#### Stats

31K views
8 bookmarked
Posted 1 May 2012

# Boolean Retrieval Model

, 31 Jul 2012 CPOL
 Rate this:
Information retrieval using boolean retrieval model

## Introduction

Model is an idealization or abstraction of an actual process. Information Retrieval models can describe the computational process.

For example

1. how documents are ranked
2. Note that how documents or indexes are stored is implemented.

The goal of information retrieval (IR) is to provide users with those documents that will satisfy their information need. Retrieval models can attempt to describe the human Process, such as the information need, interaction.

Retrieval model can be categorize as

1. Boolean retrieval model
2. Vector space model
3. Probabilistic model
4. Model based on belief net

The Boolean model of information retrieval is a classical information retrieval (IR) model and is the first and most adopted one. It is used by virtually all commercial IR systems today.

##### Exact vs Best match

In exact match a query specifies precise criteria. Each document either matches or fails to match the query. The results retrieved in exact match is a set of document (without ranking).

In best match a query describes good or best matching documents. In this case the result is a ranked list of document. The Boolean model here I’m going to deal with is the most common exact match model.

##### Basic Assumption of Boolean Model
1. An index term is either present(1) or absent(0) in the document
2. All index terms provide equal evidence with respect to information needs.
3. Queries are Boolean combinations of index terms.
• X AND Y: represents doc that contains both X and Y
• X OR Y: represents doc that contains either X or Y
• NOT X: represents the doc that do not contain X
##### Boolean Queries Example

User information need: Interested to know about Everest and Nepal

User Boolean query: Everest AND Nepal

## Implementation Part

Example of Input collection

Doc1= English tutorial and fast track

Doc2 = learning latent semantic indexing

Doc3 = Book on semantic indexing

Doc4 = Advance in structure and semantic indexing

Doc5 = Analysis of latent structures

Query problem: advance and structure AND NOT analysis

##### Boolean Model Index Construction

First we build the term-document incidence matrix which represents a list of all the distinct terms and their presence on each document (incidence vector). If the document contains the term than incidence vector is 1 otherwise 0.

Terms/doc Doc1 Doc2 Doc3 Doc4 Doc5
English  1 0 0 0 0
Tutorial 1 0 0 0 0
Fast 1 0 0 0 0
Track 1 0 0 0 0
Books 0 1 0 0 0
Semantic 0 1 1 1 0
Analysis 0 1 0 0 1
Learning 0 0 1 0 0
Latent 0 0 1 0 1
Indexing 0 0 1 1 0
Advance 0 0 0 1 0
Structures 0 0 0 1 1

So now we have 0/1 vector for each term. To answer the query we take the vectors for advance, structure and analysis, complement the last, and then do a bitwise AND.

Doc1 Doc2 Doc3 Doc4 Doc5
00010
00011(AND)
0 0 0 1
10 1 1 0 (NOT analysis)
00 0 1 0
Doc4
Hence Doc4 is retrieved here.

## Diving into the code

##### Document tokenization and indexing

Tokenization is the task of chopping the input character sequence into pieces, called tokens. Here I have used `termsCollection `to store the retrieve tokens from the input document and HasSet collection type

`distinctTerm`
to create the index of tokens which ensures that the collection type contains only distinct terms .

```//read all the text document on the specified directory
foreach (string file in Directory.EnumerateFiles("F:\\IR\\", "*.txt"))
{
String[] termsCollection = RemoveStopsWords(contents.ToUpper().Split(' '));
foreach (string term in termsCollection)
{
//prepeare distinct terms collection
//remove stop words
if (!stopWords.Contains(term))
{
}
}

//add document and their terms collection
//add document and its content for displaying the search result
count++;
}```
##### Removing stop words

Most Search Engines do not consider extremely common words in order to save disk space or to speed up search results. These filtered words are known as 'Stop Words' here I have used generic collection `stopWords` which contains a list of stop word, to remove it from the document.

```//removes all stop words
public static string[] RemoveStopsWords(string[] str)
{
List<string> terms = new List<string>();
foreach (string term in str)
{
if (!stopWords.Contains(term))
{
}
}
return terms.ToArray();
}```
##### Term - Document Incidence matrix creation

If a term appears on a document than its term incidence vector is 1 else 0.

The `GetTermDocumentIncidenceMatrix` function returns the term document incidence matrix for all distinct terms which are stored on `HashSet` collection. I have used Dictionary collection type `termDocumentIncidenceMatrix` to store the term and its incidence vectors on each document.

```//prepares Term Document Incidence Matrix
public static Dictionary<string, List<int>> GetTermDocumentIncidenceMatrix(HashSet<string>
distinctTerms,Dictionary<string,List<string>> documentCollection)
{
Dictionary<string, List<int>> termDocumentIncidenceMatrix =
new Dictionary<string, List<int>>();
List<int> incidenceVector = new List<int>();
foreach (string term in distinctTerms)
{
//incidence vector for each terms
incidenceVector = new List<int>();
foreach(KeyValuePair<string ,List<string>> p in documentCollection)
{

if (p.Value.Contains(term))
{
//document contains the term

}
else
{
//document do not contains the term
}
}

}
return termDocumentIncidenceMatrix;
} ```
##### Retrieving incidence vector of individual term
```//returns term incidence vector
public static List<int> GetTermIncidenceVector(string term)
{
return termDocumentIncidenceMatrix[term.ToUpper()];
}```
##### Query Processor

The string variable `bitWiseOp` holds the Boolean operator that exists on the query. Generally the term but is treated as semantically equivalent with Boolean AND operator. The collection `previousTermIncidenceV` and `nextTermIncidenceV `is used to hold the term incidence vector of each term that appears before and after the Boolean operator on the query.

```//process the boolean query
public static List<int> ProcessQuery(string query)
{
//query boolean operator
string bitWiseOp = string.Empty;
string[] queryTerm = RemoveStopsWords(query.ToUpper().Split(''));
//remove query term that doesnot appears on document collection
FilterQueryTerm(ref queryTerm);
List<int> previousTermIncidenceV = null;
List<int> nextTermsIncidenceV = null;
//holds the bitwise operation result
List<int> resultSet = null;
//On query X AND Y, X is previousTerm term and Y is nextTerm
Boolean hasPreviousTerm = false;
Boolean hasNotOperation = false;
foreach (string term in queryTerm)
{
//is a term
if (!booleanOperator.Contains(term)&&!term.Equals("BUT"))
{
//query case: structure AND NOT analysis
if (hasNotOperation)
{

if (hasPreviousTerm)
{
nextTermsIncidenceV = ProcessBooleanOperator("NOT",
GetTermIncidenc eVector(term), nextTermsIncidenceV);
}
//query case: eg.NOT analysis
else
{
previousTermIncidenceV = ProcessBooleanOperator("NOT",
GetTermIncid enceVector(term), nextTermsIncidenceV);
resultSet = previousTermIncidenceV;
}
hasNotOperation = false;
}
else if (!hasPreviousTerm)
{
previousTermIncidenceV = GetTermIncidenceVector(term);
resultSet = previousTermIncidenceV;
hasPreviousTerm = true; //
}
else
{

nextTermsIncidenceV = GetTermIncidenceVector(term);
}
}
else if (term.Equals("NOT"))
{
//indicates that the  term in the next iteration should be complemented.
hasNotOperation = true;
}
else
{
//'BUT' also should be evaluated as AND eg. structure BUT
//NOT semantic should be evaluated as structure AND NOT semantic
if (term.Equals("BUT"))
{
bitWiseOp = "AND";
}
else
bitWiseOp = term;
}

if (nextTermsIncidenceV != null && !hasNotOperation)
{
resultSet = ProcessBooleanOperator(bitWiseOp,
previousTermIncidenceV, nextT ermsIncidenceV);
previousTermIncidenceV = resultSet;
hasPreviousTerm = true;
nextTermsIncidenceV = null;
}
}

return resultSet;
} ```
##### Filter Query Term
```private static void FilterQueryTerm(ref string[] str)
{
List<string> _queryTerm = new List<string>();
foreach (string queryTerm in str)
{
if (queryTerm.ToUpper().Equals("BUT") ||termDocumentIncidenceMatrix.ContainsKey(queryTerm.ToUpper()) ||booleanOperator.Contains(queryTerm))
{

}
}
str = _queryTerm.ToArray();
}```
##### Processing Boolean operators
```public static List<int> ProcessBooleanOperator(string op,
List<int> previousTermV,List<int> nextTermV)
{
List<int> resultSet = new List<int>();
if(op.Equals("NOT"))
{
foreach(int a in previousTermV)
{
if (a == 1)
{
}
else
{
}
}
}
else if (op.ToUpper().Equals("AND")) //bitwise AND operation
{
for (int a = 0; a < previousTermV.Count; a++)
{
if (previousTermV[a] == 1 && nextTermV[a] == 1)
{
}
else
{
}
}
}
else if (op.ToUpper().Equals("OR")) //bitwise OR operation
{
for (int a = 0; a < previousTermV.Count; a++)
{
if (previousTermV[a] == 0 && nextTermV[a] == 0)
{
}
else
{
}
}
}
return resultSet;
}```

## Share

 Software Developer United States
Interested in Information Retrieval, Artificial Intelligence
research and programming.Programming is my passion,I enjoy programming for the same reasons that I enjoy reading great books.What I love about writing code is the nitpicking and problem solving.

## You may also be interested in...

 Pro

 First Prev Next
 matrix suhair22-Feb-17 22:12 suhair 22-Feb-17 22:12
 vector space model Member 1286917923-Dec-16 7:57 Member 12869179 23-Dec-16 7:57
 Article Reference Marconi2619-May-16 3:36 Marconi26 19-May-16 3:36
 need for help Member 1092402316-Feb-15 4:26 Member 10924023 16-Feb-15 4:26
 How to run this code? Member 109169231-Jul-14 4:42 Member 10916923 1-Jul-14 4:42
 Re: Python code boolean retrieval model Juan Pablo Perez F26-Sep-13 14:35 Juan Pablo Perez F 26-Sep-13 14:35
 Help Member 929051830-Jul-12 18:16 Member 9290518 30-Jul-12 18:16
 Re: Help Samir Kunwar31-Jul-12 8:14 Samir Kunwar 31-Jul-12 8:14
 Last Visit: 31-Dec-99 18:00     Last Update: 25-May-17 20:43 Refresh 1