Boolean Retrieval Model






4.50/5 (3 votes)
Information retrieval using boolean retrieval model
Introduction
Model is an idealization or abstraction of an actual process. Information Retrieval models can describe the computational process.
For example
- how documents are ranked
- Note that how documents or indexes are stored is implemented.
The goal of information retrieval (IR) is to provide users with those documents that will satisfy their information need. Retrieval models can attempt to describe the human Process, such as the information need, interaction.
Retrieval model can be categorize as
- Boolean retrieval model
- Vector space model
- Probabilistic model
- Model based on belief net
The Boolean model of information retrieval is a classical information retrieval (IR) model and is the first and most adopted one. It is used by virtually all commercial IR systems today.
Exact vs Best match
In exact match a query specifies precise criteria. Each document either matches or fails to match the query. The results retrieved in exact match is a set of document (without ranking).
In best match a query describes good or best matching documents. In this case the result is a ranked list of document. The Boolean model here I’m going to deal with is the most common exact match model.
Basic Assumption of Boolean Model
- An index term is either present(1) or absent(0) in the document
- All index terms provide equal evidence with respect to information needs.
- Queries are Boolean combinations of index terms.
- X AND Y: represents doc that contains both X and Y
- X OR Y: represents doc that contains either X or Y
- NOT X: represents the doc that do not contain X
Boolean Queries Example
User information need: Interested to know about Everest and Nepal
User Boolean query: Everest AND Nepal
Implementation Part
Example of Input collection
Doc1= English tutorial and fast track
Doc2 = learning latent semantic indexing
Doc3 = Book on semantic indexing
Doc4 = Advance in structure and semantic indexing
Doc5 = Analysis of latent structures
Query problem: advance and structure AND NOT analysis
Boolean Model Index Construction
First we build the term-document incidence matrix which represents a list of all the distinct terms and their presence on each document (incidence vector). If the document contains the term than incidence vector is 1 otherwise 0.
Terms/doc | Doc1 | Doc2 | Doc3 | Doc4 | Doc5 |
---|---|---|---|---|---|
English | 1 | 0 | 0 | 0 | 0 |
Tutorial | 1 | 0 | 0 | 0 | 0 |
Fast | 1 | 0 | 0 | 0 | 0 |
Track | 1 | 0 | 0 | 0 | 0 |
Books | 0 | 1 | 0 | 0 | 0 |
Semantic | 0 | 1 | 1 | 1 | 0 |
Analysis | 0 | 1 | 0 | 0 | 1 |
Learning | 0 | 0 | 1 | 0 | 0 |
Latent | 0 | 0 | 1 | 0 | 1 |
Indexing | 0 | 0 | 1 | 1 | 0 |
Advance | 0 | 0 | 0 | 1 | 0 |
Structures | 0 | 0 | 0 | 1 | 1 |
So now we have 0/1 vector for each term. To answer the query we take the vectors for advance, structure and analysis, complement the last, and then do a bitwise AND.
Doc1 | Doc2 | Doc3 | Doc4 | Doc5 | |
---|---|---|---|---|---|
0 | 0 | 0 | 1 | 0 | |
0 | 0 | 0 | 1 | 1 | (AND) |
0 | 0 | 0 | 1 | 0 | |
1 | 0 | 1 | 1 | 0 | (NOT analysis) |
0 | 0 | 0 | 1 | 0 | |
Doc4 |
Diving into the code
Document tokenization and indexing
Tokenization is the task of chopping the input
character sequence into pieces, called tokens. Here I have used termsCollection
to store the retrieve tokens from the input
document and HasSet collection type distinctTerm
to create the index of tokens which ensures that the collection type contains
only distinct terms .
//read all the text document on the specified directory
foreach (string file in Directory.EnumerateFiles("F:\\IR\\", "*.txt"))
{
string contents = File.ReadAllText(file);
String[] termsCollection = RemoveStopsWords(contents.ToUpper().Split(' '));
foreach (string term in termsCollection)
{
//prepeare distinct terms collection
//remove stop words
if (!stopWords.Contains(term))
{
distinctTerm.Add(term);
}
}
//add document and their terms collection
documentCollection.Add(file, termsCollection.ToList());
//add document and its content for displaying the search result
documentContentList.Add(count, contents);
count++;
}
Removing stop words
Most Search Engines do not consider extremely common words in order to save disk space or to speed up search results.
These filtered words are known as 'Stop Words' here I have used generic collection stopWords
which contains a list of stop word, to remove it from the document.
//removes all stop words
public static string[] RemoveStopsWords(string[] str)
{
List<string> terms = new List<string>();
foreach (string term in str)
{
if (!stopWords.Contains(term))
{
terms.Add(term);
}
}
return terms.ToArray();
}
Term - Document Incidence matrix creation
If a term appears on a document than its term incidence vector is 1 else 0.
The GetTermDocumentIncidenceMatrix
function returns the term document incidence matrix for all distinct terms which are stored on
HashSet
collection. I have used Dictionary collection type termDocumentIncidenceMatrix
to store the term and its incidence vectors on each document.
//prepares Term Document Incidence Matrix
public static Dictionary<string, List<int>> GetTermDocumentIncidenceMatrix(HashSet<string>
distinctTerms,Dictionary<string,List<string>> documentCollection)
{
Dictionary<string, List<int>> termDocumentIncidenceMatrix =
new Dictionary<string, List<int>>();
List<int> incidenceVector = new List<int>();
foreach (string term in distinctTerms)
{
//incidence vector for each terms
incidenceVector = new List<int>();
foreach(KeyValuePair<string ,List<string>> p in documentCollection)
{
if (p.Value.Contains(term))
{
//document contains the term
incidenceVector.Add(1);
}
else
{
//document do not contains the term
incidenceVector.Add(0);
}
}
termDocumentIncidenceMatrix.Add(term, incidenceVector);
}
return termDocumentIncidenceMatrix;
}
Retrieving incidence vector of individual term
//returns term incidence vector
public static List<int> GetTermIncidenceVector(string term)
{
return termDocumentIncidenceMatrix[term.ToUpper()];
}
Query Processor
The string variable bitWiseOp
holds the Boolean operator that exists on the query. Generally the term but is treated as semantically
equivalent with Boolean AND operator. The collection previousTermIncidenceV
and nextTermIncidenceV
is used
to hold the term incidence vector of each term that appears before and after the Boolean operator on the query.
//process the boolean query
public static List<int> ProcessQuery(string query)
{
//query boolean operator
string bitWiseOp = string.Empty;
string[] queryTerm = RemoveStopsWords(query.ToUpper().Split(''));
//remove query term that doesnot appears on document collection
FilterQueryTerm(ref queryTerm);
List<int> previousTermIncidenceV = null;
List<int> nextTermsIncidenceV = null;
//holds the bitwise operation result
List<int> resultSet = null;
//On query X AND Y, X is previousTerm term and Y is nextTerm
Boolean hasPreviousTerm = false;
Boolean hasNotOperation = false;
foreach (string term in queryTerm)
{
//is a term
if (!booleanOperator.Contains(term)&&!term.Equals("BUT"))
{
//query case: structure AND NOT analysis
if (hasNotOperation)
{
if (hasPreviousTerm)
{
nextTermsIncidenceV = ProcessBooleanOperator("NOT",
GetTermIncidenc eVector(term), nextTermsIncidenceV);
}
//query case: eg.NOT analysis
else
{
previousTermIncidenceV = ProcessBooleanOperator("NOT",
GetTermIncid enceVector(term), nextTermsIncidenceV);
resultSet = previousTermIncidenceV;
}
hasNotOperation = false;
}
else if (!hasPreviousTerm)
{
previousTermIncidenceV = GetTermIncidenceVector(term);
resultSet = previousTermIncidenceV;
hasPreviousTerm = true; //
}
else
{
nextTermsIncidenceV = GetTermIncidenceVector(term);
}
}
else if (term.Equals("NOT"))
{
//indicates that the term in the next iteration should be complemented.
hasNotOperation = true;
}
else
{
//'BUT' also should be evaluated as AND eg. structure BUT
//NOT semantic should be evaluated as structure AND NOT semantic
if (term.Equals("BUT"))
{
bitWiseOp = "AND";
}
else
bitWiseOp = term;
}
if (nextTermsIncidenceV != null && !hasNotOperation)
{
resultSet = ProcessBooleanOperator(bitWiseOp,
previousTermIncidenceV, nextT ermsIncidenceV);
previousTermIncidenceV = resultSet;
hasPreviousTerm = true;
nextTermsIncidenceV = null;
}
}
return resultSet;
}
Filter Query Term
private static void FilterQueryTerm(ref string[] str)
{
List<string> _queryTerm = new List<string>();
foreach (string queryTerm in str)
{
if (queryTerm.ToUpper().Equals("BUT") ||termDocumentIncidenceMatrix.ContainsKey(queryTerm.ToUpper()) ||booleanOperator.Contains(queryTerm))
{
_queryTerm.Add(queryTerm);
}
}
str = _queryTerm.ToArray();
}
Processing Boolean operators
public static List<int> ProcessBooleanOperator(string op,
List<int> previousTermV,List<int> nextTermV)
{
List<int> resultSet = new List<int>();
if(op.Equals("NOT"))
{
foreach(int a in previousTermV)
{
if (a == 1)
{
resultSet.Add(0);
}
else
{
resultSet.Add(1);
}
}
}
else if (op.ToUpper().Equals("AND")) //bitwise AND operation
{
for (int a = 0; a < previousTermV.Count; a++)
{
if (previousTermV[a] == 1 && nextTermV[a] == 1)
{
resultSet.Add(1);
}
else
{
resultSet.Add(0);
}
}
}
else if (op.ToUpper().Equals("OR")) //bitwise OR operation
{
for (int a = 0; a < previousTermV.Count; a++)
{
if (previousTermV[a] == 0 && nextTermV[a] == 0)
{
resultSet.Add(0);
}
else
{
resultSet.Add(1);
}
}
}
return resultSet;
}