Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

Lucene Search Programming

0.00/5 (No votes)
21 Oct 2011 1  
Lucene Search Programming using .NET

1. Why Search is so Important?

In 1970s, information stayed only in certain places with certain people in a certain format. In 2010 era, information is at your finger tip due to the huge revolution in the IT industry. When you have a tons of details, search plays a vital role to extract the right information at the right time.

Let's observe the below two analysis.

Analysis1

66% of the people that responded state that they have an average of less than 2 page views per visit on their developer oriented blog. The remaining 33% state that they have more than 2 pages per visit (actually 24% even more than 3). But this is probably due to different metrics or way of identifying pages per view.

In a nut shell, developer-focused blogs don’t retain the reader for more than one or two pages.

Analysis2

In Two-Second Advantage book by Vivek Randave and Kevin Maney, they mention that hockey players, musicians and CEOs who can predict what will happen and will win. Over the past fifteen years, scientists have found that what distinguishes the greatest musicians, athletes, and performers from the rest of us isn't just their motor skills or athletic abilities – it is the ability to anticipate events before they happen. A great musician knows how notes will sound before they're played, a great CEO can predict how a business decision will turn out before it's made.

The inferred message is that information and time frame are key for success and so complex free flow search plays vital role in the fast growing industry.

2. Performance Matters

In a fast growing technology trend, Web performance really matters. Responsive sites can make the online experience effective, even enjoyable. A slow site can be unusable. To satisfy customers, a Web site must fulfill four distinct needs:

  • Availability: A site that's unreachable, for any reason, is useless.
  • Responsiveness: Having reached the site, pages that download slowly are likely to drive customers to try an alternate site.
  • Clarity: If the site is sufficiently responsive to keep the customer's attention, other design qualities come into play. It must be simple and natural to use – easy to learn, predictable, and consistent.
  • Utility: Last comes utility -- does the site actually deliver the information or service the customer was looking for in the first place?

These four dimensions are not alternative functional goals, to be weighed against one another and prioritized.

FourNeeds

3. What is Lucene?

Apache Lucene a high-performance, full-featured text search engine. It's written in Java as open source.

4. Architecture of Lucene

LuceneArchitecture

In a nutshell, Lucene is defined as:
  • simple document model
  • document is a sequence of fields
  • fields are name/value pairs
  • values can be strings or Reader's
  • clients must strip markup first
  • field names can be repeated
  • string-valued fields can be stored in index
  • plug-in analysis model
  • grammar-based tokenizer included
  • tokenizers piped through filters: stop-list, lowercase, stem, etc.
  • storage API: named, random-access blobs
  • index library
  • search library
  • query parsers

5. Core Lucene APIs

Lucene has FOUR core libraries(API) as below:

  • org.apache.lucene.document
  • org.apache.lucene.analysis
  • org.apache.lucene.index
  • org.apache.lucene.search

document

  • A Document is a sequence of Fields.
  • A Field is a pair. name is the name of the field, e.g., title, body, subject, date, etc. value is text.
  • Field values may be stored, indexed or analyzed (and, now, vectored).

analysis

  • An Analyzer is a TokenStream factory.
  • A TokenStream is an iterator over Tokens. input is a character iterator (Reader)
  • A Token is tuple text (e.g., “pisa”). type (e.g., “word”, “sent”, “para”). start & length offsets, in characters (e.g, <5,4>) positionIncrement (normally 1)
  • Standard TokenStream implementations are Tokenizers, which divide characters into tokens and TokenFilters, e.g., stop lists, stemmers, etc.

index

  • batch-based: use file-sorting algorithms (textbook) + fastest to build + fastest to search - slow to update
  • b-tree based: update in place (http://lucene.sf.net/papers/sigir90.ps) + fast to search - update/build does not scale - complex implementation
  • segment based: lots of small indexes (Verity) + fast to build + fast to update - slower to search

LuceneIndexing

The above indexing diagram has the listed parameters:

  • M = 3
  • 11 documents indexed
  • stack has four indexes
  • grayed indexes have been deleted
  • 5 merges have occurred

search

Primitive queries in Lucene search are listed as:

  • TermQuery: match docs containing a Term
  • PhraseQuery: match docs w/ sequence of Terms
  • BooleanQuery: match docs matching other queries. e.g., +path:pisa +content:“Doug Cutting” -path:nutch
  • new: SpansQuery
  • derived queries: PrefixQuery, WildcardQuery, etc.

6. Methodology to Implement

LuceneMethodology

7. Application Development

Initially, I wrote a simple console application which has two methods, namely:

  1. Indexer
  2. QuerySearcher

Application layers are drawn as:

LuceneApplication

Indexer Component

Our module Indexer gets the input as a flat file with 4 columns. File content are read and stored in lucene 'document' record. A document is a single entity that is put into the index. And it contains many fields which are, like in a database, the single different pieces of information that make a document. Different fields can be indexed using different algorithm and analyzers.

IndexWriter component is responsible for the management of the indexes. It creates indexes, adds documents to the index, optimizes the index. analyzer contains the policy for extracting index terms from the text. StandardAnalyzer, which tokenizes the text based on European-language grammars, sets everything to lowercase and removes English stop words. Apparently, the above generated documents are indexed in lucene optimized way and written in the unread lucene properitary format file at declared LUCENE_DIRECTORY macro. Code snap as:

public void Indexer() 
    throws Exception {
    
    IndexWriter writer=null;
    StandardAnalyzer analyzer = null;        
    File file = null;
    String strLine = "";
    
    File readFile = new File
    ("C:\\Dev\\POC\\Lucene\\ConsoleApp\\datafeed\\employee.csv");
    
    System.out.println("Start indexing");
    
    //get a reference to index directory file
    file = new File(LUCENE_INDEX_DIRECTORY);
                
    analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
    writer = new IndexWriter(FSDirectory.open(file),
    analyzer,true,IndexWriter.MaxFieldLength.LIMITED);        
    writer.setUseCompoundFile(false);    
    
    try {
        //iterate through result set
        FileInputStream fstream = new FileInputStream(readFile);
        DataInputStream in = new DataInputStream(fstream);
        BufferedReader br = new BufferedReader(new InputStreamReader(in));
            
        Field empIdField = null;    
        Field joinDateField = null;    
        Field empNameField = null;
        Field empTeamField = null;
        
        //create document object
        Document document = null;
        String[] array = null;
            
        //read comma separated file line by line            
        while( (strLine = br.readLine()) != null){                
            //break comma separated line using ","                
            document = new Document();
            array = strLine.split(",");
            
            //display csv values    
            String str0 = (array[0] !=null) ? array[0] :"";
            empIdField = new Field("empIdField",str0, 
            Field.Store.YES,Field.Index.NOT_ANALYZED);
            document.add(empIdField);
            
            String str1 = (array[1] !=null) ? array[1] :"";
            joinDateField = new Field("joinDateField",str1, 
            Field.Store.YES,Field.Index.NOT_ANALYZED);
            document.add(joinDateField);
            
            String str2 = (array[2] !=null) ? array[2] :"";
            empNameField = new Field("empNameField",str2, 
            Field.Store.YES,Field.Index.ANALYZED);
            document.add(empNameField);
            
            String str3 = (array[3] !=null) ? array[3] :"";
            empTeamField = new Field("empTeamField",str3, 
            Field.Store.YES,Field.Index.NOT_ANALYZED);
            document.add(empTeamField);
            
            writer.addDocument(document);                    
        }            
        //Close the input stream
        in.close();
    
        //optimize the index
        System.out.println("Optimizing index");
        writer.optimize();
        writer.close();        
    }
    catch (Exception e){//Catch exception if any
        e.printStackTrace();
        System.err.println("Error: " + e.getMessage());
    }
}

QuerySearcher Component

To fetch the user input query (at UserQuery variable), let's create an IndexReader with reference to LUCENE_INDEX_DIRECTORY macro. IndexReader scans the index files and returns results based on the query supplied.

Standard analyzer object is created to simulate the query parser in the development process. Query parser is responsible for parsing a string of text to create a query object. It evaluates Lucene query syntax and uses an analyzer (which should be the same you used to index the text) to tokenize the single statements On executing the searcher based on the user query, the result set is stored in 'hits' object; apparently the result set is iterated to fetch the expected result info. Code snap as:

 public static void QuerySearcher()
     throws Exception
 {
    IndexReader reader = null;
    StandardAnalyzer analyzer = null;
    IndexSearcher searcher = null;
    TopScoreDocCollector collector = null;
    QueryParser parser = null;
    Query query = null;
    ScoreDoc[] hits = null;
    
    try{
        //store the parameter value in query variable
        //String userQuery = request.getParameter("query");
        String userQuery = "EmployeeName1000109";
        
        IndexBuilder builder = new IndexBuilder();
        builder.Indexer();
        
        //create standard analyzer object
        analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
        
        //create File object of our index directory
        File file = new File(LUCENE_INDEX_DIRECTORY);
        
        //create index reader object
        reader = IndexReader.open(FSDirectory.open(file),true);
        
        //create index searcher object
        searcher = new IndexSearcher(reader);
        
        //create topscore document collector
        collector = TopScoreDocCollector.create(1000, false);
        
        //create query parser object
        parser = new QueryParser("empNameField", analyzer);
        
        //parse the query and get reference to Query object
        query = parser.parse(userQuery);
        
        //search the query
        searcher.search(query, collector);
        
        hits = collector.topDocs().scoreDocs;
        
        //check whether the search returns any result
        if(hits.length>0){
            //iterate through the collection and display result
            for(int i=0; i<hits.length; if(reader!="null)" scoreid="hits[i].doc;" />

8. Application Architecture

As .NET developer, my target is to perform the free text search process using .NET client. So, the below application architecture is developed.

LuceneApplication

Same Java console application is upgraded to Service based system by exposing the functionality as Java Web Service using AXIS. This newly developed Java web method can be easily accessed by adding service reference in .NET client application, since it follows standard WSDL pattern. On submitting the user query from .NET UI control, the code behind calls the referenced Java web service, in turn executes the Lucene search. Result set are returned in Response object to show in .NET UI page.

9. Points of Interest

Lucene.NET is a port of the Lucene search engine library, written in C# and targeted at .NET run time users. The Lucene search library is based on an inverted index. Apache Lucene.Net is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Incubator. Still, this paper gives the exposure to perform the interoperability between .NET front end and Java back end.

History

  • Version 1.0 - Initial version

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here