Click here to Skip to main content
15,881,709 members
Articles / Programming Languages / C#
Article

DotLucene Indexer

Rate me:
Please Sign up or sign in to vote.
4.00/5 (6 votes)
22 Feb 2006CPOL2 min read 40.8K   415   29   5
DotLucene Indexer is a handy tool that can be used to automatically generate index for full text

Introduction

DotLucene has been getting quite a bit of attention recently. It is a full text search library that can be used to index fulltext and search later on. Typically you have two parts to this approach. The first is to index documents by a process and the second is to perform a search on the index and retrieve the results. More information about this can be found here and here.

Scenario / Concept

After using DotLucene for a while, you will observe that we tend to develop similar code to do our indexing. Typically, we had a scenario where we had to build our indexes from simple SQL select and add some fields to the index. This article discusses about automating the indexing process, so we can reduce development.

Approach

We will build an XML that will store all our configurations. We need the ability to have multiple indexes, each index builds index to a target index folder and gives it a name.

XML
<indexConfiguration>
	<!-- Multiple indexes -->
  <index name="TaskA" indexFolderUrl=
	"C:\Documents and Settings\abi\Desktop\LuceneIndexer\LuceneIndexer\TestIndex">
    
  </index>
</indexConfiguration>

We have a set of fields. Each field has a name, the way it should be stored, indexed and tokenised.

XML
<fields>
	<field name="ID" isStored="true" isIndexed="true" isTokenised="false" />
	<field name="Title" isStored="true" isIndexed="true" isTokenised="false" />
	<field name="ContentText" isStored="true" 
			isIndexed="true" isTokenised="true" />
</fields>

We can get the data from anywhere (SQL Server, Oracle, file system). For our scenario, let's say we get data from SQL Server (I developed code for this, if you want more, you can extend this). SQL server needs a select statement and a connection string which translates to the below XML:

XML
<sqlClient connectionString="Data Source=(local);
	Initial Catalog=Junk;Trusted_Connection=True">
	SELECT * FROM Content</sqlClient>

Putting it all together, this is what we get.

XML
<?xml version="1.0" encoding="utf-8"?><indexConfiguration>
  <index name="TaskA" indexFolderUrl=
	"C:\Documents and Settings\abi\Desktop\LuceneIndexer\LuceneIndexer\TestIndex">
    <!-- Your connection string & Select statement goes here -->
    <sqlClient connectionString="Data Source=
	(local);Initial Catalog=Junk;Trusted_Connection=True">
	SELECT * FROM Content</sqlClient>
    <fields>
      <field name="ID" isStored="true" isIndexed="true" isTokenised="false" />
      <field name="Title" isStored="true" isIndexed="true" isTokenised="false" />
      <field name="ContentText" isStored="true" isIndexed="true" isTokenised="true" />
    </fields>
  </index>
  <index name="TaskB" indexFolderUrl="\\MyPC\TestIndex">
    <!-- Your connection string & Select statement goes here -->
    <sqlClient connectionString="Data Source=(local);
	Initial Catalog=Junk;Trusted_Connection=True">SELECT * FROM Content</sqlClient>
    <fields>
      <field name="ID" isStored="true" isIndexed="true" isTokenised="false" />
      <field name="Title" isStored="true" isIndexed="true" isTokenised="false" />
      <field name="ContentText" isStored="true" isIndexed="true" isTokenised="true" />
    </fields>
  </index>
</indexConfiguration>

As we want this so that it can be extended in future, we come up with an abstract class called Indexer and SqlClientIndex will inherit off this class. This is the Indexer class:

Sample Image - DotLucene_Indexer.jpg

C#
public abstract class Indexer
   {
        protected XmlNode XmlNode;

        public Indexer(XmlNode xmlNode)
        {
            this.XmlNode = xmlNode;
        }

        public void Generate()
        {
            //Create the index
            string indexFolderUrl = XmlNode.Attributes["indexFolderUrl"].Value;
            IndexWriter writer = new IndexWriter(indexFolderUrl, 
				new StandardAnalyzer(), true);
            IndexRecords(writer);

            writer.Optimize();
            writer.Close();
        }
     
        protected abstract void IndexRecords(IndexWriter writer);
    }

This is the code for SqlClientIndexer:

C#
public class SqlClientIndexer : Indexer
  {
        public SqlClientIndexer(XmlNode xmlNode) : base(xmlNode)
        {
            
        }

        protected override void IndexRecords(IndexWriter writer)
        {
            DataTable dt = GetData;
            //Index all records
            XmlNodeList fields = this.XmlNode.SelectNodes("fields/field");
            for (int i = 0; i < dt.Rows.Count; i++)
            {
                Document doc = new Document();
                for (int j = 0; j < fields.Count; j++)
                {
                    string name = fields[j].Attributes["name"].Value;
                    doc.Add(new Field(name, dt.Rows[i][name].ToString(), 
			bool.Parse(fields[j].Attributes["isStored"].Value), 
			bool.Parse(fields[j].Attributes["isIndexed"].Value), 
			bool.Parse(fields[j].Attributes["isTokenised"].Value)));
                }
                writer.AddDocument(doc);
            }
        }

        private DataTable GetData
        {
            get
            {
                //Get Data using SQL
                string selectCommandText = 
			XmlNode.SelectSingleNode("sqlClient").InnerText;
                string connectionString = 
			XmlNode.SelectSingleNode
			("sqlClient/@connectionString").Value;
                SqlDataAdapter da = 
			new SqlDataAdapter(selectCommandText, 
			new SqlConnection(connectionString));
                DataTable dt = new DataTable();
                da.Fill(dt);
                return dt;
            }
        }
    }

Automate

Now, we write a console application that reads our index configuration file and indexes based on configuration. We also want the option to be able to index some indexes at some time. We use command line arguments to build the XPath, while selecting the nodes.

C#
static class Program
  {
        /// <summary>
        /// The main entry point for the application.
        /// </summary>
        [STAThread]
        static void Main(string[] args)
        {
            XmlDocument doc = new XmlDocument();
            doc.Load("Index.config");
            string xpath = (args.Length > 0 ? "[" + args[0] + "]" : "");
            XmlNodeList nodes = doc.SelectNodes("/indexConfiguration/*" + xpath);
            for (int i = 0; i < nodes.Count; i++)
            {
                Indexer ind = new SqlClientIndexer(nodes[i]);
                ind.Generate();
            }
        }
    }

Finally, i want 'TaskA' to be indexed every 1 hour and project B once every 10 mins. This can be achieved using Windows Task scheduler, DTS or any scheduling tool. In the command line, we will pass "name='TaskA'" for project A and "name='TaskB'" (this is basically Xpath condition). We can achieve a set of Tasks by having the xpath like "name='Task1' or name='Task2'".

Extensions

This can be extended for folder indexing, Oracle, MySql and so on.

History

  • 22nd February, 2006: Initial post

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Architect
Australia Australia
"Impossible" + "'" + " " = "I'm Possible"

Started programming when i was a kid with 286 computers and Spectrum using BASIC from 1986. There was series of languages like pascal, c, c++, ada, algol, prolog, assembly, java, C#, VB.NET and so on. Then shifted my intrest in Architecture during past 5 years with Rational Suite and UML. Wrote some articles, i was member of month on some sites, top poster(i only answer) of week (actually weeks), won some books as prizes, rated 2nd in ASP.NET and ADO.NET in Australia.

There is simplicity in complexity

Comments and Discussions

 
NewsSOLR extension Pin
Abi Bellamkonda13-Mar-13 16:13
Abi Bellamkonda13-Mar-13 16:13 
GeneralRecord update Pin
buatt29-Mar-07 20:26
buatt29-Mar-07 20:26 
AnswerRe: Record update Pin
Abi Bellamkonda29-Mar-07 21:14
Abi Bellamkonda29-Mar-07 21:14 
There are few ways:
I think the Simplest is to delete all the documents and add them all (say you have like less than 10000 documents, as its fast).
But if you have more, then you need to delete the ones that have been updated (having a Last Update DateTime field in database record will do with an sql to pull all records that have changed in past say 24 hours) and add them again.

Abi ( Abishek Bellamkonda )
My Blog: http://abibaby.blogspot.com
=(:*

GeneralVery nice Pin
Bo B28-Mar-06 21:53
Bo B28-Mar-06 21:53 
AnswerRe: Very nice Pin
Abi Bellamkonda29-Mar-06 21:19
Abi Bellamkonda29-Mar-06 21:19 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.