DotLucene Indexer

Abi Bellamkonda

Rate me:

4.00/5 (6 votes)

22 Feb 2006CPOL2 min read

40.8K

415

DotLucene Indexer is a handy tool that can be used to automatically generate index for full text

Download demo project - 162.59 KB

Introduction

DotLucene has been getting quite a bit of attention recently. It is a full text search library that can be used to index fulltext and search later on. Typically you have two parts to this approach. The first is to index documents by a process and the second is to perform a search on the index and retrieve the results. More information about this can be found here and here.

Scenario / Concept

After using DotLucene for a while, you will observe that we tend to develop similar code to do our indexing. Typically, we had a scenario where we had to build our indexes from simple SQL select and add some fields to the index. This article discusses about automating the indexing process, so we can reduce development.

Approach

We will build an XML that will store all our configurations. We need the ability to have multiple indexes, each index builds index to a target index folder and gives it a name.

XML

<indexConfiguration>
	<!-- Multiple indexes -->
  <index name="TaskA" indexFolderUrl=
	"C:\Documents and Settings\abi\Desktop\LuceneIndexer\LuceneIndexer\TestIndex">
    
  </index>
</indexConfiguration>

We have a set of fields. Each field has a name, the way it should be stored, indexed and tokenised.

XML

<fields>
	<field name="ID" isStored="true" isIndexed="true" isTokenised="false" />
	<field name="Title" isStored="true" isIndexed="true" isTokenised="false" />
	<field name="ContentText" isStored="true" 
			isIndexed="true" isTokenised="true" />
</fields>

We can get the data from anywhere (SQL Server, Oracle, file system). For our scenario, let's say we get data from SQL Server (I developed code for this, if you want more, you can extend this). SQL server needs a select statement and a connection string which translates to the below XML:

XML

<sqlClient connectionString="Data Source=(local);
	Initial Catalog=Junk;Trusted_Connection=True">
	SELECT * FROM Content</sqlClient>

Putting it all together, this is what we get.

XML

<?xml version="1.0" encoding="utf-8"?><indexConfiguration>
  <index name="TaskA" indexFolderUrl=
	"C:\Documents and Settings\abi\Desktop\LuceneIndexer\LuceneIndexer\TestIndex">
    <!-- Your connection string & Select statement goes here -->
    <sqlClient connectionString="Data Source=
	(local);Initial Catalog=Junk;Trusted_Connection=True">
	SELECT * FROM Content</sqlClient>
    <fields>
      <field name="ID" isStored="true" isIndexed="true" isTokenised="false" />
      <field name="Title" isStored="true" isIndexed="true" isTokenised="false" />
      <field name="ContentText" isStored="true" isIndexed="true" isTokenised="true" />
    </fields>
  </index>
  <index name="TaskB" indexFolderUrl="\\MyPC\TestIndex">
    <!-- Your connection string & Select statement goes here -->
    <sqlClient connectionString="Data Source=(local);
	Initial Catalog=Junk;Trusted_Connection=True">SELECT * FROM Content</sqlClient>
    <fields>
      <field name="ID" isStored="true" isIndexed="true" isTokenised="false" />
      <field name="Title" isStored="true" isIndexed="true" isTokenised="false" />
      <field name="ContentText" isStored="true" isIndexed="true" isTokenised="true" />
    </fields>
  </index>
</indexConfiguration>

As we want this so that it can be extended in future, we come up with an abstract class called Indexer and SqlClientIndex will inherit off this class. This is the Indexer class:

public abstract class Indexer
   {
        protected XmlNode XmlNode;

        public Indexer(XmlNode xmlNode)
        {
            this.XmlNode = xmlNode;
        }

        public void Generate()
        {
            //Create the index
            string indexFolderUrl = XmlNode.Attributes["indexFolderUrl"].Value;
            IndexWriter writer = new IndexWriter(indexFolderUrl, 
				new StandardAnalyzer(), true);
            IndexRecords(writer);

            writer.Optimize();
            writer.Close();
        }
     
        protected abstract void IndexRecords(IndexWriter writer);
    }

This is the code for SqlClientIndexer:

public class SqlClientIndexer : Indexer
  {
        public SqlClientIndexer(XmlNode xmlNode) : base(xmlNode)
        {
            
        }

        protected override void IndexRecords(IndexWriter writer)
        {
            DataTable dt = GetData;
            //Index all records
            XmlNodeList fields = this.XmlNode.SelectNodes("fields/field");
            for (int i = 0; i < dt.Rows.Count; i++)
            {
                Document doc = new Document();
                for (int j = 0; j < fields.Count; j++)
                {
                    string name = fields[j].Attributes["name"].Value;
                    doc.Add(new Field(name, dt.Rows[i][name].ToString(), 
			bool.Parse(fields[j].Attributes["isStored"].Value), 
			bool.Parse(fields[j].Attributes["isIndexed"].Value), 
			bool.Parse(fields[j].Attributes["isTokenised"].Value)));
                }
                writer.AddDocument(doc);
            }
        }

        private DataTable GetData
        {
            get
            {
                //Get Data using SQL
                string selectCommandText = 
			XmlNode.SelectSingleNode("sqlClient").InnerText;
                string connectionString = 
			XmlNode.SelectSingleNode
			("sqlClient/@connectionString").Value;
                SqlDataAdapter da = 
			new SqlDataAdapter(selectCommandText, 
			new SqlConnection(connectionString));
                DataTable dt = new DataTable();
                da.Fill(dt);
                return dt;
            }
        }
    }

Automate

Now, we write a console application that reads our index configuration file and indexes based on configuration. We also want the option to be able to index some indexes at some time. We use command line arguments to build the XPath, while selecting the nodes.

static class Program
  {
        /// <summary>
        /// The main entry point for the application.
        /// </summary>
        [STAThread]
        static void Main(string[] args)
        {
            XmlDocument doc = new XmlDocument();
            doc.Load("Index.config");
            string xpath = (args.Length > 0 ? "[" + args[0] + "]" : "");
            XmlNodeList nodes = doc.SelectNodes("/indexConfiguration/*" + xpath);
            for (int i = 0; i < nodes.Count; i++)
            {
                Indexer ind = new SqlClientIndexer(nodes[i]);
                ind.Generate();
            }
        }
    }

Finally, i want 'TaskA' to be indexed every 1 hour and project B once every 10 mins. This can be achieved using Windows Task scheduler, DTS or any scheduling tool. In the command line, we will pass "name='TaskA'" for project A and "name='TaskB'" (this is basically Xpath condition). We can achieve a set of Tasks by having the xpath like "name='Task1' or name='Task2'".

Extensions

This can be extended for folder indexing, Oracle, MySql and so on.

History

22^nd February, 2006: Initial post

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Written By

Abi Bellamkonda

Architect

Australia

"Impossible" + "'" + " " = "I'm Possible"

Started programming when i was a kid with 286 computers and Spectrum using BASIC from 1986. There was series of languages like pascal, c, c++, ada, algol, prolog, assembly, java, C#, VB.NET and so on. Then shifted my intrest in Architecture during past 5 years with Rational Suite and UML. Wrote some articles, i was member of month on some sites, top poster(i only answer) of week (actually weeks), won some books as prizes, rated 2nd in ASP.NET and ADO.NET in Australia.

There is simplicity in complexity

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.