65.9K
CodeProject is changing. Read more.
Home

DotLucene Indexer

starIconstarIconstarIconstarIconemptyStarIcon

4.00/5 (6 votes)

Feb 23, 2006

CPOL

2 min read

viewsIcon

41283

downloadIcon

415

DotLucene Indexer is a handy tool that can be used to automatically generate index for full text

Introduction

DotLucene has been getting quite a bit of attention recently. It is a full text search library that can be used to index fulltext and search later on. Typically you have two parts to this approach. The first is to index documents by a process and the second is to perform a search on the index and retrieve the results. More information about this can be found here and here.

Scenario / Concept

After using DotLucene for a while, you will observe that we tend to develop similar code to do our indexing. Typically, we had a scenario where we had to build our indexes from simple SQL select and add some fields to the index. This article discusses about automating the indexing process, so we can reduce development.

Approach

We will build an XML that will store all our configurations. We need the ability to have multiple indexes, each index builds index to a target index folder and gives it a name.

<indexConfiguration>
	<!-- Multiple indexes -->
  <index name="TaskA" indexFolderUrl=
	"C:\Documents and Settings\abi\Desktop\LuceneIndexer\LuceneIndexer\TestIndex">
    
  </index>
</indexConfiguration>

We have a set of fields. Each field has a name, the way it should be stored, indexed and tokenised.

<fields>
	<field name="ID" isStored="true" isIndexed="true" isTokenised="false" />
	<field name="Title" isStored="true" isIndexed="true" isTokenised="false" />
	<field name="ContentText" isStored="true" 
			isIndexed="true" isTokenised="true" />
</fields>

We can get the data from anywhere (SQL Server, Oracle, file system). For our scenario, let's say we get data from SQL Server (I developed code for this, if you want more, you can extend this). SQL server needs a select statement and a connection string which translates to the below XML:

<sqlClient connectionString="Data Source=(local);
	Initial Catalog=Junk;Trusted_Connection=True">
	SELECT * FROM Content</sqlClient>

Putting it all together, this is what we get.

<?xml version="1.0" encoding="utf-8"?><indexConfiguration>
  <index name="TaskA" indexFolderUrl=
	"C:\Documents and Settings\abi\Desktop\LuceneIndexer\LuceneIndexer\TestIndex">
    <!-- Your connection string & Select statement goes here -->
    <sqlClient connectionString="Data Source=
	(local);Initial Catalog=Junk;Trusted_Connection=True">
	SELECT * FROM Content</sqlClient>
    <fields>
      <field name="ID" isStored="true" isIndexed="true" isTokenised="false" />
      <field name="Title" isStored="true" isIndexed="true" isTokenised="false" />
      <field name="ContentText" isStored="true" isIndexed="true" isTokenised="true" />
    </fields>
  </index>
  <index name="TaskB" indexFolderUrl="\\MyPC\TestIndex">
    <!-- Your connection string & Select statement goes here -->
    <sqlClient connectionString="Data Source=(local);
	Initial Catalog=Junk;Trusted_Connection=True">SELECT * FROM Content</sqlClient>
    <fields>
      <field name="ID" isStored="true" isIndexed="true" isTokenised="false" />
      <field name="Title" isStored="true" isIndexed="true" isTokenised="false" />
      <field name="ContentText" isStored="true" isIndexed="true" isTokenised="true" />
    </fields>
  </index>
</indexConfiguration>

As we want this so that it can be extended in future, we come up with an abstract class called Indexer and SqlClientIndex will inherit off this class. This is the Indexer class:

Sample Image - DotLucene_Indexer.jpg

public abstract class Indexer
   {
        protected XmlNode XmlNode;

        public Indexer(XmlNode xmlNode)
        {
            this.XmlNode = xmlNode;
        }

        public void Generate()
        {
            //Create the index
            string indexFolderUrl = XmlNode.Attributes["indexFolderUrl"].Value;
            IndexWriter writer = new IndexWriter(indexFolderUrl, 
				new StandardAnalyzer(), true);
            IndexRecords(writer);

            writer.Optimize();
            writer.Close();
        }
     
        protected abstract void IndexRecords(IndexWriter writer);
    }

This is the code for SqlClientIndexer:

public class SqlClientIndexer : Indexer
  {
        public SqlClientIndexer(XmlNode xmlNode) : base(xmlNode)
        {
            
        }

        protected override void IndexRecords(IndexWriter writer)
        {
            DataTable dt = GetData;
            //Index all records
            XmlNodeList fields = this.XmlNode.SelectNodes("fields/field");
            for (int i = 0; i < dt.Rows.Count; i++)
            {
                Document doc = new Document();
                for (int j = 0; j < fields.Count; j++)
                {
                    string name = fields[j].Attributes["name"].Value;
                    doc.Add(new Field(name, dt.Rows[i][name].ToString(), 
			bool.Parse(fields[j].Attributes["isStored"].Value), 
			bool.Parse(fields[j].Attributes["isIndexed"].Value), 
			bool.Parse(fields[j].Attributes["isTokenised"].Value)));
                }
                writer.AddDocument(doc);
            }
        }

        private DataTable GetData
        {
            get
            {
                //Get Data using SQL
                string selectCommandText = 
			XmlNode.SelectSingleNode("sqlClient").InnerText;
                string connectionString = 
			XmlNode.SelectSingleNode
			("sqlClient/@connectionString").Value;
                SqlDataAdapter da = 
			new SqlDataAdapter(selectCommandText, 
			new SqlConnection(connectionString));
                DataTable dt = new DataTable();
                da.Fill(dt);
                return dt;
            }
        }
    }

Automate

Now, we write a console application that reads our index configuration file and indexes based on configuration. We also want the option to be able to index some indexes at some time. We use command line arguments to build the XPath, while selecting the nodes.

static class Program
  {
        /// <summary>
        /// The main entry point for the application.
        /// </summary>
        [STAThread]
        static void Main(string[] args)
        {
            XmlDocument doc = new XmlDocument();
            doc.Load("Index.config");
            string xpath = (args.Length > 0 ? "[" + args[0] + "]" : "");
            XmlNodeList nodes = doc.SelectNodes("/indexConfiguration/*" + xpath);
            for (int i = 0; i < nodes.Count; i++)
            {
                Indexer ind = new SqlClientIndexer(nodes[i]);
                ind.Generate();
            }
        }
    }

Finally, i want 'TaskA' to be indexed every 1 hour and project B once every 10 mins. This can be achieved using Windows Task scheduler, DTS or any scheduling tool. In the command line, we will pass "name='TaskA'" for project A and "name='TaskB'" (this is basically Xpath condition). We can achieve a set of Tasks by having the xpath like "name='Task1' or name='Task2'".

Extensions

This can be extended for folder indexing, Oracle, MySql and so on.

History

  • 22nd February, 2006: Initial post