Introduction
DotLucene has been getting quite a bit of attention recently. It is a full text search library that can be used to index fulltext and search later on. Typically you have two parts to this approach. The first is to index documents by a process and the second is to perform a search on the index and retrieve the results. More information about this can be found here and here.
Scenario / Concept
After using DotLucene for a while, you will observe that we tend to develop similar code to do our indexing. Typically, we had a scenario where we had to build our indexes from simple SQL select
and add some fields to the index. This article discusses about automating the indexing process, so we can reduce development.
Approach
We will build an XML that will store all our configurations. We need the ability to have multiple indexes, each index builds index to a target index folder and gives it a name.
<indexConfiguration>
<index name="TaskA" indexFolderUrl=
"C:\Documents and Settings\abi\Desktop\LuceneIndexer\LuceneIndexer\TestIndex">
</index>
</indexConfiguration>
We have a set of fields. Each field has a name, the way it should be stored, indexed and tokenised.
<fields>
<field name="ID" isStored="true" isIndexed="true" isTokenised="false" />
<field name="Title" isStored="true" isIndexed="true" isTokenised="false" />
<field name="ContentText" isStored="true"
isIndexed="true" isTokenised="true" />
</fields>
We can get the data from anywhere (SQL Server, Oracle, file system). For our scenario, let's say we get data from SQL Server (I developed code for this, if you want more, you can extend this). SQL server needs a select
statement and a connection string which translates to the below XML:
<sqlClient connectionString="Data Source=(local);
Initial Catalog=Junk;Trusted_Connection=True">
SELECT * FROM Content</sqlClient>
Putting it all together, this is what we get.
="1.0"="utf-8"<indexConfiguration>
<index name="TaskA" indexFolderUrl=
"C:\Documents and Settings\abi\Desktop\LuceneIndexer\LuceneIndexer\TestIndex">
<sqlClient connectionString="Data Source=
(local);Initial Catalog=Junk;Trusted_Connection=True">
SELECT * FROM Content</sqlClient>
<fields>
<field name="ID" isStored="true" isIndexed="true" isTokenised="false" />
<field name="Title" isStored="true" isIndexed="true" isTokenised="false" />
<field name="ContentText" isStored="true" isIndexed="true" isTokenised="true" />
</fields>
</index>
<index name="TaskB" indexFolderUrl="\\MyPC\TestIndex">
<sqlClient connectionString="Data Source=(local);
Initial Catalog=Junk;Trusted_Connection=True">SELECT * FROM Content</sqlClient>
<fields>
<field name="ID" isStored="true" isIndexed="true" isTokenised="false" />
<field name="Title" isStored="true" isIndexed="true" isTokenised="false" />
<field name="ContentText" isStored="true" isIndexed="true" isTokenised="true" />
</fields>
</index>
</indexConfiguration>
As we want this so that it can be extended in future, we come up with an abstract
class called Indexer
and SqlClientIndex
will inherit off this class. This is the Indexer
class:
public abstract class Indexer
{
protected XmlNode XmlNode;
public Indexer(XmlNode xmlNode)
{
this.XmlNode = xmlNode;
}
public void Generate()
{
string indexFolderUrl = XmlNode.Attributes["indexFolderUrl"].Value;
IndexWriter writer = new IndexWriter(indexFolderUrl,
new StandardAnalyzer(), true);
IndexRecords(writer);
writer.Optimize();
writer.Close();
}
protected abstract void IndexRecords(IndexWriter writer);
}
This is the code for SqlClientIndexer
:
public class SqlClientIndexer : Indexer
{
public SqlClientIndexer(XmlNode xmlNode) : base(xmlNode)
{
}
protected override void IndexRecords(IndexWriter writer)
{
DataTable dt = GetData;
XmlNodeList fields = this.XmlNode.SelectNodes("fields/field");
for (int i = 0; i < dt.Rows.Count; i++)
{
Document doc = new Document();
for (int j = 0; j < fields.Count; j++)
{
string name = fields[j].Attributes["name"].Value;
doc.Add(new Field(name, dt.Rows[i][name].ToString(),
bool.Parse(fields[j].Attributes["isStored"].Value),
bool.Parse(fields[j].Attributes["isIndexed"].Value),
bool.Parse(fields[j].Attributes["isTokenised"].Value)));
}
writer.AddDocument(doc);
}
}
private DataTable GetData
{
get
{
string selectCommandText =
XmlNode.SelectSingleNode("sqlClient").InnerText;
string connectionString =
XmlNode.SelectSingleNode
("sqlClient/@connectionString").Value;
SqlDataAdapter da =
new SqlDataAdapter(selectCommandText,
new SqlConnection(connectionString));
DataTable dt = new DataTable();
da.Fill(dt);
return dt;
}
}
}
Automate
Now, we write a console application that reads our index configuration file and indexes based on configuration. We also want the option to be able to index some indexes at some time. We use command line arguments to build the XPath, while selecting the nodes.
static class Program
{
[STAThread]
static void Main(string[] args)
{
XmlDocument doc = new XmlDocument();
doc.Load("Index.config");
string xpath = (args.Length > 0 ? "[" + args[0] + "]" : "");
XmlNodeList nodes = doc.SelectNodes("/indexConfiguration/*" + xpath);
for (int i = 0; i < nodes.Count; i++)
{
Indexer ind = new SqlClientIndexer(nodes[i]);
ind.Generate();
}
}
}
Finally, i want 'TaskA
' to be indexed every 1 hour and project B once every 10 mins. This can be achieved using Windows Task scheduler, DTS or any scheduling tool. In the command line, we will pass "name='TaskA'
" for project A and "name='TaskB'
" (this is basically Xpath condition). We can achieve a set of Tasks by having the xpath like "name='Task1
' or name='Task2'
".
Extensions
This can be extended for folder indexing, Oracle, MySql and so on.
History
- 22nd February, 2006: Initial post