Click here to Skip to main content
Click here to Skip to main content

Full-Text Indexing Files with Microsoft SQL Server

, 2 Feb 2007
Rate this:
Please Sign up or sign in to vote.
Indexing Word, Excel and other types of documents is easy with SQL Server

Introduction

This article is about the full-text search capability of SQL Server 2000 (2005). It is an easy to use, very fast and extensible solution to index and search in various types of documents' content. For example in Word, Excel, Adobe portable document format (PDF) and HTML files.

Prerequisites

To go through with this example, you will need Microsoft SQL Server 2000 Server (at least) access, a database with DB owner right, and of course the client tools.

Creating a Table

In order to index files stored in a database table, we have to create two table fields. In the first field, we will store the content of the document in binary format, in the second we will store the extension of the file, for example ".doc" or ".xls".
It's a good idea to store the full name and the size information of the files, because probably you'll need these in real situations, but you won't need them for full-text indexing.

Storing the sizes of the files can be optional, because you can query them with the DataLength function, but this could take a long time, much longer than reading these values from a field.

So our create table script can be this:

CREATE TABLE [dbo].[Doc] (
 [ID]  uniqueidentifier ROWGUIDCOL  NOT NULL ,
 [Extension] [varchar] (10) NOT NULL ,
 [Content] [image] NOT NULL ,
 [FileSize] [int] NOT NULL ,
 [FileName] [nvarchar] (500) NOT NULL ,
 [Stamp] [timestamp] NOT NULL 
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
GO

ALTER TABLE [dbo].[Doc] WITH NOCHECK ADD 
 CONSTRAINT [PK_Doc] PRIMARY KEY  CLUSTERED 
 (
  [ID]
 )  ON [PRIMARY] 
GO

ALTER TABLE [dbo].[Doc] ADD 
 CONSTRAINT [DF_Doc_ID] DEFAULT (newid()) FOR [ID]
GO

A simple idea is to use a unique identifier type as the primary key of the document table. It can be useful in many cases if you are a web developer. Maybe you would like to use this id in the URL (query string) when you make a download page, and you don't want to give the chance to your users to download all the documents without seeing all your advertisements, etc. by simply increasing a parameter at the end of the URL.

Please notice that a primary key has been created by the script. It is always very important because this can guarantee that you can't insert two records with the same id into the database (a table is a set of records, you can't make the difference between records with the same values, except with cursors), and a good clustered index can really improve the performance of the SELECT statements.

Also, there is a timestamp field – called stamp – on the table. It will be necessary for us and we will talk about it later.

Creating Full-text Index

I think nobody knows the proper format of the full-text index creation statement by head. So, we will create this index from Microsoft SQL Server Enterprise Manager.
First choose your server, then your database. Choose tables then on the right, search for the table you've created (Doc).
After right clicking on the table, choose "Define Full-Text indexing on a table" submenu from "Full-Text Index table" menu.

After these, a wizard will appear.
In the "Select an index step", choose your primary key. Later, when we query the table, we will get back this (unique) index's values from the full-text search engine as a result.

In the "Select Table Columns", step mark the "Content" column, and in that row in the document type column, choose the extension field. Then step away to another row, because the Next button will be activated only after you change the selection.

In this step, we gave the indexable column to the server and also defined the types of each document.

In the "Select a Catalog" step, you can create a new full-text catalog or choose an existing one. I offer that you make a new catalog for these data, because the collected indexes can be huge if the count of the documents increase.

In the "Select or Create Population Schedules", you can schedule the incremental and the full population of the catalog or the table. You can leave it empty in this case, because we will use the change tracking feature of the SQL Server. It means, we tell the SQL server to update its full-text catalog "immediately" after an insert or an update occurs.
In real situations, it is generally offered to make a full population every week or daily and an incremental daily or more often, but it depends on your server utilization.
The cause is that the change tracking and incremental update feature doesn't recognize if someone uses WriteText or UpdateText statements.

When we designed the table, we've put a timestamp field into it. The timestamp field is needed for incremental index updates, without that field all the incremental updates do full updates. The full-text index engine can find the updates on this property change. This is because the WriteText and UpdateText statements are not noticed by the full-text engine. These statements don't update the timestamp field like normal Insert and Update do.

So finish the wizard, by pressing the Next button few times and then Finish.

After the full-text index has been defined, we have to turn the change tracking functionality on. We can do it by right clicking on the table name in Enterprise Manager and then choose Change Tracking then the "Update index in background" from Full-Text Index Table menu. From this time, the full-text index engine will watch our table for changes and update its index if necessary.

An interesting result of our index creation:

if (select DATABASEPROPERTY(DB_NAME(), N'IsFullTextEnabled')) <> 1 
exec sp_fulltext_database N'enable' 
GO

if not exists (select * from dbo.sysfulltextcatalogs where name = N'Doc')
exec sp_fulltext_catalog N'Doc', N'create' 
GO

exec sp_fulltext_table N'[dbo].[Doc]', N'create', N'Doc', N'PK_Doc'
GO

exec sp_fulltext_column N'[dbo].[Doc]', N'Content', N'add', 1033, N'Extension' 
GO

exec sp_fulltext_table N'[dbo].[Doc]', N'activate'  
GO

Testing

Now we insert some records and then we create a few select queries. Let's run this insert script in Query Analyzer:

INSERT INTO [DOC] ([Extension], [Content], [FileSize], [FileName]) 
    VALUES ('.txt', 'Hello John! It''s me: Garfield!', 30, 'Cartoon1.txt')

INSERT INTO [Doc] ([Extension], [Content], [FileSize], [FileName]) 
    VALUES ('.txt', 'Oh my god!', 30, 'Shout.txt')

INSERT INTO [DOC] ([Extension], [Content], [FileSize], [FileName]) 
    VALUES ('.txt', 'NOWAN's web site: <a href="%22http://www.nowan.hu/'%22">http://www.nowan.hu/'</a>, 30, 'nowan.txt')

These are simple inserts, you can examine them yourself. After you inserted the records, you can run a SELECT statement to check the table:

SELECT * FROM [Doc]

Full-Text Search

To create a full-text query, you have to get closer with the next statements:

The first two statements have two parameters. The first is the column name and the second is the searched string. These functions give back Boolean values.

The second two statements are more interesting. These functions return tables that have two columns: Key and Rank. It means that we can get back the unique id of the searched record or records and also we can get back the hit probability (Rank):

SELECT * FROM ContainsTable([doc], Content, '"nowan"')

Another good trait of containstable and freetexttable is that you can give difficult expressions as searched string. For example, you can use "OR" and "AND" logical terms:

Of course, the result table of these statements can be joined to real tables. So if we would like to get the original data row from the original table, we could use a select statement like this:

SELECT Doc.* FROM [Doc] 
    INNER JOIN ContainsTable([doc], Content, '"nowan"') AS FT 
    ON Doc.ID = FT.[Key]

Indexed File Types

The SQL Server can create indexes typically from text files and Microsoft Office files. This type list is extensible by iFilters designed for the old Indexing Service. You can get an iFilter from Adobe too for indexing PDF files.

Hungarian Version

You can find the Hungarian version of this article here.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author


Comments and Discussions

 
QuestionDo you have an example on how to search on Word document? Pinmemberjdavidmtzc2-Aug-12 12:18 
QuestionHow to use the Transact-SQL functions CONTAINSTABLE and CONTAINS to do a search on a full-text enabled table Pingroupelizas23-Mar-10 23:38 
QuestionHow to use the Transact-SQL functions CONTAINSTABLE and CONTAINS to do a search on a full-text enabled table Pingroupelizas23-Mar-10 23:35 
GeneralFew questions PinmemberPratik.Patel9-Feb-09 14:44 
GeneralRe: Few questions PinmemberBoci5-Dec-11 10:09 
Generalfull-text index tab is not active PinmemberMember 220985825-May-08 22:33 
AnswerRe: full-text index tab is not active Pinmemberboci26-May-08 9:36 
QuestionDatabase size .. is it an issue for full text catalog ?? Pinmemberreej26-Mar-08 0:30 
AnswerRe: Database size .. is it an issue for full text catalog ?? Pinmemberboci26-Mar-08 13:52 
QuestionCan you explain how to search on Word document? Pinmemberdadvir11-Jul-07 22:02 
AnswerRe: Can you explain how to search on Word document? Pinmemberboci11-Jul-07 23:12 
GeneralDotLucene is better PinmemberJan Seda3-Feb-07 3:09 
GeneralRe: DotLucene is better Pinmemberdfgdfd24-Jun-07 6:59 
Questionwhy Full-Text search not support thai language? Pinmemberkookai29-Aug-06 21:12 
AnswerRe: why Full-Text search not support thai language? [modified] Pinmemberboci30-Aug-06 5:04 
GeneralRe: why Full-Text search not support thai language? Pinmemberkookai30-Aug-06 16:46 
AnswerRe: why Full-Text search not support thai language? Pinmembervolkan.ozcelik8-Sep-06 21:34 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web02 | 2.8.140721.1 | Last Updated 3 Feb 2007
Article Copyright 2006 by István Kovács (HU)
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid