Introduction

This article will show a method for breaking input Unicode text into separate words. It also will show an effective method of dealing with languages such as Chinese, Korean, Japanese etc. that do not have a concept of "words" without resorting to complex dictionary / linguistic analysis methods.

Finally some tips for using this in a real world application will be presented.

Background

I am currently working on a business application that must provide a way for users to quickly search for and retrieve information that they have previously entered. Because our application is fully internationalized and globalized, the document indexing and retrieval must support globalization as well.

In our application there are two tables in a SQL Server database, one contains a dictionary of unique words indexed to date and the other contains a cross reference between the word and the document it was originally indexed from. When a new record is entered, it is indexed immediately and the dictionary and index are updated immediately by the same stored procedure that saves the record. This means a user can search immediately at one workstation for something a user had just entered at another.

The input search term is broken apart into words using the same method as the original documents were broken to create the dictionary and index. A SQL query then looks for all documents that contain all the search terms and returns a list of matching documents so the user can open the original document.

I had originally considered the Indexing service built into SQL Server 2005 however it is very limited when working with multiple languages at once, and can not be indexed on the fly as in our application so has to schedule lengthy indexing operations which means users can not search for data just entered previously etc. I then considered using the Microsoft indexing service itself since it's built into all modern versions of Windows. However, it isn't ideal as it also involves some work to ensure it works with every specific language the user needs to support, and there were some questions as to the legality of using it in our own application despite parts of it being the Windows base service API and after a month of not hearing back from Microsoft Legal I have given up on that avenue. What I needed is something simple and effective enough to do the job.

The problem

On the surface it seems easy and you may have implemented this yourself when dealing with English language text: just go over the input text, split out the words by finding the boundaries between them (whitespace, punctuation etc.).

There are several problems with this approach when indexing non-English text:

Some languages do not have a concept of a "word" hence no whitespace or punctuation exists to determine word boundaries (Chinese, Japanese, Korean, etc.).
In many languages, one or more Unicode characters may make up what the user thinks of as a character or basic unit of the language.
Some languages are entered from right to left and must be handled accordingly.

The problem seemed overwhelmingly complex and time consuming: Just the sort of thing that 3^rd party components were invented for! However after doing some research, it appears that there is no solution that is small, flexible and not prohibitively expensive. Most solutions revolve around complex word stemming and dictionary matching methods which means that you have to force your users to install a language specific component of quite a large size. It also means that quite a significant investment needs to be made in every language you intend to sell in. This wouldn't work for our application because it is intended to be inexpensive and multi-lingual in a dynamic manner meaning more than one user can be using the same database in different languages at the same time.

I have written a solution that appears to do the job. In the ideal world, it would be perfect, but this world is not ideal so any feedback and suggestions would be gratefully appreciated.

The solution

I won't show any examples of the WordBreaker class source code here because the code is simple, straightforward and well commented, however the techniques are interesting and useful to know:

The right to left issue and the issue with more than one Unicode character being used to represent what the user sees as a single character are both easily solved by the TextElementEnumerator in the System.Globalization class. This handy class will properly iterate through the source text returning each text element. It returns text elements in the correct order depending upon whether the source text is right to left or left to right and it understands the concept of a series of Unicode characters representing a single visual grapheme character as the reader sees it.

The most insurmountable obstacle to making our own solution appeared to be the languages without word boundaries, however after doing some research on the subject, it appears that there is a solution called n-gram or overlapping indexing that is in use and works effectively.

The premise is that since you can't identify the boundaries between Chinese "words" (for example), what you can do is index every second character in an overlapping scheme. For example: if the input text is "C1C2C3C4" (c1 being one Chinese character, c2 another etc.) you break it apart as: "C1C2", "C2C3","C3C4" etc. This works because when the user enters their search term you break it apart similarly. The downside is of course that your index of words is going to be much larger, however you do not need to provide a dictionary and word stemming / linguistic analysis, so in the end the tradeoff may not be bad at all.

There is a good paper on the Microsoft Research website (a lot of good stuff there): "On the use of Words and N-grams for Chinese Information Retrieval" on the n-gram method I use as well as the more advanced methods. The paper indicates that the n-gram method I'm using is as effective as other methods with the downside of the larger time it takes to index (they are basing their paper on an app that indexes all the documents all at once), however since in our app we index on the fly during a save or update, the time delay is negligible. The other downside is of course the size of the dictionary being much larger than indexing Latin text, however in our app there are typically not enough documents with unique words to make much difference either way.

This n-gram breaking is simply accomplished by scrolling a moving "window" over the text in question. The only tricky parts are spotting a boundary between Latin text and CJK text -- it is common for CJK text to contain Latin text within it and we want to ensure that we break apart the Latin text using the more appropriate process of spotting word boundaries. This is solved by looking for characters in the sub 256 decimal Unicode range. This ensures that we catch not only standard Latin characters but extended Latin as well to cover languages other than English that use the extended Latin-1 Unicode block (Spanish, French, Croatian etc.) for Latin characters with Acute, Circumflex, Grave etc. accents.

Using the code

The word breaker is implemented as a static function Break in the class WordBreaker:

public static string Break(bool breakStyleCJK, 
           bool returnAsXml, params string[] text)

Parameters

breakStyleCJK - uses rolling overlap 2 n-gram method of breaking the text. If Latin characters are encountered they will be broken using punctuation and spacing as full words.
returnAsXml - returns the resulting list of unique words as a fragment of XML suitable for passing to a SQL Server stored procedure or whatever you have. If false it will return words as comma delimited strings.
text - a series of 1 to * strings to be broken.
Returns - a string containing either a fragment of XML or a series of comma delimited strings.

If the return as XML method is chosen the XML looks like this:

<Items>
  <i w="hello" />
  <i w="break" />
  <i w="text" />
</Items>

Real world implementation

I thought it might be helpful to include some bits and pieces that tie this into a real world application. As stated previously, we use a SQL Server database for persistent storage.

There are two tables used for the searching and indexing: a dictionary table and a key table. The dictionary table contains two columns, one for each word indexed with a unique constraint on it so that no word will be stored more than once and a second column containing a unique ID number for that word.

The index table is the cross reference between the dictionary and the documents that are indexed. In our case, it contains three fields, one for the word ID, one for the source object ID and a third to indicate the type of source object for easy retrieval later on.

Here is an example of a stored procedure that can take the XML generated by the word breaker and insert those words into a dictionary / index, it is called by the stored procedure that saves / updates a business object and the business object level code generates the keyword XML by passing all string fields to the word breaker and then passing on the result of the wordbreak as a ntext parameter. The business object saving / updating stored procedure in turn calls this procedure passing off the keyword XML and the info that uniquely identifies the source object:

/*
Created XXXXXX
Coded by: John
Object: _ProcessKeywords utility sp.
Purpose: Storing keywords for search
Last update: XXXXX

Takes as input an XML fragment formatted like this:

<Items>
  <i w="firstword" />
  <i w="secondword" />
  <i w="etcetc" />
</Items>

Extracts the words and inserts them into the search dictionary table
A constraint on the SearchDictionary table ensures only new words are added

Then the RootObject ID and Type along with the words dictionary ID value
are inserted into the SEarchKey table for later retrieval during a search

RootobjectID and type are used to uniquely identify the source document.
*/
ALTER PROCEDURE dbo._ProcessKeywords

    (
        @KWXML ntext,
        @RootObjectID uniqueidentifier,
        @RootObjectType smallint,
        @ClearExistingIndex smallint = 0
    )

AS

    SET NOCOUNT ON
    IF DATALENGTH(@KWXML)<1 
        RETURN
    --Clear out any previous indexes for this object id and type in the 
    --SearchKey table
    --This ensures any words that no longer appear in the source document 
    --are not left hanging around in an old index of that document
    IF @ClearExistingIndex = 1
        DELETE FROM dbo._SearchKey WHERE (_SourceObjectID=@RootObjectID AND
                         _SourceObjectType=@RootObjectType)
    
        
     DECLARE @hDoc int --Handle to xml parser document
     DECLARE @word nvarchar(255) --Max 255 characters are indexed
     DECLARE @_srchid uniqueidentifier
     
     EXEC dbo.sp_xml_preparedocument @hDoc output, @KWXML
     
     --Declare a cursor that will be used to loop through the
     --wordlist xml items one by one
     DECLARE wordlist CURSOR
     FOR SELECT * FROM OPENXML(@hDoc,'/Items/i',1) WITH (w nvarchar(255))
     FOR READ ONLY
     
     OPEN wordlist
     
     FETCH wordlist into @word
     
     --Loop through the XML word list
     while(@@FETCH_STATUS=0) 
     BEGIN
        --Insert the word into the dictionary
        INSERT dbo._SearchDictionary (Word) values (@word)    
    
        --Get the word's ID value from the dictionary table
        SET @_srchid = (SELECT ID FROM dbo._SearchDictionary 
                        WHERE (Word = @word))
    
        --Update the search key table with the source object id and word id 
        INSERT dbo._SearchKey (_WordID, _SourceObjectID, _SourceObjectType)
            VALUES (@_srchid, @RootObjectID, @RootObjectType)
     
        --Get the next word    
        FETCH wordlist into @word
     END
     
     --Cleanup cursor
     CLOSE wordlist
     DEALLOCATE wordlist     
     
     EXEC dbo.sp_xml_removedocument @hDoc

This gives us a dictionary full of words and an index. To search for documents, the user's search term is broken using the exact same method as the original document, then it's passed to a query used to find all documents that contain all the keywords (there are other ways of doing this, but for our purposes we only want to return matches that contain all words).

Let's say the user is searching for the phrase "almond pear avocado", that query would look like this:

SELECT dbo._SearchKey._SourceObjectID, dbo._SearchKey._SourceObjectType
FROM   dbo._SearchDictionary INNER JOIN
       dbo._SearchKey ON dbo._SearchDictionary.ID = dbo._SearchKey._WordID
WHERE  (dbo._SearchDictionary.Word = N'almond') OR
       (dbo._SearchDictionary.Word = N'pear') OR
       (dbo._SearchDictionary.Word = N'avocado')
GROUP BY dbo._SearchKey._SourceObjectID, dbo._SearchKey._SourceObjectType
HAVING (COUNT(*) = 3)

Since the query is grouped by the source object's ID and type and OR is used with the keywords and we know that each keyword is indexed only once per source document, it will return a single result for each document that contains at least one occurrence of each of the keywords searched for. In this case there are three keywords, so there should be three results in each grouping of each document.

The resulting list of ID numbers can then be used to extract an excerpt to display to the user (coming in another article).

Remarks

Hopefully this will be of some use to others and I would appreciate any feedback towards improving it.

History

21^st April, 2005: initially posted.