C# search engine: refactored to search Word, PDF and more

craigd

Rate me:

4.93/5 (62 votes)

16 Mar 2007CPOL11 min read

563.5K

13.3K

380

119

Refactor Searcharoo C# search engine code to make it easier to extend, and add ability to index/search Microsoft Office documents, PDFs, etc.

Download source code - 370 Kb

Background

This article follows on from the previous three Searcharoo samples:

Searcharoo Version 1 describes building a simple search engine that crawls the file system from a specified folder, and indexes all HTML (or other known types) of document. A basic design and object model was developed to support simple, single-word searches, whose results were displayed ina rudimentary query/results page.

Searcharoo Version 2 focused on adding a 'spider' to find data to index by following web links (rather than just looking at directory listings in the file system). This means downloading files via HTTP, parsing the HTML to find more links and ensuring we don't get into a recursive loop because many web pages refer to each other. This article also discusses how multiple search words results are combined into a single set of 'matches'.

Searcharoo Version 3 implemented a 'save to disk' function for the catalog, so it could be reloaded across IIS application restarts without having to be generated each time. It also spidered FRAMESETs and added Stop words, Go words and Stemming to the indexer. A number of bugs reported via CodeProject were also fixed.

Introduction to version 4

Version 4 of Searcharoo has changed in the following ways (often prompted by CodeProject members):

It can now index/search Word, Powerpoint, PDF and many other file types, thanks to the excellent Using IFilter in C# article by Eyal Post. This is probably the coolest bit of the whole project - but all credit goes to Eyal for his excellent article.
It parses and obeys your robots.txt file (in addition to the robots META tag, which it already understood) ( cool263).
You can 'mark' regions of your html to be ignored during indexing (xbit45).
There is a rudimentary effort to follow links hiding in javascript ( ckohler).
You can run the Spider locally via a CommandLine application then upload the Catalog file to your server (useful if your server doesn't have all the IFilter's installed to parse the documents you want indexed).
The code has been significantly refactored (thanks to encouragement from mrhassell and j105 Rob). I hope this makes it easier for people to read/understand and edit to add the stuff they need.

Some things to note

You need Visual Studio 2005 to work with this code. In previous versions I tried to keep the code in a small number of files, and structure it so it'd be easy to open/run in Visual WebDev Express (heck, the first version was written in WebMatrix), but it's just getting too big. As far as I know, it's still possible to shoehorn the code into VWD (with App_Code directory and assemblies from the ZIP file) if you want to give it a try...
I've included two projects from other authors: Eyal's IFilter code (from CodeProject and his blog on bypassing COM) and the Mono.GetOptions code (nice way to handle Command Line arguments). I do NOT take credit for these projects - but thank the authors for the hard work that went into them, and for making the source available.
The UI (Search.aspx) hasn't really changed at all (except for class name changes as a result of refactoring) - I have a whole list of ideas & suggestions to improve it, but they will have to wait for another day.

Design & Refactoring

The Catalog-File-Word design that supports searching the Catalog remains basically unchanged (from Version 1!), however there has been a total reorganization of the classes used to generate the Catalog.

In version 3, all the code to: download a file, parse the html, extract the links, extract the words, add the to catalog and save the catalog was crammed into two classes (Spider and HtmlDocument see right).

Notice that the StripHtml() method is in the Spider class - doesn't make sense, does it?

This made it difficult to add the new functionality required for supporting IFilter (or any other document types we might like to add) that don't have the same attributes as an Html page.

To 'fix' this design flaw, I pulled out all the Html-specific code from Spider and put it into HtmlDocument. Then I took all the 'generic' document attributes (Title, Length, Uri, ...) and pushed them into a superclass Document, from which HtmlDocument inherits. To allow Spider to deal (polymorphically) with any type of Document, I moved the object creation code into the static DocumentFactory so there is a single place where Document subclasses get created (so it's easy to extend later). DocumentFactory uses the MimeType from the HttpResponse header to decide which class to instantiate.

You can see how much neater the Spider and HtmlDocument classes are (well OK, that's because I hid the Fields compartment). To give you an idea of how the code 'moved around': Spider went from 680 lines to 420, HtmlDocument from 165 to 450, and the Document base became 135 lines - the total line count has increased (as has the functionality) but what's important is the way relevant functions are encapsulated inside each class.

The new Document class can then form the basis of any downloadable file type: it is an abstract class so any subclass must at least implement the GetResponse() and Parse() methods:

GetResponse() controls how the class gets the data out of the stream from the remote server (eg. Text and Html is read into memory, Word/PDF/etc are written to a temporary disk location) and text is extracted.
Parse() performs any additional work required on the files contents (eg. remove Html tags, parse links, etc).

The first 'new' class is TextDocument, which is a much simpler version of HtmlDocument: it doesn't handle any encodings (assumes ASCII) and doesn't parse out links or Html, so the two abstract methods are very simple! From there is was relatively easy to build the FilterDocument class to wrap the IFilter calls which allow many different file types to be read.

To demonstrate just how easy it was to extend this design to support IFilter, the FilterDocument class inherits pretty much everything from Document and only needs to add a touch of code (below; most of which is to download binary data, plus three lines courtesy of Eyal's IFilter sample). Points to note:

BinaryReader is used to read the webresponse for these files (in HtmlDocument we use StreamReader, which is intended for use with Text/Encodings)
The stream is actually saved to disk (NOTE: you need to specify the temp folder in *.config, and ensure your process has write permission there).
The saved file location is what's passed to IFilter
The saved file is deleted at the end of the method

public override void Parse()
{
    // no parsing (for now). 
}
public override bool GetResponse (System.Net.HttpWebResponse webresponse)
{
    System.IO.Stream filestream = webresponse.GetResponseStream();
    this.Uri = webresponse.ResponseUri;
    string filename = System.IO.Path.Combine(Preferences.DownloadedTempFilePath
                      , (System.IO.Path.GetFileName(this.Uri.LocalPath)));
    this.Title = System.IO.Path.GetFileNameWithoutExtension(filename);
    using (System.IO.BinaryReader reader = new System.IO.BinaryReader(filestream))
    {   // we must use BinaryReader to avoid corrupting the data
        using (System.IO.FileStream iofilestream 
               = new System.IO.FileStream(filename, System.IO.FileMode.Create))
        {   // we must save the stream to disk in order to use IFilter
            int BUFFER_SIZE = 1024;
            byte[] buf = new byte[BUFFER_SIZE];
            int n = reader.Read(buf, 0, BUFFER_SIZE);
            while (n > 0)
            {
                iofilestream.Write(buf, 0, n);
                n = reader.Read(buf, 0, BUFFER_SIZE);
            }
            this.Uri = webresponse.ResponseUri;
            this.Length = iofilestream.Length;
            iofilestream.Close(); iofilestream.Dispose();
        } 
        reader.Close();
    } 
    try
    {
        EPocalipse.IFilter.FilterReader ifil 
                              = new EPocalipse.IFilter.FilterReader(filename);
        this.All = ifil.ReadToEnd();
        ifil.Close();
        System.IO.File.Delete(filename);    // clean up
    } catch {}
}

And there you have it - indexing and searching of Word, Excel, Powerpoint, PDF and more in one easy class... all the indexing and search results display work as before, unmodified!

"Rest of the Code" Structure

The refactoring extended way beyond the HtmlDocument class. The 31 or so files are now organised into five (5!) projects in the solution:

EPocalipse.IFilter	Unmodified from Using IFilter in C# CodeProject article
Mono.GetOptions	Wrapped in a Visual Studio project file, but otherwise unmodified from a Mono source repository
Searcharoo	All Searcharoo code now lives in this project, in three folders: `/Common/` `/Engine/` `/Indexer/`
Searcharoo.Indexer	NEW Console Application, allows the Catalog file to be built on a local PC (more likely to have a wide variety of IFilter's installed), then copied to your website for searching. You could also create a scheduled task to regularly re-index your site (it's also great for debugging).
WebApplication	The ASPX files used to run Searcharoo. They have been renamed to: `Search.aspx` `SearchControl.ascx` `SearchSpider.aspx` Add these files to your website, merge the `web.config` settings (update whatever you need to), ensure the Searcharoo.DLL is added to your `/bin/` folder AND make sure your website 'user account (ASPNET)' has write permission to the web root.

New features & bug fixes

I, robots.txt

Previous versions of Searcharoo only looked in Html Meta tags for robot directives - the robots.txt file was ignored. Now that we can index non-Html files, however, we need the added flexibility of disallowing search in certain places. robotstxt.org has further reading on how the scheme works.

The Searcharoo.Indexer.RobotsTxt class has two main functions:

Check for, and if present, download and parse the robots.txt file on the site
Provide an interface for the Spider to check each Url against the robots.txt rules

Function 1 is accomplished in the RobotsTxt class constructor - it reads through every line in the file (if found), discards comments (indicated by a hash '#') and builds an Array of 'url fragments' that are to be disallowed.

Function 2 is exposed by the Allowed() method below

public bool Allowed (Uri uri)
{
    if (_DenyUrls.Count == 0) return true;

    string url = uri.AbsolutePath.ToLower();
    foreach (string denyUrlFragment in _DenyUrls)
    {
        if (url.Length >= denyUrlFragment.Length)
        {
            if (url.Substring(0, denyUrlFragment.Length) == denyUrlFragment)
            {
                return false;
            } // else not a match
        } // else url is shorter than fragment, therefore cannot be a 'match'
    }
    if (url == "/robots.txt") return false;
    // no disallows were found, so allow
    return true;
}

There is no explicit parsing of Allowed: directives in the robots.txt file - so there's a little more work to do.

Ignoring a NOSEARCHREGION

In HtmlDocument.StripHtml(), this new clause (along with the relevant settings in .config) will cause the indexer to skip over parts of an Html file surrounded by Html comments of the (default) form text not indexed

if (Preferences.IgnoreRegions)
{
    string noSearchStartTag = "<!--" + Preferences.IgnoreRegionTagNoIndex + 
                              "-->";
    string noSearchEndTag = "<!--/" + Preferences.IgnoreRegionTagNoIndex + 
                            "-->";
    string ignoreregex = noSearchStartTag + @"[\s\S]*?" + noSearchEndTag;
    System.Text.RegularExpressions.Regex ignores = 
         new System.Text.RegularExpressions.Regex(ignoreregex
          , RegexOptions.IgnoreCase | RegexOptions.Multiline | 
            RegexOptions.ExplicitCapture);
    ignoreless = ignores.Replace(styleless, " "); 
    // replaces the whole commented region with a space
}

Links inside the region are still followed - to stop the Spider searching specific links, use robots.txt.

Follow Javascript 'links'

In HtmlDocument.Parse(), the following code has been added inside the loop that matches anchor tags. It's a very rough piece of code, which looks for the first apostrophe-quoted string inside an onclick="" attribute (eg. onclick="window.location='top.htm'") and treat it as a link.

if ("onclick" == submatch.Groups[1].ToString().ToLower())
{   // maybe try to parse some javascript in here
    string jscript = submatch.Groups[2].ToString();
    // some code here to extract a filename/link to follow from the 
    // onclick="_____"
    int firstApos = jscript.IndexOf("'");
    int secondApos = jscript.IndexOf("'", firstApos + 1);
    if (secondApos > firstApos)
    {
        link = jscript.Substring(firstApos + 1, secondApos - firstApos - 1);
    }
}

It would be almost impossible to predict the infinite variety of javascript links being used, but this code should hopefully provide a basis for people to modify to suit their own site (most likely if tricky menu image rollovers or something bypass the regular href behaviour). At worst it will be extract something that isn't a real page and get a 404 error...

Multilingual 'option'

Culture note: in the last version I was really focussed on reducing the index size (and therefore the size of the Catalog on disk and in memory). To that end, I hardcoded the following Regex.Replace(word, @"[^a-z0-9,.]", "") statement which agressively removes 'unindexable' characters from words. Unfortunately, if you are using Searcharoo in any language other than English, this Regex is so agressive that it will delete a lot (if not ALL) of your content, leaving only numbers and spaces!

I've tried to improve the 'useability' of that a bit, by making it an option in the .config

HTML

<add key="Searcharoo_AssumeAllWordsAreEnglish" value="true" />

which governs this method in the Spider:

private void RemovePunctuation(ref string word)
{   // this stuff is a bit 'English-language-centric'
    if (Preferences.AssumeAllWordsAreEnglish)
    {   // if all words are english, this strict parse to remove all 
        // punctuation ensures words are reduced to their least 
        // unique form before indexing
        word = System.Text.RegularExpressions.Regex.Replace(word, 
             @"[^a-z0-9,.]", "", 
              System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    }
    else 
    {   // by stripping out this specific list of punctuation only,  
        // there is potential to leave lots of cruft in the word 
        // before indexing BUT this will allow any language to be indexed
        word = word.Trim
         (' ','?','\"',',','\'',';',':','.','(',')','[',']','%','*','$','-'); 
    }
}

In future I'd like to make Searcharoo more language aware, but for now hopefully this will at least make it possible to use the code in a non-English-language environment.

Searcharoo.Indexer.EXE

The console application is a wrapper that performs the exact same function as SearchSpider.aspx (now that all the code has been refactored out of the ASPX and into the Searcharoo 'common' project). The actual console program code is extremely simple:

clip = new CommandLinePreferences();
clip.ProcessArgs(args);
Spider spider = new Spider();
spider.SpiderProgressEvent += new SpiderProgressEventHandler(OnProgressEvent);
Catalog catalog = spider.BuildCatalog(new Uri(Preferences.StartPage));

That's almost identical to the SearchSpider.aspx web-based indexer interface.

The other code you'll find in the Searcharoo.Indexer project relates to parsing the command line arguments using the Mono.GetOptions which turns the following attribute-adorned class into the well-behaving console application below with hardly an additional line of code.

What it actually does when it's running looks like this:

Just as with SearchSpider.aspx you'll see the output as it follows links and indexes text from each page in your website. The verbosity setting allows you to control how much 'debug' information is presented:

-v:0	None: totally silent (no console output)
-v:1	Minimal: Just page names and wordcounts
-v:2	Informational: Some error information (eg. 403, 404)
-v:3	Detailed: More exception and other info (eg. cookie error)
-v:4	VeryDetailed: Still more (eg. robot Meta exclusions)
-v:5	Verbose: outputs the extracted words from each document - VERY VERBOSE

NOTE: the exe has it's own Searcharoo.Indexer.exe.config file, which would normally contain exactly the same settings as your web.config. You may want to consider using the Indexer if your website contains lots of IFilter-documents (Word, Powerpoint, PDF) and you get errors when running SearchSpider.aspx on the server because it does not have the IFilters available. The catalog output file (searcharoo.dat or whatever your .config says) can be FTPed to your sever where it will be loaded and searched!

References

There's a lot to read about IFilter and how it works (or doesn't work, as the case may be). Start with Using IFilter in C#, and it's references: Using IFilter in C# by bypassing COM for references to LoadIFilter, IFilter.org and IFilter Explorer
dotlucerne also has file parsing references).

Searcharoo now has it's own site - searcharoo.net - where you can actually try a working demo, and possibly find small fixes and enhancements that aren't groundbreaking enough to justify a new CodeProject article...

Wrap-up

Hopefully you find the new features useful and the article relevant. Thanks again to the authors of the other open-source projects used in Searcharoo.

History

2004-06-30: Version 1 on CodeProject
2004-07-03: Version 2 on CodeProject
2006-05-24: Version 3 on CodeProject
2007-03-18: Version 4 (this page) on CodeProject

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Written By

craigd

Web Developer

Australia

-- ooo ---
www.conceptdevelopment.net
conceptdev.blogspot.com
www.searcharoo.net
www.recipenow.net
www.racereplay.net
www.silverlightearth.com

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.