Adding features to a C# search engine/web spider

craigd

4.74/5 (22 votes)

May 24, 2006

CPOL

17 min read

268998

3572

Adding advanced search-engine features (and persistent catalog) to Searcharoo project

Sample image of Searcharoo: shows both the 'initial page' (bottom right) and the results view

Background

This article follows on from the previous two Searcharoo samples:

Searcharoo Version 1 describes building a simple search engine that crawls the file system from a specified folder, and indexes all HTML (or other known types) of document. A basic design and object model was developed to support simple, single-word searches, whose results were displayed ina rudimentary query/results page.

Searcharoo Version 2 focused on adding a 'spider' to find data to index by following web links (rather than just looking at directory listings in the file system). This means downloading files via HTTP, parsing the HTML to find more links and ensuring we don't get into a recursive loop because many web pages refer to each other. This article also discusses how multiple search words results are combined into a single set of 'matches'.

Introduction

This article (version 3 of Searcharoo) covers three main areas:

Implementing a 'save to disk' function for the catalog
Feature suggestions, bug fixes and incorporation of code contributed by others on previous articles (mostly via CodeProject - thankyou!)
Improving the code itself (adding comments, moving classes, improving readability and hopefully making it easier to modify & re-use)

New 'features' include:

Saving the catalog (which resides in memory for fast searching) to disk
Making the Spider recognise and follow pages referenced in FRAMESETs and IFRAMEs (suggested by le_mo_mo)
Paging results rather than just listing them all on one page (submitted by Jim Harkins)
Normalising words and numbers (removing punctuation, etc)
(Optional) stemming of English words to reduce catalog size (suggested by Chris Taylor and Trickster)
(Optional) use of Stop words to reduce catalog size
(Optional) creation of a Go word list, to specifically catalog domain-specific words like "C#", which might otherwise be ignored

The bug fixes include:

Correctly parsing <TITLE> tags that may have additional attributes eg. an ID= attribute in an ASP.NET environment. (submitted by xenomouse)
Handling Cookies if the server has set them to track a 'session' (submitted by Simon Jones)
Checking the 'final' URL after redirects to ensure the right page is indexed and linked (submitted by Simon Jones)
Correctly parsing (and obeying!) the ROBOTS meta tag (I found this bug myself).

Code layout improvements included:

The Spider code that was a bit of a mess in SearcharooSpider.aspx being moved into a proper C# class (and implementing an EventHandler to allow monitoring of progress)
Encapsulation of Preferences into a single static class
Layout of Searcharoo.cs using #regions (easy to read if you have VS.NET)
User control (Searcharoo.ASCX) created for search box - if you want to re-brand it you only have to modify in one place.
Paging implementation using PagedDataSource means you can easily alter the 'template' for the results (eg link size/color/layout) in Searcharoo3.aspx

Design

The fundamental Catalog-File-Word design remains unchanged (from Version 1), however there are quite a few extra classes implemented in this version.

Object Model for Searcharoo3

To build the catalog, SearcharooSpider.aspx calls Spider.BuildCatalog() which:

Accesses Preferences static object to read settings
Creates empty Catalog
Creates IGoWord, IStopper and IStemming implementations (based on Preferences)
Processes startPageUri (with a WebRequest)
Creates HtmlDocument, populates properties including Link collections
Parses the content of the page, creating Word and File objects as required
Recursively applies steps 4 through 6 for each LocalLink
BinarySerializes the Catalog to disk using CatalogBinder
Adds the Catalog to Application.Cache[], for use by Searcharoo3.aspx for searching!

Code Structure

These are the files used in this version (and contained in the download).

web.config	14 settings that control how the spider and the search page behave. They are all 'optional' (ie the spider and search page will run if no config settings are provided) but I recommend at least providing `<add key="Searcharoo_VirtualRoot" value="http://localhost/content/" />`
Searcharoo.cs	Most code for the application is in this file. Many classes that were in ASPX files in version 2 have been moved into this file (such as Spider and HtmlDocument) because it's easier to read and maintain. New version 3 features (Stop, Go, Stemming) all added here.
Searcharoo3.aspx	Search page (input and results). Checks the Application-Cache for a Catalog, and if none exists, creates one (deserialize OR run SearcharooSpider.aspx)
Searcharoo.ascx	NEW user control that contains two asp:Panels: the 'blank' search box (when page is first loaded, defaults to yellow background) the populated search box (when results are displayed, defaults to blue background) (see the screenshot at the top of the article)
SearcharooSpider.aspx	The main page (Searcharoo3.aspx) does a Server.Transfer to this page to create a new Catalog (if required). Almost ALL of the code that was in this page in version 2 has been migrated to Searcharoo.cs - OnProgressEvent() allows it to still display 'progress' messages as the spidering is taking place.

Saving the Catalog to Disk

There are a couple of reasons why saving the catalog to disk is useful:

It can be built on a different server to the website (for smaller sites, where the code may not have permission to write to disk on the webserver)
If the server Application restarts, the catalog can be re-loaded rather than rebuilt entirely
You can finally 'see' what information is stored in the catalog - useful for debugging!

There are two types of Serialization (Xml and Binary) available in the Framework, and since the Xml is 'human readable', that seemed the logical one to try. The code required to serialize the Catalog is very simple - the code below is from the Catalog.Save() method, so the reference to this is the Catalog object.

XmlSerializer serializerXml = new XmlSerializer( typeof( Catalog ) );
System.IO.TextWriter writer 
    = new System.IO.StreamWriter( Preferences.CatalogFileName+".xml" );
serializerXml.Serialize( writer, this );
writer.Close();

The 'test dataset' I've mostly used is the CIA World Factbook (download) which is about 52.6 Mb on disk for the HTML only (not including images and non-searchable data) - so imagine my "surprise" when the Xml-Serialized-Catalog itself three times the size at 156 Mb (yes, megabytes!). Couldn't even open it easily, except by 'type'ing it from the Command Prompt.

OUCH - what a waste of space! And worse, this was the first time I'd noticed the fields defined in the File class were declared public and not private (see the elements beginning with underscors). Firstly, let's get rid of the serialized duplicates (fields that should be private, and their public property counterparts) -- rather than change the visibility (and pontentially break code), the [XmlIgnore] attribute can be added to the definition. To further reduce the amount of repeated text, the element names are compressed to single letters using the [XmlElement] attribute, and to reduce the number of <> some of the properties are marked to be serialized as [XmlAttribute]s.

[Serializable]
public class Word 
{
    [XmlElement("t")]   public string Text;
    [XmlElement("fs")]  public File[] Files
...
[Serializable]
public class File
{
    [XmlIgnore]            public string _Url;
...
    [XmlAttribute("u")]    public string Url { ...
    [XmlAttribute("t")]    public string Title { ...
    [XmlElement("d")]      public string Description { ...
    [XmlAttribute("d")]    public DateTime CrawledDate { ...
    [XmlAttribute("s")]    public long  Size { ...
...

The Xml file is now a teeny (not!) 49 Mb in size, still too large for notepad but easily viewed via cmd. As you can see below, the 'compression' of the Xml certainly saved some space - at least the Catalog is now smaller than the source data!

Xml Serialization is slightly less verbose - 49 Mb

Even with the smaller output, 49 Mb is of Xml is still a little too verbose to be practical (hardly a surprise really, Xml often is!) so let's serialize the index to a Binary format (again, the Framework classes make it really simple).

System.IO.Stream stream = new System.IO.FileStream
       (Preferences.CatalogFileName+".dat" , System.IO.FileMode.Create );
System.Runtime.Serialization.IFormatter formatter = 
        new System.Runtime.Serialization.Formatters.Binary.BinaryFormatter();
formatter.Serialize (stream, this);
stream.Close();

The results of changing to Binary Serialization were dramatic - the same catalog data was 4.6 Mb rather than 150! That's about 3% of the Xml size, definitely the way to go.

Now that I had the Catalog being saved successfully to disk, it seemed like it would be a simple matter to re-load it back into memory & the Application Cache...

Loading the Catalog from Disk

Unfortunately, it was NOT that simple. Whenever the Application restarted (say web.config or Searcharoo.cs was changed), the code could not de-serialize the file but instead threw this cryptic error:

Cannot find the assembly h4octhiw, Version=0.0.0.0, Culture=neutral, PublicKeyToken=null

At first I was stumped - I didn't have any assembly named h4octhiw, so it wasn't immediately apparent why it could not be found. There are a couple of hints though:

The 'not found ' assembly appears to have a randomly generated name... and what do we know uses randomly generated assembly names? The \Temporary ASP.NET Files\ directory where dynamically compiled assemblies (from src="" and ASPX) are saved.
The error line references only 'object' and 'stream' types - surely they aren't causing the problem
Reading through the Stack Trace (click on the image) from the bottom, up (as always), you can infer that the Deserialize method creates a BinaryParser that creates an ObjectMap with an array of MemberNames which in turn request ObjectReader.GetType() which triggers the GetAssembly() method... but it fails!. Hmm - sounds like it might be looking for the Types that have been serialized - why can't it find them?

If your Google skills are honed, rather than the dozens of useless links returned when you search for ASP.NET "Cannot find the assembly" you'll be lucky and stumble across this CodeProject article on Serialization where you will learn a very interesting fact:

Type information is also serialized while the class is serialized enabling the class to be deserialized using the type information. Type information consists of namespace, class name, assembly name, culture information, assembly version, and public key token. As long as your deserialized class and the class that is serialized reside in the same assembly it does not cause any problem. But if the serializer is in a separate assembly, .NET cannot find your class' type hence cannot deserialize it.

But what does it mean? Every time the web/IIS 'Application' restarts, all your ASPX and src="" code is recompiled to a NEW, RANDOMLY NAMED assembly in \Temporary ASP.NET Files\. So although the Catalog class is based on the same code, its Type Information (namespace, class name, assembly name, culture information, assembly version, and public key token) is DIFFERENT!

And, importantly, when a class is binary serialized, its Type Information is stored along with it (aside: this doesn't happen with Xml Serialization, so we probably would have been OK if we'd stuck with that).

The upshot: after every recompile (whatever triggered it: web.config change, code change, IIS restart, machine reboot, etc) our Catalog class has different Type info - and when it tries to load the serialized version we saved earlier, it doesn't match and the Framework can't find the assembly where the previous Catalog Type is defined (since it was only Temporary and has been deleted when the recompile took place).

Custom Formatter implementation

Sounds complex? It is, kinda, but the whole 'temporary assemblies' thing is something that happens invisibly and most developers don't need to know or care much about it. Thankfully we don't have to worry too much either, because the CodeProject article on Serialization also contains the solution: a helper class that 'tricks' the Binary Deserializer into using the 'current' Catalog type.

public class CatalogBinder: System.Runtime.Serialization.SerializationBinder
{
    public override Type BindToType (string assemblyName, string typeName) 
    { 
        // get the 'fully qualified (ie inc namespace) type name' into an 
        // array
        string[] typeInfo = typeName.Split('.');
        // because the last item is the class name, which we're going to 
        // 'look for' in *this* namespace/assembly
        string className=typeInfo[typeInfo.Length -1];
        if (className.Equals("Catalog"))
        {
            return typeof (Catalog);
        }
        else if (className.Equals("Word"))
        {
            return typeof (Word);
        }
        if (className.Equals("File"))
        {
            return typeof (File);
        }
        else
        {    // pass back exactly what was passed in!
            return Type.GetType(string.Format( "{0}, {1}", typeName, 
                                assemblyName));
        }
    } 
}

Et Voila! Now that the Catalog can be saved/loaded, the search engine is much more robust than before. You can save/back-up the Catalog, turn on debugging to see its contents, and even generate it on a different machine (say a local PC) and upload to your web server!

Using the 'debug' Xml serialized files, for the first time I could the contents of the Catalog, and I found lots of 'garbage' was being stored that was both wasteful in terms of memory/disk, but also useless/unsearchable. With the major task for this release complete, it seemed appropriate to do some bugfixes and add some "real search engine" features to clean up the Catalog's contents.

New features & bug fixes

FRAME and IFRAME support

CodeProject member le_mo_mo pointed out that the spider did not follow (and index) framed content. This was a minor change to the regex that finds links - previously A and AREA tags were supported, so it was simple enough to add FRAME and IFRAME to the pattern.

foreach (Match match in Regex.Matches(htmlData
   , @"(?<anchor><\s*(a|area|frame|iframe)\" + 
     @"s*(?:(?:\b\w+\b\s*(?:=\s*(?:""[^""]*""|'[^']" + 
     @"*'|[^""'<> ]+)\s*)?)*)?\s*>)"
   , RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture))
{

Stop words

Let's start with Google's definition of Stop Words:

Google ignores common words and characters, such as "where" and "how," as well as certain single digits and single letters. These terms rarely help narrow a search and can slow search results. We call them "stop words."

The basic premise is that we don't want to waste space in the catalog storing data will never be used, the 'Stop Words' assumption is that you'll never search for words like "a in at I" because they appear on almost every page, and therefore don't actually help you find anything!

Here's a basic definition from MIT and some interesting statistics and Stop Word thoughts including the 'classic' Stop Word conundrum: should users be able to search for Hamlet's soliloquy "to be or not to be"?

The Stop Word code supplied with Searcharoo3 is pretty basic - it strips out ALL one and two letter words, plus

the, and, that, you, this, for, but, with, are, have, was, out, not

A more complex implementation is left for others to contribute (or a future version, whichever comes first).

Word normalization

I had noticed words were often being stored with any punctuation that was adjacent to them in the source text. For example, the Catalog contained Files with Word instances for

"People

people

people*

people

This prevented the pages containing those words from ever being returned in a search, unless the user had typed the exact punctuation as well - in the above example a search for people would only return one page, when you would expect it to return all four pages.
The previous version of Searcharoo did have a 'black list' of punctuation [,./?;:()-=etc] but that wasn't sufficient as I could not predict/foresee all possible punctuation characters. Also, it was implemented with the Trim() method which was not parsing out punctuation within words [aside: the handling of parenthesised words is still not satisfactory in version 3]. The following 'white list' of characters that are allowed to be indexed ensures that NO punctuation is accidentally stored as part of a word.

key = System.Text.RegularExpressions.Regex.Replace(key, @"[^a-z0-9,.]"
    , ""
    , System.Text.RegularExpressions.RegexOptions.IgnoreCase);

Culture note: this "white list" method of removing punctuation is VERY English-language centric, as it will remove at least some characters from most European languages, and it will strip ALL content from most Asian-language content.
If you want to use Searcharoo with non-English character sets, you should find the above line of code and REPLACE it with this "black list" from Version 2. While it allows more characters to be searched, the results are more likely to be polluted by punctuation which could reduce searchability.

key = word.Trim
  (' ','?','\"',',','\'',';',':','.','(',')','[',']','%','*','$','-').ToLower();

Number normalization

Numbers are a special case of word normalization: some punctuation is required to interpret the number (eg decimal point), then convert it to a proper number.
Although not perfect, this means phone numbers written as 0412-345-678 or (04)123-45678 would both be Catalogued as 0412345678 and therefore searching for either 0412-345-678 or (04)123-45678 would match both source documents.

private bool IsNumber (ref string word)
{
    try
    {
        long number = Convert.ToInt64(word); //;int.Parse(word);
        word = number.ToString();
        return (word!=String.Empty);//true;
    }
    catch
    {
        return false;
    }
}

Go words

After reading the Word Normalization section above you can see how cataloging and searching for a technical term/phrase (like C# or C++) is impossible - the non-alphanumeric characters are filtered out before they have a chance to be catalogued.

To avoid this, Searcharoo allows a 'Go words' list to be created. A 'Go word' is the opposite of a 'Stop word': instead of being blocked from cataloguing, it is given a free-pass into the catalog, bypassing the Normalization and Stemming code.

The weakness in this approach is that you must know ahead of time all the different Go words that your users might search for. In future, you might want to store each unsuccessful search term for later analysis and expansion of your Go word list. The Go word implementation is very simple:

public bool IsGoWord (string word)
{
    switch (word.ToLower())
    {
        case "c#":
        case "vb.net":
        case "asp.net":
            return true;
            break;
    }
    return false;
}

Stemming

The most basic explanation of 'stemming' is that it attempts to identify 'related' words and return them in response to a query. The simplest example is plurals: searching for "field" should also find instances of "fields" and vice versa. More complex examples are "realize" and "realization", "populate" and "population" - the

This page on How a Search Engine Works contains a brief explanation of Stemming and some of the other techniques described above.

The Porter Stemming Algorithm already existed as a C# class, so was utilized 'as is' in Searcharoo3 (credit and thanks to Martin Porter).

Affect on Catalog size

The Stop Words, Stemming, and Normalization steps above were all developed to 'tidy up' the Catalog and hopefully reduce its size/increase search speed. The results are listed below for our CIA World Factbook:

source: 800 files 52.6 Mb	Raw *	+ Stop words	+ Stemming	+'white list' normalization
Unique Words	30,415	30,068	26,560	26,050
Xml Serialized	156 Mb ^	149 Mb	138 Mb	136 Mb
Binary Serialized	4.6 Mb	4.5 Mb	4.1 Mb	4.0 Mb
Binary % of source	8.75%	8.55%	7.79%%	7.60%

* black list normalization, which is commented out in the code, and mentioned in the 'culture note'
^ 49 Mb after 'compressing' the Xml output with [Attributes]

The result was a 14% reduction in the number of words and a 13% decrease in Binary file size (mostly due to the addition of Stemming). Because the whole Catalog stays in memory (in the Application Cache) keeping the size small is important - maybe a future version will be able to persist some 'working copy' of the data to disk and enable spidering of really large sites, but for now the catalog seems to take less than 10% of the source data size.

...but what about the UI?

The search user interface also had some improvements:

Moving the search inputs into the Searcharoo.ascx User Control
Adding the same Stemming, Stop and Go word parsing to the search term that is applied during spidering
Generating the result list using the new ResultFile class to construct a DataSource to bind to a Repeater control
Adding PagedDataSource and custom paging links rather than one long list of results (thanks to Jim Harkin's feedback/code and uberasp.net)

ResultFile and SortedList

In version 2, outputting the results was very crude: the code was littered with Response.Write calls making it difficult to reformat the output. Jim Harkins posted some Visual Basic code which is converted to C# below.

// build each result row
foreach (object foundInFile in finalResultsArray.Keys)
{
    // Create a ResultFile with it's own Rank
    infile      = new ResultFile ((File)foundInFile);
    infile.Rank = (int)((DictionaryEntry)finalResultsArray[foundInFile]).Value;
    sortrank = infile.Rank * -1000;        // Assume not 'thousands' of results
    if (output.Contains(sortrank) )
    { // rank exists - drop key index one number until it fits
        for (int i = 1; i < 999; i++)
        {
            sortrank++;
            if (!output.Contains (sortrank))
            {
                output.Add (sortrank, infile);
                break;
            }
        }
    } else {
        output.Add(sortrank, infile);
    }
    sortrank = 0;  // reset for next pass
}

Jim's code does some trickery with a new 'sortrank' variable to try and keep the files in 'Searcharoo rank' order, but with unique keys in the output SortedList. If thousands of results were returned, you might run into trouble...

PagedDataSource

Once the results are in the SortedList, assigned to a PagedDataSource which is then bound to a Repeater control on Searcharoo3.aspx.

SortedList output = 
    new SortedList (finalResultsArray.Count); // empty sorted list
...
pg.DataSource       = output.GetValueList();
pg.AllowPaging      = true;
pg.PageSize         = Preferences.ResultsPerPage; // defaults to 10 10;
pg.CurrentPageIndex = Request.QueryString["page"]==null?0:
    Convert.ToInt32(Request.QueryString["page"])-1;

SearchResults.DataSource = pg;
SearchResults.DataBind();

making it a LOT easier to reformat the results list however you like!

<asp:Repeater id="SearchResults" runat="server">
<HeaderTemplate>
    <p><%=NumberOfMatches%> results for <%=Matches%> took 
       <%=DisplayTime%></p>
</HeaderTemplate>
<ItemTemplate>
    <a href="<%# DataBinder.Eval(Container.DataItem, "Url") %>"><b>
    <%# DataBinder.Eval(Container.DataItem, "Title") %></b></a>
    <a href="<%# DataBinder.Eval(Container.DataItem, "Url") %>" 
        target=\"_blank\" title="open in new window" 
        style="font-size:x-small">↑</a>
    <font color=gray>(<%# DataBinder.Eval(Container.DataItem, "Rank") %>)
    </font>
    <br><%# DataBinder.Eval(Container.DataItem, "Description") %>...
    <br><font color=green><%# DataBinder.Eval(Container.DataItem, "Url") %>
    - <%# DataBinder.Eval(Container.DataItem, "Size") %>
    bytes</font>
    <font color=gray>- 
       <%# DataBinder.Eval(Container.DataItem, "CrawledDate") %></font><p>
</ItemTemplate>
<FooterTemplate>
    <p><%=CreatePagerLinks(pg, Request.Url.ToString() )%></p>
</FooterTemplate>
</asp:Repeater>

Unfortunately the page links are generated via embedded Response.Write calls in CreatePagerLinks... maybe this will be templated in a future version...

The Future...

If you check the dates below, you'll notice there was almost one and a half years between version 2 and 3, so it might sound optimistic to discuss another 'future' version - but you never know...

Unfortunately many of the new features above are English-language specific (although they can be disabled to ensure Searcharoo can still be used on other language websites). However in a future version I'd like to try making the code can be a little more intelligent about handling European, Asian and other languages.

It would also be nice if the user could type boolean OR searches, or group terms with quotes " " like Google, Yahoo, etc.

And finally, indexing of document types besides Html (mainly other web-types like PDF) would be useful for many sites.

ASP.NET 2.0

Searcharoo3 runs on ASP.NET 2.0 pretty much unmodified - just remove src="Searcharoo.cs" from the @Page attribute, and move the Searcharoo.cs file into the App_Code directory.

Visual Studio.NET internal web server warning: the Searcharoo_VirtualRoot setting (where the spider starts looking for pages to index) defaults to http://localhost/. VS.NET's internal web server chooses a random port to run on, so if you're using it to test Searcharoo, you may need to set this web.config value accordingly.

History

2004-06-30: Version 1 on CodeProject
2004-07-03: Version 2 on CodeProject
2006-05-24: Version 3 (this page) on CodeProject