|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Announcements
Want a new Job?
Chapters
Services
Feature Zones
|
BackgroundThis article follows on from the previous two Searcharoo samples: Searcharoo Version 1 describes building a simple search engine that crawls the file system from a specified folder, and indexes all HTML (or other known types) of document. A basic design and object model was developed to support simple, single-word searches, whose results were displayed ina rudimentary query/results page. Searcharoo Version 2 focused on adding a 'spider' to find data to index by following web links (rather than just looking at directory listings in the file system). This means downloading files via HTTP, parsing the HTML to find more links and ensuring we don't get into a recursive loop because many web pages refer to each other. This article also discusses how multiple search words results are combined into a single set of 'matches'. IntroductionThis article (version 3 of Searcharoo) covers three main areas:
New 'features' include:
The bug fixes include:
Code layout improvements included:
DesignThe fundamental Catalog-File-Word design remains unchanged (from Version 1), however there are quite a few extra classes implemented in this version.
To build the catalog, SearcharooSpider.aspx calls Spider.BuildCatalog() which:
Code StructureThese are the files used in this version (and contained in the download).
Saving the Catalog to DiskThere are a couple of reasons why saving the catalog to disk is useful:
There are two types of Serialization (Xml and Binary) available in the Framework, and since the Xml is 'human readable', that seemed the logical one to try. The code required to serialize the Catalog is very simple - the code below is from the Catalog.Save() method, so the reference to this is the Catalog object. XmlSerializer serializerXml = new XmlSerializer( typeof( Catalog ) );
System.IO.TextWriter writer
= new System.IO.StreamWriter( Preferences.CatalogFileName+".xml" );
serializerXml.Serialize( writer, this );
writer.Close();
The 'test dataset' I've mostly used is the CIA World Factbook (download) which is about 52.6 Mb on disk for the HTML only (not including images and non-searchable data) - so imagine my "surprise" when the Xml-Serialized-Catalog itself three times the size at 156 Mb (yes, megabytes!). Couldn't even open it easily, except by 'type'ing it from the Command Prompt.
OUCH - what a waste of space! And worse, this was the first time I'd noticed the fields defined in the File class were declared public and not private (see the elements beginning with underscors). Firstly, let's get rid of the serialized duplicates (fields that should be private, and their public property counterparts) -- rather than change the visibility (and pontentially break code), the [XmlIgnore] attribute can be added to the definition. To further reduce the amount of repeated text, the element names are compressed to single letters using the [XmlElement] attribute, and to reduce the number of <> some of the properties are marked to be serialized as [XmlAttribute]s. [Serializable]
public class Word
{
[XmlElement("t")] public string Text;
[XmlElement("fs")] public File[] Files
...
[Serializable]
public class File
{
[XmlIgnore] public string _Url;
...
[XmlAttribute("u")] public string Url { ...
[XmlAttribute("t")] public string Title { ...
[XmlElement("d")] public string Description { ...
[XmlAttribute("d")] public DateTime CrawledDate { ...
[XmlAttribute("s")] public long Size { ...
...
The Xml file is now a teeny (not!) 49 Mb in size, still too large for notepad but easily viewed via cmd. As you can see below, the 'compression' of the Xml certainly saved some space - at least the Catalog is now smaller than the source data!
Even with the smaller output, 49 Mb is of Xml is still a little too verbose to be practical (hardly a surprise really, Xml often is!) so let's serialize the index to a Binary format (again, the Framework classes make it really simple). System.IO.Stream stream = new System.IO.FileStream
(Preferences.CatalogFileName+".dat" , System.IO.FileMode.Create );
System.Runtime.Serialization.IFormatter formatter =
new System.Runtime.Serialization.Formatters.Binary.BinaryFormatter();
formatter.Serialize (stream, this);
stream.Close();
The results of changing to Binary Serialization were dramatic - the same catalog data was 4.6 Mb rather than 150! That's about 3% of the Xml size, definitely the way to go. Now that I had the Catalog being saved successfully to disk, it seemed like it would be a simple matter to re-load it back into memory & the Application Cache... Loading the Catalog from DiskUnfortunately, it was NOT that simple. Whenever the Application restarted (say web.config or Searcharoo.cs was changed), the code could not de-serialize the file but instead threw this cryptic error: Cannot find the assembly h4octhiw, Version=0.0.0.0, Culture=neutral, PublicKeyToken=null At first I was stumped - I didn't have any assembly named h4octhiw, so it wasn't immediately apparent why it could not be found. There are a couple of hints though:
If your Google skills are honed, rather than the dozens of useless links returned when you search for ASP.NET "Cannot find the assembly" you'll be lucky and stumble across this CodeProject article on Serialization where you will learn a very interesting fact: Type information is also serialized while the class is serialized enabling the class to be deserialized using the type information. Type information consists of namespace, class name, assembly name, culture information, assembly version, and public key token. As long as your deserialized class and the class that is serialized reside in the same assembly it does not cause any problem. But if the serializer is in a separate assembly, .NET cannot find your class' type hence cannot deserialize it. But what does it mean? Every time the web/IIS 'Application' restarts, all your ASPX and src="" code is recompiled to a NEW, RANDOMLY NAMED assembly in \Temporary ASP.NET Files\. So although the Catalog class is based on the same code, its Type Information (namespace, class name, assembly name, culture information, assembly version, and public key token) is DIFFERENT! And, importantly, when a class is binary serialized, its Type Information is stored along with it (aside: this doesn't happen with Xml Serialization, so we probably would have been OK if we'd stuck with that). The upshot: after every recompile (whatever triggered it: web.config change, code change, IIS restart, machine reboot, etc) our Catalog class has different Type info - and when it tries to load the serialized version we saved earlier, it doesn't match and the Framework can't find the assembly where the previous Catalog Type is defined (since it was only Temporary and has been deleted when the recompile took place). Custom Formatter implementationSounds complex? It is, kinda, but the whole 'temporary assemblies' thing is something that happens invisibly and most developers don't need to know or care much about it. Thankfully we don't have to worry too much either, because the CodeProject article on Serialization also contains the solution: a helper class that 'tricks' the Binary Deserializer into using the 'current' Catalog type. public class CatalogBinder: System.Runtime.Serialization.SerializationBinder
{
public override Type BindToType (string assemblyName, string typeName)
{
// get the 'fully qualified (ie inc namespace) type name' into an
// array
string[] typeInfo = typeName.Split('.');
// because the last item is the class name, which we're going to
// 'look for' in *this* namespace/assembly
string className=typeInfo[typeInfo.Length -1];
if (className.Equals("Catalog"))
{
return typeof (Catalog);
}
else if (className.Equals("Word"))
{
return typeof (Word);
}
if (className.Equals("File"))
{
return typeof (File);
}
else
{ // pass back exactly what was passed in!
return Type.GetType(string.Format( "{0}, {1}", typeName,
assemblyName));
}
}
}
Et Voila! Now that the Catalog can be saved/loaded, the search engine is much more robust than before. You can save/back-up the Catalog, turn on debugging to see its contents, and even generate it on a different machine (say a local PC) and upload to your web server! Using the 'debug' Xml serialized files, for the first time I could the contents of the Catalog, and I found lots of 'garbage' was being stored that was both wasteful in terms of memory/disk, but also useless/unsearchable. With the major task for this release complete, it seemed appropriate to do some bugfixes and add some "real search engine" features to clean up the Catalog's contents. New features & bug fixesFRAME and IFRAME supportCodeProject member le_mo_mo pointed out that the spider did not follow (and index) framed content. This was a minor change to the regex that finds links - previously foreach (Match match in Regex.Matches(htmlData
, @"(?<anchor><\s*(a|area|frame|iframe)\" +
@"s*(?:(?:\b\w+\b\s*(?:=\s*(?:""[^""]*""|'[^']" +
@"*'|[^""'<> ]+)\s*)?)*)?\s*>)"
, RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture))
{
Stop wordsLet's start with Google's definition of Stop Words: Google ignores common words and characters, such as "where" and "how," as well as certain single digits and single letters. These terms rarely help narrow a search and can slow search results. We call them "stop words." The basic premise is that we don't want to waste space in the catalog storing data will never be used, the 'Stop Words' assumption is that you'll never search for words like "a in at I" because they appear on almost every page, and therefore don't actually help you find anything! Here's a basic definition from MIT and some interesting statistics and Stop Word thoughts including the 'classic' Stop Word conundrum: should users be able to search for Hamlet's soliloquy "to be or not to be"? The Stop Word code supplied with Searcharoo3 is pretty basic - it strips out ALL one and two letter words, plus the, and, that, you, this, for, but, with, are, have, was, out, not
A more complex implementation is left for others to contribute (or a future version, whichever comes first). Word normalizationI had noticed words were often being stored with any punctuation that was adjacent to them in the source text. For example, the Catalog contained Files with Word instances for
This prevented the pages containing those words from ever being returned in a search, unless the user had typed the exact punctuation as well - in the above example a search for people would only return one page, when you would expect it to return all four pages. key = System.Text.RegularExpressions.Regex.Replace(key, @"[^a-z0-9,.]"
, ""
, System.Text.RegularExpressions.RegexOptions.IgnoreCase);
Culture note: this "white list" method of removing punctuation is VERY English-language centric, as it will remove at least some characters from most European languages, and it will strip ALL content from most Asian-language content. key = word.Trim
(' ','?','\"',',','\'',';',':','.','(',')','[',']','%','*','$','-').ToLower();
Number normalizationNumbers are a special case of word normalization: some punctuation is required to interpret the number (eg decimal point), then convert it to a proper number. private bool IsNumber (ref string word)
{
try
{
long number = Convert.ToInt64(word); //;int.Parse(word);
word = number.ToString();
return (word!=String.Empty);//true;
}
catch
{
return false;
}
}
Go wordsAfter reading the Word Normalization section above you can see how cataloging and searching for a technical term/phrase (like C# or C++) is impossible - the non-alphanumeric characters are filtered out before they have a chance to be catalogued. To avoid this, Searcharoo allows a 'Go words' list to be created. A 'Go word' is the opposite of a 'Stop word': instead of being blocked from cataloguing, it is given a free-pass into the catalog, bypassing the Normalization and Stemming code. The weakness in this approach is that you must know ahead of time all the different Go words that your users might search for. In future, you might want to store each unsuccessful search term for later analysis and expansion of your Go word list. The Go word implementation is very simple: public bool IsGoWord (string word)
{
switch (word.ToLower())
{
case "c#":
case "vb.net":
case "asp.net":
return true;
break;
}
return false;
}
StemmingThe most basic explanation of 'stemming' is that it attempts to identify 'related' words and return them in response to a query. The simplest example is plurals: searching for "field" should also find instances of "fields" and vice versa. More complex examples are "realize" and "realization", "populate" and "population" - the This page on How a Search Engine Works contains a brief explanation of Stemming and some of the other techniques described above. The Porter Stemming Algorithm already existed as a C# class, so was utilized 'as is' in Searcharoo3 (credit and thanks to Martin Porter). Affect on Catalog sizeThe Stop Words, Stemming, and Normalization steps above were all developed to 'tidy up' the Catalog and hopefully reduce its size/increase search speed. The results are listed below for our CIA World Factbook:
* black list normalization, which is commented out in the code, and mentioned in the 'culture note' The result was a 14% reduction in the number of words and a 13% decrease in Binary file size (mostly due to the addition of Stemming). Because the whole Catalog stays in memory (in the Application Cache) keeping the size small is important - maybe a future version will be able to persist some 'working copy' of the data to disk and enable spidering of really large sites, but for now the catalog seems to take less than 10% of the source data size. ...but what about the UI?The search user interface also had some improvements:
ResultFile and SortedListIn version 2, outputting the results was very crude: the code was littered with // build each result row
foreach (object foundInFile in finalResultsArray.Keys)
{
// Create a ResultFile with it's own Rank
infile = new ResultFile ((File)foundInFile);
infile.Rank = (int)((DictionaryEntry)finalResultsArray[foundInFile]).Value;
sortrank = infile.Rank * -1000; // Assume not 'thousands' of results
if (output.Contains(sortrank) )
{ // rank exists - drop key index one number until it fits
for (int i = 1; i < 999; i++)
{
sortrank++;
if (!output.Contains (sortrank))
{
output.Add (sortrank, infile);
break;
}
}
} else {
output.Add(sortrank, infile);
}
sortrank = 0; // reset for next pass
}
Jim's code does some trickery with a new 'sortrank' variable to try and keep the files in 'Searcharoo rank' order, but with unique keys in the output PagedDataSourceOnce the results are in the SortedList, assigned to a SortedList output =
new SortedList (finalResultsArray.Count); // empty sorted list
...
pg.DataSource = output.GetValueList();
pg.AllowPaging = true;
pg.PageSize = Preferences.ResultsPerPage; // defaults to 10 10;
pg.CurrentPageIndex = Request.QueryString["page"]==null?0:
Convert.ToInt32(Request.QueryString["page"])-1;
SearchResults.DataSource = pg;
SearchResults.DataBind();
making it a LOT easier to reformat the results list however you like! <asp:Repeater id="SearchResults" runat="server">
<HeaderTemplate>
<p><%=NumberOfMatches%> results for <%=Matches%> took
<%=DisplayTime%></p>
</HeaderTemplate>
<ItemTemplate>
<a href="<%# DataBinder.Eval(Container.DataItem, "Url") %>"><b>
<%# DataBinder.Eval(Container.DataItem, "Title") %></b></a>
<a href="<%# DataBinder.Eval(Container.DataItem, "Url") %>"
target=\"_blank\" title="open in new window"
style="font-size:x-small">↑</a>
<font color=gray>(<%# DataBinder.Eval(Container.DataItem, "Rank") %>)
</font>
<br><%# DataBinder.Eval(Container.DataItem, "Description") %>...
<br><font color=green><%# DataBinder.Eval(Container.DataItem, "Url") %>
- <%# DataBinder.Eval(Container.DataItem, "Size") %>
bytes</font>
<font color=gray>-
<%# DataBinder.Eval(Container.DataItem, "CrawledDate") %></font><p>
</ItemTemplate>
<FooterTemplate>
<p><%=CreatePagerLinks(pg, Request.Url.ToString() )%></p>
</FooterTemplate>
</asp:Repeater>
Unfortunately the page links are generated via embedded The Future...If you check the dates below, you'll notice there was almost one and a half years between version 2 and 3, so it might sound optimistic to discuss another 'future' version - but you never know... Unfortunately many of the new features above are English-language specific (although they can be disabled to ensure Searcharoo can still be used on other language websites). However in a future version I'd like to try making the code can be a little more intelligent about handling European, Asian and other languages. It would also be nice if the user could type boolean OR searches, or group terms with quotes " " like Google, Yahoo, etc. And finally, indexing of document types besides Html (mainly other web-types like PDF) would be useful for many sites. ASP.NET 2.0Searcharoo3 runs on ASP.NET 2.0 pretty much unmodified - just remove Visual Studio.NET internal web server warning: the Searcharoo_VirtualRoot setting (where the spider starts looking for pages to index) defaults to http://localhost/. VS.NET's internal web server chooses a random port to run on, so if you're using it to test Searcharoo, you may need to set this web.config value accordingly. History
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||