|
|||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||
|
Announcements
Want a new Job?
Chapters
Services
Feature Zones
|
BackgroundThis article follows on from the previous three Searcharoo samples: Searcharoo Version 1 describes building a simple search engine that crawls the file system from a specified folder, and indexes all HTML (or other known types) of document. A basic design and object model was developed to support simple, single-word searches, whose results were displayed ina rudimentary query/results page. Searcharoo Version 2 focused on adding a 'spider' to find data to index by following web links (rather than just looking at directory listings in the file system). This means downloading files via HTTP, parsing the HTML to find more links and ensuring we don't get into a recursive loop because many web pages refer to each other. This article also discusses how multiple search words results are combined into a single set of 'matches'. Searcharoo Version 3 implemented a 'save to disk' function for the catalog, so it could be reloaded across IIS application restarts without having to be generated each time. It also spidered FRAMESETs and added Stop words, Go words and Stemming to the indexer. A number of bugs reported via CodeProject were also fixed. Introduction to version 4Version 4 of Searcharoo has changed in the following ways (often prompted by CodeProject members):
Some things to note
Design & Refactoring
This made it difficult to add the new functionality required for supporting IFilter (or any other document types we might like to add) that don't have the same attributes as an Html page. To 'fix' this design flaw, I pulled out all the Html-specific code from
You can see how much neater the The new
The first 'new' class is To demonstrate just how easy it was to extend this design to support IFilter, the FilterDocument class inherits pretty much everything from
public override void Parse()
{
// no parsing (for now).
}
public override bool GetResponse (System.Net.HttpWebResponse webresponse)
{
System.IO.Stream filestream = webresponse.GetResponseStream();
this.Uri = webresponse.ResponseUri;
string filename = System.IO.Path.Combine(Preferences.DownloadedTempFilePath
, (System.IO.Path.GetFileName(this.Uri.LocalPath)));
this.Title = System.IO.Path.GetFileNameWithoutExtension(filename);
using (System.IO.BinaryReader reader = new System.IO.BinaryReader(filestream))
{ // we must use BinaryReader to avoid corrupting the data
using (System.IO.FileStream iofilestream
= new System.IO.FileStream(filename, System.IO.FileMode.Create))
{ // we must save the stream to disk in order to use IFilter
int BUFFER_SIZE = 1024;
byte[] buf = new byte
And there you have it - indexing and searching of Word, Excel, Powerpoint, PDF and more in one easy class... all the indexing and search results display work as before, unmodified! "Rest of the Code" StructureThe refactoring extended way beyond the HtmlDocument class. The 31 or so files are now organised into five (5!) projects in the solution:
New features & bug fixesI, robots.txt Previous versions of Searcharoo only looked in Html Meta tags for robot directives - the robots.txt file was ignored. Now that we can index non-Html files, however, we need the added flexibility of disallowing search in certain places. robotstxt.org has further reading on how the scheme works. The
Function 1 is accomplished in the Function 2 is exposed by the public bool Allowed (Uri uri)
{
if (_DenyUrls.Count == 0) return true;
string url = uri.AbsolutePath.ToLower();
foreach (string denyUrlFragment in _DenyUrls)
{
if (url.Length >= denyUrlFragment.Length)
{
if (url.Substring(0, denyUrlFragment.Length) == denyUrlFragment)
{
return false;
} // else not a match
} // else url is shorter than fragment, therefore cannot be a 'match'
}
if (url == "/robots.txt") return false;
// no disallows were found, so allow
return true;
}
There is no explicit parsing of Ignoring a NOSEARCHREGIONIn if (Preferences.IgnoreRegions)
{
string noSearchStartTag = "<!--" + Preferences.IgnoreRegionTagNoIndex +
"-->";
string noSearchEndTag = "<!--/" + Preferences.IgnoreRegionTagNoIndex +
"-->";
string ignoreregex = noSearchStartTag + @"[\s\S]*?" + noSearchEndTag;
System.Text.RegularExpressions.Regex ignores =
new System.Text.RegularExpressions.Regex(ignoreregex
, RegexOptions.IgnoreCase | RegexOptions.Multiline |
RegexOptions.ExplicitCapture);
ignoreless = ignores.Replace(styleless, " ");
// replaces the whole commented region with a space
}
Links inside the region are still followed - to stop the Spider searching specific links, use robots.txt. Follow Javascript 'links'In if ("onclick" == submatch.Groups[1].ToString().ToLower())
{ // maybe try to parse some javascript in here
string jscript = submatch.Groups[2].ToString();
// some code here to extract a filename/link to follow from the
// onclick="_____"
int firstApos = jscript.IndexOf("'");
int secondApos = jscript.IndexOf("'", firstApos + 1);
if (secondApos > firstApos)
{
link = jscript.Substring(firstApos + 1, secondApos - firstApos - 1);
}
}
It would be almost impossible to predict the infinite variety of javascript links being used, but this code should hopefully provide a basis for people to modify to suit their own site (most likely if tricky menu image rollovers or something bypass the regular href behaviour). At worst it will be extract something that isn't a real page and get a 404 error... Multilingual 'option'Culture note: in the last version I was really focussed on reducing the index size (and therefore the size of the Catalog on disk and in memory). To that end, I hardcoded the following I've tried to improve the 'useability' of that a bit, by making it an option in the .config <add key="Searcharoo_AssumeAllWordsAreEnglish" value="true" />
which governs this method in the Spider:
private void RemovePunctuation(ref string word)
{ // this stuff is a bit 'English-language-centric'
if (Preferences.AssumeAllWordsAreEnglish)
{ // if all words are english, this strict parse to remove all
// punctuation ensures words are reduced to their least
// unique form before indexing
word = System.Text.RegularExpressions.Regex.Replace(word,
@"[^a-z0-9,.]", "",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
}
else
{ // by stripping out this specific list of punctuation only,
// there is potential to leave lots of cruft in the word
// before indexing BUT this will allow any language to be indexed
word = word.Trim
(' ','?','\"',',','\'',';',':','.','(',')','[',']','%','*','$','-');
}
}
In future I'd like to make Searcharoo more language aware, but for now hopefully this will at least make it possible to use the code in a non-English-language environment. Searcharoo.Indexer.EXEThe console application is a wrapper that performs the exact same function as clip = new CommandLinePreferences();
clip.ProcessArgs(args);
Spider spider = new Spider();
spider.SpiderProgressEvent += new SpiderProgressEventHandler(OnProgressEvent);
Catalog catalog = spider.BuildCatalog(new Uri(Preferences.StartPage));
That's almost identical to the The other code you'll find in the Searcharoo.Indexer project relates to parsing the command line arguments using the
What it actually does when it's running looks like this:
Just as with
NOTE: the exe has it's own ReferencesThere's a lot to read about IFilter and how it works (or doesn't work, as the case may be). Start with Using IFilter in C#, and it's references: Using IFilter in C# by bypassing COM for references to LoadIFilter, IFilter.org and IFilter Explorer Searcharoo now has it's own site - searcharoo.net - where you can actually try a working demo, and possibly find small fixes and enhancements that aren't groundbreaking enough to justify a new CodeProject article... Wrap-upHopefully you find the new features useful and the article relevant. Thanks again to the authors of the other open-source projects used in Searcharoo. History | ||||||||||||||||||||||||||||||||||||||||||||