|
|||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||
|
Announcements
Chapters
Services
Feature Zones
|
IntroductionThis article describes a simple, free, easy to install Search page written in C#. The goal is to build a simple search tool that can be installed simply by placing three files on a website, and that could be easily extended to add all the features you might need for a local-site search engine. There are two main parts to a Search engine:
This code 'builds' the catalog by traversing the file-system from a starting directory, it does not request web pages via HTTP or parse the pages for internal links. That means, it's only suitable for static websites. DesignA Catalog contains a collection of Words, and each Word contains a reference to every File that it appears in The first step was to think about how to implement the catalog objects; in order to keep things simple,
You can see that some assumptions have been made in this model. Firstly, we store limited information about the
The
Lastly, the There are two important assumptions which aren't immediately apparent from the model - there should only be one Code Structure
Object Model [Searcharoo.cs]This file contains the C# code that defines the object model for our catalog, including the methods to add and search words. These objects are used by both the crawler and the search page. namespace Searcharoo.Net {
public class Catalog {
private System.Collections.Hashtable index;
public Catalog () {}
public bool Add (string word, File infile, int position){}
public Hashtable Search (string searchWord) {}
}
public class Word {
public string Text;
private System.Collections.Hashtable fileCollection;
public Word (string text, File infile, int position) {}
public void Add (File infile, int position) {}
public Hashtable InFiles () {}
}
public class File {
public string Url;
public string Title;
public string Description;
public DateTime CrawledDate;
public long Size;
public File (string url, string title, string description,
DateTime datecrawl, long length) {}
}
}
Listing 1 - Overview of the object model (interfaces only - implementation code has been removed). Build the Crawler [SearcharooCrawler.aspx]Now that we have a model and structure, what next? In the interests of 'getting something working', the first build task is to simulate how our 'build' process is going to find the files we want to search. There are two ways we can look for files.
The big search engines - Yahoo, Google, MSN - all spider the Internet to build their search catalogs. However, following links to find documents requires us to write an HTML parser that can find and interpret the links, and then follow them! That's a little too much for one article, so we're going to start with some simple file crawling code to populate our catalog. The great thing about our object model is that it doesn't really care if it is populated by Spidering or Crawling - it will work for either method, only the code that populates it will change. Here is a simple method that we can use to locate the files we want to search by traversing the file system: private void CrawlPath (string root, string path) {
System.IO.DirectoryInfo m_dir = new System.IO.DirectoryInfo (path);
// ### Look for matching files to summarise what will be catalogued ###
foreach (System.IO.FileInfo f in m_dir.GetFiles(m_filter)) {
Response.Write (path.Substring(root.Length) + @"\" + f.Name + "
");
} // foreach
foreach (System.IO.DirectoryInfo d in m_dir.GetDirectories()) {
CrawlPath (root, path + @"\" + d.Name);
} // foreach
}
Listing 2 - Crawling the file system
Screenshot 1 - To test the file crawler, we downloaded the HTML from the CIA World FactBook. Now that we are confident we can access the files, we need to process each one in order to populate the catalog. Firstly, let's be clear about what that process is:
There are three different coding tasks to do:
Getting (a) working was easy: System.IO.DirectoryInfo m_dir = new System.IO.DirectoryInfo (path);
// Look for matching files
foreach (System.IO.FileInfo f in m_dir.GetFiles(m_filter)) {
Response.Write (DateTime.Now.ToString("t") + " "
+ path.Substring(root.Length) + @"\"
+ f.Name );Response.Flush();
fileurl = m_url +
path.Substring(root.Length).Replace(@"\", "/")
+ "/" + f.Name;
System.IO.StreamReader reader =
System.IO.File.OpenText (path + @"\" + f.Name);
fileContents = reader.ReadToEnd();
reader.Close(); // now use the fileContents to build the catalog...
Listing 3 - Opening the files A quick Google helped find a solution to (b). // ### Grab the ", RegexOptions.IgnoreCase | RegexOptions.Multiline );
filetitle = TitleMatch.Groups[1].Value;
// ### Parse out META data ###
Match DescriptionMatch = Regex.Match( fileContents,
"<META content='\"([^<]*)\"' name='\"DESCRIPTION\"'>",
RegexOptions.IgnoreCase | RegexOptions.Multiline );
filedesc = DescriptionMatch.Groups[1].Value;
// ### Get the file SIZE ###
filesize = fileContents.Length;
// ### Now remove HTML, convert to array,
// clean up words and index them ###
fileContents = stripHtml (fileContents);
Regex r = new Regex(@"\s+"); // remove all whitespace
string wordsOnly = stripHtml(fileContents);
// ### If no META DESC, grab start of file text ###
if (null==filedesc || String.Empty==filedesc) {
if (wordsOnly.Length > 350)
filedesc = wordsOnly.Substring(0, 350);
else if (wordsOnly.Length > 100)
filedesc = wordsOnly.Substring(0, 100);
else
filedesc = wordsOnly; // file is only short!
}
Listing 4 - Massage the file contents And finally (c) involved a very simple Regular Expression or two, and suddenly we have the document as an protected string stripHtml(string strHtml) {
//Strips the HTML tags from strHTML
System.Text.RegularExpressions.Regex objRegExp
= new System.Text.RegularExpressions.Regex("<(.|\n)+?>");
// Replace all tags with a space, otherwise words either side
// of a tag might be concatenated
string strOutput = objRegExp.Replace(strHtml, " ");
// Replace all < and > with < and >
strOutput = strOutput.Replace("<", "<");
strOutput = strOutput.Replace(">", ">");
return strOutput;
}
Listing 5 - Remove HTML and Regex r = new Regex(@"\s+"); // remove all whitespace
wordsOnly = r.Replace(wordsOnly, " "); // compress all whitespace to one space
string [] wordsOnlyA = wordsOnly.Split(' '); // results in an array of words
Listing 6 - Remove unnecessary whitespace To recap - we have the code that, given a starting directory, will crawl through it (and its subdirectories), opening each HTML file, removing the HTML tags, and putting the words into an array of strings. Now that we can parse each document into words, we can populate our Catalog! Build the CatalogAll the hard work has been done in parsing the file. Building the catalog is as simple as adding the word, file, and position using our // ### Loop through words in the file ###
int i = 0; // Position of the word in the file (starts at zero)
string key = ""; // the 'word' itself
// Now loop through the words and add to the catalog
foreach (string word in wordsOnlyA) {
key = word.Trim(' ', '?','\"', ',', '\'',
';', ':', '.', '(', ')').ToLower();
m_catalog.Add (key, infile, i);
i++;
} // foreach word in the file
Listing 7 - Add words to the catalog As each file is processed, a line is written to the browser to indicate the catalog build progress, showing the
Screenshot 2 - Processing the CIA World FactBook - it contains 40,056 words according to our code. After the last file is processed, the Build the SearchThe finished /// <summary>Returns all the Files which
/// contain the searchWord</summary>
/// <returns>Hashtable</returns>
public Hashtable Search (string searchWord) {
// apply the same 'trim' as when we're building the catalog
searchWord = searchWord.Trim('?','\"', ',', '\'',
';', ':', '.', '(', ')').ToLower();
Hashtable retval = null;
if (index.ContainsKey (searchWord) ) { // does all the work !!!
Word thematch = (Word)index[searchWord];
retval = thematch.InFiles(); // return the collection of File objects
}
return retval;
}
Listing 8 - the The key point is how simple the Obviously, there are a number of enhancements we could make here, starting with multiple word searches (finding the intersection of the Build the Results [Searcharoo.aspx]Searcharoo.aspx initially displays an HTML form to allow the user to enter the search term.
Screenshot 3 - Enter the search term When this form is submitted, we look for the Word in the index The The display process has been broken into a few steps below: Firstly, we call the // Do the search
Hashtable searchResultsArray = m_catalog.Search(searchterm);
// Format the results
if (null != searchResultsArray) {
Listing 9 - The actual search is the easy bit The Firstly, we call the // intermediate data-structure for 'ranked' result HTML
SortedList output = new SortedList(searchResultsArray.Count);
// empty sorted list
DictionaryEntry fo;
File infile;
string result="";
// build each result row
foreach (object foundInFile in searchResultsArray) {
// build the HTML output in the sorted list, so the 'unsorted'
// searchResults are 'sorted' as they're added to the SortedList
fo = (DictionaryEntry)foundInFile;
infile = (File)fo.Key;
int rank = (int)fo.Value;
Listing 10 - Processing the results Firstly, we call the // Create the formatted output HTML
result = ("<a href=" + infile.Url + ">");
result += ("<b>" + infile.Title + "</b></a>");
result += (" <a href=" + infile.Url + " target=\"_TOP\" ");
result += ("title=\"open in new window\" style=\"font-size:xx-small\">↑</a>");
result += (" <font color=gray>("+rank+")</FONT>");
result += ("<br>" + infile.Description + "..." ) ;
result += ("<br><font color=green>"
+ infile.Url + " - " + infile.Size);
result += ("bytes</font> <font color=gray>- "
+ infile.CrawledDate + "</font><p>" ) ;
Listing 11 - Pure formatting Before we can output the results, we need to get them in some order. We'll use a // multiply by -1 so larger score goes to the top
int sortrank = (rank * -1);
if (output.Contains(sortrank) )
{
// rank exists; concatenate same-rank output strings
output[sortrank] = ((string)output[sortrank]) + result;
}
else
{
output.Add(sortrank, result);
}
result = ""; // clear string for next loop
Listing 12 - Sorting the results by rank To make sure the highest rank appears at the top of the list, the rank is multiplied by -1! Now, all we have to do is // Now output to the HTML Response
foreach (object rows in output) { // Already sorted!
Response.Write ( (string)((DictionaryEntry)rows).Value );
}
Response.Write("<p>Matches: " + searchResultsArray.Count);
} else {
Response.Write("<p>Matches: 0");
}
Response.Write ("<p><a href=#top>? top");
Response.End(); // Stop here
Listing 13 - Output the results The output should look familiar to any web search engine user. We've implemented a simple ranking mechanism (a word count, shown in parentheses after the Title/URL), however it doesn't support paging.
Screenshot 4 - Search results contain a familiar amount of information, and the word-count-rank value. Clicking a link opens the local copy of the HTML file (the ↑ opens in a new window). Using the sample codeThe goal of this article was to build a simple search engine that you can install just by placing some files on your website; so you can copy Searcharoo.cs, SearcharooSpider.aspx and Searcharoo.aspx to your web root and away you go! However, that means you accept all the default settings, such as only searching .HTML files, and the search starting from the location of the Searcharoo files. To change those defaults, you need to add some settings to web.config: <appSettings>
<!--physical location of files-->
<add key="Searcharoo_PhysicalPath" value="c:\Inetpub\wwwroot\" />
<!--base Url to build links-->
<add key="Searcharoo_VirtualRoot " value="http://localhost/" />
<!--allowed file extension-->
<add key="Searcharoo_FileFilter" value="*.html"/>
</appSettings>
Listing 14 - web.config Then simply navigate to http://localhost/Searcharoo.aspx (or wherever you put the Searcharoo files) and it will build the catalog for the first time. If your application re-starts for any reason (i.e., you compile code into the /bin/ folder, or change web.config settings) the catalog will need to be rebuilt - the next user who performs a search will trigger the catalog build. This is accomplished by checking if the Cache contains a valid FutureIn the real world, most ASP.NET websites probably have more than just HTML pages, including links to DOC, PDF or other external files, and ASPX dynamic/database-generated pages. The other issue you might have is storing a large blob of data in your Application Cache. For most websites, the size of this object will be manageable - but if you've got a lot of content, you might not want that in memory all the time. The good news is, the code above can be easily extended to cope with these additional scenarios (including spidering web links, and using a database to store the catalog)... Check back here at CodeProject, or ConceptDevelopment.NET for future articles. History
| ||||||||||||||||||||||||||