|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Announcements
Chapters
Services
Feature Zones
|
IntroductionArticle I describes building a simple search engine that crawls the file system from a specified folder, and indexing all HTML (or other types) of documents. A basic design and object model was developed as well as a query/results page which you can see here. This second article in the series discusses replacing the 'file system crawler' with a 'web spider' to search and catalog a website by following the links in the HTML. The challenges involved include:
DesignThe design from Article I remains unchanged... A Catalog contains a collection of Words, and each Word contains a reference to every File that it appears in. ... the object model is the same too...
What has changed is the way the Catalog is populated. Instead of looping through folders in the file system to look for files to open, the code requires the URL of a start page which it will load, index, and then attempt to follow every link within that page, indexing those pages too. To prevent the code from indexing the entire Internet (in this version), it only attempts to download pages on the same server as the start page. Code StructureSome of the code from Article I will be referenced again, but we've added a new page - SearcharooSpider.aspx - that does the HTTP access and HTML link parsing [making the code that walks directories in the file system - SearcharooCrawler.aspx -obsolete]. We've also changed the name of the search page to SearcharooToo.aspx so you can use it side-by-side with the old one.
There are three fundamental tasks for a search spider:
The big search engines - Yahoo, Google, MSN - all 'spider' the Internet to build their search catalogs. Following links to find documents requires us to write an HTML parser that can find and interpret the links, and then follow them! This includes being able to follow HTTP-302 redirects, recognizing the type of document that has been returned, determining what character set/encoding was used (for Text and HTML documents), etc. - basically a mini-browser! We'll start small, and attempt to build a passable spider using C#... Build the Spider [SearcharooSpider_alpha.aspx]Getting Started - Downloading a PageTo get something working quickly, let's just try to download the 'start page' - say the root page of the local machine (i.e., Step 2 - downloading pages). Here is the simplest possible code to get the contents of an HTML page from a website (localhost in this case): using System.Net;
/*...*/
string url = "http://localhost/"; // just for testing
WebClient browser = new WebClient();
UTF8Encoding enc = new UTF8Encoding();
string fileContents = enc.GetString(browser.DownloadData(url));
Listing 1 - Simplest way to download an HTML document The first thing to notice is the inclusion of the Despite those problems, we now have the full text of the 'start page' in a variable. That means, we can begin to work on the code for Step 1 - finding pages to index. Parsing the pageThere are two options (OK, probably more, but two main options) for parsing the links (and other data) out of HTML:
Although I suspect "commercial" search engines might use option 1 (building a DOM), it's much simpler to use Regular Expressions. Because my initial test website had very-well-formed HTMl, I could get away with this code: // Create ArrayLists to hold the links we find...
ArrayList linkLocal = new ArrayList();
ArrayList linkExternal = new ArrayList();
// Dodgy Regex will find *some* links
foreach (Match match in Regex.Matches(htmlData
, @"(?<=<(a|area)\s+href="").*?(?=""\s*/?>)"
, RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture)) {
// Regex matches from opening "quote
link = match.Value;
// find first space (ie no spaces in Url)
int spacePos = link.IndexOf(' ');
// or first closing quote (NO single quotes)
int quotePos = link.IndexOf('"');
int chopPos = (quotePos<spacePos?quotePos:spacePos);
if (chopPos > 0) {
// chopPos if quote or space first the at URL end
link = link.Substring(0,chopPos);
}
if ( (link.Length > 8) &&
(link.Substring(0, 7).ToLower() == "http://") ) {
// Assumes all links beginning with http:// are _external_
linkExternal.Add(link) ;
} else {
// otherwise they're "relative"/internal links
// so we concatenate the base URL
link = startingUrl + link;
linkLocal.Add(link);
}
} // end looping through Matches of the 'link' pattern in the HTML data
Listing 2 - Simplest way to find links in a page As with the first cut of page-downloading, there are a number of problems with this code. Firstly, the Regular Expression used to find the links is *very* restrictive, i.e., it will find - <a href="News.htm">News</a>
<area href="News.htm" shape="rect" coords="0,0,110,20">
- because the <a href='News.htm'>News</a>
<a href=News.htm>News</a>
<a class="cssLink" href="News.htm">News</a>
<area shape="rect" coords="0,0,110,20" href="News.htm">
<area href='News.htm' shape="rect" coords="0,0,110,20">
It will also attempt to use 'internal page links' (beginning with Downloading More PagesThe basic code is shown below - comments show where additional code is required, either from the listings above or in Article I. protected void Page_Load (object sender, System.EventArgs e) {
/* The initial function call */
startingPageUrl = "http://localhost/"; // Get from web.config
parseUrl (startingPageUrl, new UTF8Encoding(), new WebClient() );
}
/* This is called recursively for EVERY link we find */
public void parseUrl (string url, UTF8Encoding enc, WebClient browser) {
if (visited.Contains(url)) {
// Url already spidered, skip and go to next link
Response.Write ("<br><font size=-2>
"+ url +" already spidered</font>");
} else {
// Add this URL to the 'visited' list, so we'll
// skip it if we come across it again
visited.Add(url);
string fileContents =
enc.GetString (browser.DownloadData(url)); // from Listing 1
// ### Pseudo-code ###
// 1. Find links in the downloaded page
// (add to linkLocal ArrayList - code in Listing 2)
// 2. Extract <TITLE> and <META> Description,
// Keywords (as Version 1 Listing 4)
// 3. Remove all HTML and whitespace (as Version 1)
// 4. Convert words to string array,
// and add to catalog (as Version 1 Listing 7)
// 5. If any links were found, recursively call this page
if (null != pmd.LocalLinks)
foreach (object link in pmd.LocalLinks) {
parseUrl (Convert.ToString(link), enc, browser);
}
}
}
Listing 3 - Combining the link parsing and page downloading code. Review the three fundamental tasks for a search spider, and you can see we've developed enough code to build it:
Although the example above is picky about what links it will find, it will work to 'spider' and then search a website! FYI, the 'alpha version' of the code is available in the ZIP file along with the completed code for this article. The remainder of this article discusses the changes required to fix all the "problems" in the alpha version. Fix the Spider [SearcharooSpider.aspx]Problem 1 - Correctly parsing relative linksThe alpha code fails to follow 'relative' and 'absolute' links (e.g., "../../News/Page.htm" and "/News/Page2.htm" respectively) partly because it does not 'remember' what folder/subdirectory it is parsing. My first instinct was to build a new '
Solution: Uri classThe first lesson to learn when you have a class library at your disposal is look before you code. It was almost by accident that I stumbled across the new Uri (baseUri, relativeUri)
that does exactly what I need. No re-inventing the wheel! Problem 2 - Following redirectsFollowing relative links is made even more difficult because the Solution: HttpWebRequest & HttpWebResponse classesThe
which are set in the code to help us get the pages we want. Problem 3 - Using the correct character-set when downloading filesAssuming all web pages use ASCII will result in many pages being 'indexed' as garbage, because the bytes will be converted into 'random'-looking characters rather than the text they actually represent. Solution: HttpWebResponse and the Encoding namespaceThe if (webresponse.ContentEncoding != String.Empty) {
// Use the HttpHeader Content-Type
// in preference to the one set in META
htmldoc.Encoding = webresponse.ContentEncoding;
} else if (htmldoc.Encoding == String.Empty) {
// TODO: if still no encoding determined,
// try to readline the stream until we
// find either * META Content-Type
// or * </head> (ie. stop looking for META)
htmldoc.Encoding = "utf-8"; // default
}
//http://www.c-sharpcorner.com/Code/2003/Dec/ReadingWebPageSources.asp
System.IO.StreamReader stream = new System.IO.StreamReader
(webresponse.GetResponseStream(),
Encoding.GetEncoding(htmldoc.Encoding) );
// we *may* have been redirected... and we want the *final* URL
htmldoc.Uri = webresponse.ResponseUri;
htmldoc.Length = webresponse.ContentLength;
htmldoc.All = stream.ReadToEnd ();
stream.Close();
Listing 4 - Check the HTTP Content Encoding and use the correct Encoding class to process the Elsewhere in the code, we use the ContentType to parse out the MIME-type of the data, so that we can ignore images and stylesheets (and, for this version, Word, PDF, ZIP and other file types). Problem 4 - Does not recognize many valid link formatsWhen building the alpha code, I implemented the simplest Regular Expression I could find to locate links in a string - (?<=<(a|area)\s+href=").*?(?="\s*/?>). The problem is that it is far too dumb to find the majority of links. Solution: Smarter Regular ExpressionsRegular Expressions can be very powerful, and clearly a more complex expression was required. Not being an expert in this area, I turned to Google and eventually Matt Bourne who posted a couple of very useful Regex patterns, which resulted in the following code: // http://msdn.microsoft.com/library/en-us/script56/html/js56jsgrpregexpsyntax.asp
// Original Regex, just found <a href=""> links; and was "broken"
// by spaces, out-of-order, etc
// @"(?<=<a\s+href="").*?(?=""\s*/?>)"
foreach (Match match in Regex.Matches(htmlData,
@"(?<anchor><\s*(a|area)\s*(?:(?:\b\" +
@"w+\b\s*(?:=\s*(?:""[^""]*""|'[^']*'|[^""'<> ]+)\s*)?)*)?\s*>)"
, RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture)) {
// Parse ALL attributes from within tags...
// IMPORTANT when they're out of order!!
// in addition to the 'href' attribute, there might
// also be 'alt', 'class', 'style', 'area', etc...
// there might also be 'spaces' between the attributes
// and they may be ", ', or unquoted
link=String.Empty;
foreach (Match submatch in Regex.Matches(match.Value.ToString(),
@"(?<name>\b\w+\b)\s*=\s*(""(?<value" +
@">[^""]*)""|'(?<value>[^']*)'|(?<value>" +
@"[^""'<> \s]+)\s*)+",
RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture)) {
// we're only interested in the href attribute
// (although in future maybe index the 'alt'/'title'?)
if ("href" == submatch.Groups[1].ToString().ToLower() ) {
link = submatch.Groups[2].ToString();
break;
}
}
/* check for internal/external link
and supported scheme, then add to ArrayList */
} // foreach
Listing 5 - More powerful Regex matching
The combination of these two Regular Expressions makes the link parsing a lot more robust. Problem 5 - Poor META-tag handlingThe alpha has very rudimentary
Solution: Smarter Regular Expressions and support for more tagsUsing a variation of the Regular Expressions from Problem 4, the code parses out the string metaKey = String.Empty, metaValue = String.Empty;
foreach (Match metamatch in Regex.Matches (htmlData,
@"<meta\s*(?:(?:\b(\w|-)+\b\s*(?:=\s*(?:""[^""]*""|'" +
@"[^']*'|[^""'<> ]+)\s*)?)*)/?\s*>",
RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture)) {
metaKey = String.Empty;
metaValue = String.Empty;
// Loop through the attribute/value pairs inside the tag
foreach (Match submetamatch in Regex.Matches(metamatch.Value.ToString(),
@"(?<name>\b(\w|-)+\b)\s*=\s*(""(?<value>" +
@"[^""]*)""|'(?<value>[^']*)'|(?<value>[^""'<> ]+)\s*)+",
RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture)) {
if ("http-equiv" == submetamatch.Groups[1].ToString().ToLower() ) {
metaKey = submetamatch.Groups[2].ToString();
}
if ( ("name" == submetamatch.Groups[1].ToString().ToLower() )
&& (metaKey == String.Empty) ) {
// if already set, HTTP-EQUIV overrides NAME
metaKey = submetamatch.Groups[2].ToString();
}
if ("content" == submetamatch.Groups[1].ToString().ToLower() ) {
metaValue = submetamatch.Groups[2].ToString();
}
}
switch (metaKey.ToLower()) {
case "description":
htmldoc.Description = metaValue;
break;
case "keywords":
case "keyword":
htmldoc.Keywords = metaValue;
break;
case "robots":
case "robot":
htmldoc.SetRobotDirective (metaValue);
break;
}
}
Listing 6 - Parsing It also obeys the Spidering the web!When you load the SearcharooSpider.aspx page, it immediately begins spidering, starting with either:
Screenshot 1 - The title of each page is displayed as it is spidered. We're using the CIA World FactBook as test data. Once the catalog is built, you are ready to search. Performing the SearchAll the hard work was done in Article 1 - this code is repeated for your information... /// <summary>Returns all the Files which
/// contain the searchWord</summary>
/// <returns>Hashtable </returns>
public Hashtable Search (string searchWord) {
// apply the same 'trim' as when we're building the catalog
searchWord =
searchWord.Trim('?','\"', ',', '\'', ';', ':', '.', '(', ')').ToLower();
Hashtable retval = null;
if (index.ContainsKey (searchWord) ) { // does all the work !!!
Word thematch = (Word)index[searchWord];
retval = thematch.InFiles(); // return the collection of File objects
}
return retval;
}
Article 1 Listing 8 - the We have not modified any of the Search objects in the diagram at the start of this article, in an effort to show how data encapsulation allows you to change both the way you collect data (i.e., from file system crawling to website spidering) and the way you present data (i.e., updating the search results page), without affecting your data tier. In article 3, we'll examine if it's possible to convert the Search objects to use a database back-end without affecting the collection and presentation classes... Improving the Results [SearcharooToo.aspx]These are the changes we will make to the results page:
The first change to support searching on multiple terms is to 'parse' the query typed by the user. This means: trimming whitespace from around the query, and compressing whitespace between the query terms. We then searchterm = Request.QueryString["searchfor"].ToString().Trim(' ');
Regex r = new Regex(@"\s+"); //remove all whitespace
searchterm = r.Replace(searchterm, " "); // to a single space
searchTermA = searchterm.Split(' '); // then split
for (int i = 0; i < searchTermA.Length; i++) { // array of search terms
searchTermA[i] = searchTermA[i].Trim
(' ', '?','\"', ',', '\'', ';', ':', '.', '(', ')').ToLower();
}
Listing 7 - the Now that we have an // Array of arrays of results that match ONE of the search criteria
Hashtable[] searchResultsArrayArray = new Hashtable[searchTermA.Length];
// finalResultsArray is populated with pages
// that *match* ALL the search criteria
HybridDictionary finalResultsArray = new HybridDictionary();
// Html output string
string matches="";
bool botherToFindMatches = true;
int indexOfShortestResultSet = -1, lengthOfShortestResultSet = -1;
for (int i = 0; i < searchTermA.Length; i++) {
// ##### THE SEARCH #####
searchResultsArrayArray[i] = m_catalog.Search (searchTermA[i].ToString());
if (null == searchResultsArrayArray[i]) {
matches += searchTermA[i]
+ " <font color=gray " +
"style='font-size:xx-small'>(not found)</font> ";
// if *any one* of the terms isn't found,
// there won't be a 'set' of matches
botherToFindMatches = false;
} else {
int resultsInThisSet = searchResultsArrayArray[i].Count;
matches += "<a href=\"?searchfor="+searchTermA[i]+"\">"
+ searchTermA[i]
+ "</a> <font color=gray style='font-size:xx-small'>("
+ resultsInThisSet + ")</font> ";
if ( (lengthOfShortestResultSet == -1) ||
(lengthOfShortestResultSet > resultsInThisSet) ) {
indexOfShortestResultSet = i;
lengthOfShortestResultSet = resultsInThisSet;
}
}
}
Listing 8 - Find the results for each of the terms individually Describing how we find the documents that match all words in the query is easiest with an example, so imagine we're searching for the query "snow cold weather" in the CIA World FactBook. Listing 8 found the array of documents matching each word, and placed them inside another array. "snow" has 10 matching documents, "cold" has 43 matching documents, and "weather" has 22 matching documents. Obviously, the maximum possible number of overall matches is 10 (the smallest result set), and the minimum is zero -- maybe there are no documents that appear in all three collections. Both of these possibilities catered for -
Diagram 1 - Finding the intersection of the result sets for each word involves traversing the 'array of arrays' Listing 9 shows how we approached this problem. It may not be the most efficient way to do it, but it works! Basically, we choose the smallest resultset and loop through its matching files, looping through the Imagine, referring to the diagram above, that we begin with [0][0] file Counter ' The next problem is, how will we remember that this file exists in both sets? I chose a very simple solution - count the number of sets we're comparing After the looping has completed, " The looping will continue with ' After a lot of looping, the code will discover that only files // Find the common files from the array of arrays of documents
// matching ONE of the criteria
if (botherToFindMatches) {
// all words have *some* matches
// loop through the *shortest* resultset
int c = indexOfShortestResultSet;
Hashtable searchResultsArray = searchResultsArrayArray[c];
if (null != searchResultsArray)
foreach (object foundInFile in searchResultsArray) {
// for each file in the *shortest* result set
DictionaryEntry fo = (DictionaryEntry)foundInFile;
// find matching files in the other resultsets
int matchcount=0, totalcount=0, weight=0;
for (int cx = 0; cx < searchResultsArrayArray.Length; cx++) {
// keep track, so we can compare at the end (if term is in ALL)
totalcount+=(cx+1);
if (cx == c) {
// current resultset
// implicitly matches in the current resultset
matchcount += (cx+1);
// sum the weighting
weight += (int)fo.Value;
} else {
Hashtable searchResultsArrayx = searchResultsArrayArray[cx];
if (null != searchResultsArrayx)
foreach (object foundInFilex in searchResultsArrayx) {
// for each file in the result set
DictionaryEntry fox = (DictionaryEntry)foundInFilex;
if (fo.Key == fox.Key) {
// see if it matches
// and if it matches, track the matchcount
matchcount += (cx+1);
// and weighting; then break out of loop, since
weight += (int)fox.Value;
break;
// no need to keep looking through this resultset
}
} // foreach
} // if
} // for
if ( (matchcount>0) && (matchcount == totalcount) ) {
// was matched in each Array
// set the final 'weight'
// to the sum of individual document matches
fo.Value = weight;
if ( !finalResultsArray.Contains (fo.Key) )
finalResultsArray.Add ( fo.Key, fo);
} // if
} // foreach
} // if
Listing 9 - Finding the sub-set of documents that contain every word in the query. There're three nested loops in there - I never said this was efficient! The algorithm described above is performing a boolean AND query on all the words in the query, i.e., the example is searching for "snow AND cold AND weather". If we wished to build an OR query, we could simply loop through all the files and filter out duplicates. OR queries aren't that useful unless you can combine them with AND clauses, such as "snow AND (cold OR weather)" - but this is NOT supported in Version 2! By the way, the variables in that code which I've called " The Finished Result
Screenshot 2 - The Search input page has minor changes, including the filename to SearcharooToo.aspx!
Screenshot 3 - You can refine your search, see the number of matches for each search term, view the time taken to perform the search and, most importantly, see the documents containing all the words in your query! Using the sample codeThe goal of this article was to build a simple search engine that you can install just by placing some files on your website; so you can copy Searcharoo.cs, SearcharooSpider.aspx and SearcharooToo.aspx to your web root and away you go! However, that means you accept all the default settings, such as crawling from the website root, and a five second timeout when downloading pages. To change those defaults, you need to add some settings to web.config: <appSettings>
<!--website to spider-->
<add key="Searcharoo_VirtualRoot" value="http://localhost/" />
<!--5 second timeout when downloading-->
<add key="Searcharoo_RequestTimeout" value="5" />
<!--Max pages to index-->
<add key="Searcharoo_RecursionLimit" value="200" />
</appSettings>
Listing 14 - web.config Then, simply navigate to http://localhost/SearcharooToo.aspx (or wherever you put the Searcharoo files) and it will build the catalog for the first time. If your application re-starts for any reason (i.e., you compile code into the /bin/ folder, or change web.config settings), the catalog will need to be rebuilt - the next user who performs a search will trigger the catalog build. This is accomplished by checking if the Cache contains a valid FutureSearcharooSpider.aspx greatly increases the utility of Searcharoo, because you can now index your static and dynamic (e.g., database generated) pages to allow visitors to search your site. That means you could use it with products like Microsoft Content Management Server (CMS) which does not expose its content-database directly. The two remaining (major) problems with Searcharoo are:
The next articles in this series will (hopefully) examine these two problems in more detail on CodeProject and at ConceptDevelopment.NET... Glossary
Postscript : What about code-behind and Visual Studio .NET? (from Article I)You'll notice the two ASPX pages use the The advantage of this approach is that you can place these three files in any ASP.NET website and they'll 'just work'. There are no other dependencies (although they work better if you set some web.config settings) and no DLLs to worry about. However, this also means these pages can't be edited in Visual Studio .NET, because it does not support the Links
History
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||