Click here to Skip to main content
Click here to Skip to main content

ASP.NET C# Search Engine (Highlighting, JSON, jQuery & Silverlight)

By , 8 Mar 2009
 
Overview of version 7: search term highlighting in doc summary

Background

This article follows on from the previous six Searcharoo samples:

Searcharoo 1 was a simple search engine that crawled the file system. Very rough.

Searcharoo 2 added a 'spider' to index web links and then search for multiple words.

Searcharoo 3 saved the catalog to reload as required; spidered FRAMESETs and added Stop words, Go words and Stemming.

Searcharoo 4 added non-text filetypes (e.g. Word, PDF and Powerpoint), better robots.txt support and a remote-indexing console app.

Searcharoo 5 runs in Medium Trust and refactored FilterDocument into DownloadDocument and its subclasses for indexing Office 2007 files.

Searcharoo 6 adds indexing of photos/images and geographic coordinates; and displaying search results on a map.

Introduction to Version 7

The following additions have been made:

  1. Store the entire 'content' of each indexed document so the results page can show an excerpt of the text with search keywords highlighted.
  2. PDF indexing has been enhanced using iTextSharp to extract the document Title from metadata rather than just display the filename in results, and also to attempt to 'manually' index the PDF file even when the IFilter fails (possibly due to Acrobat installation problems).
  3. Handling 'default document' settings correctly, to prevent duplicate results where a 'page' has multiple accessible URLs because it is configured as the "default document" on a webserver (eg. default.htm or default.aspx in IIS; or index.html in many UNIX servers).
  4. Add a JSON result 'service' (similar to the Kml output in version 6)
  5. Add a jQuery-driven AJAX/HTML page that uses the JSON to provide nice, easily skinnable results page
  6. Add a Silverlight 2.0 client that uses the JSON to provide a richer search experience
  7. Bug fixes including: 
    • brad1213 found (and fixed) a bug where links in HTML comments were still followed
    • brad1213 suggested fix to add a URL to the 'visited' collection after it has been redirected.

Storing the Complete Document Text During Indexing

Back in October '08, SMeledath asked how the description shown in the results could be taken from the page itself... I proposed an approach but did not have time to implement - until now.

In previous versions of Searcharoo, the index contains only a 'link' between each word and the URL of documents that contain it. The number of times that word appears or where that word appears is lost during the indexing process (see version 5 for discussion of the old catalog structure). This made it impossible to display an 'excerpt' on the results page since the index only stores the first 350 characters (or the META description tag) - mainly because it was much easier to program.

Version 7 significantly alters the 'structure' of the index to store more data: for each word-document pairing, we also store the positions of that word in the source document. For example: after parsing out punctuation and whitespace, each word is assigned an index, with the first word given position zero and each subsequent word adding one. We also store the complete text of the document and can therefore extract any given part of the text.

The key differences between the old and new catalog serialized file (called z_searcharoo.xml by default) are:

BUT there's more - there is a NEW file called z_searcharoo-cache.xml that contains the complete text of each document (including punctuation) which will enable us to display any part of the document text on the results page:

Highlighting Matches in Results

The majority of the code ignores the z_searcharoo-cache.xml file, since it is not required to perform the actual search. Only in the Search.cs GetResults() method is the cache used, after the results list has already been constructed to generate the document 'descriptions' with highlighted keywords.

Once we've loaded the file contents from the cache (into an array), we loop through it with some funky positioning to find the first matching word in the content, grab around 100 words around it, then loop through those 100 words and highlight ALL matches.

If it sounds like a hack: it is (kinda). Google results often identify multiple parts of the document where matches appear, and display more than one (separated by an ellipsis...) - but I will leave that for a future version (or someone else to try)...

Enhanced PDF Indexing

CodeProject user inspire90 asked about displaying the PDF 'title' in search results but I didn't really have a solution straight away. Another user brad1213 provided a working code snippet using iTextSharp. brad1213's code was added direct to Spider.cs.

Incorporating this behaviour into the object model required some refactoring of the PDF indexing process so that PDF documents are treated a little differently to other file types that require the IFilter interface. Previously the spidering process did not differentiate between PDFs and any other file it cannot 'parse' natively - it just handed off to the IFilterDocument.cs class.

Version 7 now has a PdfDocument that inherits from FilterDocument so that we can add the iTextSharp parsing to the GetResponse method.

There was a minor problem with this new subclass however - FilterDocument was not designed for extension... the FilterDocument.GetResponse() method did everything in a tightly coupled mess!

bad version 6 code

I can't believe I wrote that! To subclass this would basically require re-implementing GetResponse from scratch, because there are no 'hooks' to help the implementor 'inherit' any behaviour. I'm sure there are better approaches, but I chose to move most of the 'functionality' into a couple of *Core methods...

better version 7 code

... so the PdfDocument could use them but do additional iTextSharp processing in the middle (using the same temporary file originally created just for passing to IFilter).

new Pdf class

Although it's not perfect, the refactored code does allow the subclass to take advantage of FilterDocument's code to download and save a temporary copy of the file (and delete it afterwards), while still performing its own operations (using iTextSharp). I'm pretty confident there's a better pattern for this type of class relationship - if I find it, I will update the article.

'Default' Document Handling

Patrick Stuart asked about a problem he was having with 'duplicate' results - turned out to be the /default.aspx (or whatever your 'default' is) being indexed multiple times (when the URL ended with '/' OR '/default.aspx' for example).

To fix this problem, additional code has been added to manipulate the 'already visited' list - when a URL matches one of the 'default document' patterns, we add ALL possible 'default document' combinations to the _Visited collection. The three patterns that are handled are:

  • http://searcharoo.net/SearcharooV7/ - default page with trailing slash
  • http://searcharoo.net/SearcharooV7 - default page without slash or page name specified
  • http://searcharoo.net/SearcharooV7/default.aspx - default page specified ("default.aspx" set in Searcharoo config)

As indexing progresses, any variation of the URL is 'already visited', thus prevent the duplication in the catalog (and the results).

The updated code looks like this (notice the three different "conditions" where a different URL can be pointing to the same 'default' page):

Set the default document for your website in app.config for the Indexer.exe to parse them correctly.

<!-- Default document filename: served in folder roots [v7] -->
<add key="Searcharoo_DefaultDocument" value="default.aspx" /> 

A future/further enhancement could be for the code to be on the lookout for ANY case where a particular page has the exact same content as another page and do some automatic de-duplication... but for now, this URL comparison seems to fix the most common bug.

JSON Results 'service'

I saw this article about Silverlight-enabled Live Search and decided to try and enable Searcharoo in the same way. Unlike the article, I decided to try using JSON so I could build a jQuery front-end as well.

JSON (or JavaScript Object Notation) is a mechanism to represent data (like a serialized object graph) using just the JavaScript 'object literal' notation: it looks like a simple set of key-value pairs (with nesting and 'collections' grouped in []). Transforming the ResultFile class (used on the regular Results page) into JSON will look like this:

[
{"name":"CIA - The World Factbook -- United Kingdom"
,"description":"Tower Hamlets**, Trafford, Wakefield***
, Walsall, Waltham Forest**, Wandsworth**, Warrington
, Warwickshire*, West Berkshire****, Westminster***
, West Sussex*, Wigan, Wiltshire*, Windsor and Maidenhead******
, Wirral, Wokingham****, Wolverhampton, Worcestershire*
, York*****; Northern Ireland - 24 districts
, 2 cities*, 6 counties**; Antrim, County Antrim**
, Ards, Armagh, County Armagh**, Ballymena, Ballymoney
, Banbridge, Belfast*, Carrickfergus, Castlereagh, Coleraine, "
,"url":"http://localhost:3359/content/uk.html"
,"tags":""
,"size":"57299"
,"date":"10/18/2008 3:02:49 PM"
,"rank":6
,"gps":"0,0"
},
{"name":"kilimanjaro"
,"description":"to pay US$40 Departure tax. 
Check with your travel agent. Tanzania - Australian passport holders US$50
, British passport holders US$50, Canadian passport holders US$50
, New Zealand passport holder US$50 
Medical Information and Vaccinations: Vaccinations: 
You must have an International Certificate of Yellow Fever 
Vaccination if crossing borders within "
,"url":"http://localhost:3359/content/kilimanjaro.pdf"
,"tags":""
,"size":"182794"
,"date":"10/18/2008 3:01:53 PM"
,"rank":2
,"gps":"0,0"
}]

To create this output, we can use the same SearchPageBase base class as the KML output in version 6 -- creating the JSON output is simple as modifying the ASPX markup with {} : and "" instead of XML.

JSON ASP.NET template

jQuery JSON 'client'

Given that JSON output (accessible via a simple URL, like /SearchJson/New%20York.js or /SearchJson.aspx?searchfor=New%20York), we can now very simply access the results using JavaScript, or the excellent jQuery library (now 'supported' by Microsoft). The HTML page below can consume the JSON (using jQuery): there is a text input and button which captures the search term and buids a URL, the jQuery $.getJSON() method retrieves the data, evals it into objects and the remaining code outputs HTML to the div on the page.

jSearcharoo - search page driven by JSON and jQuery

The result below might look similar to the 'standard' ASPX page - but as you can see from the HTML above, the page is almost entirely generated by jQuery using the JSON results. Look for the jSearcharoo.html file in the Web.UI project in the download.

jSearcharoo - results page

Silverlight 2.0 JSON 'client'

The JSON 'service' can also supply results to a Silverlight 2.0 application, using the JsonArray and JsonObject classes described on MSDN. First, we'll design a simple XAML user-interface using a simple Grid with a TextBox, Button and ListBox to contain the results. We will be binding a class to the ListBox that looks very similar (if not identical) to the JSON format shown above, so the ListBox.ItemTemplate DataTemplate consists of simple controls in a StackPanel, databound to the same field names (url, name, description).

Silverlightaroo Xaml

The C# code is shown below. The important elements are:

  • Constructing the JSON URL with the query text
  • Using WebClient to start an asynchronous request for the results
  • Using JsonArray to parse the JSON and loop through array to populate our SearchResult objects
  • 'bind' the SearchResults to the UI via ItemsSource - the DataTemplate takes care of the formatting for us.

(Note: You need to manually Add References to System.Json, System.Runtime.Serialization, System.Runtime.Serialization.Json.)

/// <summary>
/// Start async request for JSON
/// http://msdn.microsoft.com/en-us/library/cc197953(VS.95).aspx
/// </summary>
private void Search_Click(object sender, RoutedEventArgs e)
{
    string host = Application.Current.Host.Source.Host;
    if (Application.Current.Host.Source.Port != 80)
        host = host + ":" + Application.Current.Host.Source.Port;
    //host = "localhost:3359";
    Uri serviceUri = new Uri("http://"+host+"/SearchJson.aspx?searchfor=" + query.Text);
    WebClient downloader = new WebClient();
    downloader.OpenReadCompleted += 
         new OpenReadCompletedEventHandler(downloader_OpenReadCompleted);
    downloader.OpenReadAsync(serviceUri);
}
/// <summary>
/// Receive JSON stream, parse into objects and bind to ListBox
/// http://msdn.microsoft.com/en-us/library/cc197957(VS.95).aspx
/// </summary>
void downloader_OpenReadCompleted(object sender, OpenReadCompletedEventArgs e)
{
    if (e.Error == null)
    {
        using (Stream responseStream = e.Result)
        {
            JsonArray resultStream = (JsonArray)JsonArray.Load(responseStream);
            var results = from result in resultStream
                          select result;
            List<SearchResult> list = new List<SearchResult>();
            foreach (JsonObject r in results)
            {
                var result = new SearchResult
                {
                    name = r["name"] ,description = r["description"]
                    ,url = r["url"],size = r["size"],date = r["date"]
                };
                list.Add(result);
            }
            resultList.ItemsSource = list;
        }
    }
}

And this is what the resulting Silverlight 2.0 application looks like (with a search for dollar results showing). Because we used the Silverlight HyperlinkButton, the document titles are clickable-links to the search result page.

Silverlightaroo result display

The Silverlight 2.0 project is a separate download that can be opened with Visual Web Developer 2008 Express (the rest of the Searcharoo code is still .NET 2.0 and can be opened in Visual Studio or Express 2005). Look for the Silverlight.html and Silverlightaroo.XAP files in the Web.UI project in the download.

Bug Fixes

Possible Duplicate Indexing When Page is Redirected

brad1213 (who has contributed to Searcharoo a couple of times) helped out with an additional 'error condition' related to the _Visited handling discussed above - when a page redirects to another location, the resulting HTML is indexed BUT only the 'original' URL is marked as 'visited (possibly leading to duplicates in the catalog). His solution is simply to add the URL after redirects have been followed to the _Visited list.

Follows Links in HTML that have been Commented Out

brad1213 also identified a solution to the problem of links inside HTML comments (i.e. within <!-- -->) that probably should be ignored. The fix is to add this regular expression replacement in HtmlDocument (line 295):

htmlData = Regex.Replace(htmlData , @"<!--.*?[^" +
Preferences.IgnoreRegionTagNoIndex + "]-->" , "" ,
RegexOptions.IgnoreCase | RegexOptions.Singleline);

Surrogate Pair Error (PDF Indexing)

Member 4130814 reporting an error serializing the catalog after indexing PDFs. I was able to reproduce it and (I think) fix it with this simple statement to remove 'nulls' from the string.

this.All += sb.ToString().Replace('\0', ' '); 

Not 100% sure why those nulls were creeping into the searched text though.

Conclusion

This article has been a mix of 'requested features' (keyword highlighting, duplicate removal) and 'new toys' (JSON, jQuery and Silverlight). You can learn more about jQuery, and why JSON is an alternative to XML on the web.

Updates

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

craigd
Web Developer
Australia Australia
Member
-- ooo ---
www.conceptdevelopment.net
conceptdev.blogspot.com
www.searcharoo.net
www.recipenow.net
www.racereplay.net
www.silverlightearth.com

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
Hint: For improved responsiveness ensure Javascript is enabled and choose 'Normal' from the Layout dropdown and hit 'Update'.
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
QuestionCannot open \WebApplication\WebApplication.csproj by VS2005memberjoeyan4 Sep '12 - 0:30 
QuestionRobots.txt parser bug for empty rulememberkbomb98719 Jun '12 - 7:00 
QuestionServer Error in "/" Application <Cache xmlns="> was not expected.memberFreeweight22 May '12 - 8:12 
QuestionNeed example how to work withmemberrub-IL22 Mar '12 - 1:49 
QuestionSpider didn't like self-closing xhtml tagsmemberNorbyTheGeek22 Aug '11 - 6:07 
QuestionRobots.txt bug ? ... IndexOutOfRange crash.memberstylesie24 Jul '11 - 15:41 
Question[My vote of 1] Version 7 errorsmemberXmen W.K.19 Jun '11 - 6:29 
AnswerRe: [My vote of 1] Version 7 errorsmembererikcai21 Sep '11 - 14:30 
Generalhandling fragments in UrimemberShaihan Murshed29 May '11 - 20:16 
GeneralThis version 7 won't compliemembercloud8080805 May '11 - 5:23 
GeneralRe: This version 7 won't compliememberheda1903nodi15 May '11 - 20:08 
General<IndexId> for MIME types always 0memberRamy Essam8 Feb '11 - 11:31 
QuestionCan you search Chinese?membertangbin33024 Jan '11 - 2:49 
Questionhow can i use search aroo [modified]memberRaha_136212 Dec '10 - 18:57 
AnswerRe: how can i use search aroomembertangbin33024 Jan '11 - 2:51 
QuestionLinks to pdfs outside web root foldermemberGregory Thomson8 Dec '10 - 9:46 
QuestionHow to use this within in my own website?memberladnan12 Oct '10 - 14:42 
QuestionWildcard search?memberreinhardS21 Sep '10 - 4:03 
Generalxslx file parsingmemberimaginemayhem14 Sep '10 - 7:17 
GeneralCool!memberJoeJiao21 Aug '10 - 6:57 
Generalsearching xml filesmemberazabaig14 Jul '10 - 23:41 
Generalgood job!memberconi2k8 Jul '10 - 3:43 
GeneralMy vote of 5memberMember 473834828 Jun '10 - 5:26 
QuestionError compiling with larger xml files [modified]memberElizabeth Christ19 May '10 - 3:39 
Generalusing Searcharoo in my applicationmemberRenukapadhamanaban18 May '10 - 17:49 
Generalthanks for good knowleadgememberAsura02728 Apr '10 - 14:51 
Questionproblem search pdf filesmembermarcoanb12 Apr '10 - 12:12 
GeneralStripping non-ignore comments [modified]membernukefusion1 Apr '10 - 4:09 
GeneralSmall bug in PdfDocumentmembernukefusion1 Apr '10 - 3:52 
GeneralEarlier articles missing on Code ProjectmemberMember 287158025 Mar '10 - 8:53 
GeneralRe: Earlier articles missing on Code Projectmembernukefusion1 Apr '10 - 3:56 
QuestionAnyone able to get this running?membertbaseflug17 Mar '10 - 10:07 
AnswerRe: Anyone able to get this running?membertbaseflug19 Mar '10 - 7:20 
GeneralRe: Anyone able to get this running?membernukefusion1 Apr '10 - 3:54 
AnswerRe: Anyone able to get this running?memberDragan Blagojevic30 Mar '11 - 1:26 
GeneralCannot get page to runmembertbaseflug16 Mar '10 - 17:47 
GeneralCannot produce z_searcharoo-cache.xml!!!membertbaseflug16 Mar '10 - 10:06 
Questionsearcharoo as Google?memberpsharma.in7 Mar '10 - 5:56 
AnswerRe: searcharoo as Google?membercraigd7 Mar '10 - 21:54 
Generalsome bugsmembermanudea21 Feb '10 - 10:28 
GeneralRe: some bugsmembernukefusion1 Apr '10 - 4:12 
Newsproblem with porter stemmermemberShaihan Murshed31 Dec '09 - 2:31 
GeneralNo z_searcharoo.xml and z_searcharoo-cache.xmlmemberFabio Rodrigues20 Dec '09 - 2:59 
GeneralRe: No z_searcharoo.xml and z_searcharoo-cache.xmlmembertbaseflug16 Mar '10 - 8:23 
General3 errors for my site www.helpsoft.bizmemberdyma25 Oct '09 - 0:52 
GeneralRe: 3 errors for my site www.helpsoft.bizmembercraigd25 Oct '09 - 11:42 
GeneralRe: 3 errors for my site www.helpsoft.bizmemberdyma25 Oct '09 - 22:25 
GeneralSmall correction for itextsharpmemberKbog4 Oct '09 - 0:46 
QuestionProcess single URI and replace in index catalog [modified]memberAnders Carlman29 Sep '09 - 22:23 
GeneralRe: Process single URI and replace in index catalogmembercraigd25 Oct '09 - 11:45 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web01 | 2.6.130516.1 | Last Updated 9 Mar 2009
Article Copyright 2009 by craigd
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid