Click here to Skip to main content
13,354,132 members (60,744 online)
Click here to Skip to main content
Add your own
alternative version


74 bookmarked
Posted 11 Jun 2008

Image and Location Search(aroo)

, 11 Jun 2008
Rate this:
Please Sign up or sign in to vote.
Index your website (including images and geographic data), search and display results in Google Earth.
Overview of version 6: Image search, Google Earth and Google Maps


This article follows on from the previous five Searcharoo samples:

Searcharoo 1 was a simple search engine that crawled the file system. Very rough.

Searcharoo 2 added a 'spider' to index web links and then search for multiple words.

Searcharoo 3 saved the catalog to reload as required; spidered FRAMESETs and added Stop words, Go words and Stemming.

Searcharoo 4 added non-text filetypes (eg Word, PDF and Powerpoint), better robots.txt support and a remote-indexing console app.

Searcharoo 5 runs in Medium Trust and refactored FilterDocument into DownloadDocument and its subclasses for indexing Office 2007 files.

Introduction to version 6

The following additions have been made:

  1. Extend the DownloadDocument object hierarchy introduced in v5 to index Jpeg images (yes, the metadata in image files such as photos) and the new Microsoft XPS (Xml Paper Specification) format (if you are compiling in .NET 3.5).
  2. Add Latitude/Longitude (Gps) location data to the base classes, and index geographic location data embedded in Jpeg images and the META tags of Html documents. Include the location in the search results, wtih a link to show the location on Google Maps and Google Earth.
  3. Allowing Searcharoo to run on websites where the ASP.NET application is restricted to "No" Trust. The remote-indexing console app in v4 and the changes in v5 were intended to addrsess this issue - but I came across a situation where NO File OR Web access was allowed. Now the code allows for the Catalog to be compiled-in to an assembly (remotely) and uploaded in ANY Trust scenario.
  4. Minor improvements: addition of COLOR to the Indexer.EXE Console Application, allowing multiple "start pages" to be set (so you can index your website, your blog, your hosted forum, whatever).
  5. Bug fixes including: <TITLE> tag parsing, correctly identifying external site links, and escaped & ampersands.

NOTE: This version of Searcharoo only displays the location (Latitude, Longitude) data - it doesn't "search by location". To read how to search "nearby" using location data, see the Store Locator: Help customers find you with Google Maps article.

Image Indexing (reading Jpeg metadata)

In Searcharoo 5 the DownloadDocument class was added to handle IFilter and Office 2007 files - files that needed to be downloaded to disk before additional code could be run against them to extract and index their textual content.
Adding support for indexing Jpeg image metadata followed the same simple steps - subclass DownloadDocument and incorporate code for processing IPTC/EXIF and XMP to read out whatever information has been embedded in a Jpeg file.

Key changes to the object model for Jpeg indexing support

The simple changes to the object model are shown above - the actual JpegDocument class was cloned from DocxDocument, with parsing code from EXIFextractor by Asim Goheer and Reading XMP Metadata from a JPEG using C# by Omar Shahine.

The two key parts of the code are shown below - the first uses System.Drawing.Imaging to extract PropertyItems which are then parsed against a set of known, hardcoded hex values; the second extracts an 'island' of Xml from within the Jpeg binary data. Both methods return differing (sometimes overlapping) metadata key/value pairs - from which we use only a small subset (Title, Description, Keywords, GPS Location, Camera Make & Model). Dozens of other values are present including focal length, flash settings - whatever the camera supports - but I will leave you to parse additional fields as you need them.

System.Drawing.Imaging and EXIFextractor

The EXIF data is stored in a 'binary' format - opening a tagged image in Notepad2 shows recognisable data with binary markers:

The binary structures are recognised by the .NET System.Drawing.Imaging code (below), but you must know the hexadecimal code and data-type for each piece of data you want to extract.

using System.Collections; // DictionaryEntry
using System.Drawing.Imaging;
// ...
public static PropertyItem[] GetExifProperties(string fileName)
    using (FileStream stream = 
            new FileStream(fileName, FileMode.Open, FileAccess.Read))
        using (System.Drawing.Image image = 
            /* useEmbeddedColorManagement = */ true,
            /* validateImageData = */ false))
            return image.PropertyItems;

Searcharoo uses the EXIFextractor code so that it's easy to review and change - there is also an alternative method to access the EXIF data by incorporating the Exiv2.dll library into your code, but I'll leave that up to you.

XMP via XML island

Unlike the EXIF data, XMP is basically 'human readable' within the JPG file as you can see below - there is an 'island' of pure XML in the binary image data. Importantly, XMP is the only way to get the Title and Description information which is really useful for searching.

public static string GetXmpXmlDocFromImage(string filename)
    using (System.IO.StreamReader sr = new System.IO.StreamReader(filename))
        contents = sr.ReadToEnd();
    beginPos = contents.IndexOf("<rdf:RDF", 0);
    endPos = contents.IndexOf("</rdf:RDF>", 0);
// ... then get title
xmlNode = doc.SelectSingleNode(
        "/rdf:RDF/rdf:Description/dc:title/rdf:Alt", NamespaceManager);
// ... then get description
xmlNode = doc.SelectSingleNode(
        "/rdf:RDF/rdf:Description/dc:description/rdf:Alt", NamespaceManager);
// ... etc.

It's worthwhile noting that if you are working with .NET 3.0 or later, extracting XMP metadata is much more 'scientific' than the above example - using WIC - Windows Imaging Component to access photo metadata. In order to keep this version of Searcharoo compatible with .NET 2.0, utilizing those newer features has been avoided (for now).

If you want to be able to add metadata to your images, try iTag: photo tagging software (recommended). iTag was used to tag many of the photos used during testing.

Indexing Geographic Location (Latitude/Longitude)

Image metadata & Html Meta tags
Only two document types have 'standard' ways of being geo-tagged: Html documents via the <META > tag and Jpeg images in their metadata. The location for images is extracted along with the rest of the metadata (explained above), but for the HtmlDocument class we needed to add code to parse either of the following two 'standard' geo-tags (only ONE tag is required, not both):

<meta name="ICBM" content="50.167958, -97.133185">
<meta name="geo.position" content="50.167958;-97.133185">

The code currently parses out description, keyword and robot META tags using the following Regular Expression/for loops:

Adding support for ICBM and geo.position merely required a couple of additional case clauses:

Once we have the longitude and latitude for an Html or Jpeg, it is set in the base Document class property; where it is available for copying across to Catalog File objects (which are then persisted for later searching).

Additional metadata: File type and Keywords (tags)

Since we needed to add properties to the base Document class for location, it seemed like a logical time to start storing keywords and file type to improve the search accuracy and user-experience. In both cases a new property was added to the base Document class, and relevant subclasses (that know how to "read" that data) were updated to populate the property.

Only Html and Jpeg classes currently support Keyword parsing, but almost all the subclasses correctly set the Extension property that indicates file-type.

These additional pieces of information will provide more feedback to the user when viewing results, and in future may be used for: (a) alternate search result navigation (eg. a tag cloud) and/or (b) changes to the ranking algorithm when a keyword is 'matched'.

"No" trust Catalog access

This 'problem' continues on from the "Medium" Trust issue discussed in Searcharoo 5 so it might be worthwhile reading that article again.

Basically, the new problem is that NOT EVEN WebClient permission is allowed, so the Search.aspx code cannot load the Catalog from ANY FILE (either via local file access or a Url). Searcharoo needs a way to load the Catalog into memory on the web server WITHOUT requiring any special permission... and the simplest way to accomplish that seems to be compiling the Catalog into the code!

The steps required are:

  1. Run Searcharoo.Indexer.EXE with the correct configuration to remotely index your website
  2. Copy the resulting z_searcharoo.xml catalog file (or whatever you have called it in the .config) to the special WebAppCatalogResource Project
  3. Ensure the Xml file Build Action: Embedded Resource
  4. Ensure that is the ONLY resource in that Project
  5. Compile the Solution - WebAppCatalogResource.DLL will be copied into the WebApplication \bin\ directory
  6. Deploy the WebAppCatalogResource.DLL to your server
If the code fails to load a .DAT or .XML file under Full or Medium Trust, its fallback behaviour is to use the first resource in that assembly (using the few simple lines of code below):

System.Reflection.Assembly a = 
string[] resNames = a.GetManifestResourceNames();
Catalog c2 = Kelvin<Catalog>.FromResource(a, resNames[0]);

One final note: rather than remove the Binary or Xml Serialization features that run in "Full" and "Medium" Trust, all methods are still available. Whether Binary or Xml is controlled by the web.config/app.config setting (for your Website and Indexer Console application).

  <add key="Searcharoo_InMediumTrust" value="True" />
If set to True, the Catalog will be saved as a *.XML file, if set to False it will be written as *.DAT. Only if the code cannot load EITHER of these files will the resource DLL be used (an easy way to force it would be to delete all .DAT and .XML catalog files).

Presenting the newly indexed data

The aforementioned changes mainly focus on the addition of indexing functionality: finding new data (latitude, longitude, keyword, file-type), cataloging it and allowing the Catalog to be accessed. Presenting the search results with this additional data required some changes to the File and ResultFile classes which are used by the Search.aspx page when it does a search and shows the results.

The new properties added to Document (above) are mirrored in File, and where required 'wrapped' in ResultFile by display-friendly read-only properties. It is the ResultFile class that is bound to an asp:Repeater to produce the following output (note the coordinates next to some results, which link to the specific location at

Searching for 'newyork' tag


The presence of location data (latitude and longitude coordinates) doesn't just allow us to 'link' out to a single location - it allows for a whole new way to view the search results! When one or more result links are found to have location data, a new view in Google Earth link is displayed. It links to the SearchKml.aspx page (which inherits from the same "code-behind" as Search.aspx) but instead of Html it displays the results in KML (Keyhole Markup Language, used by Google Earth)!
The KML output looks very different to the HTML from Search.aspx:

However, the link to SearchKml.aspx is formatted more like a 'file reference': for example takes you to the screenshots below:

How (and why) does the link use the SearchKml.aspx page, you may wonder?

The reason for the url format (using a .kml extension and embedding the search term) is to enable browsers to open Google Earth based on the file extension (when Google Earth is installed, it registers .KML and .KMZ as known file types). Because the link "looks" like it refers to a newyork.kml file (and not an ASPX page), most browsers/operating systems will automatically open it in Google Earth (or other program registered for that file type).

The link syntax is accomplished with a 404 custom error handler (which must also be setup in either web.config or IIS Custom Errors tab):

NOTE: the same 'behaviour' is possible using a custom HttpHandler, however it requires "mapping" the .KML extension to the .NET Framework in IIS - something that isn't always possible on cheaper hosting providers [which will usually still allow you to setup a custom 404 URL]. It would be even easier using the new URL Routing Framework being introduced in .NET 3.5 SP1. For now, Searcharoo uses the simplest approach - 404.aspx.

Minor enhancements

Color-coded Indexer.EXE

This is purely a cosmetic change to make using the Searcharoo.Indexer.EXE easier; and uses Philip Fitzsimons' article Putting colour/color to work on the console. Each different "logging level" (or 'verbosity') is output in a different color to make reading easier.
'Verbosity' is set in the Searcharoo.Indexer.EXE.config file (app.config in Visual Studio).

And looks like this when running:

Multiple start Urls

Previous versions of Searcharoo only allowed a single 'start Url', and any links away from that Url were ignored. Version 6 now allows multiple 'start Urls' to be specified - they will all be indexed and added to the same Catalog for searching. Specify multiple subdomains in the .config file seperated by a comma or semicolon (eg. maybe you have forums.* and www.* domain names; or you wish to index your blog and your photosharing sites together).

WARNING: indexing takes time and uses network bandwidth - DON'T index lots of sites without being aware of how long it will take. If you stop indexing half-way-through, the Catalog does NOT get saved!

Recognising 'fully qualified' local links

A 'bug' that was reported by a few people (without resolution) has been addressed - if your site has "fully qualified" links (eg. my blog has ALL the anchor tags specified like this "") these links were marked as "External" and not crawled (HtmlDocument class, around line 360).

Spider.cs has been updated to add these links to the LocalLinks ArrayList.

NOTE: the code does a 'starts with' comparison, so if you specified a subdirectory for your "Start Url" (eg. then a fully qualified link to a different subdirectory will STILL not be indexed (eg. lt;a href="</strong />SearcharooV2</strong />/SearcharooSpider_alpha.html"> would NOT be followed).

Bug fix (honor roll)

Many thanks to the following CodeProject readers/contributors:

<TITLE> tag parsing

Erick Brown [work] identified the problem of CRLFs in the <TITLE> tag causing it to not be indexed... and provided a new Regex to fix it.

Correctly identifying external site links

mike-j-g (and later hitman17) correctly pointed out that the matching of links (in HtmlDocument) was case-sensitive, and provided a simple fix.

Handling escaped & ampersands in the querystring

Thanks to Erick Brown [work] again for highlighting a problem (and providing a fix) for badly handled & ampersands in querystrings.

Proxy support

stephenlane80 provided code for downloading via a proxy server (his change was added to Spider.Download() and RobotsTxt.ctor()). A new .config setting has been added to store the Proxy Server Url (if required):

Parsing robots.txt from Unix servers

maaguirr suggested a change to the RobotsTxt class so that it correctly processes robots.txt files on Unix servers (or wherever they might have different 'line endings' to the standard Windows CRLF).

Try it out

In order to 'try out' Searcharoo without having to download and set-up the code, there is now a set of test files on which you can search here. The test files are an assortment of purpose-written files (eg. to test Frames, IFrames, META tags, etc) plus some geotagged photos from various holidays.


Obviously the biggest change in this version is the ability to 'index' images using the metadata available in the JPG format. Other metadata (keywords, filetype) has also been added, and a foundation created to index and store even more information if you wish.
These changes allow for a number of new possibilities in future, including: a "Search Near" feature to order results by distance from a given point or search result; much more sophisticated tag/keyword display and processing; additional metadata parsing across other document types (eg. Office documents have a Keyword field in their Properties); and whatever else you can think of.


This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


About the Author

Web Developer
Australia Australia
-- ooo ---

You may also be interested in...

Comments and Discussions

GeneralRe: Ranking? Pin
craigd31-Aug-08 16:38
membercraigd31-Aug-08 16:38 
QuestionRe: Ranking? [modified] Pin
HaydenMcL5-Sep-08 20:33
memberHaydenMcL5-Sep-08 20:33 
GeneralRe: Ranking? Pin
craigd25-Sep-08 13:53
membercraigd25-Sep-08 13:53 
QuestionPDF title does not show in search results Pin
inspire9018-Aug-08 4:09
memberinspire9018-Aug-08 4:09 
AnswerRe: PDF title does not show in search results Pin
craigd24-Aug-08 16:09
membercraigd24-Aug-08 16:09 
GeneralRe: PDF title does not show in search results Pin
inspire9025-Aug-08 23:27
memberinspire9025-Aug-08 23:27 
QuestionGreat work...any advice on multiple catalogues? [modified] Pin
HaydenMcL15-Aug-08 22:46
memberHaydenMcL15-Aug-08 22:46 
AnswerRe: Great work...any advice on multiple catalogues? Pin
craigd16-Aug-08 19:02
membercraigd16-Aug-08 19:02 
Searching for 'secure' documents is something that has been considered - but it's actually quite a difficult problem (particularly to solve "generically" when so many different security mechanisms are available). Because there are so many permutations it would be difficult to add 'one solution fits all' code in the Searcharoo project.

To build a simple solution (using a single catalog), I guess there are two main areas you need to address:

1. Indexing documents "in" a secure area
The Indexer needs to be able to 'login' (or bypass security) to be able to access your "members only" documents. Searcharoo is "intended" to support cookies, although I have never really tested it... if you were able to create a 'starting url' such as or something along those lines which would allow the Indexer to get a session cookie for accessing other secure pages, it could then at least index every document.
Alternatively, you could let the Indexer bypass security somehow (IP restrictions, hardcoding a header in the WebClient requests, whatever you can code into your security checks)

2. Restricting search results based on security/permissions
Assuming you can index ALL documents (unsecured and protected), you then want to check that only 'permissable' results are shown - particularly to NOT logged-in users. This would be easiest if your site is setup such that secure data is in specific subdirectories -- you could add some additional code in the results display to 'filter out' matches in those directories.

As I said, I'm not 100% that the 'login' idea above will work since I've never checked Searcharoo's cookie handling (ie. its ability to retain a logged-in session); but I'm certain it is possible to do with the right code.

I have thought about other approaches, such as adding a META tag (both to pages, and to the catalog; similar to the latitude/longitude ones added in v6) so that pages can be included in one (or more) "sections" for restricted searching, although this doesn't work so well for non-HTML content. Another idea was to NOT to login once, but force each secure document to return a 401 (or whatever Unauthorized HTTP code) and code Searcharoo to login and re-request every time - that way it could "know" which documents required login and which didn't.

Anyway, that's a few ideas for now - I'd be interested to hear how you go with the functionality. It's a good idea that might make it into a future version (probably using a directory-structure-dependent approach which would secure both HTML and other file types).

GeneralRe: Great work...any advice on multiple catalogues? Pin
HaydenMcL16-Aug-08 19:43
memberHaydenMcL16-Aug-08 19:43 
Questionwhy .dat file is larger than .xml file? Pin
Member 457173812-Aug-08 19:43
memberMember 457173812-Aug-08 19:43 
AnswerRe: why .dat file is larger than .xml file? Pin
craigd12-Aug-08 21:20
membercraigd12-Aug-08 21:20 
GeneralRe: why .dat file is larger than .xml file? [modified] Pin
Member 457173813-Aug-08 3:44
memberMember 457173813-Aug-08 3:44 
AnswerRe: why .dat file is larger than .xml file? Pin
craigd13-Aug-08 14:12
membercraigd13-Aug-08 14:12 
Generaljust simple but important question Pin
xmindx11-Aug-08 11:30
memberxmindx11-Aug-08 11:30 
AnswerRe: just simple but important question Pin
craigd13-Aug-08 14:08
membercraigd13-Aug-08 14:08 
Generalcould you help me with more details Pin
xmindx16-Aug-08 9:20
memberxmindx16-Aug-08 9:20 
GeneralReally great article. Pin
Rajib Ahmed1-Jul-08 19:58
memberRajib Ahmed1-Jul-08 19:58 
GeneralExcellent job Pin
Rajesh Naik Ponda Goa16-Jun-08 21:59
memberRajesh Naik Ponda Goa16-Jun-08 21:59 
Generalexcellent project ,nice job Pin
27185324113-Jun-08 17:55
member27185324113-Jun-08 17:55 
GeneralThe great Craig! He's back.. Pin
Ruchit Surati11-Jun-08 19:52
memberRuchit Surati11-Jun-08 19:52 
GeneralRe: The great Craig! He's back.. Pin
wangyanpeng28-Jun-08 1:24
memberwangyanpeng28-Jun-08 1:24 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.180111.1 | Last Updated 11 Jun 2008
Article Copyright 2008 by craigd
Everything else Copyright © CodeProject, 1999-2018
Layout: fixed | fluid