
Background
This article follows on from the previous two Searcharoo samples:
Searcharoo Version 1 describes building a simple search engine that crawls the file system from a specified folder, and indexes all HTML (or other known types) of document. A basic design and object model was developed to support simple, single-word searches, whose results were displayed ina rudimentary query/results page.
Searcharoo Version 2 focused on adding a 'spider' to find data to index by following web links (rather than just looking at directory listings in the file system). This means downloading files via HTTP, parsing the HTML to find more links and ensuring we don't get into a recursive loop because many web pages refer to each other. This article also discusses how multiple search words results are combined into a single set of 'matches'.
Introduction
This article (version 3 of Searcharoo) covers three main areas:
- Implementing a 'save to disk' function for the catalog
- Feature suggestions, bug fixes and incorporation of code contributed by others on previous articles (mostly via CodeProject - thankyou!)
- Improving the code itself (adding comments, moving classes, improving readability and hopefully making it easier to modify & re-use)
New 'features' include:
- Saving the catalog (which resides in memory for fast searching) to disk
- Making the Spider recognise and follow pages referenced in FRAMESETs and IFRAMEs (suggested by le_mo_mo)
- Paging results rather than just listing them all on one page (submitted by Jim Harkins)
- Normalising words and numbers (removing punctuation, etc)
- (Optional) stemming of English words to reduce catalog size (suggested by Chris Taylor and Trickster)
- (Optional) use of Stop words to reduce catalog size
- (Optional) creation of a Go word list, to specifically catalog domain-specific words like "C#", which might otherwise be ignored
The bug fixes include:
- Correctly parsing <TITLE> tags that may have additional attributes eg. an ID= attribute in an ASP.NET environment. (submitted by xenomouse)
- Handling Cookies if the server has set them to track a 'session' (submitted by Simon Jones)
- Checking the 'final' URL after redirects to ensure the right page is indexed and linked (submitted by Simon Jones)
- Correctly parsing (and obeying!) the ROBOTS meta tag (I found this bug myself).
Code layout improvements included:
- The Spider code that was a bit of a mess in SearcharooSpider.aspx being moved into a proper C# class (and implementing an EventHandler to allow monitoring of progress)
- Encapsulation of Preferences into a single static class
- Layout of Searcharoo.cs using #regions (easy to read if you have VS.NET)
- User control (Searcharoo.ASCX) created for search box - if you want to re-brand it you only have to modify in one place.
- Paging implementation using PagedDataSource means you can easily alter the 'template' for the results (eg link size/color/layout) in Searcharoo3.aspx
Design
The fundamental Catalog-File-Word design remains unchanged (from Version 1), however there are quite a few extra classes implemented in this version.

To build the catalog, SearcharooSpider.aspx calls Spider.BuildCatalog() which:
- Accesses Preferences static object to read settings
- Creates empty Catalog
- Creates IGoWord, IStopper and IStemming implementations (based on Preferences)
- Processes startPageUri (with a WebRequest)
- Creates HtmlDocument, populates properties including Link collections
- Parses the content of the page, creating Word and File objects as required
- Recursively applies steps 4 through 6 for each LocalLink
- BinarySerializes the Catalog to disk using CatalogBinder
- Adds the Catalog to Application.Cache[], for use by Searcharoo3.aspx for searching!
Code Structure
These are the files used in this version (and contained in the download).
| web.config |
14 settings that control how the spider and the search page behave. They are all 'optional' (ie the spider and search page will run if no config settings are provided) but I recommend at least providing
<add key="Searcharoo_VirtualRoot" value="http://localhost/content/" /> |
| Searcharoo.cs |
Most code for the application is in this file. Many classes that were in ASPX files in version 2 have been moved into this file (such as Spider and HtmlDocument) because it's easier to read and maintain. New version 3 features (Stop, Go, Stemming) all added here.
|
| Searcharoo3.aspx |
Search page (input and results). Checks the Application-Cache for a Catalog, and if none exists, creates one (deserialize OR run SearcharooSpider.aspx) |
| Searcharoo.ascx |
NEW user control that contains two asp:Panels:
- the 'blank' search box (when page is first loaded, defaults to yellow background)
- the populated search box (when results are displayed, defaults to blue background)
(see the screenshot at the top of the article) |
| SearcharooSpider.aspx |
The main page (Searcharoo3.aspx) does a Server.Transfer to this page to create a new Catalog (if required). Almost ALL of the code that was in this page in version 2 has been migrated to Searcharoo.cs - OnProgressEvent() allows it to still display 'progress' messages as the spidering is taking place. |
Saving the Catalog to Disk
There are a couple of reasons why saving the catalog to disk is useful:
- It can be built on a different server to the website (for smaller sites, where the code may not have permission to write to disk on the webserver)
- If the server Application restarts, the catalog can be re-loaded rather than rebuilt entirely
- You can finally 'see' what information is stored in the catalog - useful for debugging!
There are two types of Serialization (Xml and Binary) available in the Framework, and since the Xml is 'human readable', that seemed the logical one to try. The code required to serialize the Catalog is very simple - the code below is from the Catalog.Save() method, so the reference to this is the Catalog object.
XmlSerializer serializerXml = new XmlSerializer( typeof( Catalog ) );
System.IO.TextWriter writer
= new System.IO.StreamWriter( Preferences.CatalogFileName+".xml" );
serializerXml.Serialize( writer, this );
writer.Close();
The 'test dataset' I've mostly used is the CIA World Factbook (download) which is about 52.6 Mb on disk for the HTML only (not including images and non-searchable data) - so imagine my "surprise" when the Xml-Serialized-Catalog itself three times the size at 156 Mb (yes, megabytes!). Couldn't even open it easily, except by 'type'ing it from the Command Prompt.
OUCH - what a waste of space! And worse, this was the first time I'd noticed the fields defined in the File class were declared public and not private (see the elements beginning with underscors). Firstly, let's get rid of the serialized duplicates (fields that should be private, and their public property counterparts) -- rather than change the visibility (and pontentially break code), the [XmlIgnore] attribute can be added to the definition. To further reduce the amount of repeated text, the element names are compressed to single letters using the [XmlElement] attribute, and to reduce the number of <> some of the properties are marked to be serialized as [XmlAttribute]s.
[Serializable]
public class Word
{
[XmlElement("t")] public string Text;
[XmlElement("fs")] public File[] Files
...
[Serializable]
public class File
{
[XmlIgnore] public string _Url;
...
[XmlAttribute("u")] public string Url { ...
[XmlAttribute("t")] public string Title { ...
[XmlElement("d")] public string Description { ...
[XmlAttribute("d")] public DateTime CrawledDate { ...
[XmlAttribute("s")] public long Size { ...
...
The Xml file is now a teeny (not!) 49 Mb in size, still too large for notepad but easily viewed via cmd. As you can see below, the 'compression' of the Xml certainly saved some space - at least the Catalog is now smaller than the source data!
Even with the smaller output, 49 Mb is of Xml is still a little too verbose to be practical (hardly a surprise really, Xml often is!) so let's serialize the index to a Binary format (again, the Framework classes make it really simple).
System.IO.Stream stream = new System.IO.FileStream
(Preferences.CatalogFileName+".dat" , System.IO.FileMode.Create );
System.Runtime.Serialization.IFormatter formatter =
new System.Runtime.Serialization.Formatters.Binary.BinaryFormatter();
formatter.Serialize (stream, this);
stream.Close();
The results of changing to Binary Serialization were dramatic - the same catalog data was 4.6 Mb rather than 150! That's about 3% of the Xml size, definitely the way to go.
Now that I had the Catalog being saved successfully to disk, it seemed like it would be a simple matter to re-load it back into memory & the Application Cache...
Loading the Catalog from Disk
Unfortunately, it was NOT that simple. Whenever the Application restarted (say web.config or Searcharoo.cs was changed), the code could not de-serialize the file but instead threw this cryptic error:
Cannot find the assembly h4octhiw, Version=0.0.0.0, Culture=neutral, PublicKeyToken=null

At first I was stumped - I didn't have any assembly named h4octhiw, so it wasn't immediately apparent why it could not be found. There are a couple of hints though:
- The 'not found ' assembly appears to have a randomly generated name... and what do we know uses randomly generated assembly names? The \Temporary ASP.NET Files\ directory where dynamically compiled assemblies (from src="" and ASPX) are saved.
- The error line references only 'object' and 'stream' types - surely they aren't causing the problem
- Reading through the Stack Trace (click on the image) from the bottom, up (as always), you can infer that the Deserialize method creates a BinaryParser that creates an ObjectMap with an array of MemberNames which in turn request ObjectReader.GetType() which triggers the GetAssembly() method... but it fails!. Hmm - sounds like it might be looking for the Types that have been serialized - why can't it find them?
If your Google skills are honed, rather than the dozens of useless links returned when you search for ASP.NET "Cannot find the assembly" you'll be lucky and stumble across this CodeProject article on Serialization where you will learn a very interesting fact:
Type information is also serialized while the class is serialized enabling the class to be deserialized using the type information. Type information consists of namespace, class name, assembly name, culture information, assembly version, and public key token. As long as your deserialized class and the class that is serialized reside in the same assembly it does not cause any problem. But if the serializer is in a separate assembly, .NET cannot find your class' type hence cannot deserialize it.
But what does it mean? Every time the web/IIS 'Application' restarts, all your ASPX and src="" code is recompiled to a NEW, RANDOMLY NAMED assembly in \Temporary ASP.NET Files\. So although the Catalog class is based on the same code, its Type Information (namespace, class name, assembly name, culture information, assembly version, and public key token) is DIFFERENT!
And, importantly, when a class is binary serialized, its Type Information is stored along with it (aside: this doesn't happen with Xml Serialization, so we probably would have been OK if we'd stuck with that).
The upshot: after every recompile (whatever triggered it: web.config change, code change, IIS restart, machine reboot, etc) our Catalog class has different Type info - and when it tries to load the serialized version we saved earlier, it doesn't match and the Framework can't find the assembly where the previous Catalog Type is defined (since it was only Temporary and has been deleted when the recompile took place).
Custom Formatter implementation
Sounds complex? It is, kinda, but the whole 'temporary assemblies' thing is something that happens invisibly and most developers don't need to know or care much about it. Thankfully we don't have to worry too much either, because the CodeProject article on Serialization also contains the solution: a helper class that 'tricks' the Binary Deserializer into using the 'current' Catalog type.
public class CatalogBinder: System.Runtime.Serialization.SerializationBinder
{
public override Type BindToType (string assemblyName, string typeName)
{
string[] typeInfo = typeName.Split('.');
string className=typeInfo[typeInfo.Length -1];
if (className.Equals("Catalog"))
{
return typeof (Catalog);
}
else if (className.Equals("Word"))
{
return typeof (Word);
}
if (className.Equals("File"))
{
return typeof (File);
}
else
{
return Type.GetType(string.Format( "{0}, {1}", typeName,
assemblyName));
}
}
}
Et Voila! Now that the Catalog can be saved/loaded, the search engine is much more robust than before. You can save/back-up the Catalog, turn on debugging to see its contents, and even generate it on a different machine (say a local PC) and upload to your web server!
Using the 'debug' Xml serialized files, for the first time I could the contents of the Catalog, and I found lots of 'garbage' was being stored that was both wasteful in terms of memory/disk, but also useless/unsearchable. With the major task for this release complete, it seemed appropriate to do some bugfixes and add some "real search engine" features to clean up the Catalog's contents.
New features & bug fixes
FRAME and IFRAME support
CodeProject member le_mo_mo pointed out that the spider did not follow (and index) framed content. This was a minor change to the regex that finds links - previously A and AREA tags were supported, so it was simple enough to add FRAME and IFRAME to the pattern.
foreach (Match match in Regex.Matches(htmlData
, @"(?<anchor><\s*(a|area|frame|iframe)\" +
@"s*(?:(?:\b\w+\b\s*(?:=\s*(?:""[^""]*""|'[^']" +
@"*'|[^""'<> ]+)\s*)?)*)?\s*>)"
, RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture))
{
Stop words
Let's start with Google's definition of Stop Words:
Google ignores common words and characters, such as "where" and "how," as well as certain single digits and single letters. These terms rarely help narrow a search and can slow search results. We call them "stop words."
The basic premise is that we don't want to waste space in the catalog storing data will never be used, the 'Stop Words' assumption is that you'll never search for words like "a in at I" because they appear on almost every page, and therefore don't actually help you find anything!
Here's a basic definition from MIT and some interesting statistics and Stop Word thoughts including the 'classic' Stop Word conundrum: should users be able to search for Hamlet's soliloquy "to be or not to be"?
The Stop Word code supplied with Searcharoo3 is pretty basic - it strips out ALL one and two letter words, plus
the, and, that, you, this, for, but, with, are, have, was, out, not
A more complex implementation is left for others to contribute (or a future version, whichever comes first).
Word normalization
I had noticed words were often being stored with any punctuation that was adjacent to them in the source text. For example, the Catalog contained Files with Word instances for
| "People |
people |
people* |
people |
This prevented the pages containing those words from ever being returned in a search, unless the user had typed the exact punctuation as well - in the above example a search for people would only return one page, when you would expect it to return all four pages.
The previous version of Searcharoo did have a 'black list' of punctuation [,./?;:()-=etc] but that wasn't sufficient as I could not predict/foresee all possible punctuation characters. Also, it was implemented with the Trim() method which was not parsing out punctuation within words [aside: the handling of parenthesised words is still not satisfactory in version 3]. The following 'white list' of characters that are allowed to be indexed ensures that NO punctuation is accidentally stored as part of a word.
key = System.Text.RegularExpressions.Regex.Replace(key, @"[^a-z0-9,.]"
, ""
, System.Text.RegularExpressions.RegexOptions.IgnoreCase);
Culture note: this "white list" method of removing punctuation is VERY English-language centric, as it will remove at least some characters from most European languages, and it will strip ALL content from most Asian-language content.
If you want to use Searcharoo with non-English character sets, you should find the above line of code and REPLACE it with this "black list" from Version 2. While it allows more characters to be searched, the results are more likely to be polluted by punctuation which could reduce searchability.
key = word.Trim
(' ','?','\"',',','\'',';',':','.','(',')','[',']','%','*','$','-').ToLower();
Number normalization
Numbers are a special case of word normalization: some punctuation is required to interpret the number (eg decimal point), then convert it to a proper number.
Although not perfect, this means phone numbers written as 0412-345-678 or (04)123-45678 would both be Catalogued as 0412345678 and therefore searching for either 0412-345-678 or (04)123-45678 would match both source documents.
private bool IsNumber (ref string word)
{
try
{
long number = Convert.ToInt64(word);
word = number.ToString();
return (word!=String.Empty);
}
catch
{
return false;
}
}
Go words
After reading the Word Normalization section above you can see how cataloging and searching for a technical term/phrase (like C# or C++) is impossible - the non-alphanumeric characters are filtered out before they have a chance to be catalogued.
To avoid this, Searcharoo allows a 'Go words' list to be created. A 'Go word' is the opposite of a 'Stop word': instead of being blocked from cataloguing, it is given a free-pass into the catalog, bypassing the Normalization and Stemming code.
The weakness in this approach is that you must know ahead of time all the different Go words that your users might search for. In future, you might want to store each unsuccessful search term for later analysis and expansion of your Go word list. The Go word implementation is very simple:
public bool IsGoWord (string word)
{
switch (word.ToLower())
{
case "c#":
case "vb.net":
case "asp.net":
return true;
break;
}
return false;
}
Stemming
The most basic explanation of 'stemming' is that it attempts to identify 'related' words and return them in response to a query. The simplest example is plurals: searching for "field" should also find instances of "fields" and vice versa. More complex examples are "realize" and "realization", "populate" and "population" - the
This page on How a Search Engine Works contains a brief explanation of Stemming and some of the other techniques described above.
The Porter Stemming Algorithm already existed as a C# class, so was utilized 'as is' in Searcharoo3 (credit and thanks to Martin Porter).
Affect on Catalog size
The Stop Words, Stemming, and Normalization steps above were all developed to 'tidy up' the Catalog and hopefully reduce its size/increase search speed. The results are listed below for our CIA World Factbook:
source: 800 files 52.6 Mb |
Raw * |
+ Stop words |
+ Stemming |
+'white list' normalization |
| Unique Words |
30,415 |
30,068 |
26,560 |
26,050 |
| Xml Serialized |
156 Mb ^ |
149 Mb |
138 Mb |
136 Mb |
| Binary Serialized |
4.6 Mb |
4.5 Mb |
4.1 Mb |
4.0 Mb |
| Binary % of source |
8.75% |
8.55% |
7.79%% |
7.60% |
* black list normalization, which is commented out in the code, and mentioned in the 'culture note'
^ 49 Mb after 'compressing' the Xml output with [Attributes]
The result was a 14% reduction in the number of words and a 13% decrease in Binary file size (mostly due to the addition of Stemming). Because the whole Catalog stays in memory (in the Application Cache) keeping the size small is important - maybe a future version will be able to persist some 'working copy' of the data to disk and enable spidering of really large sites, but for now the catalog seems to take less than 10% of the source data size.
...but what about the UI?
The search user interface also had some improvements:
- Moving the search inputs into the Searcharoo.ascx User Control
- Adding the same Stemming, Stop and Go word parsing to the search term that is applied during spidering
- Generating the result list using the new ResultFile class to construct a DataSource to bind to a Repeater control
- Adding PagedDataSource and custom paging links rather than one long list of results (thanks to Jim Harkin's feedback/code and uberasp.net)
ResultFile and SortedList
In version 2, outputting the results was very crude: the code was littered with Response.Write calls making it difficult to reformat the output. Jim Harkins posted some Visual Basic code which is converted to C# below.
foreach (object foundInFile in finalResultsArray.Keys)
{
infile = new ResultFile ((File)foundInFile);
infile.Rank = (int)((DictionaryEntry)finalResultsArray[foundInFile]).Value;
sortrank = infile.Rank * -1000;
if (output.Contains(sortrank) )
{
for (int i = 1; i < 999; i++)
{
sortrank++;
if (!output.Contains (sortrank))
{
output.Add (sortrank, infile);
break;
}
}
} else {
output.Add(sortrank, infile);
}
sortrank = 0;
}
Jim's code does some trickery with a new 'sortrank' variable to try and keep the files in 'Searcharoo rank' order, but with unique keys in the output SortedList. If thousands of results were returned, you might run into trouble...
PagedDataSource
Once the results are in the SortedList, assigned to a PagedDataSource which is then bound to a Repeater control on Searcharoo3.aspx.
SortedList output =
new SortedList (finalResultsArray.Count);
...
pg.DataSource = output.GetValueList();
pg.AllowPaging = true;
pg.PageSize = Preferences.ResultsPerPage;
pg.CurrentPageIndex = Request.QueryString["page"]==null?0:
Convert.ToInt32(Request.QueryString["page"])-1;
SearchResults.DataSource = pg;
SearchResults.DataBind();
making it a LOT easier to reformat the results list however you like!
<asp:Repeater id="SearchResults" runat="server">
<HeaderTemplate>
<p><%=NumberOfMatches%> results for <%=Matches%> took
<%=DisplayTime%></p>
</HeaderTemplate>
<ItemTemplate>
<a href="<%# DataBinder.Eval(Container.DataItem, "Url") %>"><b>
<%# DataBinder.Eval(Container.DataItem, "Title") %></b></a>
<a href="<%# DataBinder.Eval(Container.DataItem, "Url") %>"
target=\"_blank\" title="open in new window"
style="font-size:x-small">↑</a>
<font color=gray>(<%# DataBinder.Eval(Container.DataItem, "Rank") %>)
</font>
<br><%# DataBinder.Eval(Container.DataItem, "Description") %>...
<br><font color=green><%# DataBinder.Eval(Container.DataItem, "Url") %>
- <%# DataBinder.Eval(Container.DataItem, "Size") %>
bytes</font>
<font color=gray>-
<%# DataBinder.Eval(Container.DataItem, "CrawledDate") %></font><p>
</ItemTemplate>
<FooterTemplate>
<p><%=CreatePagerLinks(pg, Request.Url.ToString() )%></p>
</FooterTemplate>
</asp:Repeater>
Unfortunately the page links are generated via embedded Response.Write calls in CreatePagerLinks... maybe this will be templated in a future version...
The Future...
If you check the dates below, you'll notice there was almost one and a half years between version 2 and 3, so it might sound optimistic to discuss another 'future' version - but you never know...
Unfortunately many of the new features above are English-language specific (although they can be disabled to ensure Searcharoo can still be used on other language websites). However in a future version I'd like to try making the code can be a little more intelligent about handling European, Asian and other languages.
It would also be nice if the user could type boolean OR searches, or group terms with quotes " " like Google, Yahoo, etc.
And finally, indexing of document types besides Html (mainly other web-types like PDF) would be useful for many sites.
ASP.NET 2.0
Searcharoo3 runs on ASP.NET 2.0 pretty much unmodified - just remove src="Searcharoo.cs" from the @Page attribute, and move the Searcharoo.cs file into the App_Code directory.

Visual Studio.NET internal web server warning: the Searcharoo_VirtualRoot setting (where the spider starts looking for pages to index) defaults to http://localhost/. VS.NET's internal web server chooses a random port to run on, so if you're using it to test Searcharoo, you may need to set this web.config value accordingly.
History
- 2004-06-30: Version 1 on CodeProject
- 2004-07-03: Version 2 on CodeProject
- 2006-05-24: Version 3 (this page) on CodeProject
| You must Sign In to use this message board. |
|
|
 |
|
|
 |
|
|
 |
|
 |
dear author, than you for your having provided great work! Currently, i am researching Vertical Search Engine and looking for an alternative efficient topic(or focused) spider(e.g oriented for blog or journal literature etc. ) Can you give me some suggestion about this ? thank you for your reply in advance!
http://www.ibaima.com(Online Knowledge Question/Answer Platform)
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
 | Error  Devil's Eye | 6:48 2 Aug '07 |
|
 |
I am continually getting an error in the lines:
return System.Configuration.ConfigurationSettings.AppSettings[appSetting] == null?defaultValue:Convert.ToInt32(System.Configuration.ConfigurationSettings.AppSettings[appSetting]);
The error states that: Input string was not in a correct format.
I am totally unable to spot out the error. Please Help.............
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Firstly, version 5 of Searcharoo is available on CodeProject[^], you may want to upgrade to that.
Secondly, for your question, my guess is that one of the web.config values is not numeric when it should be - could also be an empty string? If you step through the code (or add a Trace line to output appSetting) it shouldn't be hard to see which setting is causing the problem.
HTH Craig
|
| Sign In·View Thread·PermaLink | 2.00/5 |
|
|
|
 |
|
|
 |
|
|
 |
|
 |
What that means is the Searcharoo.Net.File class is declared twice somewhere in your web application. Typical scenarios this occurs:
1) in ASP.NET 1.1 or 2.0, your SearcharooSpider.aspx page has src="Searcharoo.cs" at the top of the page, AND you've also copied a Searcharoo.dll (assembly) into your /bin/ directory, or
2) in ASP.NET 2.0, the problem could occur if your SearcharooSpider.aspx page has src="Searcharoo.cs" at the top of the page, AND you've got Searcharoo.cs in the /App_Code/ directory.
3) in either version, you may have src="Searcharoo.cs" at the top of a page, and somehow compiled the Searcharoo.cs file into your web application DLL (assembly) by including it in your project.
Can you post (a) what version of .NET you are using; (b) where you have placed any Searcharoo-related pages and code (.aspx, .cs, .dll) files; (c) anything other settings you think might be relevent. As a quick-fix, you could try deleting src="Searcharoo.cs" from the top of the page (if it is there) and see if that works.
p.s. version 4 is now available[^]
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Hi there. I have some question using Searcharoo 3. My site has links to external sites, but I want the spider just crawls the urls inside mi site. How can I do that?
Thank you so much
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Firstly, version 4 is now available[^].
Secondly, Searcharoo only indexes your site. Inside the parsing code, Searcharoo builds two arrays, InternalLinks and ExternalLinks, but it only follows the InternalLinks.
The way the Catalog works doesn't really scale to searching external sites (or the whole internet!).
|
| Sign In·View Thread·PermaLink | 2.00/5 |
|
|
|
 |
|
 |
Uh Ok, but I have a page that makes a Response.Redirect() to an external site, and it perhaps that doing this way searcharoo goes to the external site because I can search on content placed in that site. How can avoid this? Or how can I exclude the pages that makes the Redirect from the spider?
Thank you
|
| Sign In·View Thread·PermaLink | 2.00/5 |
|
|
|
 |
 | D'oh!  craigd | 12:22 29 Mar '07 |
|
 |
Ah ok, I've not heard of that before; but it does sound like it's a symptom of a known bug that I will have to fix.
Within the Download method, the code that loads each Url does thisreq.AllowAutoRedirect = true; req.MaximumAutomaticRedirections = 3; which basically bypasses two important checks: whether the ipage being redirected toi is already in the code_Visitedcode collection AND whether it's on the local site... hence your problem.
Just OTTOMH, you could try a couple of fixes:olliChange codereq.MaximumAutomaticRedirections = 0;code or codereq.AllowAutoRedirect = false;code and test whether the rest of your site still gets indexed.li liChange your redirecting page to detect codeRequest.UserAgentcode and don't redirect to external sites if the it's a robot (Searcharoo's UserAgent string is set in the code.configcode file).li liPut the URL to the redirecting page/s in coderobots.txtcode (you could use the codeUser-agent: *code section as well)preUser-agent: Searcharoo Disallow: /SomeDir/PageThatRedirects.aspx Disallow: /AnotherDir/AnotherRedirectPage.aspx The proper fix will be to encapsulate the link-processing code from HtmlDocument.Parse() into a seperate object so that the Spider.Download() can utilise it to check whether the URL resulting from redirects is still local (and not already in _Visited). That should appear in the next (not yet released) version.
Hope that helps
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
I build a sample using Searcharoo3 here at my development environment and evrything went fine. But, once I got to upload it to the hosting on a production environment, I keep getting the following error message when attemping to make a simple search.
I also tried version 4 and got a security exception as well.
has anyone got a clue about what this might be...? Thank you.!
Security Exception Description: The application attempted to perform an operation not allowed by the security policy. To grant this application the required permission please contact your system administrator or change the application's trust level in the configuration file.
Exception Details: System.Security.SecurityException: Request for the permission of type 'System.Security.Permissions.SecurityPermission, mscorlib, Version=2.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089' failed.
Source Error:
Line 72: // No catalog 'in memory', so let's look for one Line 73: // First, for a serialized version on disk Line 74: m_catalog = Catalog.Load(); // returns null if not found Line 75: Line 76: // Still no Catalog, so we have to start building a new one
Source File: c:\domains\transwl.com\wwwroot\UIA\Searcharoo3.aspx Line: 74
Stack Trace:
[SecurityException: Request for the permission of type 'System.Security.Permissions.SecurityPermission, mscorlib, Version=2.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089' failed.] System.Runtime.Serialization.Formatters.Binary.ObjectReader.CheckSecurity(ParseRecord pr) +1644244 System.Runtime.Serialization.Formatters.Binary.ObjectReader.ParseObject(ParseRecord pr) +363 System.Runtime.Serialization.Formatters.Binary.ObjectReader.Parse(ParseRecord pr) +64 System.Runtime.Serialization.Formatters.Binary.__BinaryParser.ReadObjectWithMapTyped(BinaryObjectWithMapTyped record) +1050 System.Runtime.Serialization.Formatters.Binary.__BinaryParser.ReadObjectWithMapTyped(BinaryHeaderEnum binaryHeaderEnum) +62 System.Runtime.Serialization.Formatters.Binary.__BinaryParser.Run() +144 System.Runtime.Serialization.Formatters.Binary.ObjectReader.Deserialize(HeaderHandler handler, __BinaryParser serParser, Boolean fCheck, Boolean isCrossAppDomain, IMethodCallMessage methodCallMessage) +183 System.Runtime.Serialization.Formatters.Binary.BinaryFormatter.Deserialize(Stream serializationStream, HeaderHandler handler, Boolean fCheck, Boolean isCrossAppDomain, IMethodCallMessage methodCallMessage) +190 System.Runtime.Serialization.Formatters.Binary.BinaryFormatter.Deserialize(Stream serializationStream) +12 Searcharoo.Net.Catalog.Load() +122 ASP.searcharoo3_aspx.Page_Load() in c:\domains\transwl.com\wwwroot\UIA\Searcharoo3.aspx:74 System.Web.Util.CalliHelper.ArglessFunctionCaller(IntPtr fp, Object o) +5 System.Web.Util.CalliEventHandlerDelegateProxy.Callback(Object sender, EventArgs e) +784015 System.Web.UI.Control.OnLoad(EventArgs e) +99 System.Web.UI.Control.LoadRecursive() +47 System.Web.UI.Page.ProcessRequestMain(Boolean includeStagesBeforeAsyncPoint, Boolean includeStagesAfterAsyncPoint) +6953 System.Web.UI.Page.ProcessRequest(Boolean includeStagesBeforeAsyncPoint, Boolean includeStagesAfterAsyncPoint) +154 System.Web.UI.Page.ProcessRequest() +86 System.Web.UI.Page.ProcessRequestWithNoAssert(HttpContext context) +18 System.Web.UI.Page.ProcessRequest(HttpContext context) +49 ASP.searcharoo3_aspx.ProcessRequest(HttpContext context) in App_Web_searcharoo3.aspx.cdcab7d2.x9da3rev.0.cs System.Web.CallHandlerExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute() +154 System.Web.HttpApplication.ExecuteStep(IExecutionStep step, Boolean& completedSynchronously) +64
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Hi there,
I haven't seen this error before - it's a security permission failing on deserializing the catalog/index from disk. I suspect this is because your ISP/webhost-provider has your website set to Medium Trust.
I can reproduce the error by adding <system.web> <trust level="Medium" originUrl="" /> </system.web> to my web.config... The error Additional information: Request for the permission of type 'System.Security.Permissions.SecurityPermission, mscorlib, Version=2.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089' failed. is thrown on deserializedCatalogObject = formatter.Deserialize(stream); in Catalog.cs :: public static Catalog Load(). Changing the level to "Full" (which is the default if the setting is omitted) fixes it again, but I doubt that will work on an ISP server where the Trust level is set to Medium higher up the chain (at machine.config level).
Here are some references about altering trust level (you'll need to get your ISP to co-operate)...
Scroll down to Joe Brinkman's comments on this post[^]: Reflection is restricted but not eliminated in medium trust. The real restriction is that you can reflect non-public members of a type. So as long as you limit your reflection to public members then relection is permissible. This is why you can serialize objects with the XMLSerializer and not the Binary Serializer. XML only serializes public members while binary serialization serializes all state including private variables, hence it will not work in medium trust.
That's the reason why it's failing, although it doesn't help much with a solution.
Rick's post Running ASP.NET in Medium Trust[^] describes how he resolved a similar problem - it may be that you need to beg your ISP to update the machine.config with <SecurityClass Name="ReflectionPermission" Description="System.Security.Permissions.ReflectionPermission, mscorlib, Version=2.0.0.0, Culture=neutral,PublicKeyToken=b77a5c561934e089"/> and <IPermission class="ReflectionPermission" version="1" Unrestricted="true" /> Using Enterprise Library in ASP.NET 2.0 Partial Trust Mode[^] also has issues with Medium Level trust (that's a good article to read too). Maybe your ISP can duplicate their web_mediumtrust.config for you with the additional permissions you need, and you could point to it like the article describes.
That's all I can think of for now. Good luck! Craig
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Searcharoo "2007" (version 5)[^] has been tested with trust level="Medium" - which I think will fix this problem.
NOTE: you MUST generate the Catalog file on another PC or server (using the supplied Indexer Console application) and UPLOAD it to use Searcharoo under Medium Trust
Hope that helps Craig
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
 |
|
 |
I am wanting to index HTML based help files, however they are in a frame based htm page with a toc.htm (table of contents) file that is full of javascript firing calls to htm files through a function. Searcharoo isn't adding any of these "links" to the catalog. Do you have any other suggestions.
I also had a few issues that I had to resolve for my environment as follows.
1. Had to add location exceptions to forms authentication in the root app web.config. 2. Needed to change path to the Catalog in the searcharoo.cs file (personal prefernce) 3. Had to add write permissions to the searcharoo directory in IIS. Otherwise catalog wouldn't create. 4. Had to add anonymous access rights to the directory being spidered. Otherwise I got a 401 error.
It would be helpful to document any special setup requirements such as these for different types installations.
Thanks, Chris
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
The problem with parsing javascript 'links' to follow is that they could take almost any form. If you could post an example of what your <a tags look like maybe I can post a usable code sample.
Basically what I'd do is find the lineif ("href" == submatch.Groups[1].ToString().ToLower()) in the Spider code. In that same code block you must delete the "break;" statement add another if statement after it, like this:if ("onclick" == submatch.Groups[1].ToString().ToLower()) { string jscript = submatch.Groups[2].ToString(); int firstApos = jscript.IndexOf("'"); int secondApos = jscript.IndexOf("'", firstApos+1); link = jscript.Substring(firstApos + 1, secondApos - firstApos - 1); }I've just tested this and it works (on a basic level). It makes some assumptions (eg. <a href="#" onclick="window.location='content/kilimanjaro.pdf'"> will work, whereas <a onclick="window.location='content/kilimanjaro.pdf'" href="#"> won't because of the loop/ordering) so it needs a little more work, but you get the idea (and could code it to fit your scenario).
Thanks for the other comments - I'll add some install notes to the next version. HTH Craig
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Craig,
I created a workaround for this issue by creating an aspx diretory browser page that dynamically creates hyperlinks in the directory /subdirs of choice and I pointed Searcharoo to that file as the target in the web.config rather than the main.htm or toc.htm files. This is a better choice than enabling directory browsing because you can customize the code by mime type, extension, etc., but not as good as spidering the toc. So...
Here is a sample of the main page main.htm
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN" "http://www.w3.org/TR/html4/frameset.dtd"> <html>
<head> <meta name="generator" content="HelpNDoc v1.7 Free"> <title>PETS User Guide</title> </head>
<frameset rows="*" cols="200,*"> <frame src="files/toc.htm" name="FrameTOC"> <frame src="files/{1FB2C430-47C9-4DA6-AC84-DC365449D1AB}.htm" name="FrameMain"> </frameset> </html></pre>
Here is a sample of the left pain table of contents (javascript menu) toc.htm
<pre><!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN"> <html>
<head> <meta name="generator" content="HelpNDoc v1.7 Free"> <link rel="StyleSheet" href="dtree.css" type="text/css" /> <script type="text/javascript" src="dtree.js"></script> </head>
<body> <script type="text/javascript"><!-- d = new dTree('d'); d.config.target = 'FrameMain'; d.add(0,-1,'User Guide','javascript:void(0);'); d.add(1, 0,'Search','http://fcdsi/pets//HelpSearch/default.aspx'); d.add(2, 0,'Getting Started','{1FB2C430-47C9-4DA6-AC84-DC365449D1AB}.htm'); d.add(3, 2,'Overview','{DE00FAE9-5378-4837-885C-CD77D466453F}.htm'); d.add(4, 2,'Required Tasks and Timelines','{330989D4-B1A8-4638-A332-A2B6EB548F23}.htm'); d.add(5, 2,'Login','{74D28760-00B1-4AD0-AC6D-D15FF336EBB9}.htm'); d.add(6, 0,'Profile','{EA93D051-C4A3-471C-9F74-2AE286174746}.htm'); d.add(7, 6,'Employee Information','{99B88ED8-C643-454A-A15C-A4A0BE61A70C}.htm'); d.add(8, 6,'Job Plan and Evaluation Listings','{90531FEB-BEEF-4842-9A15-1B6D19521FA9}.htm'); d.add(9, 0,'My Team','{C6D869FD-92DB-4962-9E48-9206CB2BC8F7}.htm'); d.add(10, 9,'Filters','{9DAF00C8-A797-47B5-A3C1-AD1911B1344B}.htm'); d.add(11, 9,'Employee Listing','{8DDB856A-5128-430D-958F-190FC446379F}.htm'); d.add(12, 0,'Signatures','{173769F0-FC53-42E7-B827-8213A16FCDC2}.htm'); d.add(13, 12,'When Viewing','{0D5DD6CA-F4AA-4121-B8B1-CD9D6DEB7B44}.htm'); d.add(14, 12,'Job Plans on Creation/Edit','{49AC9157-7210-4227-A65B-8B42363D3A31}.htm'); d.add(15, 12,'Evaluations on Creation/Edit','{388157EB-D759-41D7-B28C-102EBA399764}.htm'); d.add(16, 15,'Interim','{CA61E4AE-E791-4613-9D48-61E6C19245B1}.htm'); d.add(17, 15,'Probationary','{D6112B14-7016-4B03-97FC-8DBDB8F4E3F2}.htm'); d.add(18, 15,'Annual','{327478CC-4B5A-47D9-B9B5-CEE54E35D6FE}.htm'); d.add(19, 0,'Job Plans','{0357CD29-1A03-4159-975F-9830E347A87D}.htm'); d.add(20, 19,'Viewing a Job Plan','{C39A4B3F-F549-46BD-AA33-B77C995ACC1F}.htm'); d.add(21, 19,'Creating and Editing a Job Plan','{3EC4EDAF-4E26-4168-BA5D-0B5658C04ECF}.htm'); d.add(22, 21,'Creating a Job Plan','{DB5F3D2F-7667-455F-98DC-9F3CD8B1D88F}.htm'); d.add(23, 21,'Editing a Job Plan','{034308B3-FA90-4758-BF7E-E03AC9762B7A}.htm'); d.add(24, 21,'Editing Responsibilities','{9F3CDEF3-771A-44C3-B60D-04C95F4DC234}.htm'); d.add(25, 21,'EditingTasks','{64C34547-6D29-47A2-BB49-B414AE7BB48A}.htm'); d.add(26, 0,'Evaluations','{EFB298F2-2DB0-4BFC-AE9D-9C96D99B53B5}.htm'); d.add(27, 26,'Viewing an Evaluation - UPDATED','{FA85E38E-08ED-4EBC-8550-AAECD0081049}.htm'); d.add(28, 26,'Creating and Editing Evaluations','{79051DAA-8A7F-40BE-8981-6732B43F51CF}.htm'); d.add(29, 28,'Interim','{D84D82A3-21AC-405C-AA49-5D93EFB32F20}.htm'); d.add(30, 28,'Probationary','{9E88427D-F07F-4F3A-B30E-503AAD372452}.htm'); d.add(31, 28,'Annual - UPDATED','{2A828C43-E660-4A9A-83F7-CD2CD050676B}.htm'); d.add(32, 0,'Administration','{9D63C033-6068-45F6-815C-52421199EB80}.htm'); d.add(33, 32,'Viewing User Information','{7C2EEB05-B1AB-4B80-9616-58ED3DA1B6F0}.htm'); d.add(34, 0,'Personnel','{BFE5A2C9-C369-4476-8EE4-32627A62B9E3}.htm'); d.add(35, 34,'Approving Job Plans','{0EB126C2-63C9-4B2C-9B3C-A98FF3417D13}.htm'); d.add(36, 34,'Approving Evaluations','{F8335B69-2718-4EBA-AEB8-0830ADF968C5}.htm'); document.write(d); //--></script> </body>
</html>
Here is part of the javscript file that toc.htm is using. I would post the whole thing but it is 350 lines. I'll e-mail it if you like
*--------------------------------------------------| | dTree 2.05 | www.destroydrop.com/javascript/tree/ | |---------------------------------------------------| | Copyright (c) 2002-2003 Geir Landrö | | | | This script can be used freely as long as all | | copyright messages are intact. | | | | Updated: 17.04.2003 | |--------------------------------------------------*/
// Node object function Node(id, pid, name, url, title, target, icon, iconOpen, open) { this.id = id; this.pid = pid; this.name = name; this.url = url; this.title = title; this.target = target; this.icon = icon; this.iconOpen = iconOpen; this._io = open || false; this._is = false; this._ls = false; this._hc = false; this._ai = 0; this._p; };
<snip>
Now jumping down a few hundred lines
// Creates the node icon, url and text dTree.prototype.node = function(node, nodeId) { var str = '<div class="dTreeNode">' + this.indent(node, nodeId); if (this.config.useIcons) { if (!node.icon) node.icon = (this.root.id == node.pid) ? this.icon.root : ((node._hc) ? this.icon.folder : this.icon.node); if (!node.iconOpen) node.iconOpen = (node._hc) ? this.icon.folderOpen : this.icon.node; if (this.root.id == node.pid) { node.icon = this.icon.root; node.iconOpen = this.icon.root; } str += '<img id="i' + this.obj + nodeId + '" src="' + ((node._io) ? node.iconOpen : node.icon) + '" alt="" />'; } if (node.url) { str += '<a id="s' + this.obj + nodeId + '" class="' + ((this.config.useSelection) ? ((node._is ? 'nodeSel' : 'node')) : 'node') + '" href="' + node.url + '"'; if (node.title) str += ' title="' + node.title + '"'; if (node.target) str += ' target="' + node.target + '"'; if (this.config.useStatusText) str += ' onmouseover="window.status=\'' + node.name + '\';return true;" onmouseout="window.status=\'\';return true;" '; if (this.config.useSelection && ((node._hc && this.config.folderLinks) || !node._hc)) str += ' onclick="javascript: ' + this.obj + '.s(' + nodeId + ');"'; str += '>'; } else if ((!this.config.folderLinks || !node.url) && node._hc && node.pid != this.root.id) str += '<a href="javascript: ' + this.obj + '.o(' + nodeId + ');" class="node">'; str += node.name; if (node.url || ((!this.config.folderLinks || !node.url) && node._hc)) str += '</a>'; str += '</div>'; if (node._hc) { str += '<div id="d' + this.obj + nodeId + '" class="clip" style="display:' + ((this.root.id == node.pid || node._io) ? 'block' : 'none') + ';">'; str += this.addNode(node); str += '</div>'; } this.aIndent.pop(); return str; };
There is a bunch more js after this...
Thanks for the quick feedback and great work on this code. Chris
-- modified at 1:27 Thursday 15th March, 2007
|
| Sign In·View Thread·PermaLink | 1.00/5 |
|
|
|
 |
|
 |
i got this error The remote server returned an error: (403) Forbidden. System.Net.HttpWebRequest.GetResponse() at Searcharoo.Net.Spider.Download(HtmlDocument htmldoc) ...
what should i do??
|
| Sign In·View Thread·PermaLink | 2.29/5 |
|
|
|
 |
|
 |
The web.config has an entry<add key="Searcharoo_VirtualRoot" value="http://localhost/" /> You must ensure this url points to a valid page.
Warnings: 1) If you are using Visual Studio 2005 or VWDExpress 'built in server', it will choose some random port, so you need to change this entry (maybe it will be http://localhost:3345/ or http://localhost:2969/ - check for the ASP.NET Development Server icon in your task tray.
2) There must a valid document there, eg. if your url is "http://localhost/" make sure you have a default.aspx there, or else make the url "http://localhost/yourpage.htm" to point to a page you know exists.
3) Searcharoo doesn't support login, so make sure no authentication is turned on, eg Windows authentication or Forms authentication.
If you don't think any of these things are the problem, please post more information about your development environment, server and settings/config.
|
| Sign In·View Thread·PermaLink | 5.00/5 |
|
|
|
 |
|
|
 |
|
 |
Hi First of all thanks for the code, it works great.
I'm a novice at asp so it took me a while to figure out how it works.
The thing is i want to be able to search deeper into .doc files to see if they contain any keywords. I thought the spider would do that but it doesn't. Is there anything that does?
The files are on a internal website and are referencing a sharepoint website. I use
<iframe src="http://docshare/....doc" width=100% style="height: 550px"></iframe>
on my asp page.
Although the title comes up the words in the document do not.
Please let me know if i can access these in any other way.
Rick
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
See this response to a similar question[^].
There have been a few other 'products' mentioned in these comments - Lucerne.NET and Nutch for example - which might be of interest. I've not used them so can't really give a recommendation.
Sorry 'bout that, maybe in a future release...
Craig
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
 |
|
|