 |
|
 |
Hi, firstly, great article, very helpful.
Now to the question, i'm not sure anyone will have tried this yet. But i'm trying to implement an ASP.NET 2 website using the 404 page described here. The 404 page is found very nicely in the local web server created by VS2005. However, when i place the website within IIS 7. The 404 page is never used.
Does anyone have any experiance of this?
|
|
|
|
 |
|
 |
When I was going through sample site, I decided to test http://www.iserc.ie/projects/associatedresearchareasandprojects/grammaticalevolution.html for cases where my application used to give default 404 page.
In first case, I mis-spelled page (by removing "e") as grammaticalvolution.html and it gave me suggestion for right page but when I changed file extension (grammaticalevolution.htm), it displayed default 404 page.
|
|
|
|
 |
|
 |
hi,
yes, the way my site works is that all the urls are virtual. none of the .html pages exist. i have an ISAPI filter that intercepts all .html requests and dynamically generates the right page from a database. unfortunately this means .htm pages (and other file extensions) are outside the scope of my application and don't get included in the 404 routine.
other normal sites wouldn't suffer from this side-effect.
hope this helps
tim
|
|
|
|
 |
|
 |
Dear Tim,
it possible to see the source code of the WinForm Demo?
Thank you for your help!
PS: you do a great job.
|
|
|
|
 |
|
 |
certainly, no problem.
you can download it here
enjoy!
tim
|
|
|
|
 |
|
|
 |
|
 |
Here's a quick tip that will improve performance about threefold.
In the Edit Distance method, replace
cost = (t.Substring(j - 1, 1) == s.Substring(i - 1, 1) ? 0 : 1);
with
cost = t[j - 1] == s[i - 1] ? 0 : 1;
The Substring() method is very costly.
|
|
|
|
 |
|
 |
hi robert,
thanks a million for sharing your expertise, i've just implemented the improvement in my cms and certainly seems to go a lot faster.
cheers
tim
|
|
|
|
 |
|
 |
Great article. I finally got around to implementing this in my site and made some changes to let it work with non-ASP.NET resources such as regular HTML pages that can't be found. For that to occur, you have to point IIS at 404.aspx as well and replace the GetRequestedUrl code with the following so that it knows how to get the offending resource name when called by IIS:
this.requestedUrl = Request.QueryString["aspxerrorpath"];
if(this.requestedUrl == null && Request.QueryString.Count != 0)
{
string[] strParts = Request.QueryString[0].Split(';');
if(strParts.Length > 1)
this.requestedUrl = strParts[1];
}
if(this.requestedUrl != null && this.requestedUrl.Length != 0)
{
try
{
this.requestedUrl = Path.GetFileNameWithoutExtension(
this.requestedUrl);
}
catch
{
}
}
else
this.requestedUrl = String.Empty;
|
|
|
|
 |
|
 |
hi eric,
nide addition with the IIS add-in, that makes it much more useful in a generic situation.
my cms uses an ISAPI filter to handle html requests as .net requests but this is definitely a useful enhancement.
cheers
tim
|
|
|
|
 |
|
 |
I use the Levenshtein algorithm in a similar method to search for product codes and wanted to offer the method I use for normalization.
I calculate the minimum edit distance between two strings by simple finding the absolute value of the difference of their lengths. It is essentially the minimum 'delete' distance. I subtract this distance from the calculated edit distance. I refer to this new term as the change distance. This technique removes effect of the difference in lengths, resulting in better semantic matches than no normalization at all. It would be interesting to see a comparison of the two techniques over a large sample as judged by a user.
Using the example in the article:
Request: Hello
Page: hello_new_version
DeleteDistance = 12
EditDistance = 12
ChangeDifference = abs(12-12) = 0;
Page: abcde
DeleteDistance = 0
EditDistance = 5
ChangeDifference = abs(5-0) = 5;
|
|
|
|
 |
|
 |
I forgot to add that you don't have use the absolute value on the delete distance calculation (difference between string lengths). A negative number would indicate an 'Add' distance. For instance:
Request: lo
Page: Hello
DeleteDifference = -3
EditDistance = 3
ChangeDifference = abs(3 - abs(-3)) = 0;
The ChangeDifference is 0 in this case. The negative DeleteDifference indicates though that the Request value is a substring of the Page value. You may want to eliminate the Page value as a match in a case like this, or based on some ratio of DeleteDifference to Page value length (when DeleteDifference is neg.).
|
|
|
|
 |
|
 |
hi Michael,
thanks for chipping in with an alternative to normalisation. i tried implementing your suggestion, and i either did it wrong or it isn't working so well!
here's the code i added to the end of the levenshtein function:
int deleteDistance = Math.Max(s.Length, t.Length) - Math.Min(s.Length, t.Length); // difference of string lenghts, will always be positive
int changeDistance = Math.Abs(editDistance - deleteDistance);
return changeDistance;
if you run the demo app with the input string as 'Purpose' and the compare string as 'Purpose And Approach', then you get 0.0 difference as expected because its a substring. however, if you change it to 'Puprose' which is one typo, then it gets 2.0 difference. having done some further experiementation it seems harder to find an accurate threshhold for which to discard irrelevant results.
michael maybe you can tell me if i have coded your suggestion incorrectly? thanks. tim
|
|
|
|
 |
|
 |
I think the problem is in the delete distance calculation. Try deleteDistance = Math.Abs(s.Length - t.Length). I am using the raw ChangeDistance, but you may want to add another function in which you Calculate a threshhold or something. You can dump the Math.Abs from deleteDistance and if deleteDistance is negative then the comparison string is smaller. If the comparison string is smaller then you may want to calc a ratio with a threshhold. For instance, if the (changedistance/(base string length) > .75 then it matches -> that is if the comparison string is %75 correct, etc. If the comparison string is larger then you may just want to use the ChangeDistance as an indicator of a match.
The core of the code I use is below. Watch for line wrap. I initilize the class with the search/base string (i.e. the incorrent page name) and then call the calc methods with comparison strings (i.e. existing page names). It cuts down on array initialization and creation. I reuse the same array too since the edges of the matrix don't change.
Hope this helps.
-Michael
///
/// Levenshtein distance
///
public class EditDistance
{
string _base;
int _baselength;
int[,] _dist;
int _initialdimsize = 33;
public EditDistance(string basevalue)
{
_base = basevalue;
_baselength = _base.Length;
}
public EditDistance(string basevalue, int initialdimsize)
{
_base = basevalue;
_baselength = _base.Length;
_initialdimsize = initialdimsize + 1;
}
public int CalculateDeleteDistance(string data)
{
//TODO null check
return _base.Length - data.Length;
}
private void CheckDistArray(int s2)
{
s2++; //length of string + 1
if (_dist != null && s2 <= _dist.GetLength(1)) { return; }
if (_dist == null) { s2 = (s2 < _initialdimsize ? _initialdimsize:s2); }
//only re allocate array if too small - array edges don't change
_dist = new int[_baselength + 1, s2]; // matrix
for(int i = 0; i <= _baselength; _dist[i, 0] = i++);
for(int j = 0; j <= s2-1; _dist[0, j] = j++);
}
public int CalculateChangeDistance(string data)
{
return Math.Abs(CalculateDistance(data) - Math.Abs(CalculateDeleteDistance(data)));
}
public int CalculateDistance(string data)
{
int datalength = data.Length;
int cost;
if (_baselength == 0) { return datalength; }
if (datalength == 0) { return _baselength; }
CheckDistArray(datalength);
for(int i = 1; i <= _baselength;i++)
{
for(int j = 1; j <= datalength;j++)
{
cost = (data[j-1] == _base[i-1]?0:1);
_dist[i, j] = System.Math.Min(System.Math.Min(_dist[i - 1, j] + 1, _dist[i, j - 1] + 1),
_dist[i - 1, j - 1] + cost);
}
}
return _dist[_baselength, datalength];
}
}
|
|
|
|
 |
|
 |
I left a step out of my ratio calculation above. Here is a more explicit example. Try this.
Request: Puprose (misspelled)
Page: Purpose And Approach
DeleteDistance = 13
EditDistance = 15
ChangeDistance = abs(15-13) = 2;
Threshold:
ChangeDistance/length(Puprose) = .29
29% of Puprose has to be changed to match some portion of the 'Purpose And Approach'.
If that 29% change is made then the request matches 35% of page (length(Puprose)/length(Purpose And Approach)).
So the approach would be a two step process. You would probably want to keep the first threshhold (% of request changed) to something like 1/3 -> 33% (i.e. change less than 33% of request). If that threshhold is satisfied then you would want to check what percent of the Page does the Request match. I would probably use something like 33% there too, although experimentation would be best.
I use different threshholds and a slightly different approach in my situation since I'm comparing product codes and not words. I'm comparing product codes like #185bg7 to #195cg7.
It's a little extra work but it's working great in my situation.
|
|
|
|
 |
|
 |
One more idea. It's one in the morning here and I'm not sure if these are good ideas or if I'm just tired....
You might calculate the first threshhold - percent of request that must be changed - and if that threshhold is satisfied use the actual length of the Request to determine whether or not there is a valid match. In the example in the previous message you would have to change 29% of Puprose (2/7) to get 'Purpose'. However, that would yield a 7 character match (length of Puprose). If you compare 7 to some constant for average word length (5 or 6 characters) that would indicate a match of over 1 word which is probably pretty good in a page request situation.
|
|
|
|
 |
|
 |
hi michael,
thanks for the example + explanation. i'll give it a try. cheers!
tim
|
|
|
|
 |
|
 |
The 'correct' approach as used in the evaluation of mutation rates in DNA is to assign variable costs to each change pair and add/delete For instance, A <-> C = 0.8, A <-> T = 0.9, C -> G = 1.0, G -> C = 1.1 Add = 1.0, Delete = 0.95 to reflect the relative chances of each type of mutation occuring. For instance, in the case of page names, you might want a cost of 0 for paces and other punctuation (/_, etc), 0.8 for any letter, and 0.2 for any insert or delete. In turn you could bias the weightings if you know that for instance old pages generally had shorter name, then additions should have a lower cost because they were more likely, and deletions are less likely, so their costs should increase. Common would be different costs for vowels vs constonants, perhaps common word ending, case sensitivity, etc. Roger Willcocks Software Engineer MCP, MCSD .NET http://www.l-space-design.com/
|
|
|
|
 |
|
 |
hi roger, thanks for the suggestion. sounds like it might be hard to fine-tune the weightings to get an accurate threshhold that works for a wide range of inputs? i have the above implementation working well and tested over lots of inputs.
would you have any comment on how efficient the genetic mutation algorithm would be compared to levenshteins edit distance? performance is obviously important because for one 404 request, the web server could be doing comparisons for all 1000 pages in the site, if the site was that big.
cheers
tim
|
|
|
|
 |
|
 |
There are a variety of improvements that can be made to the basic Levenshtein algorithm at the cost of performance. Many of them don't really help much in my situation (product code matching) and many of them may be a little overkill for page lookups. In our case, matching strings in a web page, performance outweighs the need for a high degree of accuracy, as would probably be the opposite in detecting DNA mutations.
Inside your nested for loop, rather than simply comparing characters and assign a cost of 0 if they match and 1 if they don't, you determine the type of mismatch and apply a weighted cost.
One common check is a 'swap' check. If the current character, a, doesn't match the comparison character, a1, look at the next, b1. If b1 matches a then you you have found a swap which typically gets a weight less than that of an insert or delete. An example would be 'Hello' and 'Hlelo'.
Another feature would be to add soundex or metaphone type comparisons. For instance, if the current character is an F, and the current comparison character is a P, you might use a lookahead to determine if the next comparison character is an H. F and PH match phonetically, so you would assign a cost to the operation of switching phonetically matched characters, typically less than an insert or delete.
-Michael
|
|
|
|
 |
|
 |
A couple of months ago I posted here in CP a simplified implementation of the Levenshtein distance algorithm in C:
http://www.codeproject.com/string/SimilarStrings.asp
I also sugested in the article to use it for the purpose you describe here. But your article is more explicit and clear.
Congratulations,
Rui A. Rebelo
Wanna do a fast program? Use C.
Wanna do fast a program? Use Perl.
Want to avoid both? Use Java.
|
|
|
|
 |
|
 |
Hi Rui,
before i posted the article, i checked codeproject to see if there was a similar one there already and i actually came across your one and read it, but i thought it was still worth posting mine since it does provide a 404 page implementation. i knew that the article was nothing revolutionary, just tying in bits and pieces from other peoples work into a workable solution that nobody had expressed yet. i guess this is the fulfillment of your suggestion for 404 pages then, even though i had it done before i read your article!
i'll add a link to your article as it does a good job of explaining the algorithm and problem domain.
all the best,
tim
|
|
|
|
 |
|
 |
It works like a charm now after fixing automatic file searching on the server and making the file extension and its folder context to come up in the links
This is how I did it:
private void SetUpSiteUrls()
{
this.validUrls = new ArrayList();
ArrayList directory = GetFilesInDir(Server.MapPath("~/"), true, "*.html|*.aspx|*.jpg|*.gif|*.wav|*.mp3");
validUrls.AddRange(directory);
}
private ArrayList GetFilesInDir(string path, bool recursive, string pattern)
{
ArrayList files = new ArrayList();
if(pattern.Contains("|"))
{
string[] multipattern = pattern.Split('|');
foreach (string subpattern in multipattern)
files.AddRange(System.IO.Directory.GetFiles(path, subpattern));
}
else
files.AddRange(System.IO.Directory.GetFiles(path, pattern));
if (recursive)
{
string[] directories = System.IO.Directory.GetDirectories(path);
foreach (string directory in directories)
files.AddRange(GetFilesInDir(directory, true, pattern));
}
return files;
}
Then I changed the ComputeResults function to accept the fact that validUrls now contained the whole path and extension of each file/page:
private void ComputeResults()
{
results = new ArrayList(); foreach (string s in validUrls)
{
string filename = System.IO.Path.GetFileNameWithoutExtension(s).ToLower();
if (filename == "") continue; double distance = Levenshtein.CalcEditDistance(filename, this.requestedUrl); double meanDistancePerLetter = (distance / filename.Length);
if (meanDistancePerLetter <= 0.60D) {
results.Add(new DictionaryEntry(meanDistancePerLetter, "<a href='" + s.Replace(MapPath("~\\"), "").Replace("\\","/") + "'>" + System.IO.Path.GetFileName(s) + "</a>")); // use dictionary entries because we want to store the score and the hyperlink. can't use sortedlist because they don't allow duplicate keys and we have 2 hyperlinks with the same edit distance.
}
}
results.Sort(new ArrayListKeyAscend());
}
Okay! That was my first comment with code, so I hope that I did it right with the code and pre tags.
Hope this could be useful for someone and please comment if you think the code looks like a horror movie or something. I'm here to learn!
Geat stuff anyway!
I will use this one for sure
|
|
|
|
 |
|
 |
hey nice work. none of my CMS applications ever used physical files so i never got around to implementing a physical file search. looks like great code.
thanks for posting
tim
|
|
|
|
 |
|
 |
Thanks for an interesting and well explained article. It was worth it simply for the "fuzzy" algorithm.
New sites making for broken URLs can be a frustrating problem to deal with. Clients will rarely agree to sit through mapping old URLs to new ones as I have often suggested.
regards,
Paul Watson
South Africa
Michael Dunn wrote:
"except the sod who voted this a 1, NO SOUP FOR YOU"
Crikey! ain't life grand?
|
|
|
|
 |