Here is my idea:
The algorithm should accept three things on input: to string to be compared and the meta-data
. The meta-data would contain parameters for comparison: weight factors to be used, character sets, maybe even dictionaries; please see below on optionality. Basically, you would need to play with set of parameters and experiment with real data samples, to determine the best set of parameters, which can also vary depending on cultures, industry, etc.
The algorithm should return the similarity factor as a floating-point number: the more the factor, the more similar. For certainty, let's assume this is
For the first step
, strings are compared for being 100% identical, in that case it returns
. This value is very convenient to avoid messing up with normalization of distribution. Is is also convenient because infinite values can be correctly compared with "regular" non-NaN values using '>=', '<=', '>', '>' or '==' operators.
Let's define the set of delimiter characters in the form of array of characters. This is a delicate decision, I'll try to discuss is later. For first approximation, let's include all punctuation (including Unicode punctuation like «, », ', —, –, “, ”, etc.), also some symbols like © or ™, etc., and — importantly — a blank space in all its forms. Let's assume we have this set as
On next step, let's split each input string
string lexemes = input.Split(delimiters, System.StringSplitOptions.RemoveEmptyEntries);
is very important.
This is a very important step. For the first thing, it expresses the following: we unify all delimiters (to the minimal information: one ore more delimiter in certain place), merge consecutive delimiters together and give them the least priority; in other words, we shall not consider differences between those delimiters. From this point, the delimiters go out of consideration.
Now, the core of the algorithm
: count the number of matches between the lexemes without taking into account the order, taking each match with some weight factor depending on the length of the lexeme. This is the first step where we use meta-data. I don't discuss what is the "match". For a first approximation, the match can be the case-insensitive string comparison. It could be two comparisons: case sensitive and case-insensitive; the case of 100% case-sensitive match going with higher weight factor.
Optional improvement of lexeme match algorithm #1
: in the case-insensitive match, count the number of case-sensitive matches character-by-character and modify the weight factor of the match depending on the percentage of case-sensitive character matches.
Optional improvement of lexeme match algorithm #2
: compare lexemes using Gestalt-approach string comparison suggested by Andi.
Basically, that's it. There can be different options on top of it. For example, the core algorithm could be improved by adding positional comparison of the lexemes: a match can be given an elevated weight factor is the match happens at the same or close position.
Now, about the dictionaries. You could have a dictionary of low-priority words, such as articles and prepositions. If a lexeme match is found, the weight factor could be multiplied by low (<1) factor of a "low-priority word". This method is widely used these day for Web search. I must note that this method could work well in case of 1-3 different languages, mostly from Germanic/Roman-based cultures but might have negative impact in case of very different cultures or many different languages. From the other hand, the vast dictionary could work in the following way: if a match is found, and a word is not found in a dictionary, it can be given an elevated weight factor, as for a "unique word".
Again, every approach we discussed on this page requires extensive research on the real data samples.
If you decide to lead this work to some reasonable working result, if would be very nice if you publish the results of your research. I personally would be very interested to learn the results. In case you do, I would expect you notify Andi and myself.