An autocompletion algorithm with use of “popularity factor”






4.40/5 (4 votes)
The algorithm uses a factor called “popularity factor” to calculate the “relevancy rank” of the autocompletion list items
Introduction
The article evolves the algorithm described in my previous article "An algorithm of "phrases similarity calculation" and its usage in "autocompletion" feature"
The new algorithm described below uses another factor called "popularity factor" to calculate the "relevancy rank" of the autocompletion list items.
The article has a Demo application with an implementation of the algorithm attached. The Demo is implemented on C# in form of a Silverlight Application. The Silverlight application is already deployed on http://files.rsdn.ru/44022/AutocompleteTextBox2.html; you can work with it there without the downloading and compiling the attached sources.
Audience
The possible audience of the article are developers that want to implement a "smart" autocompletion algorithm in their applications. The word "smart" means here the following: an algorithm that calculates the "relevancy rank" for the matched "auto-suggestion strings", and orders the found results by decreasing of the "rank".
A good example of such an algorithm is the well-known movies database www.imdb.com :

As you see on the screenshot above, the www.imdb.com not just displays "all the results that contain the entered text ‘go’ as a sub-string". It displays the "most relevant" found results. One of the criterions of the "relevancy" here is popularity of a found movie/person. Apparently the www.imdb.com database contains a lot of movies that have a title starting with "go". But the "The Godfather" displayed above is a very popular movie. This is why its "relevancy rank" is considered to be "high enough" to be displayed in the begin of the "autocompletion list".
Disclaimers and restrictions
It is necessary to read the previous "An algorithm of "phrases similarity calculation" and its usage in "autocompletion" feature" article prior to reading the given article.
The Demo attached is not a ready-to-use "control" you can "drag and drop" into your application. It is rather an example how to implement the algorithm described in the article.
The attached Demo is a Silverlight application; but of course this approach may be used in practically any type of GUI applications – on a Web Page (with JavaScript usage), in a desktop application etc.
The algorithm implementation in the Demo has several "hardcoded" facts that are specific for English language only (a list of the "2nd class words" like "the"). So the Demo should be used with English queries only. If you want to use the algorithm for other languages, then populate the list with the "2nd class words" specific for your language.
Background
In the previous article "An algorithm of "phrases similarity calculation" and its usage in "autocompletion" feature" I introduced the algorithm that uses a "degree of similarity between phrases" to calculate the "Rank" of the autocompletion list items.
In this article I improve the algorithm by introducing another factor that contributes to the Rank – a "popularity factor". The factor says – "how popular" was the given "autocompletion list item" among the user(s) of the system. The more "popular" the item was, the more its Rank is.
A NOTE: the new "popularity factor" does not cancel the "phrases similarity factor" described in the previous article. No – they both contribute to the Rank.
Inputs, outputs and usage scenario of the algorithm
The algorithm supposes there is an application with a text field (a "textbox" etc) in its GUI. The field has a list of "possible values" for the auto-completion feature. The application users issue some kind of search queries. The attached Demo is an example of such application that issues the search queries to a database of movies:
But the nature of the queries does not matter to the algorithm, of course.
When a User of the application starts typing a text in such a field, the algorithm matches the entered text (we call the text as Query) against the Possible Values. The algorithm returns a list of the "matched" Possible Values. The list is sorted in descending order by "relevance". The "relevance" here is a number that displays a degree of relevance between a Possible Value and given Query.
The application is supposed to display the returned list as the "auto-completion list" for the given Query:

We suppose the application has some kind of a "persistent storage".
The Storage is used to save the queries issued by the current User. The Storage
can save up to StorageMaxSize
count of the queries per a User. The
StorageMaxSize
value should be large enough to allow the Algorithm
building a "representative sampling of a User’s activity". Let’s say, the
StorageMaxSize
should be 10000 at least.
The word "persistent" in the "persistent storage" above means: the Storage should save its content after the current User’s session is finished and/or the application is "closed". The Storage may be implemented as a database, or a plain text file, or an XML file – it does not matter to the algorithm.
The Main Ideas of the algorithm
Usage of both the "similarity factor" and "popularity factor"
The main idea of the algorithm is the following: it considers BOTH the "similarity factor" and "popularity factor" while calculating the "relevancy Rank".
The "similarity factor" is exactly the same as one described in the article "An algorithm of "phrases similarity calculation" and its usage in "autocompletion" feature".
Let’s explain what the "popularity factor" term means. Suppose a Query is "Sal"
and the algorithm has matched 2 PossibleValues
by the Query: "Sal"
and "Sally". The "Sal" is "more similar" to the Query than the
"Sally" – the previous article explains why.
Let’s also suppose the PossibleValue
"Sal" was issued by a User
only once – say, 1 year ago. Whereas the PossibleValue
"Sally" was
issued by the same User many times – say, 100 times only during last week. So we
can say the PossibleValue
"Sally" is definitely "more popular"
than the PossibleValue
"Sal". This is an example of the "popularity
factor".
And now let’s put these things together. The algorithm takes into account the
both factors; and uses them simultaneously. In other words, the algorithm does not
use a principle like "the similarity factor is primary one and the popularity factor
is considered only when two PossibleValues
have exactly the same similarity
factor". No – none of the factors is a "primary" one. Considering the example above,
the algorithm may decide the PossibleValue
"Sally" should have a Rank
larger than the PossibleValue
"Sal" – because of the "Sally" query
was "much more popular" than the "Sal" was.
"Similarity factor"
Calculation of the number called SimilarityRank
(the number that
shows the "similarity factor") is described in the previous article "An algorithm
of "phrases similarity calculation" and its usage in "autocompletion" feature"
(http://www.codeproject.com/Articles/638280/An-algorithm-of-phrases-similarity-calculation-and)
"Popularity factor"
Calculation of a number called PopularityRank
(the number
that shows the "popularity factor") uses the following ideas:
- The Storage saves up to
StorageMaxSize
(see its definition above) "usages". A "Usage" is a fact that a Query has been issued by the current User.
For example:- At 10:12:23 30-12-2012 the User issues a query "The Dark Knight"
- At 23:59:59 30-12-2012 the User issues a query "The Hangover"
- At 11:15:40 31-12-2012 the User issues a query "The Dark Knight" again
Then the Storage will contain the following Usages (pairs of
{PossibleValue, Date}
) ordered by decrease of the Date:{PossibleValue: "The Dark Knight", Date: 11:15:40 31-12-2012}
{PossibleValue: "The Hangover", Date: 23:59:59 30-12-2012}
{PossibleValue: "The Dark Knight", Date: 10:12:23 30-12-2012}
- While calculating
PopularityRank
for aPossibleValue
:- We take into account all the
PossibleValue
Usages found in the Storage.
For example: if we calculate
PopularityRank
for aPossibleValue
"The Dark Knight" in the Storage example above, we take into account the both Usages: the{PossibleValue: "The Dark Knight", Date: 11:15:40 31-12-2012}
and the{PossibleValue: "The Dark Knight", Date: 10:12:23 30-12-2012}
- the later Usage Date is, the more the given Usage brings to the
PopularityRank
calculated.
For example: we have 2 usages:
{PossibleValue: "The Dark Knight", Date: 11:15:40 31-12-2012}
and{PossibleValue: "The Dark Knight", Date: 10:12:23 30-12-2012}
. The 1st Usage will bring more to thePopularityRank
than the second Usage. The reason is: the Date 11:15:40 31-12-2012 is later than the 10:12:23 30-12-2012. - We take into account all the
Details of the algorithm
In the previous section I described the main ideas of the algorithm. In this section I will give the full description.
How the Storage handles the overflows
As we said earlier, the Storage saves up to StorageMaxSize
count
of Usages. But what to do when an "overflow" occurs: a User issues another query
(in other words – new Usage is reported to the Storage), but the Storage already
contains StorageMaxSize
of Usages?
In this situation we remove the oldest Usage – i.e., a Usage with the oldest Date.
As a consequence of that, the Storage should have some kind of ordering of the Usages by Date – to effectively implement "the oldest Usage search" operation.
How we look for PossibleValues and calculate their Ranks
Let’s consider how we look for a list of PossibleValues
and their
Ranks for a Query specified:
- Get a list of (
PossibleValue
,SimilarityRank
) pairs using the previous algorithm "An algorithm of "phrases similarity calculation" and its usage in "autocompletion" feature". As this is specified in the previous article, all theSimilarityRank
values in the result list will be > 0. In other words, PossibleValues that are "not similar at all" to the Query will not be included into the list. Let’s call the list asSimilarityResults
. - Calculate
PopularityRank
for everySimilarityResults[i].PossibleValue
found on the previous step. See the details of the calculation below - Now we have
SimilarityRank
andPopularityRank
values for every for everySimilarityResults[i].PossibleValue
. Calculate the [final] Rank of everySimilarityResults[i].PossibleValue
as the following:Rank = SimilarityRank * PopularityRank
How we calculate PopularityRank for a list of PossibleValues
The input data here is the list of PossibleValues.
- Define the following values:
Min2NdRank = 1.0 Max2NdRank = 100.0 MinFinalRank = 1.0 MaxFinalRank = 6.0
- Calculate a value called "
1stRank
" for every of thePossibleValues
. See the details below - Find
min1stRank
– a minimum of all the1stRank[]
values - Find
max1stRank
– a maximum of all the1stRank[]
values - For every of the
1stRank[i]
values calculate a value called "2ndRank
" as the following:If (min1stRank == max1stRank) Then 2ndRank[i] = Min2NdRank; Else 2ndRank[i] = Min(1stRank[i] / min1stRank, Max2NdRank);
- Now we have the list
2ndRank[]
where every2ndRank[i]
is in[Min2NdRank; Max2NdRank]
range.Map every
2ndRank[i]
value into [MinFinalRank
,MaxFinalRank
] range as the following:delta = (2ndRank[i] - Min2NdRank) / (Max2NdRank - Min2NdRank); PopularityRank[i] = MinFinalRank + delta*(MaxFinalRank - MinFinalRank);
- Eventually we have the list
PopularityRank[]
where everyPopularityRank[i]
value corresponds to an inputPossibleValues[i];
and is in [MinFinalRank
,MaxFinalRank
] range.
How we calculate "1stRank" for a PossibleValue
The input data here is a PossibleValue
and a date/time point called
LatestTime
. The LatestTime
is a maximum of all the
PossibleValues[].Usages[].Date
that are considered.
AN EXAMPLE: suppose we are considering a list of 2 PossibleValues
:
"The Dark Knight" and "The Hangover". 1st one has 2 Usages:
{PossibleValue: "The Dark Knight", Date: 11:15:40 31-12-2012}
{PossibleValue: "The Dark Knight", Date: 10:12:23 30-12-2012}
The 2nd one has the only usage:
{PossibleValue: "The Hangover", Date: 23:59:59 30-12-2012}
Then, the LatestTime
here will be (the same for both the "The Dark
Knight" and "The Hangover" PossibleValues
) the "11:15:40 31-12-2012".
Let’s also define the following constant:
- TimePortion
. This is a time period equal to 7 days (7 x 24 hours)
Then the 1stRank will be calculated as the following (below is a C#-like code):
double 1stRank = 0.0;
for (int i = 0; i < PossibleValue.Usages.Count; ++i)
{
var timeLentgth = LatestTime - PossibleValue.Usages[i].Date;
var timePortionsCount = (int)(timeLentgth.TotalMilliseconds / TimePortion.TotalMilliseconds);
var currUsageRank = 1.0 / (1.0 + timePortionsCount);
1stRank += currUsageRank;
}
The implementation
The Demo attached is a simple implementation of the algorithm on C#.
The folder Similarity comprises all the logic for the Similarity factor.
The folder Popularity comprises all the logic for the Popularity factor.
The class AutocompletionEngine.AutocompletionProvider
joins the
Similarity factor with the Popularity factor and makes the output autocompletion
list.
The Storage in the Popularity folder is, in fact, not "permanent" and exists only while the Silverlight application is loaded. At all, the Storage is implemented quite simple - this is just a demo.
Several places in the sources are marked by "TO BE IMPROVED" comment. These are places where performance could be improved if this code was used in a real application.
The static list AutocompletionEngine.Processor.Word.ArticlesEtc
contains the hardcoded list of the "second class" words (see the Similarity factor).
If you want the algorithm to work correctly with non-English queries, populate the
list with "second-class words" of your language.
The project UnitTest contains a set of NUnit-based automated tests of the
AutocompletionProvider
class. This is why the project requires
nunit.framework.dll library.
Possible improvements
We may introduce a limit for the count of Usages
per a PossibleValue
in the Storage. Say – to save not more than 500
latest Usages per a certain PossibleValue
. When a PossibleValue
already has 500 Usages stored, and a new Usage arrives – the oldest Usage will be
deleted to give room to the new one.