An autocompletion algorithm with use of “popularity factor”

VictorSotnikov

4.40/5 (4 votes)

Dec 9, 2013

CPOL

10 min read

14741

152

The algorithm uses a factor called “popularity factor” to calculate the “relevancy rank” of the autocompletion list items

Download source

Introduction

The article evolves the algorithm described in my previous article "An algorithm of "phrases similarity calculation" and its usage in "autocompletion" feature"

The new algorithm described below uses another factor called "popularity factor" to calculate the "relevancy rank" of the autocompletion list items.

The article has a Demo application with an implementation of the algorithm attached. The Demo is implemented on C# in form of a Silverlight Application. The Silverlight application is already deployed on http://files.rsdn.ru/44022/AutocompleteTextBox2.html; you can work with it there without the downloading and compiling the attached sources.

Audience

The possible audience of the article are developers that want to implement a "smart" autocompletion algorithm in their applications. The word "smart" means here the following: an algorithm that calculates the "relevancy rank" for the matched "auto-suggestion strings", and orders the found results by decreasing of the "rank".

A good example of such an algorithm is the well-known movies database www.imdb.com :

As you see on the screenshot above, the www.imdb.com not just displays "all the results that contain the entered text ‘go’ as a sub-string". It displays the "most relevant" found results. One of the criterions of the "relevancy" here is popularity of a found movie/person. Apparently the www.imdb.com database contains a lot of movies that have a title starting with "go". But the "The Godfather" displayed above is a very popular movie. This is why its "relevancy rank" is considered to be "high enough" to be displayed in the begin of the "autocompletion list".

Disclaimers and restrictions

It is necessary to read the previous "An algorithm of "phrases similarity calculation" and its usage in "autocompletion" feature" article prior to reading the given article.

The Demo attached is not a ready-to-use "control" you can "drag and drop" into your application. It is rather an example how to implement the algorithm described in the article.

The attached Demo is a Silverlight application; but of course this approach may be used in practically any type of GUI applications – on a Web Page (with JavaScript usage), in a desktop application etc.

The algorithm implementation in the Demo has several "hardcoded" facts that are specific for English language only (a list of the "2nd class words" like "the"). So the Demo should be used with English queries only. If you want to use the algorithm for other languages, then populate the list with the "2nd class words" specific for your language.

Background

In the previous article "An algorithm of "phrases similarity calculation" and its usage in "autocompletion" feature" I introduced the algorithm that uses a "degree of similarity between phrases" to calculate the "Rank" of the autocompletion list items.

In this article I improve the algorithm by introducing another factor that contributes to the Rank – a "popularity factor". The factor says – "how popular" was the given "autocompletion list item" among the user(s) of the system. The more "popular" the item was, the more its Rank is.

A NOTE: the new "popularity factor" does not cancel the "phrases similarity factor" described in the previous article. No – they both contribute to the Rank.

Inputs, outputs and usage scenario of the algorithm

The algorithm supposes there is an application with a text field (a "textbox" etc) in its GUI. The field has a list of "possible values" for the auto-completion feature. The application users issue some kind of search queries. The attached Demo is an example of such application that issues the search queries to a database of movies:

But the nature of the queries does not matter to the algorithm, of course.

When a User of the application starts typing a text in such a field, the algorithm matches the entered text (we call the text as Query) against the Possible Values. The algorithm returns a list of the "matched" Possible Values. The list is sorted in descending order by "relevance". The "relevance" here is a number that displays a degree of relevance between a Possible Value and given Query.

The application is supposed to display the returned list as the "auto-completion list" for the given Query:

We suppose the application has some kind of a "persistent storage". The Storage is used to save the queries issued by the current User. The Storage can save up to StorageMaxSize count of the queries per a User. The StorageMaxSize value should be large enough to allow the Algorithm building a "representative sampling of a User’s activity". Let’s say, the StorageMaxSize should be 10000 at least.

The word "persistent" in the "persistent storage" above means: the Storage should save its content after the current User’s session is finished and/or the application is "closed". The Storage may be implemented as a database, or a plain text file, or an XML file – it does not matter to the algorithm.

The Main Ideas of the algorithm

Usage of both the "similarity factor" and "popularity factor"

The main idea of the algorithm is the following: it considers BOTH the "similarity factor" and "popularity factor" while calculating the "relevancy Rank".

The "similarity factor" is exactly the same as one described in the article "An algorithm of "phrases similarity calculation" and its usage in "autocompletion" feature".

Let’s explain what the "popularity factor" term means. Suppose a Query is "Sal" and the algorithm has matched 2 PossibleValues by the Query: "Sal" and "Sally". The "Sal" is "more similar" to the Query than the "Sally" – the previous article explains why.

Let’s also suppose the PossibleValue "Sal" was issued by a User only once – say, 1 year ago. Whereas the PossibleValue "Sally" was issued by the same User many times – say, 100 times only during last week. So we can say the PossibleValue "Sally" is definitely "more popular" than the PossibleValue "Sal". This is an example of the "popularity factor".

And now let’s put these things together. The algorithm takes into account the both factors; and uses them simultaneously. In other words, the algorithm does not use a principle like "the similarity factor is primary one and the popularity factor is considered only when two PossibleValues have exactly the same similarity factor". No – none of the factors is a "primary" one. Considering the example above, the algorithm may decide the PossibleValue "Sally" should have a Rank larger than the PossibleValue "Sal" – because of the "Sally" query was "much more popular" than the "Sal" was.

"Similarity factor"

Calculation of the number called SimilarityRank (the number that shows the "similarity factor") is described in the previous article "An algorithm of "phrases similarity calculation" and its usage in "autocompletion" feature" (http://www.codeproject.com/Articles/638280/An-algorithm-of-phrases-similarity-calculation-and)

"Popularity factor"

Calculation of a number called PopularityRank (the number that shows the "popularity factor") uses the following ideas:

The Storage saves up to StorageMaxSize (see its definition above) "usages". A "Usage" is a fact that a Query has been issued by the current User.

For example:
- At 10:12:23 30-12-2012 the User issues a query "The Dark Knight"
- At 23:59:59 30-12-2012 the User issues a query "The Hangover"
- At 11:15:40 31-12-2012 the User issues a query "The Dark Knight" again
Then the Storage will contain the following Usages (pairs of {PossibleValue, Date}) ordered by decrease of the Date:

{PossibleValue: "The Dark Knight", Date: 11:15:40 31-12-2012}

{PossibleValue: "The Hangover", Date: 23:59:59 30-12-2012}

{PossibleValue: "The Dark Knight", Date: 10:12:23 30-12-2012}
While calculating PopularityRank for a PossibleValue:
- We take into account all the PossibleValue Usages found in the Storage.
For example: if we calculate PopularityRank for a PossibleValue "The Dark Knight" in the Storage example above, we take into account the both Usages: the {PossibleValue: "The Dark Knight", Date: 11:15:40 31-12-2012} and the {PossibleValue: "The Dark Knight", Date: 10:12:23 30-12-2012}
- the later Usage Date is, the more the given Usage brings to the PopularityRank calculated.
For example: we have 2 usages: {PossibleValue: "The Dark Knight", Date: 11:15:40 31-12-2012} and {PossibleValue: "The Dark Knight", Date: 10:12:23 30-12-2012}. The 1st Usage will bring more to the PopularityRank than the second Usage. The reason is: the Date 11:15:40 31-12-2012 is later than the 10:12:23 30-12-2012.

Details of the algorithm

In the previous section I described the main ideas of the algorithm. In this section I will give the full description.

How the Storage handles the overflows

As we said earlier, the Storage saves up to StorageMaxSize count of Usages. But what to do when an "overflow" occurs: a User issues another query (in other words – new Usage is reported to the Storage), but the Storage already contains StorageMaxSize of Usages?

In this situation we remove the oldest Usage – i.e., a Usage with the oldest Date.

As a consequence of that, the Storage should have some kind of ordering of the Usages by Date – to effectively implement "the oldest Usage search" operation.

How we look for PossibleValues and calculate their Ranks

Let’s consider how we look for a list of PossibleValues and their Ranks for a Query specified:

Get a list of (PossibleValue, SimilarityRank) pairs using the previous algorithm "An algorithm of "phrases similarity calculation" and its usage in "autocompletion" feature". As this is specified in the previous article, all the SimilarityRank values in the result list will be > 0. In other words, PossibleValues that are "not similar at all" to the Query will not be included into the list. Let’s call the list as SimilarityResults.
Calculate PopularityRank for every SimilarityResults[i].PossibleValue found on the previous step. See the details of the calculation below
Now we have SimilarityRank and PopularityRank values for every for every SimilarityResults[i].PossibleValue. Calculate the [final] Rank of every SimilarityResults[i].PossibleValue as the following:
```
Rank = SimilarityRank * PopularityRank
```

How we calculate PopularityRank for a list of PossibleValues

The input data here is the list of PossibleValues.

Define the following values:

Min2NdRank   = 1.0
Max2NdRank   = 100.0
MinFinalRank = 1.0
MaxFinalRank = 6.0

Calculate a value called "1stRank" for every of the PossibleValues. See the details below
Find min1stRank – a minimum of all the 1stRank[] values
Find max1stRank – a maximum of all the 1stRank[] values

For every of the 1stRank[i] values calculate a value called "2ndRank" as the following:

If (min1stRank == max1stRank)

Then

    2ndRank[i] = Min2NdRank;

Else

    2ndRank[i] = Min(1stRank[i] / min1stRank, Max2NdRank);

Now we have the list 2ndRank[] where every 2ndRank[i] is in [Min2NdRank; Max2NdRank] range.
Map every 2ndRank[i] value into [MinFinalRank, MaxFinalRank] range as the following:
```
    delta = (2ndRank[i]  - Min2NdRank) / (Max2NdRank - Min2NdRank);
    PopularityRank[i] = MinFinalRank + delta*(MaxFinalRank - MinFinalRank);
```
Eventually we have the list PopularityRank[] where every PopularityRank[i] value corresponds to an input PossibleValues[i]; and is in [MinFinalRank, MaxFinalRank] range.

How we calculate "1stRank" for a PossibleValue

The input data here is a PossibleValue and a date/time point called LatestTime. The LatestTime is a maximum of all the PossibleValues[].Usages[].Date that are considered.

AN EXAMPLE: suppose we are considering a list of 2 PossibleValues: "The Dark Knight" and "The Hangover". 1st one has 2 Usages:

{PossibleValue: "The Dark Knight", Date: 11:15:40 31-12-2012}

{PossibleValue: "The Dark Knight", Date: 10:12:23 30-12-2012}

The 2nd one has the only usage:

{PossibleValue: "The Hangover", Date: 23:59:59 30-12-2012}

Then, the LatestTime here will be (the same for both the "The Dark Knight" and "The Hangover" PossibleValues) the "11:15:40 31-12-2012".

Let’s also define the following constant:

- TimePortion. This is a time period equal to 7 days (7 x 24 hours)

Then the 1stRank will be calculated as the following (below is a C#-like code):

double 1stRank = 0.0;
for (int i = 0; i < PossibleValue.Usages.Count; ++i)
{
    var timeLentgth = LatestTime - PossibleValue.Usages[i].Date;
    var timePortionsCount = (int)(timeLentgth.TotalMilliseconds / TimePortion.TotalMilliseconds);
    var currUsageRank = 1.0 / (1.0 + timePortionsCount);
    1stRank += currUsageRank;
}

The implementation

The Demo attached is a simple implementation of the algorithm on C#.

The folder Similarity comprises all the logic for the Similarity factor.

The folder Popularity comprises all the logic for the Popularity factor.

The class AutocompletionEngine.AutocompletionProvider joins the Similarity factor with the Popularity factor and makes the output autocompletion list.

The Storage in the Popularity folder is, in fact, not "permanent" and exists only while the Silverlight application is loaded. At all, the Storage is implemented quite simple - this is just a demo.

Several places in the sources are marked by "TO BE IMPROVED" comment. These are places where performance could be improved if this code was used in a real application.

The static list AutocompletionEngine.Processor.Word.ArticlesEtc contains the hardcoded list of the "second class" words (see the Similarity factor). If you want the algorithm to work correctly with non-English queries, populate the list with "second-class words" of your language.

The project UnitTest contains a set of NUnit-based automated tests of the AutocompletionProvider class. This is why the project requires nunit.framework.dll library.

Possible improvements

We may introduce a limit for the count of Usages per a PossibleValue in the Storage. Say – to save not more than 500 latest Usages per a certain PossibleValue. When a PossibleValue already has 500 Usages stored, and a new Usage arrives – the oldest Usage will be deleted to give room to the new one.