Recently I was confronted with the following problem: how to perform high speed matching of a huge amount of items towards a complex criterion?
My real problem domain was different, but just to give you an idea, consider matching the customer population of a large warehouse to customer profiles. A customer profile could for instance be: “has bought this product, but no product of that category”. The matching criterion is determined by the profile and becomes more complex when the profile contains more AND’s and OR’s…
One option would be to store all customer and sale information in a database and perform the matching with regular SQL queries. For small populations, this may be all right, but for large populations you will soon see considerable performance degradation as database engines are not especially optimized for this specific type of situation. Hierarchies of product categories, for instance, need to be translated into a number of joins depending on the depth of the categories tree.
Data warehouses may be a better option. They are optimized for very large data amounts and have a way of structuring data that must ensure simple query execution plans.
Index based matching
I chose a different path: the one of implementing the matching using lightweight indexes. Such an index is simply a list of integers and can be represented using an
IEnumerable<int>, but with that particularity that the values in the index are sorted.
For instance, I could create an index containing all customers that visited the store in the last 6 months. This index would be a sorted list of integers (customer IDs).
IEnumerables means we can use LINQ to run queries against the indexes. And indeed, an OR can be represented using the
Union() operation, while an AND can be represented using the
Intersect() operation. The
Except() operation provides support for NOT operations.
In LINQ, we can for instance write:
a being the list of customers (or customer IDs) that visited the store in the last 6 months,
b the list of customers that bought baby milk,
c the customers that bought baby food, and
d the customers who bought diapers, gives us the list of customers that could be interested in a change of our diapers assortment. LINQ excels in the handling of enumerables. It’s a language feature I never would like to miss again. But it is an all-round solution for handling all kinds of enumerables, not optimized to handle sorted enumerables. So I created a few classes containing algorithms for AND, OR, and EXCLUDE (or NOT) that are optimized for sorted value lists:
These classes are themselves
IEnumerable<int> and the algorithm can be found in their
GetEnumerator() method. For the OR handling, I used an algorithm based on merge sort which I got form the following link. The AND and EXCLUDE algorithms are homemade and can may even be further optimized.
The same query as above can now be written as:
IEnumerable<int> expressions that I also provide:
Performance and memory gain
The downloadable code contains the implementation of these classes, as well as a sample application that can be used to compare the performance with regular LINQ methods. The first time the sample is run, it will create some index files on disk with random (but sorted) content.
One of the index files is created with the following line of code. It says to create an index file (only if it does not yet exist) containing 12% of the population. For a population of 2 million (current default), this results in a file containing 240,000 integers, thus a little 940 KB.
var a12 = new Index(EnsureIndexFileExists(@"..\..\Indexes\A012.idx", population, 12));
Index’ class of which the above line code seems to create an instance, does not really exist. The
using statement on top of the code defines which class is being used. The
FileIndex class is an index for which the values are stored on disk and are only loaded in memory one by one when needed.
PreloadedFileIndex will first load the whole file in memory and therefore run faster.
using Index = Matching.FileIndex;
When comparing performance, we see the two ‘Matching’ based runs perform better than the LINQ ones. The file based (using a
FileIndex) runs are obviously slower than the preloaded runs. The following chart shows the results for four runs (FileIndex+LINQ, PreloadedFileIndex+LINQ, FileIndex+matching, PreloadedFileIndex+matching):
The preloaded matching run is 3.4 times faster than the preloaded LINQ based run. That the file based matching run is only twice as fast as the file based LINQ run can be explained by the fact that both runs get the same amount of time added (the time needed to read the files).
Using a 4 year old Intel Quad Core2 CPU at 2.4GHz, a population of 2 million, and a query involving 12 indexes, it takes a little over 0.7 seconds to perform a LINQ query, while it takes 0.2 seconds for the query with the matching classes (using preloaded indexes). The query returned 108,552 matches (as the index files are generated randomly, expect a different result).
When we look at the memory consumption, we see that the LINQ implementation requires a huge amount of memory! The preloaded LINQ run required an amount of memory close to 30 (!) times the size of the index files. The indexes are 19 MB altogether, while the run required 530 MB.
I guess the LINQ implementation, which cannot assume the data is sorted, has to use different buffers and therefore copies the data several times.
In conclusion, I would be tempted to say that using algorithms optimized for matching sorted lists as done with the matching classes gives a performance benefit, but it most importantly benefits on memory consumption and hence on scalability too. When many huge indexes are involved, LINQ might well bypass the available memory for your application.
Usage scenarios and limitations
The simple concept of matching by sorted indexes has two major limitations:
- Indexes must exist for every matching criteria, and all queries must be in terms of those indexes.
- Ranking is not supported; all indexes are sorted by customer ID. The matching query can only tell whether a customer is matched or not.
The first issue means you have to foresee all indexes that you will need. Also, since the indexes indicate a match to the criteria, the criteria must be predefined in terms of match or no match, true or false. So if one of my match criteria is “is over 18 years of age”, I need to maintain an index containing those over 18 years of age. And I probably have daily mutations to that index which I have to manage.
Another example: if I want to know who bought chocolate pudding in the last 60 days, I need to maintain exactly that index, with a content that changes every day. I can’t combine an index “bought chocolate pudding” with a “visited the store in the last 60 days” because the result would be different. Knowing who bought chocolate pudding in the last 61 days would require a different index to be maintained.
I can’t also rank the results by who bought the most chocolate pudding the last 60 days, or who bought chocolate pudding most recently.
Both limitations can be overcome when the matching process is seen as only a single step within a process of querying the system. Additional steps can perform further filtering (when match criteria are too broad because no exact index match could be found), sorting, and/or ranking.
The sample solution
The sample solution that you can download with this article contains three projects:
- The Matching project contains the
ExcludeMatchingExpression classes that perform the logical operations optimized for sorted indexes.
The project also contains the
ListIndex classes (two usable implementations of a sorted index) and Extension Methods.
- The Matching.Test project contains unit tests for the Matching project.
- The MatchingSample project is a command line application that demonstrates the matching process as described in this article.