Introduction
Recently, I was confronted with the following problem: how to perform high speed matching of a huge amount of items towards a complex criterion?
My real problem domain was different, but just to give you an idea, consider matching the customer population of a large warehouse to customer profiles. A customer profile could for instance be: “has bought this product, but no product of that category”. The matching criterion is determined by the profile and becomes more complex when the profile contains more ANDs and ORs…
One option would be to store all customer and sale information in a database and perform the matching with regular SQL queries. For small populations, this may be all right, but for large populations you will soon see considerable performance degradation as database engines are not especially optimized for this specific type of situation. Hierarchies of product categories, for instance, need to be translated into a number of joins depending on the depth of the categories tree.
Data warehouses may be a better option. They are optimized for very large data amounts and have a way of structuring data that must ensure simple query execution plans.
Index Based Matching
I chose a different path: the one of implementing the matching using lightweight indexes. Such an index is simply a list of integers and can be represented using an IEnumerable<int>
, but with that particularity that the values in the index are sorted.
For instance, I could create an index containing all customers that visited the store in the last 6 months. This index would be a sorted list of integers (customer IDs).
Having IEnumerable
s means we can use LINQ to run queries against the indexes. And indeed, an OR
can be represented using the Union()
operation, while an AND
can be represented using the Intersect()
operation. The Except()
operation provides support for NOT
operations.
In LINQ, we can, for instance, write:
a.Intersect(b.Union(c)).Except(d)
Where a
being the list of customers (or customer IDs) that visited the store in the last 6 months, b
the list of customers that bought baby milk, c
the customers that bought baby food, and d
the customers who bought diapers, gives us the list of customers that could be interested in a change of our diapers assortment. LINQ excels in the handling of enumerables. It’s a language feature I never would like to miss again. But it is an all-round solution for handling all kinds of enumerables, not optimized to handle sorted enumerables. So I created a few classes containing algorithms for AND
, OR
, and EXCLUDE
(or NOT
) that are optimized for sorted value lists:
These classes are themselves IEnumerable<int>
and the algorithm can be found in their GetEnumerator()
method. For the OR
handling, I used an algorithm based on merge sort which I got form the following link. The AND
and EXCLUDE
algorithms are homemade and can may even be further optimized.
The same query as above can now be written as:
new ExcludeMatchingExpression(
new AndMatchingExpression(
a,
new OrMatchingExpression(c)
),
d
);
Or, using IEnumerable<int>
expressions that I also provide:
a.MatchAnd(b.MatchOr(c)).MatchExcept(d)
Performance and Memory Gain
The downloadable code contains the implementation of these classes, as well as a sample application that can be used to compare the performance with regular LINQ methods. The first time the sample is run, it will create some index files on disk with random (but sorted) content.
One of the index files is created with the following line of code. It says to create an index file (only if it does not yet exist) containing 12% of the population. For a population of 2 million (current default), this results in a file containing 240,000 integers, thus a little 940 KB.
var a12 = new Index(EnsureIndexFileExists(@"..\..\Indexes\A012.idx", population, 12));
The ‘Index
’ class of which the above line code seems to create an instance, does not really exist. The using
statement on top of the code defines which class is being used. The FileIndex
class is an index for which the values are stored on disk and are only loaded in memory one by one when needed.
PreloadedFileIndex
will first load the whole file in memory and therefore run faster.
using Index = Matching.FileIndex;
When comparing performance, we see the two ‘Matching’ based runs perform better than the LINQ ones. The file based (using a FileIndex
) runs are obviously slower than the preloaded runs. The following chart shows the results for four runs (FileIndex
+LINQ, PreloadedFileIndex
+LINQ, FileIndex
+matching, PreloadedFileIndex
+matching):
The preloaded matching run is 3.4 times faster than the preloaded LINQ based run. That the file based matching run is only twice as fast as the file based LINQ run can be explained by the fact that both runs get the same amount of time added (the time needed to read the files).
Using a 4 year old Intel Quad Core2 CPU at 2.4GHz, a population of 2 million, and a query involving 12 indexes, it takes a little over 0.7 seconds to perform a LINQ query, while it takes 0.2 seconds for the query with the matching classes (using preloaded indexes). The query returned 108,552 matches (as the index files are generated randomly, expect a different result).
When we look at the memory consumption, we see that the LINQ implementation requires a huge amount of memory! The preloaded LINQ run required an amount of memory close to 30 (!) times the size of the index files. The indexes are 19 MB altogether, while the run required 530 MB.
I guess the LINQ implementation, which cannot assume the data is sorted, has to use different buffers and therefore copies the data several times.
In conclusion, I would be tempted to say that using algorithms optimized for matching sorted lists as done with the matching classes gives a performance benefit, but it most importantly benefits on memory consumption and hence on scalability too. When many huge indexes are involved, LINQ might well bypass the available memory for your application.
Usage Scenarios and Limitations
The simple concept of matching by sorted indexes has two major limitations:
- Indexes must exist for every matching criteria, and all queries must be in terms of those indexes.
- Ranking is not supported; all indexes are sorted by customer ID. The matching query can only tell whether a customer is matched or not.
The first issue means you have to foresee all indexes that you will need. Also, since the indexes indicate a match to the criteria, the criteria must be predefined in terms of match or no match, true or false. So if one of my match criteria is “is over 18 years of age”, I need to maintain an index containing those over 18 years of age. And I probably have daily mutations to that index which I have to manage.
Another example: if I want to know who bought chocolate pudding in the last 60 days, I need to maintain exactly that index, with a content that changes every day. I can’t combine an index “bought chocolate pudding” with a “visited the store in the last 60 days” because the result would be different. Knowing who bought chocolate pudding in the last 61 days would require a different index to be maintained.
I can’t also rank the results by who bought the most chocolate pudding the last 60 days, or who bought chocolate pudding most recently.
Both limitations can be overcome when the matching process is seen as only a single step within a process of querying the system. Additional steps can perform further filtering (when match criteria are too broad because no exact index match could be found), sorting, and/or ranking.
The Sample Solution
The sample solution that you can download with this article contains three projects:
The project also contains the FileIndex
and ListIndex
classes (two usable implementations of a sorted index) and Extension Methods.
- The
Matching
project contains the And
-, Or
- and ExcludeMatchingExpression
classes that perform the logical operations optimized for sorted indexes. - The
Matching.Test
project contains unit tests for the Matching
project. - The
MatchingSample
project is a command line application that demonstrates the matching process as described in this article.
History
- 1st November, 2011: Initial version