CollectionRegex -- Regular Expressions for Collections
Find patterns in a sequence of objects using a familiar language.
Prerequirements
A reader of this article is supposed to know basics of regular expressions. The article does not cover explaining regular expressions. The Regular Expression .info website is a fine regex resource, both as a starting point and a reference.
Introduction
This article presents a simple, strong-typed class library which provides methods
for finding patterns in sequences of objects. Patterns are described with regular
expressions, and a sequence is any instance of a class which implements the IEnumerable<T>
generic interface [msdn]. This includes well known .NET framework classes like a
List<T>
, HashSet<T>
or a Dictionary<K,V>
and objects returned by
LINQ queries.
Using the Code
A functionality of the library can be accessed by invoking a CollectionRegex.Match
method. Both instance and static version of this method are availible. To make them
usable, the following information is required:
- A pattern expressed as a regular expression,
- A relation between characters which appear in a pattern and collection items,
- (optionally) options to pass to the .NET regex engine.
When all of the above parameters are known, a Match
method can be invoked
and results acquired. The Match
method returns an enumeration of all
matches, that is, an IEnumerable<CollectionMatch<T>>
. An
implementation uses a yield return
[msdn]
construct, so operations are performed on-demand. The library supports both named
and unnamed capturing groups.
Example 1
The following code analyses a sequence of numbers to find places where at least
three consecutive numbers exceed a specified range. Each number is classified into
one of three categories: "too low" (a
), "ok" (b
)
or "too high" (c
). A regular expression which will be used
is:
[^b]{3,}
b
.
In terms of a collection, it matches every three or more consecutive numbers that
are not in a category "ok". Since defined categories
are a, b and c, the expression is equivalent to [ac]{3,}
. A required
.NET code is very straightfoward. Of course, a relation between characters in a
pattern and categories needs to be defined. In this example it is done with
lambda expressions. They are anymous methods which take every number from
a collection and return a boolean value, which is true when a number falls into a particular
category.// create a regex object
var reg = new CollectionRegex<double>(@"[^b]{3,}");
double low = 3,
high = 7;
// define predicates for "low", "ok" and "high"
// with lambda expressions
reg.AddPredicate(s => s <= low, 'a');
reg.AddPredicate(s => s > low && s < high, 'b');
reg.AddPredicate(s => s >= high, 'c');
Testing code:
// create a test sequence var collection = new List<double> { 4, 5, 9, 6, 7, 8, 9, 8, 6, 4, 3, 2, 4, 2, 2, 3, 3, 5, 5, 5, 3, 2, 7, 5 }; // run the regex var actual = reg.Match(collection).ToList(); // check if results are as expected Assert.AreEqual(3, actual.Count()); Assert.AreEqual(4, actual[0].Index); Assert.AreEqual(4, actual[0].Count); Assert.AreEqual(13, actual[1].Index); Assert.AreEqual(4, actual[1].Count); Assert.AreEqual(20, actual[2].Index); Assert.AreEqual(3, actual[2].Count);
A test collection is internally represented by a string "bbcb cccc bbaa baaa
abbb aacb
" (without spaces). The regex matches three subsequences:
{7, 8, 9, 8} (cccc
), {2, 2, 3, 3} (aaaa
) and {3, 2, 7}
(aac
).
LINQ alternative
For those who think in C#, it can be easier to understand the idea by egzamining an implementation which does not use the CollectionRegex library.List<IEnumerable<double>> results = new List<IEnumerable<double>>(); for (int i = 0; i < collection.Count; i++) { IEnumerable<double> enu = collection.Skip(i); enu = enu.TakeWhile(x => x <= low || x >= high); if (enu.Count() >= 3) { results.Add(enu); i += enu.Count() - 1; } }
Example 2
The following code finds requests in a production process which weren't finished
successfully. A collection contains objects of type ProductionEvent
,
which represent a request for producing an item (r
), a success report
(s
) or a failure report (f
). If an item couldn't be produced,
a failure report is put into a collection. An item can be successfuly produced even
if it failed a few times. The regex is supposed to find items which wasn't produced
at all.
(?<item>r)f+(?=r|$)
(?<item>r)
matches a request for an item and defines a named groupitem
.f+
matches one or more failure report.- Finally, the last part
(?=r|$)
does a positive look-ahead for either another request (r
) or the end of a collection ($
), to ensure that there wasn't any success report after the analysed request.
For example, in a sequence rs rfff rffffs rff
, it finds the second
and the fourth request (rfff
and rff
). However, it does
not match the third one (rffffs
) because it was finished successfully
eventually, despite there were four failures on the way. The source code assumes
that there is a class ProductionEvent
which exposes two properties:
Type
and ItemName
.
var reg = new CollectionRegex<ProductionEvent>(@"(?<item>r)f+(?=r|$)");
reg.AddPredicate(s => s.Type == ProductionEventTypes.Success, 's');
reg.AddPredicate(s => s.Type == ProductionEventTypes.ItemRequest, 'r');
reg.AddPredicate(s => s.Type == ProductionEventTypes.Failure, 'f');
And a testing code:
var collection = new List<ProductionEvent> { new ProductionEvent { Type = ProductionEventTypes.ItemRequest, ItemName="chocolade" }, new ProductionEvent { Type= ProductionEventTypes.Success }, new ProductionEvent { Type = ProductionEventTypes.ItemRequest, ItemName="impossible1" }, new ProductionEvent { Type= ProductionEventTypes.Failure }, new ProductionEvent { Type= ProductionEventTypes.Failure }, new ProductionEvent { Type= ProductionEventTypes.Failure }, new ProductionEvent { Type = ProductionEventTypes.ItemRequest, ItemName="problematic" }, new ProductionEvent { Type= ProductionEventTypes.Failure }, new ProductionEvent { Type= ProductionEventTypes.Failure }, new ProductionEvent { Type= ProductionEventTypes.Failure }, new ProductionEvent { Type= ProductionEventTypes.Failure }, new ProductionEvent { Type= ProductionEventTypes.Success }, new ProductionEvent { Type = ProductionEventTypes.ItemRequest, ItemName="impossible2" }, new ProductionEvent { Type= ProductionEventTypes.Failure }, }; var actual = reg.Match(collection).ToList(); foreach (CollectionMatch<ProductionEvent> e in actual) { // access a named group "item" Assert.IsTrue(e.Groups["item"].Single().ItemName.StartsWith("impossible")); }
LINQ alternative
In the below snippet, a requests
list simulates a capture group "item".
List<IEnumerable<ProductionEvent>> results = new List<IEnumerable<ProductionEvent>>(); List<ProductionEvent> requests = new List<ProductionEvent>(); for (int i = 0; i < collection.Count; i++) { IEnumerable<ProductionEvent> begin = collection.Skip(i); IEnumerable<ProductionEvent> enu = begin; if (enu.Count() >= 2) { var request = enu.First(); if (request.Type == ProductionEventTypes.ItemRequest) { enu = enu.Skip(1); var x = enu.First(); if (x.Type == ProductionEventTypes.Failure) { enu = enu.SkipWhile(p => p.Type == ProductionEventTypes.Failure); if (enu.Count() == 0) { results.Add(begin); requests.Add(request); i += begin.Count(); } else if (enu.First().Type == ProductionEventTypes.ItemRequest) { // (dirty) var result = begin.TakeWhile(p => !object.ReferenceEquals(p, enu.First())); results.Add(result); requests.Add(request); i += result.Count(); } } } } }
Remarks
Predicates must be mutually exclusive. If any item matches more than one predicate,
then the Match
method would throw an ArgumentException.
Predicate is an instance of a class which inherits from an abstract class CollectionRegexPredicate
.
If you want to provide a custom mechanism of checking if an objects falls into a
category, all you have to do is to implement a method IsMatch(T obj)
.
Items which wasn't matched by any predicate are represented by a comma (','
)
in a regular expression. Therefore, a negated group like [^ab]
will
match these items. A comma was selected after a long meditation and no other character
could be accurate here. (a first printable, not-problematic and easily countable
by a human character in an ASCII table).
Along with a regular class library, I have written an extension method for IEnumerable
class, which allows using constructs like this:
var predicates = new CollectionRegexDelegatePredicate<double>[] {
new CollectionRegexDelegatePredicate<double>(s => s <= low, 'a'),
new CollectionRegexDelegatePredicate<double>(s => s > low && s < high, 'b'),
new CollectionRegexDelegatePredicate<double>(s => s >= high, 'c'),
};
var actual = collection.MatchRegex(predicates, @"[^b]{3,}")s;
Please use a message board below the article to report bugs or post feature requests.
History
- 2012-04-14 -- article and API update
- 2012-04-11 -- the original version posted