Recently, I wanted to extract calls to external system from log files and do some LINQ to XML processing on obtained data. Here’s a sample log line (simplified, real log was way more complicated but it doesn’t matter for this post):
Result:<getName seqNo="56789">John Smith</getName>
I was interested in XML data of the call:
Quick tip: Super-easy way to get such nicely formatted XML in .NET 3.5 or later is to invoke
ToString method on
var xml = System.Xml.Linq.XElement.Parse(someUglyXmlString);
When it comes to log, some things were certain:
- call’s XML will be logged after “
Call:” text on the beginning of line
- call’s root element name will contain only alphanumerical chars or underscore
- there will be no line brakes in call’s data
- call’s root element name may also appear in the “
Getting to the proper information was quite easy, thanks to
Regex regex = new Regex(@"(?<=^Call:)<(\w+).*?</\1>");
string call = regex.Match(logLine).Value;
This short regular expression has a couple of interesting parts. It may not be perfect but proved really helpful in log analysis. If this regex is not entirely clear to you - read on, you will need to use something similar sooner or later.
Here’s the same regex with comments (
RegexOptions.IgnorePatternWhitespace is required to process expression commented this way):
string pattern = @"(?<=^Call:) # Positive lookbehind for call marker
<(\w+) # Capturing group for opening tag name
.*? # Lazy wildcard (everything in between)
</\1> # Backreference to opening tag name";
Regex regex = new Regex(pattern, RegexOptions.IgnorePatternWhitespace);
string call = regex.Match(logLine).Value;
(?<=Call:) is a lookaround or more precisely positive lookbehind. It’s a zero-width assertion that lets us check whether some text is preceded by another text. Here “
Call:” is the preceding text we are looking for.
(?<=something) denotes positive lookbehind. There is also negative lookbehind expressed by
(?<!something). With negative lookbehind, we can match text that doesn’t have a particular
string before it. Lookaround checks fragment of the text but doesn't become part of the match value. So the result of this:
Will be "
123" rather than "
.NET regex engine has lookaheads too. Check this awesome page if you want to learn more about lookarounds.
Note: In some cases (like in our log examination example), instead of using positive lookaround we may use non-capturing group...
<(\w+) will match less-than sign followed by one or more characters from
\w class (letters, digits or underscores).
\w+ part is surrounded with parenthesis to create a group containing XML root name (
getName for sample log line). We later use this group to find closing tag with the use of backreference. (
\w+) is capturing group, which means that results of this group existence are added to
Groups collection of
Match object. If you want to put part of the expression into a group but you don’t want to push results into
Groups collection, you may use non-capturing group by adding a question mark and colon after opening parenthesis, like this:
.*? matches all characters except newline (because we are not using
RegexOptions.Singleline) in lazy (or non-greedy) mode thanks to question mark after asterisk. By default * quantifier is greedy, which means that regex engine will try to match as much text as possible. In our case, default mode will result in too long text being matched:
<getName seqNo="56789"><id>123</id></getName> Result:<getName seqNo="56789">John Smith</getName>
</\1> matches XML close tag where element's name is provided with \1 backreference. Remember the
(\w+) group? This group has number 1 and by using \1 syntax we are referencing the text matched by this group. So for our sample log,
</\1> gives us
</getName>. If regex is complex, it may be a good idea to ditch numbered references and use named references instead. You can name a group by
<name> or ‘
name’ syntax and reference it by using
k’name’. So your expression could look like this:
or like this:
The latter version is better for our purpose. Using
< > signs while matching XML is confusing. In this case, regex engine will do just fine with
< > version but keep in mind that source code is written for humans…
Regular expressions look intimidating, but do yourself a favor and spend few hours practicing them, they are extremely useful (not only for quick log analysis)!