Introduction

A cool feature of the .NET RegEx-engine is the ability to match nested constructions, for example nested parenthesis. I will describe this feature somewhat in depth in this article.

In Part II the balancing group is explained in depth and it is applied to a couple of concrete examples.

If you are an experienced RegEx developer, please feel free to go forward to the part "The Push-down Automata."

Background

Regular expressions are a very cool feature for pattern recognition in strings. Mathematically speaking regular expressions are parsed through a "finite state machine". As the name implies, such a machine has only a finite number of states, and it has no external memory attached. This means that the finite state machine cannot match nested constructions such as: (9 + (5 + 3)). Or a more specific task: "Tell me if and where the string str has a number of correct nested parenthesis."

Though, if the finite state machine is supplied with an external stack, the mathematics state, this would be possible - and that's what has happened in the .NET RegEx engine. Through smart use of external stacks, it is now possible to match nested constructions. So let's see how it gets the job done.

The Beginning: Captures

(Feel free to skip this part if you already have a solid knowledge of regular expressions).

In a regular expression, you can always use parenthesis to capture a specific part of the recognized pattern. A quick example:

Source text
mymail@mycompany.com

Regular expression
([^@]+)@([^.]+)\.(\w{2,4})

This is a (too) simple expression which attempts to capture a mail address. Let's run through the parenthesis:

```
( [^@]+ )
```
The first parenthesis matches any number of characters which is not a @-symbol. The expression is encapsulated in a parenthesis, and therefore the RegEx engine will capture this part of the source into a group numbered 1. This group can be accessed at runtime like this:
```
String pattern = @"([^@]+)@([^.]+)\.(\w{2,4})";
String source = "mymail@mycompany.com";
Match match = Regex.Match(pattern, source);
match.Groups[1].Value; // returns "mymail"

match.Groups[1].Index; // returns 0;
```
```
( [^.]+ )
```
Likewise, after the @ in the email address, the above expression will capture any text until it reaches a period. This will be captured into group number 2.
```
( \w{2,4} )
```
Finally this will capture the rest of the expression - the top level domain - into group number 3.

The Fundamentals: Named Captures

Even though this is not a new feature, named captures are fundamental for understanding the nested constructions in .NET regular expressions.

In the previous chapter, parenthesis were used to capture a part of the source into a group. The groups were named with successive integers beginning with 1 (by convention Groups[0] captures the whole match).

But it is also possible to capture parts of the source into named groups, with the following simple syntax:

(?<user>[^@]+) @ (?<company>[^.]+) \. (?<top_level_domain>\w{2,4})

In the code, the groups can now be accessed like this:

String pattern = @"([^@]+)@([^.]+)\.(\w{2,4})";
String source = "mymail@mycompany.com";
Match match = Regex.Match(pattern, source);
// returns "mymail":

match.Groups["user"].Value;
// returns "mycompany":

match.Groups["company"].Value
// returns "com":

match.Groups["top_level_domain"].Value

This is not a new part of a regular expression engine - you would find the same to exist with almost the same syntax in languages like Python and PHP.

But the .NET regular expression engine uses the principle of named capturing to provide features for the developer to create a regular expression which is capable of matching nested constructions. Now, let's check that out!

The Push-down Automata

As stated in the beginning of this article, a finite state machine is not capable of matching nested constructions. This is actually the reason that Chomsky in the 1960's argued that a finite state machine cannot recognize a natural language such as English.

A push-down automata is a finite state machine with an external memory - a stack - attached. This Framework provides a basis for understanding how the .NET RegEx engine is capable of matching nested constructions. Let's see some code.

Now, let's say that the task is to match nested <div>'s in HTML code.

String pattern = @"
(?# line 01) <div>
(?# line 02) (?>
(?# line 03)   <div> (?<DEPTH>)
(?# line 04)   |
(?# line 05)   </div> (?<-DEPTH>)
(?# line 06)   |
(?# line 07)   .?
(?# line 08) )*
(?# line 09) (?(DEPTH)(?!))
(?# line 10) </div>
";

source = "<div>asdf<div>asdf</div>asdffds</div>";
Match match = Regex.Match(source, pattern,
  RegexOptions.IgnorePatternWhitespace | RegexOptions.IgnoreCase);
Console.WriteLine(match.Success.ToString());

First (line 1) this expression matches an opening div. This is mirrored by an ending closing tag </div> in line 10.

In line 2, the first group begins. It is an unnamed atomic group. "Atomic" means roughly that once something is matched, the RegEx-engine won't give it up again. This might sound a bit strange at first, but it can be important to performance. To keep focus in this article I won't elaborate further on it. But notice that this parenthesis is ended in line 8 with a * meaning: repeat as many times as possible.

Now the important stuff begins. Line 3 to 7 is an alternation with three possibilities. Either it should match a <div> or a </div> or a single character .?. When alternation occurs, the .NET RegEx engine will try out the matches one at a time accepting the first match - even if this is not the longest match. (This is as far as I know opposed to a deterministic finite automaton).

When the RegEx engine matches a <div> (line 3), it is at the same time told to match something else: (?<DEPTH>). Remember the stuff about named capturing? This actually is a capture with the name DEPTH. It is empty, though! So while the group doesn't actually capture anything from the source, it does something else, which is very important - it creates a new stack, let's pretend it's called DEPTH, and it puts the capture on that stack!

This means that we now have a new stack called DEPTH with one element on it containing an empty string. This is quite interesting - so let's dig a bit deeper into it.

Check this out:

(
(?<A>a)
|
(?<B>b)
)*

Here we've also used named capturing, but we don't just capture an empty string, we are capturing the letter a onto the stack A and the letter b onto the stack B. As seen before we can request this using the code: match.Groups["A"].Value;.

Now, let's change the code a bit:

(
(?<A>a)
|
(?<A>b)
)*

Now both groups are named A. What happens this time? If we take this source: ab, the RegEx engine will first match the a. By doing this it creates a new stack called A and it pushes the a on the stack. Next it matches the b. It already has created a stack named A, so it just pushes a new element on the stack with the capture b. So now our stack looks like this:

Stack 'A'
________
|      |
|   b  |
|   a  |
|______|

If you use the same code to request what's captured in group A (match.Groups["A"].Value) you would get the string b - the Groups object simply peeks the top element on the stack.

This stack invention is quite cool - but what's even cooler is, that it lets you pop elements off the stack inside the Regex pattern, which is what happens in this example:

(
(?<A>a)
|
(?<A>b) (?<-A>)
)*

Again, consider the input string ab. First the RegEx engine matches the a, creates a new stack and pushes an a on it. Next, the engine matches a b and pushes it on the stack. Now the syntax changes: (?<-A>). This empty capturing parenthesis actually pops the top element off the A stack. To ensure this actually happens try the code once again: match.Groups["A"].Value. Now this code returns the string a even though the last character matched was b. The reason is that the b was popped off the stack immediately after it was pushed on. So the stack contains only one element, i.e. the a.

Okay - back to the DIV's in our main example.

Whenever the RegEx engine matches a <div> it pushes an empty string on the stack. On the other hand: whenever it matches a </div> it pops the stack. Doing this - the stack would end up empty if and only if the RegEx engine discovers a correct nested construction of DIV's.

This is what the final expression is testing (line 9):

(?(DEPTH)(?!))

This is actually a conditional test. A conditional test is of the well known form if-then-else in the syntax:

(? (if) then | else)

The if-part tests if the named group was matched and stored. If it was matched the then-part of the expression is applied, else the else-part is applied. Note that the else-part is optional.

What the last expression (?(DEPTH) (?!)) does then, is that it tests if the named group DEPTH was captured and stored. If it was, the then-part is applied (?!) forcing a negative lookahead with no expression, which will always fail. Hence, the whole expression fails if the number of (?<DEPTH>) doesn't match the number of (?<-DEPTH>) applied in a correct nested order. You can test this last part yourself with an expression like this:

Given the source aba this expression will not succeed. First it matches the a and pushes it on the stack, then the b (without pushing) and pops the stack - now the stack is empty. Then it checks to see if the stack has been matched and stored - it hasn't, so it will try to match a b where the source has an a. On the other hand, the source abb will succeed with a match.

Points of Interest

The invention of an auxiliary stack in the .NET RegEx engine has made it possible to match nested constructions and keep track of the matches bringing even more power to regular expressions. And it lets you push, pop and to some extent peek the stack from within the RegEx engine.

Also, it is possible to create multiple stacks with different names keeping track of more complex expressions.

In Part II peeking and the balancing group are explained in depth and applied to a couple of concrete examples.