14,635,443 members
Articles » General Programming » Algorithms & Recipes » Regular Expressions
Article
Posted 28 Oct 2007

77K views
62 bookmarked

# In Depth with RegEx Matching Nested Constructions

Rate this:
5 Nov 2007
A cool feature of the .NET RegEx-engine is the ability to match nested constructions, for example nested parenthesis. I will describe it somewhat in depth in this article.

## Introduction

A cool feature of the .NET `RegEx`-engine is the ability to match nested constructions, for example nested parenthesis. I will describe this feature somewhat in depth in this article.

In Part II the balancing group is explained in depth and it is applied to a couple of concrete examples.

If you are an experienced `RegEx `developer, please feel free to go forward to the part "The Push-down Automata."

## Background

Regular expressions are a very cool feature for pattern recognition in strings. Mathematically speaking regular expressions are parsed through a "finite state machine". As the name implies, such a machine has only a finite number of states, and it has no external memory attached. This means that the finite state machine cannot match nested constructions such as: `(9 + (5 + 3))`. Or a more specific task: "Tell me if and where the string str has a number of correct nested parenthesis."

Though, if the finite state machine is supplied with an external stack, the mathematics state, this would be possible - and that's what has happened in the .NET `RegEx `engine. Through smart use of external stacks, it is now possible to match nested constructions. So let's see how it gets the job done.

## The Beginning: Captures

(Feel free to skip this part if you already have a solid knowledge of regular expressions).

In a regular expression, you can always use parenthesis to capture a specific part of the recognized pattern. A quick example:

Source text
mymail@mycompany.com

Regular expression
`([^@]+)@([^.]+)\.(\w{2,4})`

This is a (too) simple expression which attempts to capture a mail address. Let's run through the parenthesis:

1. `( [^@]+ )`

The first parenthesis matches any number of characters which is not a `@`-symbol. The expression is encapsulated in a parenthesis, and therefore the `RegEx `engine will capture this part of the source into a group numbered 1. This group can be accessed at runtime like this:

```String pattern = @"([^@]+)@([^.]+)\.(\w{2,4})";
String source = "mymail@mycompany.com";
Match match = Regex.Match(pattern, source);
match.Groups[1].Value; // returns "mymail"
match.Groups[1].Index; // returns 0;```
2. `( [^.]+ )`

Likewise, after the `@` in the email address, the above expression will capture any text until it reaches a period. This will be captured into group number 2.

3. `( \w{2,4} )`

Finally this will capture the rest of the expression - the top level domain - into group number 3.

## The Fundamentals: Named Captures

Even though this is not a new feature, named captures are fundamental for understanding the nested constructions in .NET regular expressions.

In the previous chapter, parenthesis were used to capture a part of the source into a group. The groups were named with successive integers beginning with 1 (by convention `Groups[0]` captures the whole match).

But it is also possible to capture parts of the source into named groups, with the following simple syntax:

`(?<user>[^@]+) @ (?<company>[^.]+) \. (?<top_level_domain>\w{2,4})`

In the code, the groups can now be accessed like this:

```String pattern = @"([^@]+)@([^.]+)\.(\w{2,4})";
String source = "mymail@mycompany.com";
Match match = Regex.Match(pattern, source);
// returns "mymail":
match.Groups["user"].Value;
// returns "mycompany":
match.Groups["company"].Value
// returns "com":
match.Groups["top_level_domain"].Value ```

This is not a new part of a regular expression engine - you would find the same to exist with almost the same syntax in languages like Python and PHP.

But the .NET regular expression engine uses the principle of named capturing to provide features for the developer to create a regular expression which is capable of matching nested constructions. Now, let's check that out!

## The Push-down Automata

As stated in the beginning of this article, a finite state machine is not capable of matching nested constructions. This is actually the reason that Chomsky in the 1960's argued that a finite state machine cannot recognize a natural language such as English.

A push-down automata is a finite state machine with an external memory - a stack - attached. This Framework provides a basis for understanding how the .NET `RegEx `engine is capable of matching nested constructions. Let's see some code.

Now, let's say that the task is to match nested `<div>`'s in HTML code.

```String pattern = @"
(?# line 01) <div>
(?# line 02) (?>
(?# line 03)   <div> (?<DEPTH>)
(?# line 04)   |
(?# line 05)   </div> (?<-DEPTH>)
(?# line 06)   |
(?# line 07)   .?
(?# line 08) )*
(?# line 09) (?(DEPTH)(?!))
(?# line 10) </div>
";

source = "<div>asdf<div>asdf</div>asdffds</div>";
Match match = Regex.Match(source, pattern,
RegexOptions.IgnorePatternWhitespace | RegexOptions.IgnoreCase);
Console.WriteLine(match.Success.ToString());```

First (line 1) this expression matches an opening `div`. This is mirrored by an ending closing tag `</div>` in line 10.

In line 2, the first group begins. It is an unnamed atomic group. "Atomic" means roughly that once something is matched, the `RegEx`-engine won't give it up again. This might sound a bit strange at first, but it can be important to performance. To keep focus in this article I won't elaborate further on it. But notice that this parenthesis is ended in line 8 with a `*` meaning: repeat as many times as possible.

Now the important stuff begins. Line 3 to 7 is an alternation with three possibilities. Either it should match a `<div>` or a `</div>` or a single character `.?`. When alternation occurs, the .NET `RegEx `engine will try out the matches one at a time accepting the first match - even if this is not the longest match. (This is as far as I know opposed to a deterministic finite automaton).

When the `RegEx `engine matches a `<div>` (line 3), it is at the same time told to match something else: `(?<DEPTH>)`. Remember the stuff about named capturing? This actually is a capture with the name `DEPTH`. It is empty, though! So while the group doesn't actually capture anything from the source, it does something else, which is very important - it creates a new stack, let's pretend it's called `DEPTH`, and it puts the capture on that stack!

This means that we now have a new stack called `DEPTH `with one element on it containing an empty string. This is quite interesting - so let's dig a bit deeper into it.

Check this out:

```(
(?<A>a)
|
(?<B>b)
)*```

Here we've also used named capturing, but we don't just capture an empty string, we are capturing the letter `a` onto the stack `A` and the letter `b `onto the stack `B`. As seen before we can request this using the code: `match.Groups["A"].Value;`.

Now, let's change the code a bit:

```(
(?<A>a)
|
(?<A>b)
)*```

Now both groups are named `A`. What happens this time? If we take this source: `ab`, the `RegEx `engine will first match the `a`. By doing this it creates a new stack called `A `and it pushes the `a `on the stack. Next it matches the `b`. It already has created a stack named `A`, so it just pushes a new element on the stack with the capture `b`. So now our stack looks like this:

```Stack 'A'
________
|      |
|   b  |
|   a  |
|______| ```

If you use the same code to request what's captured in group A (`match.Groups["A"].Value`) you would get the string `b` - the `Groups `object simply peeks the top element on the stack.

This stack invention is quite cool - but what's even cooler is, that it lets you pop elements off the stack inside the `Regex `pattern, which is what happens in this example:

```(
(?<A>a)
|
(?<A>b) (?<-A>)
)*```

Again, consider the input string `ab`. First the `RegEx `engine matches the `a`, creates a new stack and pushes an `a `on it. Next, the engine matches a `b` and pushes it on the stack. Now the syntax changes: `(?<-A>)`. This empty capturing parenthesis actually pops the top element off the `A` stack. To ensure this actually happens try the code once again: `match.Groups["A"].Value`. Now this code returns the string `a` even though the last character matched was `b`. The reason is that the `b` was popped off the stack immediately after it was pushed on. So the stack contains only one element, i.e. the `a`.

Okay - back to the `DIV`'s in our main example.

Whenever the `RegEx `engine matches a `<div>` it pushes an empty string on the stack. On the other hand: whenever it matches a `</div>` it pops the stack. Doing this - the stack would end up empty if and only if the `RegEx `engine discovers a correct nested construction of `DIV`'s.

This is what the final expression is testing (line 9):

`(?(DEPTH)(?!))`

This is actually a conditional test. A conditional test is of the well known form `if-then-else `in the syntax:

`(? (if) then | else)`

The `if`-part tests if the named group was matched and stored. If it was matched the `then`-part of the expression is applied, else the `else`-part is applied. Note that the `else`-part is optional.

What the last expression `(?(DEPTH) (?!)) `does then, is that it tests if the named group `DEPTH `was captured and stored. If it was, the `then`-part is applied (?!) forcing a negative lookahead with no expression, which will always fail. Hence, the whole expression fails if the number of `(?<DEPTH>)` doesn't match the number of `(?<-DEPTH>)` applied in a correct nested order. You can test this last part yourself with an expression like this:

Given the source `aba` this expression will not succeed. First it matches the `a` and pushes it on the stack, then the `b` (without pushing) and pops the stack - now the stack is empty. Then it checks to see if the stack has been matched and stored - it hasn't, so it will try to match a `b` where the source has an `a`. On the other hand, the source `abb` will succeed with a match.

## Points of Interest

The invention of an auxiliary stack in the .NET `RegEx `engine has made it possible to match nested constructions and keep track of the matches bringing even more power to regular expressions. And it lets you push, pop and to some extent peek the stack from within the `RegEx `engine.

Also, it is possible to create multiple stacks with different names keeping track of more complex expressions.

In Part II peeking and the balancing group are explained in depth and applied to a couple of concrete examples.

A list of licenses authors might use can be found here

## Share

 Product Manager FOA Denmark
Morten is a linguistics nerd and .NET developer. He is currently working as an Product Manager at Configit (http://www.configit.com).

 First Prev Next
 My vote of 5 GaryCamp10-Sep-13 5:00 GaryCamp 10-Sep-13 5:00
 Help needed for nested tags NikolaGrk5-Aug-13 0:50 NikolaGrk 5-Aug-13 0:50
 First readable explanation I've seen darrellp1-Oct-12 18:07 darrellp 1-Oct-12 18:07
 There's a new version of the RegEx Tester Tool ! Pablo Osés1-Mar-08 23:19 Pablo Osés 1-Mar-08 23:19
 Is it ok if I translate these two excellent article into chinese? deerchao16-Nov-07 4:16 deerchao 16-Nov-07 4:16
 Re: Is it ok if I translate these two excellent article into chinese? Morten Holk Maate16-Nov-07 5:03 Morten Holk Maate 16-Nov-07 5:03
 Useful, but Johan Fourie13-Nov-07 12:42 Johan Fourie 13-Nov-07 12:42
 Re: This is great [modified] Morten Holk Maate6-Nov-07 22:41 Morten Holk Maate 6-Nov-07 22:41
 Excellent merlin9815-Nov-07 7:33 merlin981 5-Nov-07 7:33
 Matching identical constructs (double-quotes) spambuster29-Oct-07 12:37 spambuster 29-Oct-07 12:37
 Re: Matching identical constructs (double-quotes) [modified] ToAoM29-Oct-07 13:39 ToAoM 29-Oct-07 13:39
 Re: Matching identical constructs (double-quotes) Morten Holk Maate30-Oct-07 3:10 Morten Holk Maate 30-Oct-07 3:10
 Some help. Paw Jershauge29-Oct-07 3:33 Paw Jershauge 29-Oct-07 3:33
 Re: Some help. Morten Holk Maate29-Oct-07 8:06 Morten Holk Maate 29-Oct-07 8:06
 Re: Some help. Paw Jershauge29-Oct-07 8:15 Paw Jershauge 29-Oct-07 8:15
 Re: Some help. Morten Holk Maate29-Oct-07 9:24 Morten Holk Maate 29-Oct-07 9:24
 Last Visit: 20-Sep-20 8:47     Last Update: 20-Sep-20 8:47 Refresh 1