Click here to Skip to main content
15,879,613 members
Articles / Programming Languages / C#
Article

Create Regex Objects using a Kind of "meta-variables" - Quicker and Easier

Rate me:
Please Sign up or sign in to vote.
3.83/5 (5 votes)
10 Jan 2007CPOL3 min read 31.7K   62   12   7
This article describes a class VarRegex allowing you to reuse parts of regular expressions

Introduction

Regexps (Perl-compatible regular expressions) are great, no doubt (refer to this wonderful article for a tutorial). But the little problem is that every regular expression's pattern should be presented in single string.

For example, suppose we want to specify a pattern for phone number with the following rules:

  • Digits are in groups of 1 or more
  • Spaces and minus sign are used as separators
  • At least one group of digits should be present

How would the appropriate pattern look? Something like this:

// the @ sign is used in C# to prevent parsing \ as escape sequence
@"(\d+[\s\-])*\d+"

This means that we have groups of digits (\d+) followed by a separator, either minus or space ([\s\-]), such groups can occur any (maybe 0) number of times, but at least one group of digits should be present (final \d+). Well, not very difficult, but not very nice at the same time.

Assume, at some moment, the customer says the number may include capital letters (like 1-800-GO-TO-THE-HELL-NOW). We have to change our digit group specification twice.
And if we have some regex for, example, real number in exponential format? Something like this...

"[1-9][\d]*[.,][\d]*[1-9][Ee][+\-](0|[1-9][\d]*)"

... for only one (full) type of record, like 123.456E+120. But we can omit integer or fractional part. Our regex becomes really complex:

"([1-9][\d]*[.,][\d]*[1-9][Ee](0|([+\-]?[1-9][\d]*))
|
([.,][\d]*[1-9][Ee](0|([+\-]?[1-9][\d]*))
|
([1-9][\d]*[.,][Ee](0|([+\-]?[1-9][\d]*))"

Brrr, really?

A Dream

For a long time, I had a dream (:-). A dream to write something like this:

SIGNIFICANT_DIGIT = @"[1-9]";
DIGIT = @"[0-9]";
// ` quote is the rare special character
// not having its own meaning in regex syntax
INT_PART = @"`SIGNIFICANT_DIGIT` `DIGIT`*";
FLOAT_PART = @"`DIGIT`* `SIGNIFICANT_DIGIT`";
EXP_PART = @"[Ee](0|[+\-]?`INT_PART`)";
FULL_EXP = @"`INT_PART` [.,] `FLOAT_PART` `EXP_PART`";
NO_INT_EXP = @"[.,] `FLOAT_PART` `EXP_PART`";
NO_FLOAT_EXP = @"`INT_PART` [.,] `EXP_PART`";

// and finally
PATTERN = @"`FULL_EXP` | `NO_INT_EXP` | `NO_FLOAT_EXP`";

Well, much more lines of code, but:

  1. Each group of symbols is defined once and reused then, no doubling groups in different parts of pattern.
  2. Each line is much shorter and contains named literals, this makes an expression easier to understand.

This article describes a class created for similar syntax to be used in C# programs. It handles such expressions and returns a Regex object created with expanded pattern.

Idea

OK, the idea is as simple as possible. We create a class that allows adding "variables". Each variable can be a single regex expression or regex-like expression with references to previously added variables. Then the pattern is set in the same form. After that, we receive ready Regex object and use it as we like to.

We use ` quote to mark variables. If we want to use the quote itself (maybe someone still needs it :), we can write "\`".

Implementation Details

The class VarRegex is created. It has nested enumerable class VariablesCollection built around a Dictionary<String, String>. This class allows adding and modifying variables using indexer property, retrieving their Count, Clear variables list and enumerating their values. The main VariablesCollection's method is called Expand. It receives a string to be "expanded", looks for variable names occurrences and replaces each variable's reference with its expanded value.

The method is implemented in the following way:

C#
public String Expand(String pattern)
{
    if (pattern == "")
    return "";

    string p = pattern;
    p = p.Replace("\\`", ""+(char)1);

    r = new Regex("`([^`]+)`");
    MatchCollection ms = r.Matches(p);

    foreach (Match match in ms)
    {
        string t = match.Groups[1].Value;
        p = p.Replace("`" + t + "`", Expand(variables[t]));
    }

    p = p.Replace(""+(char)1, "`");

    return p;
}

First, we exclude "fake" quotes and slashes. Then we look for all quoted variables' names and replace each name with expanded variable's value. Finally we return all "fake" quotes (without slashes). Well, rather easy. Each time we make some changes to variables or patterns, a Regex object is recreated inside our VarRegex object. The class VariablesCollection also utilizes nested enumerable class ExpandedVariablesCollection, which allows enumerating or receiving by name expanded variables' values.

Using

Now the code for generating regex for phone number from the introduction will look like this:

C#
VarRegex vr = new VarRegex();
vr.Variables["int"] = @"\d+";
vr.Variables["sep"] = @"[\s\-]";
vr.Variables["gr"] = @"`int``sep`";
vr.Pattern = @"`gr`*`int`";
vr.Options = RegexOptions.IgnoreCase;

string str = @"123 568-99";
Match m = vr.Regex.Match(str);
Console.WriteLine("Result for string {1}: {0}\n", m.Success, str);

Limitations

The main limitation is that variables should be added in the order that they are referenced. It means, the variable should be added to the VarRegex after all variables it references are already added.

History

  • 10th January, 2007: Initial post

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer
Belarus Belarus
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
GeneralNice work Pin
Light Walker18-Jan-07 3:22
Light Walker18-Jan-07 3:22 
GeneralRe: Nice work Pin
Eugene Mirotin (Guard)18-Jan-07 3:37
Eugene Mirotin (Guard)18-Jan-07 3:37 
Ah, yes. Of course it will die if you try to use self-references. But what is the reason to use them?;) It's similar to write a self-calling function without any stopping statement. The only thing I can do here - call this function with references stack list and check whether we encounter self-referencing.
GeneralRe: Nice work Pin
Light Walker18-Jan-07 3:50
Light Walker18-Jan-07 3:50 
GeneralRe: Nice work Pin
Eugene Mirotin (Guard)18-Jan-07 3:57
Eugene Mirotin (Guard)18-Jan-07 3:57 
GeneralLooks like an interesting technique Pin
Garth J Lancaster10-Jan-07 17:17
professionalGarth J Lancaster10-Jan-07 17:17 
GeneralRe: Looks like an interesting technique Pin
Eugene Mirotin (Guard)10-Jan-07 22:16
Eugene Mirotin (Guard)10-Jan-07 22:16 
GeneralRe: Looks like an interesting technique Pin
Pete Goodsall15-Jan-07 11:47
Pete Goodsall15-Jan-07 11:47 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.