Click here to Skip to main content
Click here to Skip to main content

Create Regex Objects using a Kind of "meta-variables" - Quicker and Easier

, 10 Jan 2007 CPOL
Rate this:
Please Sign up or sign in to vote.
This article describes a class VarRegex allowing you to reuse parts of regular expressions

Introduction

Regexps (Perl-compatible regular expressions) are great, no doubt (refer to this wonderful article for a tutorial). But the little problem is that every regular expression's pattern should be presented in single string.

For example, suppose we want to specify a pattern for phone number with the following rules:

  • Digits are in groups of 1 or more
  • Spaces and minus sign are used as separators
  • At least one group of digits should be present

How would the appropriate pattern look? Something like this:

// the @ sign is used in C# to prevent parsing \ as escape sequence
@"(\d+[\s\-])*\d+"

This means that we have groups of digits (\d+) followed by a separator, either minus or space ([\s\-]), such groups can occur any (maybe 0) number of times, but at least one group of digits should be present (final \d+). Well, not very difficult, but not very nice at the same time.

Assume, at some moment, the customer says the number may include capital letters (like 1-800-GO-TO-THE-HELL-NOW). We have to change our digit group specification twice.
And if we have some regex for, example, real number in exponential format? Something like this...

"[1-9][\d]*[.,][\d]*[1-9][Ee][+\-](0|[1-9][\d]*)"

... for only one (full) type of record, like 123.456E+120. But we can omit integer or fractional part. Our regex becomes really complex:

"([1-9][\d]*[.,][\d]*[1-9][Ee](0|([+\-]?[1-9][\d]*))
|
([.,][\d]*[1-9][Ee](0|([+\-]?[1-9][\d]*))
|
([1-9][\d]*[.,][Ee](0|([+\-]?[1-9][\d]*))"

Brrr, really?

A Dream

For a long time, I had a dream (Smile | :) . A dream to write something like this:

    SIGNIFICANT_DIGIT = @"[1-9]";
    DIGIT = @"[0-9]";
    // ` quote is the rare special character 
    // not having its own meaning in regex syntax
    INT_PART = @"`SIGNIFICANT_DIGIT` `DIGIT`*";
    FLOAT_PART = @"`DIGIT`* `SIGNIFICANT_DIGIT`";
    EXP_PART = @"[Ee](0|[+\-]?`INT_PART`)";
    FULL_EXP = @"`INT_PART` [.,] `FLOAT_PART` `EXP_PART`";
    NO_INT_EXP = @"[.,] `FLOAT_PART` `EXP_PART`";
    NO_FLOAT_EXP = @"`INT_PART` [.,] `EXP_PART`";

    // and finally
    PATTERN = @"`FULL_EXP` | `NO_INT_EXP` | `NO_FLOAT_EXP`";

Well, much more lines of code, but:

  1. Each group of symbols is defined once and reused then, no doubling groups in different parts of pattern.
  2. Each line is much shorter and contains named literals, this makes an expression easier to understand.

This article describes a class created for similar syntax to be used in C# programs. It handles such expressions and returns a Regex object created with expanded pattern.

Idea

OK, the idea is as simple as possible. We create a class that allows adding "variables". Each variable can be a single regex expression or regex-like expression with references to previously added variables. Then the pattern is set in the same form. After that, we receive ready Regex object and use it as we like to.

We use ` quote to mark variables. If we want to use the quote itself (maybe someone still needs it Smile | :) , we can write "\`".

Implementation Details

The class VarRegex is created. It has nested enumerable class VariablesCollection built around a Dictionary<String, String>. This class allows adding and modifying variables using indexer property, retrieving their Count, Clear variables list and enumerating their values. The main VariablesCollection's method is called Expand. It receives a string to be "expanded", looks for variable names occurrences and replaces each variable's reference with its expanded value.

The method is implemented in the following way:

public String Expand(String pattern)
{
    if (pattern == "")
    return "";

    string p = pattern;
    p = p.Replace("\\`", ""+(char)1);

    r = new Regex("`([^`]+)`");
    MatchCollection ms = r.Matches(p);

    foreach (Match match in ms)
    {
        string t = match.Groups[1].Value;
        p = p.Replace("`" + t + "`", Expand(variables[t]));
    }

    p = p.Replace(""+(char)1, "`");

    return p;
}

First, we exclude "fake" quotes and slashes. Then we look for all quoted variables' names and replace each name with expanded variable's value. Finally we return all "fake" quotes (without slashes). Well, rather easy. Each time we make some changes to variables or patterns, a Regex object is recreated inside our VarRegex object. The class VariablesCollection also utilizes nested enumerable class ExpandedVariablesCollection, which allows enumerating or receiving by name expanded variables' values.

Using

Now the code for generating regex for phone number from the introduction will look like this:

VarRegex vr = new VarRegex();
vr.Variables["int"] = @"\d+";
vr.Variables["sep"] = @"[\s\-]";
vr.Variables["gr"] = @"`int``sep`";
vr.Pattern = @"`gr`*`int`";
vr.Options = RegexOptions.IgnoreCase;

string str = @"123 568-99";
Match m = vr.Regex.Match(str);
Console.WriteLine("Result for string {1}: {0}\n", m.Success, str);

Limitations

The main limitation is that variables should be added in the order that they are referenced. It means, the variable should be added to the VarRegex after all variables it references are already added.

History

  • 10th January, 2007: Initial post

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Eugene Mirotin (Guard)
Software Developer
Belarus Belarus
No Biography provided

Comments and Discussions

 
GeneralLooks like an interesting technique PinmemberGarth J Lancaster10-Jan-07 18:17 
I have done something similar, a 'dynamic regular expression' - for example
 
The user wishes to filter on (Australian) Postcodes. We allow the user to pass a 'State' flag on the command line, eg -state=TV .. here the post codes will be checked to see if they are in (T)asmania or (V)ictoria. In the resource table for the application, there's a definition string that looks like A=(^20:digit:{2})@T=(^7:digit:{3}@V=(^3:digit:{3}) (and so on, using '@' as the seperator, knowing it wont occur as part of the regex itself)
 
When the app starts, it build a map such that
 
A -> "(^20:digit:{2})"
T -> "(^7:digit:{3})"
V -> "(^3:digit:{3})"
 
After obtaining the paramater from the '-state' tag, it then builds a regular expression by starting with a '(', adding each 'State/Postcode Test' from the map, with an '|' inbetween them, then finally ')', so we would end up with something like
 
"((^7:digit:{3})|(^3:digit:{3}))"
 
and checking to see if thats a valid Regex... if it is, the compiled form is used to filter the postcodes
 
Its 'clunky' in some ways and thats a simpification used of the actual sub regex strings, because they are a bit more complex, but it beats coding either one huge regex or trying to code every possible state combo as a seperate regex ...
 
I'll definately look at your technique more
 
'g'

GeneralRe: Looks like an interesting technique PinmemberEugene Mirotin (Guard)10-Jan-07 23:16 
GeneralRe: Looks like an interesting technique PinmemberClevedon_Peanut15-Jan-07 12:47 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web03 | 2.8.141223.1 | Last Updated 10 Jan 2007
Article Copyright 2007 by Eugene Mirotin (Guard)
Everything else Copyright © CodeProject, 1999-2014
Layout: fixed | fluid