Click here to Skip to main content
Click here to Skip to main content

What is the best way to create Regular Expressions?

, 26 Sep 2008
Rate this:
Please Sign up or sign in to vote.
A convenient way to document the intent of each part of a regex.

What is the best way to create Regular Expressions?

Regular Expressions are notorious for being confusing to read and understand. The longer the Regular Expression, the higher the chance of making a mistake in it, and the more difficult it is to debug or modify. Of course, every Regular Expression would be commented thoroughly. It would still suffer from being a single long line of characters.

Consider this Regular Expression that is found at http://regexlib.com/REDetails.aspx?regexp_id=731:

(?s)( class=\w+(?=([^<]*>)))|(<!--\[if.*?<!\[endif\]-->)|
  (<!\[if !\w+\]>)|(<!\[endif\]>)|(<o:p>[^<]*</o:p>)|
  (<span[^>]*>)|(</span>)|
  (font-family:[^>]*[;'])|(font-size:[^>]*[;'])(?-s)

There's nothing wrong with the expression itself. Unfortunately, no matter how thorough we document it, we cannot easily, visually, associate a comment with the part of the Regular Expression string that is being described.

The real problem is that a single long Regular Expression line does not allow a developer to show the intent of each significant part of it. Each part of a Regular Expression must scream its purpose. If a Regular Expression is several lines long, and it does not work properly, the developer will have a hard time locating the point that is responsible for the failure.

The solution is really simple. I have not seen a similar technique used anywhere, so this feels like a good example to share. Instead of entering a Regular Expression as a single long cryptic string, the string is built dynamically as a sum of very short cryptic strings. Each short piece of Regular Expression is commented separately.

For example, the following class creates a regex to validate a Canadian postal code:

public class CanadianPostalCodeRegex
{
    /// <summary>
    /// Canadian postal code regular expression pattern.
    /// </summary>
    private string _strPattern;
    /// <summary>
    /// Singleton access.
    /// </summary>
    private static CanadianPostalCodeRegex Instance = new CanadianPostalCodeRegex();


    private CanadianPostalCodeRegex()
    {
        StringBuilder patternBuilder = new StringBuilder();

        // Pattern description:
        // Start of string.
        patternBuilder.Append(@"^");
        // Start the FSA group
        patternBuilder.Append(@"(?<FSA>");
        // FSA group consists of ANA, where A is a letter and N is a digit
        patternBuilder.Append(@"\p{L}\d\p{L}");
        // End the FSA group
        patternBuilder.Append(@")");
        // An optional single white space
        patternBuilder.Append(@"\s?");
        // Start the LDU group
        patternBuilder.Append(@"(?<LDU>");
        // LDU group consists of NAN, where A is a letter and N is a digit
        patternBuilder.Append(@"\d\p{L}\d");
        // End the LDU group
        patternBuilder.Append(@")");
        // End of string.
        patternBuilder.Append(@"$");

        _strPattern = patternBuilder.ToString();
    }



    /// <summary>
    /// Gets the Canadian postal code regex pattern.
    /// </summary>
    public static string Pattern
    {
        get { return Instance._strPattern; }
    }
}

A Regular Expression is created piece by piece. Each smallest meaningful unit is thoroughly commented. The intention of each part is crystal clear, which is a huge help when one needs to fix or modify the regex. At all times, we need to deal with a fairly small regex string, instead of an unwieldy cryptic monster.

This technique also promotes the syntactic correctness of the Regular Expression. For example, a group construct can be entered first, making sure parenthesis match.

// Start the LDU group
patternBuilder.Append(@"(?<LDU>");
// End the LDU group
patternBuilder.Append(@")");

Next, the group's pattern is entered.

// Start the LDU group
patternBuilder.Append(@"(?<LDU>");
// LDU group consists of NAN, where A is a letter and N is a digit
patternBuilder.Append(@"\d\p{L}\d");
// End the LDU group
patternBuilder.Append(@")");

Being a Singleton, the expression will be built only once. There is virtually no performance penalty. Readability and maintainability improves significantly.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Alex Perepletov
Software Developer (Senior)
Canada Canada
No Biography provided

Comments and Discussions

 
GeneralCould be clearer PinmemberDoctorMemory30-Sep-08 15:54 
GeneralRe: Could be clearer PinmemberAlex Perepletov30-Sep-08 19:45 
GeneralI like your solution. PinmemberAshaman30-Sep-08 5:44 
GeneralRe: I like your solution. PinmemberAlex Perepletov30-Sep-08 10:49 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web01 | 2.8.140814.1 | Last Updated 26 Sep 2008
Article Copyright 2008 by Alex Perepletov
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid