Click here to Skip to main content
15,881,709 members
Articles / Programming Languages / Visual Basic

Visual Studio Add-in for testing regular expressions

Rate me:
Please Sign up or sign in to vote.
4.94/5 (17 votes)
1 Dec 2010CPOL4 min read 39.6K   474   36   8
An add-in for debugging and creating regular expression directly in Visual Studio

Introduction

Writing regular expressions can be frustrating and hard to get right. Having a tool for testing them in your development environment would be nice but I haven't found a free Visual Studio add-in for testing regular expressions, which actually works the way I want it to, so I have made one myself.

It has the following features:

  • Allows you to edit your existing .NET regular expressions directly from Visual Studio
  • Handles character escaping
  • Matches as you type and highlights matches and groups
  • Displays any errors in your regular expression
  • Detects when evaluating a regular expression may be going exponential in execution time and allows you to abort evaluation
  • Allows you to save your regular expressions for later use
  • Works for C# and VB

Using the Code

The solution has two projects: The add-in itself and an installer project. The easiest way to install it is to build the solution and run the MSI.

Once installed, you can select either a string containing a regular expression or the full set of parameters for constructing a regular expression (including parentheses) or for calling Regex.Replace or Regex.Match and start the regular expression tester from the right-click menu or by pressing Ctrl r, Ctrl x.

How It Works

The project contains the following main elements:

  • Parsing regular expressions, replace patterns and options from source code
  • Building a result tree from regular expression matches and showing the results in an easily readable format
  • Handling out-of-control regular expression evaluations
  • Saving regular expressions to disk

Parsing Code

Reading existing regular expressions and options from code is done using regular expressions. The regular expressions for this are encapsulated so support for new languages can be added by simply implementing new versions of the two interfaces IRegexFindResults and IRegexFormatProvider. Currently C# and VB are supported. For example, the following regular expressions are used for C#:

match string verbatim or otherwise: (?(@)@"(?(")"|.*?[^"]"(?!"))|"(?(")"|.*?[^\\]"))
match ORed list of RegexOptions: 
	(?<options>(\s*\|?\s*(?<{namespace}>System\s*\.\s*Text\s*\.\
	s*RegularExpressions\s*\.\s*|Text\s*.\s*RegularExpressions\s*\.\s*
	|RegularExpressions\s*\.\s*)?RegexOptions\s*\.\s*(?<option>[a-zA-Z]+))+)

Building Result Tree

Regex.Matches return results as a collection of matches which in turn each contain a number of groups and captures. The way the add-in handles this information is by transforming it into a tree which can then easily be traversed using the visitor pattern to display it in readable form. The data structure contains 4 types of nodes: RootNode, MatchNode, GroupNode and LiteralNode. Using this data structure, the regular expression z(?<groupa>a(?<groupb>b(?<groupc>c)*b)a) matched to "xyzabcbazabba" results in the following tree:

RootNode
|-LiteralNode(xy)
|-MatchNode(zabcba)
  |-LiteralNode(z)
  |-GroupNode(groupa, abcba)
    |-LiteralNode(a)
    |-GroupNode(groupb, bcb)
      |-LiteralNode(b)
      |-GroupNode(groupc, c)
        |-LiteralNode(c)
      |-LiteralNode(b)
    |-LiteralNode(a)
|-MatchNode(zabba)
  |-LiteralNode(z)
  |-GroupNode(groupa, abba)
    |-LiteralNode(a)
    |-GroupNode(groupb)
      |-LiteralNode(bb)
    |-LiteralNode(a)

The corresponding C# data structure is seen below. Children of nodes are ordered in a sorted dictionary by covered interval. This is convenient when adding nodes while building the tree.

C#
public interface IRegexResultTreeNodeVisitor
{
    void Visit(RootNode node, bool beforeChildren);
    void Visit(MatchNode node, bool beforeChildren);
    void Visit(GroupNode node, bool beforeChildren);
    void Visit(LiteralNode node);
}

public abstract class RegexResultTreeNode : IComparable
{
    private SortedDictionary<RegexResultTreeNode, 
	RegexResultTreeNode> children = new SortedDictionary
		<RegexResultTreeNode, RegexResultTreeNode>();

    public RegexResultTreeNode(int startIndex, int endIndex)
    {
        StartIndex = startIndex;
        EndIndex = endIndex;
    }

    public void AddChildNode(RegexResultTreeNode child)
    {
        children.Add(child, child);
    }

    public void RemoveChildNode(RegexResultTreeNode child)
    {
        children.Remove(child);
    }

    public IEnumerable<RegexResultTreeNode> ChildNodes { get { return children.Values; } }

    public abstract void Accept(IRegexResultTreeNodeVisitor visitor);

    public int StartIndex { get; private set; }
    public int EndIndex { get; private set; }

    public int CompareTo(object obj)
    {
        RegexResultTreeNode other = obj as RegexResultTreeNode;
        if (this.StartIndex < other.StartIndex) { return -1; }
        if (other.StartIndex < this.StartIndex) { return 1; }
        if (this.EndIndex < other.EndIndex) { return -1; }
        if (other.EndIndex < this.EndIndex) { return 1; }
        return 0;
    }

    public bool IsChildOf(RegexResultTreeNode other)
    {
        return (this.StartIndex >= other.StartIndex && this.EndIndex <= other.EndIndex);
    }
}

public class RootNode : RegexResultTreeNode
{
    public RootNode(int startIndex, int endIndex) : base(startIndex, endIndex) { }

    public override void Accept(IRegexResultTreeNodeVisitor visitor)
    {
        visitor.Visit(this, true);
        foreach (RegexResultTreeNode child in this.ChildNodes)
        {
            child.Accept(visitor);
        }
        visitor.Visit(this, false);
    }
}

public class MatchNode : RegexResultTreeNode
{
    public MatchNode(int startIndex, int endIndex) : base(startIndex, endIndex) { }

    public override void Accept(IRegexResultTreeNodeVisitor visitor)
    {
        visitor.Visit(this, true);
        foreach (RegexResultTreeNode child in this.ChildNodes)
        {
            child.Accept(visitor);
        }
        visitor.Visit(this, false);
    }
}

public class GroupNode : RegexResultTreeNode
{
    public GroupNode(string groupName, int startIndex, int endIndex) : 
					base(startIndex, endIndex)
    {
        this.GroupName = groupName;
    }

    public string GroupName { get; private set; }

    public override void Accept(IRegexResultTreeNodeVisitor visitor)
    {
        visitor.Visit(this, true);
        foreach (RegexResultTreeNode child in this.ChildNodes)
        {
            child.Accept(visitor);
        }
        visitor.Visit(this, false);
    }
}

public class LiteralNode : RegexResultTreeNode
{
    public LiteralNode(string literal, int startIndex, int endIndex) : 
						base(startIndex, endIndex)
    {
        this.Literal = literal;
    }

    public string Literal { get; private set; }

    public override void Accept(IRegexResultTreeNodeVisitor visitor)
    {
        visitor.Visit(this);
    }
}

The algorithm for building the tree just traverses all groups and inserts each one in the tree using the IsChildOf method to determine parent/children relationships.

C#
private void AddRegexResultTreeNode
	(RegexResultTreeNode parent, RegexResultTreeNode child)
{
    List<regexresulttreenode> toRemove = new List<regexresulttreenode>();
    foreach (RegexResultTreeNode otherNode in parent.ChildNodes)
    {
        if (child.IsChildOf(otherNode))
        {
            AddRegexResultTreeNode(otherNode, child);
            return;
        }
        else if (otherNode.IsChildOf(child))
        {
            toRemove.Add(otherNode);
            child.AddChildNode(otherNode);
        }
    }
    foreach (RegexResultTreeNode node in toRemove)
    {
        parent.RemoveChildNode(node);
    }
    parent.AddChildNode(child);
}

private void AddLiteralNode(List<literalnode> nodes, 
	int startIndex, int endIndex, string toMatch)
{
    if (startIndex < endIndex)
    {
        string literal = toMatch.Substring(startIndex, endIndex - startIndex);
        nodes.Add(new LiteralNode(literal, startIndex, endIndex));
    }
}

private void AddLiteralNodes(RegexResultTreeNode regexResultTreeNode, string toMatch)
{
    int currentIndex = regexResultTreeNode.StartIndex;
    List<literalnode> literalNodes = new List<literalnode>();
    foreach (RegexResultTreeNode child in regexResultTreeNode.ChildNodes)
    {
        AddLiteralNode(literalNodes, currentIndex, child.StartIndex, toMatch);
        AddLiteralNodes(child, toMatch);
        currentIndex = child.EndIndex;
    }
    AddLiteralNode(literalNodes, currentIndex, regexResultTreeNode.EndIndex, toMatch);
    foreach (LiteralNode literalNode in literalNodes)
    {
        regexResultTreeNode.AddChildNode(literalNode);
    }
}

private RegexResultTreeNode BuildRegexResultTree(Regex regularExpression, string toMatch)
{
    RootNode root = new RootNode(0, toMatch.Length);
    MatchCollection matches = regularExpression.Matches(toMatch);
    foreach (Match match in matches)
    {
        MatchNode matchNode = new MatchNode(match.Index, match.Index + match.Length);
        root.AddChildNode(matchNode);
        int groupIndex = 0;
        foreach (Group group in match.Groups)
        {
            if (groupIndex != 0 && group.Success)
            {
                string groupName = regularExpression.GroupNameFromNumber(groupIndex);
                foreach (Capture capture in group.Captures)
                {
                    AddRegexResultTreeNode(matchNode, new GroupNode
			(groupName, capture.Index, capture.Index + capture.Length));
                }
            }
            groupIndex++;
        }
    }
    AddLiteralNodes(root, toMatch);
    return root;
}

Regular Expression Editor

The regular expression editor is a RichTextBox which has been extended with highlighting of matching parentheses and decorated with two user controls: one for showing a red or green border around the control to indicate validity of the tested regular expression, and one for showing an error message if the regular expression doesn't compile. To match parentheses in a regular expression, you have to take into account escaped parentheses - a parenthesis is escaped if it has an uneven number of backslashes in front of it. The int GetMatchingParentheses(string value, int index) method is used to find the index of a matching parenthesis given the string to search and the current index.

C#
private Dictionary<char, char> matchingEndParentheses =
    new Dictionary<char, char>{
        {'(', ')'},
        {'[', ']'},
        {'{', '}'}
    };

private Dictionary<char, char> matchingStartParentheses =
    new Dictionary<char, char> {
        {')', '('},
        {']', '['},
        {'}', '{'}
    };

private IDictionary<int, int> GetMatchingParentheses(string value)
{
    IDictionary<int, int> result = new Dictionary<int, int>();
    Stack<KeyValuePair<char, int>> stack = new Stack<KeyValuePair<char, int>>();
    int escapeCount = 0;
    char[] chars = value.ToCharArray();
    for (int i = 0; i < chars.Length; i++)
    {
        if (escapeCount % 2 == 0)
        {
            if (matchingEndParentheses.ContainsKey(chars[i]))
            {
                stack.Push(new KeyValuePair<char, int>(chars[i], i));
            }
            else if (matchingStartParentheses.ContainsKey(chars[i]))
            {
                if (stack.Count > 0 && 
		matchingEndParentheses[stack.Peek().Key] == chars[i])
                {
                    result.Add(stack.Pop().Value, i);
                }
                else
                {
                    return result;
                }
            }
        }
        escapeCount = chars[i] == '\\' ? escapeCount + 1 : 0;
    }
    return result;
}

public int GetMatchingParentheses(string value, int index)
{
    char[] chars = value.ToCharArray();
    IDictionary<int, int> result = GetMatchingParentheses(value);
    foreach (KeyValuePair<int, int> pair in result)
    {
        if (pair.Value == index - 1)
        {
            return pair.Key;
        }
        if (pair.Key == index)
        {
            return pair.Value;
        }
    }
    return -1;
}

Handling Out-of-control Regular Expression Evaluations

In order for the .NET Regex library to allow capture of groups, it needs to evaluate regular expressions by simulating an NFA (Nondeterministic Finite Automaton). This means using backtracking and can potentially result in the match-process being exponential in the length of the string to match. As a simple example, consider the regular expression (a+|c(?!c))+b which matches any number of a's with single c's interspersed ended by a b. This will work fine on strings that don't match at all like xyzzzzzzz and on strings that match like aaacaaab. However, if you run it on aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa it will keep running forever. What goes wrong is that the nested +-operators mean every grouping of a's has to be tested before the evaluator can conclude that there is no match. This example can be fixed by removing the one plus to get (a|c(?!c))+b which matches exactly the same set of strings. The point is that it's easy to fall into the trap of writing a simple regular expression which seemingly does what you want until someday your program apparently gets stuck in an infinite loop.

The add-in addresses this by using two threads: one for evaluating the regular expressions and one for monitoring the working thread. If the worker thread takes more than 2 seconds on a single evaluation, the monitor thread will display a warning in the GUI, and the user can click a Reset button to kill the worker thread and reset the match string. The code for the worker thread and the monitor thread is as follows:

C#
private void UpdateFormThreadStart()
{
    int maxIndex = -1;
    while (!IsDisposed)
    {
        bool updated = false;
        lock (testFormUpdateIndexLock)
        {
            if (TestFormUpdateIndex > maxIndex)
            {
                maxIndex = TestFormUpdateIndex;
                updated = true;
            }
        }
        if (updated)
        {
            monitorWatch.Start();
            UpdateTestForm();
            monitorWatch.Reset();
        }
        else
        {
            Thread.Sleep(100);
        }
    }
}

private void MonitorUpdateThreadStart()
{
    while (!IsDisposed)
    {
        if (labelWarningEvaluationTime.Visible != monitorWatch.Elapsed.Seconds > 2)
        {
            CallGUIFromOtherThread(() => { labelWarningEvaluationTime.Visible = 
					!labelWarningEvaluationTime.Visible; });
        }
        Thread.Sleep(100);
    }
}

private void regexParameters_Changed(object sender, EventArgs e)
{
    lock (testFormUpdateIndexLock)
    {
        TestFormUpdateIndex++;
    }
}

private void CallGUIFromOtherThread(Action a)
{
    if (this.InvokeRequired)
    {
        this.Invoke(new InvokeDelegate(a));
    }
    else
    {
        a();
    }
}

Saving Regular Expressions

Regular expressions are saved in an XML file in user isolated storage. This allows each user to have his own set of saved regular expressions and requires few user privileges.

History

  • 2010-11-27: Initial version uploaded
  • 2010-11-30: Updated source code

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer (Senior)
Denmark Denmark
.NET developer. I wanted to be first an astronaut, then a jet pilot, but when I got a Commodore 64 for Christmas I never looked back. Also I would never have qualified for the first two things and everybody knows computer programmers get all the girls.

Comments and Discussions

 
GeneralMy vote of 5 Pin
Marcelo Ricardo de Oliveira3-Jan-11 4:46
Marcelo Ricardo de Oliveira3-Jan-11 4:46 
Generalthanks for sharing Pin
Pranay Rana19-Dec-10 18:34
professionalPranay Rana19-Dec-10 18:34 
GeneralExcellent article and right on target! 5 Pin
DrABELL16-Dec-10 18:48
DrABELL16-Dec-10 18:48 
GeneralMy vote of 5 Pin
Abhinav S11-Dec-10 4:45
Abhinav S11-Dec-10 4:45 
GeneralMy Vote of 5 Pin
rspercy653-Dec-10 11:06
rspercy653-Dec-10 11:06 
GeneralMy vote of 5 Pin
Kanasz Robert1-Dec-10 10:54
professionalKanasz Robert1-Dec-10 10:54 
GeneralMy vote of 5 Pin
Slacker0071-Dec-10 0:28
professionalSlacker0071-Dec-10 0:28 
GeneralRe: My vote of 5 Pin
Andreas Andersen1-Dec-10 6:25
Andreas Andersen1-Dec-10 6:25 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.