RegexToXml regular expression to XML converter

Andrew Tweddle

Rate me:

4.76/5 (6 votes)

13 Oct 200623 min read

51.9K

428

RegexToXml is a command line utility which applies a regular expression to input text and returns the results as an XML document.

Download source - 30.2 Kb

Introduction
Why the utility was written
Some code samples to whet your appetite
Using the code
Using RegexToXml to parse SQL output files for errors
Using RegexToXml to analyse Delphi forms
Using RegexToXml to download board game ratings
Limitations of the utility
A better design?
- Extend the -d switch to specify extra options
- Customize the output using an XML template
Other points of interest
- Using NUnit as an alternative to a console test app
Acknowledgements
History

Introduction

RegexToXml is a command line utility which applies a regular expression to an input document and saves the results to an XML document.

Capture groups are mapped to XML elements and can be nested to create a deep XML hierarchy.

The utility was motivated by the need to extract errors from SQL output files as part of a daily build process. A sequence of unit tests is provided which demonstrates the incremental construction of a regular expression to extract these errors.

RegexToXml is the first of three complementary command line utilities which enable the reporting of SQL errors in our daily build process. By writing general purpose utilities, I have been able to reuse the utilities to solve other problems. These other uses are discussed in the article.

I hope RegexToXml will prove useful to you too!

Why the utility was written

Some months back a colleague created a daily build process using FinalBuilder. This builds the executable and tests the SQL deployment scripts on a recent copy of each customer's database.

Testing the deployment scripts was a great idea. The company I work for produces agricultural management software for large sugar cane estates, some of which are in remote parts of Africa. Online connections to these estates can be slow and unreliable, and time spent on site is very precious. So unexpected errors in deployment scripts can be costly and frustrating.

But there was an important practical obstacle to making the script testing process truly useful. Errors needed to be detected in the SQL output files and sent to only the developers who had submitted the faulty change scripts (so as not to bombard all developers with irrelevant information).

I decided to tackle this problem as a hobby project. My solution was to write 3 general purpose command line utilities, then use these to generate and e-mail a personalized error report to each affected developer.

The three utilities were:

RegexToXml: to parse the SQL output files for errors and warnings and output the results as a separate XML file for each customer database.
TransformXml: a wrapper around .NET's XslCompiledTransform class. XSL transforms were written to:
- Generate various batch files to piece together the entire process (e.g., batch files with a command per customer or developer in central Customers.xml and Developers.xml files.)
- Concatenate the XML files for the various customer databases.
- Re-order the XML nodes by developer, then by customer, then by change script.
- Generate an HTML file per developer with details of the errors in that developer's change scripts (grouped by customer).
SendSMTP: used to e-mail the HTML file (as the body of the e-mail) to each affected developer.

This article is about RegexToXml. I hope to write articles on the other two utilities in the near future.

Some code samples to whet your appetite

The following is an extract from the batch file to test SQL deployment scripts (Nchalo is the name of a sugar estate in Malawi):

..\Tools\RegexToXml -d Matches=Nchalo Match=ChangeScript -p- -m+ -w- -c- 
  -r "ChangeScriptErrorsRegex.txt" -i "..\Output\NchaloResults.txt" 
  -o "..\Output\NchaloChangeScriptErrors.xml"

The core functionality is contained in a separate DLL. It is possible to use this DLL from your own applications.

The following code snippet is from the first of the unit tests. It is a simple illustration of the type of output produced by the conversion library.

    [TestFixture]
    public class SimpleTests : TestsBase
    {
        [Test]
        public void TestSimpleMatchUsingDefaults()
        {
            string input = "1. Man\r\n2. Woman\r\nText to ignore\r\n3.Child";
            string pattern = @"(?<Number>\d+)\.\s*(?<Item>(?:(?!\r|\n).)*)";
            string expectedXml
                = @"<?xml version=""1.0"" encoding=""utf-16""?>
<Matches>
    <Match>
        <Number>1</Number>
        <Item>Man</Item>
    </Match>
    <Match>
        <Number>2</Number>
        <Item>Woman</Item>
    </Match>
    <Match>
        <Number>3</Number>
        <Item>Child</Item>
    </Match>
</Matches>";

            CompareToExpectedXml(input, pattern, expectedXml);
        }
        
        // ...
    }

The CompareToExpectedXML method is defined as followed:

public void CompareToExpectedXml(string input, string pattern,
    string expectedXml, RegexToXmlOptions options, string notes)
{
    WriteTestHeadingToConsole();

    if ((notes != null) && (notes != String.Empty))
    {
        Console.WriteLine("Notes:");
        Console.WriteLine("------");
        Console.WriteLine();
        Console.WriteLine(notes);
        Console.WriteLine();
    }

    Console.WriteLine("Input:");
    Console.WriteLine("------");
    Console.WriteLine();
    Console.WriteLine(input);
    Console.WriteLine();

    Console.WriteLine("Regex:");
    Console.WriteLine("------");
    Console.WriteLine();
    Console.WriteLine(pattern);
    Console.WriteLine();

    Console.WriteLine("Xml output:");
    Console.WriteLine("-----------");
    Console.WriteLine();

    XmlWriterSettings xwSettings = new XmlWriterSettings();
    xwSettings.Indent = true;
    xwSettings.IndentChars = "    ";

    string xmlOutput
        = RegexToXmlConverter.ConvertToXmlString(input, pattern,
            options, xwSettings);

    Console.WriteLine(xmlOutput);
    Console.WriteLine();

    Assert.IsTrue(String.Compare(xmlOutput, expectedXml,
        true /*ignoreCase*/) == 0);
}

Using the code

Opening the solution

The code was written using Visual Studio 2005. I used NUnit 2.3.0 for unit testing.

If you don't have NUnit installed, then you will receive error messages when opening the solution because the references to the NUnit assemblies cant be resolved. Ignore these and consider removing the unit testing project from the solution.

Better yet, just download NUnit. And consider downloading TestDriven.Net too. This is a very useful Visual Studio extension which integrates NUnit and other testing frameworks into the IDE.

An FxCop project is also included. I have followed some, but not all of the FXCop advice. If you need to, you can use the FXCop project to see what still needs to be done (e.g. globalization).

The structure of the application

Two assemblies make up the application:

AndrewTweddle.Tools.RegexToXml.Core.dll contains the core functionality.
RegexToXml.exe is the console application.

In keeping with good design principles, RegexToXml has been designed in a multi-layered fashion.

The user interface code is contained entirely within RegexToXml.exe. A list of its command line switches is provided later in this article.

The core functionality is contained within the DLL. This assembly is what would sometimes be called the "business logic layer". I dislike this term though, because it implies a restricted context of business applications only. "core layer" is more generic terminology ("logic layer" would also be fine).

As is customary, the DLL contains no code to access external resources (such as the file system or a database), making it a suitable target for unit testing.

The main usage scenario is to run RegexToXml directly. But of course there is no reason why you can't include the DLL in your own applications. This option is discussed in the next section.

Using the core assembly

The DLL contains a static class, RegexToXmlConverter.

This has a number of overloaded Convert() methods each of which takes an XMLWriter as its first parameter. There are also overloaded ConvertToString() methods which return the XML outputs as a string:

public static class RegexToXmlConverter
{
    public static void Convert(XmlWriter xw, string input,
        string regexPattern, RegexToXmlOptions options)
    {
      // ...
    }

    public static void Convert(XmlWriter xw, string input,
        string regexPattern)
    {
      // ...
    }

    public static string ConvertToXmlString(string input,
        string regexPattern, RegexToXmlOptions options,
        XmlWriterSettings settings)
    {
      // ...
    }

    public static string ConvertToXmlString(string input,
        string regexPattern, XmlWriterSettings settings)
    {
      // ...
    }

    public static string ConvertToXmlString(string input,
        string regexPattern)
    {
      // ...
    }

    public static string ConvertToXmlString(string input,
        string regexPattern, RegexToXmlOptions options)
    {
      // ...
    }

    // ...
}

Command line switches

The command line utility uses a hyphen followed by a single character to specify a command line switch. The related setting can either be appended to the switch, or passed as the next command line argument. For example:

-ic:\temp\InputFile.txt

-i c:\temp\InputFile.txt

To view the list of command line options, run RegexToXml without any parameters or with the -? switch. Below are a list of all the command line switches:

Switch	Value	Notes
-?	Display the command line switches.
-i	File name or URI with the input text
-I	Input text	-I takes precedence over -i if both are specified. If neither -i nor -I are specified, then the input text is read from the console's input stream.
-t	Input encoding
-r	File containing regular expression
-R	The regular expression	-R takes precedence over -r if both are specified.
-o	XML output file	If omitted, the XML text will be sent to the console's output stream.
-e	Error log output file
-d	One or more parameters in the format: GroupName=ElementName	GroupName can be the named capture group's name, or GroupN where N is the index of the unnamed capture group, or Matches for the root element, or Match for each match found. e.g. -d Matches=Root Match=Leaf
-c	+ or - (i.e. on or off)	Case sensitive regular expressions. Off by default.
-m	+ or -	Multi-line regular expressions (^ and $ match the start/end of a line, not the start/end of the entire string). On by default.
-s	+ or -	Single line regular expression (so that . also matches \n). Off by default.
-w	+ or -	Match pattern whitespace in the regular expression. By default whitespace is ignored.
-n	+ or -	Use culture-invariant regular expression. Off by default.
-x	+ or -	Explicit capture groups in regular expression. Off by default.
-y	+ or -	Compiled regular expression. Off by default.
-z	+ or -	Parse the regular expression from right to left. Off by default.
-S	+ or -	Add a StartIndex attribute to each element giving the starting position of the capture in the input string.
-E	+ or -	Add an EndIndex attribute to each element giving the ending position of the capture in the input string. The end position will be 1 less than the start position if the capture has a length of zero.
-L	+ or -	Add a Length attribute to each element giving the number of characters in the captured text.
-k	+ or -	Skip unnamed capture groups. Off by default. If included, unnamed capture groups are given element names of GroupN, where N is the index of the capture group in the regular expression.
-f	+ or -	Write XML fragment only. Off by default. This omits the XML header.
-a	+ or -	Append the XML to the output file. Off by default. This is useful in conjunction with -f+ as it allows the results of multiple input files to be concatenated into a single XML file.
-v	+ or -	Verbose mode. On by default. Shows extra progress information. This is ignored if the XML output is being written to the console instead of to a file.
-p	+ or -	Prompt to exit. Off by default. This is useful when running the utility in debug mode, as it gives you a chance to see the results before the console window disappears.

Using RegexToXml to parse SQL output files for errors

SQR Software (the company I work for) uses an in-house tool to generate SQL deployment scripts. Developers copy appropriately named change scripts into their own sub-folder of the change scripts folder. The tool then concatenates the change scripts in the correct sequence.

The tool also precedes each change script with a print statement giving the sub-folder and file name of the change script:

/*** Andrew\Aug06\Aug01_RPGetProdStdsRuleKey.sql ****/
print 'Andrew\Aug06\Aug01_RPGetProdStdsRuleKey.sql'

This makes it possible for the regular expression to extract the developer name and file name from the output file.

I'm not going to describe the process of building up the regular expression. Instead I refer you to the sequence of unit tests in SQLErrorParsingTests.cs. This demonstrates each step of the process.

Here's a typical extract from a generated XML file:

XML

<?xml version=""1.0"" encoding=""utf-16""?>
<Nchalo>
    <ChangeScript>
        <FilePath>
            <Text>Murray\Feb06\Feb10a_EMEmployeeHearings.sql</Text>
            <Developer>Murray</Developer>
        </FilePath>
        <ErrorOutput>
            <Text>Msg 15600, Level 15, State 1, Server D28, Procedure 
                     sp_addextendedproperty, Line 42 An invalid parameter 
                     or option was specified for procedure 
                     'sp_addextendedproperty'.
            </Text>
            <ErrorHeader>Msg 15600, Level 15, State 1, Server D28, 
                      Procedure sp_addextendedproperty, Line 42</ErrorHeader>
            <ErrorText>An invalid parameter or option was specified 
                   for procedure 'sp_addextendedproperty'.
            </ErrorText>
        </ErrorOutput>
        <ErrorOutput>
            <Text>Msg 15600, Level 15, State 1, Server D28, 
                 Procedure sp_addextendedproperty, Line 42
                 An invalid parameter or option was specified 
                 for procedure 'sp_addextendedproperty'.
            </Text>
            <ErrorHeader>Msg 15600, Level 15, State 1, Server D28, 
                Procedure sp_addextendedproperty, Line 42</ErrorHeader>
            <ErrorText>An invalid parameter or option was 
                 specified for procedure 'sp_addextendedproperty'.
            </ErrorText>
        </ErrorOutput>
    </ChangeScript>
    <ChangeScript>
        <FilePath>
            <Text>Andrew\Feb06\Feb17a_PopulateBAMenu
                  ItemsWithAndrewsMenuItems.sql</Text>
            <Developer>Andrew</Developer>
        </FilePath>
        <PreErrorOutput>(0 rows affected)
(0 rows affected)
(0 rows affected)
(0 rows affected)
(0 rows affected)
(4 rows affected)
(1 row affected)
(10 rows affected)
(1 row affected)
(4 rows affected)
(1 row affected)
</PreErrorOutput>
        <ErrorOutput>
            <Text>Msg 208, Level 16, State 1, Server D28, Line 1
                     Invalid object name 'itm'.
                     (1 row affected)
                     (1 row affected)
            </Text>
            <ErrorHeader>Msg 208, Level 16, State 1, 
                    Server D28, Line 1</ErrorHeader>
            <ErrorText>Invalid object name 'itm'.
                          (1 row affected)
                          (1 row affected)
                          </ErrorText>
        </ErrorOutput>
    </ChangeScript>
    ...

Using RegexToXml to analyse Delphi forms

The need

A while back SQR standardised on using stored procedures instead of dynamic SQL in the UI of our main product. All data would be retrieved and modified using stored procedures. Over time we would also try to replace dynamic SQL with stored procedures in our existing code.

I recently used RegexToXml to search for all such database components.

The structure of Delphi DFM files

SQR's CanePro application is currently written in Delphi. Delphi stores form definitions in files with a dfm extension. Below is some sample text from a DFM file:

Delphi

object CostAllocRulesDataSet: TBetterADODataSet
  Connection = DM.ADOConnection1
  BeforeOpen = CostAllocRulesDataSetBeforeOpen
  CommandText =
    'select SarKey,'#13#10'  RuleName,'#13#10'
     RuleDesc'#13#10'  from BDSing' +
    'leAllocRateRule'#13#10' where ScenarioKey
     is null'#13#10' or ScenarioKey' +
    ' = :ScenarioKey'#13#10
  Parameters = <
    item
      Name = 'ScenarioKey'
      Attributes = [paSigned, paNullable]
      DataType = ftInteger
      Precision = 10
      Size = 4
      Value = 11
    end>
  IndexDefs = <>
  Left = 498
  Top = 279
  object CostAllocRulesDataSetSarKey: TAutoIncField
    FieldName = 'SarKey'
    ReadOnly = True
  end
  object CostAllocRulesDataSetRuleName: TWideStringField
    FieldName = 'RuleName'
    Required = True
    Size = 50
  end
  object CostAllocRulesDataSetRuleDesc: TWideStringField
    FieldName = 'RuleDesc'
    Size = 250
  end
end
object MaxTreeLevelCommand: TADOCommand
  CommandText = 'dbo.ULMaxLocBudgetCentreTreeLevel'
  CommandType = cmdStoredProc
  Connection = DM.ADOConnection1
  Parameters = <
    item
      Name = '@RETURN_VALUE'
      Attributes = [paNullable]
      DataType = ftInteger
      Direction = pdReturnValue
      Precision = 10
      Value = 5
    end
    item
      Name = '@RootLocKey'
      Attributes = [paNullable]
      DataType = ftInteger
      Precision = 10
      Value = Null
    end
    item
      Name = '@IncludeSelf'
      Attributes = [paNullable]
      DataType = ftBoolean
      Value = True
    end>
  Left = 37
  Top = 261
end

Complications

Note that the MaxTreeLevelCommand has a CommandType property of cmdStoredProc (an enumeration value.) But the CostAllocRulesDataSet doesn't have a CommandType property in the DFM file. This is because the default value for CommandType is cmdText, and Delphi only saves non-default values to the DFM file.

So the first complication is that the regular expression pattern must look for an object which has a CommandText property but which doesn't have a CommandType property. You'll see further down how zero-width negative look-ahead groups can be used to solve this problem

A second complication is that objects can be nested within other objects. For example, there are field objects nested within the CostAllocRulesDataSet component. This makes it harder to detect the end of the component. One can't simply look for an "end" statement, as it could belong to a nested object!

My solution to this problem was to capture the number of indented spaces in front of the initial object statement...

^(?<ComponentIndentation>\s+?)Object\s+
(?<ComponentName>\w+)\s*:\s*
(?<Type>TBetterADODataSet|TADODataSet|TADOQuery|TADOCommand)[ \t\f]*\r?\n

... then to use the back-capture command (\k) to find the first end statement preceded by exactly this number of spaces.

\k<ComponentIndentation>end

(Although you can't see it in these snippets, each snippet was guaranteed to be at the start of a new line).

A third complication was extracting the values of the CommandText property. I used the same trick again, but capturing the indentation in front of the CommandText property this time.

If you look at the text fragment above, you will see that Delphi intersperses strings with tokens such as #13, #10, #9 and so on (representing characters such as \r, \n and \t). I could have tried to parse these individual lines of text and the intervening tokens. But it was far simpler to capture the entire property value instead.

The complete regular expression

The complete regular expression is given below:

# Look for an ADO component...
^(?<ComponentIndentation>\s+?)Object\s+
(?<ComponentName>\w+)\s*:\s*
(?<Type>TBetterADODataSet|TADODataSet|TADOQuery|TADOCommand)[ \t\f]*\r?\n

# Keep consuming characters until the CommandText property is found,
# or the end of the component is reached...
(?:
    (?!^\k<ComponentIndentation>end)
    (?!^[ ]*CommandText)
    (?>.|\r|\n)
)*

# Read and store the property indentation before the CommandText property:
^(?<PropertyIndentation>[ ]*)(?=\S)

CommandText[ ]*=\s*(?=\S)

  # Keep consuming characters until the start of the next property
  # or the end of the component is reached...
  (?<CommandText>
    (?:
      (?![ \t\f]*\r?\n\k<PropertyIndentation>(?=\S))
      (?![ \t\f]*\r?\n\k<ComponentIndentation>end)
      (?>.|\r|\n)
    )*
  )

# Keep consuming characters until a CommandType property is found, 
# or the end of the component is reached...
(?:
  (?![ \t\f]*\r?\n\k<PropertyIndentation>CommandType)
  (?![ \t\f]*\r?\n\k<ComponentIndentation>end)
  (?:.|\r|\n)
)*[ \t\f]*\r?\n

(?:
  \k<PropertyIndentation>CommandType\s*=\s*cmdText
  |
  \k<ComponentIndentation>end
)

Two points of interest...

The (?! ) capture group is known as a zero-width negative look-ahead group. It is useful for stepping through characters until a particular pattern is encountered.
The (?> ) capture group is a non-backtracking group. It was vital for improving performance.

Improving performance

Before I used the non-backtracking group, the regular expression would work fine for a number of matches then suddenly grind to a halt. This would usually happen on a component with a large number of characters (such as bitmaps with their large blocks of text-encoded binary data).

Regular expression performance problems are common when groups with quantifiers (such as * or +) are nested within groups with quantifiers. This results in a running time which is exponential in the number of characters (because of the combinatorial number of ways in which the regular expression can backtrack.)

You can read more about this issue on RegexAdvice.com, on MSDN and on the Base Class Library team's blog.

I had similar performance problems because I had 2 successive quantified groups which each consumed characters one at a time.

I wanted the groups to consume as many characters as possible until the non-capturing groups hit a match. So there was no need for the regular expression to backtrack. Changing to a non-backtracking group solved the performance issues.

Using RegexToXml to concatenate XML fragments

The application contains hundreds of forms. But I wanted to generate a single XML file, not hundreds of output files.

The solution was to:

Use the -f+ command line switch to generate XML fragments.
Use the -a+ switch to append the results to the XML file.
Use the -d switch to remap the Matches elements to the name of the particular DFM file.

The contents of the batch file to call RegexToXml are as follows:

del NonSPCommandTexts.xml
copy NonSPCommandTextsTemplate.xml NonSPCommandTexts.xml
for /R %1 %%d in (*.dfm) do RegexToXml -i "%%d" -d "Matches=%%~nd" 
  Match=Component -r NonSPCommandTextsRegex.txt -p- 
  -f+ -a+ -o NonSPCommandTexts.xml
echo ^</Forms^> >> NonSPCommandTexts.xml

[Note that I've split the line with the "for" command over 2 lines for improved readability. It's a single line in the batch file.]

The batch file takes a single parameter (i.e. %1) which is the root folder of the application's source code.

NonSPCommandTextsTemplate.xml has the following contents:

XML

<?xml version="1.0" encoding="utf-8"?>
<Forms>

Sample output

The following is an extract from the XML file that RegexToXml generates:

XML

...
<DailyContractorReport />
<DailyContractSummary />
<DayWorkContracts>
    <Component>
        <ComponentIndentation>  </ComponentIndentation>
        <ComponentName>LocationsListQ</ComponentName>
        <Type>TBetterADODataSet</Type>
        <PropertyIndentation>    </PropertyIndentation>
        <CommandText>'select locKey, locName, code'#13#10'  
                        from dbo.locations'</CommandText>
    </Component>
</DayWorkContracts>
<EditFrameBase />
<EditFrameTabbedBase />
<BDAccountEntriesReport>
    <Component>
        <ComponentIndentation>  </ComponentIndentation>
        <ComponentName>AccountDataSet</ComponentName>
        <Type>TBetterADODataSet</Type>
        <PropertyIndentation>    </PropertyIndentation>
        <CommandText>'select AccKey, AccCode, 
             AccName'#13#10'  from Account'</CommandText>
    </Component>
    <Component>
        <ComponentIndentation>  </ComponentIndentation>
        <ComponentName>CostAllocRulesDataSet</ComponentName>
        <Type>TBetterADODataSet</Type>
        <PropertyIndentation>    </PropertyIndentation>
        <CommandText>'select SarKey,'#13#10'       
            RuleName,'#13#10'       RuleDesc'#13#10'  from BDSing' +
            'leAllocRateRule'#13#10' where ScenarioKey 
            is null'#13#10'     or ScenarioKey' +
            ' = :ScenarioKey'#13#10</CommandText>
    </Component>
    <Component>
        <ComponentIndentation>  </ComponentIndentation>
        <ComponentName>FinPeriodTypeDataSet</ComponentName>
        <Type>TBetterADODataSet</Type>
        <PropertyIndentation>    </PropertyIndentation>
        <CommandText>'select FPTKey, PeriodNo, 
             PeriodTypeCode, PeriodTypeName'#13#10'  from ' +
             'FinPeriodTypeTree'#13#10' where 
             CategoryKey = 4'#13#10'order by PeriodNo'</CommandText>
    </Component>
    ...

Note that the unwanted ComponentIndentation and PropertyIndentation groups are also included in the XML file. This and other limitations are discussed in a later section.

Other code analysis utilities

A variety of other similar utilities could be written in the same way.

For example, I have written a similar (but somewhat more complex) regular expression which lists all stored procedures referenced by ADO components within the DFM files.

This information could be put to a variety of uses, such as documentation (e.g. to assist new developers), clustering related UI and SQL code, generating a hierarchy of SQL database objects based on function, and so on.

The added complexity arises because the contents of the CommandText property must be parsed (looking either for "exec dbo.<StoredProcName>..." or just "dbo.<StoredProcName>").

This produces outputs that look like this...

XML

<?xml version="1.0" encoding="utf-8"?>
<Forms>
<DataModule>
    <Component>
        <ComponentIndentation>  </ComponentIndentation>
        <ComponentName>CalendarDeliveriesCmd</ComponentName>
        <Type>TADOCommand</Type>
        <PropertyIndentation>    </PropertyIndentation>
        <DatabaseObjectName>BACalendarDeliveries</DatabaseObjectName>
    </Component>
    <Component>
        <ComponentIndentation>  </ComponentIndentation>
        <ComponentName>CalendarCalcFuelUsedCmd</ComponentName>
        <Type>TADOCommand</Type>
        <PropertyIndentation>    </PropertyIndentation>
        <DatabaseObjectName>BACalendarFuelUsed</DatabaseObjectName>
    </Component>
    <Component>
        <ComponentIndentation>  </ComponentIndentation>
        <ComponentName>CalendarCalcNumPaidCmd</ComponentName>
        <Type>TADOCommand</Type>
        <PropertyIndentation>    </PropertyIndentation>
        <DatabaseObjectName>BACalendarCalcNumPaid</DatabaseObjectName>
    </Component>
    ...

This can be useful for quickly determining where a particular stored procedure is used.

Using RegexToXml to download board game ratings

My hobby

My hobby is collecting and playing board games, particularly the very elegant "German-style" games such as Puerto Rico and The Settlers of Catan.

[It seems I'm in good company! Martin Fowler writes about them in his blog, and Scott Berkun mentions Settlers in his excellent book, The Art of Project Management.]

But South Africa is not the ideal country for this hobby due to the high cost of shipping and small penetration of the hobby. I order games online, so there is no opportunity to "try before you buy". Instead I use the superb BoardGameGeek web site to access ratings and comments on thousands of board games, so that I can make better game purchases.

One thing you notice is that games tend to be over-hyped initially, rising rapidly in the ranks then often falling again as the gloss wears off and their true merits become apparent. To reduce the risk of buying an over-hyped game, I usually wait a while for the ratings to stabilise. But how long should one wait?

A while back I wrote a utility to:

Connect to the BoardGameGeek web site.
Retrieve the number of pages of ratings.
Open each page in turn.
Use a regular expression to extract all ratings from the page.
Save the ratings to a database.

I was running this utility on a weekly schedule to collect enough statistics to answer my question. But in the end the need for my utility fell away, because the Board Game Geek administrators changed the Bayesian weighting formula to reduce the problem of over-hyped games rising too rapidly in the rankings. This invalidated the stats I had been gathering, but partially addressed the problem that had prompted the utility!

I have since used RegexToXml to extract game ratings from the web site as well.

I'm not going to give details of the regular expression here. For one thing, it's already out of date. For another, I'm sure that the administrators wouldn't thank me for directing your data extraction programs to their web site!

Extracting ratings allowed me to test RegexToXml with URL's. But it also gave me insight into 2 very different development "philosophies"...

Layered development versus the Unix small sharp tools approach

The original utility was written in a multi-layered fashion with a data layer, logic layer and user interface layer.

On the other hand, using RegexToXml to extract ratings is much closer to the Unix philosophy of "small sharp tools".

Since I didn't have a "small sharp tool" to update the database, I couldn't really compare the 2 approaches directly. But my gut feel was that the right approach depended on the scale of the problem. Don't ask me to create an enterprise level application using small sharp tools. But if long term maintainability is not an issue, then this approach is great.

On the other hand, the multi-layered approach felt more complex. But it turned out to be very easy to maintain and extend.

I had originally used an open source C# component to convert the HTML page to an XML format. When this started giving bugs I was able to rip it out and replace it with a regular expression based solution. No changes were required to the data layer and only a single line needed to be changed in the user interface! This vindicated the theory behind multi-layered applications.

I think this is very similar to the trade-off one must make between dynamic scripting languages and strongly-typed languages, and between spreadsheet and database solutions.

Anyway, back to RegexToXml...

Overcoming HTTP protocol violation errors

One problem I encountered is that .NET throws a WebException when connecting to the web site. Apparently the header returned by the site is not fully compliant with the http protocol. The workaround was to add the following section to the App.config file:

XML

<configuration>
  <system.net>
    <settings>
      <httpWebRequest useUnsafeHeaderParsing="true" />
    </settings>
  </system.net>
</configuration>

For further details see this bug report at the Microsoft site.

Connecting to web pages or local files

The -i command line switch has the flexibility to accept a URL, a full file path or a relative file path.

To make the -i switch work this way required a bit of a hack though.

The WebRequest constructor is happy to take a full file path instead of a URI, so URI's and full file paths could both be supported very simply. But WebRequest isn't happy with relative file paths.

My solution was to first check whether the URI was well-formed. If not, I assumed that it was a relative file path and tried to convert it to a full file path. If an exception was thrown, I would swallow the exception and treat the path as a URL to get a more meaningful error message. Swallowing exceptions like this is something I find hard to, um, swallow! Sure it works, and I'll do it if I have to - but it just feels wrong to me.

Here's the code to do this (from Program.cs). If a reader knows of a better way, please drop a line in the comments section to let me know how...

private static string ReadInputTextFromUri(string inputFileNameOrUri,
        Encoding inputEncoding)
{
    /* If this is a relative file path, then convert it to the
     * full path otherwise the Uri constructor will throw an error:
     */
    if (!Uri.IsWellFormedUriString(inputFileNameOrUri,
        UriKind.Absolute))
    {
        try
        {
            inputFileNameOrUri = Path.GetFullPath(inputFileNameOrUri);
        }
        catch
        {
            /* Swallow the exception, since a more meaningful
             * exception will be thrown when the Uri is created.
             */
        }
    }

    Uri inputUri = new Uri(inputFileNameOrUri);

    /* Read the input text from the Uri: */
    WebRequest wrq = WebRequest.Create(inputUri);
    WebResponse wresp = wrq.GetResponse();

    using (wresp)
    {
        Stream respStream = wresp.GetResponseStream();
        using (respStream)
        {
            StreamReader sr
                = new StreamReader(respStream, inputEncoding);

            return sr.ReadToEnd();
        }
    }
}

Limitations of the utility

There are a number of limitations which you need to be aware of...

Potential problems with nested groups

Regular expressions are completely capable of having capture groups nested within other capture groups. The .NET regular expression engine parses these groups correctly. But the problem is that the object model is flat, not hierarchical. The Regex class exposes an array of Matches. Each Match contains an array of Groups. And each group contains an array of Captures. But the Capture class does not expose an array of nested captures. This made it challenging to reconstruct the hierarchy correctly.

My solution was to identify nested captures using the positions of each Capture within the input string. The RegexCaptureNode class encapsulates each Capture instance, but is able to contain child nodes. If the start and end index of a capture is contained within the start and end index of a capture node, then the capture is added as a child capture node.

In general this works very well. However it does lead to two limitations of the utility:

A zero length capture may fall on the start or end boundary of another capture. How does one know whether to add it as a child node or a sibling node?
A capture could have the same start and end index as another capture. How does one decide which capture should be the parent and which the child?

The pragmatic solution to this problem is to design the regular expression so that neither of these situations can occur. For example one could change zero-or-more quantifier (*) in the contents of a capture group to rather have a one-or-more quantifier (+) and make the entire group optional using a zero-or-one quantifier (?). Then no capture will be created when there are zero characters to match and an incorrect hierarchy will be avoided.

There is no option to suppress the display of a capture group's text

A capture which has multiple child captures will store its captured text in a <Text> child element. In many cases this is what one wants. But in some cases one is only interested in the text in the child captures, not in the parent capture's text.

There is no option to exclude particular capture groups from the results

A related issue is that sometimes one doesn't want the capture to appear at all.

In the Code Analysis example earlier in this article I used a named capture group to store the number of indented spaces at the start of a line. I then used a back-capture (\k<GroupName>) to match that same level of indentation further down in the file. The indentation node still gets written to the XML output, even though it is of no further interest.

Captures can only be displayed as elements, not as attributes

Each capture is saved as an XML element. But in some cases it might be preferable to save the capture as an attribute (and its text as the attribute's value).

It would be particularly useful if one could pass in an attribute value to the root element, instead of only being able to remap the Matches element name.

For example, the SQL errors xml file takes the name of the customer as the root element name:

XML

<?xml version=""1.0"" encoding=""utf-16""?>
<Nchalo>
  <ChangeScript>
    ...

It would be more elegant and useful (e.g. for XSLT transformations generally and XAML HierarchicalDataTemplate bindings in particular) to have it look like this instead:

XML

<?xml version=""1.0"" encoding=""utf-16""?>
<Customer name="Nchalo">
  <ChangeScript>
...

A better design?

For most practical purposes the limitations in the previous section are not significant. But it's still fun to try to devise a better design! Here are a few options...

Extend the -d switch to specify extra options

It wouldn't be too difficult to modify the parameters to the -d switch so that one could specify extra options for the groups. One could even specify the level of each capture group in the hierarchy.

This would all but eliminate the problem of zero length capture groups (there could still be a problem deciding the sequence of sibling nodes which both have zero length, although the sequence is unlikely to be of importance.)

For example, one could use the following parameters:

-d 0!:Matches=Customer 1!:Match=ChangeScriptErrors 2!:ChangeScript 
  3@:Developer 3@:FilePath 3:PreErrorOutput 3!:Error 4:ErrorHeader 
  4:ErrorText -:Indentation

Here N!: means "add the group's captures at level N and don't write its text". N@: means "add the group's captures at level N but add it as an attribute of its parent element". -: would mean "don't save the group name to the XML file".

This solution would be quite simple to implement. But it feels ugly and messy, and is certainly not intuitive to learn.

Customize the output using an XML template

A more elegant alternative would be to allow an XML template file to be passed as an input parameter.

A namespace could be defined for RegexToXml (possibly with an alias such as rgx). This could be used to identify elements to be transformed using the regular expression matches.

RegexToXml could possibly traverse the XML file replacing nodes such as...

XML

<rgx:Capture GroupName="ChangeScript" rgx:DisplayText="false">
  <rgx:Capture GroupName="Developer" rgx:DisplayAsAttribute="true"/>
  <rgx:Capture GroupName="FilePath" rgx:DisplayAsAttribute="true"/>
</rgx:Capture>

...with the groups' captures.

This would be quite a fun utility to write. But for now RegexToXml is good enough for my purposes. So there is no justification for spending time on these enhancements.

Other points of interest

Using NUnit as an alternative to a console test app

NUnit traps and displays the contents of the console output and error streams. This makes it very convenient to call Console.WriteLine from within one's unit tests, rather than writing a separate command line test program.

Because of this, multiple tests will often all be writing their output to the Console. It is useful to be able to tell them apart in NUnit's console window. The following methods (found in the TestsBase class) can be used to separate the different test outputs:

public string GetTestMethodName()
{
    StackTrace st = new StackTrace();
    MethodBase testMethod = null;

    /* Find the most recent method in the call stack 
     * which has an NUnit [Test] attribute:
     */
    foreach (StackFrame sf in st.GetFrames())
    {
        MethodBase currMethod = sf.GetMethod();

        if (currMethod.GetCustomAttributes(typeof(TestAttribute), true)
            .Length > 0)
        {
            testMethod = currMethod;
            break;
        }
    }

    if (testMethod == null)
        return "<Unknown calling method>";
    else
        return testMethod.Name;
}

public void WriteTestHeadingToConsole()
{
    string methodName = GetTestMethodName();
    Console.WriteLine("--------------------------------------------");
    Console.WriteLine("*** {0} ***", methodName);
    Console.WriteLine();
}

Acknowledgements

Thanks to Egmont Goedeke of SQR Software for permission to include extracts from actual input and output files.

History

13 October 2006
- Initial version submitted.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

Written By

Andrew Tweddle

Architect Dariel Solutions

South Africa

Andrew Tweddle started his career as an Operations Researcher, but made the switch to programming in 1997. His current programming passions are Powershell and WPF.

He has worked for one of the "big 4" banks in South Africa as a software team lead and an architect, at a Dynamics CRM consultancy and is currently an architect at Dariel Solutions working on software for a leading private hospital network.

Before that he spent 7 years at SQR Software in Pietermaritzburg, where he was responsible for the resource planning and budgeting module in CanePro, their flagship product for the sugar industry.

He enjoys writing utilities to streamline the software development and deployment process. He believes Powershell is a killer app for doing this.

Andrew is a board game geek (see www.boardgamegeek.com) with a collection of over 190 games! He also enjoys digital photography, camping and solving puzzles - especially Mathematics problems.

His Myers-Briggs personality profile is INTJ.

He lives with his wife, Claire and his daughters Lauren and Catherine in Johannesburg, South Africa.

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.