Google Docs version

Prologue

It was August the 18^th 2011. The article was http://blogs.msdn.com/b/netmfteam/archive/2011/08/18/netmf-4-2-regular-expressions.aspx.

The Regular Expression implementation for the .NET Micro Framework has just been publicly mentioned. The pressure was on for the community to find bugs and sure enough after some time with hungry developers using the library a bug was eventually found however not in the RegEx engine, with the StringBuilder (which was also contributed by myself) http://forums.netduino.com/index.php?/topic/2717-string-manipulation-in-net-41-micro/.

After promptly fixing the issue in less than 24 hours immediately tested other supported expressions and even utilized my new feature of cycle timing to ensure I was not wandering around into the unmanaged void and creating an access violations on accident.

The main purpose of this article is to decide if the missing methods have any impact on the developers utilizing the library and if further development is even warranted based on this article.

The latter purpose is to ensure that the differences as well as similarities are appropriately highlighted and that the cycle timing feature is explained in way which makes it easy for existing Regex users to understand and use in the same way and last but not least to provide a few laughs at the expense of time alone..

Introduction

With respect to the developers of the Regular Expression syntax and by virtue of the inventors of the Unix Operating System, not to mention by merit the creator of the language which powered it, I will say that Regular Expressions are nothing more than a mythical black box.

It was with great pride that I learned how the great theory of Automata was formed and implemented. I was filled with the odes of joy when I learned there were even two separate possible representations which reflected what I like to envision as biased and unbiased, Deterministic Finite Automata and Nondeterministic Finite Automata. It was even a greater feeling when I learned that you could convert between the two and that there was really no difference, save performance or instructions required to complete the determination.

It was with great sadness and .... or ‘Murderous Intent’ that I learned via the history of time through research that both schools of thought were never unified and battled endlessly, such as the physicists did which inspired some of their reasoning. The disease of difference was forever engrained into the minds of programmers and the symptoms were that each one would be naive and utilize DFA minimization or other equivalence principles until they woke up and realized that the problem could and should be solved by combining compiled and evolving the features into RegExNext or such. Even after most realized this, they rather implemented various interpreters into their language implementations and utilized them in an attempted conspiracy to force the failure or Moore’s Law prematurely and also allow the notion of high capacity hard disks to take off. While initially successful, this plot was thwarted by a flood and newer solid state technology which actually then benefited the campaign by forcing a rise in cost as well ensuring that solid state technology had the required money for research.

Without exposing my CIA handler and the methods in which I have confirmed these various operations, I know that my story may seem highly suspicious, however, what I can say is that I employ you to look into Einstein and how he just so happened to want to escape to America and how he then came up with General Relativity.

I will also say that a proper use of common sense combined with good design and optimization can always outperform a Regular Expression engine (hands down/up).

Now with that out of the way, I will move onto the meat of the article which I hope is what you came here looking for rather than my delusional ramblings which only have a slight possibility of being related and a even smaller margin of actually being true.

Please be active and leave feedback if you care about kittens.

Article

Users typically utilize Regular Expressions as a way to exploit what seems like a short cut through some seemingly complex string handling task. But before we begin to get into the main point of how to use the library and how it differs from the full framework implementation, I will ask that we take a second to realize other possible utilizations for Regular Expressions.

A developer can use Regular Expressions to actually do quite a bit of work on binary data as well. Once you learn a bit about string encoding and you get into more complex Regular Expressions (herein known as Regex(‘s)), or if you already do, then you can skip down to the main points and examples.

You can use Regex’s to match sections of logic in a code analysis as well as use them to get relevant or similar portions of an image from another image among many other variations of the string matching technique which is still applicable since a char is a byte and vice versa and there is no loss or even a conversion which is done. There is more like a typedef or an alias which takes place to provide these constructs to a developer.

Now possibly with your mind even more confused than what it was before or possibly enlightened or even argumentative, I will ask that we let this go for now and move onto the main points.

The new Timing mechanism a.k.a. RegexOptions.Timed is used to ensure that a match operation does not take longer than the given amount of time defined in ticks.

This was easily achieved by ensuring that the given amount of ticks elapsed since the initial execution was compared via the following code:

//If we are keeping time and the time we started matching at is more then MaxTick 
//ticks ago something went wrong allow for a break - JRF

if (timed && (DateTime.Now.Ticks - startTicks) >= MaxTick)
    throw new RegexExecutionTimeException(idxStart); 
else startTicks = DateTime.Now.Ticks;

It ensures that expressions which include possible endless results such as '\w+' o0r '\b+\' are safely executed and the system is not wasting time needlessly attempting to match.

I also pondered the possibility of adding a Regex.Lazy flag which would optionally sleep for the given ticks and then possibly increase the allocated ticks until exception.

The RegexExecutionTimeException class takes the position in the match stack and then allows the catcher to restart the matching process either over or later by providing the index last evaluated and keeping the stack registers and the class state in the same state as before the exception. Below you will find a meaningless example:

int catchCount = 0;

//Should match for a long time and throw the Timing Excpetion
//This could be changed up a bit to show partial matching

string pattern = @"\b(?:(\w+))\s+(\1)\b+";
Regex rx = new Regex(pattern, RegexOptions.IgnoreCase |
RegexOptions.Timed);
rx.MaxTick = ushort.MaxValue * 2;

// Define a test string. 
string text = "The the quick brown fox fox jumped over the lazy dog dog.";
int start = 0;

TryToMatch:

try
{ 
    // Find matches.
    MatchCollection matches = rx.Matches(text, start);
    //Log.Comment("Found Matches:" + matches.Count);
    //If we get here you have a really fast processor and timing is not working as expected
    return MFTestResults.Fail;
}
catch (RegexExecutionTimeException ex)
{ 
    if (++catchCount > 5)
        return MFTestResults.Pass;
    else
    {
        //Try to match again at the index
        //Could check index or access the Groups
        //up the Index here but in this example we never matched. 
        start = ex.Index; 
        goto TryToMatch;
    } 
} 
catch 
{
    return; 
}

A user can find a better value for the MaxTick value by performing matches on strings with known results to obtain a statistic which may be applied in an algorithm or lack thereof to ensure stricter operation. The default value is long.MaxValue / 2.

I thought of adding a Profile method which was internal, however, I could not get feedback on the implementation let alone did I want to start adding even more nonsense to a seemingly happy theory and science which did not accept change without what seemed like great loss in both dignity and pride which is of obviously the worst philosophical sin in my humble opinion(s) at the current time of this writing.

I am not sure of other implementations which have this feature let alone humans who have also expressed such concerns or ideas such as but not limited to the ones I have listed in this writing nor am I sure that it is a aspect relevant to this article...

You will see from the following Test example that the syntax with respect to the perspective of the C# compiler from the full framework has been kept as close as possible and no such deviation has been applied to either it or the StringBuilder class.

//MSDN Test from http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.match.aspx 
string input = "int[] values = { 1, 2, 3 };\n" + 
  "for (int ctr = values.GetLowerBound(1); ctr <= values.GetUpperBound(1); ctr++)\n" + 
  "{\n" + " Console.Write(values[ctr]);\n" + " if (ctr < values.GetUpperBound(1))\n" + 
  " Console.Write(\", \");\n" + "}\n" + "Console.WriteLine();\n";

string pattern = "Console.Write(Line)?"; 
Match match = Regex.Match(input, pattern); 
int matchCount = 0; 
string expected0 = null; 
string expected1 = null; 
while (match.Success) 
{
    Log.Comment("'" + match.Value + "' found in the source code at position " + match.Index + "."); 
    if (matchCount == 0) expected0 = match.ToString(); 
    else if (matchCount == 2) expected1 = match.ToString(); 
    match = match.NextMatch(); 
    ++matchCount;
} 

return matchCount == 3 && expected0 == "Console.Write" && 
  expected1 == "Console.WriteLine" ? MFTestResults.Pass : MFTestResults.Fail;

You will also see that MatchCollection is present.

    // Define a regular expression for repeated words. 
    Regex rx = new Regex(@"\b(?<word>\w+)\s+(\k<word>)\b", 
       RegexOptions.Compiled | RegexOptions.IgnoreCase);

    // Define a test string. 
    string text = "The the quick brown fox fox jumped over the lazy dog dog.";

    // Find matches. 
    MatchCollection matches = rx.Matches(text);

    // Report the number of matches found. 
    Log.Comment(matches.Count + " matches found in:\n " + text);

    // Report on each match. 
    foreach (Match match in matches) 
    {
        GroupCollection groups = match.Groups; 
        Log.Comment("'" + groups["word"].Value + "'" + 
          " repeated at positions " + groups[0].Index + " and " + groups[1].Index); 
    }

    //Required Named Capture
    return MFTestResults.Pass; 
} 
catch 
{
    //Known to Fail until Named Capture is implemented 
    //See http://www.regular-expressions.info/named.html 
    return MFTestResults.KnownFailure;
}

We will now begin to explore the differences with respect to the Jakarta library from which the code was ported and the full framework implementation.

    // Define a regular expression for repeated words.

    //Was trying to add ?: to the second group to force its presence
    //atleast once but you cant use back references when you do this with this engine, 
    //there must be another way!

    //Has correct Index's but extra matches!!! 
    //@"\b(?:\w+)\s+(\1)\b" 
    //pattern = @"\b(?:\w+)\s+((\s+(\1))\b)";

    ////Only 1 Match 
    //pattern = @"\b(\w+)\s+\w+(\k\1)\b";

    //Almost perfect 
    //pattern = @"\b(?:\w+)\s+(\w+)\1\b";

    //Correct Positions 
    //pattern = @"\b(?:\w+)\s+(\w+\1)\b";

    //Correct Positions extra matches 
    //pattern = @"\b(\w+)\b" 
    //pattern = @"\b\w+\s+\b(\1)"

    //These both work in Jakarta Applet... check for bugs 
    //http://jakarta.apache.org/regexp/applet.html 
    string pattern = @"\b(?:(\w+))\s+(\1)\b"; 
    //string pattern = @"\b(\w+)\s+\1\b"; 
    Regex rx = new Regex(pattern, RegexOptions.IgnoreCase);

    // Define a test string. 
    string text = "The the quick brown fox fox jumped over the lazy dog dog.";

    //Need dumpProgram to determine if my compiler compiled the same... I 
    //think the problem lies in the Match object... I am not passing something
    //correclty to the concstructor....

    // Find matches. 
    MatchCollection matches = rx.Matches(text);

    TestTestsHelper.ShowParens(ref rx);

    // Report the number of matches found. 
    Log.Comment(matches.Count + " matches found in:\n " + text);

    // Report on each match. 
    foreach (Match match in matches) 
    {
        GroupCollection groups = match.Groups; 
        Log.Comment("'" + groups[0].Value + "'" + " repeated at positions " + 
          groups[0].Index + " and " + groups[1].Index); 
    } 

    //May be bugs in engine but is likely due to differences in Regex 
    //engines because I have no implemented Transfer Caputre,
    //this is known to Fail for now so long as no exception is thrown.

    return MFTestResults.KnownFailure; 
} 
catch 
{
    return MFTestResults.Fail; 
}

From the code example and comments above you will see that the changes to the engine are giving some results which may not be expected at first with respect to either of the other implementations. This is a classic problem in Regex implementations and great care is needed to ensure that the theory is followed as closely as possible to ensure compatiblity.

Overall this was the hardest part because of the notion of UTF-8 encoding in the micro framework and the unicode in the full framework. I was even tempted to use shorts rather than char’s for the opcodes for this very reason however it turned out that it would be okay in the end without such drastic changes.

The bottom line is that the results are inline with what the Full Framework will return under mostly all circumstances when using syntax that is supported.

For example, Named capture is NOT supported however it would be trivial to add to the current implementation as the members are already present and the engine just needs small changes to make it actually work.. (Which may end up being another article in this series depending on the feedback) the same with various other parts of the syntax which could and may very well be added depending on how the feedback and community discussion coalesce combined with how my work load is dealt with.

The implementation also keeps a Cache of recently compiled regular expressions (just like the full framework implementation). The cache can be cleared by setting the static member Regex.CacheSize to -1 which clears the cache and prevents newly compiled expressions from being added. The default cache size is 25 instances at which point they will be recycled on a FILO basis from the cache.

What’s Missing or Different?

In the 4.2 implementation there are some members and methods missing from the Regex class e.g. those for operating on a Capture. I will elaborate further and you will hopefully decide if this impacts development....

For example take the previously stated issue on UTF-8 vs Unicode characters and the following code example.

/*
// The example displays the following output:
//        : Microsoft
//        : Excel
//        : Access
//        : Outlook
//        : PowerPoint
//        : Silverlight
// Found 6 trademarks or registered trademarks.
*/
bool result = true;
int expectedCount = 6;

string pattern = @"\b(\w+?)([  ])";
string input = "Microsoft  Office Professional Edition combines several office " +
    "productivity products, including Word, Excel , Access , Outlook , " +
    "PowerPoint , and several others. Some guidelines for creating " +
    "corporate documents using these productivity tools are available " +
    "from the documents created using Silverlight  on the corporate " +
    "intranet site.";
Regex test = new Regex(pattern);
MatchCollection matches = test.Matches(input);
foreach (Match match in matches)
{
    GroupCollection groups = match.Groups;
    Log.Comment(groups[2] + ": " + groups[1]);
}

Log.Comment("Found "+matches.Count+" trademarks or registered trademarks.");

if (matches.Count != expectedCount)
{
    result = false;
}

if (matches[0].ToString() != "Microsoft ")
{
    result = false;
}
else if (matches[1].ToString() != "Excel ")
{
    result = false;
}
else if (matches[2].ToString() != "Access ")
{
    result = false;
}
else if (matches[3].ToString() != "Outlook ")
{
    result = false;
}
else if (matches[4].ToString() != "PowerPoint ")
{
    result = false;
}
else if (matches[5].ToString() != "Silverlight ")
{
    result = false;
}

return result ? MFTestResults.Pass : MFTestResults.Fail;

The weird character is supposed to be either ©, ™, ℠, or ® however with the UTF-8 encoding at use in the framework this was not a ‘showstopper’ as people would basically just have to understand this platform difference, I could have ended up doing something weird or fancy to try and solve this however I also see the framework growing and supporting various different locales so I will have to wait for feedback to determine how much of a pain in the ass this really is, I suspect since that this can also technically happen on the full framework if you are not careful and that it is not a big as deal as I am making it however I will also say that the character literals may not been the best way to match and that there might be a need for specific codes to be given as well as encoding in future expression syntax parsers, either that or at least an Encoding member of the Regex class itself to ensure that it can optionally matching against alternate encoding.

The entire process of porting the code took all of 4 hours if that. I was compiling expression strings into complete RegexProgram’s and all of the tests passed! I was very happy with myself because I did not end up using a code tool to do the port, I did it by hand and the results spoke for themselves. Because of this I was able to jump in and out of the code as if it was my own.

Some of the comments were lacking but I did my best to ensure that a proper description was given to methods and that each line which was even remotely hard to read has some type of comment ensuring future developers would understand why I was using the variables in the series I was et al;
I basically then started top down with MSDN examples which I knew had to produce consistent results across both implementations.

This is where the headaches started, the matching was as per the theory in the engine however the names of the members were totally different that in the full framework implementations. The API was also vastly different.

This took about another 2 or 3 days to get completely right with no tests failing from the ported libraries unit test code.

The code reviewer I was working with had some unfortunate problems so this delayed the review process for my assignment about 2 whole months. After Microsoft saw I was so bored that I did a complete WebSocket implementation in less than 48 hours they emailed me and the code reviewer to find out why I was now working on WebSockets... long story short Lorenzo directly reviewed and helped me make some decisions such as to redo the StringBuilder from the full framework code and then to also implement the minimal set of methods which would enable the MSDN examples to pass using the same code.

This sounded like it would be easy at first however I quickly learned that this is where the engines were most different and the the full framework implementation utilizes various specialized matching algorithms based on the expression given and the theories behind each of the matching algorithms.

Without getting too caught up in NFA and DFA or BoyerMoore and Thompson I just made the properties named as per the full framework implementation and this was easy to do, I even promoted and demoted a few fields and properties and things still worked but some of the MSDN examples required the Match, MatchCollection, Capture, and Group classes.

So I began porting them directly over from the full framework to ensure the API was the same and that the results returned from the examples would match.

I was not allowed to take the method such as Pop and Crawl from the full framework without first demonstrating that I understood the code and that I did have a conforming implementation which required the full framework methods not because I wanted to be lazy but because I wanted to ensure that the results returned from my class were consistent with the full framework and I also wanted to ensure the Reflected calls were similar to what was on the Full Framework.

This notion eventually caught on when I was able to demonstrate within 1 week that there were only some very small issues and that I had everything save a few small features missing.

We will now take some time to explore what is missing or different about the implementation and even give some interesting ideas for future implementations.

The main theory of the ported code is to create a RegexProgram from the given expression. This program is essentially a state machine which allows it to have certain functions e.g. IsMatch which allow the program to return meaningful results such as true or false or the index of the match inter alia.

There is a Regex Compiler class which is responsible for taking the given expression and compiling it into a RegexProgram instance.

There is a DebugRegexCompiler which exposes a few exceptions which are typically used for debugging and bug testing but can be used in runtime with a dynamically created regular expressions.

using System;
using System.Text;
using System.Collections;

namespace System.Text.RegularExpressions
{
#if DEBUG

   /// <summary>
   /// A subclass of RECompiler which can dump a regular expression program for debugging purposes.
   /// </summary>
   public class RegexDebugCompiler : RegexCompiler
   {
       /// <summary>
       /// Mapping from opcodes to descriptive strings
       /// </summary>
       static Hashtable hashOpcode = new Hashtable()
   {
       {OpCode.ReluctantStar,    "Star"},
       {OpCode.ReluctantPlus,    "ReluctantPlus"},
       {OpCode.ReluctantMaybe,   "ReluctantMaybe"},
       {OpCode.EndProgram,              "EndProgram"},
       {OpCode.BeginOfLine,              "BeginOfLine"},
       {OpCode.EndOfLine,              "EndOfLine"},
       {OpCode.Any,              "Any"},
       {OpCode.AnyOf,            "AnyOf"},
       {OpCode.Branch,           "Branch"},
       {OpCode.Atom,             "Atom"},
       {OpCode.Star,             "Star"},
       {OpCode.Plus,             "Plus"},
       {OpCode.Maybe,            "Maybe"},
       {OpCode.Nothing,          "Nothing"},
       {OpCode.GoTo,             "GoTo"},
       {OpCode.Continue,         "Continue"},
       {OpCode.Escape,           "Escape"},
       {OpCode.Open,             "Open"},
       {OpCode.Close,            "Close"},
       {OpCode.BackRef,          "BackRef"},
       {OpCode.PosixClass,       "PosixClass"},
       {OpCode.OpenCluster,     "OpenCluster"},
       {OpCode.CloseCluster,    "CloseCluster"}
   };

       /// <summary>
       /// Returns a descriptive string for an opcode.
       /// </summary>
       /// <param name="opcode">Opcode to convert to a string</param>
       /// <returns> Description of opcode</returns>
       String OpcodeToString(char opcode)
       {
           // Get string for opcode
           String ret = (String)hashOpcode[opcode];

           // Just in case we have a corrupt program
           if (ret == null)
           {
               ret = "UNKNOWN_OPCODE";
           }
           return ret;
       }

       /// <summary>
       /// Return a string describing a (possibly unprintable) character.
       /// </summary>
       /// <param name="c">Character to convert to a printable representation</param>
       /// <returns>String representation of character</returns>
       String CharToString(char c)
       {
           // If it's unprintable, convert to '\###'
           if (c < ' ' || c > 127)
           {
               return "\\" + (int)c;
           }

           // Return the character as a string
           return c.ToString();
       }

       /// <summary>
       /// Returns a descriptive string for a node in a regular expression program.
       /// </summary>
       /// <param name="node">Node to describe</param>
       /// <returns>Description of node</returns>
       String NodeToString(int node)
       {
           // Get opcode and opdata for node
           char opcode = Instructions[node /* + RE.offsetOpcode */];
           int opdata = (int)Instructions[node + Regex.offsetOpdata];

           // Return opcode as a string and opdata value
           return OpcodeToString(opcode) + ", opdata = " + opdata;
       }


       /// <summary>
       /// Dumps the current program to a {@link PrintStream}.
       /// </summary>
       /// <param name="p">PrintStream for program dump output</param>
       public void DumpProgram(System.IO.TextWriter p)
       {
           // Loop through the whole program
           for (int i = 0, e = Instructions.Length; i < e; )
           {
               // Get opcode, opdata and next fields of current program node
               char opcode = Instructions[i /* + RE.offsetOpcode */];
               char opdata = Instructions[i + Regex.offsetOpdata];
               int next = (short)Instructions[i + Regex.offsetNext];

               // Display the current program node
               p.Write(i + ". " + NodeToString(i) + ", next = ");

               // If there's no next, say 'none', otherwise give absolute index of next node
               if (next == 0)
               {
                   p.Write("none");
               }
               else
               {
                   p.Write(i + next);
               }

               // Move past node
               i += Regex.nodeSize;

               // If character class
               if (opcode == OpCode.AnyOf)
               {
                   // Opening bracket for start of char class
                   p.Write(", [");

                   // Show each range in the char class
                   // int rangeCount = opdata;
                   for (int r = 0; r < opdata; r++)
                   {
                       // Get first and last chars in range
                       char charFirst = Instructions[i++];
                       char charLast = Instructions[i++];

                       // Print range as X-Y, unless range encompasses only one char
                       if (charFirst == charLast)
                       {
                           p.Write(CharToString(charFirst));
                       }
                       else
                       {
                           p.Write(CharToString(charFirst) + "-" + CharToString(charLast));
                       }
                   }

                   // Annotate the end of the char class
                   p.Write("]");
               }

               // If atom
               if (opcode == OpCode.Atom)
               {
                   // Open quote
                   p.Write(", \"");

                   // Print each character in the atom
                   for (int len = opdata; len-- != 0; )
                   {
                       p.Write(CharToString(Instructions[i++]));
                   }

                   // Close quote
                   p.Write("\"");
               }

               // Print a newline
               p.WriteLine("");
           }
       }

       /// <summary>
       /// Dumps the current program to a <code>System.out</code>.
       /// </summary>
       public void dumpProgram()
       {
           //PrintStream w = new PrintStream(System.out);
           //dumpProgram(w);
           //w.flush();
       }
   }

#endif
}

You can even compile regular expressions into RegexProgram instance’s and then either serialize them or store their bytecode on a SD card using reflection and then set it back once you instantiate the Regex Class after a reboot. There is a class RegexPrecompiler which can be used either under the emulator or on the hardware which is specially designed for this purpose.

using System;
using Microsoft.SPOT;

namespace System.Text.RegularExpressions
{
   /// <summary>
   /// Class for precompiling regular expressions for later use
   /// </summary>
   public class RegexPrecompiler
   {
       /// <summary>
       /// Main application entrypoint.
       /// Might make this have methods and be a class rathern the a program...
       /// Then the class can Serialise and Deserialiaze the Regexps
       /// </summary>
       /// <param name="arg">Command line arguments</param>
       static public void Main(String[] arg)
       {
           // Create a compiler object
           RegexCompiler r = new RegexCompiler();

           // Print usage if arguments are incorrect
           if (arg.Length <= 0 || arg.Length % 2 != 0)
           {
               Debug.Print("Usage: recompile <patternname> <pattern>");
               return;
           }

           // Loop through arguments, compiling each
           for (int i = 0, end = arg.Length; i < end; i += 2)
           {
               try
               {
                   // Compile regular expression
                   String name = arg[i];
                   String pattern = arg[i + 1];
                   String instructions = name + "Instructions";

                   // Output program as a nice, formatted char array
                   Debug.Print("\n    // Pre-compiled regular expression '" + pattern + "'\n"
                                    + "    private static  char[] " + instructions + " = \n    {");

                   // Compile program for pattern
                   RegexProgram program = r.Compile(pattern);

                   // Number of columns in output
                   int numColumns = 7;

                   // Loop through program
                   char[] p = program.Instructions;
                   for (int j = 0; j < p.Length; j++)
                   {
                       // End of column?
                       if ((j % numColumns) == 0) Debug.Print("\n        ");

                       // Print char as padded hex number                    
                       String hex = (0).ToHexString();
                       
                       while (hex.Length < 4) hex = "0" + hex;
                       
                       Debug.Print("0x" + hex + ", ");
                   }

                   // End of program block
                   Debug.Print("\n    };");
                   Debug.Print("\n    private static  REProgram " + name + " = new REProgram(" + instructions + ");");
               }
               catch (RegexpSyntaxException e)
               {
                   Debug.Print("Syntax error in expression \"" + arg[i] + "\": " + e.ToString());
               }
               catch (Exception e)
               {
                   Debug.Print("Unexpected exception: " + e.ToString());
               }
           }
       }
   }
}

All in all the assignment took about 4 months from first copy / paste to the final submittal to Microsoft. (And mind you this was with no other documentation or help (by anyone) other than that you [The Reader] would have if you did this yourself).

Fortunately everyone was pleased and apparently there were no bugs which increased my confidence in my skills and caused my reputation to start to go to my head.

In the end If I could do it again I probably would take more of an initiative into getting the syntax and engine the way I wanted it while adding methods for special localized matching in a expression.

E.g. something ruff like this:

//In less than 65535 ticks match the given expression ‘(\.?\)’, pass the results into the function present and return the result.

“\ \In{65535} (\.?\) \-> EvaluateMatch\” //Where EvaluateMatch is a MatchEvaluator or similar

//In less than 65535 ticks match the given expression ‘(\b+)’, pass the results into the function which sleeps for 100 milliseconds and returns the match.
“\ \In{100} (\b+) \-> { Sleep(_GivenTime_); return $1; }\” //_GivenTime_ would implicitly be 100 because it has been given to the In construct.

I would also change how some of the compiling is done by making it use more of the CLR string methods where possible rather than compiling my own type but this may be too much for the micro framework to perform dynamically at this time although it technically is possible.

I imagine if language implementations in general utilized this syntax then reflection and scripting would be something built into every language after there was a conforming regular expression engine.

Just imagine how it would be to perform such a task such filtering raw data on a ethernet adapter if you would be able to write a regular expression to look for data and this expression had the ability to invoke delegates or if you were writing a virus scanner or memory monitor... you could basically use the matching engine to find out when code changes and last but not least imagine how it would be to be able to pop down to assembly / MSIL / bytecode in an expression and have it create code dynamically or patch segments intelligently or to have it call other functions to do so or a combination of both.

I honestly feel that given this opportunity developers would even be able to achieve better interoperability with interpreted code as it would be given directly to the Regular Expression layer and converted into the machine language and executed allowing a much higher degree of security and standardization then various different API’s which exist and classic pointer passing. Just remember it is easy to pass a pointer to another region of memory but it is almost always invalid to pass a pointer across machine boundaries, with this new notion developers would be able to pass Regular Expressions across a machine boundary and have the same expectation of code execution as per the system they were running on; offsets and pointers become localized to the expression and they again have some meaning and it also means they can be transported across machine boundaries with little or no trouble at all just so long as the receiver is parsing them according to the same standard constructs as per all network communication.

I think this is what the original implementer's had in mind when developing ‘Regular Expressions’ however everyone got caught up in the math and then we remembers it was math with strings and forgot that strings are just bytes as well...

Closing

In closing I would also have liked to see more methods related to binary matching however to each implementation their own, a developer could easily wrap the Regex class and still create it with ease.

If I could change the full framework implementation the only changes I make would be a reduction in the amount of classes that actually perform the various types of matches and state machine conversions.

A single superclass would be created and it would analyze the expression taking into account the string given to match. It would determine the branches to use and build up a RegexProgram which contained branches which were specially optimized for the given string rather than defaulting to a single algorithm based on a few cheesy checks. This would allow all types of node searching logics to be applied dynamically in a single match without using any more overhead than is currently utilized in the implementation.

You can see the two original MSDN articles by Collin Miller here:

Here’s cool article which has nothing to do with the Micro Framework but does explain how to convert Regular Expressions to NFA -> DFA works and does have illustrations and is written by academic sources: http://lambda.uta.edu/cse5317/notes/node9.html.

[Cites / Thanks]
Myself,
CIA,
The ghosts of Dennis Ritchie, Boyer, Moore and Thompson, Fravia and Christmas Past
and last but not least Marijuana

[Shoutz]
Anonymous,
LulzSec,
AntiSec,
Identology - MYiDONE.com, (0d fa real and isocdftpz)
Federation Studios, Marsh, Gaijin, t4w2, Gracktov, Gellpak and the others @ Eclipse
iDenGodz (BTW I hacked you),
iDEN Insider, tommy and crazysim
PLA Blue,
Zi Hackademy,
MIT EECS

[Pokes]
Area 69, - All of you especially NexVision and his lackeys
howardforums, - All of you n00b’s who wish you could have been in Area 69 (especially yoDude)
iDENGodz, - M00$3 iDEN Insider, AaronASB & Rickster