Xmi CodeDom Library, Part 2 - Using dynamic types to increase performance

Dustin Metzgar

4.11/5 (5 votes)

May 30, 2006

CPOL

10 min read

43370

547

A .Net 2.0 library that converts XMI into CodeDom. Part 2 shows how CodeDom was used to create dynamic types to outperform the reflection-based parsing.

Download source files - 97 Kb

Introduction

In the previous article, I introduced a library I created to convert XMI to CodeDom. This library used heavy reflection to perform its parsing. While the reflection-based parser is decently capable of handling the XMI quickly, I was curious if it could be done faster. Using CodeDom, I created a new parser that generates dynamic types to perform the parsing. This article details what I did, some of the problems I encountered, and my results.

Thanks

I would like to once again thank Diana Mohan for providing me with a large and complex XMI document. It helped greatly in pointing out performance problems and parsing errors.

The Basic Technique

As mentioned in the previous article, the classes were created basically as data containers and the parser matched XMI nodes to their respective data containers. The job of the parser was to read elements and attributes and store those values in the properties of the data container objects. Reflection was used to determine which properties took which information.

In order to replace reflection, I took a hint from Microsoft. When you use the XmlSerializer class, it uses the attributes in your code to create a dynamic assembly that can handle XML serialization of your types. They probably use Reflection.Emit and write the IL instructions themselves. But the technique is to basically replace reflection by figuring out upfront how the serialization happens and writing dynamically-generated types to do it.

First, I created an interface called IXmiParser. This allows me to create a dynamic type that implements that interface:

internal static Type CreateCodeDom(Dictionary<string, Type> dictParseClasses)
{
    // Create the parser type and everything needed to support it.
    CodeCompileUnit ccu = new CodeCompileUnit();
    CodeNamespace cn = new CodeNamespace("NsCodeDomXmiParser");
    ccu.Namespaces.Add(cn);
    CodeTypeDeclaration ctdParser = new CodeTypeDeclaration("CodeDomXmiParser");
    cn.Types.Add(ctdParser);
    ctdParser.BaseTypes.Add(typeof(IXmiParser));
    ...
    // Lots of stuff happens here
    ...
    // Compile the assembly and return a reference to the XMI parser type.
    CSharpCodeProvider provider = new CSharpCodeProvider();
    CompilerParameters comParams = new CompilerParameters(new string[] { "System.dll", 
        "System.Xml.dll", typeof(CodeDomXmiParser).Assembly.Location });
    comParams.GenerateInMemory = true;
    comParams.IncludeDebugInformation = false;
    comParams.CompilerOptions = "/optimize";
    comParams.TempFiles.KeepFiles = true;
    CompilerResults results = provider.CompileAssemblyFromDom(comParams, 
        new CodeCompileUnit[] { ccu });
    if (results.Errors.Count != 0)
        throw new Exception("Compilation errors!");
    return results.CompiledAssembly.GetType("NsCodeDomXmiParser.CodeDomXmiParser", true,
        true);
}

After this is the challenge of expanding out everything that the reflection-based code does into normal code. This involves an awful lot of CodeDom. I will spare you the code dump and just point out some of the things I noticed about using CodeDom.

Loops and Conditions

If there's anything I found particularly lacking about CodeDom, it was in the realm of loops and conditions. The for loop construct just sucks. If you only put in a test condition, I would expect it to just act like a while loop. Instead it throws a null reference exception when compiling the CodeCompileUnit. There is also no break or continue available. So, I just did it the old-fashioned way and used labels and gotos. A loop like this:

while (condition1) {
    // Do stuff
    if (something)
        break;
    if (somethingElse)
        continue;
    // Do more stuff
}

Turns into this:

BeginWhileLoop:
if (condition == false)
    goto EndWhileLoop;
// Do stuff
if (something)
    goto EndWhileLoop;
if (somethingElse)
    goto BeginWhileLoop;
// Do more stuff
goto BeginWhileLoop;
EndWhileLoop:

Conditions were also annoying. There's no switch or else if. One way around this is to use nested if/else statements. A switch like this:

switch (foo) {
    case a: ... break;
    case b: ... break;
    case c: ... break;
    ...
}

Turns into this:

if (a) {}
else {
    if (b) {}
    else {
        if (c) {}
        else {
            ...
        }
    }
}

This kind of code gets really hard to manage. So, I wrote it with labels and gotos instead:

if (a) {
    // do stuff
    goto endofswitch;
}
if (b) {
    // do stuff
    goto endofswitch;
}
if (c) {
    // do stuff
    goto endofswitch;
}
...
endofswitch:

I'm not trying to say that everyone should go out and use gotos. There's a reason that modern languages try to hide it and discourage its use. I think when you start writing dynamic types in CodeDom, it's a good time to stop being ignorant about goto and what really goes on behind the scenes.

Switch versus If

While looking for information to help me complete the CodeDom work, I remember coming upon some interesting conversations here at Code Project. People were using the IL disassembler and found out that the C# compiler was actually optimizing switch statements using certain techniques like binary searching and hash tables.

I was interested to see if this would provide any performance improvement in my generated code. My technique of using labels and gotos instead of nested if/else might be improved with some kind of hash or binary search algorithm. So, I replaced one of my if/else blocks with a switch and called it the CodeDomSwitchParser. In case you're wondering, you can create the switch statement as text and put it into a CodeSnippetExpression and then compile with a C# compiler:

CodeConditionStatement stmtIfAttr = new CodeConditionStatement();
...
StringBuilder sb = new StringBuilder();
sb.Append("switch (propName) {");
sb.Append(Environment.NewLine);
foreach (PropertyInfo pi in type.GetProperties(
   BindingFlags.FlattenHierarchy | BindingFlags.Public 
   | BindingFlags.Instance)) {
   if (pi.CanWrite && pi.PropertyType != typeof(object) 
      && pi.PropertyType.Namespace == "System") {
      sb.Append("case \"");
      sb.Append(pi.Name.ToLower());
      sb.Append("\":");
      ...
   }
   ...
}
sb.Append("}");
stmtIfAttr.TrueStatements.Add(new CodeSnippetStatement(sb.ToString()));

But, as you can see from the final results, the switch's impact on performance was minor and not always helpful.

Regular Expression Goof-up

My initial reason for writing a CodeDom-based parser as opposed to a reflection-based parser was because the parsing was very slow. This was due to a mistake I made when using some regular expressions. Although it is kind of embarassing, I will describe what I did.

First of all, when I ran my parser against the small document, it took 2.782 seconds. And when I ran it against the large document, it took 586.158 seconds. Yikes!! No user is going to sit there for 10 minutes to wait for their document to load. Well, unless they've used Visio before, then they're used to that sort of thing. I thought, I have to speed this up, and it might be Reflection that's the problem.

So a few days and a lot of difficult coding later, I had come up with a reliable CodeDom-based parser that generated dynamic types to do the parsing. “This should be much faster,” I thought to myself. I ran my tests and the results for that parser on the small and large documents were 3.079 and 580.357 seconds respectively. Huh?

When I saw this I figured I had a performance bottleneck somewhere other than reflection. That's when I found this:

public static string GetSafeName(string s) 
{
    Regex re1 = new Regex(@"[\s\-]+", RegexOptions.Compiled);
    s = re1.Replace(s, "_");
    Regex re2 = new Regex(@"\W", RegexOptions.Compiled);
    s = re2.Replace(s, string.Empty);
    Regex reFirst = new Regex(@"^[^a-zA-Z_]+", RegexOptions.Compiled);
    s = reFirst.Replace(s, string.Empty);
    return s;            
}

The purpose of this code was to convert whatever names came through the XMI into names that could be compiled. Which meant stripping out illegal characters and such.

When I wrote the code, I thought that I should turn the RegexOptions.Compiled flag on since that will improve the performance of the regular expressions by pre-compiling them. What I did not realize was that every time this method was being called, a new regular expression was being compiled. This means that a dynamic type is being emitted based on that regular expression. Because of the scope of the variables, I had killed my performance. So, I made them static:

public static string GetSafeName(string s) 
{
    s = _Re1.Replace(s, "_");
    s = _Re2.Replace(s, string.Empty);
    s = _ReFirst.Replace(s, string.Empty);
    return s;            
}
        
private static Regex _Re1 = new Regex(@"[\s\-]+", RegexOptions.Compiled);
private static Regex _Re2 = new Regex(@"\W", RegexOptions.Compiled);
private static Regex _ReFirst = new Regex(@"^[^a-zA-Z_]+",
    RegexOptions.Compiled);

Simply doing this caused my small document parsing time to go from 2.782 seconds for reflection down to 0.0875 seconds! My large document parsing time for reflection went from a horrible 580.357 seconds down to 4.412 seconds! Take this lesson from me: use a code profiler, it will show you what methods take the longest and help you locate stupid stuff like this quickly.

The Parsers

Reflection
This is the original parser. It serves as a benchmark to compare the other parsers against for accuracy.
CodeDom
The first CodeDom parser. It contains no optimizations besides creating the dynamic types.
CodeDom w/ Switch
Same as the CodeDom parser except that one if/else block was replaced with a switch to see how the C# compiler would optimize it.
CodeDom optimized
The CodeDom parser with some optimizations regarding knowing which properties are specified as attributes and more intelligently doing type conversions. Also, some of the checks were removed. There are definitely more optimizations that could be performed though.

Final Results

Here are the timing results for each parser from a bunch of test runs. Each run parses the document 10 times and divides the results by 10 to get an average. All results are expressed in seconds and were run in a debug compilation but without the debugger. The test machine is a P4 3Ghz machine with 1 GB of RAM.

Small Document - 70 KB, Version 1.0

Parsing Type	Run 1	Run 2	Run 3	Run 4	Run 5
Reflection	0.065608	0.028126	0.028140	0.053126	0.051563
CodeDom	0.028117	0.028126	0.031267	0.026563	0.028125
CodeDom w/ switch	0.031242	0.029689	0.029704	0.028125	0.028125
CodeDom optimized	0.020307	0.020313	0.020323	0.018750	0.018750

Large Document - 8571 KB, Version 1.2

Parsing Type	Run 1	Run 2	Run 3	Run 4	Run 5
Reflection	2.675894	2.689200	2.729644	2.609441	2.618783
CodeDom	1.184079	1.150058	1.178780	1.067214	1.092201
CodeDom w/ switch	1.168458	1.170372	1.185034	1.082840	1.096889
CodeDom optimized	1.118470	1.118807	1.119372	1.029713	1.051575

Analysis

Why is the reflection on the small document just as fast or faster than the CodeDom version in runs 2 & 3?

The first three runs were done through SharpDevelop, which is an open-source .NET IDE. NUnit is integrated directly into SharpDevelop. My guess is that because I did not recompile, the AppDomain being used by NUnit was still open. Microsoft has put some effort into caching Reflection data, and this is where that shows.

What happened in test runs 4 & 5?

After the first three test runs, I left SharpDevelop and shut off a bunch of programs. I then ran the tests directly through the NUnit GUI.

The test results were run with a file on the hard drive. Windows probably does some caching of this, but it could still be a factor in that another program could want use of the hard drive during the tests. Loading the document into a MemoryStream first did lower the times and increase the stability, but I felt the results were less “real-world”.

Switching from Reflection to CodeDom-generated dynamic types did yield a performance advantage. The other optimizations could improve performance in certain situations, but may cause harm in others.

Using the API

There's nothing different about the API since the last article. But I will reiterate. All you have to do is use the XmiRoot class and provide it with a XmlReader. For example, let's grab a XmlTextReader on a file:

XmlTextReader xtr = new XmlTextReader("MyDiagram.xmi");

Now create an instance of XmiRoot and invoke its Parse method:

XmiRoot root = new XmiRoot();
root.Parse(xtr, XmiParserType.Reflection);

The default parser is set in the XmiRoot class as being the Reflection parser. But there are a number of other options:

public enum XmiParserType
{
   Reflection,
   CodeDom,
   CodeDomSwitch,
   CodeDomOptimized,
}

Now, the XmiRoot object contains a representation of the XMI model and can convert that into a CodeCompileUnit:

CodeCompileUnit ccu = root.GetCodeCompileUnit();

Or you can directly create code:

string s = root.GetParsedCode(CodeOutputTypes.CSharp);

As you can see, the API is straightforward. Parsing errors are simply bubbled up as exceptions. If there are nodes that the parser does not recognize, it will skip over them.

In the solution, you'll find a project set up with an NUnit test harness. It has one simple test in it that runs the parsers against version 1.0 and 1.2 documents and times them. The small document mentioned in the article is graphics.xmi which is a version 1.0 document. The large document was not included with the code because it is not mine to give. So, I actually included a version 1.2 document generated from the same CodeCompileUnit that is produced by the CodeDom-based parser. Writing XMI from the library is something I will cover in the next article.

Summary

Reflection does have some performance penalty. For many programmers, they simply know that there is a performance penalty, but they're really not familiar with how much they can gain by not using reflection. In instances where you need to squeeze every little bit of speed out of some code, be wary of using reflection. But also be careful, because replacing reflection could require dynamic types created in either script, CodeDom, or Reflection.Emit. This takes a lot more time to create, is difficult to debug, and may not provide the huge performance improvements that you're looking for.

The speed advantages I gained by using dynamically created types are significant because it effects the user experience. When everything works fast the user won't really notice at all. A lot of effort went into making this library capable of parsing an 8 meg XMI document in about a second. The user will never appreciate this. However, they will definitely notice if the response time is greater than 3 seconds. If you think you want to optimize your code for performance, remember three things:

Optimization typically requires a lot of time and effort.
Performance improvements are usually (frustratingly) small.
Users are less appreciative of performance improvements than feature enhancements. Good performance is expected.

History

0.1 : 2006-05-23 : Initial version
XMI versions 1.0 and 1.2 are handled for one namespace with classes, data types, generalizations, associations, multiplicities, class attributes, and class operations. The system uses a Reflection-based parser.
0.2 : 2006-05-30 : Version 0.2
A new set of parsers has been added to improve parsing performance. Particularly, the CodeDom-optimized parser runs the fastest. The reflection-based parser is still needed because the dynamic types are very difficult to debug.