Originally posted on: http://oofgeek.com/archive/2014/01/20/complex-xml-schemas-how-to-simplify.aspx
[Sample code is here: HIPAA Schema Simplification for the BizTalk Server application]
The XML Schemas are used for two main tasks:
- for processing XML documents (for the XML document validation and for the XML document transformation);
- for defining the domain specific standards.
XML Schemas and Domain Standards
Let's talk about the domain standards. EDI, RosettaNet, NIEM, ebXML, Global Justice XML Data Model, SWIFT, OpenTrave, Maritime Data Standards, HIPAA, HL7, etc. If we look at those standards, we see that schemas embrace the domain knowledge in form which can be formally and officially validated. [In this article I discussed those standards in more details: Domain Standards and Integration Architecture] The XML Schemas are very helpful for such tasks.
Compare standards which defined in form of the XML schemas and in form of the documents. It is almost impossible to verify if the data satisfy the standard or not if we use the text document where this standard is defined. And it is possible to validate it and validate it automatically, if we use the XML Schemas.
The domain specialists use XML Schemas to define standards in unambiguous form, in machine verifiable format.
Those schemas tend to be large, huge and very detailed. And it is for very good reasons.
But if we start to use XML Schemas for the first task, for processing XML documents in our programs, we need something different, we need the small schemas. In the system integration we need small schemas.
I want to emphasize it. If you work with a hundred values and use a schema with a thousand nodes, it is completely wrong. It smells all around, it intoxicates all your code. You don't want to know how programmers call this type of code.
We don't need an abundance of HIPAA schemas in most applications. We only need a small portion of schema to validate or transform the significant for this application part of schema.
We upload the megabyte size schemas, we perform mapping for these huge schemas, and it lasts for eternity and it consumes huge amount of CPU and memory.
For the most integration projects we don’t want to validate data to satisfy the standard. We want to transfer data between systems as fast as possible with minimal development effort.
How to work with those wealthy schemas? How to do our integration fast at run-time and in development?
First we have to decide, does our application require the whole schemas or not?
If the answer is "No" we could read further to solution.
How to Simplify?
Solution is to simplify the schema. Cut out all unused parts of schema.
The first step in our simplification is to decide which parts of original schema we want to transfer further, want to map to another schema. We keep these parts unchanged and we simplify all other unnecessary schema parts.
The second step is to research if the target integrated system perform the data validation of the input data or not. Good system usually validates input data. Validation includes the data format validation (is this field integer, date type or does it match a regex expression?), the data range (is this string too long or is this integer too big?), the right encoding (is this value belong to the code table?), etc.
If the target system performs this validation, it doesn't make sense to us perform the same data validation on the integration layer. We just pass the data without any validation to the target system. Let this system validate data and decide what to do with errors: send errors back to the source system or try to repair or something else. Actually it is not good architecture, if an intermediary (our integration system) is trying to do such validations and decisions. It means spreading the business logic between systems where target system delegates the data validation logic to intermediary. The integration system deals with data validation only if it needed.
Example: HIPAA Schema Simplification in the BizTalk Server
Now let's be more technical. The next example is implemented in the BizTalk Server and the HIPAA schemas, but you can use the same principles with other systems and standards.
The first step in the schema simplification is the structural modification. It is pretty simple. We replace the unused schema parts with <any> tags [http://www.w3.org/TR/xmlschema-0/#any]. If we are still want to map this schema part but without any details, we can use the Mass Copy functoid.
The second part of the schema simplification is the type simplification.
For the HIPAA schemas I use these regex replacements:
Open your schema with XML (Text) Editor mode:
Click Ctrl-Shift-H (Find and Replace in Files) and check “Use Regular Expressions” option:
Make two replacements:
- type="X12_.*" --> type="xs:string"
- <xs:restriction base="X12_.*">.*\n.*\n.*\n.*</xs:restriction> --> <xs:restriction base="xs:string"/>
Save and close.
Open schema again with Schema Editor, make any small change and undo it. Editor will recalculate type information and pops up the Clean Up Global Data Types window. Check all types and click OK.
This cleans up all unused Global Data Types.
Previously we replaced all those types with “xs:string” type and those types are not used anymore.
It takes 5 minutes for this replacement. What is the result?
The modified schema is twice smaller.
- is the dll size with original schema.
- is dll size with modified schema.
The assembly for modified schema also cut twice in size.
Result is not bad for 5 min job.
How these simplified schemas change our performance?
All projects with schemas and maps are compiled in Visual Studio notably faster. I like this improvement as a developer.
How about the run-time performance?
I have made a simple proof of concept project to check the performance changes.
The project compounded of two BizTalk applications and two BizTalk Visual Studio projects. Do not do this in production projects! One Visual Studio solution should keep one and exactly one BizTalk application.
Each project keeps one HIPAA schema, one very simple schema, one “complex” map (HIPAA schema to HIPAA schema), and one simple map (HIPAA schema to the very simple schema).
The first project works with original HIPAA schema and the second project with simplified HIPAA schema.
Build and Deploy one project.
Each BizTalk application compounded of a receive file location and a send file port. The receive location uses the EdiReceive pipeline to convert the text EDI documents into the XML documents. So we need to add a reference to the “BizTalk EDI Application”:
After deployment import the binding file which you find in the project folder. Create the In and Out folders and apply necessary permissions to those folders. Change the folder paths in the file locations for your folders.
There is also a UnitTests project with several unit tests. Change folder paths in the test code.
Then delete the application and deploy second BizTalk project and perform tests again.
Do not deploy both projects side by side.
Note: Before each test start the Backup BizTalk job to clean up a little bit the MessageBox.
Tests for 1, 10 and 100 messages did not show visible difference. The difference could be noticeable in my environment in 1000 message and 3K message batch tests. The above table shows the test result for 3K batch tests.
The performance gain is about 10%. It is not breathtaking but anyway it is not so bad for the 5 minutes effort.
Conclusion: The schema type simplification is worth to do if the application expects the sustainable high payloads, the high peak payloads, and everywhere you want to get the best possible performance.