Legacy file formats, such as UN-EDIFACT with a record per line and fixed-length fields, still exist and are widely used for B2B transactions. A tool that can convert legacy files to human-readable XML might come in handy. The tool I present here converts files similar to, but not identical to UN-EDIFACT. The file format in question is used by PBS - Payment Business Services (PBS) in Denmark, see http://www.pbs.dk/en/. The tool might not be terribly relevant outside Denmark, but it does show how to deal with validating, searching and converting > 100 megabyte legacy files to XML in a fairly general manner. So I have decided to place it on CodeProject in spite of the strong local coupling to PBS in Denmark. This tool uses the class arguments from the article C#/.NET Command Line Arguments Parser, thanks to R. LOPES.
Using the Tool
The tool works like this:
pbs2Xml.exe –s InfoService.xml –i Leverance.xml –o Leverance.xml –f "John Schmidt"
–s command line argument is the specification file which must follow the schema in PbsSpecification.xsd.
–i argument is the input file in legacy format.
–o argument is the output file in XML format. This is optional; leave it out when all you want is to validate the legacy file.
–f argument is a search filter. This is optional. It can be handy when dealing with very large files. If you are looking for information regarding a specific SSN, use this option to convert only records containing that SSN.
|Information service. Information types 100, 150: Pension and 700: LetLøn
|Payment Service Invoicing: 601, section 112
|Payment Service Invoicing: 601, section 117
|Payment Service Payments: 602
Using the Code
I needed a tool to validate files used for business transactions in banking, pension and life insurance and convert them to XML. I also needed a general approach because the business rules for validating data were unclear. Basically I wanted a general parser that could read a legacy file with a record per line, fixed-length fields and a hierarchical record structure like the one in UN-EDIFACT documents. The parser must not know the specifics of the records, fields and validation rules. The specifics must be provided in a specification file so that changing parsing details does not require code changes, but merely changes to an XML file containing the parsing rules.
pbs2xml is just a parser, and a parser of a specific B2B legacy file format, which is only used in Denmark. This sounds like application-specific code, not suited for CodeProject!
Well, maybe not. It does however demonstrate an interesting technique: pulling out all of the business rules for parsing and validating a specific file format from the code and into an XML specification file.
The specification file must follow some ground rules that are common for all B2B files used by Payment Business Services (PBS); these rules are represented by the schema in PbsSpecification.xsd. The overall format is similar to UN-EDIFACT: one record per line with fixed-length fields and a hierarchy of record types.
The following classes model the entities in the specification schema:
– Specifies the position, length and validation rule of a field in a record of fixed-length fields.
true if the field is part of what identifies the record.
true if the field is not always supplied in the input
fileRecord – Contains fields
Section – Contains a start record, some data records and an end record
Leverance – Contains Sections
PbsReader can read and validate an input file given a valid specification:
XmlDocument spec = new XmlDocument();
Leverance leverance = new Leverance(spec);
PbsReader target = new PbsReader();
foreach (Error error in target.Errors)
If the input file does not honor the ground rules, a
PbsFormatException is thrown. Fields with format errors are summarized in
PbsReader.ErrorCount and the first 100 errors are accumulated in the collection
PbsReader is inherited by
PbsWriter, which can convert the input file to XML.
PbsReader is inherited by
PbsSearcher, which converts a selection of records to XML based on a search filter.
Points of Interest
This tool was developed by myself and my colleague Lotte Jensen during a programming course with Kent Beck. I learned at least two important things during that course:
- I used write tests after coding for a while, waiting for the design to stabilize. Now; I start with writing the tests before writing the code.
- Curly braces go on a new line after the method name, not at the same line. This is according to Kent Beck's principle of symmetry, wish he would take his own medicine!
- March 2008: Version 1.0
- June 2009: Bug fix - Introduced support for reading an arbitrary number of sections