FileUtilities - A Library for Reading Flat Files into POCOs






4.92/5 (14 votes)
A utility to help turn flat files (.csv, tab separated, etc.) into type-safe CLR objects with validation using attributes
Introduction
Whilst many modern applications exchange data in well defined standard ways (such as XML or JSON), the reality is that a great deal of data is still packaged in flat files (by which I mean comma separated values, or fixed width text files).
This library is concerned with turning these files into arrays of type-safe objects and using semantic meaning in those objects to validate the files concerned.
Background
The use of attributes to tag properties of your classes is quite common in both Entity Framework and also in System.Runtime.Serialization
. This library very much follows that model but if you are not familiar with it then I recommend reading this article to start with.
Using the Code
For a worked example, we'll start with a flat file which contains historical price information about Microsoft shares from Yahoo:
What is desired is to turn this into an IEnumerable
of a class called StockPrice
that can be declared thus:
Public Class YahooStockPrice
Public Property PriceDate As Date
Public Property OpenPrice As Nullable(Of Decimal)
Public Property HighPrice As Nullable(Of Decimal)
Public Property LowPrice As Nullable(Of Decimal)
Public Property ClosingPrice As Nullable(Of Decimal)
Public Property Volume As Nullable(Of Decimal)
Public Property AdjustedClose As Nullable(Of Decimal)
End Class
Field Separator and Field Definition
You specify the field separator at the class level with the RecordFieldSeperator
attribute. There are four possible standard separators (comma, tab, pipe or semicolon) or you can specify your own custom separator as an array of characters.
Then each field that you want to read is specified by the RecordColumn attribute. This takes a column header name and the ordinal position of the field. You do not have to specify every field but (obviously) any attempt to specify two properties at the same column position will throw an error.
Imports FileUtilities
<RecordFieldSeperator(FileUtilities.RecordFieldSeperatorAttribute.StandardSeparators.Comma)>
Public Class YahooStockPrice
<RecordColumn("Date", 0)>
Public Property PriceDate As Date
<RecordColumn("Open", 1)>
Public Property OpenPrice As Nullable(Of Decimal)
<RecordColumn("High", 2)>
Public Property HighPrice As Nullable(Of Decimal)
<RecordColumn("Low", 3)>
Public Property LowPrice As Nullable(Of Decimal)
<RecordColumn("Closing", 4)>
Public Property ClosingPrice As Nullable(Of Decimal)
<RecordColumn("Volume", 5)>
Public Property Volume As Nullable(Of Decimal)
<RecordColumn("Adj Close", 6)>
Public Property AdjustedClose As Nullable(Of Decimal)
End Class
Fixed-width Files
Another possibility for text files is that they don't have any record separator but instead have implied fields of a given fixed width. (This is often the case for records coming from mainframe / COBOL style applications.) In this case, you do not need to specify a record separator but each field needs to have the additional information of the length of the field and the field start location specified.
For example, if your account number field is defined as the 5 letters that start from character position # 6, it is specified thus:
<RecordColumn("Account Number", 2, False, 5, 6)>
Excel Files
In version 2, there is a reader for Excel 2007 files. This depends on the ClosedXML library from CodePlex (so I have included the version separately - if you don't need Excel, use version 1).
To specify that a record should be read from a specific worksheet within the file, mark the class with a RecordExcelWorksheet
attribute:
<RecordExcelSourceSheet("House Summary")>
If no worksheet is specified, then the code will use the first worksheet in the file. You can specify that each property of your class comes from an explicit cell or from a particular column in the Excel sheet:
<RecordsSpecificExcelCell("Business Date:", RecordColumnExcelCellAttribute.ExcelColumn.I, 4)>
Semantic Meaning
When initially read in, every field is just considered to be a string
. In order to turn it into the correct type and apply any data format rules, you need to add the appropriate semantic meaning tag. These are RecordColumnDate
, RecordColumnNumber
, etc.
Imports FileUtilities
<RecordFieldSeperator(FileUtilities.RecordFieldSeperatorAttribute.StandardSeparators.Comma)>
Public Class YahooStockPrice
<RecordColumn("Date", 0)>
<RecordColumnDate("M/d/yyyy")>
Public Property PriceDate As Date
<RecordColumnNumber()>
<RecordColumn("Open", 1)>
Public Property OpenPrice As Nullable(Of Decimal)
<RecordColumnNumber()>
<RecordColumn("High", 2)>
Public Property HighPrice As Nullable(Of Decimal)
<RecordColumnNumber()>
<RecordColumn("Low", 3)>
Public Property LowPrice As Nullable(Of Decimal)
<RecordColumnNumber()>
<RecordColumn("Closing", 4)>
Public Property ClosingPrice As Nullable(Of Decimal)
<RecordColumnNumber()>
<RecordColumn("Volume", 5)>
Public Property Volume As Nullable(Of Decimal)
<RecordColumnNumber()>
<RecordColumn("Adj Close", 6)>
Public Property AdjustedClose As Nullable(Of Decimal)
End Class
Validation
Now that we have some semantic meaning attributes attached to our class, we can also specify validation rules to skip over any invalid rows. These are specified using the RecordColumnValidation
attribute, and there are a number of validations built in:
Public Enum ValidationRule
''' <summary>
''' No validation is applied to this column
''' </summary>
NoValidation = 0
''' <summary>
''' Record field must not be empty
''' </summary>
NotBlank = 1
''' <summary>
''' Column data does not contain the column name
''' </summary>
NotColumnName = 2
''' <summary>
''' Column must contain a number that maps to its RecordColumnNumberAttribute settings
''' </summary>
NumberValidation = 3
''' <summary>
''' Column must contain a date that maps to its RecordColumnDateAttribute settings
''' </summary>
DateValidation = 4
''' <summary>
''' Record must equal the CompareTo property
''' </summary>
MustEqual = 5
''' <summary>
''' Record must not equal the CompareTo property
''' </summary>
MustNotEqual = 6
''' <summary>
''' The value must match the regular expression setting in the CompareTo property
''' </summary>
RegularEpressionMatch = 7
''' <summary>
''' Test the value against the assigned instrument identifier attributes
''' </summary>
InstrumentIdentifier = 8
End Enum
In the case of the Yahoo file, we need to validate that the fields do not match the column name and that they are valid numeric or date data.
In version 2, there is also the concept of marking a validation as "fatal" by setting the optional parameter in the validation attribute constructor. For example, if a certain field must be set to the capital letter M
only, you could use the following validation to enforce that:
<RecordColumnValidation(RecordColumnValidationAttribute.ValidationRule.MustEqual, "M", True)>
Unusual Formats
As well as the standard number and date formats provided in the .NET framework, there are some unusual cases that can be sent (again usually as output from mainframe systems). For example, the number might have a trailing sign and might have an implied number of decimals. For these cases, the RecordColumnCustomNumber
attribute can be used to effectively preprocess the number before converting it.
These special formats are expressed as a Flags
based enumerated type as they can sometimes be combined together in one field.
<Flags()>
Public Enum CustomNumberFormatFlags
''' <summary>
''' There are no custom transforms for this number
''' </summary>
StandardNumber = &H0
''' <summary>
''' There is an implied decimal precision that is fixed for this field
''' </summary>
FixedDecimalPoint = &H1
''' <summary>
''' The sign comes at the end of the number (e.g. 12345- means -12345)
''' </summary>
''' <remarks>
''' If no - then assume number is positive
''' </remarks>
TrailingSign = &H2
''' <summary>
''' Negative numbers are in braces
''' </summary>
''' <remarks>
''' e.g. (123.45) means -123.45
''' </remarks>
NegativeBracketed = &H4
''' <summary>
''' Number can be followed by a fraction - e.g. "37 1/2" or even "1.85 3/8"
''' </summary>
FractionalSuffix = &H8
'additional madness can be added here if files contain it
End Enum
Modifying Values
Sometimes, the data passed in from the file needs to be modified in order to work. For example, some files might contain "NULL
" or "n/a
" in a field when that field should be specified as blank or a field needs to be trimmed as it is read in. This is done with the attributes such as ValueSubstitution
and ValueTrim
.
Culture
If you are receiving a file from a different culture (i.e., for example if it was written by a machine in France and is being imported by a machine in the US), you may need to explicitly denote the culture the file comes from in order to convert numeric and date values. This is done with the ValidCulture
attribute.
Reading the Data
The hard work of applying all these attributes and turning out the results is done by the ClassStreamReader
class, which is a generic class that takes the POCO type specified with the attributes as input. It can either read one record at a time with the ReadNext
method, or return the complete set of record with the ReadToEnd
method.
For example, to iterate through the Yahoo file specified, you could do:
Dim reader As New ClassStreamReader(GetType(YahooStockPrice), _
New System.IO.FileInfo("table.csv"))
For Each rowdata As YahooStockPrice In reader.ReadToEnd()
Debug.WriteLine(rowdata.PriceDate.ToShortDateString() & _
" closed at " & rowdata.AdjustedClose.ToString())
Next
If your data comes from an Excel file (per version 2 of the code), use a ClassExcelReader
in place of a ClassStreamReader
.
Making Use of the Semantic Meaning
One of the powerful things that comes out of defining the semantic meaning of the fields in your record is that you can validate a given row against that record and - in some special circumstances - even fix up "defective" input files.
The validation is performed by the method GetValidationWarnings
of the class ClassColumnMapping
. This tries the passed in string
against the decorated .NET class and returns a list of 0 or more warnings about the data.
History
- 2015-03-26: Initial version
- 2015-03-30: Additional details on how to do fixed width fields and the more unusual data formats
- 2015-11-26: Added an Excel reader and the concept of fatal/non fatal validations