Click here to Skip to main content
12,395,645 members (81,341 online)
Click here to Skip to main content
Add your own
alternative version

Stats

11.1K views
635 downloads
37 bookmarked
Posted

FileUtilities - a library for reading flat files into POCOs

, 26 Nov 2015 CPOL
Rate this:
Please Sign up or sign in to vote.
A utility to help turn flat files (.csv, tab separated etc.) into type-safe CLR objects with validation using attributes
 

Introduction

Whilst many modern applications exchange data in well defined standard ways (such as XML or JSON) the reality is that a great deal of data is still packaged in flat files (by which I mean comma separated values, or fixed width text files). 

This library is concerned with turning these files into arrays of type-safe objects and using semantic meaning in those objects to validate the files concerned.

Background

The use of attributes to tag properties of your classes is quite common in both Entity Framework and also in System.Runtime.Serialization.  This library very much follows that model but if you are not familiar with it then I recommend reading this article to start with.

Using the code

For a worked example, we'll start with a flat file which contains historical price information about Microsoft shares from Yahoo:

What is desired is to turn this into an IEnumerable of a class called StockPrice that can be declared thus:-

Public Class YahooStockPrice

    Public Property PriceDate As Date

    Public Property OpenPrice As Nullable(Of Decimal)

    Public Property HighPrice As Nullable(Of Decimal)

    Public Property LowPrice As Nullable(Of Decimal)

    Public Property ClosingPrice As Nullable(Of Decimal)

    Public Property Volume As Nullable(Of Decimal)

    Public Property AdjustedClose As Nullable(Of Decimal)

End Class

Field separator and field definition

You specify the field separator at the class level with the RecordFieldSeperator attribute.  There are four possible standard separators (comma, tab, pipe or semicolon) or you can specify your own custom separator as an array of characters. 

Then each field that you want to read is specified by the RecordColumn attribute.  This takes a column header name and the ordinal position of the field.  You do not have to specify every field but (obviously) any attempt to specify two properties at the same column position will throw an error.

Imports FileUtilities

<RecordFieldSeperator(FileUtilities.RecordFieldSeperatorAttribute.StandardSeparators.Comma)>
Public Class YahooStockPrice

  <RecordColumn("Date", 0)>
  Public Property PriceDate As Date

  <RecordColumn("Open", 1)>
  Public Property OpenPrice As Nullable(Of Decimal)

  <RecordColumn("High", 2)>
  Public Property HighPrice As Nullable(Of Decimal)

  <RecordColumn("Low", 3)>
  Public Property LowPrice As Nullable(Of Decimal)

  <RecordColumn("Closing", 4)>
  Public Property ClosingPrice As Nullable(Of Decimal)

  <RecordColumn("Volume", 5)>
  Public Property Volume As Nullable(Of Decimal)

  <RecordColumn("Adj Close", 6)>
  Public Property AdjustedClose As Nullable(Of Decimal)

End Class

Fixed-width files

Another possibility for text files is that they don't have any record separator but instead have implied fields of a given fixed width.  (This is often the case for records coming from mainframe / COBOL style applications.)  In this case you do not need to specify a record separator but each field needs to have the additional information of the length of the field and the field start location specified.

For example if your account number field is defined as teh 5 letters that start from character position # 6 it is specified thus:

<RecordColumn("Account Number", 2, False, 5, 6)>

Excel files

In version 2 there is a reader for Excel 2007 files.  This depends on the ClosedXML library from CodePlex (so I have included the version separately - if you don't need Excel, use version 1)

To specify that a record should be read from a specific worksheet within the file mark the class with a RecordExcelWorksheet attribute:-

<RecordExcelSourceSheet("House Summary")>

If no worksheet is specified then the code will use the first worksheet in the file.  You can specify that each property of your class comes from an explicit cell or from a particular column in the excel sheet:-

<RecordsSpecificExcelCell("Business Date:", RecordColumnExcelCellAttribute.ExcelColumn.I, 4)>

Semantic meaning

When initially read in, every field is just considered to be a string.  In order to turn it into the correct type and apply any data format rules you need to add the appropriate semantic meaning tag.  These are RecordColumnDate, RecordColumnNumber etc.

Imports FileUtilities

<RecordFieldSeperator(FileUtilities.RecordFieldSeperatorAttribute.StandardSeparators.Comma)>
Public Class YahooStockPrice

  <RecordColumn("Date", 0)>
  <RecordColumnDate("M/d/yyyy")>
  Public Property PriceDate As Date

  <RecordColumnNumber()>
  <RecordColumn("Open", 1)>
  Public Property OpenPrice As Nullable(Of Decimal)

  <RecordColumnNumber()>
  <RecordColumn("High", 2)>
  Public Property HighPrice As Nullable(Of Decimal)

  <RecordColumnNumber()>
  <RecordColumn("Low", 3)>
  Public Property LowPrice As Nullable(Of Decimal)

  <RecordColumnNumber()>
  <RecordColumn("Closing", 4)>
  Public Property ClosingPrice As Nullable(Of Decimal)

  <RecordColumnNumber()>
  <RecordColumn("Volume", 5)>
  Public Property Volume As Nullable(Of Decimal)

  <RecordColumnNumber()>
  <RecordColumn("Adj Close", 6)>
  Public Property AdjustedClose As Nullable(Of Decimal)

End Class

Validation

Now that we have some semantic meaning attributes attached to our class we can also specify validation rules to skip over any invalid rows.  These are specified using the RecordColumnValidation attribute, and there are a number of validations built in:

Public Enum ValidationRule
  ''' <summary>
  ''' No validation is applied to this column
  ''' </summary>
  NoValidation = 0
  ''' <summary>
  ''' Record field must not be empty
  ''' </summary>
  NotBlank = 1
  ''' <summary>
  ''' Column data does not contain the column name
  ''' </summary>
  NotColumnName = 2
  ''' <summary>
  ''' Column must contain a number that maps to its RecordColumnNumberAttribute settings
  ''' </summary>
  NumberValidation = 3
  ''' <summary>
  ''' Column must contain a date that maps to its RecordColumnDateAttribute settings
  ''' </summary>
  DateValidation = 4
  ''' <summary>
  ''' Record must equal the CompareTo property
  ''' </summary>
  MustEqual = 5
  ''' <summary>
  ''' Record must not equal the CompareTo property
  ''' </summary>
  MustNotEqual = 6
  ''' <summary>
  ''' The value must match the regular expression setting in the CompareTo property
  ''' </summary>
  RegularEpressionMatch = 7
  ''' <summary>
  ''' Test the value against the assigned instrument identifier attributes
  ''' </summary>
  InstrumentIdentifier = 8
End Enum

 

In the case of the Yahoo file we need to validate that the fields do not match the column name and that they are valid numeric or date data.

In version 2 there is also the concept of marking a validation as "fatal" by setting the optional parameter in the validation attribute constructor.  For exampel if a certain field must be set to the capital letter M only you could use the following validation to enforce that:

<RecordColumnValidation(RecordColumnValidationAttribute.ValidationRule.MustEqual, "M", True)>

Unusual formats

As well as the standard number and date formats provided in the .NET framework there are some unusual cases that can be sent (again suually as output from mainframe systems).  For example the number might have a trailing sign and might have an implied number of decimals.  For these cases the RecordColumnCustomNumber attribute can be used to effectively preprocess the number before converting it.

These special formats are expressed as a Flags based enumerated type as they can sometimes be combined together in one field.

<Flags()>
Public Enum CustomNumberFormatFlags
    ''' <summary>
    ''' There are no custom transforms for this number
    ''' </summary>
    StandardNumber = &H0
    ''' <summary>
    ''' There is an implied decimal precision that is fixed for this field
    ''' </summary>
    FixedDecimalPoint = &H1
    ''' <summary>
    ''' The sign comes at the end of the number (e.g. 12345- means -12345)
    ''' </summary>
    ''' <remarks>
    ''' If no - then assume number is positive
    ''' </remarks>
    TrailingSign = &H2
    ''' <summary>
    ''' Negative numbers are in braces
    ''' </summary>
    ''' <remarks>
    ''' e.g. (123.45) means -123.45
    ''' </remarks>
    NegativeBracketed = &H4
    ''' <summary>
    ''' Number can be followed by a fraction - e.g. "37 1/2" or even "1.85 3/8"
    ''' </summary>
    FractionalSuffix = &H8
    'additional madness can be added here if files contain it
End Enum

Modifying values

Sometimes the data passed in from the file needs to be modified in order to work.  For example some files might contain "NULL" or "n/a" in a field when that field should be specified as blank or a field needs to be trimmed as it is read in.  This is done with the attributes such as ValueSubstitution and ValueTrim.

Culture

If you are receiving a file from a different culture (i.e. for example if it was written by a machine in France and is being imported by a machine in the US) you may need to explicitly denote the culture the file comes from in order to convert numeric and date values.  This is done with the ValidCulture attribute.

Reading the data

The hard work of applying all these attributes and turning out the results is done by the ClassStreamReader class, which is a generic class that takes the POCO type specified with the attributes as input.  It can either read one record at a time with the ReadNext method, or return the complete set of record with the ReadToEnd method.

For example to iterate through the Yahoo file specified you could do:-

Dim reader As New ClassStreamReader(GetType(YahooStockPrice), _
             New System.IO.FileInfo("table.csv"))

For Each rowdata As YahooStockPrice In reader.ReadToEnd()
   Debug.WriteLine(rowdata.PriceDate.ToShortDateString() & _
       " closed at " & rowdata.AdjustedClose.ToString())
Next

If your data comes from an Excel file (per vserion 2 of the code) use a ClassExcelReader in place of a ClassStreamReader.

Making use of the semantic meaning

One of the powerful things that comes out of defining the semantic meaning of the fields in your record is that you can validate a given row against that record and - in some special circumstances - even fix up "defective" input files.

The validation is performed by the method GetValidationWarnings of the class  ClassColumnMapping.  This tries the passed in string against the decorated .NET class and returns a list of 0 or more warnings about the data.

History

2015-03-26 Initial version

2015-03-30 Additional detail on how to do fixed width fields and the more unusual data formats

2015-11-26 Added an Excel reader and the concept of fatal/non fatal validations

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Duncan Edwards Jones
Software Developer (Senior)
Ireland Ireland
C# / SQL Server developer
Microsoft MVP 2006, 2007
Visual Basic .NET

You may also be interested in...

Comments and Discussions

 
QuestionAh! The piece I was missing. Pin
warren brown13-May-15 11:35
memberwarren brown13-May-15 11:35 
AnswerRe: Ah! The piece I was missing. Pin
Duncan Edwards Jones13-May-15 11:37
professionalDuncan Edwards Jones13-May-15 11:37 
QuestionRight, I had tried that, but ... Pin
warren brown13-May-15 11:27
memberwarren brown13-May-15 11:27 
AnswerRe: Right, I had tried that, but ... Pin
Duncan Edwards Jones13-May-15 11:31
professionalDuncan Edwards Jones13-May-15 11:31 
QuestionVery interesting, but having problems with constructor? Pin
warren brown13-May-15 7:22
memberwarren brown13-May-15 7:22 
AnswerRe: Very interesting, but having problems with constructor? Pin
Duncan Edwards Jones13-May-15 11:01
professionalDuncan Edwards Jones13-May-15 11:01 
QuestionNice concept Pin
Member 114240806-Apr-15 11:01
memberMember 114240806-Apr-15 11:01 
AnswerRe: Nice concept Pin
Duncan Edwards Jones6-Apr-15 20:43
professionalDuncan Edwards Jones6-Apr-15 20:43 
GeneralRe: Nice concept Pin
Member 114240807-Apr-15 14:53
memberMember 114240807-Apr-15 14:53 
QuestionThanks! Pin
Carl Edwards In SA27-Mar-15 18:15
memberCarl Edwards In SA27-Mar-15 18:15 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.160721.1 | Last Updated 26 Nov 2015
Article Copyright 2015 by Duncan Edwards Jones
Everything else Copyright © CodeProject, 1999-2016
Layout: fixed | fluid