Click here to Skip to main content
13,795,978 members
Click here to Skip to main content
Add your own
alternative version

Tagged as

Stats

16K views
50 downloads
12 bookmarked
Posted 1 May 2015
Licenced CPOL

A Robust CSV Reader

, 1 May 2015
Rate this:
Please Sign up or sign in to vote.
The routines in this library can parse any string that I can throw at it, including Common Name strings read from X.509 Digital Certificates.

Introduction

In the course of writing a program to generate a report about the properties of the contents of a Windows certificate store, I discovered that none of the CSV readers that I had at my disposal could process the unusual CSV strings found in many digital certificates. Turning to The Code Project, I found A Fast CSV Reader, which I eagerly downloaded, built, and tested.

Regretfully, I quickly discovered that it couldn't handle these types of strings, either, and set about to roll yet another CSV reader of my own.

Background

The type of string that broke every CSV reader at which I threw it looks like the following actual example from a production store of Trusted Root Certificates.

CN=RapidSSL CA, O="GeoTrust, Inc.", C=US

Specifically, the middle substring, O="GeoTrust, Inc.", is nonstandard, because the opening quotation mark that is intended to protect the comma in the organization name, GeoTrust, Inc. is the fourth character in the substring. This is by design, since the Common Name string is a series of name-value pairs, as are many other certificate properties.

Although every other CSV library in The Code Project and elsewhere that I reviewed mimics a base class, such as System.IO.StreamReader, I chose instead to adhere to the Single Responsibility Principle, and focus entirely on parsing strings, leaving you free to determine how to acquire and manage them.

Using the Code

The working code is in a class library, which I built against Microsoft .NET Framework 2.0, and it can be used in projects that target any newer version of the framework. For your convenience, I included both debug and release builds of the library and the test program.

The package includes a test program, ConsoleUtility1.exe, which is built from a single source file, Program.cs, and a set of string resources. Since it was intended to prove the robustness of the algorithm, the test program confines itself to the static methods.

Class library WizardWrx.AnyCSV.dll exposes a single class, Parser, which has nine constructor overloads, all of which are optional, and one overloaded static Parse method that does the real work. Since the real worker method is static, and everything it needs can be either passed into the method or allowed to take default values, the only reason to use an instance is when you want to specify non-default values for one or more of the arguments that have corresponding properties on the class in advance, so that you can omit them from the argument list when you call the Parse method in a loop to process a set of strings.

The Parse method has four overloads, all of which return a string [ ] containing the substrings. I shall next summarize the overloads, starting with the fourth, and simplest.

  1. The simplest overload takes two arguments, the string to be parsed and the delimiter, a char<font color="#111111" face="Segoe UI">, pchrDelimiter</font>.
  2. Adding a tad of complexity, the third overload adds a second char argument, pchrProtector, which specifies the character that protects a pchrDelimiter that appears in the middle of a substring, and must be ignored.
  3. A third overload adds penmGuardDisposition, which uses a member of the GuardDisposition enumeration to govern disposition of guard characters. Its valid values are straightforward and self-explanatory: Keep and Strip.
  4. The most complex overload adds the final parameter, penmTrimWhiteSpace, which uses the TrimWhiteSpace enumeration to specify four possible options to dispose of leading and trailing blanks in a substring.
Value Outcome
TrimWhiteSpace
Leave Trim leading white space. This is designed specifically for use with Issuer and Subject fields of X.509 digital certificates.
TrimLeading Trim leading white space. This is designed specifically for use with Issuer and Subject fields of X.509 digital certificates.
TrimTrailing Trim trailing white space. This option is especially useful with CSV files generated by Microsoft Excel, which often have long runs of meaningless white space, especially when a worksheet has blank rows or columns in its UsedRange.
TrimBoth Given that TrimLeading and TrimTrailing are required use cases, trimming both ends is essentially free. This flag is implemented such that it can be logically processed as TrimLeading | TrimTrailing.

One final static method, StandardCSVParse, is dedicated to parsing a true Comma Separated Values string. As such, it has a single string argument, pstrAnyCSV. The delimiter character is a comma, and the guard character is the double quotation mark.

Points of Interest

Everything that counts happens in a single method, the most complex of the four static Parse methods. Everything else, including the instance Parse method, calls upon it, specifying default values for omitted arguments or, in the case of the instance method, the corresponding instance properties.

The Parse method gets its robustness from a simple state machine that uses a pair of simple Boolean variables, fInProgress and fProtectDelimiters, as state variables. The key point is that when fProtectDelimiters is TRUE, indicating that a guard character has been found, but its mate has yet to be found, delimiter characters are ignored. Strings are assembled by appending characters to a StringBuilder that is initialized with a size sufficient to hold a degenerate case string - one that is devoid of delimiters. The objective of reserving such a large amount of memory is that the StringBuilder never needs to expand; hence, it need never move a partially built string to a bigger buffer.

To simplify routine use, the class exposes common delimiter and guard characters as public constants. If you prefer to avoid using raw characters, you can specify them in terms of a pair of enumerations, DelimiterChar and GuardChar. The constants and enumerations are especially useful for specifying guard characters, since quotation mark literals are rather fussy.

Translating the DelimiterChar and GuardChar enumerations is aided by arrays of DelimiterMap and GuardMap structures, which are populated by a static Parser constructor.

The instance properties use s_objSyncLock, a private generic object, to make themselves thread safe. I borrowed this proven concept from other classes that use such an object as part of their implementation of the Singleton design pattern.

Room for Improvement

I anticipate that others will find numerous ways to improve this library. One that occurs to me is the addition of a static StringBuilder that grows, but never shrinks during its lifetime. At a minimum, this would require the Parse method to test for its existence and size, and use the s_objSyncLock object to make itself thread-safe.

History

  • Friday, 01 May 2015 - Initial publication

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

David A. Gray
Software Developer (Senior)
United States United States
I deliver robust, clean, adaptable, future-ready applications that are properly documented for users and maintainers. I have deep knowledge in multiple technologies and broad familiarity with computer and software technologies of yesterday, today, and tomorrow.

While it isn't perceived as sexy, my focus has always been the back end of the application stack, where data arrives from a multitude of sources, and is converted into reports that express my interpretation of The Fundamental Principle of Tabular Reporting, and are the most visible aspect of the system to senior executives who approve the projects and sign the checks.

While I can design a front end, I prefer to work at the back end, getting data into the system from outside sources, such as other computers, electronic sensors, and so forth, and getting it out of the system, as reports to IDENTIFY and SOLVE problems.

When presented with a problem, I focus on identifying and solving the root problem for the long term.

Specialties: Design: Relational data base design, focusing on reporting; organization and presentation of large document collections such as MSDS libraries

Development: Powerful, imaginative utility programs and scripts for automated systems management and maintenance

Industries: Property management, Employee Health and Safety, Services

Languages: C#, C++, C, VBA, Visual Basic, Perl, WinBatch, SQL, XML, HTML, Javascript

Outside Interests: Great music (mostly, but by no means limited to, classical), viewing and photographing sunsets and clouds, traveling by car on small country roads, attending museum exhibits (fine art, history, science, technology), long walks, especially where there is little or no motor traffic

You may also be interested in...

Comments and Discussions

 
QuestionDownload source Pin
lafffer_11918-May-15 19:54
memberlafffer_11918-May-15 19:54 
AnswerRe: Download source Pin
soho14-Jan-16 0:33
membersoho14-Jan-16 0:33 
QuestionCleanup the code ZIP Pin
Matt T Heffron4-May-15 14:20
professionalMatt T Heffron4-May-15 14:20 
QuestionCertificates store Pin
Dr Gadgit2-May-15 4:54
memberDr Gadgit2-May-15 4:54 
AnswerRe: Certificates store Pin
David A. Gray1-Jun-15 20:05
memberDavid A. Gray1-Jun-15 20:05 
GeneralRe: Certificates store Pin
Dr Gadgit4-Jul-15 5:12
memberDr Gadgit4-Jul-15 5:12 
GeneralRe: Certificates store Pin
Dr Gadgit19-Jul-15 5:19
memberDr Gadgit19-Jul-15 5:19 
AnswerRe: Certificates store Pin
David A. Gray1-Jun-15 20:05
memberDavid A. Gray1-Jun-15 20:05 
QuestionNo Source or Sample? Pin
Gary Noble2-May-15 2:30
memberGary Noble2-May-15 2:30 
AnswerRe: No Source or Sample? Pin
Matt T Heffron4-May-15 14:20
professionalMatt T Heffron4-May-15 14:20 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Cookies | Terms of Use | Mobile
Web06 | 2.8.181207.3 | Last Updated 1 May 2015
Article Copyright 2015 by David A. Gray
Everything else Copyright © CodeProject, 1999-2018
Layout: fixed | fluid