A powerful CSV document wrapper library





5.00/5 (2 votes)
I need a tool for exchange the data between my .NET program and R program, yeah this powerful library makes things better!
Download DocumentFormat.Csv-noexe.zip
Download DocumentFormat.Csv.zip
Why chose csv?
- First, csv document just a formatted plant text file that can be easily read and write by any program.
- Secondly, if we just want a data file but not a well formatted Excel file with the extract style data, then the csv document is the best choice for the cross-platform software as the MS Office Excel will not works on LINUX and the .NET program on LINUX mono environment is not easily using the Excel library C++ interface from the OpenOffice without any modification of our source code.
- Third, as a biological researcher, the R program is our most use experiment data statics software, and when we develop a .NET program interact with R on LINUX/Windows via RDotNET, the csv document is the best choice for the data exchange between the R and the tools that I’ve develop. It’s convenient using Visual Basic operate the R to create a complex object in R using RDotNET via exchange the data in csv:
Call DataCollection.SaveTo(CsvPath)
R = R << Push(String.Format(“ExperimentData <- read.csv(“”{0}””)”, CsvPath))
R operation statements, balabalabala....
Picture1. The advantages of csv document
Important Components in this library
Implementation of Csv file object
CsvFile Object
The CsvFile object instance stands for a CSV file in the file system. Create a csv file object in your code just easy:
Dim CsvFile As Microsoft.VisualBasic.DataVisualization.DocumentFormat.Csv.File = Microsoft.VisualBasic.DataVisualization.DocumentFormat.Csv.File.Load(Path:=CsvPath)
The Load function in the Csv.File class define shows below:
Public Shared Function Load(Path As String, Optional encoding As System.Text.Encoding = Nothing) As File If encoding Is Nothing Then encoding = System.Text.Encoding.Default End If Dim Csv = CType(System.IO.File.ReadAllLines(Path, encoding:=encoding), File) Csv.File = Path Return Csv End Function
The save method for save the data in the csv object into the filesystem.
''' <summary>
'''
''' </summary>
''' <param name="Path"></param>
''' <param name="LazySaved">Optional, this is for the consideration of performance and memory consumption.
''' When a data file is very large, then you may encounter a out of memory exception on a 32 bit platform,
''' then you should set this parameter to True to avoid this problem. Defualt is False for have a better
''' performance.
''' (当估计到文件的数据量很大的时候,请使用本参数,以避免内存溢出致使应用程序崩溃,默认为False,不开启缓存)
''' </param>
''' <remarks>当目标保存路径不存在的时候,会自动创建文件夹</remarks>
Public Sub Save(Optional Path As String = "", Optional LazySaved As Boolean = False, Optional encoding As System.Text.Encoding = Nothing)
There is an encoding optional parameter; this is for the consideration of avoid the messy code for Chinese character on different operation system, as an example is that Windows is using ANSI as default but the LINUX is using Unicode as default.
Here is the major data structure definition of the Csv.File class:
''' <summary>
''' A comma character seperate table file that can be read and write in the EXCEL.(一个能够被Excel程序所读取的表格文件)
''' </summary>
''' <remarks></remarks>
Public Class File : Implements Generic.IEnumerable(Of Csv.File.Row)
Implements Generic.IList(Of DocumentFormat.Csv.File.Row)
''' <summary>
''' First line in the table is the column name definition line.
''' </summary>
''' <remarks></remarks>
Protected Friend _InnerListTable As List(Of Row) = New List(Of Row)
And some useful method which are defined in this Csv.File class for data manipulation:
''' <summary>
''' Add a data row collection into this Csv file object instance and then return the total row number after the add operation.
''' (向CSV文件之中批量添加行记录,之后返回当前所打开的文件在添加纪录之后的总行数)
''' </summary>
''' <param name="RowCollection"></param>
''' <returns></returns>
''' <remarks></remarks>
Public Function AppendRange(RowCollection As Generic.IEnumerable(Of Row)) As Long
''' <summary>
''' Get a data row in the specific row number, when a row is not exists in current csv file then the function will return a empty row.
''' (当目标对象不存在的时候,会返回一个空行对象)
''' </summary>
''' <param name="line"></param>
''' <returns></returns>
''' <remarks></remarks>
Public Function Get_DataRow(line As Integer) As Row
''' <summary>
''' Using the content in a specific column as the target for search using a specific keyword, and then return all of the rows that have the query keyword.
''' (以指定的列中的内容搜索关键词,并返回检索成功的行的集合)
''' </summary>
''' <param name="KeyWord"></param>
''' <param name="Column"></param>
''' <returns></returns>
''' <remarks></remarks>
Public Function FindAtColumn(KeyWord As String, Column As Integer) As Row()
''' <summary>
''' Get and set the string content in a specific table cell.(设置或者获取某一个指定的单元格中的字符串内容)
''' </summary>
''' <param name="X"></param>
''' <param name="Y"></param>
''' <value></value>
''' <returns></returns>
''' <remarks></remarks>
Public Property Cell(X As Integer, Y As Integer) As String
''' <summary>
''' Delete all of the row that meet the delete condition.(将所有满足条件的行进行删除)
''' </summary>
''' <param name="Condition"></param>
''' <remarks></remarks>
Public Function Remove(Condition As Func(Of Row, Boolean)) As Row()
''' <summary>
''' Remove the item in a specific row collection.
''' </summary>
''' <param name="RowCollection"></param>
''' <remarks></remarks>
Public Sub RemoveRange(RowCollection As Generic.IEnumerable(Of Row))
''' <summary>
''' Insert a new empty line of row data before the specific row number.(在指定列标号的列之前插入一行空列)
''' </summary>
''' <param name="column"></param>
''' <remarks></remarks>
Public Function InsertEmptyColumnBefore(column As Integer) As Integer
Row Object
''' <summary>
''' A line of data in the csv file.(Csv表格文件之中的一行)
''' </summary>
''' <remarks></remarks>
Public Class Row : Implements Generic.IEnumerable(Of String)
Implements Generic.IList(Of System.String)
''' <summary>
''' 本行对象中的所有的单元格的数据集合
''' </summary>
''' <remarks></remarks>
Protected Friend _InnerColumnList As List(Of String) = New List(Of String)
How to parse the data string into a row object
As we can see, the row object is major consist of a string collection, so that we just need to split the string into a string collection, and a magical regular expression makes this work easy:
''' <summary>
''' A regex expression string that use for split the line text.
''' </summary>
''' <remarks></remarks>
Protected Friend Const SplitRegxExpression As String = "[" & vbTab & ",](?=(?:[^""]|""[^""]*"")*$)"
And the detail of the code implementation of the parsing job:
''' <summary>
''' Row parsing into column tokens
''' </summary>
''' <param name="Line"></param>
''' <returns></returns>
''' <remarks></remarks>
Public Shared Widening Operator CType(Line As String) As Row
Dim Row = Regex.Split(Line, SplitRegxExpression)
For i As Integer = 0 To Row.Count - 1
If Not String.IsNullOrEmpty(Row(i)) Then
If Row(i).First = """"c AndAlso Row(i).Last = """"c Then
Row(i) = Mid(Row(i), 2, Len(Row(i)) - 2)
End If
End If
Next
Return Row
End Operator
And some major method for the row data manipulation:
''' <summary>
''' Get the cell data in a specific column number. if the column is not exists in this row then will return a empty string.
''' (获取某一列中的数据,若该列不存在则返回空字符串)
''' </summary>
''' <param name="Index"></param>
''' <returns></returns>
''' <remarks></remarks>
Default Public Property Column(Index As Integer) As String Implements IList(Of String).Item
''' <summary>
''' is this row object contains any data?
''' </summary>
''' <value></value>
''' <returns></returns>
''' <remarks></remarks>
Public ReadOnly Property IsNullOrEmpty As Boolean
''' <summary>
''' insert the data into a spercific column
''' </summary>
''' <param name="value"></param>
''' <param name="column"></param>
''' <returns>仅为LINQ查询使用的一个无意义的值</returns>
''' <remarks></remarks>
Public Function InsertAt(value As String, column As Integer) As Integer
''' <summary>
''' Takes the data in the specific number of columns, if columns is not exists in this row object, then a part of returned data will be the empty string.
''' </summary>
''' <param name="Count"></param>
''' <returns></returns>
''' <remarks></remarks>
Public Function Takes(Count As Integer) As String()
''' <summary>
''' Takes the data in the specific column index collection, if the column is not exists in the row object, then a part of the returned data will be the empty string.
''' </summary>
''' <param name="Cols"></param>
''' <param name="retNullable">(当不存在数据的时候是否返回空字符串,默认返回空字符串)</param>
''' <returns></returns>
''' <remarks></remarks>
Public Function Takes(Cols As Integer(), Optional retNullable As Boolean = True) As String()
Public Function AddRange(values As Generic.IEnumerable(Of String)) As Integer
Public Sub Add(columnValue As String) Implements ICollection(Of String).Add
Imports operations
As the csv file format is using the comma character to separate the cell data, but some text file is not using the comma as delimiter, it may use TAB or several space characters, then we can imports the data in the target text file to create a csv file object, just like the Excel application data imports operation does.
There are two types of data imports operation are define in the DataImports module:
''' <summary>
''' Imports the data in a well formatted text file using a specific delimiter, default delimiter is comma character.
''' </summary>
''' <param name="txtPath">The file path for the data imports text file.(将要进行数据导入的文本文件)</param>
''' <returns></returns>
''' <remarks></remarks>
Public Function [Imports](txtPath As String, Optional delimiter As Char = ","c, Optional encoding As System.Text.Encoding = Nothing) As Microsoft.VisualBasic.DataVisualization.DocumentFormat.Csv.File
Dim Expression As String = String.Format(SplitRegxExpression, delimiter)
If encoding Is Nothing Then
encoding = System.Text.Encoding.Default
End If
Dim Lines As String() = IO.File.ReadAllLines(txtPath, encoding)
Dim LQuery = (From line As String In Lines Select RowParsing(line, Expression)).ToArray
Return LQuery
End Function
''' <summary>
''' Imports the data in a well formatted text file using the fix length as the data separate method.
''' </summary>
''' <param name="txtPath"></param>
''' <param name="length">固定的列字符数的宽度</param>
''' <returns></returns>
''' <remarks></remarks>
Public Function FixLengthImports(txtPath As String, Optional length As Integer = 10, Optional encoding As System.Text.Encoding = Nothing) As Microsoft.VisualBasic.DataVisualization.DocumentFormat.Csv.File
If encoding Is Nothing Then
encoding = System.Text.Encoding.Default
End If
Dim Lines As String() = IO.File.ReadAllLines(txtPath, encoding)
Dim LQuery = (From line As String In Lines Select RowParsing(line, length:=length)).ToArray
Return LQuery
End Function
The amazing Reflection Wrapper
This part of my job in this library is the most powerful and amazing works! It makes my coding job happier and I believe it will do the same on you! Let’s see an example of code on it! All of these wrapper operations are define in the namespace Microsoft.VisualBasic.DataVisualization.DocumentFormat.Csv.Reflection
How it works?
The reflection wrapper operations are based on the reflection operation in .NET. There three step to implement this function:
Picture2. Workflow for the wrapper operation
First, Create Data Schema
Required custom attribute:
Namespace Csv.Reflection
<AttributeUsage(AttributeTargets.Property, allowmultiple:=True, inherited:=False)>
Public Class ColumnAttribute : Inherits Attribute
Protected Friend _ElementDataType As Type = Type.String
Protected Friend _Name As String
Protected Friend _bindProperty As System.Reflection.PropertyInfo
Sub New(Name As String, Optional Type As Type = Type.String)
''' <summary>
''' 并不建议使用本Csv属性来储存大量的文本字符串,极容易出错
''' </summary>
''' <remarks></remarks>
<AttributeUsage(AttributeTargets.Property, allowmultiple:=True, inherited:=False)>
Public Class ArrayColumn : Inherits Attribute
Protected Friend _Delimiter As String
Protected Friend _ElementDataType As ColumnAttribute.Type
Protected Friend _Name As String
Protected Friend _bindProperty As System.Reflection.PropertyInfo
Sub New(Name As String, Optional Delimiter As String = "; ", Optional ElementDataType As ColumnAttribute.Type = ColumnAttribute.Type.String)
Tow function for creating the data schema:
Private Shared Function GetArrayColumns(ItemTypeProperties As System.Reflection.PropertyInfo()) As Reflection.ArrayColumn()
Dim LQuery = (From [PropertyInfo] As System.Reflection.PropertyInfo
In ItemTypeProperties
Let attrs As Object() = [PropertyInfo].GetCustomAttributes(Reflection.ArrayColumn.TypeInfo, inherit:=False)
Where Not attrs.IsNullOrEmpty
Select DirectCast(attrs.First, Reflection.ArrayColumn).Bind(PropertyInfo)).ToArray
Return LQuery
End Function
Private Shared Function GetColumns(Type As System.Type) As Reflection.ColumnAttribute()
Dim ItemTypeProperties = Type.GetProperties(BindingFlags.Public Or BindingFlags.Instance)
Dim [Property] As PropertyInfo
Dim ColumnList As List(Of Reflection.ColumnAttribute) = New List(Of ColumnAttribute)
For i As Integer = 0 To ItemTypeProperties.Length - 1
[Property] = ItemTypeProperties(i)
Dim ColumnAttrList = (From attr In [Property].GetCustomAttributes(ColumnAttribute.TypeInfo, inherit:=False) Select DirectCast(attr, ColumnAttribute)).ToArray
If Not ColumnAttrList.IsNullOrEmpty Then
For idx As Integer = 0 To ColumnAttrList.Count - 1
Dim columnAttr = ColumnAttrList(idx)
If Not Len(columnAttr._Name) > 0 Then
columnAttr._Name = [Property].Name
End If
ColumnList.Add(columnAttr.Bind([Property]))
Next
End If
Next
Return ColumnList.ToArray
End Function
As just some basic data type can be convert into string or convert string into the target type directly, then I define a type enums to specific the dat type of target property, and this code make the program performance better than we directly get the type information from the property using reflection.
Public Enum Type
[Object] = -1
[Integer] = 0
[Long]
[String]
[Double]
[DateTime]
[Bool]
End Enum
Secondary, Data Convection
After we create a data schema from the reflection operation above, then we are able to make a data type convention between the data object and the row object.
Here is the data convert method:
Friend Shared ReadOnly CTypers As System.Func(Of String, Object)() = {
Function(s As String) CInt(Val(s)),
Function(s As String) CType(Val(s), Long),
Function(s As String) s,
Function(s As String) Val(s),
Function(s As String) CType(s, DateTime),
Function(s As String) CType(s, Boolean)}
Public Shared Function Convert(s As String, DataType As Type) As Object
Return CTypers(DataType)(s)
End Function
There is another problem that the wrapper operation can convert a data array object, so that I create this method for array convert into a row string and convert the array string into an object array:
Public Function CreateObject(cellData As String) As String()
Dim Tokens As String() = Split(cellData, _Delimiter)
Return Tokens
End Function
Private ReadOnly Property DelimiterLength As Integer
Get
Return Len(_Delimiter)
End Get
End Property
Public Function CreateObject(Of T)(DataCollection As Generic.IEnumerable(Of T)) As String
Dim Type As System.Type = GetType(T)
Dim sBuilder As StringBuilder = New StringBuilder(1024)
If Not (Type.IsValueType OrElse Type = GetType(String)) Then
Call Console.WriteLine("[WARNNING] DataType ""{0}"" is not a validated value type, trying to get string data from its Object.ToString() method!", Type.FullName)
End If
Dim StringCollection As String() = (From item In DataCollection Let _create = Function() item.ToString Select _create()).ToArray
For Each item In StringCollection
Call sBuilder.AppendFormat("{0}{1}", item, _Delimiter)
Next
Call sBuilder.Remove(sBuilder.Length - DelimiterLength, DelimiterLength)
Return sBuilder.ToString
End Function
There is a problem that we passing the collection data into the Property.SetValue method, as we parse a string array load from the text file but the data type of the target property may be a Double array,then If we directly pass this object collection parameter into the Property.SetValue method then we will get a data type miss match exception, so that I using a lambda expression to make the thing work.
Protected Friend Shared ReadOnly FillObjects As Dictionary(Of Reflection.ColumnAttribute.Type, System.Action(Of PropertyInfo, Object, String())) =
New Dictionary(Of ColumnAttribute.Type, Action(Of PropertyInfo, Object, String()))
From {
{ColumnAttribute.Type.Bool, Sub(bindProperty As System.Reflection.PropertyInfo, FilledObject As Object, strValues As String()) _
Call bindProperty.SetValue(FilledObject, (From strData As String In strValues Select CType(strData, Boolean)).ToArray, Nothing)},
{ColumnAttribute.Type.DateTime, Sub(bindProperty As System.Reflection.PropertyInfo, FilledObject As Object, strValues As String()) _
Call bindProperty.SetValue(FilledObject, (From strData As String In strValues Select CType(strData, DateTime)).ToArray, Nothing)},
{ColumnAttribute.Type.Double, Sub(bindProperty As System.Reflection.PropertyInfo, FilledObject As Object, strValues As String()) _
Call bindProperty.SetValue(FilledObject, (From strData As String In strValues Select CType(strData, Double)).ToArray, Nothing)},
{ColumnAttribute.Type.Integer, Sub(bindProperty As System.Reflection.PropertyInfo, FilledObject As Object, strValues As String()) _
Call bindProperty.SetValue(FilledObject, (From strData As String In strValues Select CType(strData, Integer)).ToArray, Nothing)},
{ColumnAttribute.Type.Long, Sub(bindProperty As System.Reflection.PropertyInfo, FilledObject As Object, strValues As String()) _
Call bindProperty.SetValue(FilledObject, (From strData As String In strValues Select CType(strData, Long)).ToArray, Nothing)},
{ColumnAttribute.Type.String, Sub(bindProperty As System.Reflection.PropertyInfo, FilledObject As Object, strValues As String()) _
Call bindProperty.SetValue(FilledObject, strValues, Nothing)}}
All of the wrapper operation code can be found at Csv.Reflection.Reflector class:
''' <summary>
''' Method for load a csv data file into a specific type of object collection.
''' </summary>
''' <typeparam name="ItemType"></typeparam>
''' <param name="Explicit">当本参数值为False的时候,所有的简单属性值都将被解析出来,而忽略掉其是否带有<see cref="Csv.Reflection.ColumnAttribute"></see>自定义属性</param>
''' <param name="Path"></param>
''' <returns></returns>
''' <remarks></remarks>
Public Shared Function Load(Of ItemType)(Path As String, Optional Explicit As Boolean = True, Optional encoding As System.Text.Encoding = Nothing) As List(Of ItemType)
''' <summary>
''' Save the specifc type object collection into the csv data file
''' </summary>
''' <typeparam name="ItemType"></typeparam>
''' <param name="Collection"></param>
''' <param name="Explicit"></param>
''' <returns></returns>
''' <remarks></remarks>
Public Shared Function Save(Of ItemType)(Collection As Generic.IEnumerable(Of ItemType), Optional Explicit As Boolean = True) As Microsoft.VisualBasic.DataVisualization.DocumentFormat.Csv.File
Those two methods above have a optional parameter Explicit, which means True for just parse the property have the column custom attribute and False for parse all property even they have no column custom attribute, I just want makes thing better and easier control. This is just like that when you do the xml serialization all of the property without xml serialization custom attribute also can be serialize into the xml file.
And I have created two extension methods for makes the IO operation more easily:
''' <summary>
''' Load a csv data file document using a specific object type.(将某一个Csv数据文件加载仅一个特定类型的对象集合中)
''' </summary>
''' <typeparam name="T"></typeparam>
''' <param name="Path"></param>
''' <param name="explicit"></param>
''' <param name="encoding"></param>
''' <returns></returns>
''' <remarks></remarks>
<Extension> Public Function LoadCsv(Of T)(Path As String, Optional explicit As Boolean = True, Optional encoding As System.Text.Encoding = Nothing) As List(Of T)
Return Csv.Reflection.Reflector.Load(Of T)(Path, explicit, encoding)
End Function
<Extension> Public Sub SaveTo(Of T)(Collection As Generic.IEnumerable(Of T), path As String, Optional explicit As Boolean = True, Optional encoding As System.Text.Encoding = Nothing)
Call Csv.Reflection.Reflector.Save(Collection, explicit).Save(path, False, encoding)
End Sub
A Simple Example: How to use?
If we have a collection of experiment data in such data structure, and after we preprocessing in our .NET program and then passing this data package to the R program and create a complex structure object in R, how we goanna do? The most amazing thing is using the reflection wrapper in this library to complete this job:
As we have such an example data structure in the .NET program, and the custom attribute have already define on the target property:
Public Class ExampleExperimentData
Public Property Id As String
<Microsoft.VisualBasic.DataVisualization.DocumentFormat.Csv.Reflection.Column("RPKM", DataVisualization.DocumentFormat.Csv.Reflection.ColumnAttribute.Type.Double)>
Public Property ExpressionRPKM As Double
<Microsoft.VisualBasic.DataVisualization.DocumentFormat.Csv.Reflection.ArrayColumn("tags")>
Public Property Tags As String()
End Class
Then the target csv data file were load into a specific type of data collection object! Things just simple and quite smart!
Dim CsvPath As String = "csv/data/file/path"
Dim DataCollection = CsvPath.LoadCsv(Of ExampleExperimentData)(explicit:=True)
You see, As the same as read data, save the data in the data collection object is simple too!
Call DataCollection.SaveTo(CsvPath, explicit:=True)
A simple test example
Imports Microsoft.VisualBasic.DataVisualization.DocumentFormat.Extensions
Module TestMain
Sub Main()
Dim DataCollection As ExampleExperimentData() = New ExampleExperimentData() {
New ExampleExperimentData With {.Id = "GeneId_0001", .ExpressionRPKM = 0, .Tags = New String() {"Up", "Regulator"}},
New ExampleExperimentData With {.Id = "GeneId_0002", .ExpressionRPKM = 1, .Tags = New String() {"Up", "PathwayA"}},
New ExampleExperimentData With {.Id = "GeneId_0003", .ExpressionRPKM = 2, .Tags = New String() {"Down", "Virulence"}}}
Dim CsvPath As String = "./TestData.csv"
Call DataCollection.SaveTo(CsvPath, explicit:=False)
DataCollection = Nothing
DataCollection = CsvPath.LoadCsv(Of ExampleExperimentData)(explicit:=False).ToArray
Dim File = Microsoft.VisualBasic.DataVisualization.DocumentFormat.Csv.Reflection.Reflector.Save(Of ExampleExperimentData)(DataCollection, Explicit:=False)
For Each row In File
Call Console.WriteLine(row.ToString)
Next
Console.WriteLine("Press any key to continute...")
Console.Read()
End Sub
Public Class ExampleExperimentData
Public Property Id As String
<Microsoft.VisualBasic.DataVisualization.DocumentFormat.Csv.Reflection.Column("RPKM", DataVisualization.DocumentFormat.Csv.Reflection.ColumnAttribute.Type.Double)>
Public Property ExpressionRPKM As Double
<Microsoft.VisualBasic.DataVisualization.DocumentFormat.Csv.Reflection.ArrayColumn("tags")>
Public Property Tags As String()
End Class
End Module
Picture3. Test example console output and auto generate csv document