For the last 2 days, I thought of unleashing few of the hidden facts of Data Storage. It is true that the introduction of the .NET Framework without introducing Generic in the first place is one of the biggest mistakes. Because of Non-Generic data structure, a large amount of data is being unnecessarily boxed and unboxed without any reason as such. After Generics was introduced with .NET 2.0, most of these classes which made you do type conversions is either been totally depreciated or few of them rarely used as is there for backward compatibility. But for every programmer, the most common class called
DataTable still exists without any change. In this post, we will try to see the actual implementation of
DataTable as a whole, so that we could understand what is going on when we store data into it.
DataTable is an application object that maps to a
Database Table. It holds a collection of
DataRow represents the actual Data while the
DataColumns holds Integration rules. Basically, we all know the basics of
DataTable and often use it for our regular work.
DataSet holds a list of
DataTable as being said is the storage of Data in the form of a Table with Row and Columns, the
DataRelation holds the Relationship rules between two tables in a
DataSet. Hence ADO.NET has been structured in such a way that you can have the flavour of the entire database from your application end.
But, one thing that you must keep in mind is that these objects are created way back in time of .NET 1.0 release or even before when there was no Generics or even the concept of State Machines. Generics allows you to pass type of an object during actual object creation and hence will allow you to create Type information based on the object which we create in real time, as a result, it eliminates unnecessary runtime boxing and unboxing. Keeping this in mind, I thought I should check how these classes are made in Framework. Let's look at each of them, one by one in this post and let's identify the pros and cons.
DataSet is the application object that can hold the entire structure of the database schema. Basically, internally it is a serializable object which holds two sets of collection,
DataRelationCollection where the former is an implementation of an
DataTables and the later is a list of
DataSet can be Cloned (produces another object with New tables from existing schema) or Copied (Produces another object with same tables).
Some of the important benefits of DataSet
- Easily Serializable, even could be written to XML File using
WriteXML or can be read to the Object.
- Allows you to Merge other
DataSet into it easily.
- Can hold Relation and Tables into it.
Problems & best implementation of DataSet
- Can only hold a collection of
DataTable, and it does not allow us to define our own data structures and produce custom
DataRelation among them.
- No Generic support, use of
ArrayList means each element has to be upcast to object when being stored and again to actual element (
DataTable) when being retrieved.
DataTable is the primary object which holds the
Data. It holds a collection of
DataRow. Each of the
DataRow holds the data element while
DataColumn holds the rules between each columns. The
DataRowCollection internally represents an
DataColumnCollection is on
To hold the
DataTable maintains an internal sealed class called
RecordManager. Everytime you create a
DataRow element inside a
DataTable class, the object is stored into
DataRowCollection holds a list of
RBTree class implemented as
Generic list of
DataRow objects is used.
Some important benefits of DataTable
DataTable can store any
Table even if the data is complex
- Serializable, so can be ported in communication medium easily
- Keeps track of indexed field
Problems of DataTable
- Most of the objects are non-generic, and hence leading to conversion of types
- Very bulky
DataRow on the other hand holds the actual data. It has an indexer to the object which lets you
set data to it.
Internals to ADO.NET Continued....
Now let us take a look at how the data is being stored into these data structures and later retrieved from it. Say you have opened a connection to database and used
Adapter.Fill to fill data into a
DataTable. Now let's examine what internally happens to store Data into the
DataTable and also retrieving the same when required. Say the code I write to retrieve the Data is as below:
SqlDataAdapter adpt = new SqlDataAdapter("Select * from Table", conn);
Now when you call
Fill method, it actually invokes the external SQL server process to load the Data into memory and parses the loaded memory into the existing data structure. If you have already read my last post on internals of ADO.NET, you should be clear how the
DataRows are created using
DataReader and put into
DataRowCollection. In this post, I will cover from there, how the internal
DataRow is parsed.
DataReader loads each
DataRow into memory and takes you through with calling
LoadDataRow method of
SchemaMapping. If you see into Reflector, the code for
LoadDataRow looks like:
Here the initial call to
dataReader.GetValues is actually getting the data from memory into
DataRow, which is then Loaded into
DataTable and finally to the
DataSet. Now if you try to look into the definition of
GetValues, you need to look into the actual implementation of it in the provider specific classes. For
SqlClient, the data is loaded using
SqlBuffer class. The Buffer reads each individual entry and puts it as an array. So basically for a
DataRow object loading Data for SQL Server, it puts the actual Data into an array of
SqlBuffer. So if you can see the implementation of
SqlBuffer, you would be clear about the actual data storage of
So if you look into the implementation of
SqlDataReader, you can see it basically maintains the collection of
SqlBuffer is the actual unit of data storage. It maintains a structure Storage in it which has member to store almost all data basic data types The Storage stores the actual data in
Type and the
SqlBuffer holds the exact type information.
Here inside the
SqlBuffer, you can see the member
StorageType on the other hand is an
Enum which sets the actual type of
Data associated in the
SqlBuffer also exposes properties, each of which can
set the value for the object. The exact type information is also held into
DataColumnCollection and according to that, the
Type of the
Data is parsed.
Hence, this means the
DataRow is actually stored internally into proper types and hence does not have any performance implication, right? No, there is.
Actually the performance implication lies above these facts when you try to retrieve the value or store it into the
DataRow object. We generally use indexer to fetch data from the
DataRow. The indexer to the
DataRow is actually of type object.
public object this
This is the major problem to
DataRow class. So eventually, every time you try to retrieve data from
DataRow, even though proper type is maintained underneath, the indexer box to object type. Also, upon using setter, the proper type maintained from our code is also being boxed to the setter and which again is unboxed to proper data type.
I think these classes could be generalized more to induce better performance to the API. I am also hoping to have an enumeration of generic objects in the whole data structure of
DataColumns (may be using
DataCell<t>) in future.
I hope you like the post.
I don't know if I did made things clear to you. I have to put only those which are important to know. I would suggest to use Reflector to see the implementation in detail to know more.
Thank you for reading. Looking forward to your feedback.