Using SqlBulkCopy for High Performance Inserts

Veronica S. Zotali

4.83/5 (7 votes)

Feb 20, 2015

CPOL

4 min read

49471

1171

How to BulkLoad a file with millions of records

Download source - 9.8 MB

Introduction

We use SqlBulkCopy to transfer data to the SQL Server. You can use DataTables if you have some hundreds of thousands rows. However, one will need to take a different approach if we are dealing with big data files, i.e. millions of rows.

How to Run the Solution

There is a .bak file that you need to import to your local database. We need the database numbers with the table RandomNumbers.
The application is constituted by two console applications. Set up as StartUp project the FileGenerator. Execute this so we can generate the file.
Modify the ConnectionString accordingly in the SqlBulkLoader project, as it needs to point to your local database.
Run the second app, the SqlBulkLoader.
Bring up the Task Manager and monitor the Memory, while it's running.

Background

Datatables are easy to use, however, they are also one of the best candidates for Memory Leaks. If we use them to load massive data, we will experience an OutOfMemoryException error, which clearly proves that the common language runtime cannot allocate enough contiguous memory to successfully perform an operation.

WriteToServerAsync(DataTable)	The asynchronous version of WriteToServer, which copies all rows in the supplied DataTable to a destination table specified by the DestinationTableName property of the `SqlBulkCopy` object.
WriteToServerAsync(IDataReader)	Copies all rows in the supplied IDataReader to a destination table specified by the DestinationTableName property of the `SqlBulkCopy` object.

Likely, the WriteToServerAsync exhibits a polymorphic behavior. To make it work with big data, we will need to use the second one, WriteToServerAsync(IDataReader).

The first step is to create a class that inherits from IDataReader. You will notice that this interface includes many methods. We won't need to implement all of them. If you attempt to use the base class that inherits from IDataReader, the code will be massive. A good solution is to isolate only those methods we need, by hiding the methods we don't need, with the new operator. With this way, we can use that refined class, without importing all the methods we don't need.

Additionally, there are two things to consider:

SqlBulkCopy.BulkCopyTimeout Property

When working with big data, you will need to update this property:

bulkCopy.BulkCopyTimeout = 0;

If you fail to do that, the operation will time out, as the default value is 30 seconds. A value of 0 indicates no limit, so the bulk copy will wait indefinitely.

SqlBulkCopy.BatchSize Property

Again, if we have big data, failing to specify the Batch Size, it will pick up its default value, which is 0. In that case, it will attempt to WriteToServer in a single Batch and therefore fail. If we want to control the number of rows, that will be inserting in each batch, then we will need to specify a number. In our example, we have to process 10.000.000 records, so if you decide to play with the code, comment out the following line and see what happens:

bulkCopy.BatchSize = 5000;

The batch will fail.

Using the Code

As you can see, IDataReader is shifted with many methods. In my ObjectDataReader below, I have declared as abstract the methods I want to use and new the methods I need to hide:

 public abstract class ObjectDataReader : IDataReader
    {
        public new int Depth { get { throw new NotImplementedException(); } }

        public new int RecordsAffected
        {
            get { throw new NotImplementedException();}
        }

        public abstract int FieldCount
        {
            get;
        }

        public new bool IsClosed
        {
            get { throw new NotImplementedException(); }
        }

        public abstract object this[int i]
        {
            get;
        }

        public new object this[string name]
        {
            get { throw new NotImplementedException(); }
        }

        public abstract void Close();

        public new DataTable GetSchemaTable()
        {
            throw new NotImplementedException();
        }

        public abstract bool NextResult();

        public abstract bool Read();

        public new IDataReader GetData(int i)
        {
            throw new NotImplementedException();
        }

        public new string GetDataTypeName(int i)
        {
            throw new NotImplementedException();
        }

        public new string GetName(int i)
        {
            throw new NotImplementedException();
        }

        public new string GetString(int i)
        {
            throw new NotImplementedException();
        }

        public abstract object GetValue(int i);

        public new int GetValues(object[] values)
        {
            throw new NotImplementedException();
        }

        public abstract void Dispose();

        public new bool GetBoolean(int i)
        {
            throw new NotImplementedException();
        }

        public new byte GetByte(int i)
        {
            throw new NotImplementedException();
        }

        public new long GetBytes(int i, long fieldOffset, byte[] buffer, int bufferoffset, int length)
        {
            throw new NotImplementedException();
        }

        public new char GetChar(int i)
        {
            throw new NotImplementedException();
        }

        public new long GetChars(int i, long fieldoffset, char[] buffer, int bufferoffset, int length)
        {
            throw new NotImplementedException();
        }

        public new DateTime GetDateTime(int i)
        {
            throw new NotImplementedException();
        }

        public new decimal GetDecimal(int i)
        {
            throw new NotImplementedException();
        }

        public new double GetDouble(int i)
        {
            throw new NotImplementedException();
        }

        public new Type GetFieldType(int i)
        {
            throw new NotImplementedException();
        }

        public new float GetFloat(int i)
        {
            throw new NotImplementedException();
        }

        public new Guid GetGuid(int i)
        {
            throw new NotImplementedException();
        }

        public new short GetInt16(int i)
        {
            throw new NotImplementedException();
        }

        public new int GetInt32(int i)
        {
            throw new NotImplementedException();
        }

        public new long GetInt64(int i)
        {
            throw new NotImplementedException();
        }

        public new int GetOrdinal(string name)
        {
            throw new NotImplementedException();
        }

        public new bool IsDBNull(int i)
        {
            throw new NotImplementedException();
        }
    }

So instead of using directly ObjectdataReader, I will use it as a base class in my DataReader class. With this way, I can debug DataReader easier PLUS the code as you will shortly see is more readable. Now, I can override only the set of methods I need, instead of having a massive class with so many methods that I don't need for the current scenario.

 public class DataReader : ObjectDataReader
    {
        private object[] _values;

        private StreamReader _stream;

        public DataReader(string fileName)
        {
            _stream = new StreamReader(fileName);
            _values = new object[FieldCount];
        }

        public override int FieldCount
        {
            get
            {
                return GetColumns();
            }
        }

        protected bool Eof { get; set; }

        protected string CurrentRecord { get; set; }

        protected int CurrentIndex { get; set; }

        public override object this[int i]
        {
            get { return _values[i]; }
        }

        public override bool NextResult()
        {
            return false;
        }

        public override void Close()
        {
            Array.Clear(_values, 0, _values.Length);
            _stream.Close();
            _stream.Dispose();
        }

        public override bool Read()
        {
            CurrentIndex++;
            CurrentRecord = _stream.ReadLine();
            Eof = String.IsNullOrEmpty(CurrentRecord);
            if (Eof && !_stream.EndOfStream)
            {
                do
                {
                    NextResult();
                    CurrentRecord = _stream.ReadLine();
                    Eof = String.IsNullOrEmpty(CurrentRecord);
                    ////For Empty files with lines
                    if (_stream.EndOfStream)
                    {
                        break;
                    }
                }
                while (Eof);
            }

            if (!Eof)
            {
                Fill(_values);
            }

            return !Eof;
        }

        public override object GetValue(int i)
        {
            return _values[i];
        }

        public override void Dispose()
        {
            _stream.Close();
            _stream.Dispose();
        }

        private void Fill(object[] values)
        {
            values[0] = CurrentRecord.Split(',')[0];
            values[1] = CurrentRecord.Split(',')[1];
        }

        private int GetColumns()
        {
            return 2;
        }
    }

The client will only need to do the following:

   public static void Main(string[] args)
   {
        var bulkloader = new BulkLoader();
        var file = ConfigurationManager.AppSettings["FileLocation"];
        var streamReader = new DataReader(file);
        Task.Run(async () => await bulkloader.BulkLoadTableAsync(streamReader, "RandomNumbers")).Wait();
   }

The moment you start bulk importing, it's worth monitoring the memory from the Task Manager. You will notice that while it's running, there will be no spikes but one steady line, which proves that the code will cause no memory leaks. To see the difference, do the same using DataTables, then you will see spikes instead of one steady line... and shortly the catastrophic OutOfMemoryException error.

To summarize, use DataTables only when you know that the amount of entries won't exceed some hundreds of thousands. If you are dealing with millions of rows, then the above solution is the ONLY solution, you can manage to bulk import vast amount of data.

Points of Interest

Memory Leaks, Performance, Task Manager