Fast Excel file reader with basic functionality

Messir

Rate me:

4.88/5 (104 votes)

1 Jul 20063 min read

1.6M

14K

211

173

A set of managed classes for native reading of data from MS Excel files

Introduction

When I was writing one of the projects (multi-format data parser) I had to deal with Microsoft Excel files.

First of all I tried to read data via COM interop, but this way was very slow (and sometimes input files have up to 100,000 rows and even more) and required Excel to be installed on the target machine.

Then I tried to read data via Excel's OleDb driver. When DataReader ran through data rows without reading them (i.e. while (rd.Read());) it was quite fast, but retrieving data was still very slow.

When I had tried out one of commercial products (TMS FlexCel Studio), it showed rather good results (several seconds for loading 17 MB file, 2 sheets, ~ 80 000 rows), but access to them was not very fast (at least not suitable for displaying data using DataGridView in virtual mode).

So, I decided to write my own reader for Excel files.

Background

First of all I decided that the reader will produce a DataSet object, with sheets represented as DataTables within it. Access to DataTable is rather fast, and its TableName property is suitable for sheet title.

To read an Excel file, we need to get BIFF stream from xls OLE2 container. Working with OLE2 container is implemented in the following classes:

XlsHeader (file header representation)
XlsFat (support for file system in OLE2 container)
XlsRootDirectory (directory tree representation)
XlsDirectoryEntry, and
XlsStream (file stream)

Note that implementation of OLE2 container object model is limited (e.g. I didn't write any MiniFAT support code) and Excel-specific (XlsHeader supports only MS Excel header field values, any other will produce InvalidHeaderException).

Now we can parse BIFF stream with spreadsheet data. For handy reading data from stream we have a XlsBiffStream class derived from XlsStream. It provides a set of methods for reading XlsBiffRecord common structures. And then, each structure is converted to the appropriate type. Note that only a few record types are supported. All other structures are read as basic XlsBiffRecord and are ignored.

My implementation of reader ignores any text formatting, graphics, embedded objects and charts. Only cell values are read. When a cell contains a formula, the last calculated formula value is used as the cell value. Also, only indexed sheets are read (i.e. sheets with index data) - I didn't write not-indexed sheets support due to my laziness (I didn't come across any file without an index yet).

Using the code

Now, it's very easy to use the parser. All you need is a Stream with an Excel file.

FileStream fs = new FileStream(@"c:\file.xls", 
        FileMode.Open, FileAccess.Read);
ExcelDataReader rd = new ExcelDataReader(fs);
fs.Close();

// Now we can access data:
DataSet data = rd.WorkbookData;

Points Of Interest

Latest versions of Excel use SST (Shared String Table) to store string information, and text cells only reference string by index. Also, the size of one BIFF section is limited, and SST can be rather big, so it uses the so-called CONTINUE sections. Every string in SST can be saved as Unicode (2 byte) or as ANSI (single byte per symbol). But when the string is broken with the continue section, it can change its encoding from one to another! This took me a lot of debugging, and I wish nobody has to do the same again...

Results

This implementation of reader is rather nonoptimal (I was pressed for time), but it showed a good speed. It surely beats any interop or OLE method and even some commercial analogs.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

Written By

Messir

Web Developer

Russian Federation

Alex
.NET Developer
Russian Federation
rakemaker@gmail.com

Comments and Discussions

Re: One row to much in every datatable

Kelevr44-Nov-08 2:53

Kelevr4

4-Nov-08 2:53

Hi! I'm not interested in first row (index 0). Second row (index 1) replicates the first row. Third row (index 2) doesn't exist of course... Anyway I've lost a row!

There's always a better way

Error when reading negative integer in limit case

flops4225-Jan-07 7:42

flops42

25-Jan-07 7:42

Hi,

I found this error with value -18721441, in MULRK case.
I found that the value, coded in uint, has it's last bit set, as it's a negative value.
But when converting it (double right shift), the last bit is lost.
The line of code is :
num = (double)(int)(rk >> 2);
And should be instead :
num = (double)((int)rk >> 2);
Here, num is first converted to int, then, when shifting twice, the last bit is kept.

This should correct the issue.

The line can be found in Core.cs, class XlsBiffRKCell, method NumFromRK(uint rk), line 1898.

Best regards for this great work Cool | :cool:

Christian

How can I use this on a web page

ppara25-Jan-07 3:12

ppara

25-Jan-07 3:12

I'd like to read excel files that are uploaded to my website. How can I use your software to do this?

How can I use this on a web page

ppara25-Jan-07 3:09

ppara

25-Jan-07 3:09

I want my web page to read uploaded excel files. How can I do this?

Non-indexed worksheets?

coopsf112-Jan-07 5:55

coopsf1

12-Jan-07 5:55

I am uploading similar excel files and sometimes it works and sometimes it doesn't. The key problem seems to be the line of code in the method "ReadWorksheet":

<br />
XlsBiffIndex idx = m_stream.Read() as XlsBiffIndex;<br />

returns a null value when the read fails to return data and a value when it works.
Could this be because worksheets are not Indexed? You mention these will be ignored in your instructions. If that is indeed the case how do I make the worksheets indexed?

Thanks in advance,

Martin.

Re: Non-indexed worksheets?

coopsf112-Jan-07 6:43

coopsf1

12-Jan-07 6:43

I think I fixed my own issue. If I select all the cells in a non-working sheet and select Edit > Clear > Formats and save and reupload it all seems to work.

Martin.

Re: Non-indexed worksheets?

mindatlarge29-Mar-07 4:26

mindatlarge

29-Mar-07 4:26

I am having this exact same problem, but unfortunately clearing the format and saving the file is not an option. I am trying to get this to work in a Windows service that pulls an Excel file from an FTP site. Has anyone else had this same issue with indexing of worksheets?

Re: Non-indexed worksheets?

Bob van Steijnen9-Nov-08 10:39

Bob van Steijnen

9-Nov-08 10:39

Like mindatlarge, I also need to upload Excel files (to a web application) which may sometimes have non-indexed worksheets. I think this may have to do with Excel files being generated (or "exported") by (some) applications other than MS Excel. Opening and just saving such a file in the MS Excel application itself seems to solve the non-indexed problem: worksheets will be indexed (and the file size will increase), whether any formatting was cleared or not. In my web application, I could have also opted to store the uploaded file temporarily and open it via OLEDB (no problem with non-indexed worksheets there); however, like mindatlarge, I would prefer to be able to read it directly from the stream (it's so much faster!) so I wanted to solve the non-indexed issue.

I followed an intuitive approach that seems to work OK (for me up to now anyway): in the case of a non-indexed worksheet, I cannot use the INDEX record's DbCellAddresses array to quickly find the data cells I need to read, so instead I just read the stream (in particular the part assigned to the worksheet) to find them. In my copy of the ExcelDataReader code I made a few changes to accomplish this.

NOTE: The ExcelDataReader code I use is based on a recent download from the CodePlex website, but IMO this is remarkably similar code to the code on this site by Alex Skalozub aka Messir, the main ExcelDataReader.cs files differing mainly in formatting and some code having been organized a bit more.

In the main ExcelDataReader() code:
- I added a parameter double_decimals to be able to set a custom number of decimals when formatting double values (m_double_values is a global, used in the FormatNumber()) method;
- I call the ReadWorksheet() method with an extra parameter pointing to the worksheet following the current one (if available), to have its offset available in case of the current worksheet turning out to be non-indexed (could have solved this differently as m_sheets is a global variable).

public ExcelDataReader(Stream fileStream, bool promoteToColumns, int double_decimals)
{
	m_PromoteToColumns = promoteToColumns;
	m_double_decimals = double_decimals;

	using (m_file = fileStream)
	{
		m_hdr = XlsHeader.ReadHeader(m_file);
		XlsRootDirectory dir = new XlsRootDirectory(m_hdr);
		XlsDirectoryEntry workbookEntry = dir.FindEntry(WORKBOOK) ?? dir.FindEntry(BOOK);

		if (workbookEntry == null)
			throw new FileNotFoundException(Errors.ErrorStreamWorkbookNotFound);
		if (workbookEntry.EntryType != STGTY.STGTY_STREAM)
			throw new FormatException(Errors.ErrorWorkbookIsNotStream);

		m_stream = new XlsBiffStream(m_hdr, workbookEntry.StreamFirstSector);

		ReadWorkbookGlobals();

		m_workbookData = new DataSet();

		for (int i = 0; i < m_sheets.Count; i++)
		{
			if (ReadWorksheet(m_sheets[i], (i + 1 < m_sheets.Count ? m_sheets[i + 1] : null)))
				m_workbookData.Tables.Add(m_sheets[i].Data);
		}

		m_globals.SST = null;
		m_globals = null;
		m_sheets = null;
		m_stream = null;
		m_hdr = null;

		GC.Collect();
		GC.SuppressFinalize(this);
	}
}

Then the ReadWorksheet() method:
- I split up the statement XlsBiffIndex idx = m_stream.Read() as XlsBiffIndex; to first get the record, then try and cast it to an INDEX record;
- if the cast to an INDEX record fails (idx == null), I continue to read the stream until I hit the DIMENSIONS record which also holds the number of rows etc. found in the INDEX record
- I need either a valid INDEX record or a valid DIMENSIONS record to continue and read the cell data; for a non-indexed worksheet, I read the stream to find these - otherwise the code uses the INDEX record's DbCellAddresses array;
- I put the code block that actually gets the cell data in a separate method to have no duplicate code.

private bool ReadWorksheet(XlsWorksheet sheet, XlsWorksheet next_sheet)
{
	m_stream.Seek((int)sheet.DataOffset, SeekOrigin.Begin);

	XlsBiffBOF bof = m_stream.Read() as XlsBiffBOF;
	if (bof == null || bof.Type != BIFFTYPE.Worksheet)
		return false;

	XlsBiffRecord trec;
	XlsBiffIndex idx;

	// next statement fails if this is a non-indexed worksheet
	// XlsBiffIndex idx = m_stream.Read() as XlsBiffIndex;
	// split it up, so we keep a reference to the record itself
	trec = m_stream.Read();
	idx = trec as XlsBiffIndex;
	if (idx == null)
	{
		try
		{
			idx = null;
			do
			{
				trec = m_stream.Read();
				idx = trec as XlsBiffIndex;
			}
			while (idx == null && trec.ID != BIFFRECORDTYPE.DIMENSIONS);
		}
		catch
		{
			trec = null;
			idx = null;
		}
	}
	bool isV8 = (m_version >= 0x600);

	if (idx != null || (trec != null && trec.ID == BIFFRECORDTYPE.DIMENSIONS))
	{
		DataTable dt = new DataTable(sheet.Name);

		XlsBiffDimensions dims = null;
		if (idx == null)
		{
			dims = (XlsBiffDimensions)trec;
		}
		else
		{
			idx.IsV8 = isV8;

			do
			{
				trec = m_stream.Read();
				if (trec.ID == BIFFRECORDTYPE.DIMENSIONS)
				{
					dims = (XlsBiffDimensions)trec;
					break;
				}

			} while (trec != null && trec.ID != BIFFRECORDTYPE.ROW);
		}

		int maxCol = 256;

		if (dims != null)
		{
			dims.IsV8 = isV8;
			maxCol = dims.LastColumn - 1;
			sheet.Dimensions = dims;
		}

		InitializeColumns(ref dt, maxCol);

		sheet.Data = dt;

		uint maxRow = (idx == null ? dims.LastRow : idx.LastExistingRow);
		uint minRow = (idx == null ? dims.FirstRow : idx.FirstExistingRow);
		if (maxRow <= minRow)
		{
			return true;
		}

		dt.BeginLoadData();

		// add (empty) rows - changed init value for i to minRow instead of 0
		for (int i = (int) minRow; i < maxRow; i++)
		{
			dt.Rows.Add(dt.NewRow());
		}

		if (idx == null)
		{
			// the long way: just follow the stream for this worksheet looking for valid XlsBiffBlankCell elements
			int offs = trec.Offset;
			int maxOffset = (next_sheet == null ? trec.Bytes.Length : (int)next_sheet.DataOffset) - 1;
			m_stream.Seek(offs, SeekOrigin.Begin);
			
			while (offs < maxOffset && LoadCellValues(ref offs, ref dt, ref maxCol, ref maxRow))
			{ }
		}
		else
		{
			uint[] dbCellAddrs = idx.DbCellAddresses;

			for (int i = 0; i < dbCellAddrs.Length; i++)
			{
				XlsBiffDbCell dbCell = (XlsBiffDbCell)m_stream.ReadAt((int) dbCellAddrs[i]);
				XlsBiffRow row = null;
				int offs = dbCell.RowAddress;

				do
				{
					row = m_stream.ReadAt(offs) as XlsBiffRow;
					if (row == null) break;

					offs += row.Size;

				} while (null != row);

				while (LoadCellValues(ref offs, ref dt, ref maxCol, ref maxRow))
				{ }
			}
		}

		dt.EndLoadData();

		if (m_PromoteToColumns)
		{
			RemapColumnsNames(ref dt, dt.Rows[0].ItemArray);
			dt.Rows.RemoveAt(0);
			dt.AcceptChanges();
		}

	}
	else
	{
		return false;
	}

	return true;
}

private bool LoadCellValues(ref int offs, ref DataTable dt, ref int maxCol, ref uint maxRow)
{
	XlsBiffRecord rec = m_stream.ReadAt(offs);
	offs += rec.Size;
	if (rec is XlsBiffDbCell) return false;
	if (rec is XlsBiffEOF) return false;
	XlsBiffBlankCell cell = rec as XlsBiffBlankCell;

	if (cell == null) return true;
	if (cell.ColumnIndex >= maxCol) return true;
	if (cell.RowIndex > maxRow) return true;

	switch (cell.ID)
	{
		case BIFFRECORDTYPE.INTEGER:
		case BIFFRECORDTYPE.INTEGER_OLD:
			dt.Rows[cell.RowIndex][cell.ColumnIndex] = ((XlsBiffIntegerCell)cell).Value.ToString();
			break;
		case BIFFRECORDTYPE.NUMBER:
		case BIFFRECORDTYPE.NUMBER_OLD:
			dt.Rows[cell.RowIndex][cell.ColumnIndex] = FormatNumber(((XlsBiffNumberCell)cell).Value);
			break;
		case BIFFRECORDTYPE.LABEL:
		case BIFFRECORDTYPE.LABEL_OLD:
		case BIFFRECORDTYPE.RSTRING:
			dt.Rows[cell.RowIndex][cell.ColumnIndex] = ((XlsBiffLabelCell)cell).Value;
			break;
		case BIFFRECORDTYPE.LABELSST:
			string tmp = m_globals.SST.GetString(((XlsBiffLabelSSTCell)cell).SSTIndex);
			dt.Rows[cell.RowIndex][cell.ColumnIndex] = tmp;
			break;
		case BIFFRECORDTYPE.RK:
			dt.Rows[cell.RowIndex][cell.ColumnIndex] = FormatNumber(((XlsBiffRKCell)cell).Value);
			break;
		case BIFFRECORDTYPE.MULRK:
			for (ushort j = cell.ColumnIndex; j <= ((XlsBiffMulRKCell)cell).LastColumnIndex; j++)
			{
				dt.Rows[cell.RowIndex][j] = FormatNumber(((XlsBiffMulRKCell)cell).GetValue(j));
			}
			break;
		case BIFFRECORDTYPE.BLANK:
		case BIFFRECORDTYPE.BLANK_OLD:
		case BIFFRECORDTYPE.MULBLANK:
			// Skip blank cells

			break;
		case BIFFRECORDTYPE.FORMULA:
		case BIFFRECORDTYPE.FORMULA_OLD:
			((XlsBiffFormulaCell)cell).UseEncoding = m_encoding;
			object val = ((XlsBiffFormulaCell)cell).Value;
			if (val == null)
				val = string.Empty;
			else if (val is FORMULAERROR)
				val = "#" + ((FORMULAERROR)val);
			else if (val is double)
				val = FormatNumber((double)val);
			dt.Rows[cell.RowIndex][cell.ColumnIndex] = val.ToString();
			break;
		case BIFFRECORDTYPE.BOOLERR:
			dt.Rows[cell.RowIndex][cell.ColumnIndex] = ((XlsBiffBoolErrCell)cell).Value;
			break;
		default:
			break;
	}
	return true;
}

I hope this may be of use to anyone. The dataset I get for non-indexed worksheets using the code above is identical to the result of the OLEDB route.

Re: Non-indexed worksheets?

Chris Kuper25-Jun-08 2:42

Chris Kuper

25-Jun-08 2:42

I am not an expert on this so my findings are based on trial and error. It seem to be that the index might be at a different position. I found the follwoing workaround:

After the line of code

XlsBiffIndex idx = m_stream.Read() as XlsBiffIndex;

I added some code to give the fetching of the index a second chance:

if (idx == null)<br />
{<br />
   idx = m_stream.Read() as XlsBiffIndex;<br />
}

Not to nice but it worked for me. An alternative solution would be to loop until a valid index is found - replacing the original line of code with the following:

XlsBiffIndex idx = null;<br />
do<br />
  {<br />
     idx = m_stream.Read() as XlsBiffIndex;<br />
  }<br />
while (idx == null);

Hope this triggers the development of a robust solution for the issue.

Regards

Chris

Great library... but how to open password protected files

Jorge Varas5-Jan-07 9:51

Jorge Varas

5-Jan-07 9:51

Congratulations, this is a great library. Any ideas on how to implement it so it can open password protected files?

NOT FOUND!

Eric China18-Dec-06 16:40

Eric China

18-Dec-06 16:40

Thanks for ur perfect job.
now i hv a problem here. when I read the excel file which include asia charactors. it turns out that some of cells display as "NOT FOUND #28","NOT FOUND #31".., and other cells are correct. i don't know why. these cells are all filled by charactors. i really need ur help. thks!

Perfect.

Eugen Wiebe6-Dec-06 22:58

Eugen Wiebe

6-Dec-06 22:58

Only 16 columns

Adrian Rosca27-Nov-06 23:40

Adrian Rosca

27-Nov-06 23:40

Hi,

Great work here, a quite impressive effort. I have a small problem though, the reader can only find 16 columns in my Excel-document. All other columns are ignored and I get a "Cannot find column 16." error when I try to access the DataRow. I don't have any strange values in my cells, just plain text and numbers.

Any ideas, anyone? I have temporarily fixed the issue by removing the "maxCol = dims.LastColumn;" row inside the ExcelDataReader so I always get 256 cols, but that's an ugly fix and not very correct.

Adrian Rosca

Re: Only 16 columns

luppo17-Dec-06 2:45

luppo

17-Dec-06 2:45

Messir already fix that bug, this is the post

Dimensions for Biff8 worksheets were calculated incorrectly.
To fix this, you need to edit XlsBiffDimensions class in Core.cs.
Find its LastColumn property and replace
get { return (isV8) ? base.ReadUInt16(0x10) : base.ReadUInt16(0x6); }
with
get { return (isV8) ? base.ReadUInt16(0xA) : base.ReadUInt16(0x6); }

Sorry for stupid mistake .
Thanks to Yannick Morin-Rivest for bugreporting.

Great library.

Murat YILMAZ23-Nov-06 21:02

Murat YILMAZ

23-Nov-06 21:02

This is a great library. Thanks for advancade.

Best Regards,
Murat YILMAZ

Problems with Boolean

pleury12-Nov-06 16:57

pleury

12-Nov-06 16:57

I'm not able to retrieve boolean values like TRUE or FALSE. Any advice?

Regards,

pleury

Re: Problems with Boolean

pleury12-Nov-06 17:41

pleury

12-Nov-06 17:41

I found the solution.

In "public static XlsBiffRecord GetRecord(byte[] bytes, uint offset)" I added this case statement:

case BIFFRECORDTYPE.BOOLERR:<br />
   return new XlsBiffBoolErrCell(bytes, offset);<br />
   break;<br />

Then in "private bool ReadWorksheet(XlsWorksheet sheet)" I added this case statement:

case BIFFRECORDTYPE.BOOLERR:<br />
   dt.Rows[cell.RowIndex][cell.ColumnIndex] = ((XlsBiffBoolErrCell)cell).Value;<br />
   break;

Works like a charm. Big Grin | :-D

pleury

My compliments !

J-D9-Nov-06 5:57

J-D

9-Nov-06 5:57

Hello,
first of all : GREAT job ! You've got my 5

I wanted to use your code to read data from an excel file on Pocket PC.

So I recomile your code under CF.

Only several changes have to be done to let the library compile under CF :

- InvalidDataException does not exists in CF, I replace it with simple IOException

- BitConverter.Int64BitsToDouble does not exists in CF, I replace it by
BitConverter.ToDouble(BitConverter.GetBytes(XXX), 0)

- Encoding.IsSingleByte does not exists in CF, I replace it with true
--> I'm sure there's a way to know if the actual encoding is single
or more bytes. Perhaps can you give me the solution.

So yet I got a simple Excel reader for PocketPC !!!!! Great !

Perhaps can you modify your code so it can be compiled both for desktop an POcketPC applications.

Re: My compliments !

Eric China18-Dec-06 16:22

Eric China

18-Dec-06 16:22

Hi, buddy, Great Job.
and then i hv the same problem, you know, The method 'IsSingleByte()' doesn't exist in .net framework1.1, but i hv to find a solution to solve it.
can u help me. thanks a lot!

.NET 1.1

yw_edin24-Oct-06 1:42

yw_edin

24-Oct-06 1:42

Do you have the ddl for .NET 1.1? I tried to compile your code with VS2003, but as far as I can see there the compiler complains about this line:

private List<xlsworksheet> m_sheets = new List<xlsworksheet>();

Thanks.

Dr. Nikos Migas

Re: .NET 1.1

bennyahahady28-Mar-07 2:21

bennyahahady

28-Mar-07 2:21

I had succeed to downgrade the code, and it is working in .net 1.1.
I will send the code to Alex Skalozub, and he can upload it .

Re: .NET 1.1

phatmike1126-Jun-07 16:29

phatmike11

26-Jun-07 16:29

Anyone have a copy of the .net 1.1 version?

Re: .NET 1.1

satya chhikara30-Nov-09 21:13