Click here to Skip to main content
15,860,972 members
Articles / Programming Languages / C#

Read and write Open XML files (MS Office 2007)

Rate me:
Please Sign up or sign in to vote.
4.92/5 (144 votes)
31 Jul 2011CPOL8 min read 392.9K   13.7K   300   19
Learn how to read and write Open XML files (MS Office 2007).

Introduction

With Office 2007, Microsoft decided to change the default application formats from old, proprietary, closed formats (DOC, XLS, PPT) to new, open, and standardized XML formats (DOCX, XLSX, and PPTX). The new formats share some similarities with the old Office XML formats (WordML, SpreadsheetML) and some similarities with the competing OpenOffice.org OpenDocument format, but there are many differences. Since the new formats will be default in Office 2007 and Microsoft Office is the most predominant Office suite, these formats are destined to be popular and you will probably have to deal with them sooner or later.

This article will explain the basics of the Open XML file format, and specifically the XLSX format, the new format for Excel 2007. Presented is a demo application which writes / reads tabular data to / from XLSX files. The application is written in C# using Visual Studio 2010. The created XLSX files can be opened using Excel 2007 or greater.

Microsoft Open XML format

Every Open XML file is essentially a Zip archive containing many other files. Office-specific data is stored in multiple XML files inside that archive. This is in direct contrast with the old WordML and SpreadsheetML formats which were single, non-compressed XML files. Although more complex, the new approach offers a few benefits:

  • You don't need to process the entire file in order to extract specific data.
  • Images and multimedia are now encoded in native format, not as text streams.
  • Files are smaller as a result of compression and native multimedia storage.

In Microsoft's terminology, an Open XML Zip file is called a package. Files inside that package are called parts. It is important to know that every part has a defined content type and there are no default type presumptions based on the file extension. The content type can describe anything; application XML, user XML, images, sounds, video, or any other binary object. Every part must be connected to some other part using a relationship. Inside the package are special XML files with a ".rels" extension which define the relationship between the parts. There is also a start part (sometimes called "root", which is a bit misleading because the graph containing all parts doesn't have to be a tree structure), so the entire structure looks like in picture 1.

Picture 1: Parts and relations inside XLSX file.

Picture 1: Parts and relations inside an XLSX file.

To cut a long story short, in order to read the data from an Open XML file, you need to:

  1. Open package as a Zip archive - any standard Zip library will do.
  2. Find parts that contain data you want to read - you can navigate through a relationship graph (more complex), or you can presume that certain parts have a defined name and path (Microsoft can change that in the future).
  3. Read parts you are interested in - using the standard XML library (if they are XML), or some other method (if they are images, sounds, or some other type).

On the other side, if you want to create a new Open XML file, you need to:

  1. Create/get all the necessary parts - by using some standard XML library (if they are XML), by copying them, or by using some other method.
  2. Create all relationships - create ".rels" files.
  3. Create content types - create a "[Content_Types].xml" file.
  4. Package everything into a Zip file with the appropriate extension (DOCX, XLSX, or PPTX) - any standard Zip library will do.

The whole story about packages, parts, content types, and relations is the same for all Open XML documents (regardless of the originating application), and Microsoft refers to it as Open Packaging Conventions.

Excel 2007 Open XML specifics

Excel 2007 extends on the basis of Open Packaging Conventions by adding its own application-specific XML types. Reference schemas for all XML files used in Office can be downloaded from MSDN, but note that some things are still open to change until the final Excel 2007 release.

We just want to write / read worksheet data, so we need to look in the folder "\xl\worksheets" inside the XLSX file where all the worksheets are located. For every worksheet, there is a separate XML file; "sheet1.xml", "sheet2.xml", and so on. When you open such a file, you will notice that all of the sheet data is inside the <sheetData> element. For every row, there is a <row> element; for every cell, there is a <c> element. Finally, the value of the cell is stored in a <v> element.

However, real world XML is never simple as schoolbook XML. You will notice that numbers get encoded as numbers inside the <v> element:

XML
<c r="A1">
    <v>100</v>
</c>

However, a string value (like "John") also gets encoded as number:

XML
<c r="B1" t="s">
    <v>0</v>
</c>

That is because MS Excel uses an internal table of unique strings (for performance reasons). Zero is an index of that string in an internal table of strings, and the attribute t="s" tells us that the underlying type is a string, not a number. So where is the table of unique strings located? It is in the "\xl\sharedStrings.xml" XML file, and contains all strings used in the entire workbook, not just the specific worksheet.

This approach is used for many other things; cell styles, borders, charts, number formats, etc. In fact, that becomes the major programming problem when working with XLSX files - updating and maintaining various tables of some unique Excel objects. In this article, we will just write / read data values, but if you require some complex formatting, you should probably be better using some commercial component which does all the tedious work for you.

Implementation

Our demo is a Windows Presentation Foundation application (see picture 2), written in C#, using Visual Studio 2010. Since using Microsoft Packaging API for zipping and unzipping multiple files is cumbersome and tedious, our demo uses an Open-Source Zip library called SharpZipLib (available at: http://www.icsharpcode.net/OpenSource/SharpZipLib). For demonstration purposes, we will extract the entire Zip files to a TEMP folder so we can examine the contents of that folder and files while debugging the demo application. In a real world application, you may want to avoid extracting to the temporary folder and just read to / write from the Zip file directly.

Application will also read / write data to a DataTable so it can be used as a reference for Excel to DataTable export / import scenarios.

For XML processing, the choice is simple. For reading XML files, we use the XmlReader class, and for writing, we use the XmlWriter class. Both come with the .NET Framework, but you can also use any other XML processing library.

Picture 2: Demo application in action.

Picture 2: Demo application in action.

Data reading

We want to read a simple "In.xlsx" file (in the "Input" folder) and copy its contents to the DataTable. That file contains a list of people with their first and last names (text values) and their IDs (number values). When the "Read input .xlsx file" button is clicked, the following code is executed:

C#
private void ReadInput(object sender, RoutedEventArgs e)
{
    // Get the input file name from the text box.
    var fileName = this.inputTextBox.Text;
 
    // Delete contents of the temporary directory.
    XlsxRW.DeleteDirectoryContents(tempDir);
 
    // Unzip input XLSX file to the temporary directory.
    XlsxRW.UnzipFile(fileName, tempDir);
 
    IList<string> stringTable;
    // Open XML file with table of all unique strings used in the workbook..
    using (var stream = new FileStream(Path.Combine(tempDir, @"xl\sharedStrings.xml"),
        FileMode.Open, FileAccess.Read))
        // ..and call helper method that parses that XML and returns an array of strings.
        stringTable = XlsxRW.ReadStringTable(stream);
 
    // Open XML file with worksheet data..
    using (var stream = new FileStream(Path.Combine(tempDir, @"xl\worksheets\sheet1.xml"),
        FileMode.Open, FileAccess.Read))
        // ..and call helper method that parses that XML and fills DataTable with values.
        XlsxRW.ReadWorksheet(stream, stringTable, this.data);
}

Nothing unusual happens here. The XLSX file is unzipped to the TEMP folder and then the necessary XML parts (now files) are processed. The file "sharedStrings.xml" contains a global table of unique strings while the file "sheet1.xml" contains data for the first sheet. Helper methods are pretty straightforward XML reading code -- you can download the demo application code to examine them in more detail.

If everything is OK, after the button click, all data will show up in the DataGrid.

Data writing

Now we want to write data from a DataTable to the "Out.xlsx" file in the "Output" folder. You can change some data, or add some new rows in the DataGrid. When the "Write output .xlsx file" button is clicked, the following code is executed:

C#
private void WriteOutput(object sender, RoutedEventArgs e)
{
    // Get the output file name from the text box.
    string fileName = this.outputTextBox.Text;
 
    // Delete contents of the temporary directory.
    XlsxRW.DeleteDirectoryContents(tempDir);
 
    // Unzip template XLSX file to the temporary directory.
    XlsxRW.UnzipFile(templateFile, tempDir);
 
    // We will need two string tables; a lookup
    // IDictionary<string, int> for fast searching and
    // an ordinary IList<string> where items are sorted by their index.
    IDictionary<string, int> lookupTable;
 
    // Call helper methods which creates both tables from input data.
    var stringTable = XlsxRW.CreateStringTables(this.data, out lookupTable);
 
    // Create XML file..
    using (var stream = new FileStream(Path.Combine(tempDir, @"xl\sharedStrings.xml"),
        FileMode.Create))
        // ..and fill it with unique strings used in the workbook
        XlsxRW.WriteStringTable(stream, stringTable);
 
    // Create XML file..
    using (var stream = new FileStream(Path.Combine(tempDir, @"xl\worksheets\sheet1.xml"),
        FileMode.Create))
        // ..and fill it with rows and columns of the DataTable.
        XlsxRW.WriteWorksheet(stream, this.data, lookupTable);
 
    // ZIP temporary directory to the XLSX file.
    XlsxRW.ZipDirectory(tempDir, fileName);
 
    // If checkbox is checked, show XLSX file in Microsoft Excel.
    if (this.openFileCheckBox.IsChecked == true)
        System.Diagnostics.Process.Start(fileName);
}

This time the code is a bit more complicated. In order not to generate all the necessary parts needed for the XLSX file, we decide to use a template file. We extract the template file to a temporary folder and then just change the XML parts containing the shared string table and worksheet data. All other parts, relationships, and content types stay the same - so we don't need to generate any of that. Note that we use two string tables: a lookup IDictionary<string, int> for fast searching, and an ordinary IList<string> where items are sorted by their index. We could pull it out only with IList<string> but then we would need to search the entire IList<string> every time we add a new string (to check if it is already there). The CreateStringTables() helper method builds both string tables, the WriteStringTable() helper method writes the string table XML, and the WriteWorksheet() helper method writes the worksheet data XML.

Again, download the demo application code to examine the helper methods in more detail.

Download links

You can download the latest version of the demo application (together with the C# source code) from here.

Alternative ways

As always in programming, there is more than one method to achieve the same thing.

You could use Excel automation to start an instance of Excel 2007 (or any other Office application) and then use interop calls to create a document and save it. However, using automation has some drawbacks I have already written about (see why is GemBox.Spreadsheet better than automation).

Open Packaging API, released with .NET 3.5, has support for Open Packaging Conventions (package handling and navigating the relationships) but there is no support for accessing application specific data so you will still need to process XML parts manually.

As another option, you could use some third party Excel C# / VB.NET component which will come with support for the Open XML format. This will probably cost you some money, but has the advantage that usually more than one format (for example, XLS, XLSX, CSV) are supported within the same API, so your application will be able to target different file formats using the same code.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
President GemBox Ltd
United Kingdom United Kingdom
Zeljko Svedic, the owner of a component vendor company called GemBox Software.
GemBox Software provides .NET developers a GemBox components that enable easy and efficient processing of office files:

  • GemBox.Spreadsheet, a .NET component for reading, writing, converting and printing spreadsheet files (XLSX, XLS, ODS, CSV, HTML, PDF and XPS).
  • GemBox.Document, a .NET component for reading, writing, converting and printing document files (DOCX, DOC, PDF, HTML, XPS, RTF and TXT).
  • GemBox.Presentation, a .NET component for reading, writing, converting and printing presentation files (PPTX, PPT, PDF and XPS).
  • GemBox.Email, a .NET component for reading and writing email files (MSG, EML, MHTML), and composing, receiving and sending email messages using IMAP, POP, and SMTP.
  • GemBox.Pdf, a .NET component for reading, writing, creating and updating PDF files.

Comments and Discussions

 
Questiontable values do not always correspond to shared string index Pin
Member 435337114-Apr-13 15:44
Member 435337114-Apr-13 15:44 
Hi ZeljkoS,

thanks for this. I have an odd issue where the reference in the worksheet does not always correspond to the shared string value. Roughly, I'd say about 50/50 which renders it largely unusable Frown | :(

Any thoughts as to why this would be the case?

thanks

Matt
AnswerRe: table values do not always correspond to shared string index Pin
Feence8-May-16 22:16
Feence8-May-16 22:16 
GeneralMy vote of 5 Pin
tewuapple4-Sep-12 15:26
tewuapple4-Sep-12 15:26 
Questionnice Pin
wzhyuki5-Apr-12 16:16
wzhyuki5-Apr-12 16:16 
QuestionNice article! Pin
aron.sinoai29-Dec-11 9:28
aron.sinoai29-Dec-11 9:28 
GeneralMy vote of 3 Pin
itmagesh11-Oct-11 1:08
itmagesh11-Oct-11 1:08 
GeneralMy vote of 4 Pin
blvramprasad5-Oct-11 0:54
blvramprasad5-Oct-11 0:54 
SuggestionImproved way to create the DataTable Pin
AlejoBrz3-Aug-11 2:37
AlejoBrz3-Aug-11 2:37 
GeneralMy vote of 5 Pin
AlejoBrz3-Aug-11 2:31
AlejoBrz3-Aug-11 2:31 
QuestionOpenXML Toolkit Pin
peteSJ1-Aug-11 8:13
peteSJ1-Aug-11 8:13 
SuggestionMy contribution to avoid unzipping to disk Pin
AlejoBrz28-Jul-11 1:40
AlejoBrz28-Jul-11 1:40 
GeneralExcelent! Pin
Member 30944882-Jan-10 1:23
Member 30944882-Jan-10 1:23 
GeneralExcelent Work! Pin
ElManoSanta14-Aug-09 3:54
ElManoSanta14-Aug-09 3:54 
QuestionWill this work when office is not installed on the system ? Pin
thisisjaiswal3-Aug-09 4:42
thisisjaiswal3-Aug-09 4:42 
GeneralThank you. Pin
srimals16-Apr-09 5:53
srimals16-Apr-09 5:53 
GeneralGzip vs Zip Pin
Member 287297817-Mar-09 11:17
Member 287297817-Mar-09 11:17 
GeneralOpen Xml SDK made it simple ! Pin
PuneWala4-Mar-08 1:25
PuneWala4-Mar-08 1:25 
GeneralDocumentation Pin
kuerbis21-Oct-06 3:37
kuerbis21-Oct-06 3:37 
GeneralWrite data xml inside excel zip file [modified] Pin
Apoorv Bhargava1-Oct-06 1:15
Apoorv Bhargava1-Oct-06 1:15 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.