In this tip, I will show you how to get embedded files from the most commonly used Microsoft Office document; Word, Excel and PowerPoint without using the original programs. You probably are wondering why I don’t use the original programs? Well, not everybody has Microsoft Office because it can be expensive and you almost never will use Microsoft Office on a server.
We are going to cover the range from Office 97 until Office 2013.
So Are There Any Caveats in Covering this Range?
Yes there are, from Office 97 until 2003, the used file format is a binary format called a “compound document”. Technically spoken, this file format is like a file system in a file. The reason why Microsoft invented this format was because in the days of Office 97, the computers were “slow”. Another advantage of this format was that everything could be stored in a single file (e.g. the images and text). Inside the compound file, you have storages and streams. A storage is a set of streams and a stream contains information.
So How Do They Store It Binary?
When you save a binary Word document, all the text is placed in the
1Table stream. When you embed a file in a binary Word document, a storage called “
ObjectPool” is created, inside this storage another storage is created for each file you embed. These storages always start with an underscore and get a number after it that is 10 long.
When you save a binary Excel document, all the text and formulas in the Excel workbook are placed inside a stream called
WorkBook. When you embed files in the Excel workBook, then these files are stored in storages called MBD<random number>. When the embedded file is another binary Office document, then the storages and streams inside it are placed as nodes in the MDB storage. When the file is for example an Open XML document, then the file is placed in a stream called
Package. This makes it somewhat complex to get all the embedded files out of all the kind of variations Microsoft invented (probably with a good reason).
PowerPoint makes it even easier to extract embedded files … NOT… When you save a binary PowerPoint file, all the text from the presentation is placed inside the “PowerPoint Document” stream and all the images used are placed inside the Pictures stream. Embedded files are however not placed in separate storages instead these files are merged inside the “PowerPoint Document” stream. Sometimes, the embedded files are compressed and sometimes not.
It will probably make this tip very long to cover all the different variations, so I’m not going to write it over here. It is probably better to just download the code and see what it does.
So How Do They Store It New Style (Open XML Document Format)?
Luckily, it is much easier to extract embedded files from this new type that is used as the native format from Office 2007 and higher. Technically spoken, this new format is just a ZIP (docx, xlsx, pptx) file with a lot of XML inside it. Most of the times, the embedded files are inside the “ZIP” in their native format. So that we don’t need to decode anything.
In Word, the embedded files can be found in the folder structure “/word/embeddings”, in Excel “/xl/embeddings” and in PowerPoint “/ppt/embeddings”.
I hear you think… well just write an unzip method and get everything out… well… most of the times, this will do the job but not always. When you embed dead normal “Office files” then yes, you could do it this way but when you are embedding other kind of files (e.g., a text file) these files are placed inside so called ole structures and again… this is in binary format.
It is also better to use the standard
System.IO.Package class that is already present to read the “new office” format.
If you only want to extract normal Office Documents, you can do it this way:
var package = Package.Open(inputFileMemoryStream);
foreach (var packagePart in package.GetParts())
using (var packagePartStream = packagePart.GetStream())
using (var packagePartMemoryStream = new MemoryStream())
var fileName = outputFolder +
fileName = FileManager.FileExistsMakeNew(fileName);
Any Caveats in Office 2007 and Higher?
Yes of course… otherwise, it would be to easy ;-). When you password protect a Word, Excel or PowerPoint document, the easy to read “ZIP” file is gone and we are thrown back into the compound structure ages. Instead of a “ZIP” file, the file is written binary again. Inside the compound document, we have a stream called “
Because my intention is to get all the embedded files without human intervention, the code does not cover password protected files. It will in this case just raise an exception and skips the file.
Using the Code
Instead of all the difficult kind of ways embedded files are stored inside the Office documents, it is easy to call the OfficeExtractor.dll that does all the work for you. First make a reference to the DocumentServices.Modules.Extractors.OfficeExtractor.dll, then add the following 2 lines:
var extractor = new DocumentServices.Modules.Extractors.OfficeExtractor.Extractor();
var files = extractor.ExtractToFolder(<file to extract from>, <folder where to write the extracted files>);
Points of Interest
I always wanted to learn how the file formats work that Microsoft uses. Never would expect that there would be so many different things between Word, Excel and PowerPoint.
2014-06-10 Version 1.0
- Extracts embedded files from binary Office files (Office 97 - 2003)
- Extracts embedded files from Office Open XML files (Office 2007 - 2013)
- Automatically sets hidden workbooks in Excel files visible
- Will detect if the files are password protected
- Unit tests for the most common used file types
- 100% native .NET code, no PINVOKE