Click here to Skip to main content
12,251,332 members (51,351 online)
Click here to Skip to main content
Add your own
alternative version

Stats

17.9K views
2K downloads
5 bookmarked
Posted

Extract Embedded Files from Microsoft Office Documents

, 11 Jun 2014 CPOL
Rate this:
Please Sign up or sign in to vote.
How to extract embedded files from Microsoft Office documents

Introduction

In this tip, I will show you how to get embedded files from the most commonly used Microsoft Office document; Word, Excel and PowerPoint without using the original programs. You probably are wondering why I don’t use the original programs? Well, not everybody has Microsoft Office because it can be expensive and you almost never will use Microsoft Office on a server.

We are going to cover the range from Office 97 until Office 2013.

So Are There Any Caveats in Covering this Range?

Yes there are, from Office 97 until 2003, the used file format is a binary format called a “compound document”. Technically spoken, this file format is like a file system in a file. The reason why Microsoft invented this format was because in the days of Office 97, the computers were “slow”. Another advantage of this format was that everything could be stored in a single file (e.g. the images and text). Inside the compound file, you have storages and streams. A storage is a set of streams and a stream contains information.

So How Do They Store It Binary?

Word

When you save a binary Word document, all the text is placed in the 1Table stream. When you embed a file in a binary Word document, a storage called “ObjectPool” is created, inside this storage another storage is created for each file you embed. These storages always start with an underscore and get a number after it that is 10 long.

Excel

When you save a binary Excel document, all the text and formulas in the Excel workbook are placed inside a stream called WorkBook. When you embed files in the Excel workBook, then these files are stored in storages called MBD<random number>. When the embedded file is another binary Office document, then the storages and streams inside it are placed as nodes in the MDB storage. When the file is for example an Open XML document, then the file is placed in a stream called Package. This makes it somewhat complex to get all the embedded files out of all the kind of variations Microsoft invented (probably with a good reason).

PowerPoint

PowerPoint makes it even easier to extract embedded files … NOT… When you save a binary PowerPoint file, all the text from the presentation is placed inside the “PowerPoint Document” stream and all the images used are placed inside the Pictures stream. Embedded files are however not placed in separate storages instead these files are merged inside the “PowerPoint Document” stream. Sometimes, the embedded files are compressed and sometimes not.

It will probably make this tip very long to cover all the different variations, so I’m not going to write it over here. It is probably better to just download the code and see what it does.

So How Do They Store It New Style (Open XML Document Format)?

Luckily, it is much easier to extract embedded files from this new type that is used as the native format from Office 2007 and higher. Technically spoken, this new format is just a ZIP (docx, xlsx, pptx) file with a lot of XML inside it. Most of the times, the embedded files are inside the “ZIP” in their native format. So that we don’t need to decode anything.

In Word, the embedded files can be found in the folder structure “/word/embeddings”, in Excel “/xl/embeddings” and in PowerPoint “/ppt/embeddings”.

I hear you think… well just write an unzip method and get everything out… well… most of the times, this will do the job but not always. When you embed dead normal “Office files” then yes, you could do it this way but when you are embedding other kind of files (e.g., a text file) these files are placed inside so called ole structures and again… this is in binary format.

It is also better to use the standard System.IO.Package class that is already present to read the “new office” format.

If you only want to extract normal Office Documents, you can do it this way:

var package = Package.Open(inputFileMemoryStream);

// Get the embedded files names.
foreach (var packagePart in package.GetParts())
{
    if (packagePart.Uri.ToString().StartsWith(embeddingPartString))
    {
        using (var packagePartStream = packagePart.GetStream())
        using (var packagePartMemoryStream = new MemoryStream())
        {
            packagePartStream.CopyTo(packagePartMemoryStream);

            var fileName = outputFolder +
                           packagePart.Uri.ToString().Remove(0, embeddingPartString.Length);
            
            fileName = FileManager.FileExistsMakeNew(fileName);
            File.WriteAllBytes(fileName, packagePartMemoryStream.ToArray());
            result.Add(fileName);
        }
    }
}
package.Close();

Any Caveats in Office 2007 and Higher?

Yes of course… otherwise, it would be to easy Wink | ;-) . When you password protect a Word, Excel or PowerPoint document, the easy to read “ZIP” file is gone and we are thrown back into the compound structure ages. Instead of a “ZIP” file, the file is written binary again. Inside the compound document, we have a stream called “EncryptedPackage”.

Because my intention is to get all the embedded files without human intervention, the code does not cover password protected files. It will in this case just raise an exception and skips the file.

Using the Code

Instead of all the difficult kind of ways embedded files are stored inside the Office documents, it is easy to call the OfficeExtractor.dll that does all the work for you. First make a reference to the DocumentServices.Modules.Extractors.OfficeExtractor.dll, then add the following 2 lines:

var extractor = new DocumentServices.Modules.Extractors.OfficeExtractor.Extractor();
var files = extractor.ExtractToFolder(<file to extract from>, <folder where to write the extracted files>);

Points of Interest

I always wanted to learn how the file formats work that Microsoft uses. Never would expect that there would be so many different things between Word, Excel and PowerPoint.

History

  • 2014-06-10 Version 1.0

    • Extracts embedded files from binary Office files (Office 97 - 2003)
    • Extracts embedded files from Office Open XML files (Office 2007 - 2013)
    • Automatically sets hidden workbooks in Excel files visible
    • Will detect if the files are password protected
    • Unit tests for the most common used file types
    • 100% native .NET code, no PINVOKE

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Kees van Spelde
Software Developer (Senior)
Netherlands Netherlands
Programming since I was a kid. Started on the Commodore 64 with BASIC. Since then I used programming languages like Turbo Pascal, Delphi, C++ and Visual Basic 6.

Now a days I do a lot of programming in C# with underlying databases like MS SQL

You may also be interested in...

Comments and Discussions

 
QuestionIcon Text Pin
Member 1135853915-Jan-15 5:36
memberMember 1135853915-Jan-15 5:36 
AnswerRe: Icon Text Pin
Kees van Spelde15-Jan-15 7:40
professionalKees van Spelde15-Jan-15 7:40 
AnswerRe: Icon Text Pin
Kees van Spelde15-Jan-15 7:42
professionalKees van Spelde15-Jan-15 7:42 
GeneralRe: Icon Text Pin
Member 1135853916-Jan-15 4:05
memberMember 1135853916-Jan-15 4:05 
GeneralRe: Icon Text Pin
Kees van Spelde16-Jan-15 5:09
professionalKees van Spelde16-Jan-15 5:09 
GeneralRe: Icon Text Pin
Kees van Spelde16-Jan-15 5:12
professionalKees van Spelde16-Jan-15 5:12 
GeneralRe: Icon Text Pin
Member 1135853916-Jan-15 6:02
memberMember 1135853916-Jan-15 6:02 
GeneralRe: Icon Text Pin
Kees van Spelde16-Jan-15 7:16
professionalKees van Spelde16-Jan-15 7:16 
SuggestionGood Job ... BUT Pin
Tony Jenniges10-Jun-14 13:55
memberTony Jenniges10-Jun-14 13:55 
GeneralRe: Good Job ... BUT Pin
Kees van Spelde11-Jun-14 3:08
professionalKees van Spelde11-Jun-14 3:08 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.160426.1 | Last Updated 11 Jun 2014
Article Copyright 2014 by Kees van Spelde
Everything else Copyright © CodeProject, 1999-2016
Layout: fixed | fluid