Click here to Skip to main content
Click here to Skip to main content

Read Document Text Directly from Microsoft Word File

By , 6 Jan 2008
Rate this:
Please Sign up or sign in to vote.

Introduction

In this article, we'll take a brief look into Microsoft Word binary file format and present a simple way to obtain document text from *.doc files.

Links:

  • Cellbi.DocFramework - Read/Write Microsoft Word documents, complex formatting, sections and tables are supported.
  • GetWordTextSrc - The complete source code for this article.

You may also take a look at the project we are currently working on.

The most popular file format for rich text representation is *.doc file format created by Microsoft. OLE (Object Linking and Embedding) is the most popular way to work with Microsoft Word documents programmatically, but this method has several disadvantages, e.g. low speed, need of Microsoft Office to be installed and inconvenient document model.

Another way to manipulate Microsoft Word files is to read and write them directly. This is a way to solve the disadvantages described above. Direct manipulation provides much more speed and allows use of your own document model. However the most significant difficulty here is that we need binary file format knowledge.

OLE Structured Storage

Word file structure is represented as a file system within a file. This technology, called OLE structured storage, allows storing multiple kinds of objects in a single document. OLE structured storage is a collection of two object types: storages and streams.

StgOpenStorage WinAPI function provides access to the root storage object in a structured storage system. Here is the declaration:

[DllImport(Ole32Dll, CharSet = CharSet.Unicode)]
public static extern int StgOpenStorage(string pwcsName,
    IStorage pstgPriority,
    int grfMode,
    IntPtr snbExclude,
    int reserved,
    out IStorage ppstgOpen);

In this declaration we'll mainly use pwcsName - file path that contains the storage object to open, and ppstgOpen - IStorage implementer used to work with the root file storage. Here is an example code illustrating the StgOpenStorage function usage:

const int _DefaultFlags = (int)(STGMFlags.STGM_READWRITE |
                                STGMFlags.STGM_SHARE_EXCLUSIVE);

public static IStorage OpenRootStorage(string path)
{
  IStorage storage;
  int result = NativeMethods.StgOpenStorage(
      path,
      null,
      _DefaultFlags,
      IntPtr.Zero,
      0,
      out storage);

  if (result != 0)
    return null;
  return storage;
}

IStorage interface translation and implementation of some necessary structures can be found in the article's source code.

Reading Document Streams

The root Word file storage contains document and table streams, which we'll use to read document text. The document stream contains the Word file header (FIB – File Information Block), document text and formatting information. The document text and its formatting are stored as a set of pieces, and each piece has an offset into the document stream. The table stream contains information about text location represented as a collection of piece descriptors.

To get access to any stream in the root storage, we'll use the OpenStream WinAPI function provided by the IStorage interface.

  int OpenStream(string pwcsName,
          IntPtr reserved1,
          int grfMode,
          int reserved2,
          out UCOMIStream ppstm);

In this declaration pwcsName is the name of the storage and ppstm is the pointer to the resulting stream, so to get access to the file document and table streams, we'll use the following code:

const int _DefaultFlags = (int)(STGMFlags.STGM_READWRITE |
                                STGMFlags.STGM_SHARE_EXCLUSIVE);

UCOMIStream OpenStream(IStorage storage, string name)
{
  UCOMIStream stream;
  int result = storage.OpenStream(name, IntPtr.Zero, _DefaultFlags, 0, out stream);
  if (result != 0)
    return null;
  return stream;
}

byte[] GetStreamData(UCOMIStream stream)
{
  STATSTG stat;
  stream.Stat(out stat, 0);
  long size = stat.cbSize;
  byte[] buffer = new byte[size];

  stream.Read(buffer, (int)size, IntPtr.Zero);
  return buffer;
}

BinaryReader GetStreamReader(UCOMIStream stream)
{
  if (stream == null)
    return null;

  byte[] streamData = GetStreamData(stream);
  MemoryStream memoryStream = new MemoryStream(streamData);
  return new BinaryReader(memoryStream);
}

void GetStreamsData(string path, out BinaryReader documentStreamReader,
        out BinaryReader tableStreamReader)
{
  IStorage rootStorage = OpenRootStorage(path);

  UCOMIStream documentStream = OpenStream(rootStorage, "WordDocument");
  UCOMIStream tableStream = OpenStream(rootStorage, "0Table");

  documentStreamReader = GetStreamReader(documentStream);
  tableStreamReader = GetStreamReader(tableStream);
}

Reading Document Text

Now when we have access to the main file streams, we may read information about the document text location. This information can be obtained from the table stream at the fib.clxOffset with fib.clxLength:

void GetDataFromFib(BinaryReader tableStreamReader, out int pieceCollOffset,
    out uint pieceCollLength)
{
  tableStreamReader.BaseStream.Seek(418, SeekOrigin.Begin);
  pieceCollOffset = reader.ReadInt32();
  pieceCollLength = reader.ReadUInt32();
}

Having this information in place we may read all piece descriptors from the table stream. Each piece descriptor contains information about the text part stored in the document. Here is the code illustrating how to do this:

PieceDescriptorCollection GetPieceDescriptors(BinaryReader tableStreamReader,
    int pieceCollOffset, uint pieceCollLength)
{
  PieceDescriptorCollection result =
        new PieceDescriptorCollection(pieceCollOffset, pieceCollLength);
  result.Read(tableStreamReader);
  return result;
}

Note that all work to read piece descriptors is done inside the PieceDescriptorCollection class. See this article's source code for complete implementation.

The last step is to read the document text. Here is how to do this:

string LoadText(BinaryReader documentReader, PieceDescriptorCollection pieces)
{
  text = string.Empty;
  if (documentReader == null || pieces == null)
    return text;

  int count = pieces.Count;
  for (int i = 0; i < count; i++)
  {
    uint pieceStart;
    uint pieceEnd;
    bool isUnicode = pieces.GetPieceFileBounds(i, out pieceStart, out pieceEnd);

    documentReader.BaseStream.Seek(pieceStart, SeekOrigin.Begin);
    text += ReadString(documentReader, pieceEnd - pieceStart, isUnicode);
  }
  return text;
}

The LoadText method iterates over all document pieces, gets each piece bounds and reads document text. The ReadString method is simple:

string ReadString(BinaryReader reader, uint length, bool isUnicode)
{
  if (length == 0)
    return string.Empty;

  if (isUnicode)
    length = length / 2;

  string result = string.Empty;
  for (int i = 0; i < length; i++)
  {
    object ch = isUnicode ? reader.ReadInt16() : reader.ReadByte();
    result += (char)ch;
  }
  return result;
}

That's all. Thanks for your attention. Hope this article will be useful. Please let me know if there are any problems.

History

  • January 7th, 2008: Initial release

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

SteveLi-Cellbi
Web Developer
United States United States
I'm excited about computers and programming, since my school days. I have master's degree in software development and at the moment I'm a software developer at Cellbi Software.

Comments and Discussions

 
Questionis there any suggested code in C++? PinmemberKarimai20-Oct-13 21:29 
SuggestionExactly what I need Pinmemberhsback7-Nov-12 3:18 
GeneralMy vote of 5 Pinmembereslsys11-Oct-12 13:37 
BugMy vote is 5 Pinmembercanloimat8713-Sep-12 21:36 
GeneralMy vote of 5 PinmemberPhilipp Kursawe11-Jun-12 23:59 
QuestionIncomplete read Pinmemberalex__b23-Feb-12 19:37 
Questioncoversion error in the project code PinmemberMember 830345210-Oct-11 23:33 
AnswerRe: coversion error in the project code PinmemberMarcus Kramer19-Oct-11 3:49 
QuestionThis code works for word documents with .doc extension but not .docx extension Pinmemberyetty200014-Sep-11 7:49 
Hi,
This code works for word documents saved in Word(97 to 2003) format i.e. *.doc format
But does not work for *.docx format.
 
What changes to be done to view the *.docx format files.
thanks
y2000
AnswerRe: This code works for word documents with .doc extension but not .docx extension Pinmemberpcunite10-Nov-11 15:09 
GeneralRe: This code works for word documents with .doc extension but not .docx extension PinmemberFripouille25-Jan-12 3:46 
AnswerRe: This code works for word documents with .doc extension but not .docx extension PinmemberRobert Hutch23-Feb-12 23:12 
AnswerRe: This code works for word documents with .doc extension but not .docx extension Pinmemberyousaf2k9-Apr-12 22:09 
GeneralMy vote of 5 Pinmembershrikant2725-Aug-11 0:57 
GeneralMy Thanks PinmemberOnisan200014-Mar-11 4:41 
GeneralMy vote of 5 Pinmemberkatz.gilad1-Mar-11 22:14 
GeneralMS Word formatting characters Pinmemberalex__b3-Sep-10 21:08 
GeneralThanks PinmemberMuhammadAdeel18-Aug-10 0:28 
GeneralThanks PinmemberMushtaque Nizamani28-Apr-09 19:43 
QuestionHow to get embedded Attachments Pinmembermiztaken17-May-08 22:10 
QuestionAnyone know how to get ' to read properly? Pinmemberprrusa8-Apr-08 1:09 
AnswerRe: Anyone know how to get ' to read properly? PinmemberDan Ware31-Mar-09 2:55 
GeneralRe: Anyone know how to get ' to read properly? [modified] Pinmemberdmihailescu13-Nov-09 8:22 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web02 | 2.8.140415.2 | Last Updated 7 Jan 2008
Article Copyright 2008 by SteveLi-Cellbi
Everything else Copyright © CodeProject, 1999-2014
Terms of Use
Layout: fixed | fluid