Click here to Skip to main content
Click here to Skip to main content

Read Document Text Directly from Microsoft Word File

By , 6 Jan 2008
 

Introduction

In this article, we'll take a brief look into Microsoft Word binary file format and present a simple way to obtain document text from *.doc files.

Links:

  • Cellbi.DocFramework - Read/Write Microsoft Word documents, complex formatting, sections and tables are supported.
  • GetWordTextSrc - The complete source code for this article.

You may also take a look at the project we are currently working on.

The most popular file format for rich text representation is *.doc file format created by Microsoft. OLE (Object Linking and Embedding) is the most popular way to work with Microsoft Word documents programmatically, but this method has several disadvantages, e.g. low speed, need of Microsoft Office to be installed and inconvenient document model.

Another way to manipulate Microsoft Word files is to read and write them directly. This is a way to solve the disadvantages described above. Direct manipulation provides much more speed and allows use of your own document model. However the most significant difficulty here is that we need binary file format knowledge.

OLE Structured Storage

Word file structure is represented as a file system within a file. This technology, called OLE structured storage, allows storing multiple kinds of objects in a single document. OLE structured storage is a collection of two object types: storages and streams.

StgOpenStorage WinAPI function provides access to the root storage object in a structured storage system. Here is the declaration:

[DllImport(Ole32Dll, CharSet = CharSet.Unicode)]
public static extern int StgOpenStorage(string pwcsName,
    IStorage pstgPriority,
    int grfMode,
    IntPtr snbExclude,
    int reserved,
    out IStorage ppstgOpen);

In this declaration we'll mainly use pwcsName - file path that contains the storage object to open, and ppstgOpen - IStorage implementer used to work with the root file storage. Here is an example code illustrating the StgOpenStorage function usage:

const int _DefaultFlags = (int)(STGMFlags.STGM_READWRITE |
                                STGMFlags.STGM_SHARE_EXCLUSIVE);

public static IStorage OpenRootStorage(string path)
{
  IStorage storage;
  int result = NativeMethods.StgOpenStorage(
      path,
      null,
      _DefaultFlags,
      IntPtr.Zero,
      0,
      out storage);

  if (result != 0)
    return null;
  return storage;
}

IStorage interface translation and implementation of some necessary structures can be found in the article's source code.

Reading Document Streams

The root Word file storage contains document and table streams, which we'll use to read document text. The document stream contains the Word file header (FIB – File Information Block), document text and formatting information. The document text and its formatting are stored as a set of pieces, and each piece has an offset into the document stream. The table stream contains information about text location represented as a collection of piece descriptors.

To get access to any stream in the root storage, we'll use the OpenStream WinAPI function provided by the IStorage interface.

  int OpenStream(string pwcsName,
          IntPtr reserved1,
          int grfMode,
          int reserved2,
          out UCOMIStream ppstm);

In this declaration pwcsName is the name of the storage and ppstm is the pointer to the resulting stream, so to get access to the file document and table streams, we'll use the following code:

const int _DefaultFlags = (int)(STGMFlags.STGM_READWRITE |
                                STGMFlags.STGM_SHARE_EXCLUSIVE);

UCOMIStream OpenStream(IStorage storage, string name)
{
  UCOMIStream stream;
  int result = storage.OpenStream(name, IntPtr.Zero, _DefaultFlags, 0, out stream);
  if (result != 0)
    return null;
  return stream;
}

byte[] GetStreamData(UCOMIStream stream)
{
  STATSTG stat;
  stream.Stat(out stat, 0);
  long size = stat.cbSize;
  byte[] buffer = new byte[size];

  stream.Read(buffer, (int)size, IntPtr.Zero);
  return buffer;
}

BinaryReader GetStreamReader(UCOMIStream stream)
{
  if (stream == null)
    return null;

  byte[] streamData = GetStreamData(stream);
  MemoryStream memoryStream = new MemoryStream(streamData);
  return new BinaryReader(memoryStream);
}

void GetStreamsData(string path, out BinaryReader documentStreamReader,
        out BinaryReader tableStreamReader)
{
  IStorage rootStorage = OpenRootStorage(path);

  UCOMIStream documentStream = OpenStream(rootStorage, "WordDocument");
  UCOMIStream tableStream = OpenStream(rootStorage, "0Table");

  documentStreamReader = GetStreamReader(documentStream);
  tableStreamReader = GetStreamReader(tableStream);
}

Reading Document Text

Now when we have access to the main file streams, we may read information about the document text location. This information can be obtained from the table stream at the fib.clxOffset with fib.clxLength:

void GetDataFromFib(BinaryReader tableStreamReader, out int pieceCollOffset,
    out uint pieceCollLength)
{
  tableStreamReader.BaseStream.Seek(418, SeekOrigin.Begin);
  pieceCollOffset = reader.ReadInt32();
  pieceCollLength = reader.ReadUInt32();
}

Having this information in place we may read all piece descriptors from the table stream. Each piece descriptor contains information about the text part stored in the document. Here is the code illustrating how to do this:

PieceDescriptorCollection GetPieceDescriptors(BinaryReader tableStreamReader,
    int pieceCollOffset, uint pieceCollLength)
{
  PieceDescriptorCollection result =
        new PieceDescriptorCollection(pieceCollOffset, pieceCollLength);
  result.Read(tableStreamReader);
  return result;
}

Note that all work to read piece descriptors is done inside the PieceDescriptorCollection class. See this article's source code for complete implementation.

The last step is to read the document text. Here is how to do this:

string LoadText(BinaryReader documentReader, PieceDescriptorCollection pieces)
{
  text = string.Empty;
  if (documentReader == null || pieces == null)
    return text;

  int count = pieces.Count;
  for (int i = 0; i < count; i++)
  {
    uint pieceStart;
    uint pieceEnd;
    bool isUnicode = pieces.GetPieceFileBounds(i, out pieceStart, out pieceEnd);

    documentReader.BaseStream.Seek(pieceStart, SeekOrigin.Begin);
    text += ReadString(documentReader, pieceEnd - pieceStart, isUnicode);
  }
  return text;
}

The LoadText method iterates over all document pieces, gets each piece bounds and reads document text. The ReadString method is simple:

string ReadString(BinaryReader reader, uint length, bool isUnicode)
{
  if (length == 0)
    return string.Empty;

  if (isUnicode)
    length = length / 2;

  string result = string.Empty;
  for (int i = 0; i < length; i++)
  {
    object ch = isUnicode ? reader.ReadInt16() : reader.ReadByte();
    result += (char)ch;
  }
  return result;
}

That's all. Thanks for your attention. Hope this article will be useful. Please let me know if there are any problems.

History

  • January 7th, 2008: Initial release

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

SteveLi-Cellbi
Web Developer
United States United States
Member
I'm excited about computers and programming, since my school days. I have master's degree in software development and at the moment I'm a software developer at Cellbi Software.

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
Hint: For improved responsiveness ensure Javascript is enabled and choose 'Normal' from the Layout dropdown and hit 'Update'.
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
SuggestionExactly what I needmemberhsback7 Nov '12 - 3:18 
GeneralMy vote of 5membereslsys11 Oct '12 - 13:37 
BugMy vote is 5membercanloimat8713 Sep '12 - 21:36 
GeneralMy vote of 5memberPhilipp Kursawe11 Jun '12 - 23:59 
QuestionIncomplete readmemberalex__b23 Feb '12 - 19:37 
Questioncoversion error in the project codememberMember 830345210 Oct '11 - 23:33 
AnswerRe: coversion error in the project codememberMarcus Kramer19 Oct '11 - 3:49 
QuestionThis code works for word documents with .doc extension but not .docx extensionmemberyetty200014 Sep '11 - 7:49 
AnswerRe: This code works for word documents with .doc extension but not .docx extensionmemberpcunite10 Nov '11 - 15:09 
GeneralRe: This code works for word documents with .doc extension but not .docx extensionmemberFripouille25 Jan '12 - 3:46 
AnswerRe: This code works for word documents with .doc extension but not .docx extensionmemberRobert Hutch23 Feb '12 - 23:12 
AnswerRe: This code works for word documents with .doc extension but not .docx extensionmemberyousaf2k9 Apr '12 - 22:09 
GeneralMy vote of 5membershrikant2725 Aug '11 - 0:57 
GeneralMy ThanksmemberOnisan200014 Mar '11 - 4:41 
GeneralMy vote of 5memberkatz.gilad1 Mar '11 - 22:14 
GeneralMS Word formatting charactersmemberalex__b3 Sep '10 - 21:08 
GeneralThanksmemberMuhammadAdeel18 Aug '10 - 0:28 
GeneralThanksmemberMushtaque Nizamani28 Apr '09 - 19:43 
QuestionHow to get embedded Attachmentsmembermiztaken17 May '08 - 22:10 
QuestionAnyone know how to get ' to read properly?memberprrusa8 Apr '08 - 1:09 
AnswerRe: Anyone know how to get ' to read properly?memberDan Ware31 Mar '09 - 2:55 
GeneralRe: Anyone know how to get ' to read properly? [modified]memberdmihailescu13 Nov '09 - 8:22 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web04 | 2.6.130516.1 | Last Updated 7 Jan 2008
Article Copyright 2008 by SteveLi-Cellbi
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid