Click here to Skip to main content
15,868,158 members
Articles / Programming Languages / C#
Article

Read Document Text Directly from Microsoft Word File

Rate me:
Please Sign up or sign in to vote.
4.97/5 (35 votes)
6 Jan 2008CPOL3 min read 176.7K   8.5K   92   26
A simple way to obtain document text from *.doc file.
Image 1

Introduction

In this article, we'll take a brief look into Microsoft Word binary file format and present a simple way to obtain document text from *.doc files.

Links:

  • Cellbi.DocFramework - Read/Write Microsoft Word documents, complex formatting, sections and tables are supported.
  • GetWordTextSrc - The complete source code for this article.

You may also take a look at the project we are currently working on.

The most popular file format for rich text representation is *.doc file format created by Microsoft. OLE (Object Linking and Embedding) is the most popular way to work with Microsoft Word documents programmatically, but this method has several disadvantages, e.g. low speed, need of Microsoft Office to be installed and inconvenient document model.

Another way to manipulate Microsoft Word files is to read and write them directly. This is a way to solve the disadvantages described above. Direct manipulation provides much more speed and allows use of your own document model. However the most significant difficulty here is that we need binary file format knowledge.

OLE Structured Storage

Word file structure is represented as a file system within a file. This technology, called OLE structured storage, allows storing multiple kinds of objects in a single document. OLE structured storage is a collection of two object types: storages and streams.

StgOpenStorage WinAPI function provides access to the root storage object in a structured storage system. Here is the declaration:

C#
[DllImport(Ole32Dll, CharSet = CharSet.Unicode)]
public static extern int StgOpenStorage(string pwcsName,
    IStorage pstgPriority,
    int grfMode,
    IntPtr snbExclude,
    int reserved,
    out IStorage ppstgOpen);

In this declaration we'll mainly use pwcsName - file path that contains the storage object to open, and ppstgOpen - IStorage implementer used to work with the root file storage. Here is an example code illustrating the StgOpenStorage function usage:

C#
const int _DefaultFlags = (int)(STGMFlags.STGM_READWRITE |
                                STGMFlags.STGM_SHARE_EXCLUSIVE);

public static IStorage OpenRootStorage(string path)
{
  IStorage storage;
  int result = NativeMethods.StgOpenStorage(
      path,
      null,
      _DefaultFlags,
      IntPtr.Zero,
      0,
      out storage);

  if (result != 0)
    return null;
  return storage;
}

IStorage interface translation and implementation of some necessary structures can be found in the article's source code.

Reading Document Streams

The root Word file storage contains document and table streams, which we'll use to read document text. The document stream contains the Word file header (FIB – File Information Block), document text and formatting information. The document text and its formatting are stored as a set of pieces, and each piece has an offset into the document stream. The table stream contains information about text location represented as a collection of piece descriptors.

To get access to any stream in the root storage, we'll use the OpenStream WinAPI function provided by the IStorage interface.

C#
int OpenStream(string pwcsName,
        IntPtr reserved1,
        int grfMode,
        int reserved2,
        out UCOMIStream ppstm);

In this declaration pwcsName is the name of the storage and ppstm is the pointer to the resulting stream, so to get access to the file document and table streams, we'll use the following code:

C#
const int _DefaultFlags = (int)(STGMFlags.STGM_READWRITE |
                                STGMFlags.STGM_SHARE_EXCLUSIVE);

UCOMIStream OpenStream(IStorage storage, string name)
{
  UCOMIStream stream;
  int result = storage.OpenStream(name, IntPtr.Zero, _DefaultFlags, 0, out stream);
  if (result != 0)
    return null;
  return stream;
}

byte[] GetStreamData(UCOMIStream stream)
{
  STATSTG stat;
  stream.Stat(out stat, 0);
  long size = stat.cbSize;
  byte[] buffer = new byte[size];

  stream.Read(buffer, (int)size, IntPtr.Zero);
  return buffer;
}

BinaryReader GetStreamReader(UCOMIStream stream)
{
  if (stream == null)
    return null;

  byte[] streamData = GetStreamData(stream);
  MemoryStream memoryStream = new MemoryStream(streamData);
  return new BinaryReader(memoryStream);
}

void GetStreamsData(string path, out BinaryReader documentStreamReader,
        out BinaryReader tableStreamReader)
{
  IStorage rootStorage = OpenRootStorage(path);

  UCOMIStream documentStream = OpenStream(rootStorage, "WordDocument");
  UCOMIStream tableStream = OpenStream(rootStorage, "0Table");

  documentStreamReader = GetStreamReader(documentStream);
  tableStreamReader = GetStreamReader(tableStream);
}

Reading Document Text

Now when we have access to the main file streams, we may read information about the document text location. This information can be obtained from the table stream at the fib.clxOffset with fib.clxLength:

C#
void GetDataFromFib(BinaryReader tableStreamReader, out int pieceCollOffset,
    out uint pieceCollLength)
{
  tableStreamReader.BaseStream.Seek(418, SeekOrigin.Begin);
  pieceCollOffset = reader.ReadInt32();
  pieceCollLength = reader.ReadUInt32();
}

Having this information in place we may read all piece descriptors from the table stream. Each piece descriptor contains information about the text part stored in the document. Here is the code illustrating how to do this:

C#
PieceDescriptorCollection GetPieceDescriptors(BinaryReader tableStreamReader,
    int pieceCollOffset, uint pieceCollLength)
{
  PieceDescriptorCollection result =
        new PieceDescriptorCollection(pieceCollOffset, pieceCollLength);
  result.Read(tableStreamReader);
  return result;
}

Note that all work to read piece descriptors is done inside the PieceDescriptorCollection class. See this article's source code for complete implementation.

The last step is to read the document text. Here is how to do this:

C#
string LoadText(BinaryReader documentReader, PieceDescriptorCollection pieces)
{
  text = string.Empty;
  if (documentReader == null || pieces == null)
    return text;

  int count = pieces.Count;
  for (int i = 0; i < count; i++)
  {
    uint pieceStart;
    uint pieceEnd;
    bool isUnicode = pieces.GetPieceFileBounds(i, out pieceStart, out pieceEnd);

    documentReader.BaseStream.Seek(pieceStart, SeekOrigin.Begin);
    text += ReadString(documentReader, pieceEnd - pieceStart, isUnicode);
  }
  return text;
}

The LoadText method iterates over all document pieces, gets each piece bounds and reads document text. The ReadString method is simple:

C#
string ReadString(BinaryReader reader, uint length, bool isUnicode)
{
  if (length == 0)
    return string.Empty;

  if (isUnicode)
    length = length / 2;

  string result = string.Empty;
  for (int i = 0; i < length; i++)
  {
    object ch = isUnicode ? reader.ReadInt16() : reader.ReadByte();
    result += (char)ch;
  }
  return result;
}

That's all. Thanks for your attention. Hope this article will be useful. Please let me know if there are any problems.

History

  • January 7th, 2008: Initial release

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Web Developer
United States United States
I'm excited about computers and programming, since my school days. I have master's degree in software development and at the moment I'm a software developer at Cellbi Software.

Comments and Discussions

 
QuestionNativeMethods.StgIsStorageFile(_Path) != 0 Pin
richsmith16-May-16 21:48
richsmith16-May-16 21:48 
QuestionExcellent. My vote of 5 too..! Pin
PP Diwakar16-Feb-16 12:57
PP Diwakar16-Feb-16 12:57 
QuestionOpen doc with Passwords Pin
fishin747-Oct-15 7:08
fishin747-Oct-15 7:08 
Questionis there any suggested code in C++? Pin
Karimai20-Oct-13 21:29
Karimai20-Oct-13 21:29 
SuggestionExactly what I need Pin
s.hervy7-Nov-12 3:18
s.hervy7-Nov-12 3:18 
GeneralMy vote of 5 Pin
eslsys11-Oct-12 13:37
professionaleslsys11-Oct-12 13:37 
BugMy vote is 5 Pin
N Viet Anh13-Sep-12 21:36
N Viet Anh13-Sep-12 21:36 
GeneralMy vote of 5 Pin
Philipp Kursawe11-Jun-12 23:59
Philipp Kursawe11-Jun-12 23:59 
QuestionIncomplete read Pin
alex__b23-Feb-12 19:37
professionalalex__b23-Feb-12 19:37 
Questioncoversion error in the project code Pin
suresh_csharp10-Oct-11 23:33
suresh_csharp10-Oct-11 23:33 
AnswerRe: coversion error in the project code Pin
fjdiewornncalwe19-Oct-11 3:49
professionalfjdiewornncalwe19-Oct-11 3:49 
QuestionThis code works for word documents with .doc extension but not .docx extension Pin
yetty200014-Sep-11 7:49
yetty200014-Sep-11 7:49 
AnswerRe: This code works for word documents with .doc extension but not .docx extension Pin
pcunite10-Nov-11 15:09
pcunite10-Nov-11 15:09 
GeneralRe: This code works for word documents with .doc extension but not .docx extension Pin
Fripouille25-Jan-12 3:46
Fripouille25-Jan-12 3:46 
AnswerRe: This code works for word documents with .doc extension but not .docx extension Pin
Robert Hutch23-Feb-12 23:12
Robert Hutch23-Feb-12 23:12 
AnswerRe: This code works for word documents with .doc extension but not .docx extension Pin
yousaf2k9-Apr-12 22:09
yousaf2k9-Apr-12 22:09 
GeneralMy vote of 5 Pin
shrikant2725-Aug-11 0:57
shrikant2725-Aug-11 0:57 
GeneralMy Thanks Pin
Onisan200014-Mar-11 4:41
Onisan200014-Mar-11 4:41 
GeneralMy vote of 5 Pin
katz.gilad1-Mar-11 22:14
katz.gilad1-Mar-11 22:14 
GeneralMS Word formatting characters Pin
alex__b3-Sep-10 21:08
professionalalex__b3-Sep-10 21:08 
GeneralThanks Pin
MuhammadAdeel18-Aug-10 0:28
MuhammadAdeel18-Aug-10 0:28 
GeneralThanks Pin
Mushtaque Nizamani28-Apr-09 19:43
Mushtaque Nizamani28-Apr-09 19:43 
QuestionHow to get embedded Attachments Pin
miztaken17-May-08 22:10
miztaken17-May-08 22:10 
QuestionAnyone know how to get ' to read properly? Pin
prrusa8-Apr-08 1:09
prrusa8-Apr-08 1:09 
AnswerRe: Anyone know how to get ' to read properly? Pin
Dan Ware31-Mar-09 2:55
Dan Ware31-Mar-09 2:55 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.