|
||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||
|
Announcements
Chapters
Services
Feature Zones
|
IntroductionIn this article, we'll take a brief look into Microsoft Word binary file format and present a simple way to obtain document text from *.doc files. Links:
You may also take a look at the project we are currently working on. The most popular file format for rich text representation is *.doc file format created by Microsoft. OLE (Object Linking and Embedding) is the most popular way to work with Microsoft Word documents programmatically, but this method has several disadvantages, e.g. low speed, need of Microsoft Office to be installed and inconvenient document model. Another way to manipulate Microsoft Word files is to read and write them directly. This is a way to solve the disadvantages described above. Direct manipulation provides much more speed and allows use of your own document model. However the most significant difficulty here is that we need binary file format knowledge. OLE Structured StorageWord file structure is represented as a file system within a file. This technology, called OLE structured storage, allows storing multiple kinds of objects in a single document. OLE structured storage is a collection of two object types: storages and streams.
[DllImport(Ole32Dll, CharSet = CharSet.Unicode)]
public static extern int StgOpenStorage(string pwcsName,
IStorage pstgPriority,
int grfMode,
IntPtr snbExclude,
int reserved,
out IStorage ppstgOpen);
In this declaration we'll mainly use const int _DefaultFlags = (int)(STGMFlags.STGM_READWRITE |
STGMFlags.STGM_SHARE_EXCLUSIVE);
public static IStorage OpenRootStorage(string path)
{
IStorage storage;
int result = NativeMethods.StgOpenStorage(
path,
null,
_DefaultFlags,
IntPtr.Zero,
0,
out storage);
if (result != 0)
return null;
return storage;
}
Reading Document StreamsThe root Word file storage contains document and table streams, which we'll use to read document text. The document stream contains the Word file header (FIB – File Information Block), document text and formatting information. The document text and its formatting are stored as a set of pieces, and each piece has an offset into the document stream. The table stream contains information about text location represented as a collection of piece descriptors. To get access to any stream in the root storage, we'll use the int OpenStream(string pwcsName,
IntPtr reserved1,
int grfMode,
int reserved2,
out UCOMIStream ppstm);
In this declaration const int _DefaultFlags = (int)(STGMFlags.STGM_READWRITE |
STGMFlags.STGM_SHARE_EXCLUSIVE);
UCOMIStream OpenStream(IStorage storage, string name)
{
UCOMIStream stream;
int result = storage.OpenStream(name, IntPtr.Zero, _DefaultFlags, 0, out stream);
if (result != 0)
return null;
return stream;
}
byte[] GetStreamData(UCOMIStream stream)
{
STATSTG stat;
stream.Stat(out stat, 0);
long size = stat.cbSize;
byte[] buffer = new byte[size];
stream.Read(buffer, (int)size, IntPtr.Zero);
return buffer;
}
BinaryReader GetStreamReader(UCOMIStream stream)
{
if (stream == null)
return null;
byte[] streamData = GetStreamData(stream);
MemoryStream memoryStream = new MemoryStream(streamData);
return new BinaryReader(memoryStream);
}
void GetStreamsData(string path, out BinaryReader documentStreamReader,
out BinaryReader tableStreamReader)
{
IStorage rootStorage = OpenRootStorage(path);
UCOMIStream documentStream = OpenStream(rootStorage, "WordDocument");
UCOMIStream tableStream = OpenStream(rootStorage, "0Table");
documentStreamReader = GetStreamReader(documentStream);
tableStreamReader = GetStreamReader(tableStream);
}
Reading Document TextNow when we have access to the main file streams, we may read information about the document text location. This information can be obtained from the table stream at the void GetDataFromFib(BinaryReader tableStreamReader, out int pieceCollOffset,
out uint pieceCollLength)
{
tableStreamReader.BaseStream.Seek(418, SeekOrigin.Begin);
pieceCollOffset = reader.ReadInt32();
pieceCollLength = reader.ReadUInt32();
}
Having this information in place we may read all piece descriptors from the table stream. Each piece descriptor contains information about the text part stored in the document. Here is the code illustrating how to do this: PieceDescriptorCollection GetPieceDescriptors(BinaryReader tableStreamReader,
int pieceCollOffset, uint pieceCollLength)
{
PieceDescriptorCollection result =
new PieceDescriptorCollection(pieceCollOffset, pieceCollLength);
result.Read(tableStreamReader);
return result;
}
Note that all work to read piece descriptors is done inside the The last step is to read the document text. Here is how to do this: string LoadText(BinaryReader documentReader, PieceDescriptorCollection pieces)
{
text = string.Empty;
if (documentReader == null || pieces == null)
return text;
int count = pieces.Count;
for (int i = 0; i < count; i++)
{
uint pieceStart;
uint pieceEnd;
bool isUnicode = pieces.GetPieceFileBounds(i, out pieceStart, out pieceEnd);
documentReader.BaseStream.Seek(pieceStart, SeekOrigin.Begin);
text += ReadString(documentReader, pieceEnd - pieceStart, isUnicode);
}
return text;
}
The string ReadString(BinaryReader reader, uint length, bool isUnicode)
{
if (length == 0)
return string.Empty;
if (isUnicode)
length = length / 2;
string result = string.Empty;
for (int i = 0; i < length; i++)
{
object ch = isUnicode ? reader.ReadInt16() : reader.ReadByte();
result += (char)ch;
}
return result;
}
That's all. Thanks for your attention. Hope this article will be useful. Please let me know if there are any problems. History
|
|||||||||||||||||||||||||||||||||||||||||||||||