Introduction
At last, Microsoft has turned to XML-based format for storing document content. At the same time, it created a small problem for developers who need to index and search in Microsoft Word *.docx files. It's not a problem on a computer with Microsoft Office 2007 installed, but what is there to do if your application works on a server without Office and still needs to get text from Word files? Well, there are three options:
- Install Microsoft Office 2007 and use its DLLs.
- Use some third party libraries like "Office Open XML C# Library."
- Write your own code.
In fact, there is another option: use the DocxToText
class described below.
DocxToText Class
This class performs only one function: it extracts text from a given *.docx file. However, before we dig into the code, I'll remind you that a Microsoft Word *.docx file is an Open XML document combining texts, styles, graphics and so on into a single ZIP archive. Therefore we have to "unpack" the *.docx file to get to its guts. If you work with .NET Framework 3.0, you can use the Package
class in the System.IO.Packaging
namespace. However, working with .NET Framework 2.0, I used the open-source ZIP library SharpZipLib.
If you rename your *.docx file to *.zip and open it in your archiver, you will see a list of packed files like this:
First of all, we have to read the [Content_Types].xml file and find the location of the document.xml file. Usually, Microsoft hides it in the /word sub-directory, but it can be anywhere if the file was not created by Microsoft Word. Then we have to parse the document.xml file and extract text from it. A ReadNode()
method does all the dirty work: it pulls out text strings, paragraphs, tabs and carriage returns, and concatenates it into final text.
Full text of the DocxToText
class:
public class DocxToText
{
private const string ContentTypeNamespace =
@"http://schemas.openxmlformats.org/package/2006/content-types";
private const string WordprocessingMlNamespace =
@"http://schemas.openxmlformats.org/wordprocessingml/2006/main";
private const string DocumentXmlXPath =
"/t:Types/t:Override[@ContentType="" +
"application/vnd.openxmlformats-officedocument." +
"wordprocessingml.document.main+xml\"]";
private const string BodyXPath = "/w:document/w:body";
private string docxFile = "";
private string docxFileLocation = "";
public DocxToText(string fileName)
{
docxFile = fileName;
}
#region ExtractText()
public string ExtractText()
{
if (string.IsNullOrEmpty(docxFile))
throw new Exception("Input file not specified.");
docxFileLocation = FindDocumentXmlLocation();
if (string.IsNullOrEmpty(docxFileLocation))
throw new Exception("It is not a valid Docx file.");
return ReadDocumentXml();
}
#endregion
#region FindDocumentXmlLocation()
private string FindDocumentXmlLocation()
{
ZipFile zip = new ZipFile(docxFile);
foreach (ZipEntry entry in zip)
{
if (string.Compare(entry.Name, "[Content_Types].xml", true) == 0)
{
Stream contentTypes = zip.GetInputStream(entry);
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.PreserveWhitespace = true;
xmlDoc.Load(contentTypes);
contentTypes.Close();
XmlNamespaceManager nsmgr =
new XmlNamespaceManager(xmlDoc.NameTable);
nsmgr.AddNamespace("t", ContentTypeNamespace);
XmlNode node = xmlDoc.DocumentElement.SelectSingleNode(
DocumentXmlXPath, nsmgr);
if (node != null)
{
string location =
((XmlElement) node).GetAttribute("PartName");
return location.TrimStart(new char[] {'/'});
}
break;
}
}
zip.Close();
return null;
}
#endregion
#region ReadDocumentXml()
private string ReadDocumentXml()
{
StringBuilder sb = new StringBuilder();
ZipFile zip = new ZipFile(docxFile);
foreach (ZipEntry entry in zip)
{
if (string.Compare(entry.Name, docxFileLocation, true) == 0)
{
Stream documentXml = zip.GetInputStream(entry);
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.PreserveWhitespace = true;
xmlDoc.Load(documentXml);
documentXml.Close();
XmlNamespaceManager nsmgr =
new XmlNamespaceManager(xmlDoc.NameTable);
nsmgr.AddNamespace("w", WordprocessingMlNamespace);
XmlNode node =
xmlDoc.DocumentElement.SelectSingleNode(BodyXPath,nsmgr);
if (node == null)
return string.Empty;
sb.Append(ReadNode(node));
break;
}
}
zip.Close();
return sb.ToString();
}
#endregion
#region ReadNode()
private string ReadNode(XmlNode node)
{
if (node == null || node.NodeType != XmlNodeType.Element)
return string.Empty;
StringBuilder sb = new StringBuilder();
foreach (XmlNode child in node.ChildNodes)
{
if (child.NodeType != XmlNodeType.Element) continue;
switch (child.LocalName)
{
case "t":
sb.Append(child.InnerText.TrimEnd());
string space =
((XmlElement)child).GetAttribute("xml:space");
if (!string.IsNullOrEmpty(space) &&
space == "preserve")
sb.Append(' ');
break;
case "cr":
case "br":
sb.Append(Environment.NewLine);
break;
case "tab":
sb.Append("\t");
break;
case "p":
sb.Append(ReadNode(child));
sb.Append(Environment.NewLine);
sb.Append(Environment.NewLine);
break;
default:
sb.Append(ReadNode(child));
break;
}
}
return sb.ToString();
}
#endregion
}
To extract text from a *.docx file using the DocxToText
class, you need a few lines of code:
DocxToText dtt = new DocxToText(docxFileName);
string text = dtt.ExtractText();
Conclusion
The class is a bit primitive, but it performs its main function: to just extract text. It was quite enough to implement indexing and full-text search in *.docx files in my document storage and management system Heliocode Doc@Hand. The class does not extract page headers and footers; it does not process numbering and custom XML; similarly, it knows nothing about the data binding used in documents. If you improve the class, I'll be glad to hear about it.
History
September 17, 2007 - Initial release