Click here to Skip to main content
12,067,928 members (50,419 online)
Click here to Skip to main content
Add your own
alternative version

Stats

142.9K views
8.8K downloads
77 bookmarked
Posted

Using DocxToText to Extract Text from DOCX Files

, 17 Sep 2007 CPOL
Rate this:
Please Sign up or sign in to vote.
This article explains how to extract text from DOCX files without Microsoft Office libraries.
DocxToText demo application

Introduction

At last, Microsoft has turned to XML-based format for storing document content. At the same time, it created a small problem for developers who need to index and search in Microsoft Word *.docx files. It's not a problem on a computer with Microsoft Office 2007 installed, but what is there to do if your application works on a server without Office and still needs to get text from Word files? Well, there are three options:

  • Install Microsoft Office 2007 and use its DLLs.
  • Use some third party libraries like "Office Open XML C# Library."
  • Write your own code.

In fact, there is another option: use the DocxToText class described below.

DocxToText Class

This class performs only one function: it extracts text from a given *.docx file. However, before we dig into the code, I'll remind you that a Microsoft Word *.docx file is an Open XML document combining texts, styles, graphics and so on into a single ZIP archive. Therefore we have to "unpack" the *.docx file to get to its guts. If you work with .NET Framework 3.0, you can use the Package class in the System.IO.Packaging namespace. However, working with .NET Framework 2.0, I used the open-source ZIP library SharpZipLib.

If you rename your *.docx file to *.zip and open it in your archiver, you will see a list of packed files like this:

Screenshot - screenshot2.png

First of all, we have to read the [Content_Types].xml file and find the location of the document.xml file. Usually, Microsoft hides it in the /word sub-directory, but it can be anywhere if the file was not created by Microsoft Word. Then we have to parse the document.xml file and extract text from it. A ReadNode() method does all the dirty work: it pulls out text strings, paragraphs, tabs and carriage returns, and concatenates it into final text.

Full text of the DocxToText class:

public class DocxToText
{
    private const string ContentTypeNamespace =
        @"http://schemas.openxmlformats.org/package/2006/content-types";

    private const string WordprocessingMlNamespace =
        @"http://schemas.openxmlformats.org/wordprocessingml/2006/main";

    private const string DocumentXmlXPath =
        "/t:Types/t:Override[@ContentType="" +
        "application/vnd.openxmlformats-officedocument." +
        "wordprocessingml.document.main+xml\"]";

    private const string BodyXPath = "/w:document/w:body";

    private string docxFile = "";
    private string docxFileLocation = "";

    public DocxToText(string fileName)
    {
        docxFile = fileName;
    }

    #region ExtractText()
    /// 
    /// Extracts text from the Docx file.
    /// 
    /// Extracted text.
    public string ExtractText()
    {
        if (string.IsNullOrEmpty(docxFile))
            throw new Exception("Input file not specified.");

        // Usually it is "/word/document.xml"

        docxFileLocation = FindDocumentXmlLocation();

        if (string.IsNullOrEmpty(docxFileLocation))
            throw new Exception("It is not a valid Docx file.");

        return ReadDocumentXml();
    }
    #endregion

    #region FindDocumentXmlLocation()
    /// 
    /// Gets location of the "document.xml" zip entry.
    /// 
    /// Location of the "document.xml".
    private string FindDocumentXmlLocation()
    {
        ZipFile zip = new ZipFile(docxFile);
        foreach (ZipEntry entry in zip)
        {
            // Find "[Content_Types].xml" zip entry

            if (string.Compare(entry.Name, "[Content_Types].xml", true) == 0)
            {
                Stream contentTypes = zip.GetInputStream(entry);

                XmlDocument xmlDoc = new XmlDocument();
                xmlDoc.PreserveWhitespace = true;
                xmlDoc.Load(contentTypes);
                contentTypes.Close();

                //Create an XmlNamespaceManager for resolving namespaces

                XmlNamespaceManager nsmgr = 
                    new XmlNamespaceManager(xmlDoc.NameTable);
                nsmgr.AddNamespace("t", ContentTypeNamespace);

                // Find location of "document.xml"

                XmlNode node = xmlDoc.DocumentElement.SelectSingleNode(
                    DocumentXmlXPath, nsmgr);

                if (node != null)
                {
                    string location = 
                        ((XmlElement) node).GetAttribute("PartName");
                    return location.TrimStart(new char[] {'/'});
                }
                break;
            }
        }
        zip.Close();
        return null;
    }
    #endregion

    #region ReadDocumentXml()
    /// 
    /// Reads "document.xml" zip entry.
    /// 
    /// Text containing in the document.
    private string ReadDocumentXml()
    {
        StringBuilder sb = new StringBuilder();

        ZipFile zip = new ZipFile(docxFile);
        foreach (ZipEntry entry in zip)
        {
            if (string.Compare(entry.Name, docxFileLocation, true) == 0)
            {
                Stream documentXml = zip.GetInputStream(entry);

                XmlDocument xmlDoc = new XmlDocument();
                xmlDoc.PreserveWhitespace = true;
                xmlDoc.Load(documentXml);
                documentXml.Close();

                XmlNamespaceManager nsmgr = 
                    new XmlNamespaceManager(xmlDoc.NameTable);
                nsmgr.AddNamespace("w", WordprocessingMlNamespace);

                XmlNode node = 
                    xmlDoc.DocumentElement.SelectSingleNode(BodyXPath,nsmgr);

                if (node == null)
                    return string.Empty;

                sb.Append(ReadNode(node));

                break;
            }
        }
        zip.Close();
        return sb.ToString();
    }
    #endregion

    #region ReadNode()
    /// 
    /// Reads content of the node and its nested childs.
    /// 
    /// XmlNode.
    /// Text containing in the node.
    private string ReadNode(XmlNode node)
    {
        if (node == null || node.NodeType != XmlNodeType.Element)
            return string.Empty;

        StringBuilder sb = new StringBuilder();
        foreach (XmlNode child in node.ChildNodes)
        {
            if (child.NodeType != XmlNodeType.Element) continue;

            switch (child.LocalName)
            {
                case "t":                           // Text
                    sb.Append(child.InnerText.TrimEnd());

                    string space = 
                        ((XmlElement)child).GetAttribute("xml:space");
                    if (!string.IsNullOrEmpty(space) && 
                        space == "preserve")
                        sb.Append(' ');

                    break;

                case "cr":                          // Carriage return
                case "br":                          // Page break
                    sb.Append(Environment.NewLine);
                    break;

                case "tab":                         // Tab
                    sb.Append("\t");
                    break;

                case "p":                           // Paragraph
                    sb.Append(ReadNode(child));
                    sb.Append(Environment.NewLine);
                    sb.Append(Environment.NewLine);
                    break;

                default:
                    sb.Append(ReadNode(child));
                    break;
            }
        }
        return sb.ToString();
    }
    #endregion
}

To extract text from a *.docx file using the DocxToText class, you need a few lines of code:

DocxToText dtt = new DocxToText(docxFileName);
string text = dtt.ExtractText();

Conclusion

The class is a bit primitive, but it performs its main function: to just extract text. It was quite enough to implement indexing and full-text search in *.docx files in my document storage and management system Heliocode Doc@Hand. The class does not extract page headers and footers; it does not process numbering and custom XML; similarly, it knows nothing about the data binding used in documents. If you improve the class, I'll be glad to hear about it.

History

September 17, 2007 - Initial release

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Jevgenij lives in Riga, Latvia. He started his programmer's career in 1983 developing software for radio equipment CAD systems. Created computer graphics for TV. Developed Internet credit card processing systems for banks.
Now he is System Analyst in Accenture.

You may also be interested in...

Comments and Discussions

 
QuestionDocxToT3ext Pin
Peter Bulloch13-Jan-16 11:35
memberPeter Bulloch13-Jan-16 11:35 
Questionimages are not loading in ur code Pin
Saikumar Guptha31-Jul-14 20:43
professionalSaikumar Guptha31-Jul-14 20:43 
Question.doc not readble Pin
Member 79036395-Mar-14 21:10
memberMember 79036395-Mar-14 21:10 
QuestionError said Pin
Jayson Ragasa2-Feb-14 12:25
memberJayson Ragasa2-Feb-14 12:25 
GeneralMy vote of 5 Pin
Paul071214-May-13 5:17
memberPaul071214-May-13 5:17 
GeneralMy vote of 5 Pin
shanawazway17-Oct-12 4:46
membershanawazway17-Oct-12 4:46 
QuestionError message: This is an unclosed string Pin
Hendra Bunyamin27-Jun-12 4:07
memberHendra Bunyamin27-Jun-12 4:07 
QuestionHow can I read line numbers ? Pin
Eng Ahmadi12-Jun-12 1:24
memberEng Ahmadi12-Jun-12 1:24 
QuestionThis is a Keeper. Pin
gmccoy8-Feb-12 9:38
membergmccoy8-Feb-12 9:38 
QuestionHow to call your class from a VB.NET project Pin
arrelialp12-Jan-12 2:44
memberarrelialp12-Jan-12 2:44 
SuggestionSlightly modified Version Using Ionic.zip Pin
Martin_Dann24-Aug-11 6:42
memberMartin_Dann24-Aug-11 6:42 
QuestionHow about PDF Pin
Andrew Polar14-Aug-11 2:35
memberAndrew Polar14-Aug-11 2:35 
AnswerRe: How about PDF Pin
Jevgenij Pankov14-Aug-11 9:35
memberJevgenij Pankov14-Aug-11 9:35 
QuestionDOCTXTOHTML Pin
kunal.codes9-May-11 22:11
memberkunal.codes9-May-11 22:11 
Questionhow to Extract image embeded in .docx document Pin
Member 41795321-Mar-11 23:20
memberMember 41795321-Mar-11 23:20 
AnswerRe: how to Extract image embeded in .docx document Pin
Robert Hutch16-Feb-12 4:48
memberRobert Hutch16-Feb-12 4:48 
GeneralDocx files not parsing properly Pin
Nivedita D30-Jan-11 19:27
memberNivedita D30-Jan-11 19:27 
BugRe: Docx files not parsing properly Pin
Member 777999828-Dec-12 6:34
memberMember 777999828-Dec-12 6:34 
GeneralRe: Docx files not parsing properly Pin
Member 777999828-Dec-12 17:11
memberMember 777999828-Dec-12 17:11 
QuestionHow to Close the input file? Pin
kavidha28-Oct-10 21:43
memberkavidha28-Oct-10 21:43 
AnswerRe: How to Close the input file? Pin
Jevgenij Pankov29-Oct-10 7:57
memberJevgenij Pankov29-Oct-10 7:57 
GeneralRe: How to Close the input file? Pin
kavidha31-Oct-10 16:40
memberkavidha31-Oct-10 16:40 
QuestionWhat should I do for huge .docx document (for ex: 100mb file) Pin
Özgür Çivi17-Mar-10 3:31
memberÖzgür Çivi17-Mar-10 3:31 
GeneralGood stuff Pin
eslsys17-Feb-10 7:06
membereslsys17-Feb-10 7:06 
GeneralRe: Good stuff Pin
Eugene Pankov17-Feb-10 7:27
memberEugene Pankov17-Feb-10 7:27 
Generalyou saved us a lot of work Pin
dmihailescu1-Dec-09 11:23
memberdmihailescu1-Dec-09 11:23 
Questionhow about word(2000-2003) document Pin
satyamdelhi7-Aug-09 3:21
membersatyamdelhi7-Aug-09 3:21 
AnswerRe: how about word(2000-2003) document Pin
Eugene Pankov7-Aug-09 7:14
memberEugene Pankov7-Aug-09 7:14 
GeneralRe: how about word(2000-2003) document Pin
satyamdelhi7-Aug-09 7:26
membersatyamdelhi7-Aug-09 7:26 
GeneralRe: how about word(2000-2003) document Pin
Eugene Pankov8-Aug-09 7:58
memberEugene Pankov8-Aug-09 7:58 
GeneralRe: how about word(2000-2003) document Pin
kirkaiya22-Dec-10 21:40
memberkirkaiya22-Dec-10 21:40 
GeneralRe: how about word(2000-2003) document Pin
MadMilkaman9-Sep-13 0:18
professionalMadMilkaman9-Sep-13 0:18 
Questionhow about parsing xlsx etc? Pin
Martin Welker22-Jul-09 6:29
memberMartin Welker22-Jul-09 6:29 
GeneralThank You Pin
S1n200926-Mar-09 12:03
memberS1n200926-Mar-09 12:03 
GeneralLiels paldies! Pin
a kachanoff28-Dec-08 20:54
membera kachanoff28-Dec-08 20:54 
Questionhow about images? Pin
Unruled Boy25-Sep-08 21:21
memberUnruled Boy25-Sep-08 21:21 
GeneralThank you very much! Pin
soxos114-Jul-08 13:26
membersoxos114-Jul-08 13:26 
GeneralThank you Pin
rippo15-Oct-07 2:11
memberrippo15-Oct-07 2:11 
GeneralSpecial thanks.. Pin
Pietro_SVK30-Sep-07 11:31
memberPietro_SVK30-Sep-07 11:31 
GeneralGreat! Pin
Uwe Keim17-Sep-07 20:37
sitebuilderUwe Keim17-Sep-07 20:37 
GeneralRe: Great! Pin
Eugene Pankov18-Sep-07 0:57
memberEugene Pankov18-Sep-07 0:57 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.160208.1 | Last Updated 17 Sep 2007
Article Copyright 2007 by Jevgenij Pankov
Everything else Copyright © CodeProject, 1999-2016
Layout: fixed | fluid