Click here to Skip to main content
15,860,859 members
Articles / Productivity Apps and Services / Microsoft Office / Microsoft Word
Tip/Trick

Find Text in Word Documents

Rate me:
Please Sign up or sign in to vote.
4.98/5 (129 votes)
9 Jun 2014CPOL2 min read 46.4K   2.3K   128   12
Small WinForm application for reading multiple DOCX files and retrieving their text content

Introduction

This tip shows how to perform string or regex search on multiple DOCX files in the specific directory.

The accompanying application will demonstrate how to read DOCX files, convert them to text and search for specific string or regex on that text. It is based on Show Word file in WPF article which explains DOCX file format and implements DOCX reader used in this tip, so I would recommend reading it before this one.

Implementation

We will use the same DocxReader class from the article mentioned above to unzip the DOCX files and to read DOCX main part (document.xml) with XmlReader. Also, we will implement a converter (DocxToStringConverter) which will convert specific XML elements (or their content) from document.xml to strings.

DocxToStringConverter

This class inherits from the DocxReader and overrides its virtual reading methods to create strings like this:

  • While DocxReader is reading document element (<document>), we will create a new StringBuilder which will be used for appending all of the DOCX text content:
    C#
    protected override void ReadDocument(XmlReader reader)
    {
        this.text = new StringBuilder();
        base.ReadDocument(reader);
    }
  • After DocxReader reads paragraph element (<p>), we will append new line to the StringBuilder:
    C#
    protected override void ReadParagraph(XmlReader reader)
    {
        base.ReadParagraph(reader);
        this.text.AppendLine().AppendLine();
    }
  • While DocxReader is reading text element (<t>), we will append the content of that element to the StringBuilder:
    C#
    protected override void ReadText(XmlReader reader)
    {
        this.text.Append(reader.ReadString());
    }

MainForm

Image 1

This simple Windows Form user interface will enable you to search DOCX files in specific directory (and its subdirectories) and will show the search results in the ListView control using the below code:

C#
private void btnSearch_Click(object sender, EventArgs e)
{
    // ...

    foreach (var filePath in Search(this.txtDirectory.Text, this.txtSearch.Text, 
    this.cBoxUseSubdirectories.Checked, this.cBoxCaseSensitive.Checked, this.rBtnRegex.Checked))
    {
        var file = new FileInfo(filePath);
        this.resultListView.Items.Add(new ListViewItem(new string[] 
        { file.Name, string.Format("{0:0.0}", file.Length / 1024d), file.FullName }));
    }
}

Depending on the user choice, we will perform regex or string search on current DOCX file. To accomplish this, we will use Predicate<T> delegate to implement these two search options like in the following code:

C#
var isMatch = useRegex ?
              new Predicate<string>
        (x => Regex.IsMatch(x, searchString, caseSensitive ?
           RegexOptions.None : RegexOptions.IgnoreCase))
            : new Predicate<string>
        x => x.IndexOf(searchString, caseSensitive ?
           StringComparison.Ordinal : StringComparison.OrdinalIgnoreCase) >= 0);   

Delegate isMatch is used in method which iterates over all DOCX files in the specified directory, converts them to text and returns path to every DOCX file that satisfies the isMatch delegate using the C# iterator (yield return statement) like in the following code:

C#
foreach (var filePath in Directory.GetFiles(directory, "*.docx", 
searchSubdirectories ? SearchOption.AllDirectories : SearchOption.TopDirectoryOnly))
{
    string docxText;

    using (var stream = File.Open(filePath, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
    docxText = new DocxToStringConverter(stream).Convert();
 
    if (isMatch(docxText))
    yield return filePath;
} 

The resulting DOCX files listed in the ListView control can be activated to show them in your default DOCX viewer (usually Microsoft Word).

C#
private void resultListView_ItemActivate(object sender, EventArgs e)
{
    string filePath = ((ListView)sender).SelectedItems[0].SubItems[2].Text;
    if (File.Exists(filePath))
        Process.Start(filePath);
} 

Conclusion

Show Word file in WPF demonstrated how to convert DOCX to WPFs FlowDocument, and this tip demonstrated how to convert DOCX to plain text using the same DOCX reading code. By combining these two articles, you could, for example, convert DOCX to HTML. Hopefully, this tip has shown you some basis of reading DOCX files and how to convert DOCX to other representations by reusing the same DOCX reading code in all of these conversions.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer GemBox Ltd.
Croatia Croatia
I'm a developer at GemBox Software, working on:

  • GemBox.Spreadsheet - Read, write, convert, and print XLSX, XLS, XLSB, CSV, HTML, and ODS spreadsheets from .NET applications.
  • GemBox.Document - Read, write, convert, and print DOCX, DOC, PDF, RTF, HTML, and ODT documents from .NET applications.
  • GemBox.Pdf - Read, write, edit, and print PDF files from .NET applications.
  • GemBox.Presentation - Read, write, convert, and print PPTX, PPT, and PPSX presentations from .NET applications.
  • GemBox.Email - Read, write, and convert MSG, EML, and MHTML email files, or send and receive email messages using POP, IMAP, SMTP, and EWS from .NET applications.
  • GemBox.Imaging - Read, convert, and transform PNG, JPEG, and GIF images from .NET applications.

Comments and Discussions

 
QuestionBinary file and .Net 4.0 compatibility Pin
Onur Guzel12-Jan-20 13:03
Onur Guzel12-Jan-20 13:03 
SuggestionGreat work but need some tweak Pin
Member 994876327-Feb-19 23:15
Member 994876327-Feb-19 23:15 
Questionline number of the text Pin
twice2313-Feb-19 7:23
twice2313-Feb-19 7:23 
Questiontxt documents Pin
RaheelaMalik14-Nov-16 22:29
RaheelaMalik14-Nov-16 22:29 
AnswerRe: txt documents Pin
Mario Z15-Nov-16 0:23
professionalMario Z15-Nov-16 0:23 
GeneralRe: txt documents Pin
RaheelaMalik15-Nov-16 5:51
RaheelaMalik15-Nov-16 5:51 
QuestionNice work but there is another way to do this Pin
Kees van Spelde10-Jun-14 6:19
professionalKees van Spelde10-Jun-14 6:19 
AnswerRe: Nice work but there is another way to do this PinPopular
Mario Z11-Jun-14 8:52
professionalMario Z11-Jun-14 8:52 
Questionsource file cannot be downloadable Pin
fredatcodeproject10-Jun-14 3:11
professionalfredatcodeproject10-Jun-14 3:11 
AnswerRe: source file cannot be downloadable Pin
Mario Z11-Jun-14 8:43
professionalMario Z11-Jun-14 8:43 
GeneralRe: source file cannot be downloadable Pin
fredatcodeproject12-Jun-14 10:35
professionalfredatcodeproject12-Jun-14 10:35 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.