65.9K
CodeProject is changing. Read more.
Home

Sentence Breaker using Microsoft Word

starIconstarIconstarIcon
emptyStarIcon
starIcon
emptyStarIcon

3.92/5 (6 votes)

Jan 9, 2006

1 min read

viewsIcon

47422

downloadIcon

622

A kind of preprocessing in text analysis.

Sample Image

Introduction

This article presents a simple way to do language-independent sentence breaking using Microsoft Word 2003, which breaks texts into sentence groups internally. The target audience includes those who are interested in text processing, text mining from Internet, or NLP (natural-language processing) related fields.

Background

In the NLP field, most technologies are sentence oriented, such as word breaking, POS tagging, and syntactic parsing etc., while the largest text resource "Internet" is document-based. So, there is a requirement to convert documents into sentences. That is the problem we want to solve here.

The objective of sentence breaking is to break one document into sentences. The problem is how to recognize the sentence boundaries. There are several popular algorithms in this field, but in this article, we will provide a low-cost way, if you have Microsoft Word installed in your computer.

Internally, Word breaks the loaded document into sentences for parsing. So we can extract those broken sentences for our own purpose.

Using the code

The implementation is very simple, just a couple of code snippets to trigger Word processing.

'/// Step 1 : start Word as the Sentence break engine 

    Dim oWord As Word.Application
    Sub New()
        oWord = CreateObject("Word.Application")
        oWord.Visible = False
    End Sub


'/// Step 2: vomit sentence from the given document in turn 

    Public Function VomitSentences(ByVal file As String) As StringCollection
            Dim vecSent As StringCollection = New StringCollection

            Try

                Dim thisDoc As Word.Document = _
                  oWord.Application.Documents.Open(file, ReadOnly:=True)

                Dim i As Int32
                For i = 1 To thisDoc.Sentences.Count
                    vecSent.Add(String.Format("-- Sentence {0} --", i))
                    vecSent.Add(thisDoc.Sentences(i).Text.Trim())
                Next

                thisDoc.Close(Word.WdSaveOptions.wdDoNotSaveChanges)

            Catch ex As Exception
                Debug.WriteLine(ex.Message)
            Finally
                VomitSentences = vecSent
            End Try


    End Function
    
'/// Step 3:  finilize 

    Protected Overrides Sub Finalize()
        MyBase.Finalize()
        oWord.Application.Quit()
    End Sub 'Finalize

Points of Interest

What is the purpose of extracting sentences?

  • English sentences can be used to aid English-writing, especially for non-English users;
  • Bilingual sentences can be used to help translators;
  • Elite sentences can play a role in the language teaching field;

If you are interested in sentence searching, you can taste this professional sentences search engine. Chinese users can visit this website for bilingual sentences searching.

History

  • 2006-1-8 created.