Click here to Skip to main content
Click here to Skip to main content

Sentence Breaker using Microsoft Word

, 9 Jan 2006
Rate this:
Please Sign up or sign in to vote.
A kind of preprocessing in text analysis.

Sample Image

Introduction

This article presents a simple way to do language-independent sentence breaking using Microsoft Word 2003, which breaks texts into sentence groups internally. The target audience includes those who are interested in text processing, text mining from Internet, or NLP (natural-language processing) related fields.

Background

In the NLP field, most technologies are sentence oriented, such as word breaking, POS tagging, and syntactic parsing etc., while the largest text resource "Internet" is document-based. So, there is a requirement to convert documents into sentences. That is the problem we want to solve here.

The objective of sentence breaking is to break one document into sentences. The problem is how to recognize the sentence boundaries. There are several popular algorithms in this field, but in this article, we will provide a low-cost way, if you have Microsoft Word installed in your computer.

Internally, Word breaks the loaded document into sentences for parsing. So we can extract those broken sentences for our own purpose.

Using the code

The implementation is very simple, just a couple of code snippets to trigger Word processing.

'/// Step 1 : start Word as the Sentence break engine 

    Dim oWord As Word.Application
    Sub New()
        oWord = CreateObject("Word.Application")
        oWord.Visible = False
    End Sub


'/// Step 2: vomit sentence from the given document in turn 

    Public Function VomitSentences(ByVal file As String) As StringCollection
            Dim vecSent As StringCollection = New StringCollection

            Try

                Dim thisDoc As Word.Document = _
                  oWord.Application.Documents.Open(file, ReadOnly:=True)

                Dim i As Int32
                For i = 1 To thisDoc.Sentences.Count
                    vecSent.Add(String.Format("-- Sentence {0} --", i))
                    vecSent.Add(thisDoc.Sentences(i).Text.Trim())
                Next

                thisDoc.Close(Word.WdSaveOptions.wdDoNotSaveChanges)

            Catch ex As Exception
                Debug.WriteLine(ex.Message)
            Finally
                VomitSentences = vecSent
            End Try


    End Function
    
'/// Step 3:  finilize 

    Protected Overrides Sub Finalize()
        MyBase.Finalize()
        oWord.Application.Quit()
    End Sub 'Finalize

Points of Interest

What is the purpose of extracting sentences?

  • English sentences can be used to aid English-writing, especially for non-English users;
  • Bilingual sentences can be used to help translators;
  • Elite sentences can play a role in the language teaching field;

If you are interested in sentence searching, you can taste this professional sentences search engine. Chinese users can visit this website for bilingual sentences searching.

History

  • 2006-1-8 created.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

About the Author

engooAgent
Instructor/Trainer
China China
developer at China

Comments and Discussions

 
QuestionHow can the data from Vb.Net pass to the Text Box in Microsoft Word? PinmemberGeoffreyOng3-Sep-06 15:50 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web02 | 2.8.140709.1 | Last Updated 9 Jan 2006
Article Copyright 2006 by engooAgent
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid