Introduction
This article presents a simple way to do language-independent sentence breaking using Microsoft Word 2003, which breaks texts into sentence groups internally. The target audience includes those who are interested in text processing, text mining from Internet, or NLP (natural-language processing) related fields.
Background
In the NLP field, most technologies are sentence oriented, such as word breaking, POS tagging, and syntactic parsing etc., while the largest text resource "Internet" is document-based. So, there is a requirement to convert documents into sentences. That is the problem we want to solve here.
The objective of sentence breaking is to break one document into sentences. The problem is how to recognize the sentence boundaries. There are several popular algorithms in this field, but in this article, we will provide a low-cost way, if you have Microsoft Word installed in your computer.
Internally, Word breaks the loaded document into sentences for parsing. So we can extract those broken sentences for our own purpose.
Using the code
The implementation is very simple, just a couple of code snippets to trigger Word processing.
Dim oWord As Word.Application
Sub New()
oWord = CreateObject("Word.Application")
oWord.Visible = False
End Sub
Public Function VomitSentences(ByVal file As String) As StringCollection
Dim vecSent As StringCollection = New StringCollection
Try
Dim thisDoc As Word.Document = _
oWord.Application.Documents.Open(file, ReadOnly:=True)
Dim i As Int32
For i = 1 To thisDoc.Sentences.Count
vecSent.Add(String.Format("-- Sentence {0} --", i))
vecSent.Add(thisDoc.Sentences(i).Text.Trim())
Next
thisDoc.Close(Word.WdSaveOptions.wdDoNotSaveChanges)
Catch ex As Exception
Debug.WriteLine(ex.Message)
Finally
VomitSentences = vecSent
End Try
End Function
Protected Overrides Sub Finalize()
MyBase.Finalize()
oWord.Application.Quit()
End Sub
Points of Interest
What is the purpose of extracting sentences?
- English sentences can be used to aid English-writing, especially for non-English users;
- Bilingual sentences can be used to help translators;
- Elite sentences can play a role in the language teaching field;
If you are interested in sentence searching, you can taste this professional sentences search engine. Chinese users can visit this website for bilingual sentences searching.
History