Click here to Skip to main content
12,953,014 members (47,868 online)
Click here to Skip to main content
Add your own
alternative version

Stats

13.7K views
3 bookmarked
Posted 3 Jan 2014

How to Remove Extra Spaces and Add Missing Spaces Between Sentences in .docx files using C# and the DocX Library

, 3 Jan 2014 CPOL
Rate this:
Please Sign up or sign in to vote.
Make sentences having one and only one space between them

Introduction

This tip is sort of an add-on to the one here.

As with that one, with this tip, too, you need to download and use the DocX library; you can get the instructions on doing that from there.

The code is simple, looping through all the letters in the alphabet to search for sentences that have no space between the period in the preceding sentence and the first letter in the next one.

Presumably, the next letter (the first letter in a sentence) would always be capitalized, and so we would only need to look for A..Z, but I am also looking for a..z here since, as Tony Randall in his role as Felix Unger in "The Odd Couple" so famously popularized, if you assume you make an "ass" out of "u" and "me."

Nevertheless, I am assuming that those are the only characters that will begin a sentence (IOW, that a sentence will never begin with a number (0..9) or with a punctuation mark or an odd symbol such as the one for TAFKAP).

Here is the code to do that (which assumes (yikes!) that you are passing it the name of the *.docx file to open - specifics on how to get that are shown in the last snippet below):

// This will change sentences like this: "I scream.You scream.
// We all scream for ice cream." ...to this: "I scream. You scream. 
// We all scream for ice cream."
private void SpacifySardinizedLetters(string filename)
{
    // 65..90 are A..Z; 97..122 are a..z
    const int FIRST_CAP_POS = 65;
    const int LAST_CAP_POS = 90;
    const int FIRST_LOWER_POS = 97;
    const int LAST_LOWER_POS = 122;
    using (DocX document = DocX.Load(filename))
    {
        for (int i = FIRST_CAP_POS; i <= LAST_CAP_POS; i++)
        {
            char c = (char)i;
            string originalStr = string.Format(".{0}", c);
            string newStr = string.Format(". {0}", c);
            document.ReplaceText(originalStr, newStr);
        }
        for (int i = FIRST_LOWER_POS; i <= LAST_LOWER_POS; i++)
        {
            char c = (char)i;
            string originalStr = string.Format(".{0}", c);
            string newStr = string.Format(". {0}", c);
            document.ReplaceText(originalStr, newStr);
        }
        document.Save();
    }
}

Similarly, you may be living in siglo XXI (that's twenty-first century for you non-Spanish understanders), in which case you will also want to reduce sentences which begin with two spaces down to the thoroughly modern call for just one. The code is almost exactly the same for that scenario. Instead of looking for a period followed by a letter, you look for a period followed by two spaces and then a letter. In both cases, you replace those with a period, a space, and a letter. So here is that code.

// This will change sentences like this: "I scream.  You scream.  
// We all scream for ice cream." ...to this: "I scream. You scream. We all scream for ice cream."
private void SnuggifyLooseyGooseySentenceEndings(string filename)
{
    const int FIRST_CAP_POS = 65;
    const int LAST_CAP_POS = 90;
    const int FIRST_LOWER_POS = 97;
    const int LAST_LOWER_POS = 122;
    using (DocX document = DocX.Load(filename))
    {
        for (int i = FIRST_CAP_POS; i <= LAST_CAP_POS; i++)
        {
            char c = (char)i;
            string originalStr = string.Format(".  {0}", c);
            string newStr = string.Format(". {0}", c);
            document.ReplaceText(originalStr, newStr);
        }
        for (int i = FIRST_LOWER_POS; i <= LAST_LOWER_POS; i++)
        {
            char c = (char)i;
            string originalStr = string.Format(".  {0}", c);
            string newStr = string.Format(". {0}", c);
            document.ReplaceText(originalStr, newStr);
        }
        document.Save();
    }
}

Note: If there are more than two space characters between sentences (three, four, seventeen, forty-two, whatever), you will need code to specifically address that situation, since the "Snuggify" helper method only looks for exactly two spaces. You could solve that problem with code like this:

private void RemoveSuperfluousSpaces(string filename)
{
    bool superfluousSpacesFound = true;
    using (DocX document = DocX.Load(filename))
    {
        List<int> multipleSpacesLocs;
        while (superfluousSpaces)
        {
            document.ReplaceText("  ", " ");
            multipleSpacesLocs = document.FindAll("  ");
            superfluousSpacesFound = multipleSpacesLocs.Count > 0;
        }
        document.Save();
    }
}

So just to be clear, you can select a *.docx file and call those three helper methods this way:

DialogResult result = openFileDialog1.ShowDialog();
if (result == DialogResult.OK)
{
    filename = openFileDialog1.FileName;
}
else
{
    MessageBox.Show("No file selected - sayonara!");
    return;
}
SpacifySardinizedLetters(filename);
SnuggifyLooseyGooseySentenceEndings(filename);
RemoveSuperfluousSpaces(filename);

And actually, if you use RemoveSuperfluousSpaces(), you can do without SnuggifyLooseyGooseySentenceEndings(), as the former will do everything the latter does, and more.

If you want to be a wise guy/fancy pants, you can make the const ints global, so they only need to be declared once, instead of violating the DRY principle by having them in multiple (two) helper functions.

If this tip helps you, give your pet a treat, whether it be a cat, dog, llama, duckbilled platypus, or what have you.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

B. Clay Shannon
Founder Across Time & Space
United States United States
I am in the process of morphing from a software developer into a portrayer of Mark Twain. My monologue (or one-man play, entitled "The Adventures of Mark Twain: As Told By Himself" and set in 1896) features Twain giving an overview of his life up till then. The performance includes the relating of interesting experiences and humorous anecdotes from Twain's boyhood and youth, his time as a riverboat pilot, his wild and woolly adventures in the Territory of Nevada and California, and experiences as a writer and world traveler, including recollections of meetings with many of the famous and powerful of the 19th century - royalty, business magnates, fellow authors, as well as intimate glimpses into his home life (his parents, siblings, wife, and children).

Peripatetic and picaresque, I have lived in eight states; specifically, besides my native California (where I was born and where I now again reside) in chronological order: New York, Montana, Alaska, Oklahoma, Wisconsin, Idaho, and Missouri.

I am also a writer of both fiction (for which I use a nom de plume, "Blackbird Crow Raven", as a nod to my Native American heritage - I am "½ Cowboy, ½ Indian") and nonfiction, including a two-volume social and cultural history of the U.S. which covers important events from 1620-2006: http://www.lulu.com/spotlight/blackbirdcraven

You may also be interested in...

Comments and Discussions

 
QuestionVery good article! Pin
Volynsky Alex4-Jan-14 2:39
memberVolynsky Alex4-Jan-14 2:39 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170525.1 | Last Updated 3 Jan 2014
Article Copyright 2014 by B. Clay Shannon
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid