Click here to Skip to main content
Click here to Skip to main content
Go to top

Repair Broken Text

, 7 Mar 2007
Rate this:
Please Sign up or sign in to vote.
An article to demonstrate how to repair broken text.

Introduction

This article demonstrates a way to repair broken text. It puts together a few things tedious to find if you are not aware about them, e.g., how to get characters from the Unicode representation and to use them for replacing strings.

Background

What is meant by broken text? For example, some French text having Cyrillic characters instead of the French specific ones, like "sйchйs, gвteaux" versus the correct ones "séchés, gâteaux". How can you get into such trouble? I got, e.g., because of some legacy data migration.

Using the code

To use the code, just download the sample. The idea is very simple: we need to replace wrong characters with the correct ones; so we define a character map in the XML file MapLiterals.xml, using the Unicode representation of the characters to replace. The code reads the file BrokenText.txt, repairs the text - based on MapLiterals.xml, and saves the repaired text in the file RepairedText.txt.

string text;
FileInfo file = new System.IO.FileInfo("BrokenText.txt");

// Read and repair text
using (TextReader textRead = file.OpenText() )
{
    text = textRead.ReadToEnd();

    XmlDocument literalsMapXml = new XmlDocument();
    literalsMapXml.Load("MapLiterals.xml");

    char from, to;
    string buffer;

    XmlNodeList nodes = 
      literalsMapXml.SelectNodes("/MapLiterals/MapLiteral");
    foreach (XmlNode node in nodes)
    {
        from = (char) ushort.Parse(node.Attributes["from"].Value, 
               System.Globalization.NumberStyles.HexNumber);
        to = (char) ushort.Parse(node.Attributes["to"].Value, 
              System.Globalization.NumberStyles.HexNumber);

        buffer = text.Replace(from, to);

        text = buffer;
    }
}

// Write repaired text in a file.
file = new System.IO.FileInfo("RepairedText.txt");
using ( StreamWriter writer = file.CreateText() ) 
    writer.Write(text);

Points of interest

By using StringBuilder, memory allocation can be improved.

MapLiterals.xml can contain various character mappings, and it can be defined in several ways like:

  • use hex codes (without '\u') like: from="0439" to="00E9" (the current implementation)
  • use explicit characters like: from= "й" to= "é"; in this case, the code can be simplified like:
// Read and repair text
using (TextReader textRead = file.OpenText())
{
    ...
    string from, to, buffer;
    ...
    foreach (XmlNode node in nodes)
    {
        from = node.Attributes["from"].Value;
        to = node.Attributes["to"].Value;

        buffer = text.Replace(from, to);

        text = buffer;
    }
}

// Write repaired text in a file.
...

Hope this helps.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

radumi
Software Developer
New Zealand New Zealand
Coder

Comments and Discussions

 
GeneralMy vote of 2 PinmemberJagsir17-Feb-13 12:07 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web04 | 2.8.140926.1 | Last Updated 7 Mar 2007
Article Copyright 2006 by radumi
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid