ReplaceOmatic

PIEBALDconsult

4.95/5 (11 votes)

Nov 27, 2016

CPOL

5 min read

13182

206

Performs replacements within strings

Download ReplaceOmatic.zip - 2.5 KB

Background

This code is in response to the Coding challenge: bad word filter[^] posed by Chris.

Introduction

This code performs replacements within strings as specified via configuration string pairs. You may review the specifications on the Code Challenge's page. Most entries so far have used Regular Expressions and this one does too, but in a fairly limited capacity. Because Regular Expressions can be inefficient when presented with such a non-regular request, rather than use Regular Expressions to scan the whole string (perhaps several times), this code splits the string into parts and then uses Regular Expressions on the parts. In the interest of performance, it avoids Replace, Substring, Split, and ToString as much as possible.

Using the code

A ReplaceOmatic can be instantiated and used like this:

PIEBALD.Type.ReplaceOmatic replacer = 
new PIEBALD.Type.ReplaceOmatic
(
  new System.Tuple<string,string> ( "PHB!"        , "boss!"  ) 
,
  new System.Tuple<string,string> ( "gotten"      , "become" ) 
,
  new System.Tuple<string,string> ( "*p[o0]{2}p*" , "p**p"   ) 
) ;

string intext = "My PHB is such a poophead. It's gotten worse since his promotion. My PHB has started his new blog phblog.com. He's SUCH A MISBEGOTTEN POOPHEAD!" ;
 
string outtext = intext ;

int total = replacer.Replace ( ref outtext ) ;

System.Console.WriteLine ( "{0}\n{1}\n{2}" , intext , total , outtext ) ;

This class is not intended to be instantiated, used once, and then disposed. I recommend keeping one long-term. This allows the class to gather statistics about which Regular Expressions encounter the most matches. With that data, the List of ReplacementSpecs can be sorted so the ones with the most hits can be tried first.

StringSlice

This class is used to define a section of a string. The goal is to avoid using Substring unless truly necessary. ToString will call Substring on the parent string, so we want to do that as seldom as possible.

    private sealed partial class StringSlice
    {
      private string source ;

      public int Start  { get ; private set ; }
      public int Length { get ; private set ; }

      private StringSlice
      (
        string Source 
      ,
        int    Start
      ,
        int    Length
      )
      {
        this.source = Source ;

        this.Start  = Start  ;
        this.Length = Length ;

        return ;
      }
 
      public override string
      ToString
      (
      )
      {
        return ( this.source.Substring ( this.Start , this.Length ) ) ;
      }

TrimPunctuation

This method assists with defining a section of a string that has neither leading nor trailing punctuation characters. For the current Challenge, I must allow punctuation within a section, such as sh!t and b(.)(.)bs.

      private static void
      TrimPunctuation
      (
        string  Source
      ,
        ref int Start
      ,
        ref int Length
      )
      {
        /* Trim leading punctuation. */
        while ( ( Length > 0 ) && System.Char.IsPunctuation ( Source [ Start ] ) )
        {
          Start++  ;
          Length-- ;
        }

        /* Trim trailing punctuation. */
        while ( ( Length > 0 ) && System.Char.IsPunctuation ( Source [ Start + Length - 1 ] ) )
        {
          Length-- ;
        }

        return ;
      }

Tokenize

This method is the only way to create instances of StringSlice. Each StringSlice will define a non-empty section of the input string. Sections are delimited by whitespace and may or may not include leading and trailing punctuation. For the current Challenge, I include neither leading nor trailing punctuation characters. Similarly, when punctuation trimming is performed, a StringSlice will never contain only punctuation.

The method is an Iterator to reduce resource usage. Rather than scanning the entire string at once and producing many StringSlices, one section is produced at a time, so the caller can process it and throw it out.

      public static System.Collections.Generic.IEnumerable<StringSlice>
      Tokenize
      (
        string Source
      ,
        int    Start
      ,
        int    Length
      ,
        bool   Trim
      )
      {
        int i   = 0 ;
        int len ;

        while ( i < Length )
        {
          if ( System.Char.IsWhiteSpace ( Source [ i ] ) )
          {
            len = i - Start ;

            if ( Trim )
            {
              TrimPunctuation ( Source , ref Start , ref len ) ;
            }

            if ( len > 0 )
            {
              yield return ( new StringSlice ( Source , Start , len ) ) ;
            }

            Start = i + 1 ;
          }

          i++ ;
        }

        len = i - Start ;

        if ( Trim )
        {
          TrimPunctuation ( Source , ref Start , ref len ) ;
        }

        if ( len > 0 )
        {
          yield return ( new StringSlice ( Source , Start , len ) ) ;
        }

        yield break ;
      }
    }

The tokens from my test string are:

My PHB is such a poophead. It's gotten worse since his promotion. My PHB has started his new blog phblog.com. He's SUCH A MISBEGOTTEN POOPHEAD!

String.Split can't do this reliably, plus you lose the information about where the token was, and I need that later.

ReplacementSpec

This class holds the information required for each desired replacement.

    private sealed partial class ReplacementSpec : System.IComparable<ReplacementSpec>
    {
      public static System.Text.RegularExpressions.RegexOptions DefaultRegexOptions = 
        System.Text.RegularExpressions.RegexOptions.Compiled 
        |                         
        System.Text.RegularExpressions.RegexOptions.ExplicitCapture ;

      private System.Text.RegularExpressions.Regex regex ;
    
      private string replacement ;

      private bool duplicatecasing ;

      /* How many replacements have been made. Can be used to sort a list of these. */
      public int HitCount { get ; private set ; } 

      private                                       ReplacementSpec
      (
        string                                      Target
      ,
        string                                      Replacement
      ,
        System.Text.RegularExpressions.RegexOptions Option  
      ,
        bool                                        DuplicateCasing
      )
      {
        this.regex = new System.Text.RegularExpressions.Regex
        (
          Target
        ,
          Option
        ) ;

        this.replacement = Replacement ;

        this.duplicatecasing = DuplicateCasing ;

        this.HitCount = 0 ;

        return ;
      }

      public int
      CompareTo
      (
        ReplacementSpec Op
      )
      {
        return ( this.HitCount.CompareTo ( Op.HitCount ) ) ;
      }

Create

This method is used to process the specifications as defined in the Challenge to produce a ReplacementSpec instance.

The Target may include one or more prefix or suffix characters that specify some details of the search and replace operation. These characters will be removed once they have been interpreted as instructions.

If the Target does not end with an exclamation point (!), then add the Regex Option to IgnoreCase.
If the Target does not end with an asterisk (*), then add a dollar sign ($) at the end of the Regex.
If the Target does not start with an asterisk (*), then add a caret (^) at the beginning of the Regex.
If the Replacement does not end with an exclamation point (!), then the replace operation needs to duplicate the casing of the source. (This is my own addition to the spec.)

Other than the specified prefix and suffix characters above, the Target must be a valid Regular Expression. When specifying the Target, remember that it will be matched against strings that contain no whitespace and neither leading nor trailing punctuation. Try to keep the Regular Expressions simple and efficient; avoid expressions that may lead to backtracking.

Furthermore, beware of unintentionally including "special" characters in the Target, as it might break the Regular Expression. For instance, to add a Target for b(.)(.)bs, you must remember to escape -- prefix with a backslash (\) -- the dots and parentheses, for example: @"b$\.$$\.$bs".

      public static ReplacementSpec
      Create
      (
        string Target
      ,
        string Replacement
      )
      {
        System.Text.RegularExpressions.RegexOptions opt = DefaultRegexOptions ;

        if ( Target.EndsWith ( "!" ) )
        {
          Target = Target.TrimEnd ( new char[] { '!' } ) ;
        }
        else
        {
          opt |= System.Text.RegularExpressions.RegexOptions.IgnoreCase ;
        }

        if ( Target.EndsWith ( "*" ) )
        {
          Target = Target.TrimEnd ( new char[] { '*' } ) ;
        }
        else
        {
          Target += "$" ;
        }

        if ( Target.StartsWith ( "*" ) )
        {
          Target = Target.TrimStart ( new char[] { '*' } ) ;
        }
        else
        {
          Target = "^" + Target ;
        }

        bool dup ;

        if ( Replacement.EndsWith ( "!" ) )
        {
          Replacement = Replacement.TrimEnd ( new char[] { '!' } ) ;

          dup = false ;
        }
        else
        {
          dup = true ;
        }

        return ( new ReplacementSpec ( Target , Replacement , opt , dup ) ) ;
      }

Replace

This is where the magic happens. Given a Candidate string, which is a section of the string being searched, use the Regular Expression to find matches.

If any matches are found, then replace each with the specified replacement text in accordance with whether or not to duplicate the source string's casing. A match may be preceded by non-matched text which must be copied. The final match may be followed by non-matched text which must be copied.

If no matches are found, then no other processing occurs.

      public int
      Replace
      (
        ref string Candidate
      )
      {
        int result = 0 ;

        System.Text.RegularExpressions.MatchCollection mat = this.regex.Matches ( Candidate ) ;

        if ( mat.Count > 0 )
        {
          int offset = 0 ;

          System.Text.StringBuilder sb = new System.Text.StringBuilder ( Candidate.Length ) ;

          for ( int i = 0 ; i < mat.Count ; i++ )
          {
            /* Copy characters after the previous match and before the current match. */
            sb.Append ( Candidate , offset , mat [ i ].Index - offset ) ;

            /* Copy the replacement text. */
            if ( this.duplicatecasing )
            {
              int j = 0 ;

              for ( ; ( j < mat [ i ].Length ) && ( j < this.replacement.Length ) ; j++ )
              {
                if ( System.Char.IsUpper ( mat [ i ].Value , j ) )
                {
                  sb.Append ( System.Char.ToUpper ( this.replacement [ j ] ) ) ;
                }
                else
                {
                  sb.Append ( System.Char.ToLower ( this.replacement [ j ] ) ) ;
                }
              }

              /* If the replacement is longer than the matched text... */
              if ( j < this.replacement.Length )
              {
                /* Use the casing of the last character. */
                CasingDelegate casing = 
                  System.Char.IsUpper ( mat [ i ].Value , j - 1 ) 
                  ? (CasingDelegate) System.Char.ToUpper 
                  : (CasingDelegate) System.Char.ToLower ;

                for ( ; j < this.replacement.Length ; j++ )
                {
                  sb.Append ( casing ( this.replacement [ j ] ) ) ;
                }
              }
            }
            else
            {
              sb.Append ( this.replacement , 0 , this.replacement.Length ) ;
            }

            offset = mat [ i ].Index + mat [ i ].Length ;

             /* Track how many characters were replaced. */
            result += mat [ i ].Length ;
          }

          /* Copy any remaining characters. */
          sb.Append ( Candidate , offset , Candidate.Length - offset ) ;
  
          /* Call StringBuilder.ToString only if there were replacements. */
          Candidate = sb.ToString() ;

          this.HitCount += mat.Count ;
        }

        /* Result is the total number of characters replaced. */
        return ( result ) ;
      }
    }

ReplaceOmatic

This class holds a List of ReplacementSpec instances to apply to a string. The constructor accepts pairs of strings and simply passes each to ReplacementSpec.Create .

namespace PIEBALD.Type
{
  public sealed partial class ReplaceOmatic : System.IDisposable
  {
    private System.Collections.Generic.List<ReplacementSpec> rep ;

    public ReplaceOmatic
    (
      params System.Tuple<string,string>[] Context
    )
    {
      this.rep = new System.Collections.Generic.List<ReplacementSpec> ( Context.Length ) ;

      for ( int i = 0 ; i < Context.Length ; i++ )
      {
        this.rep.Add ( ReplacementSpec.Create ( Context [ i ].Item1 , Context [ i ].Item2 ) ) ;
      }

      return ;
    }

    public void
    Dispose
    (
    )
    {
      this.rep.Clear() ;

      return ;
    }

    public void
    Sort
    (
    )
    {
      /* This will sort in descending order. */
      this.rep.Sort ( ( x , y ) => y.CompareTo ( x ) ) ;

      return ;
    }

Replace

This method iterates the sections of the input string, applies each ReplacementSpec to each section, and produces a new string if (and only if) any replacements were made.

    public int
    Replace
    (
      ref string Candidate
    )
    {
      int result = 0 ;

      int offset = 0 ;

      System.Text.StringBuilder sb = new System.Text.StringBuilder ( Candidate.Length ) ;

      foreach 
      ( 
        StringSlice sub 
      in 
        StringSlice.Tokenize ( Candidate , 0 , Candidate.Length , true ) 
      )
      {
        string temp = sub.ToString() ; /* Performs Substring on the source string. */

        int r = 0 ; 

        /* Apply ReplacementSpecs to the section until we run out of them or we've replaced the entire section. */
        for ( int i = 0 ; ( i < this.rep.Count ) && ( r < Candidate.Length ) ; i++ )
        {
          /* Track how many characters have been replaced so far. */
          r += this.rep [ i ].Replace ( ref temp ) ;
        }

        /* If at least one character was replaced... */
        if ( r > 0 )
        {
          /* Copy characters after the previous replacement and before the current slice. */
          sb.Append ( Candidate , offset , sub.Start - offset ) ;
            
          sb.Append ( temp ) ; 

          offset = sub.Start + sub.Length ;

          /* Track how many characters were replaced. */
          result += r ;
        }
      }

      if ( result > 0 )
      {
        /* Copy any remaining characters. */
        sb.Append ( Candidate , offset , Candidate.Length - offset ) ;
        
        /* Call StringBuilder.ToString only if there were replacements. */
        Candidate = sb.ToString() ;
      }

      /* Result is the total number of characters replaced. */
      return ( result ) ;
    }
  }
}

Conclusion

Testing with the provided input and expected output values, plus some other tests of my own, seems to validate the correctness of the algorithm.

My PHB is such a poophead. It's gotten worse since his promotion. My PHB has started his new blog phblog.com. He's SUCH A MISBEGOTTEN POOPHEAD!
5
My boss is such a p**phead. It's become worse since his promotion. My boss has started his new blog phblog.com. He's SUCH A MISBEGOTTEN P**PHEAD!

Points of Interest

At first I was upset about the absence of an overload for RegEx.Matches ( string , int , int ), but it turned out not to matter and the algorithm is actually better without it.

History

2016-11-27 First published.

2016-11-29 Updates: No longer uses a separate method to copy to the StringBuilder. Stops trying to apply edits if the entire section has already been replaced. Returns the number of characters replaced.

2016-11-30 Updates: Made ReplacementSpec IComparable and added Sort to ReplaceOmatic.