Background
This code is in response to the Coding challenge: bad word filter[^] posed by Chris.
Introduction
This code performs replacements within strings as specified via configuration string pairs. You may review the specifications on the Code Challenge's page. Most entries so far have used Regular Expressions and this one does too, but in a fairly limited capacity. Because Regular Expressions can be inefficient when presented with such a non-regular request, rather than use Regular Expressions to scan the whole string (perhaps several times), this code splits the string into parts and then uses Regular Expressions on the parts. In the interest of performance, it avoids Replace, Substring, Split, and ToString as much as possible.
Using the code
A ReplaceOmatic can be instantiated and used like this:
PIEBALD.Type.ReplaceOmatic replacer =
new PIEBALD.Type.ReplaceOmatic
(
new System.Tuple<string,string> ( "PHB!" , "boss!" )
,
new System.Tuple<string,string> ( "gotten" , "become" )
,
new System.Tuple<string,string> ( "*p[o0]{2}p*" , "p**p" )
) ;
string intext = "My PHB is such a poophead. It's gotten worse since his promotion. My PHB has started his new blog phblog.com. He's SUCH A MISBEGOTTEN POOPHEAD!" ;
string outtext = intext ;
int total = replacer.Replace ( ref outtext ) ;
System.Console.WriteLine ( "{0}\n{1}\n{2}" , intext , total , outtext ) ;
This class is not intended to be instantiated, used once, and then disposed. I recommend keeping one long-term. This allows the class to gather statistics about which Regular Expressions encounter the most matches. With that data, the List of ReplacementSpecs can be sorted so the ones with the most hits can be tried first.
StringSlice
This class is used to define a section of a string. The goal is to avoid using Substring unless truly necessary. ToString will call Substring on the parent string, so we want to do that as seldom as possible.
private sealed partial class StringSlice
{
private string source ;
public int Start { get ; private set ; }
public int Length { get ; private set ; }
private StringSlice
(
string Source
,
int Start
,
int Length
)
{
this.source = Source ;
this.Start = Start ;
this.Length = Length ;
return ;
}
public override string
ToString
(
)
{
return ( this.source.Substring ( this.Start , this.Length ) ) ;
}
TrimPunctuation
This method assists with defining a section of a string that has neither leading nor trailing punctuation characters. For the current Challenge, I must allow punctuation within a section, such as sh!t
and b(.)(.)bs
.
private static void
TrimPunctuation
(
string Source
,
ref int Start
,
ref int Length
)
{
while ( ( Length > 0 ) && System.Char.IsPunctuation ( Source [ Start ] ) )
{
Start++ ;
Length-- ;
}
while ( ( Length > 0 ) && System.Char.IsPunctuation ( Source [ Start + Length - 1 ] ) )
{
Length-- ;
}
return ;
}
Tokenize
This method is the only way to create instances of StringSlice. Each StringSlice will define a non-empty section of the input string. Sections are delimited by whitespace and may or may not include leading and trailing punctuation. For the current Challenge, I include neither leading nor trailing punctuation characters. Similarly, when punctuation trimming is performed, a StringSlice will never contain only punctuation.
The method is an Iterator to reduce resource usage. Rather than scanning the entire string at once and producing many StringSlices, one section is produced at a time, so the caller can process it and throw it out.
public static System.Collections.Generic.IEnumerable<StringSlice>
Tokenize
(
string Source
,
int Start
,
int Length
,
bool Trim
)
{
int i = 0 ;
int len ;
while ( i < Length )
{
if ( System.Char.IsWhiteSpace ( Source [ i ] ) )
{
len = i - Start ;
if ( Trim )
{
TrimPunctuation ( Source , ref Start , ref len ) ;
}
if ( len > 0 )
{
yield return ( new StringSlice ( Source , Start , len ) ) ;
}
Start = i + 1 ;
}
i++ ;
}
len = i - Start ;
if ( Trim )
{
TrimPunctuation ( Source , ref Start , ref len ) ;
}
if ( len > 0 )
{
yield return ( new StringSlice ( Source , Start , len ) ) ;
}
yield break ;
}
}
The tokens from my test string are:
My PHB is such a poophead. It's gotten worse since his promotion. My PHB has started his new blog phblog.com. He's SUCH A MISBEGOTTEN POOPHEAD!
String.Split can't do this reliably, plus you lose the information about where the token was, and I need that later.
ReplacementSpec
This class holds the information required for each desired replacement.
private sealed partial class ReplacementSpec : System.IComparable<ReplacementSpec>
{
public static System.Text.RegularExpressions.RegexOptions DefaultRegexOptions =
System.Text.RegularExpressions.RegexOptions.Compiled
|
System.Text.RegularExpressions.RegexOptions.ExplicitCapture ;
private System.Text.RegularExpressions.Regex regex ;
private string replacement ;
private bool duplicatecasing ;
public int HitCount { get ; private set ; }
private ReplacementSpec
(
string Target
,
string Replacement
,
System.Text.RegularExpressions.RegexOptions Option
,
bool DuplicateCasing
)
{
this.regex = new System.Text.RegularExpressions.Regex
(
Target
,
Option
) ;
this.replacement = Replacement ;
this.duplicatecasing = DuplicateCasing ;
this.HitCount = 0 ;
return ;
}
public int
CompareTo
(
ReplacementSpec Op
)
{
return ( this.HitCount.CompareTo ( Op.HitCount ) ) ;
}
Create
This method is used to process the specifications as defined in the Challenge to produce a ReplacementSpec instance.
The Target may include one or more prefix or suffix characters that specify some details of the search and replace operation. These characters will be removed once they have been interpreted as instructions.
- If the Target does not end with an exclamation point (!), then add the Regex Option to IgnoreCase.
- If the Target does not end with an asterisk (*), then add a dollar sign ($) at the end of the Regex.
- If the Target does not start with an asterisk (*), then add a caret (^) at the beginning of the Regex.
- If the Replacement does not end with an exclamation point (!), then the replace operation needs to duplicate the casing of the source. (This is my own addition to the spec.)
Other than the specified prefix and suffix characters above, the Target must be a valid Regular Expression. When specifying the Target, remember that it will be matched against strings that contain no whitespace and neither leading nor trailing punctuation. Try to keep the Regular Expressions simple and efficient; avoid expressions that may lead to backtracking.
Furthermore, beware of unintentionally including "special" characters in the Target, as it might break the Regular Expression. For instance, to add a Target for b(.)(.)bs
, you must remember to escape -- prefix with a backslash (\) -- the dots and parentheses, for example: @"b\(\.\)\(\.\)bs"
.
public static ReplacementSpec
Create
(
string Target
,
string Replacement
)
{
System.Text.RegularExpressions.RegexOptions opt = DefaultRegexOptions ;
if ( Target.EndsWith ( "!" ) )
{
Target = Target.TrimEnd ( new char[] { '!' } ) ;
}
else
{
opt |= System.Text.RegularExpressions.RegexOptions.IgnoreCase ;
}
if ( Target.EndsWith ( "*" ) )
{
Target = Target.TrimEnd ( new char[] { '*' } ) ;
}
else
{
Target += "$" ;
}
if ( Target.StartsWith ( "*" ) )
{
Target = Target.TrimStart ( new char[] { '*' } ) ;
}
else
{
Target = "^" + Target ;
}
bool dup ;
if ( Replacement.EndsWith ( "!" ) )
{
Replacement = Replacement.TrimEnd ( new char[] { '!' } ) ;
dup = false ;
}
else
{
dup = true ;
}
return ( new ReplacementSpec ( Target , Replacement , opt , dup ) ) ;
}
Replace
This is where the magic happens. Given a Candidate string, which is a section of the string being searched, use the Regular Expression to find matches.
If any matches are found, then replace each with the specified replacement text in accordance with whether or not to duplicate the source string's casing. A match may be preceded by non-matched text which must be copied. The final match may be followed by non-matched text which must be copied.
If no matches are found, then no other processing occurs.
public int
Replace
(
ref string Candidate
)
{
int result = 0 ;
System.Text.RegularExpressions.MatchCollection mat = this.regex.Matches ( Candidate ) ;
if ( mat.Count > 0 )
{
int offset = 0 ;
System.Text.StringBuilder sb = new System.Text.StringBuilder ( Candidate.Length ) ;
for ( int i = 0 ; i < mat.Count ; i++ )
{
sb.Append ( Candidate , offset , mat [ i ].Index - offset ) ;
if ( this.duplicatecasing )
{
int j = 0 ;
for ( ; ( j < mat [ i ].Length ) && ( j < this.replacement.Length ) ; j++ )
{
if ( System.Char.IsUpper ( mat [ i ].Value , j ) )
{
sb.Append ( System.Char.ToUpper ( this.replacement [ j ] ) ) ;
}
else
{
sb.Append ( System.Char.ToLower ( this.replacement [ j ] ) ) ;
}
}
if ( j < this.replacement.Length )
{
CasingDelegate casing =
System.Char.IsUpper ( mat [ i ].Value , j - 1 )
? (CasingDelegate) System.Char.ToUpper
: (CasingDelegate) System.Char.ToLower ;
for ( ; j < this.replacement.Length ; j++ )
{
sb.Append ( casing ( this.replacement [ j ] ) ) ;
}
}
}
else
{
sb.Append ( this.replacement , 0 , this.replacement.Length ) ;
}
offset = mat [ i ].Index + mat [ i ].Length ;
result += mat [ i ].Length ;
}
sb.Append ( Candidate , offset , Candidate.Length - offset ) ;
Candidate = sb.ToString() ;
this.HitCount += mat.Count ;
}
return ( result ) ;
}
}
ReplaceOmatic
This class holds a List of ReplacementSpec instances to apply to a string. The constructor accepts pairs of strings and simply passes each to ReplacementSpec.Create .
namespace PIEBALD.Type
{
public sealed partial class ReplaceOmatic : System.IDisposable
{
private System.Collections.Generic.List<ReplacementSpec> rep ;
public ReplaceOmatic
(
params System.Tuple<string,string>[] Context
)
{
this.rep = new System.Collections.Generic.List<ReplacementSpec> ( Context.Length ) ;
for ( int i = 0 ; i < Context.Length ; i++ )
{
this.rep.Add ( ReplacementSpec.Create ( Context [ i ].Item1 , Context [ i ].Item2 ) ) ;
}
return ;
}
public void
Dispose
(
)
{
this.rep.Clear() ;
return ;
}
public void
Sort
(
)
{
this.rep.Sort ( ( x , y ) => y.CompareTo ( x ) ) ;
return ;
}
Replace
This method iterates the sections of the input string, applies each ReplacementSpec to each section, and produces a new string if (and only if) any replacements were made.
public int
Replace
(
ref string Candidate
)
{
int result = 0 ;
int offset = 0 ;
System.Text.StringBuilder sb = new System.Text.StringBuilder ( Candidate.Length ) ;
foreach
(
StringSlice sub
in
StringSlice.Tokenize ( Candidate , 0 , Candidate.Length , true )
)
{
string temp = sub.ToString() ;
int r = 0 ;
for ( int i = 0 ; ( i < this.rep.Count ) && ( r < Candidate.Length ) ; i++ )
{
r += this.rep [ i ].Replace ( ref temp ) ;
}
if ( r > 0 )
{
sb.Append ( Candidate , offset , sub.Start - offset ) ;
sb.Append ( temp ) ;
offset = sub.Start + sub.Length ;
result += r ;
}
}
if ( result > 0 )
{
sb.Append ( Candidate , offset , Candidate.Length - offset ) ;
Candidate = sb.ToString() ;
}
return ( result ) ;
}
}
}
Conclusion
Testing with the provided input and expected output values, plus some other tests of my own, seems to validate the correctness of the algorithm.
My PHB is such a poophead. It's gotten worse since his promotion. My PHB has started his new blog phblog.com. He's SUCH A MISBEGOTTEN POOPHEAD!
5
My boss is such a p**phead. It's become worse since his promotion. My boss has started his new blog phblog.com. He's SUCH A MISBEGOTTEN P**PHEAD!
Points of Interest
At first I was upset about the absence of an overload for RegEx.Matches ( string , int , int )
, but it turned out not to matter and the algorithm is actually better without it.
History
2016-11-27 First published.
2016-11-29 Updates: No longer uses a separate method to copy to the StringBuilder. Stops trying to apply edits if the entire section has already been replaced. Returns the number of characters replaced.
2016-11-30 Updates: Made ReplacementSpec IComparable and added Sort to ReplaceOmatic.