Click here to Skip to main content
14,883,453 members
Please Sign up or sign in to vote.
5.00/5 (2 votes)
Hello,

I cannot find a valid regular expression pattern for my needs.

I have a sample string like this:

I have four child of seven years each, [seven] years ago I had no child, because I was fourteen

now, I want match and then substitute the words "four" and "[seven]".

So I have used a pattern like:

\bfour\b|\b\[seven\]\b

(searches using word boundaries to match exact words. Square brackets are escaped to match them literally)

but only "four" is matched and substituted.

If I change the pattern to:

four|\[seven\]


"four" and "[seven]" are both matched. But because I have removed the word boundary command "\b", now partial word matches can happen ("four" into "fourteen", for example) and this is not what I want.

Ultimately seems that "\b" has to do with this strange behaviour but I don't know why and how to solve.

Any help is appreciated. Thanks.
Posted
Updated 1-Jul-11 6:13am
v2
Comments
Manfred Rudolf Bihy 1-Jul-11 12:57pm
   
I like your question! 5+
Please also see my answer for an explanation.
thatraja 1-Jul-11 13:15pm
   
/*I like your question! 5+*/
Manfred, I think you clicked Vote 1 instead of Vote 5 :)
vlad781 1-Jul-11 17:07pm
   
I clicked 5

[seven] does not match definition of 'world class'. Try to use \bfour\b|\[seven\]
See here for details

I recommend you to download Expresso and play with it
   
Comments
Nyarlatotep 1-Jul-11 12:44pm
   
Uhm yes, now that you have pointed out it seems clear that this is the cause.
Thanks, it seems I need further study on regular expression universe :)
Let me elaborate a bit on what Catalin already said. \w is the class of characters "[A-Za-z0-9_]". Word boundaries can occurr only right next to these characters. The code below illustrates this quite nicely:

C#
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;

using TestSupportService.ServiceReference;

namespace TestSupportService
{
    class Program
    {
        static void Main(string[] args)
        {

            String example = "I have four child of seven years each, [seven] years ago I had no child, because I was fourteen";
            Regex rexWillDo = new Regex(@"\bfour\b|\[\bseven\b\]");
            Regex rexWontDo = new Regex(@"\bfour\b|\b\[seven\]\b");

            Console.WriteLine("Now you see it!");
            MatchCollection matches = rexWillDo.Matches(example);
            foreach (Match match in matches)
            {
                Console.WriteLine(match.Value);
            }

            Console.WriteLine("\nAnd now you don't!");
            matches = rexWontDo.Matches(example);
            foreach (Match match in matches)
            {
                Console.WriteLine(match.Value);
            }
            Console.ReadLine();

        }
    }
}


So by moving the word boundary detectors next to (real) word characters the expression works. I do admit that I also did not expect that kind of behavior. Regular expressions usually work quite nicely for me, but once in a while MS's implementation of it rears it's ugly head and bites us. :(

Cheers!

—MRB
   
Comments
Nyarlatotep 1-Jul-11 13:03pm
   
I did not expected it too. But it seems that the real \b behavior is what has been indicated by Catalin. I want to try the same pattern in other languages (PHP incidentally) and see how it behaves. But I think it will be the same.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)




CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900