12,630,847 members (33,429 online)
Technical Blog
alternative version

15.4K views
8 bookmarked
Posted

# Regular Expressions – Big daddy o’ string manipulation

, 2 Feb 2012 CPOL
 Rate this:
Regular Expressions – Big daddy o’ string manipulation

Regular Expressions are an amazing way to go. A regular expression (regex or regexp for short) is a special text string for describing a search pattern. You can think of regular expressions as wildcards on steroids. Almost all languages support them and with a little understanding, you can ace at them. What more, they condense tens and thousands of lines of logic into just a couple of simple lines. I’ve used regular expressions both through .NET framework and in JavaScript and found them to be immensely helpful. This blog post focuses majorly on them and here’s a little outline on how I’ll be approaching the concept on hand.

• A problem – We engineers are problem solvers, so a simple problem followed by how regular expressions solved it in an even simpler way
• Dissecting the solution – An understanding of the above simpler solution
• Link to raw regular expression resources – Unless you can make up some of those regular expressions, you can never make it easier to use this language feature
• `RegEx `class – The power of RegEx in .NET framework, what all you can achieve

Starting with this, followed by that, then this and then a dozen of that – A Problem!

I was given, the below problem statement and a laptop to code and solve. Developer instincts are hard to ignore and eventually I started scribbling off an elegant algorithm, or so I thought.

Domestic passport numbers either begin with the letter TW followed by 12 digits, or begin with 6 digits followed by 4 other characters and ending with the letters OFTW.

Pseudo logic:

1. Is Starting With TW?
1. Are the remaining 12 characters – digits?

i. Return “Domestic”

1. Is Ending with OFTW?
1. Does it begin with 6 digits?

i. Return “Domestic”

1. If all above fails
1. Return “Foreigner”

So with all the code logic and intricacies figured out, I exactly translate the pseudo code to C# code:

```public bool IsDomestic(string PassportNumber)
{
if (PassportNumber.StartsWith(“TW”))
{
string Rest=PassportNumber.Substring(2,12);
Int64 Result;
if (Int64.TryParse(Rest,out Result))
{
return true;
}
}
else if(PassportNumber.EndsWith(“OFTW”))
{
string First6 = PassportNumber.Substring(0, 6);
Int64 Result;
if (Int64.TryParse(First6, out Result))
{
return true;
}
}
return false;
}```

That’s when I was told; don’t you follow all the language features? Especially something called Regular Expressions?

Well, not to be an all knowing buff but yet, I knew what Regular Expression was and I perfectly knew how to use it. But surprisingly, it never struck me. Given a problem, my initial mode of tackling it was to analytically approach it and solve. Although regular Expression’s way of doing it was straight opposite – it was elegant; and I hit myself why my mind didn’t think of it.

The Solution – Simpler one

So here’s the code, if I had used `RegEx`:

```public bool IsDomestic(string PassportNumber)
{
return Regex.Match(PassportNumber,
“(TW)(\\d{12})(\$)|(\\d{6})([A-Za-z0-9]{4})(OFTW)(\$)”).Success;
}```

Surprisingly less lines of code, isn’t it?

Let’s dissect the solution

Without going into much depth, let me explain what the pattern (yes, I called it a pattern) means:

`(TW)(\\d{12})(\$)|(\\d{6})([A-Za-z0-9]{4})(OFTW)(\$)`

You see the pipe (|) symbol in the middle? Let me split the pattern at that part.

Left part: (TW)(\\d{12})(\$)

• (TW) - Simply means that the two initial characters are TW
• (\\d{12}) - tries to say that there are 12 digits following
• (\$)- The `string `should end at the point, no more characters

Right part: (\\d{6})([A-Za-z0-9]{4})(OFTW)(\$)

• (\\d{6}) – The first 6 characters are digits
• ([A-Za-z0-9]{4}) - 4 alphanumeric (A-Z or a-z or 0-9) characters would follow
• (OFTW) – The next four characters would be OFTW
• (\$)- The `string `should end at the point, no more characters – as we saw earlier

Did I miss the pipe (|)? You know what it is, it’s just an OR condition. The `string `in question, say `TW012345678901 `can choose to match with either the left part pattern or the right part pattern.

This should have given you a basic idea into formulating a simple RegEx pattern; if you need further help, you can visit http://www.regular-expressions.info/reference.html. It gives you a good glimpse on RegEx.

Or if you want a quick stop solution/pattern, visit Mark’s http://txt2re.com/, it’s an amazing site.

The almighty RegEx class of .NET

Let’s step back a bit and focus on the `RegEx `class in .NET framework, and see how better we can use it!

Check if a string is looking as it should

The above solution simply does that. The `Match`, `IsMatch `methods of the `RegEx `class try to fits the given `string `into a pattern and says `True`, the `string `matches the pattern or `False`, the `string `looks nothing like the pattern! The simplest form of the method asks only for the pattern and the `string`.

Replace a questionable part in a string

Take for instance, the exclaimer. He gets too excited for nothing and you just need to dial his excitement down. How?

```string SampleText = “This is Outrageous!!!!!!!! Regex can’t solve all my problems!!!!
What if it can!!!!!”;```

You can never track the count of the exclamations he has used (!) nor does he use a specific number all the time. The Replace method of Regex would help you there.

```SampleText = Regex.Replace(SampleText, "(!)+", "!");
Console.WriteLine(SampleText);```

P.S.: The plus (+), here in the pattern says that the (!) can appear once or more in the `string`. And the last part is the replace with part. All questionable parts matching the pattern would be replaced with the input to the method.

This would give me an elegant output like thus:

`This is Outrageous! Regex can’t solve all my problems! What if it can!`

Pattern matching, only simpler:

Imagine how hard it was to match patterns or find the count of substrings actually matching a pattern you’re looking for. `RegEx `offers you an elegant way to the same, using `RegEx.Matches`. Let’s take the below statement:

“Are you working on any special projects at work? I am not reading any books right now. Aren’t you teaching at the university now?”

I need to fish out all the present-continuous forms verbs in them, i.e., the words ending with “`ing`”. Normally this would be a heavy code. But with `RegEx`, we can do it faster and simpler using the pattern `string ``(\\b)(\\w+)(ing)(\\b)`”.

This says – in sequence:

 \\b blank space \\w+ [one or more] word character ing The specific set of characters “ing” \\b blank space
```MatchCollection collObj=Regex.Matches(TestString, "(\\b)(\\w+)(ing)(\\b)");
foreach (var item in collObj)
{
Console.WriteLine(item);
}```

Would print me out, all occurrences of the pattern, which are:

```Working
Teaching```

Summary

Through this post, I’ve summarized most of the key uses of `RegEx `with special emphasis on .NET code. Beyond a particular point, it depends on your creativity how you take it further to tailor the `RegEx `solution based on your problem. You can do away with ugly `for `loops or unnecessary `if`-`else `constructs in code.

Filed under: CodeProject, Technology Tagged: .NET 4.0, .NET Framework, elegant algorithm, language feature, Pattern Replace, Pattern Search, RegEx, Regular Expression

Engineer.

## You may also be interested in...

 Pro

 First Prev Next
 Regex is not always best choice Dennis Lang9-Feb-12 3:55 Dennis Lang 9-Feb-12 3:55
 My vote of 4 karabax3-Feb-12 7:03 karabax 3-Feb-12 7:03
 Re: My vote of 4 Jyothikarthik_N3-Feb-12 9:24 Jyothikarthik_N 3-Feb-12 9:24
 Re: My vote of 4 karabax3-Feb-12 11:25 karabax 3-Feb-12 11:25
 Re: My vote of 5 Jyothikarthik_N3-Feb-12 9:26 Jyothikarthik_N 3-Feb-12 9:26
 My vote of 4 John Brett3-Feb-12 3:21 John Brett 3-Feb-12 3:21
 Re: My vote of 4 Jyothikarthik_N3-Feb-12 3:39 Jyothikarthik_N 3-Feb-12 3:39
 Performance approaches torial2-Feb-12 10:00 torial 2-Feb-12 10:00
 Re: Performance approaches Jyothikarthik_N2-Feb-12 18:30 Jyothikarthik_N 2-Feb-12 18:30
 Last Visit: 31-Dec-99 19:00     Last Update: 7-Dec-16 19:08 Refresh 1

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Web02 | 2.8.161205.3 | Last Updated 2 Feb 2012