Click here to Skip to main content
15,884,007 members
Articles / Programming Languages / C#

Splitting Pascal/Camel Case with RegEx Enhancements

Rate me:
Please Sign up or sign in to vote.
4.50/5 (5 votes)
3 Jul 2012CPOL2 min read 28.8K   4   3
Splitting Pascal/Camel Case with RegEx Enhancements

In Jon Galloway’s Splitting Camel Case with RegEx blog post, he introduced a simple regular expression replacement which can split “ThisIsInPascalCase” into “This Is In Pascal Case”. Here’s the original code:

output = System.Text.RegularExpressions.Regex.Replace(
    input,
    "([A-Z])",
    " $1",
    System.Text.RegularExpressions.RegexOptions.Compiled).Trim();

Simple and effective. Matches any capital letters and inserts a space before them. But there’s room for improvement. First, the call to String.Trim() to remove any spaces potentially added if the first letter is uppercase – this can be handled with a “Match if prefix is absent” group containing the “beginning of line” character ^. This prevents any matches from occurring on the first character, which eliminates the need for the String.Trim() call. The formal name for this grouping construct is “Zero-width negative lookbehind assertion”, but just think of it as “if you see what’s in here, don’t match the next thing”.

(?<!^)([A-Z])

Next - there’s a potential issue with how acronyms get handled with this. Given this fictional book title: “WCFForNoobs” – the split will occur on each uppercase letter resulting in “W C F For Noobs”. The fix is simple, though – require that uppercase letters be followed by a lowercase:

(?<!^)([A-Z][a-z])

… Now it’ll result in “WCF For Noobs” (aren’t we all!). But now it won’t add a space before the acronym – for “LearnWCFInSixEasyMonths”, the result will be “LearnWCF In Six Easy Months”. No problem – add an alternate match for a lowercase letter coming before the uppercase letter. The replace pattern makes this more difficult – we don’t want the space to go before the lowercase letter, we want it between the lowercase and the first capital letter of the acronym. RegEx can handle this with another lookbehind match group – “Match prefix but exclude it” - (?<=). This allows the match to occur on the lowercase-uppercase pair, but only the uppercase portion will get matched, so when it comes time to run the replacement, the space will get inserted between the two letters. By itself, that’ll look like this:

((?<=[a-z])[A-Z])

Great! But this needs to be combined with the previous expression. Easily accomplished with an either/or match using the vertical bar “or” construct:

(?<!^)([A-Z][a-z]|(?<=[a-z])[A-Z])

The example “LearnWCFInSixEasyMonths” will now be split into “Learn WCF In Six Easy Months”. These same techniques can be used for additional splits – perhaps on numbers or underscores. More generally, lookbehind and lookahead are great tools to have in your RegEx toolbelt.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer (Senior)
United States United States
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
GeneralMy vote of 5 Pin
Thomas Schittli7-Jun-18 7:15
Thomas Schittli7-Jun-18 7:15 
SuggestionDetecting numbers Pin
Nik Rolls17-Mar-13 12:06
Nik Rolls17-Mar-13 12:06 
AnswerOne comment Pin
Clifford Nelson3-Jul-12 9:07
Clifford Nelson3-Jul-12 9:07 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.