Click here to Skip to main content
Click here to Skip to main content

Using Regular Expressions in MFC

, 18 Jun 2006
Rate this:
Please Sign up or sign in to vote.
CATLRegExp - A Visual C++ built-in regular expression.

Sample image

Introduction

I think most people will think of Boost::Regex or PCRE when they want to use Regular Expressions in a C++ project. However, in fact, Microsoft has its own regular expression implementation as part of the ATL server, and it is called CAtlRegExp. And as a bonus, CAtlRegExp supports not only ASCII and Unicode, but also MBCS.

Supported Regular Expression Syntax

The following tables are copied from MSDN. You can note that the syntax is not exactly the same as in Perl. For example, the grouping operator is {}, while in Perl it is (), and it doesn't have the {n} (match exactly n times) as in the Perl syntax<.

Metacharacter Meaning
. Matches any single character.
[ ] Indicates a character class. Matches any character inside the brackets (for example, [abc] matches "a", "b", and "c").
^ If this metacharacter occurs at the start of a character class, it negates the character class. A negated character class matches any character except those inside the brackets (for example, [^abc] matches all characters except "a", "b", and "c").

If ^ is at the beginning of the regular expression, it matches the beginning of the input (for example, ^[abc] will only match input that begins with "a", "b", or "c").

- In a character class, indicates a range of characters (for example, [0-9] matches any of the digits "0" through "9").
? Indicates that the preceding expression is optional: it matches once or not at all (for example, [0-9][0-9]? matches "2" and "12").
+ Indicates that the preceding expression matches one or more times (for example, [0-9]+ matches "1", "13", "666", and so on).
* Indicates that the preceding expression matches zero or more times.
??, +?, *? Non-greedy versions of ?, +, and *. These match as little as possible, unlike the greedy versions which match as much as possible. Example: given the input "<abc><def>", <.*?> matches "<abc>" while <.*> matches "<abc><def>".
( ) Grouping operator. Example: (\d+,)*\d+ matches a list of numbers separated by commas (such as "1" or "1,23,456").
{ } Indicates a match group. The actual text in the input that matches the expression inside the braces can be retrieved through the CAtlREMatchContext object.
\ Escape character: interpret the next character literally (for example, [0-9]+ matches one or more digits, but [0-9]\+ matches a digit followed by a plus character). Also used for abbreviations (such as \a for any alphanumeric character; see table below).

If \ is followed by a number n, it matches the nth match group (starting from 0). Example: <{.*?}>.*?</\0> matches "<head>Contents</head>".

Note that in C++ string literals, two backslashes must be used: "\\+", "\\a", "<{.*?}>.*?</\\0>".

$ At the end of a regular expression, this character matches the end of the input. Example: [0-9]$ matches a digit at the end of the input.
| Alternation operator: separates two expressions, exactly one of which matches (for example, T|the matches "The" or "the").
! Negation operator: the expression following ! does not match the input. Example: a!b matches "a" not followed by "b".

CAtlRegExp can handle abbreviations, such as \d instead of [0-9]. The abbreviations are provided by the character traits class passed in the CharTraits parameter. The predefined character traits classes provide the following abbreviations:

Abbreviation Matches
\a Any alphanumeric character: ([a-zA-Z0-9])
\b White space (blank): ([ \\t])
\c Any alphabetic character: ([a-zA-Z])
\d Any decimal digit: ([0-9])
\h Any hexadecimal digit: ([0-9a-fA-F])
\n Newline: (\r|(\r?\n))
\q A quoted string: (\"[^\"]*\")|(\'[^\']*\')
\w A simple word: ([a-zA-Z]+)
\z An integer: ([0-9]+)

Using the code

Although CAtlRegExp is part of the ATL server classes, you don't have to be an ATL project in order to use this class, simply #include "atlrx.h" is enough.

I have written a simple Dialog based program to test/demo the CAtlRegExp. The core of the program is listed as follows:

// create regular expression content
CAtlRegExp<> regex;
REParseError status = regex.Parse(m_szRegex, m_bCaseSensitive);

if (REPARSE_ERROR_OK != status) {
  // invalid pattern, show error
  m_szStatus = TEXT("Parser Error: ");
  m_szStatus += REError2String(status);
} else {
  // valid regex pattern, now try to match the content
  CAtlREMatchContext<> mc;
  if (!regex.Match(m_szInput, &mc)) {
    // content not match
    m_szStatus = TEXT("No match");
  } else {
    // content match, show match-group
    m_szStatus = TEXT("Success match");
    for (UINT nGroupIndex = 0; nGroupIndex < mc.m_uNumGroups; 
         ++nGroupIndex) {
      const CAtlREMatchContext<>::RECHAR* szStart = 0;
      const CAtlREMatchContext<>::RECHAR* szEnd = 0;
      mc.GetMatch(nGroupIndex, &szStart, &szEnd);
      ptrdiff_t nLength = szEnd - szStart;
      CString text(szStart, nLength);
      m_ctrlListBox.AddString(text);
    }
  }
}

And the function REError2String is listed as follows:

// refer to REParseError for more information
CString CMfcRegexDlg::REError2String(REParseError status)
{
  switch (status) {
    case REPARSE_ERROR_OK:
         return TEXT("No error occurred");
    case REPARSE_ERROR_OUTOFMEMORY:
         return TEXT("Out of memory");
    case REPARSE_ERROR_BRACE_EXPECTED:
         return TEXT("A closing brace was expected");
    case REPARSE_ERROR_PAREN_EXPECTED:
         return TEXT("A closing parenthesis was expected");
    case REPARSE_ERROR_BRACKET_EXPECTED:
         return TEXT("A closing bracket was expected");
    case REPARSE_ERROR_UNEXPECTED:
         return TEXT("An unspecified fatal error occurred");
    case REPARSE_ERROR_EMPTY_RANGE:
         return TEXT("A range expression was empty");
    case REPARSE_ERROR_INVALID_GROUP:
         return TEXT("A back reference was made to a group" 
                     " that did not exist");
    case REPARSE_ERROR_INVALID_RANGE:
         return TEXT("An invalid range was specified");
    case REPARSE_ERROR_EMPTY_REPEATOP:
         return TEXT("A repeat operator (* or +) was applied" 
                     " to an expression that could be empty");
    case REPARSE_ERROR_INVALID_INPUT:
         return TEXT("The input string was invalid");
    default: return TEXT("Unknown error");
  }
}

Special note about MBCS

By default, CAtlRegExp uses CAtlRECharTraits, which is CAtlRECharTraitsA for non-Unicode version. However, unless you are using strict and pure ASCII, you should use CAtlRECharTraitsMB; otherwise, you may encounter some un-expected results in non-ASCII text. For example, the Chinese character for Chinese Character ("word") in Big5 encoding is the two byte word "\0xA6 r", which has a 'r' in as the second byte.

References

History

  • 6th March 2006: Initial version uploaded.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

About the Author

Sam NG

Hong Kong Hong Kong
No Biography provided

Comments and Discussions

 
GeneralMy vote of 1 PinmemberMember 42633028-Jul-09 18:18 
GeneralATL Server not in VC++ 2008 Pinmemberbob1697220-Jun-08 3:23 
My interest in ATL was shortlived as I've noticed Microsoft no longer include the ATL Server library (except for a few data encoding/decoding classes) in VC++ 2008. Unfortunately, CAtlRegExp was not one of the few classes they kept.
 
Microsoft no longer maintains or ships ATL Server with VC++ and has released it as a shared source at Codeplex[^]
 
Visual C++ 2008 ATL Breaking changes[^]
 
Just thought I'd pass this on.
 
Either way, thanks again for posting your article as it helped me navigate through all the details in a short time.
AnswerRe: ATL Server not in VC++ 2008 Pinmembermknaup10-Sep-08 23:38 
GeneralUnhandled exception Pinmemberbob1697219-Jun-08 11:11 
QuestionHow conver szEnd and szStart to int ? PinmemberMaIron-cool20-Feb-07 8:17 
AnswerRe: How conver szEnd and szStart to int ? PinmemberLegolas655-Aug-07 12:35 
GeneralRe: How conver szEnd and szStart to int ? PinmemberLegolas658-Aug-07 1:23 
Generalthere are some bugs Pinmembersuperxsc3-Jan-07 18:20 
GeneralRe: there are some bugs Pinmemberdxlee8-Jan-07 7:51 
Questioncase-insensistive search? PinmemberCrazyScntst24-Jul-06 5:25 
AnswerRe: case-insensistive search? PinmemberSam NG24-Jul-06 15:56 
QuestionWarning C4018 in ATL code PinmemberJulberto Danray26-Jun-06 5:13 
AnswerRe: Warning C4018 in ATL code PinmemberSam NG26-Jun-06 15:42 
GeneralRe: Warning C4018 in ATL code PinmemberJulberto Danray27-Jun-06 9:42 
GeneralOnly works with groups in regexp limitation PinmemberGerhard Schmeusser19-Jun-06 22:05 
GeneralRe: Only works with groups in regexp limitation PinmemberSam NG19-Jun-06 23:11 
GeneralRe: Only works with groups in regexp limitation Pinmemberk77714-Jun-09 8:49 
GeneralRe: Only works with groups in regexp limitation PinmemberLegolas655-Aug-07 10:36 
GeneralSerious bug but not yours Pinmemberphgo19-Jun-06 4:00 
GeneralRe: Serious bug but not yours PinmemberYoSilver19-Jun-06 12:30 
GeneralRe: Serious bug but not yours Pinmemberphgo19-Jun-06 21:02 
GeneralRe: Serious bug but not yours PinmemberYoSilver20-Jun-06 0:21 
GeneralRe: Serious bug but not yours Pinmemberphgo20-Jun-06 0:53 
GeneralRe: Serious bug but not yours PinmemberYoSilver20-Jun-06 11:46 
GeneralRe: Serious bug but not yours PinmemberLegolas6518-Aug-07 0:23 
GeneralRe: Serious bug but not yours Pinmembervvirmani23-Apr-09 4:38 
QuestionWhat is the ATL version? PinmemberStephen Hewitt19-Jun-06 3:40 
AnswerRe: What is the ATL version? PinmemberSam NG19-Jun-06 16:12 
GeneralOne small thing... Pinmemberali_m_00014-Jun-06 21:53 
GeneralRe: One small thing... PinmemberSam NG18-Jun-06 16:38 
GeneralAnother Helpful Article (in MSDN Magazine) PinmemberMike O'Neill7-Mar-06 14:59 
GeneralRe: Another Helpful Article (in MSDN Magazine) PinmemberSam NG7-Mar-06 15:11 
GeneralNice Pinmembersudhir mangla6-Mar-06 17:12 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web04 | 2.8.140709.1 | Last Updated 18 Jun 2006
Article Copyright 2006 by Sam NG
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid