|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Announcements
Chapters
Services
Feature Zones
|
IntroductionRegular expressions (sometimes known as regexps) are used to describe text pattern matching in a concise form. They are useful whenever you need to apply such pattern matching: input validation, lightweight lexing, parsing email addresses, and so on. Many scripting languages (such as Perl and Python) have a built-in regular expression engines. One is also provided with the .NET framework. For C++ programmers, there are several regular expression libraries available "in the wild" already. They are faster and more up to date than this one. They also understand more complex regular expressions (such as Perl regexps or POSIX regexps). One of those is Why provide this one, then?One answer: code footprint. This library is based on Henry Spencer's early public-domain regular expression implementation. When compiled with the test program under a Linux/ELF32 system, weights around 20 KB. Under Win32, you get a bit more than 19 KB in release mode. Most full POSIX-compliant regular expression engines (such as Code footprint may not seem to be a big deal in those days of 100 GB hard drives and multi megabyte applications. However, there may be reasons you need a small footprint - you may want to use regular expressions on a Pocket PC, for instance, or put it in a downloadable ActiveX control. In addition, sometimes you don't need the full POSIX regular expressions; simple regexes will do find. Or you may want simple code so you can understand the implementation more easily. In those cases, this small engine is just the deal. BackgroundTo use this library, you need to know the basic regular expression syntax. A good introduction can be found at this page. This particular implementation is a superset of the "extended regular expression" dialect. Essentially, the following are available:
In other words, it does not support:
In general, consider the engine to support the extended, pre-POSIX regular expression syntax, without the Perl extensions, and with character classes added "on top" by me as a somewhat ugly hack. As mentioned before, the original regexp engine was written by Henry Spencer; I found it at ftp://ftp.zoo.toronto.edu/ and modified to support wide characters with the appropriate preprocessor definition. I also extended it to work with an arbitrary number of subexpressions (the original code was limited to 9). The resulting interface is close to the POSIX regex interface, but not quite the same, unfortunately. This gives it a re-entrant interface (the original interface definitely was not re-entrant). I basically tailored the interface so it would be easy to make it work with dynamic memory allocation. Using the codeThe C interfaceYou have three ways of using that code. The basic C interface (found in the #include "regexp.h"
#include <string>
#include <vector>
int parse_email(const std::string& to_match,
std::string& user_name,
std::string& host_and_domain,
std::string& domain_suffix)
{
regexp* compiled; // line A
int retval = re_comp(&compiled,
"^([A-Za-z0-9]+)@(.+)\\.(\\a+)$"); // line B
if(retval < 0)
return retval; // line C
regmatch* matches = new regmatch[re_nsubexp(compiled)]; // line D
retval = re_exec(compiled,
to_match.c_str(),
re_nsubexp(compiled),
&matches[0]); // line E
re_free(compiled); // line F
if(retval < 1) // line G
{
delete[] matches;
return retval;
}
user_name = std::string(to_match.begin() + matches[1].begin,
to_match.begin() + matches[1].end); // line H
host_and_domain = std::string(to_match.begin() + matches[2].begin,
to_match.begin() + matches[2].end);
domain_suffix = std::string(to_match.begin() + matches[3].begin,
to_match.begin() + matches[3].end);
delete[] matches;
return 1;
}
int main(int argc, char* argv[])
{
if(argc >= 2)
{
std::string user_name, host_and_domain, domain_suffix;
if(parse_email(argv[1], user_name,
host_and_domain, domain_suffix) < 1)
{
printf("Not an email address\n");
return 1;
}
printf("User name: %s\nHost/domain: %s\nDomain suffix: %s\n",
user_name.c_str(),
host_and_domain.c_str(),
domain_suffix.c_str());
return 0;
}
printf("Usage: %s <email address>\n", argv[0]);
return 1;
}
The program above contains a cheap email parser (the regular expression is not compliant with any RFC, it just seems to work for the two test addresses I gave it). It splits the email in user name, domain prefix, and domain suffix (the domain suffix is, say, the ".com" at the end of an address). Here are explanations of the interesting lines in function
REGEXP_UNICODE preprocessor symbol defined, you get wide character versions of the re_comp() and re_exec() routines. Those are accessed through the re_comp_w() and re_exec_w() routines. They work exactly like their non-"_w" versions, except that they take wide character strings instead of multibyte strings. Here's the example function rewritten for wide characters: #define REGEXP_UNICODE
#include "regexp.h"
#include <string>
#include <vector>
int parse_email(const std::wstring& to_match,
std::wstring& user_name,
std::wstring& host_and_domain,
std::wstring& domain_suffix)
{
regexp* compiled; // line A
int retval = re_comp_w(&compiled,
L"^([A-Za-z0-9]+)@(.+)\\.(\\a+)$"); // line B
if(retval < 0)
return retval; // line C
regmatch* matches = new regmatch[re_nsubexp(compiled)]; // line D
retval = re_exec_w(compiled,
to_match.c_str(),
re_nsubexp(compiled),
&matches[0]); // line E
re_free(compiled); // line F
if(retval < 1) // line G
{
delete[] matches;
return retval;
}
user_name = std::wstring(to_match.begin() + matches[1].begin,
to_match.begin() + matches[1].end); // line H
host_and_domain = std::wstring(to_match.begin() + matches[2].begin,
to_match.begin() + matches[2].end);
domain_suffix = std::wstring(to_match.begin() + matches[3].begin,
to_match.begin() + matches[3].end);
delete[] matches;
return 1;
}
The Class InterfaceAs an example, and also for convenience, the demo code contains a Here's the class declaration for the class CRegExpException
{
public:
CRegExpException(int nError);
int GetError() const;
CString GetErrorString() const;
};
class CRegExp
{
public:
CRegExp(LPCTSTR pszPattern);
~CRegExp();
BOOL Exec(const CString& pszMatch);
BOOL IsMatched(int nSubExp = 0) const;
int GetMatchStart(int nSubExp = 0) const;
int GetMatchEnd(int nSubExp = 0) const;
CString GetMatch(int nSubExp = 0) const;
int GetNumberOfMatches() const;
};
An instance of the The
Here is the C example (from the C interface section) reworked for wrapper use: BOOL ParseEMail(const CString& sToMatch,
CString& sUserName,
CString& sHostAndDomain,
CString& sDomainSuffix)
{
CRegExp reEMailExpr(_T("^([A-Za-z0-9]+)@(.+)\\.(\\a+)$"));
if(reEMailExpr.Exec(sToMatch) == FALSE)
return FALSE;
// the regular expression's format should ensure that all
// three expressions match, or the expression doesn't match
// at all.
ATLASSERT(reEMailExpr.IsMatched(1) &&
reEMailExpr.IsMatched(2) &&
reEMailExpr.IsMatched(3));
sUserName = reEMailExpr.GetMatch(1);
sHostAndDomain = reEMailExpr.GetMatch(2);
sDomainSuffix = reEMailExpr.GetMatch(3);
return TRUE;
}
Standard C++ WrapperFor those of you who do not have access to Here's the class declaration for the class regular_expression_error : public std::runtime_error
{
public:
regular_expression_error(int error_code, regexp* re);
int code() const;
const char* message() const;
};
class regular_expression
{
public:
#ifdef REGEXP_UNICODE
typedef wchar_t CharT;
typedef std::wstring string_type;
#else
typedef char CharT;
typedef std::string string_type;
#endif
typedef typename string_type::size_type size_type;
typedef typename string_type::const_iterator const_iterator;
regular_expression(const CharT* pattern);
regular_expression(const string_type& pattern);
bool exec(const CharT* match);
bool exec(const string_type& match);
bool matched(size_type sub_exp = 0) const;
const_iterator begin(size_type sub_exp = 0) const;
const_iterator end(size_type sub_exp = 0) const;
string_type operator[](size_type sub_exp) const;
size_type size() const;
};
An instance of the The
REGEXP_UNICODE. Given that the underlying C code is not templatized with the character type, I had little choice in the matter.
Here is the C example (which should look very familiar by now, otherwise see the C interface section) reworked in all its standard library glory: bool parse_email(const std::string& to_match,
std::string& user_name,
std::string& host_and_domain,
std::string& domain_suffix)
{
regular_expression email_expr("^([A-Za-z0-9]+)@(.+)\\.(\\a+)$");
if(!email_expr.exec(to_match))
return false;
// the regular expression's format should ensure that all three
// expressions match, or the expression doesn't match at all.
assert(email_expr.matched(1) &&
email_expr.matched(2) &&
email_expr.matched(3));
user_name = email_expr[1];
host_and_domain = email_expr[2];
domain_suffix = email_expr[3];
return true;
}
How to include the library in your projectI haven't provided a project file for the library itself because you probably won't want to make a DLL out of this library. Its code footprint is small enough to link it statically. The following files serve as the "core" of the library:
The following files are optional and are only needed if you want to use the "old-style" interface and the substitution functions:
Finally, the following files are unit tests inherited from the original source files; you probably don't need them, but they are provided together with the rest of the library for consistency purposes:
Depending on which wrapper you may want to use, you can also add the wtl/CRegExp.* or the stl/stdregexp.* files to your project. The demo projects show how this can be done. CustomizationLibraries such as this one tend to be used in a variety of context. Hence, I've tried to isolate the dependencies on runtime library routines so they can be easily overridden. By default, the library allocates memory using extern "C" void* re_malloc(size_t sz);
extern "C" void re_cfree(void* p);
Those should have similar semantics to In addition, when compiling with extern "C" wchar_t* re_ansi_to_unicode(const char* s);
extern "C" char* re_unicode_to_ansi(const wchar_t* s);
Those functions should use Finally, the regular expression library calls a function to report internal errors in a more fine-grained manner than through the extern "C" void re_report(const char* error);
(Note that the error is always provided in plain chars) You could provide, for instance, extern "C" void re_report(const char* error)
{
char buffer[128];
::wsprintfA(buffer, "REGEXP ERROR: %s\n", error);
::OutputDebugStringA(buffer);
}
The default implementation is in file report.c. Simply provide your own implementation and don't link with report.c if you don't like the default implementation. The Demo ProjectsThe first demo program (re2demo.exe) is a simple WTL dialog application which allows you to explore different regular expressions. The top field should contain the regular expression; the middle field should contain the string to match. Once you press the "Try It" button, the matches will be placed in the bottom combo box (open the combo box to see all the submatches). If there was an error, or the string didn't match the regexp, the error will be printed as the first (and only) entry of the combo box. Note that the demo program is not meant to be a demonstration of clean WTL style. It's mostly an example of how to integrate the regular expression engine in your own programs. (Actually, I really should confess that I only wrote it to make sure the CRegExp class works properly) The second demo program (try.exe) is a simple unit testing program which was provided in the original source code archive. It is provided in pre-compiled form simply as a convenience. Future DirectionsA few additional utilities, such as a string substitution routine and a global match routine, could be added with relatively little trouble. I've not done this yet in the interest of posting this code quickly. In addition, I'm pretty sure it would be possible to support the Also, the fact that the plain char versions don't work right when Finally, I seem to recall some modified version of this regexp engine which optimized some common case to yield better performance. I may eventually hunt it down and apply those modifications to the version provided here. Related WorkThere is another CRegExp class which exists in the wild. The author of this class is to thank for the inspiration of my implementation; that class made me aware of the availability of Mr. Spencer's code. However, one thing I didn't do was merge the C routines inside my own Before I ended up with this specific implementation, I tried to extract Mr. Spencer's latest implementation (which is buried somewhere in the Tcl/Tk code). I managed to extract it, but was disappointed by the rather large code footprint. I also considered PCRE; unfortunately, their Unicode support was still experimental at the time, and it's based on UTF-8 encoded strings rather than UCS-2 wide character strings and I needed UCS-2. It's unfortunate because it is relatively small and it's supposed to be a very fast library. Oh, well. Finally, there are articles on CodeProject about the same subject. They provide tutorials for different libraries. You may want to look at those for alternative solutions. ConclusionWe have seen a short tutorial on how to use the regular expression package provided with this article. Also, we've seen how to use the C++ two classes provided as example wrappers around the package. From the comments on the regular expression syntax, it should be clear by now that this is not the most complete, nor the fastest, library available. However, it's simple, easy to understand, portable, and small. If you're looking for any of those criteria over completeness and speed, this library will fit your needs better. I hope you'll enjoy using this as much as I enjoyed tweaking its code. History
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||