Click here to Skip to main content
15,892,059 members
Articles / Desktop Programming / WTL

Henry Spencer's Regexp Engine Revisited

Rate me:
Please Sign up or sign in to vote.
4.88/5 (19 votes)
2 Jul 200317 min read 177.8K   3.4K   67  
A small, Unicode-aware regular expression engine based on Henry Spencer's early work
.TH REGEXP 3 "5 Sept 1996"
.SH NAME
regcomp, regexec, regsub, regerror \- regular expression handler
.SH SYNOPSIS
.ft B
.nf
#include <regexp.h>

int re_comp_w(regexp** rpp, const wchar_t* exp)

int re_nsubexp(const regexp* rp);

int re_exec_w(const regexp* rp, const wchar_t* string, 
              size_t nmatch, regmatch pmatch[]);

int re_sub_w(const regexp* rp, const wchar_t* s, 
             const wchar_t* src, char** dest);

void re_free(void* p);
void re_error(int errcode, const regexp* re, wchar_t* buffer, 
	      size_t bufsize)
.SH DESCRIPTION
These functions implement
.IR egrep (1)-style
regular expressions and supporting facilities.
.PP
.I re_comp_w
compiles a regular expression into a structure of type
.IR regexp ,
and puts the pointer to it in
.IR rpp .
Once done with the expression, you must free it using
.IR re_free.
.PP
.I re_exec_w
matches a NUL-terminated \fIstring\fR against the compiled regular expression
in \fIrp\fR.
It returns 1 for match and 0 when there was no match.  It will attempt
to set the matching ranges into the \fIpmatch\fR array, up to
\fInmatch\fR ranges (see below).
.PP
The
.I regmatch
typedef is declared as follows:
.PP
.nf
typedef struct regmatch
{
    int begin;
    int end;
} regmatch;
.fi
.I begin
is the start offset of the match (relative to \fIstring\fR) and
.I end
is the offset of the character one past the end of the match.  If
there is no such subexpression (i.e., \fInmatch\fR was greater than
the number of subexpressions in the regexp plus one), \fIbegin\fR and
\fIend\fR will both be -1.
.PP
The first match (i.e., \fIpmatch[0]\fR) is always for the whole
expression.  The following indices are for the parenthesized
expressions, in order of appearance.
.PP
.I re_nsubexp
is used to figure out how many potential expressions are available in
the compiled expression \fIrp\fR.  This is useful when one wants to
allocate the \fIpmatch\fR array on the heap.
.PP
.I re_sub_w
executes the compiled regular expression \fIrp\fR on \fIs\fR.  It then
uses the result of the match to build a string from the specification
in \fIsrc\fR, allocates a new string, and puts the string address in
\fI*sub\fR.
Each instance of `&' in \fIsrc\fR is replaced by the whole string that
matched the compiled regular expression.
Each instance of `\e\fIn\fR', where \fIn\fR is a digit, is replaced by
the substring indicated by each subexpression.  For instance, `\e1'
will be replaced by the part of \fIs\fR that matched the first
subexpression of \fIrp\fR, and so on.
To get a literal `&' or `\e\fIn\fR' into \fI*sub\fR, prefix it with `\e';
to get a literal `\e' preceding `&' or `\e\fIn\fR', prefix it with
another `\e'.
.PP
.I re_free
should be called to free any memory allocated by the regexp routines.
For instance, the \fIregexp\fR object returned by \fIre_comp_w\fR, or
the string returned by \fIre_sub_w\fR should be freed using this
routine.
.PP
.I re_error
can be used to translate error codes returned by the regexp functions
into a text string.  The error code should be provided in
\fIerrcode\fR, and the regexp that caused it in \fIre\fR; the error
message will be placed in \fIbuffer\fR.  \fIbuffer\fR should be at
least \fIbufsize\fR characters long.
.SH "REGULAR CHARACTER VERSIONS"
Every function ending in \fI_w\fR also has a regular character
version, which takes plain \fIchar\fR instead of \fIwchar_t\fR
parameters.  Note that \fIre_exec\fR is a bit problematic, since the
matches are indices in the wide character string.  Therefore, if any
of the wide characters translates to multiple characters in a plain
char string, the offsets will be wrong.
.PP
The regular character versions are provided for convenience in writing
tests that deal mostly with the latin-1 character set; they are not
meant for production code.
.SH "REGULAR EXPRESSION SYNTAX"
A regular expression is zero or more \fIbranches\fR, separated by `|'.
It matches anything that matches one of the branches.
.PP
A branch is zero or more \fIpieces\fR, concatenated.
It matches a match for the first, followed by a match for the second, etc.
.PP
A piece is an \fIatom\fR possibly followed by `*', `+', or `?'.
An atom followed by `*' matches a sequence of 0 or more matches of the atom.
An atom followed by `+' matches a sequence of 1 or more matches of the atom.
An atom followed by `?' matches a match of the atom, or the null string.
.PP
An atom is a regular expression in parentheses (matching a match for the
regular expression), a \fIrange\fR (see below), `.'
(matching any single character), `^' (matching the null string at the
beginning of the input string), `$' (matching the null string at the
end of the input string), a `\e' followed by a single character not
part of the charater class or word boundary specification (matching
that character), or a single character with no other significance
(matching that character).
.PP
A character class can be specified with a `\e' followed by one of the
following letters:
.RS
.nf
m       alnum   d       digit   n       punct
a       alpha   g       graph   s       space
b       blank   l       lower   u       upper
c       cntrl   p       print   x       xdigit
w       word    <       SOW     >       EOW
.fi
.RE
.PP
Theses stands for the character classes defined in
.IR ctype (3).
A character class may not be used as the endpoint of a range;
otherwise, it can be used anywhere a normal character can be used.
A `\e' followed by any character not in the above table matches that
character.
.PP
A \fIrange\fR is a sequence of characters enclosed in `[]'.
It normally matches any single character from the sequence.
If the sequence begins with `^',
it matches any single character \fInot\fR from the rest of the sequence.
If two characters in the sequence are separated by `\-', this is shorthand
for the full list of ASCII characters between them
(e.g. `[0-9]' matches any decimal digit).
To include a literal `]' in the sequence, make it the first character
(following a possible `^').
To include a literal `\-', make it the first or last character.
.SH AMBIGUITY
If a regular expression could match two different parts of the input string,
it will match the one which begins earliest.
If both begin in the same place but match different lengths, or match
the same length in different ways, life gets messier, as follows.
.PP
In general, the possibilities in a list of branches are considered in
left-to-right order, the possibilities for `*', `+', and `?' are
considered longest-first, nested constructs are considered from the
outermost in, and concatenated constructs are considered leftmost-first.
The match that will be chosen is the one that uses the earliest
possibility in the first choice that has to be made.
If there is more than one choice, the next will be made in the same manner
(earliest possibility) subject to the decision on the first choice.
And so forth.
.PP
For example, `(ab|a)b*c' could match `abc' in one of two ways.
The first choice is between `ab' and `a'; since `ab' is earlier, and does
lead to a successful overall match, it is chosen.
Since the `b' is already spoken for,
the `b*' must match its last possibility\(emthe empty string\(emsince
it must respect the earlier choice.
.PP
In the particular case where the regular expression does not use `|'
and does not apply `*', `+', or `?' to parenthesized subexpressions,
the net effect is that the longest possible
match will be chosen.
So `ab*', presented with `xabbbby', will match `abbbb'.
Note that if `ab*' is tried against `xabyabbbz', it
will match `ab' just after `x', due to the begins-earliest rule.
(In effect, the decision on where to start the match is the first choice
to be made, hence subsequent choices must respect it even if this leads them
to less-preferred alternatives.)
.SH SEE ALSO
egrep(1), expr(1)
.SH RETURN VALUE
All functions (except \fIre_error\fR and \fIre_free\fR) return an
integer.  That integer is negative if an error was detected.  The
following error codes may be returned:
.PP
.nf
.ta \w'REG_EESCAPE'u+3n
REGEXP_BADARG	Bad argument
REGEXP_ESIZE	Regexp was too big
REGEXP_ESPACE	Out of memory while processing
REGEXP_EPAREN	Unmatched `)'
REGEXP_ERANGE	Invalid `[]' range
REGEXP_EBRACK	Unmatched `]'
REGEXP_BADRPT	Invalid use of repeat character
REGEXP_EESCAPE	Trailing backslash
REGEXP_EEND	Unspecified internal error
.fi
.PP
In addition, \fIre_exec_w\fR returns 0 if the regexp was not matched,
and 1 if the regexp was matched successfully.  Other functions return
0 on success.
.SH HISTORY
This is a revised version.  Both code and manual page were originally
written by Henry Spencer at University of Toronto.  They are intended
to be compatible with the Bell V8 \fIregexp\fR(3), but are not derived
from Bell code.
.PP
It was further revised by Benoit Goudreault-Emond of Silanis
Technology to give it a slightly extended syntax.  This ain't the
POSIX regular expressions yet, but it provides much more convenience
than the basic unmodified implementation, in a much smaller footprint
than the POSIX implementation (about 20KB versus 55KB!)
.SH BUGS
Empty branches and empty regular expressions are not portable
to other, otherwise-similar, implementations.
.PP
The ban on applying `*' or `+' to a possibly-null operand is an
artifact of the simplistic implementation.
.PP
The match-choice rules are complex.  A simple ``longest match'' rule
would be preferable, but is harder to implement.
.PP
Although there is a general similarity to POSIX.2 ``extended'' regular
expressions, neither the regular-expression syntax nor the programming
interface is an exact match.
.PP
Due to emphasis on compactness and simplicity, it's not strikingly
fast.  It does give some attention to handling simple cases quickly.

By viewing downloads associated with this article you agree to the Terms of Service and the article's licence.

If a file you wish to view isn't highlighted, and is a text file (not binary), please let us know and we'll add colourisation support for it.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here


Written By
Web Developer
Canada Canada
I'm a senior software developer, working at Silanis Technology (http://www.silanis.com). I've acquired quite a bit of experience (usually the hard way!) in Win32 and raw COM programming on the job. In my spare time, I like to monkey around with POSIX code.

I'm mostly interested in portable C++ libraries. I'm happiest when I develop portable C++ code--C++ being such a powerful language as long as one keeps clear of the rather nasty subtleties of the language.

I hope the articles I contribute will be of some help to someone. If even one person gains a few hours through use of that code, I'll be very happy.

When not coding, I like to listen to Anime and try to learn Japanese. It's not working too well so far, unfortunately. :{)

Comments and Discussions