![]() |
General Programming »
String handling »
Regular Expressions
Beginner
License: The Code Project Open License (CPOL)
An Introduction to Regular ExpressionsBy Uwe KeimDescribes the theory behind regular expressions (RE) as well as the practical usage. |
C++Win2K, Visual Studio, Dev
|
|
Advanced Search |
|
|
|
||||||||||||||||
This document describes the theory behind regular expressions (RE) as well as their practical usage.
Regular expressions are a way to search for substrings ("matches") in strings. This is done by searching with "patterns" through the string.
You probably know the '*' and '?' charachters used in the dir command on the
DOS command line. The '*' character means "zero or more arbitrary
characters" and the '?' means "one arbitrary character".�
When using a pattern like "text?.*", it will find files like
textf.txttext1.asptext9.htmlBut it will not find files like
text.txttext.asptext.htmlThis is exactly the way REs work. While the '*' and '?' are a very
limited ways of pattern matching, REs supply a much broader spectrum of describing
patterns.
Example usages could be:
Basically you can do the following operations on a string with REs:
REs are one of the foundations of the Perl programming language and therefore built-into the compiler itself. There are many other languages that can use REs by using third-party libraries or add ons.
Following are some other languages for which RE libraries exist:
Although being slightly different to use (because of the design of the languages), all are quite similar to Perl's implementation of REs. Therefore I use Perl code snippets in this document to describe examples.
The RE syntax is not completely standardized. AFAIK there is a POSIX version of RE, defining the complete syntax. Perl's RE implementation is much more flexible than POSIX's, so having a library that is Perl-compatible as much as can be is normally what you want.
The syntax itself can be sometimes different between the languages. I.e. one library implements only a subset of the POSIX-RE syntax, while other implements nearly all of the Perl-RE syntax.
As stated, I do all examples in Perl. Therefore here a quick overview over the most common methods on how to execute a regular expression in Perl.
expression =~ m/pattern/[switches]
Searches the string expression for the occurence(s) of a substring that matches
'pattern' and returns the recognized subexpressions ($1, $2,
$3, ...). "m" stands for "match".
For example�
$test = "this is just one test"; $test =~ m/(o.e)/
Would return "one" in $1.
expression =~ s/pattern/new text/[switches]
Searches the string "expression" for the occurence(s) of a substring that matches
'pattern' and replaces the found substrings with "new text". "s" stands for "substitute".
For example�
$test = "this is just one test"; $test =~ s/one/my/
Would replace "one" by "my" resulting in a string "this
is just my test", stored in $test.
This chapter is not trying to be a reference of all characters that can be used inside a RE pattern. There are other documents that do this quite well. Instead the basic meta characters are shown and explained.
Meta characters that you want to use literal must be
escaped with the backslash, just as in C++ strings. E.g. to use the square
bracket [ literal, write \[. (Remember that this is so
for the Perl language and can be different for other languages).
Following are the most important meta charachters, as from chapter "Regular Expression Syntax" on MSDN:
| Character | Description |
|---|---|
\ |
Marks the next character as either a special character, a literal, a backreference, or an octal escape. For example,
'n' matches the character "n". '\n' matches a newline character.
The sequence '\\' matches "\" and "\(" matches "(". |
. |
Matches any single character except "\n". To match any character including the '\n',
use a pattern such as '[.\n]'. |
A character class is a group of one or multiple characters. These are written
in square brackets '[...]'. E.g. the construct "B[iu]rma" matches
"Birma"
or "Burma", i.e. a "B" followed by either an
"i" or an "u",
followed by "rma".
In other words a character class means "match any single character of that class".
There are the opposite of character classes, too, the negotiated character
classes. Which means "match any single character that is not in
the class". E.g. '[^1-6]' recognized any characters except the numbers "1" to "6".
See more examples at "Character Matching" on MSDN.�
If you don't know exactly how many characters are coming, you can use
quantifiers to specify the number of times a character can occur. E.g. you can say "Hel+o"
which means "He" followed by one or multiple "l"'s followed by an "o".
More Quantifiers, as from chapter "Quantifiers" on MSDN are
| Character | Description |
|---|---|
* |
Matches the preceding subexpression zero or more times. For example, ' zo*' matches "z" and "zoo".
'*' is
equivalent to '{0,}'. |
+ |
Matches the preceding subexpression one or more times. For example, " zo+" matches "zo" and "zoo", but not "z".
'+' is equivalent to '{1,}'. |
? |
Matches the preceding subexpression zero or one time. For example, ' do(es)?' matches the "do" in "do" or "does".
'?' is equivalent to '{0,1}'. |
{n} |
n is a nonnegative integer. Matches exactly n
times.For example, ' o{2}' does not match the "o"
in "Bob",
but matches the two "o"'s in "food". |
{n,} |
n is a nonnegative integer. Matches at least n
times.For example, ' o{2,}' does not match the "o"
in "Bob"
and matches all the "o"'s in "foooood".
'o{1,}' is equivalent to 'o+'. 'o{0,}' is equivalent to
'o*'. |
{n,m} |
m and n are nonnegative integers, where n
<= m. Matches at least n and at most m times.For example, ' o{1,3}' matches the first three "o"'s in "fooooood".
'o{0,1}' is equivalent to 'o?'. Note that you
cannot put a space between the comma and the numbers. |
An important fact about quantifiers is that the '*' and '+' are "greedy". I.e.
they match as much as they can, not as few. E.g.
$test = "hello out there, how are you"; $test =~ m/h.*o/
means "find a 'h', followed by multiple arbitrary characters, followed by
an 'o'". The author probably thought it matches "hello", but actually it
matches "hello out there, how are yo", since the RE is greedy and searches until
the last "o", wich is the "o" in
"you".
You can explicitly say that a quantifier should be "ungreedy" by
appending a '?'. E.g.
$test = "hello out there, how are you"; $test =~ m/h.*?o/
Would actually find "hello", as intended, since it now means "find a
'h', followed by multiple arbitrary characters, followed by
the first occurence 'o'".
To check for the beginning or the end of a line (or string), you use the meta
characters ^ and $. E.g. "^thing"
matches for a line starting with "thing". "thing$"
matches for a line ending with "thing".
The meta characters '\b' and '\B' are used for testing word boundaries and
non-word boundaries. E.g.
$test =~ m/out/
would match not only match "out" in "speak out loud" but also the
"out" in "please
don't shout at me". To avoid this, you can precede the pattern with
a word boundary anchor:
$test =~ m/\bout/
Now, it only finds "out" if it starts at a word boundary, not inside a word.
Alternation allows use of the '|' character to allow a choice between two or
more alternatives. Using it together with the parantheses '(...|...|...)' it
allows you to group the alternations.
Parantheses ifself are used for "capturing" substring for later
usage and store them in the Perl-built-in variables $1, $2, ..., $9. (See Backreferences,
below).
E.g.
$test = "I like apples a lot"; $test =~ m/like (apples|pines|bananas)/
Will match, since "apples" is one of the three alternatives to mach and
therefore "like apples" is found.� The�parantheses will also "capture" the
"apples"
as a backreference in $1.
One of the most important features of REs is the ability to store
("capture") a part of
the matches substring for later reuse. This is done by placing the substring in
parantheses (...). These are stored in the Perl-built-in variables $1, $2, ..., $9.�
If you don't want to capture a substring but need parantheses to group the
substring, use the '?:' operator to avoid capturing.
E.g.
$test = "Today is monday the 18th."; $test =~ m/([0-9]+)th/
will store "18" in $1, whereas
$test = "Today is monday the 18th."; $test =~ m/[0-9]+th/
will store nothing in $1 since the parantheses are not present.
$test = "Today is monday the 18th."; $test =~ m/(?:[0-9]+)th/
will store nothing in $1, too since the parantheses are used
with the '?:' operator. Another example of the direct use in a
replace operation:�
$test = "Today is monday the 18th."; $test =~ s/ the ([0-9]+)th/, and the day is $1/
will result in $test being "Today is monday, and the day is 18.".
You can also backreferences inside the query to previously found substrings by using \1, \2, ..., \9.
E.g. the following RE will remove duplicate words:
$test = "the house is is big"; $test =~ s/\b(\S+)\b(\s+\1\b)+/$1/
Will result in $test being "the house is big".
Sometimes it is necessary to say "match this, but only if it is not preceded
by that" or "match this, but only if it is not followed by
that". When just single charactes are concerned, you can use the negotiated
character class [^...].
But when it comes to more than just a single character, you need to use the so called lookahead-condition or the lookbehind-condition. There are four possibly types:
(?=re)'re.(?!re)'re.(?<=re)'re.(?<!re)'re.Examples:
$test = "HTML is a document description-language and not a programming-language"; $test =~ m/(?<=description-)language/
Will match the first "language" ("description-language"), since it is preceded by
"description-", wheras
$test = "HTML is a document description-language and not a programming-language"; $test =~ m/(?<!description-)language/
Will match the second "language" ("description-language"), since it is not preceded
by "description-".
Here are some more real-world examples from the last chapter of the RE section of [3]. These more advanceds REs can be use as a starting point for your own REs, or just as detailed examples you can look at in more detail.
Swap the first two words:
s/(\S+)(\s+)(\S+)/$3$2$1/
Find name=value pairs:
m/(\w+)\s*=\s*(.*?)\s*$/
Now name is in $1, value is in $2.
Read a date in the form YYYY-MM-DD:
m/(\d{4})-(\d\d)-(\d\d)/
Now YYYY is in $1, MM is in $2, DD
is in $3.
Remove the leading path from a filename:
s/^.*\///
This document tried to give you a brief introduction overview of what REs are and where and how to use them.
Also being straightforward to get into using REs, there are quite a lot of traps and errors you probably will meet in "real life". It is highly recommended to refer to additional literature and examples to understand and use the full power of REs. Especially [4] is a very valuable (but somewhat fastidiously) resource you should read.
Topics that were not covered in this document include:
For these and many others, please take a look at the resources below.
| You must Sign In to use this message board. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
General
News
Question
Answer
Joke
Rant
Admin
|
PermaLink |
Privacy |
Terms of Use
Last Updated: 25 Jan 2001 Editor: Chris Maunder |
Copyright 2001 by Uwe Keim Everything else Copyright © CodeProject, 1999-2009 Web15 | Advertise on the Code Project |