An Introduction to Regular Expressions

Uwe Keim

4.85/5 (16 votes)

Jan 26, 2001

CPOL

283224

Describes the theory behind regular expressions (RE) as well as the practical usage.

This document describes the theory behind regular expressions (RE) as well as their practical usage.

Table of Content

What are Regular Expressions?
Why would you use Regular Expressions
Where to use Regular Expressions
How to use the Regular Expression Syntax Basics
Regular Expression's Syntax Basics
More Examples
Summary
Literature Resources

What are Regular Expressions?

Regular expressions are a way to search for substrings ("matches") in strings. This is done by searching with "patterns" through the string.

You probably know the '*' and '?' charachters used in the dir command on the DOS command line. The '*' character means "zero or more arbitrary characters" and the '?' means "one arbitrary character".

When using a pattern like "text?.*", it will find files like

textf.txt
text1.asp
text9.html

But it will not find files like

text.txt
text.asp
text.html

This is exactly the way REs work. While the '*' and '?' are a very limited ways of pattern matching, REs supply a much broader spectrum of describing patterns.

Why would you use Regular Expressions

Example usages could be:

Remove all occurences of a specific tag from an html file
Check whether an e-mail address is well-formed

Standard Regular Expression Operations

Basically you can do the following operations on a string with REs:

Test for a pattern
I.e. search through a string and check whether a pattern matches a substring, returning true or false.
Extract a substring
I.e. search for a substring and return that substring.
Replace a substring
I.e. search for a substring that matches a pattern and replace it by another string.

Where to use Regular Expressions

REs are one of the foundations of the Perl programming language and therefore built-into the compiler itself. There are many other languages that can use REs by using third-party libraries or add ons.

Following are some other languages for which RE libraries exist:

VBScript (Version 5.x and above) through the RegExp object.
JScript (Version 5.x and above) through the RegExp object, too.
C++ through the Regex++ libary and the PCRE (Perl Compatible Regular Expression) library.
Java with ORO (Perl 5 compatible) from the Apache team, RegExp, Rex or gnu.regexp.
Microsoft's .NET framework (including C#), has built in RE support through the System.Text.RegularExpression namespace.
PHP with its built-in Perl-compatible RE functions or POSIX extended RE functions.

Although being slightly different to use (because of the design of the languages), all are quite similar to Perl's implementation of REs. Therefore I use Perl code snippets in this document to describe examples.

The RE syntax is not completely standardized. AFAIK there is a POSIX version of RE, defining the complete syntax. Perl's RE implementation is much more flexible than POSIX's, so having a library that is Perl-compatible as much as can be is normally what you want.

The syntax itself can be sometimes different between the languages. I.e. one library implements only a subset of the POSIX-RE syntax, while other implements nearly all of the Perl-RE syntax.

How to use Regular Expressions from Perl

As stated, I do all examples in Perl. Therefore here a quick overview over the most common methods on how to execute a regular expression in Perl.

Search a string for a pattern

expression =~ m/pattern/[switches]

Searches the string expression for the occurence(s) of a substring that matches 'pattern' and returns the recognized subexpressions ($1, $2, $3, ...). "m" stands for "match".

For example

$test = "this is just one test";
$test =~ m/(o.e)/

Would return "one" in $1.

Replace a substring

expression =~ s/pattern/new text/[switches]

Searches the string "expression" for the occurence(s) of a substring that matches 'pattern' and replaces the found substrings with "new text". "s" stands for "substitute".

For example

$test = "this is just one test";
$test =~ s/one/my/

Would replace "one" by "my" resulting in a string "this is just my test", stored in $test.

Regular Expression's Syntax Basics

This chapter is not trying to be a reference of all characters that can be used inside a RE pattern. There are other documents that do this quite well. Instead the basic meta characters are shown and explained.

Meta characters that you want to use literal must be escaped with the backslash, just as in C++ strings. E.g. to use the square bracket [ literal, write \[. (Remember that this is so for the Perl language and can be different for other languages).

Important Meta Characters

Following are the most important meta charachters, as from chapter "Regular Expression Syntax" on MSDN:

Character	Description
`\`	Marks the next character as either a special character, a literal, a backreference, or an octal escape. For example, '`n`' matches the character "`n`". '`\n`' matches a newline character. The sequence '`\\`' matches "`\`" and "`\(`" matches "`(`".
`.`	Matches any single character except "`\n`". To match any character including the '`\n`', use a pattern such as '`[.\n]`'.

Character Classes

A character class is a group of one or multiple characters. These are written in square brackets '[...]'. E.g. the construct "B[iu]rma" matches "Birma" or "Burma", i.e. a "B" followed by either an "i" or an "u", followed by "rma".

In other words a character class means "match any single character of that class".

There are the opposite of character classes, too, the negotiated character classes. Which means "match any single character that is not in the class". E.g. '[^1-6]' recognized any characters except the numbers "1" to "6".

See more examples at "Character Matching" on MSDN.

Quantifiers

If you don't know exactly how many characters are coming, you can use quantifiers to specify the number of times a character can occur. E.g. you can say "Hel+o" which means "He" followed by one or multiple "l"'s followed by an "o".

More Quantifiers, as from chapter "Quantifiers" on MSDN are

Character	Description
`*`	Matches the preceding subexpression zero or more times. For example, '`zo`' matches "`z`" and "`zoo`". '``' is equivalent to '`{0,}`'.
`+`	Matches the preceding subexpression one or more times. For example, "`zo+`" matches "`zo`" and "`zoo`", but not "`z`". '`+`' is equivalent to '`{1,}`'.
`?`	Matches the preceding subexpression zero or one time. For example, '`do(es)?`' matches the "`do`" in "`do`" or "`does`". '`?`' is equivalent to '`{0,1}`'.
`{n}`	`n` is a nonnegative integer. Matches exactly `n` times. For example, '`o{2}`' does not match the "`o`" in "`Bob`"`,` but matches the two "`o`"'s in "`food`".
`{n,}`	`n` is a nonnegative integer. Matches at least `n` times. For example, '`o{2,}`' does not match the "`o`" in "`Bob`" and matches all the "`o`"'s in "`foooood`". '`o{1,}`' is equivalent to '`o+`'. '`o{0,}`' is equivalent to '`o*`'.
`{n,m}`	`m` and `n` are nonnegative integers, where `n` <= `m`. Matches at least `n` and at most `m` times. For example, '`o{1,3}`' matches the first three "`o`"'s in "`fooooood`". '`o{0,1}`' is equivalent to '`o?`'. Note that you cannot put a space between the comma and the numbers.

Greedy

An important fact about quantifiers is that the '*' and '+' are "greedy". I.e. they match as much as they can, not as few. E.g.

$test = "hello out there, how are you";
$test =~ m/h.*o/

means "find a 'h', followed by multiple arbitrary characters, followed by an 'o'". The author probably thought it matches "hello", but actually it matches "hello out there, how are yo", since the RE is greedy and searches until the last "o", wich is the "o" in "you".

You can explicitly say that a quantifier should be "ungreedy" by appending a '?'. E.g.

$test = "hello out there, how are you";
$test =~ m/h.*?o/

Would actually find "hello", as intended, since it now means "find a 'h', followed by multiple arbitrary characters, followed by the first occurence 'o'".

Anchors

Line Beginnings and Line Ends

To check for the beginning or the end of a line (or string), you use the meta characters ^ and $. E.g. "^thing" matches for a line starting with "thing". "thing$" matches for a line ending with "thing".

Word Boundaries

The meta characters '\b' and '\B' are used for testing word boundaries and non-word boundaries. E.g.

$test =~ m/out/

would match not only match "out" in "speak out loud" but also the "out" in "please don't shout at me". To avoid this, you can precede the pattern with a word boundary anchor:

$test =~ m/\bout/

Now, it only finds "out" if it starts at a word boundary, not inside a word.

Alternation and Grouping

Alternation allows use of the '|' character to allow a choice between two or more alternatives. Using it together with the parantheses '(...|...|...)' it allows you to group the alternations.

Parantheses ifself are used for "capturing" substring for later usage and store them in the Perl-built-in variables $1, $2, ..., $9. (See Backreferences, below).

E.g.

$test = "I like apples a lot";
$test =~ m/like (apples|pines|bananas)/

Will match, since "apples" is one of the three alternatives to mach and therefore "like apples" is found. The parantheses will also "capture" the "apples" as a backreference in $1.

Backreferences, Lookahead- and Lookbehind-Conditions

Backreferences

One of the most important features of REs is the ability to store ("capture") a part of the matches substring for later reuse. This is done by placing the substring in parantheses (...). These are stored in the Perl-built-in variables $1, $2, ..., $9.

If you don't want to capture a substring but need parantheses to group the substring, use the '?:' operator to avoid capturing.

E.g.

$test = "Today is monday the 18th.";
$test =~ m/([0-9]+)th/

will store "18" in $1, whereas

$test = "Today is monday the 18th.";
$test =~ m/[0-9]+th/

will store nothing in $1 since the parantheses are not present.

$test = "Today is monday the 18th.";
$test =~ m/(?:[0-9]+)th/

will store nothing in $1, too since the parantheses are used with the '?:' operator. Another example of the direct use in a replace operation:

$test = "Today is monday the 18th.";
$test =~ s/ the ([0-9]+)th/, and the day is $1/

will result in $test being "Today is monday, and the day is 18.".

You can also backreferences inside the query to previously found substrings by using \1, \2, ..., \9. E.g. the following RE will remove duplicate words:

$test = "the house is is big";
$test =~ s/\b(\S+)\b(\s+\1\b)+/$1/

Will result in $test being "the house is big".

Lookahead- and Lookbehind-Conditions

Sometimes it is necessary to say "match this, but only if it is not preceded by that" or "match this, but only if it is not followed by that". When just single charactes are concerned, you can use the negotiated character class [^...].

But when it comes to more than just a single character, you need to use the so called lookahead-condition or the lookbehind-condition. There are four possibly types:

Positive lookahead-condition '(?=re)'
Match only when followed by the RE re.
Negative lookahead-condition '(?!re)'
Match only when not followed by the RE re.
Positive lookbehind-condition '(?<=re)'
Match only when preceded by the RE re.
Negative lookbehind-condition '(?<!re)'
Match only when not preceded by the RE re.

Examples:

$test = "HTML is a document description-language and not a programming-language";
$test =~ m/(?<=description-)language/

Will match the first "language" ("description-language"), since it is preceded by "description-", wheras

$test = "HTML is a document description-language and not a programming-language";
$test =~ m/(?<!description-)language/

Will match the second "language" ("description-language"), since it is not preceded by "description-".

More Examples

Here are some more real-world examples from the last chapter of the RE section of [3]. These more advanceds REs can be use as a starting point for your own REs, or just as detailed examples you can look at in more detail.

Swap the first two words:

s/(\S+)(\s+)(\S+)/$3$2$1/

Find name=value pairs:

m/(\w+)\s*=\s*(.*?)\s*$/

Now name is in $1, value is in $2.

Read a date in the form YYYY-MM-DD:

m/(\d{4})-(\d\d)-(\d\d)/

Now YYYY is in $1, MM is in $2, DD is in $3.

Remove the leading path from a filename:

s/^.*\///

Summary

This document tried to give you a brief introduction overview of what REs are and where and how to use them.

Also being straightforward to get into using REs, there are quite a lot of traps and errors you probably will meet in "real life". It is highly recommended to refer to additional literature and examples to understand and use the full power of REs. Especially [4] is a very valuable (but somewhat fastidiously) resource you should read.

Topics that were not covered in this document include:

Modificators to REs (also known as "switches")
These can be used for setting things like case-sensitivity, single-line and multiline-mode, extended mode for better overview, etc.
Internals of a RE engine
Different types of RE enginges (namely NFA and DFA) behave different.
Using RE in other languages than Perl
There are language specific details that differ from Perls RE implementation.
Optimizations
There is always more than one way of writing a RE. Some are faster, others are better to read.

For these and many others, please take a look at the resources below.

Literature Resources

Learning Perl (2nd Edition)
Randal L. Schwartz, Tom Christiansen, Larry Wall (Foreword)
Programming Perl (3rd Edition)
Larry Wall, Tom Christiansen, Jon Orwant
Perl Cookbook
Tom Christiansen, Nathan Torkington, Larry Wall
Mastering Regular Expressions: Powerful Techniques for Perl and Other Tools (O'Reilly Nutshell)
Jeffrey E. Friedl (Editor), Andy Oram (Editor)
Introduction to Regular Expressions
Microsoft Developer Network (MSDN), Microsoft Corporation
Perl 5 Pocket Reference, 3rd Edition: Programming Tools (O'Reilly Perl)
Johan Vromans, Larry Wall, Linda Mui