Introducing Regular Expressions

Vasant Raj

Rate me:

4.69/5 (11 votes)

7 Apr 200513 min read

86.2K

Basic Regular Expressions with detailed examples.

Introduction

Regular expressions are defined as "formal descriptions of text patterns that allow extremely powerful text matching operations". If you are searching for text within files, or performing basic programming/database operations, you will be working with regular expressions. Basically, a RE is a pattern describing a certain amount of text. Their name comes from the mathematical theory on which they are based. Regular Expressions are used in searches and substitutions.

To start with consider:

\b[A-Z0-9._%-]+@[A-Z0-9._%-]+\.[A-Z]{2,4}\b

It describes a series of letters, digits, dots, percentage signs and underscores, followed by an at sign, followed by another series of letters, digits, dots, percentage signs and underscores, finally followed by a single dot and between two and four letters. In other words: this pattern describes an email address.

With the above RE pattern, you can search through a text file to find email addresses, or verify if a given string looks like an email address. The term "string" or "character string" is used by programmers to indicate a sequence of characters. In practice, you can use REs with whatever data you can access using the application or programming language you are working with.

Background

Literals, Metacharacter and Escape Sequence

A 'literal' is any character we use in a search or matching expression. E.g., to find ind in windows, the ind is a 'literal' string - each character plays a part in the search, it is literally the string we want to find.

A metacharacter is one or more special characters that have a unique meaning and are NOT used as literals in the search expression. E.g., the character ^ (circumflex) is a metacharacter.

An escape sequence is a way of indicating that we want to use one of our metacharacters as a literal. In a RE, an escape sequence involves placing the metacharacter \ (backslash) in front of the metacharacter that we want to use as a literal. E.g., if we want to find ^ind in w^indow then we use the search string \^ind and if we want to find \\file in the string c:\\file then we would need to use the search string \\\\file (each \ we want to search for (a literal) is preceded by an escape sequence \).

Brackets, Ranges and Negation

Bracket expressions introduce our first metacharacters, the square brackets, which allow us to define list of things to test for rather than the single characters we have been checking up until now.

Metacharacter	Meaning
[ ]	Match anything inside the square brackets for one character position once and only once. E.g. [12] means match the target to either 1 or 2 while [0123456789] means match to any character in the range 0 to 9.
-	The - (dash) inside square brackets is the 'range separator' and allows us to define a range, in our example above of [0123456789] we could rewrite it as [0-9]. You can define more than one range inside a list e.g. [0-9A-C] means check for 0 to 9 and A to C. Note: To test for - inside brackets (as a literal) it must come first or last i.e. [-0-9] will test for - and 0 to 9.
^	The ^ (circumflex) inside square brackets negates the expression (we will see an alternate use for the circumflex outside square brackets later) e.g. [^Ff] means anything except upper or lower case F. Note: Spaces, or in this case the lack of them, between ranges are very important.

Iteration 'Metacharacters'

The following is a set of metacharacters that can control the number of times a character or string is found in our searches:

Metacharacter	Meaning
?	The ? (question mark) matches the preceding character 0 or 1 times only. E.g. colou?r will find both color and colour.
*	The * (asterisk or star) matches the preceding character 0 or more times. E.g. tre* will find tree and tread and trough.
+	The + (plus) matches the previous character 1 or more times. E.g. tre+ will find tree and tread but not trough.
{n}	Matches the preceding character n times exactly. E.g. to find a local phone number we could use [0-9]{3}-[0-9]{4} which would find any number of the form 123-4567. Note: The - (dash) in this case, because it is outside the square brackets, is a 'literal'.
{n,m}	Matches the preceding character at least n times but not more than m times e.g. 'a{2,3}' will find 'baab' and 'baaab' but NOT 'bab' or 'baaaab'.

Additional 'Metacharacters'

The following is a set of additional 'metacharacters' that provide additional power to our searches:

Metacharacter	Meaning
()	The ( (open parenthesis) and ) (close parenthesis) may be used to group parts of our search expression together.
\|	The \| (vertical bar or pipe) is called alternation and means: 'find the left hand OR right values'. E.g. gr(a\|e)y will find 'gray' or 'grey'.

Matching single characters

The '.' character is used to match a single character.

For example: 
/p.t/ - matches 'p' and 't' separated by a single character.
 e.g. 'pit', 'put', 'pot', etc.

Sets of characters

The expression /RE/ is used to match a set of characters in a single character position.

For example:
/x[ab2X]y/ - matches any of the following: xay, xby, x2y, xXy

In the expression /[RE]/ a range of characters can be specified.

For example:
    [a-z] - matches any single lower case character
    [0-9] - matches any single digit
    [0-57] - matches any one of the following: 0 1 2 3 4 5 7 i.e. 0-5 and 7.

Sets of characters can be combined:
    [a-d5-8X-Z] matches any one of the following: a b c d 5 6 7 8 X Y Z

It is possible to specify a set of characters which are not to 
be matched in the RE.
For example:
    [^0-9] - matches any single character which is not a digit

Anchors

A few other metacharacters are almost as simple, but they don't actually match characters in the target string; rather, they match a position in the target string. This includes string / line boundaries (caret and dollar), as well as word boundaries \<, \b, and such. The tests are simple because, for the most part, they simply compare two adjacent characters in the target string. An anchor is used to match a RE found at a particular position.

For example:
    /^RE/ - matches RE at the start of a line
    /RE$/ - matches RE at the end of a line
    /^RE$/ - matches RE as the whole line

Note that there are two separate uses of the '^' operator. One is as the start of line anchor, and the other as the 'logical not' operator. The latter function only applies inside square brackets.

Repetitions

Multiple occurrences of REs can be specified.

For example:
    a* - matches 0 or more occurrences of 'a'
    aa* - matches 1 or more occurrences of 'a'
    .* - matches any string of characters

Summary of special characters

Special characters in the search string
    ^ start of line anchor (or NOT operator inside [])
    $ end of line anchor
    . any character
    * character repeated any number of times
    \ escape character
    [ ] contains range of characters

Special characters in the replacement string
    & string matched in search string
    \ escape character

Examples

Numbers

A positive integer number.	\b\d+\b
A positive integer number which allows for a sign.	[-+]?\b\d+\b
A C-style hexadecimal number.	\b0[xX][0-9a-fA-F]+\b
An integer number as well as a floating point number with optional integer part.	((\b[0-9]+)?\.)?[0-9]+\b
A floating point number with optional integer as well as optional fractional part, but does not match an integer number.	(\b[0-9]+\.([0-9]+\b)?\|\.[0-9]+\b)
A number in scientific notation. The mantissa can be an integer or floating point number with optional integer part. The exponent is optional.	((\b[0-9]+)?\.)?\b[0-9]+([eE][-+]?[0-9]+)?\b
A number in scientific notation. The difference with the previous example is that if the mantissa is a floating point number, the integer part is mandatory.	\b[0-9]+(\.[0-9]+)?(e[+-]?[0-9]+)?\b
An integer that does not begin with a zero and has an optional sign.	(\+\|-)?[1-9][0-9]*

Floating Point Numbers

RE to match an optional sign, that is either followed by zero or more digits followed by a dot and one or more digits (a floating point number with optional integer part), or followed by one or more digits (an integer).

[-+]?([0-9]*\.[0-9]+|[0-9]+)

This is a far better definition. Any match will include at least one digit, because there is no way around the [0-9]+ part. We have successfully excluded the matches we do not want: those without digits.

We can optimize this RE as: [-+]?([0-9]*\.)?[0-9]+

If you also want to match numbers with exponents, you can use:

[-+]?([0-9]*\.)?[0-9]+([eE][-+]?[0-9]+)?

A real number:

(\+|-)?[1-9][0-9]*(\.[0-9]*)?
(\+|-)?[0-9]+(\.[0-9]*)? (for 0.111)

A real number in engineering notation:

(\+|-)?[1-9]\.[0-9]*E(\+|-)?[0-9]+

Strings

"[^"\r\n]*" matches a single-line string that does not allow the quote character to appear inside the string. Using the negated character class is more efficient than using a lazy dot.

"[^"]*" allows the string to span across multiple lines.

"[^"\\\r\n]*(\\.[^"\\\r\n]*)*" matches a single-line string in which the quote character can appear if it is escaped by a backslash. Though this RE may seem more complicated than it needs to be, it is much faster than simpler solutions which can cause a whole lot of backtracking in case a double quote appears somewhere all by itself rather than part of a string.

"[^"\\]*(\\.[^"\\]*)*" allows the string to span multiple lines.

You can adapt the above REs to match any sequence delimited by two (possibly different) characters.

If we use b for the starting character, e and the end, and x as the escape character, the version without escape becomes b[^e\r\n]*e, and the version with escape becomes b[^ex\r\n]*(x.[^ex\r\n]*)*".

Valid Dates

RE to matches a date in yyyy-mm-dd format from between 1900-01-01 and 2099-12-31, with a choice of four separators.

(19|20)\d\d[-/.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])

The year is matched by (19|20)\d\d. Alternation is used to allow the first two digits to be 19 or 20. The round brackets are mandatory. If they are omitted, the RE engine would go looking for 19 or the remainder of the RE, which matches a date between 2000-01-01 and 2099-12-31. Round brackets are the only way to stop the vertical bar from splitting up the entire RE into two options.

The month is matched by 0[1-9]|1[012], again enclosed by round brackets to keep the two options together. By using character classes, the first option matches a number between 01 and 09, and the second matches 10, 11 or 12.

The last part of the RE consists of three options. The first matches the numbers 01 through 09, the second 10 through 29, and the third matches 30 or 31.

If you are validating the user's input of a date in a script, it is probably easier to do certain checks outside of the RE. For example, excluding February 29th when the year is not a leap year is far easier to do in a scripting language. It is far easier to check if a year is divisible by 4 (and not divisible by 100 unless divisible by 400) using simple arithmetic than using REs.

To match a date in particular format:

For mm/dd/yyyy:

(0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])[- /.](19|20)\d\d\

For dd-mm-yyyy:

(0[1-9]|[12][0-9]|3[01])[- /.](0[1-9]|1[012])[- /.](19|20)\d\d

To replace dates of the form mm/dd/yy with dates of the form dd-mm-yy:

"\b(?<month>\d{1,2})/(?<day>\d{1,2})/(?<year>\d{2,4})\b"

"${day}-${month}-${year}")

Time

Matching a time can be taken to varying levels of strictness. Something like

[0-9]?[0-9]:[0-9][0-9] (am|pm)

picks up both 9:17 am and 12:30 pm, but also allows 99:99 pm. Looking at the hour, we realize that if it is a two-digit number, the first digit must be a one. But 1?[0-9] still allows an hour of 19 (and also an hour of 0), so maybe it is better to break the hour part into two possibilities:1[012] for two-digit hours and [1-9] for single-digit hours. The result is (1[012]|[1-9]). The minute part is easier. The first digit should be [0-5]. For the second, we can stick with the current [0-9]. This gives (1[012]|[1-9]):[0-5][0-9] (am|pm) when we put it all together. Using the same logic, can you extend this to handle 24-hour time with hours from 0 through 23. To allow for a leading zero, at least through to are various solutions, but we can use similar logic as before. Break the task into groups: Morning (hours 00 through 09, with the leading zero being optional), Daytime (hours 10 through 19) and Evening (hours 20 through 23).

Resulting RE is: 0?[0-9]|1[0-9]|2[0-3]

Actually, we can combine the first two alternatives, resulting in the shorter [01]?[0-9]|2[0-3]

Variable names

Many programming languages have identifiers (variable names and such) that are allowed to contain only alphanumeric characters and underscores, but which may not begin with a number. This can be accomplished through using following RE:

[a-zA-Z_][a-zA-Z_0-9]*

The first class matches what the first character can be, the second (with its accompanying star) allows the rest of the identifier. If there is a limit on the length of an identifier, say 32 characters, you might replace the star with {0,31) if the {min,max) notation is supported.

Dollar amount

One approach is:

\$[0-9]+(\.[0-9][0-9])?

From a top-level perspective, this is a simple RE with three parts: \$ and ¼+ and (¼)?, which might be loosely paraphrased as "A literal dollar sign, a number and other-thing." In this case, "other-thing" is the combination of a decimal point followed by two digits. If, however, you need to find lines that contain just a price, and nothing else, you can wrap the expression with ^¼$.

Removing White space

You can easily trim unnecessary white space from the start and the end of a string or the lines in a text file by doing a RE search-and-replace. Search for ^[ \t]+ and replace with nothing to delete leading white space (spaces and tabs). Search for [ \t]+$ to trim trailing white space. Both the above can be combined into single RE:

^[ \t]+|[ \t]+$

Instead of [ \t]] which matches a space or a tab, you can expand the character class into [ \t\r\n] if you also want to strip line breaks.

HTML Tags

RE to match the opening and closing pair of a specific HTML tag.

<TAG[^>]*>(.*?)</TAG>

Anything between the tags is captured into the first back reference. The question mark is used to make sure it stops before the first closing tag rather than before the last, like a greedy star would do. This RE will not properly match tags nested inside themselves, like in <TAG>one<TAG>two</TAG>one</TAG>t;.

RE to match the opening and closing pair of any HTML tag.

<([A-Z][A-Z0-9]*)[^>]*>(.*?)</\1>

Be sure to turn off case sensitivity. The key in this solution is the use of the back reference \1 in the RE. Anything between the tags is captured into the second back reference. This solution will also not match tags nested in themselves.

IP Addresses

Matching an IP address is another good example of a trade-off between RE complexity and exactness. The following RE will match any IP address just fine, but will also match 999.999.999.999 as if it were a valid IP address.

\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b

To restrict all the four numbers in the IP address to 0..255, you can use this complex RE (everything on a single line).

\b(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.
   (25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b

The long RE stores each of the four numbers of the IP address into a capturing group. You can use these groups to further process the IP number.

If you don't need access to the individual numbers, you can shorten the RE with a quantifier to:

\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b

Similarly, you can also shorten the RE to:

\b(?:\d{1,3}\.){3}\d{1,3}\b

Valid E-mail

Email Ids can also be checked through this RE:

("^([\w-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-]{1,3}\.)|(([\w-]+\.)+))
                                   ([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$")

A Tweaker...

RE to match a number, either an integer or floating-point. As this expression is constructed, such a number has an optional leading minus sign, any number of digits, an optional decimal point, and any number of digits that follow.

-?[0-9]*\.?[0-9]*

Indeed, this matches such examples as 1, -272.37, 129238843., .191919, and even something like -. 0.

Looking at the RE closely we can find out that everything is optional. If a number is there, and if it is at the beginning of the string, it will be matched, but nothing is required.

Solution is :-?[0-9]+(\.[0-9]*)?.

This still doesn't allow something like '.007', since the RE requires at least one digit before the decimal point. The solution is to add an alternative which allows for the uncovered situation. The following RE allows just a decimal point followed by (this time not optional) digits.

[-?[0-9]+(\.[0-9]*)?|-?\.[0-9]+]

The optional leading minus in the second alternative is also required. You could also bring the -? out of the alternation.

[-?([0-9]+(\.[0-9]*)?|\.[0-9]+)]

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

Written By

Vasant Raj

Web Developer

India

Software engineer and currently working in SQL Server 2005 Reporting Services. I have done M.C.A. from M.S. University, Baroda.

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Introducing Regular Expressions

Introduction

Background

Literals, Metacharacter and Escape Sequence

Brackets, Ranges and Negation

Metacharacter

Meaning

Iteration 'Metacharacters'

Metacharacter

Meaning

Additional 'Metacharacters'

Metacharacter

Meaning

Matching single characters

Sets of characters

Anchors

Repetitions

Summary of special characters

Examples

Numbers

Floating Point Numbers

Strings

Valid Dates

Time

Variable names

Dollar amount

Removing White space

HTML Tags

IP Addresses

Valid E-mail

A Tweaker...

License

Comments and Discussions