|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Announcements
Want a new Job?
Chapters
Services
Feature Zones
|
Learning .NET Regular Expressions with ExpressoDid you ever wonder what Regular Expressions are all about and want to gain a basic understanding quickly? My goal is to get you up and running with a basic understanding of regular expressions within 30 minutes. The reality is that regular expressions aren't as complex as they look. The best way to learn is to start writing and experimenting. After your first half hour, you should know a few of the basic constructs and be able to design and use regular expressions in your programs or web pages. For those of you who get hooked, there are many excellent resources available to further your education. What the Heck is a Regular Expression Anyway?I'm sure you are familiar with the use of "wildcard" characters for pattern matching. For example, if you want to find all the Microsoft Word files in a Windows directory, you search for " In writing programs or web pages that manipulate text, it is frequently necessary to locate strings that match complex patterns. Regular expressions were invented to describe such patterns. Thus, a regular expression is just a shorthand code for a pattern. For example, the pattern "\w+" is a concise way to say "match any non-null strings of alphanumeric characters". The .NET framework provides a powerful class library that makes it easy to include regular expressions in your applications. With this library, you can readily search and replace text, decode complex headers, parse languages, or validate text. A good way to learn the arcane syntax of regular expressions is by starting with examples and then experimenting with your own creations. This tutorial introduces the basics of regular expressions, giving many examples that are included in an Expresso library file. Expresso can be used to try out the examples and to experiment with your own regular expressions. Let's get started! Some Simple ExamplesSearching for ElvisSuppose you spend all your free time scanning documents looking for evidence that Elvis is still alive. You could search with the following regular expression: 1.elvis Find elvis
This is a perfectly valid regular expression that searches for an exact sequence of characters. In .NET, you can easily set options to ignore the case of characters, so this expression will match "Elvis", "ELVIS", or "eLvIs". Unfortunately, it will also match the last five letters of the word "pelvis". We can improve the expression as follows: 2. Now things are getting a little more interesting. The " Suppose you want to find all lines in which the word "elvis" is followed by the word "alive." The period or dot " 3. With just a few special characters we are beginning to build powerful regular expressions and they are already becoming hard for we humans to read. Let's try another example. Determining the Validity of Phone NumbersSuppose your web page collects a customer's seven-digit phone number and you want to verify that the phone number is in the correct format, "xxx-xxxx", where each "x" is a digit. The following expression will search through text looking for such a string: 4. Each " 5. The " Let's learn how to test this expression. ExpressoIf you don't find regular expressions hard to read you are probably an idiot savant or a visitor from another planet. The syntax can be imposing for anyone, including those who use regular expressions frequently. This makes errors common and creates a need for a simple tool for building and testing expressions. Many such tools exist, but I'm partial to my own, Expresso, originally launched on the CodeProject. Version 2.0 is shown here. For later versions, check the Ultrapico website. To get started, install Expresso and select the Tutorial from the Windows Program menu. Each example can be selected using the tab labeled "Expression Library".
Figure 1. Expresso running example 5 Start by selecting the first example, "1. Find Elvis". Click Run Match and look at the TreeView on the right. Note there are several matches. Click on each to show the location of the match in the sample text. Run the second and third examples, noting that the word "pelvis" no longer matches. Finally, run the fourth and fifth examples; both should match the same numbers in the text. Try removing the initial " Basics of .NET Regular ExpressionsLet's explore some of the basics of regular expressions in .NET. Special CharactersYou should get to know a few characters with special meaning. You already met " Let's try a few more examples: 6. This works by searching for the beginning of a word (\b), then the letter "a", then any number of repetitions of alphanumeric characters (\w*), then the end of a word (\b). 7. Here, the "+" is similar to "*", except it requires at least one repetition. 8. Try these in Expresso and start experimenting by inventing your own expressions. Here is a table of some of the characters with special meaning:
Table 1. Commonly used special characters for regular expressions In the beginningThe special characters " 9. This is the same as example (5), but forced to fill the whole text string, with nothing else before or after the matched text. By setting the "Multiline" option in .NET, " Escaped charactersA problem occurs if you actually want to match one of the special characters, like " RepetitionsYou've seen that "
Table 2. Commonly used quantifiers Let's try a few more examples: 10. 11. 12. 13. Try the last example with and without setting the "Multiline" option, which changes the meaning of " Character ClassesIt is simple to find alphanumerics, digits, and whitespace, but what if we want to find anything from some other set of characters? This is easily done by listing the desired characters within square brackets. Thus, " Let's try a more complicated expression that searches for telephone numbers. 14. This expression will find phone numbers in several formats, like "(800) 325-3535" or "650 555 1212". The " NegationSometimes we need to search for a character that is NOT a member of an easily defined class of characters. The following table shows how this can be specified.
Table 3. How to specify what you don't want 15. Later, we'll see how to use "lookahead" and "lookbehind" to search for the absence of more complex patterns. AlternativesTo select between several alternatives, allowing a match if either one is satisfied, use the pipe " 16. When using alternatives, the order is important since the matching algorithm will attempt to match the leftmost alternative first. If the order is reversed in this example, the expression will only find the 5 digit Zip Codes and fail to find the 9 digit ones. We can use alternatives to improve the expression for ten digit phone numbers, allowing the area code to appear either delimited by whitespace or parenthesis: 17. GroupingParentheses may be used to delimit a subexpression to allow repetition or other special treatment. For example: 18. The first part of the expression searches for a one to three digit number followed by a literal period " Unfortunately, this example allows IP addresses with arbitrary one, two, or three digit numbers separated by periods even though a valid IP address cannot have numbers larger than 255. It would be nice to arithmetically compare a captured number N to enforce N<256, but this is not possible with regular expressions alone. The next example tests various alternatives based on the starting digits to guarantee the limited range of numbers by pattern matching. This shows that an expression can become cumbersome even when looking for a pattern that is simple to describe. 19. Expresso Analyzer View
Figure 2. Expresso's analyzer view showing example 17 Expresso has a feature that diagrams expressions in a Tree structure, explaining what each piece means. When debugging an expression, this can help zoom in on the part that is causing trouble. Try this by selecting example (17) and then using the Analyze button. Select nodes in the tree and expand them to explore the structure of this regular expression as shown in the figure. After highlighting a node, you can also use the Partial Match or Exclude Match buttons to run a match using just the highlighted portion of the regular expression or using the regular expression with the highlighted portion excluded. When subexpressions are grouped with parentheses, the text that matches the subexpression is available for further processing in a computer program or within the regular expression itself. By default, groups are numbered sequentially as encountered in reading from left to right, starting with 1. This automatic numbering can be seen in Expresso's skeleton view or in the results shown after a successful match. A "backreference" is used to search for a recurrence of previously matched text that has been captured by a group. For example, " 20. This works by capturing a string of at least one alphanumeric character within group 1 " It is possible to override the automatic numbering of groups by specifying an explicit name or number. In the above example, instead of writing the group as " 21. Test this in Expresso and expand the match results to see the contents of the named group. Using parentheses, there are many special purpose syntax elements available. Some of the most common are summarized in this table:
Table 4. Commonly used Group Constructs We've already talked about the first two. The third " Positive LookaroundThe next four are so-called lookahead or lookbehind assertions. They look for things that go before or after the current match without including them in the match. It is important to understand that these expressions match a position like " " 22. " 23. Here is an example that could be used repeatedly to insert commas into numbers in groups of three digits: 24. Here is an example that looks for both a prefix and a suffix: 25. Negative LookaroundEarlier, I showed how to search for a character that is not a specific character or the member of a character class. What if we simply want to verify that a character is not present, but don't want to match anything? For example, what if we are searching for words in which the letter "q" is not followed by the letter "u"? We could try: 26. Run the example and you will see that it fails when "q" is the last letter of a word, as in "Iraq". This is because "[^q]" always matches a character. If "q" is the last character of the word, it will match the whitespace character that follows, so in the example the expression ends up matching two whole words. Negative lookaround solves this problem because it matches a position and does not consume any text. As with positive lookaround, it can also be used to match the position of an arbitrarily complex subexpression, rather than just a single character. We can now do a better job: 27. We used the "zero-width negative lookahead assertion", "(?!exp)", which succeeds only if the suffix "exp" is not present. Here is another example: 28. Similarly, we can use " 29. Here is one more example using lookaround: 30. This searches for an HTML tag using lookbehind and the corresponding closing tag using lookahead, thus capturing the intervening text but excluding both tags. Comments pleaseAnother use of parentheses is to include comments using the "(?#comment)" syntax. A better method is to set the "Ignore Pattern Whitespace" option, which allows whitespace to be inserted in the expression and then ignored when the expression is used. With this option set, anything following a number sign "#" at the end of each line of text is ignored. For example, we can format the preceding example like this: 31. Text between HTML tags, with comments
Greedy and LazyWhen a regular expression has a quantifier that can accept a range of repetitions (like " 32. If this is used to search the string "aabab", it will match the entire string "aabab". This is called "greedy" matching. Sometimes, we prefer "lazy" matching in which a match using the minimum number of repetitions is found. All the quantifiers in Table 2 can be turned into "lazy" quantifiers by adding a question mark " 33. If we apply this to the same string "aabab" it will first match "aab" and then "ab".
Table 5. Lazy quantifiers What did we leave out?I've described a rich set of elements with which to begin building regular expressions; but I left out a few things that are summarized in the following table. Many of these are illustrated with additional examples in the project file. The example number is shown in the left-hand column of this table.
Table 6. Everything we left out. The left-hand column shows the number of an example in the project file that illustrates this construct. ConclusionWe've given many examples to illustrate the essential features of .NET regular expressions, emphasizing the use of a tool like Expresso to test, experiment, and learn by example. If you get hooked, there are many online resources available to help you go further. You can start your search at the Ultrapico web site. If you want to read a book, I suggest the latest edition of Mastering Regular Expressions, by Jeffrey Friedl. There are also a number of nice articles on The Code Project including the following tutorials:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||