|
|||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||
|
Announcements
Chapters
Services
Feature Zones
|
IntroductionWhat is a regular expression? In a nutshell, regular expressions provide a simple way to transform raw data into something useable. In the preface of Mastering Regular Expressions (O'Reilly & Associates), Jeffrey Friedl writes: "There's a good reason that regular expressions are found in so many diverse applications: they are extremely powerful. At a low level, a regular expression describes a chunk of text. You might use it to verify a user's input, or perhaps to sift through large amounts of data. On a higher level, regular expressions allow you to master your data. Control it. Put it to work for you. To master regular expressions is to master your data." You may not know this, but regular expressions are found in the Microsft Visual Studio text search tool. It provides a very powerful way to search for complex patterns in your code (or any text file for that matter). Here are a few links on the web to help you get started with regular expressions if you've never used them before.
Getting StartedRegular expressions, while seemingly difficult to learn, are one of the most powerful tools in a programmer’s arsenal, yet many programmers never take advantage of them. You can certainly write your own text parsers that will get the job done, but doing it that way takes more time, is far more error prone, and is nowhere near as fun (IMHO). Regex++ is a regular expression library available from http://www.boost.org. Boost provides free peer-reviewed portable C++ source libraries. Take a look at the website to learn more. We are only concerned with Regex++ for our purposes, but you may find many of their libraries useful. The original Regex++ author's website is http://ourworld.compuserve.com/homepages/John_Maddock/ Installing Regex++Note: The following instructions will only work if you have Visual Studio 6 or 7 installed. To install Regex++, complete the following steps (Detailed instructions are also availabe in the Regex++ download itself):
Now that your library is built and in place, it is ready to use. The project that I've included above is intended to demonstrate how you can simply parse HTML. All you need to do now is open the project and ensure that project settings are pointing to the appropriate regex++ lib and include directories. But first a short discussion Note: To add the Regex++ library to your project select Project | Settings.... In the ensuing dialog, select the C/C++ tab. In the Category drop down list, select Preprocessor. In the Additional include directories: edit box enter C:\Regex++. Now select the Link tab. In the Category drop down list, select Input. In the Additional library path: edit box enter C:\Regex++. Parsing HTMLHTML parsers are nothing new. There is really no reason someone should have to write their own (that I can think of, at least) since the wheel has already been invented. That being said, the example we are going to be using does just that--parses HTML. I do this because parsing HTML provides a good pedagogical example. Specifically, it parses form elements in an HTML document. This is a fairly complex task to accomplish, however, using regular expressions makes it simple. We are going to want our parser to be generic enough to parse what will amount to key value pairs in any given input field. For instance, in the HTML: <input type="text" name="address" size=30 maxlength = "100">
we would like to just supply the key name ( e.g. type, name, size, etc. ) and have the regex return that key's corresponding value ( e.g. text, address, 30, etc. ). Notice that some values have quotes and some don't. Some use white space and others don't. These are things we're going to have to account for in our regular expression. We also have to account for a different order for each parameter. For instance this:
<input type="text" name="address" size=30 maxlength = "100">
is the same as this:
<input name="address" type="text" maxlength="100" size="30">
In the sample application example I build a single string from the HTML input file (we'll read the whole file into a CString variable). While this may cause problems on very large files, for our purposes we'll assume that the file is fairly small. We'll need the whole string in order to match across line barriers--but more on that later. ParseFile MethodIn the ParseFile method we:
Note: The code snippets in this article contain regular expressions that use escape characters. Because these are C/C++ strings being used, these escape characters have to be escaped twice. That is, the regex whitespace escape character (\s) will actually look like this: BOOL CRegexTestDlg::ParseFile(CString filename)
{
if (filename.IsEmpty())
{
return FALSE;
}
CString finalstring;
this->m_mainEdit.SetWindowText("");
CStdioFile htmlfile;
CString inString = "";
CString wholeFileString = "";
std::string wholeFileStr = "";
// Read entire file into a string.
try{
if (htmlfile.Open(filename, CFile::modeRead |
CFile::typeText, NULL))
{
while (htmlfile.ReadString(inString))
{
wholeFileString += inString;
}
htmlfile.Close();
}
}
catch (CFileException e)
{
MessageBox("The file " + filename +
" could not be opened for reading",
"File Open Failed",
MB_ICONHAND|MB_ICONSTOP|MB_ICONERROR );
return FALSE;
}
// Need to convert string to a STL string for use in RegEx
wholeFileStr = wholeFileString.GetBuffer(10);
// Create our regular expression object
// TRUE means that we want a match to be case-insensitive
boost::RegEx expr("(<\s*(textarea|input|select)\\s*[^>]+>[^<>]*(</(select|textarea)>)?)",
TRUE);
// Create a vector to hold all matches
std::vector<std::string> v;
// Pass the vector and the STL string that hold the file contents
// to the RegEx.Grep method.
expr.Grep(v, wholeFileStr);
// Create char array to hold actual type (e.g. input, select, textarea).
char actualType[100];
// vector v is now full of all matches. We iterate through them.
for(int i = 0; i < v.size(); i++)
{
// Get the object at the current index and typecast to string
std::string line = (std::string)v[i];
// Get a pointer to the beginning of the character arrray
const char *buf = line.c_str();
// Create some temporary storage variables
char name[100];
char typeName[100];
// Build output string.
finalstring += "input, textarea, select?: ";
GetActualType(buf, actualType);
finalstring += actualType;
finalstring += " -- ";
GetValue("name", buf, name);
finalstring += "name: ";
finalstring += name;
finalstring += " -- ";
finalstring += "input type is: ";
// If it's an input, get the type of input
// (e.g. text, password, checkbox, radio, etc.)
if(_stricmp("input", actualType) == 0)
{
GetValue("type", buf, typeName);
finalstring += typeName;
}
// Otherwise, it doesn't apply.
else
{
finalstring += "N/A";
}
finalstring += "\r\n";
}
// Populate text field with items
this->m_mainEdit.SetWindowText(finalstring);
return TRUE;
}
In this method notice specifically the lines: // Create our regular expression object boost::RegEx expr("(<\\s*(textarea|input|select)\\s*[^>]+>[^<>]*(</(select|textarea)>)?)", TRUE); // Create a vector to hold all matches std::vector<std::string> v; // Pass the vector and the STL string that hold the file contents // to the RegEx.Grep method. expr.Grep(v, wholeFileStr); The expr object gets constructed with a pattern. I will break down the pattern as follows: (<\s* // Match on an open tag "<" and zero or // more white space characters (textarea|input|select)\s+[^>]+> // 1. Match on either textarea, input, or select 1 2 3 // 2. look for one or more spaces next // 3. Match on one or more characters that // are not a ">" until we find the end ">" [^<>]* // Match on zero or more characters that are not // "<" or ">" (</(select|textarea)>)?) // Match on an end tag "</" and either a select or // a text area. The question mark means that everything // inside the quotes is optional(e.g. 0 or 1 occurrences). Note: In this previous description escape characters are not escaped twice. This is the way the actual regular expression would look if you printed it out. Just as a reminder the regex operators above mean:
The Grep method takes a reference to the vector created above it. After the Grep call, the vector will contain all matches found. Using Grep() as opposed to Search() (which is another useful method), will allow you to match across line barriers. This is important for a file you read in--especially HTML files that allow for a fairly loose format. For instance this: <input type="text" name="name">
is the same as this: <input type="text"
name="name">
in any web browser. We need to account for this. If you are wondering about case-sensitivity, look at the instantiation of the RegEx object. The second parameter is a boolean. This indicates whether you would like it to be case-insensitive--which we do in the example code. If you would like further information about the boost Regex++ library API, take a look at: GetActualType MethodIn the GetActualType method we extract the type of input field we're dealing with on the current line. Remember that in the ParseFile method we made sure that there was at least one input type of some sort, so this line is pretty much guaranteed to have one. Here is the method implementation: BOOL CRegexTestDlg::GetActualType(const char *line, char *type) { // Create a pattern to look for. char* pattern = "<\\s*((input|textarea|select))\\s*"; // Create RegEx object with pattern. Should be case-insensitive RegEx exp(pattern, TRUE); // Search for the pattern. Use Search, not Grep since we have a single line. if(exp.Search(line)) { // If found, copy the text of the first expression match to the // type variable. strcpy(type, exp[1].c_str()); return TRUE; } // We didn't find anything. Just copy an empty string. strcpy(type, ""); return FALSE; } Take a look at the pattern itself: char* pattern = "<\\s*((input|textarea|select))\\s*"; Here we are saying look for an opening brace "<" and possibly some white space. Then look for either "input", "textarea", or "Select". Then there may be some more white space. Notice the two sets of parentheses around <input type= "text" name="email" size="20">
GetValue MethodIn the GetValue method we pass in a key to look for and a pointer to the variable we want to populate with the value. void CRegexTestDlg::GetValue(char *key, const char *str, char *val) { char* tmpStr = "\\s*=\\s*\\\"?([^\"<>]+)\\\"?"; char final[100]; // We need to build the string so we know exactly what we're looking for. strcpy(final, key); strcat(final, tmpStr); // Create the RegEx object with the pattern. RegEx exp(final); // Search for the if(exp.Search(str)) { // If found, copy what we found. strcpy(val, exp[1].c_str()); } else { // Otherwise copy a string with the no<key> where <key> is the key passed in. sprintf(val, "no%s", key); } } Take a look at this expression: char* tmpStr = "\\s*=\\s*\\\"?([^\"<>]+)\\\"?"; This is our most complex pattern yet. First we look for some possible whitespace, an equals sign, and some more possible whitespace. Then we're looking for an opening quote. The question mark means 0 or 1 of the previous expression, so if the HTML didn't include an opening quote, we are accounting for that. That is if the line looked like either of the following (notice the quotation marks), it would still find a match: <input type="text" name="email">
<input type=text name=email>
Next we're looking for any character(s) except a quotation mark ("), an opening brace (<), or a closing brace (>). This is our value. Notice that there are parens around this value because we want to capture that value into our special variable exp[n]. Next we are looking for a closing quotation mark and a possible close quote. This is the end of our need for regular expressions. We now have the value we were looking for and can format it and output it in the list box. What you do with the values is up to you, but now you have all you need to parse HTML accurately and effectively. The example code may need some tweaking, but in general it gets the job done. Running The ExampleThe example application I've included parses an HTML file that contains a form. For convenience sake, I've included an HTML form file in the project. The filename is contact_form.html and it can be found in the root directory of the project. When you run the application, simply click the "Browse..." button and select this file. Then click "Try It!" ConclusionWhile we could have built our parser using strtok or other tokenizers, these are not completely ideal for HTML since HTML can be so free form (e.g. a space here, quotes there, but not there, line wrap, etc.). Regular expressions are perfectly suited for just this sort of text parsing. Regex++ is a very robust regular expression library that you will find very useful in your applications. Take a look at the example project and familiarize yourself with regular expression syntax. This will give you the ability to create powerful text parsers with minimal coding and will enable you to "master your data". | ||||||||||||||||||||||||||||||||