|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Announcements
Chapters
Services
Feature Zones
|
IntroductionI wanted to know how a regular expression parser works. So I did some Googling and found some cool articles that describe the process of how regular expressions finds a match. I listed the articles in the reference section of this article. I have implemented this parser based on my research. I will not go too much into describing the process and the theory behind the regular expression, since the articles in reference section covers this very well (the topic of regular expressions is huge and will require a book to explain thoroughly). In this article, I will simply show an implementation of a simple Regular Expression parser (or Mini Regular Expression Parser). I will go on using the terms Automata, NFA, DFA, Minimum DFA, state, transitions, and epsilon transition. If you do not understand these terms, I highly recommend you to read up on some the articles in reference section before continuing. So you ask, "why this one?" This implementation is done step-by-step, so makes it easy for someone wanting to learn how regular expressions work. Other features:
So I wanted to share the implementation with CP users. BTW: This is my first submission to CP, hope you like it. FeaturesTable below shows the symbols the parser supports.
Some parsers use a dot(.) to denote any single character, instead of _(underscore). If you want, you can change this in the source code by simply changing the value of the constant that is defined for this. You can control greediness of the parser by setting the The parser validates the input pattern string for its correctness. If encounters error in the syntax, it will report error with details information (i.e., error position, length, and type of error) accurately. Using the CodeThe algorithm in this parser follows the lecture notes of Mr. Mike Clark. Unfortunately, he had taken the notes offline since I downloaded them. So I am making them available here.
I find these notes rather simple and very easy to understand. And I structured my code according to the steps mentioned in these notes. If sometimes you find it difficult to understand what my code is doing (I hope you won't), please read one of these relevant notes. This parser is written in C# using Visual Studio 2005. Below is the partial class diagram with key classes of the component.
The
You will find a description of methods of the classes in the source code as comments. Sequence diagram below shows how to use the parser. The code snippet below shows how you can use the private void TestMatch()
{
RegEx re = new RegEx();
string sPattern = "a_*p";
ErrorCode errCode = re.Compile(sPattern);
if (errCode != ErrorCode.ERR_SUCCESS)
{
string sErrSubstring = sPattern.Substring(re.GetLastErrorPosition(),
re.GetLastErrorLength());
string sFormat =
"Error occured during compilation.\nCode: {0}\nAt: {1}\nSubstring: {2}";
sFormat = String.Format(sFormat, errCode.ToString(),
m_regEx.GetLastErrorPosition(), sErrSubstring);
MessageBox.Show(sFormat);
return;
}
int nFoundStart = -1;
int nFoundEnd = -1;
string sToSearch = "appleandpotato";
bool bFound = m_regEx.FindMatch(sToSearch, 0, sToSearch.Length - 1,
ref nFoundStart, ref nFoundEnd);
if (bFound)
{
int nMatchLength = nFoundEnd - nFoundStart + 1;
if (nMatchLength == 0)
{
MessageBox.Show("Matched an empty string at position " +
nFoundStart.ToString() + ".");
}
else
{
string sMatchString = sToSearch.Substring(nFoundStart, nMatchLength);
MessageBox.Show("Match found at: " + nFoundStart.ToString() + "\n" +
sMatchString);
}
}
else
{
MessageBox.Show("No match found.");
}
}
Points of InterestThe NFA models for quantifiers *, + , and ? can be found in the articles I mentioned. When I was implementing the parser, I had a lot of trouble with a couple of transitions:
I did not find information regarding these transitions during my Googling. After much trial and error, I came up with the NFA models that work fine. Using these models, you do not have to modify the original algorithm at all. This "AnyChar" transition is handled in the The complement of character set uses a "AnyChar" transition and a "Dummy" transition. If the current state uses a transition that is forbidden (i.e., A in [^A] ), it ends up in a state that has only one transition going away from it — that is the "Dummy" transition and that state is not an accepting state. A "Dummy" transition is NEVER used in the actual process and thus the parser reaches a dead-end state, effectively resulting in a mismatch. If the current state does not have any transition over the current input symbol, it uses the "AnyChar" transition and ends up in accepting state. Effectively matching the correct sub/string. Also, the public bool FindMatch(string sSearchIn,
int nSearchStartAt,
int nSearchEndAt,
ref int nFoundBeginAt,
ref int nFoundEndAt)
{
if (m_stateStartDfaM == null)
{
return false;
}
if (nSearchStartAt < 0)
{
return false;
}
State stateStart = m_stateStartDfaM;
nFoundBeginAt = -1;
nFoundEndAt = -1;
bool bAccepted = false;
State toState = null;
State stateCurr = stateStart;
while (nSearchStartAt <= nSearchEndAt)
{
char chInputSymbol = sSearchIn[nSearchStartAt];
toState = stateCurr.GetSingleTransition(chInputSymbol.ToString());
if (toState == null)
{
toState = stateCurr.GetSingleTransition(MetaSymbol.ANY_ONE_CHAR_TRANS);
}
if (toState != null)
{
if (nFoundBeginAt == -1)
{
nFoundBeginAt = nSearchStartAt;
}
if (toState.AcceptingState)
{
bAccepted = true;
nFoundEndAt = nSearchStartAt;
}
stateCurr = toState;
nSearchStartAt++;
}
else
{
if (!bAccepted) // we reset everything
{
nSearchStartAt = (
nFoundBeginAt != -1 ? nFoundBeginAt + 1 : nSearchStartAt + 1);
nFoundBeginAt = -1;
nFoundEndAt = -1;
stateCurr = stateStart; // start from begining
}
else
{
break;
}
}
} // end of while..loop
return bAccepted;
}
But with those features added, this method gets little more logic oriented. The current implementation looks like this: public bool FindMatch(string sSearchIn,
int nSearchStartAt,
int nSearchEndAt,
ref int nFoundBeginAt,
ref int nFoundEndAt)
{
if (m_stateStartDfaM == null)
{
return false;
}
if (nSearchStartAt < 0)
{
return false;
}
State stateStart = m_stateStartDfaM;
nFoundBeginAt = -1;
nFoundEndAt = -1;
bool bAccepted = false;
State toState = null;
State stateCurr = stateStart;
int nIndex = nSearchStartAt;
int nSearchUpTo = nSearchEndAt;
while (nIndex <= nSearchUpTo)
{
if (m_bGreedy && IsWildCard(stateCurr) == true)
{
if (nFoundBeginAt == -1)
{
nFoundBeginAt = nIndex;
}
ProcessWildCard(stateCurr, sSearchIn, ref nIndex, nSearchUpTo);
}
char chInputSymbol = sSearchIn[nIndex];
toState = stateCurr.GetSingleTransition(chInputSymbol.ToString());
if (toState == null)
{
toState = stateCurr.GetSingleTransition(MetaSymbol.ANY_ONE_CHAR_TRANS);
}
if (toState != null)
{
if (nFoundBeginAt == -1)
{
nFoundBeginAt = nIndex;
}
if (toState.AcceptingState)
{
// then we ignore the accepting state
if (m_bMatchAtEnd && nIndex != nSearchUpTo)
{
//toState = stateStart ;
}
else
{
bAccepted = true;
nFoundEndAt = nIndex;
if (m_bGreedy == false)
{
break;
}
}
}
stateCurr = toState;
nIndex++;
}
else
{
if (!m_bMatchAtStart && !bAccepted ) // we reset everything
{
nIndex = (nFoundBeginAt != -1 ? nFoundBeginAt + 1 : nIndex + 1);
nFoundBeginAt = -1;
nFoundEndAt = -1;
//nIndex++;
stateCurr = stateStart; // start from begining
}
else
{
break;
}
}
} // end of while..loop
if (!bAccepted)
{
if (stateStart.AcceptingState == false)
{
return false;
}
else // matched an empty string
{
nFoundBeginAt = nSearchStartAt;
nFoundEndAt = nFoundBeginAt - 1;
return true;
}
}
return true;
}
Initially I fought myself trying not to add any additional logic in the Final WordThis is a simple implementation. More features can be added to this demo code if you want. If you do add more features to this demo code presented here, I would very much like to have a copy of that code if it is OK with you. You can use the code anyway you like under the CP license (CPOL). I don't think it would be too difficult to port this demo code to C++ if you want. References
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||