|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Announcements
Want a new Job?
Chapters
Services
Feature Zones
|
IntroductionHave you ever wondered how regular expressions work in detail? If the answer is yes then this article is right for you. I will try to guide you step by step on how to create your own mini regular expression library. The purpose of the article is NOT to give you a highly optimized, tested and great regular expression library, but to explain principals behind the pattern matching in text files. If you only want a good library and don't care how it works, you probably want Boost regex library, which you can find here. I used a couple of different regular expression libraries but I must say that I am happiest with Boost regex. Obviously you decide it for yourself. There is another regular expressions article here on CP which could be found here. I must admit, I did not read the article completely but it seems to me that the article focuses mostly on how to use the regular expression library provided by the author. In this article, the author uses similar technique (I could be wrong though) to create a more sophisticated library. The article you are reading right now, could be seen as a prerequisite for the article I just mentioned above. Regular expressions are part of MS .NET library or Java SDK (if you write code in Java). As you can see, regular expressions are available in many different programming languages and technologies. Actually the article does not focus on writing the library in a specific language. I wrote the code in C++, using STL primarily because it is my favorite language/library, but the principles from the article could be applied to any language (obviously). I will try to be as language independent as possible, using pseudo code where ever possible. If you want the code in Java, please send me an email. The code provided here in this article is free (obviously) but if you like it and use it in your application, it would be great if you would give me the credit for what I deserve. Also please email me, so I can show off to my peers and/or potential employers. OverviewSo how are we going to do it? Well before we start with coding, it is necessary to explain the mathematical background needed to fully understand the method used here in this article. I would strongly recommend to read and understand the math behind, because once we overcome the math part, the rest will be very simple. Note however that I will not have any mathematical proofs. If you are interested in proofs, please check out the references, which could be found in References section of this article. Additionally, note that the regular expression parser, which we will create here will support these three operations:
However many additional operators can be simulated by combining these three operators. For instance:
The reason for implementing only three operators is simplicity. When I started planning to write this article, I quickly recognized that I had to limit myself in many ways. The topic is so large, that it would require a book to explain every little detail (maybe I will write it someday). As I stated above, the purpose of the article is not to equip you with a library but to introduce you to principles behind regular expressions. If you want to know more on how to use regular expressions, then you can check out the book: Mastering Regular Expressions - O'Reilly. Following is the overview of the article:
What is NFA?NFA stands for nondeterministic finite-state automata. NFA can be seen as a special kind of final state machine, which is in a sense an abstract model of a machine with a primitive internal memory. If you want to know more about finite-state machines, please refer to References section below. Let us look at the mathematical definition of NFA. An NFA
If we would explain the above definition to a 12 year old, we could say that an NFA is a set
An equivalent graph would be:
Looking at the table/graph above, we can see that there are special transitions called Epsilon transitions, which is one of the special features of NFA. A transition is an event of going from one state to another. Epsilon transition is a transition from one state to another on an empty string. In other words, we are going from one state to another on no character input. For example, as we can see from table/graph above, we can go on no input from In an NFA, like the mathematical definition defines, there is always a starting state. I used Graphvis, a great tool for drawing different kinds of graphs (see References section), for drawing of NFAs and DFAs (see later). Because Graphvis is laying out the nodes of a graph on its own, it seems to me that a starting state is always the state drawn at the top of the graph. So we will follow that convention. What is DFA?DFA stands for deterministic finite state automata. DFA is very closely related to NFA. As a matter of fact, the mathematical definition of NFA works for DFA too. But obviously there are differences, from which we will take advantage. One big difference is that in a DFA there are no Epsilon transitions. Additionally, for each state all transitions leading away from this state are done on different input characters. In other words, if we are in state Now that we understand NFAs and DFAs, we can proceed by saying that given any NFA, there is an equivalent DFA (You are going to have to trust me on this because I don't think it is appropriate to give you the mathematical proof of this statement). As humans, generally it is easier for us to construct an NFA, as well as interpret what language an NFA accepts. But why do we need the DFAs then? Well if we think about computers, it is very hard to "teach" them to do very well educated guesses (sometimes even we can't make smart educated guesses). And this is exactly what the computer needs to do, when traversing an NFA. If we would write an algorithm, which would use an NFA to check for an accepting combination of characters, it would involve backtracking to check for choices that it did not make previously. Obviously, there are regular expression parsers which work using NFAs, but they are generally slower than those that use DFAs. This is due to the fact that a DFA has a unique path for each accepting string, so no backtracking is involved. Hence, we are going to use a DFA to check if a combination of input characters is accepted by an automata. Note: If you really want to understand both NFAs and DFAs, I would recommend to do further reading on these topics. It is useful as an exercise to convert from one to another, to fully understand the difference and the algorithm used here to convert NFA to DFA (see later). Thompson's AlgorithmNow that we have all the mathematical background that we need to understand regular expressions, we need to start thinking about what is our goal. As a first step, we realize that we need a way of going from a regular expression (like
Using the basic elements above, we will construct three operations, which we would like to implement in our regular expression parser like the following:
But how do we go from something like
Here
As we can see, it is very similar to the evaluation of arithmetic expressions. The difference is that in regular expressions the star operation pops only one element from the stack and evaluates the star operator. Additionally, the concatenation operation is not denoted by any symbol, so we would have to detect it. The code provided with the article simplifies the problem by pre-processing the regular expression and inserting a character ASCII code 0x8 whenever a concatenation is detected. Obviously it is possible to do this "on the fly", during the evaluation of the regular expression, but I wanted to simplify the evaluation as much as possible. The pre-processing does nothing else but detects a combination of symbols that would result in concatenation, like for example:
void CAG_RegEx::Push(char chInput) { // Create 2 new states on the heap CAG_State *s0 = new CAG_State(++m_nNextStateID); CAG_State *s1 = new CAG_State(++m_nNextStateID); // Add the transition from s0->s1 on input character s0->AddTransition(chInput, s1); // Create a NFA from these 2 states FSA_TABLE NFATable; NFATable.push_back(s0); NFATable.push_back(s1); // push it onto the operand stack m_OperandStack.push(NFATable); // Add this character to the input character set m_InputSet.insert(chInput); } As we can see, the character is converted to a simple NFA and then the resulting NFA is added to the stack. Now back to the conversion from regular expression to NFA. Now that we know how to push the NFA onto the stack, the pop operation is trivial. Just retrieve the NFA from the stack and that's it. As I said earlier, a NFA table is defined to be a double ended queue (STL container BOOL CAG_RegEx::Concat()
{
// Pop 2 elements
FSA_TABLE A, B;
if(!Pop(B) || !Pop(A))
return FALSE;
// Now evaluate AB
// Basically take the last state from A
// and add an epsilon transition to the
// first state of B. Store the result into
// new NFA_TABLE and push it onto the stack
A[A.size()-1]->AddTransition(0, B[0]);
A.insert(A.end(), B.begin(), B.end());
// Push the result onto the stack
m_OperandStack.push(A);
return TRUE;
}
As we can see, the concatenation is popping two NFAs from the stack. First NFA is changed, so that it is now new NFA, which is then pushed on the stack. Note that we first pop second operand. This is the case because in regular expressions, the order of operands is of importance because AB != BA (not commutative). BOOL CAG_RegEx::Star()
{
// Pop 1 element
FSA_TABLE A, B;
if(!Pop(A))
return FALSE;
// Now evaluate A*
// Create 2 new states which will be inserted
// at each end of deque. Also take A and make
// a epsilon transition from last to the first
// state in the queue. Add epsilon transition
// between two new states so that the one inserted
// at the begin will be the source and the one
// inserted at the end will be the destination
CAG_State *pStartState = new CAG_State(++m_nNextStateID);
CAG_State *pEndState = new CAG_State(++m_nNextStateID);
pStartState->AddTransition(0, pEndState);
// add epsilon transition from start state to the first state of A
pStartState->AddTransition(0, A[0]);
// add epsilon transition from A last state to end state
A[A.size()-1]->AddTransition(0, pEndState);
// From A last to A first state
A[A.size()-1]->AddTransition(0, A[0]);
// construct new DFA and store it onto the stack
A.push_back(pEndState);
A.push_front(pStartState);
// Push the result onto the stack
m_OperandStack.push(A);
return TRUE;
}
Star operator pops a single element from the stack, changes it according to the Thompson's rule (see above) and then pushes it on the stack. BOOL CAG_RegEx::Union()
{
// Pop 2 elements
FSA_TABLE A, B;
if(!Pop(B) || !Pop(A))
return FALSE;
// Now evaluate A|B
// Create 2 new states, a start state and
// a end state. Create epsilon transition from
// start state to the start states of A and B
// Create epsilon transition from the end
// states of A and B to the new end state
CAG_State *pStartState = new CAG_State(++m_nNextStateID);
CAG_State *pEndState = new CAG_State(++m_nNextStateID);
pStartState->AddTransition(0, A[0]);
pStartState->AddTransition(0, B[0]);
A[A.size()-1]->AddTransition(0, pEndState);
B
Finally, the union pops two elements, makes the transformation and pushes the result on the stack. Note that here we have to watch for the order of the operation. Finally, we are now able to evaluate the regular expression. If everything goes well, we will have a single NFA on the stack, which will be our resulting NFA. Here is the code, which utilizes the above functions. BOOL CAG_RegEx::CreateNFA(string strRegEx) { // Parse regular expresion using similar // method to evaluate arithmetic expressions // But first we will detect concatenation and // insert char(8) at the position where // concatenation needs to occur strRegEx = ConcatExpand(strRegEx); for(int i=0; i<strRegEx.size(); ++i) { // get the charcter char c = strRegEx[i]; if(IsInput(c)) Push(c); else if(m_OperatorStack.empty()) m_OperatorStack.push(c); else if(IsLeftParanthesis(c)) m_OperatorStack.push(c); else if(IsRightParanthesis(c)) { // Evaluate everyting in paranthesis while(!IsLeftParanthesis(m_OperatorStack.top())) if(!Eval()) return FALSE; // Remove left paranthesis after the evaluation m_OperatorStack.pop(); } else { while(!m_OperatorStack.empty() && Presedence(c, m_OperatorStack.top())) if(!Eval()) return FALSE; m_OperatorStack.push(c); } } // Evaluate the rest of operators while(!m_OperatorStack.empty()) if(!Eval()) return FALSE; // Pop the result from the stack if(!Pop(m_NFATable)) return FALSE; // Last NFA state is always accepting state m_NFATable[m_NFATable.size()-1]->m_bAcceptingState = TRUE; return TRUE; } Function Subset Construction AlgorithmNow that we know how to convert any regular expression to an NFA, the next step is to convert NFA to DFA. At first, this process seems to be very challenging. We have a graph with zero or more Epsilon transitions, and multiple transitions on single character and we need an equivalent graph with no Epsilon transitions and a unique path for each accepted sequence of input characters. Like I said, it seems to be very challenging, but it is really not. Mathematicians actually already solved that problem for us, and then using the results, computer scientists created the Subset Construction Algorithm. I am not sure whom to give credit here but the Subset Construction Algorithm goes like this: First, let us define 2 functions:
Now using these 2 functions, we can perform the transformation:
Simple enough? If not, then read further. Following is the pseudo code found on the pages 118-121 of the book "Compilers - Principles, Techniques and Tools" by Aho, Sethi and Ullman. The algorithm below is the equivalent to the algorithm above but expressed in a different way. First, let's define the S EpsilonClosure(T)
{
push all states in T onto the stack
initialize result to T
while stack is not empty
{
pop t from the stack
for each state u with an edge from t to u labeled epsilon
{
if u is not in EpsilonClosure(T)
{
add u to result
push u onto the stack
}
}
}
return result
}
Basically, what this function does is, goes through all the states in
If we would call Epsilon transition on a set of states Now that we know how the Epsilon transition works, let us look at the pseudo code to transform an NFA to a DFA: D-States = EpsilonClosure(NFA Start State) and it is unmarked while there are any unmarked states in D-States { mark T for each input symbol a { U = EpsilonClosure(Move(T, a)); if U is not in D-States { add U as an unmarked state to D-States } DTran[T,a] = U } } Finally the Before we go to the next step, let us convert an NFA to a DFA by hand, using this process. If you want to master this process, I would strongly suggest that you perform more similar transformations using this method. Let's convert the following NFA to its corresponding DFA using subset construction algorithm:
Using the subset construction algorithm, we would do following (Each newly created state will be bolded):
There are no new states, so we are done. Following is the drawing of the DFA:
The starting state is DFA OptimizationNow that we have all the knowledge to convert a regular expression into an NFA and then convert NFA to an equivalent DFA, we actually could stop at this point and use it for patterns matching. Originally when I planned to write this article, in order to keep it as simple as possible showing only principles, DFA optimization was not taken into account. But then it occurred to me that, first of all for large regular expressions, we are creating very large NFAs (by the nature of Thompson's algorithm), which in turn occasionally creates complex DFAs. If we would search for patterns, this might slow us considerably down, so I decided to include the optimization as a part of the regular expression parser. The optimization here is not a complicated one. So let's look at the following example:
If we look at this DFA, we notice that state
So the result is following:
The DFA above, definitely seems to be smaller than the the previous one. I will still call this a DFA, despite the fact that it is not really a DFA. Using the results from parts aboveFinally we are ready to use all of the parts from above, to match some text patterns. Once we have the DFA, all we need to do is to take an input character and run it against the starting state. Here is the pseudo code to do that: Find(string s) { for each character c in s { for each active state s { if there is transition from s on c { go to the next state t if t is accepting state { record the pattern } mark t as active } mark s as inactive } if there is transition from starting state on c { go to the next state s if s is accepting state { record the pattern } mark s as active } } } The code above can be found in the At this point, for the completeness of the article, I must note that there is a way of converting a regular expression directly into a DFA. This method is not explained here yet, but if time permits, it will be in future articles (or article updates). Additionally, there are different ways of constructing an NFA from regular expressions. Final WordsWell, that's it! I hope you enjoyed reading the article as much as I enjoyed writing it. Please use the demo code in any kind of applications, but give me the credit where deserved. If you want to build a more complete library, using the demo code presented here, please send me a copy of your additions. Note: The class References & Tools Used
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||