Click here to Skip to main content
Click here to Skip to main content

Fast regular expressions

By , 29 Oct 2000
 

Sample Image - RexSearch.jpg

Introduction

Regular expressions are a well recognized way for describing string patterns. The following regular expression defines a floating point number with a (possibly empty) integer part, a non empty fractional part and an optional exponent:

[0-9]* \.[0-9]+ ([Ee](\+|-)?[0-9]+)?

The rules for interpreting and constructing such regular expressions are explained below. A regular expression parser takes a regular expression and a source string as arguments and returns the source position of the first match. Regular expression parsers either interpret the search pattern at runtime or they compile the regular expression into an efficient internal form (known as deterministic finite automaton). The regular expression parser described here belongs to the second category. Besides being quite fast, it also supports dictionaries of regular expressions. With the definitions $Int= [0-9], $Frac= \.[0-9]+ and $Exp= ([Ee](\+|-)?[0-9]+), the above regular expression for a floating point number can be abbreviated to $Int* $Frac $Exp?.

Interface

I separated algorithmic from interface issues. The files RexAlgorithm.h and RexAlgorithm.cpp implement the regular expression parser using only standard C++ (relying on STL), whereas the file RexInterface.h and RexInterface.cpp contain the interfaces for the end user. Currently there is only one interface, implemented in the class REXI_Search. Interfaces for replace functionality and for programming language scanners are planned for future releases.

struct REXI_DefErr{
        enum{eNoErr,eErrInName,eErrInRegExp} eErrCode;
        string  strErrMsg;
        int     nErrOffset;
};
class REXI_Search : public REXI_Base
{ 
public:
    REXI_Search(char cEos='\0');

    REXI_DefErr
            AddRegDef   (string strName,string strRegExp);
    inline  REXI_DefErr  
            SetRegexp  (string strRegExp);
    bool    MatchHere   (const char*& rpcszSrc, int& nMatchLen,bool& bEos);
    bool    Find        (const char*& rpcszSrc, int& nMatchLen,bool& bEos);
private:
    bool    MatchHereImpl();
    int     m_nIdAnswer;
};

Example usage

int main(int argc, char* argv[])
{
    const char szTestSrc[]= "3.1415 is the same as 31415e-4";
    const int ncOk= REXI_DefErr::eNoErr;
    
    REXI_Search rexs; 
    REXI_DefErr err;
    err= rexs.AddRegDef("$Int","[0-9]+");  assert(err.eErrCode==ncOk);
    err= rexs.AddRegDef("$Frac","\\.[0-9]+"); assert(err.eErrCode==ncOk);
    err= rexs.AddRegDef("$Exp","([Ee](\\+|-)?[0-9]+)"); 
                                assert(err.eErrCode==ncOk);
    err= rexs.SetRegexp("($Int? $Frac $Exp?|$Int \\. $Exp?|$Int $Exp)[fFlL]?");
                                assert(err.eErrCode==ncOk);

    const char*     pCur= szTestSrc;
    int             nMatchLen;
    bool            bEosFound= false;
    cout    <<  "Source text is: \""    <<  szTestSrc   << "\"" <<  endl;
    while(rexs.Find(pCur,nMatchLen,bEosFound)){
           cout <<  "Floating point number found  at position "       
                <<  ((pCur-szTestSrc)-nMatchLen)    
                <<  " having length "  <<  nMatchLen  <<  endl;
    }
    int i;
    cin >> i;
    return 0;
}

Performance issues

A call to the member function REXI_Search::SetRegexp(strRegExp)involves quite a lot of computing. The regular expression strRegExp is analyzed and after several steps transformed into a compiled form. Because of this preprocessing work, which is not needed in the case of an interpreting regular expression parser, this regular expression parser shows its efficiency only when you apply it to large input strings or if you are searching again and again for the same regular expression. A typical application which profits from the preprocessing needed by this parser is a utility which searches all files in a directory.

Limitations

Currently Unicode is not supported. There is no fundamental reason for this limitation and I think that a later release will correct this. I just did not yet find an efficient representation of a compiled regular expression which supports Unicode.

Constructing regular expressions

Regular expressions can be built from characters and special symbols. There are some similarities between regular expressions and arithmetic expressions. The most basic elements of arithmetic expressions are numbers and expressions enclosed in parens ( ). The most basic elements of regular expressions are characters, regular expressions enclosed in parens ( ) and character sets. On the next higher level, arithmetic expressions have '*' and '/' operators, whereas regular expressions have operators indicating the multiplicity of the preceding element.

Most basic elements of regular expressions

  • Individual characters. e.g. "h" is a regular expression. In the string "this home" it matches the beginning of 'home'. For non printable characters, one has to use either the notation \xhh where h means a hexadecimal digit or one of the escape sequences \n \r \t \v known from "C". Because the characters * + ? . | [ ] ( ) - $ ^ have a special meaning in regular expressions, escape sequences must also be used to specify these characters literally: \*  \+  \?  \.  \|  \[  \]  \(  \)  \-  \$  \^ . Furthermore, use '\ ' to indicate a space, because this implementation skips spaces in order to support a more readable style.
  • Character sets enclosed in square brackets [ ]. e.g. "[A-Za-z_$]" matches any alphabetic character, the underscore and the dollar sign (the dash (-) indicates a range), e.g. [A-Za-z$_] matches "B", "b", "_", "$" and so on. A ^ immediately following the [ of a character set means 'form the inverse character set'. e.g. "[^0-9A-Za-z]" matches non-alphanumeric characters.
  • Expressions enclosed in round parens ( ). Any regular expression can be used on the lowest level by enclosing it in round brackets.
  • the dot . It means 'match any character'.
  • an identifier prefixed by a $. It refers to an already defined regular expression. e.g. "$Ident" stands for a user defined regular expression previously defined. Think of it as a regular expression enclosed in round parens, which has a name.

Operators indicating the multiplicity of the preceding element

Any of the above five basic regular expressions can be followed by one of the special characters * + ? /i

  • * meaning repetition (possibly zero times); e.g. "[0-9]*" not only matches "8" but also "87576" and even the empty string "".
  • + meaning at least one occurrence; e.g. "[0-9]+" matches "8", "9185278", but not the empty string.
  • ? meaning at most one occurrence; e.g. "[$_A-Z]?" matches "_", "U", "$", .. and ""
  • \i meaning ignore case

Catenation of regular expressions

The regular expressions described above can be catenated to form longer regular expressions. E.g. "[_A-Za-z][_A-Za-z0-9]*" is a regular expression which matches any identifier of the programming language "C", namely the first character must be alphabetic or an underscore and the following characters must be alphanumeric or an underscore. "[0-9]*\.[0-9]+" describes a floating point number with an arbitrary number of digits before the decimal point and at least one digit following the decimal point. (The decimal point must be preceded by a backslash, otherwise the dot would mean 'accept any character at this place'). "(Hallo (,how are you\?)?)\i" matches "Hallo" as well as "Hallo, how are you?" in a case insensitive way.

Alternative regular expressions

Finally - on the top level - regular expressions can be separated by the | character. The two regular expressions on the left and right side of the | are alternatives, meaning that either the left expression or the right expression should match the source text. E.g. "[0-9]+ | [A-Za-z_][A-Za-z_0-9]*" matches either an integer or a "C"-identifier.

A complex example

The programming language "C" defines a floating point constant in the following way: A floating point constant has the following parts: An integer part, a decimal point, a fraction, an exponential part beginning with e or E followed by an optional sign and digits and an optional type suffix formed by one the characters f, F, l, L. Either the integer part or the fractional part can be absent (but not both). Either the decimal point or the exponential part can be absent (but not both).

The corresponding regular expression is quite complex, but it can be simplified by using the following definitions:

$Int = "[0-9]+."
$Frac= "\.[0-9]+".
$Exp = "([Ee](\+|-)?[0-9]+)".

So we get the following expression for a floating point constant:

($Int? $Frac $Exp?|$Int \. $Exp?|$Int $Exp)[fFlL]?

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

About the Author

Martin Holzherr
Switzerland Switzerland
Member
No Biography provided

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
GeneralBug in ignore case- only works if expression is uppermembermpietrasa4 Jan '11 - 4:12 
change ::MakeIgnoreCase from
     if( (*it)->m_transitions[i].m_transChars[j] && isupper(j) ){
        (*it)->m_transitions[i].m_transChars[(unsigned char)(tolower(j))]= true;
     }
to
     if( (*it)->m_transitions[i].m_transChars[j] ){
        if(isupper(j))
           (*it)->m_transitions[i].m_transChars[(unsigned char)(tolower(j))]= true;
        else if(islower(j))
           (*it)->m_transitions[i].m_transChars[(unsigned char)(toupper(j))]= true;
     }

GeneralBug in REXA_Scanner::GetCharmemberMember 274598416 Nov '10 - 2:04 
There is a bug in the trasformation from hex code to char in:
 
char               REXA_Scanner::GetChar(const char*& rpFrom,bool& rbIsEscape)
 
you have to change
hexChar+= isdigit(rpFrom[i])?(rpFrom[i]-'0'):(toupper(rpFrom[i])-'A'+16);
in
hexChar+= isdigit(rpFrom[i])?(rpFrom[i]-'0'):(toupper(rpFrom[i])-'A'+10);
QuestionWhat licence is this code released under?memberRay9510527 Mar '09 - 8:22 
Please state explicitly, for the peace of mind of people using this code. Smile | :)
AnswerRe: What licence is this code released under?memberMartin.Holzherr29 Mar '09 - 20:42 
You can assume the CPOL licence as described in Licenses section of the code project
GeneralProblem runningmemberMizan Rahman4 Nov '08 - 23:40 
Hi,
 
I have several question for you Smile | :) .
 
#1
I am trying to run this demo in VS2005. I am getting compile time error.
Would it be possible for you to upgrade your source to VS2005? I would very much appriciate it.
 
#2
when using a named regex, do you actually replace the pattern string with value of the named? For example,
err= rexs.AddRegDef("$Def1","A|B");
err= rexs.AddRegDef("$Def2","C|D";
rexs.SetRegexp("$Def1|$Def2")

In this case, do you actually re-construct the string "$Def1|$Def2" to "A|B|C|D" ? or how do you do this?
 
#3
In the following scnerio, will it match the entire input string?
 
    const char szTestSrc[]= "ABC";
    
    REXI_Search rexs; 
 
    rexs.AddRegDef("$Def1","BC");
    rexs.AddRegDef("$Def2","A(B|$Def1)C"; 
    rexs.SetRegexp("$Def2");
 
    const char*     pCur= szTestSrc;
    int             nMatchLen;
    bool            bEosFound= false;
    cout    <<  "Source text is: \""    <<  szTestSrc   << "\"" <<  endl;
    while(rexs.Find(pCur,nMatchLen,bEosFound)){
           cout <<  "Floating point number found  at position "       
                <<  ((pCur-szTestSrc)-nMatchLen)    
                <<  " having length "  <<  nMatchLen  <<  endl;
If matches the entire source string, then how does it decide which alternate to pick (i.e., B or $Def1 in $Def2)?
 
Thank you for your time.
Mizan
GeneralRe: Problem runningmemberMartin.Holzherr5 Nov '08 - 4:26 
Answer to your questions:
#1) I will not update it to VS2005. The main reason being, that there is a known error when concatenating regexpes (internally). It is too long a time ago that I looked into the code, so a fix would take me too much time.
#2) Yes "$Def1|$Def2" is equivalent to "(A|B)|(C|D)"
#3) It should match the input, but the decision which alternate to pick is based on set union and other set operations (quite complex).
===========
Bottom Line: Efficient regular epression interpretation is based on complex set operations.
There is no simple algorithmic direct approach for interpreting regular expressions which is also efficient. The only way for an efficient interpretation of a regular epxressions uses the transformation from an ndfa to a dfa (definite finite automaton) as is done in this project (but with at least one serious error which has not been corrected).
GeneralRe: Problem runningmemberMizan Rahman5 Nov '08 - 21:56 
Hi again,
 
Thank you for your response.
 
As to #2, so you actually re-construct the pattern string and then use the reconstructed string to create DFA, am I correct here?
 
Is this mean that when the following code is executed, it just stores the pattern in a key-value paired in memory for later use and does NOT actually construct the DFA right away?
rexs.AddRegDef("$Def1","A|B");
 
Regards,
Mizan
GeneralRe: Problem runningmemberMartin.Holzherr5 Nov '08 - 23:26 
Yes I think so. Perhaps an intermediate representation of the state machine is already generated. But the result should be the same as a text replacement.
GeneralRe: Problem runningmemberMizan Rahman6 Nov '08 - 5:02 
Yes, I understand the fact that the text replacement will give me the same result.
 
But I am looking for a different solution than text replacment.
 
What about the idea of having seperate DFA model for each (relatively) smaller pattern in memory and being able to use the model from another pattern's model and thus creating a complex language recognizer? How dose that work? Like EBNF or BNF.
 
For example (psuedo):
 
Rule1 = "A|B|C|D"
Rule2 = "<" Rule1+ ">"
Rule3 = Rule2 (Rule1)*
 
(...I think you got the idea)
 
Here if I do text replacement it could become a huge regular expression, escentilly running into perfomance issue.
 
I would think that it should be possible to construct DFA for each rule seperatly and use them during the match (some how Smile | :) ).
 
What are your thoughts on this?
 
Regards,
Mizan
GeneralRe: Problem runningmemberMizan Rahman6 Nov '08 - 10:52 
Nevermind, now I have figured it out.
 
Thanks.
Mizan
GeneralRe: Problem runningmemberMartin.Holzherr6 Nov '08 - 20:55 
You wrote: if I do text replacement it could become a huge regular expression, escentilly running into perfomance issue.
This is true for compilation performance. A more complex expressions uses more compilation resources in order to translate it from a dfa to a nfda.
But runtime performance should still be very good because the generated ndfa never backtracks.
But In any case, text replacement cannot support recursiveness and therfore cannot support grammar rules as in your example above. Furthermore the theoretical model of a pure regexp exclused recursion.
Your proposal, which seems to support recursiveness (normal grammar rules)
would result in a parser framework where individual grammar rules would follow the regexp idea but where full recursiveness would be supported between rules. This is more or less what PERL6 has realized. In PERL 6 it will be possible to write grammars as part of the language and each grammar rule would allow the possiblitities of a regular expression.
Furthermore the same idea has been implemented in boost as a library from Erich Niebler and is called Xpressive
GeneralRe: Problem runningmemberMartin.Holzherr6 Nov '08 - 21:15 
Your idea above would result in a parser framework based on regular expressions.
As I wrote in my last answer this is possible and has been done in Pearl 6 and the boost library XPressive.
In my opinion a better approach to implement a parser framework which gives some regexp feeling are Parsing Expression Grammars.
The PEG framework supports lean, efficient and very powerful parsers, which can cope with any grammar. A PEG parser e.g. can parse C# or C++. The dominant parsing framework currently used is based on LALR grammars. LALR grammars can parse most of C# and C++ but not all. Writing a C# or C++ parser with an LALR parser requires break outs from the parser or modifications of the scanner in order to cope with grammar elements breaking the LALR framework.
A PEG parser on the other hand can peek forward (operators & and !) and is therefore very powerful. Any language grammar in use today can easily be covered by a PEG parser.
See also at (http://www.codeproject.com/KB/recipes/grammar_support_1.aspx[^])
or at (http://en.wikipedia.org/wiki/Parsing_expression_grammar[^])
GeneralRe: Problem runningmemberMizan Rahman6 Nov '08 - 22:20 
Thank you so much for the info.
BTW I like your new article Smile | :) . I will follow it.
 
/Mizan
GeneralC# implementationmemberMizan Rahman21 Aug '08 - 10:30 
Hi,
 
I have recently posted an article "An implemenation of regurlar expression in C#" in code project at
http://www.codeproject.com/KB/recipes/re_expression_parser.aspx[^]
Thanks,
Mizan
www.vds.us.com
NewsThere's a new version of the RegEx Tester Tool !memberBucanerO_Slacker1 Mar '08 - 23:40 
I have released a new version of the RegEx Tester tool. You can download it free from http://www.codeproject.com/KB/string/regextester.aspx and http://sourceforge.net/projects/regextester
 
With RegEx Tester you can fully develop and test your regular expression against a target text. It's UI is designed to aid you in the RegEx developing. It uses and supports ALL of the features available in the .NET RegEx Class.
Generaltrouble with charactermemberChrisG411 Jul '07 - 4:52 
hi a followup to a post I made before... i need to enhance my regular expression to include
 
a href or &lt; a href
 
does anyone know how to do this? i've tried:
 
(<|&lt;\\ *a\\ *href\\ *\\=\\ *[\"'][^\"']*)
 
([<|&lt;]\\ *a\\ *href\\ *\\=\\ *[\"'][^\"']*)
 
and even (&lt;\\ *a\\ *href\\ *\\=\\ *[\"'][^\"']*), just to make sure the < was being detected, but it was not.
 
Can anyone help??
 


Generalignore casememberChrisG411 Jul '07 - 3:54 
possibly a dumb question but I have constructed this regular expression:
 
(<\\ *a\\ *href\\ *\\=\\ *[\"'][^\"']*)
 
I use it to find all occurances of a href=....... in a html page. however I need to set the flag to ignore the case of the "a" and "href" where would I place the \\i??
 
Thanks in advance,
 
Chris.
QuestionHow about this problemmemberwaldermort23 Sep '06 - 10:21 
Is it possible to find a sequence of increasing values? Lets say I have a string "B1B2-T6T7T8-W9W9", I need to be able to find the instance of "T6T7T8" but only if the numerical values are in sequence.
 
Taking the same example string above above, is it possible to find one or the other, but not the third? For example 'B* OR T* NOT W*' OR 'B* OR W* NOT T*'.
 
I'm sorry if I can't explain exactly what I want to do clear enough, it's difficult enough to grasp the concept in my own head. I'm creating a game (similar to a card game) for which a winning hand compromises of 14 cards. 4 sets of 3(either sequence or identical) and 1 pair. There are regional variations aswell as alternate scoring schemes for this game, so I would like to place a regex string in a .xml file to allow for easy manipulation of the rules/scoring. Each set, depending on the cards it contains carries a different score. Different combinations of sets produce different scores.
 
I would really appreciate any help getting this scheme up and running, though I have a terrible feeling I am going to have to produce my own regex parser (something I'm not looking forward to).
General^$ once againmemberChemBuddy13 Sep '06 - 12:48 
I have seen someone already asked about these (beginning/end of line). Any chance that they will be implemented? Otherwise REXI_Search fits perfectly my needs - but I need ^$ badly Frown | :(
 
See me at www.chembuddy.com.
GeneralRe: ^$ once againmemberMartin.Holzherr13 Sep '06 - 22:28 
he end of line can easily be searched with [\r]?\n
 
REXI_Search generally searches a pattern even if it crosses line boundaries.
 
The advantage of REXI_Search compared to other regular expression parsers is its speed at the coast of quite a long initial processing time.
But there is also a known serious error in concatenating states, which leads to missing matches in some complex patterns.
The correction of this known errors seems to me more important than the support for anchors.

Generalthat serious error in concatenating states corrected?memberweiqiangconnie20 Mar '11 - 4:16 
Hi,Martin.
first,i want to thank you about the project and article, from which i learned a lot about C++ language and regular expression.
i got a problem when i used it . i used the regex "1.*5|12" to match "12345",and i only got one match result which seemed the "12" regex part was not work correctly.
i am working on this bug now. But if you have already fixed it, please let me know.
GeneralRe: that serious error in concatenating states corrected?memberMartin.Holzherr20 Mar '11 - 21:50 
Sorry, I did not work anymore on this. But, as far as I remember, the incorrect state concatenation shouldn't be difficult to fix.
GeneralDev c++memberf_randy4 Jun '06 - 11:11 
I am not able to compile this thing in Dev C++ Frown | :(
Generalplease help me!!!!sussfowadkhan8 Feb '05 - 22:45 
1.construct a program for regular language
such that
submission=[ a b c 1 2 3 ]
operator=[ + * |]
 
2.construct a regular expression
3.excecute the domain; such that domain length =100 words
length word=5 to 10 char
4. use regular expression keys given in operator domain
5. collect the selected words in file

GeneralHelp Neededsussnoogui29 Oct '04 - 10:36 
Can anyeone tell me how can what is the regular expression for integers in C Programming language ( including octal ,decimal and hexadecimal values regardless of the boundaries in C for the int )
 
Also what is the regular expression for the float ?
 
And how can we turn it into NFA or DFA ?
 

GeneralBugsmemberlucky_by2 Jun '04 - 2:19 
Nice article.
 
However I have found the following bugs in the examples:
 
transforming hex digits to char
RexAlgorithm.cpp line 344:
< hexChar+= isdigit(rpFrom[i])?(rpFrom[i]-'0')toupper(rpFrom[i])-'A'+16);
> hexChar+= isdigit(rpFrom[i])?(rpFrom[i]-'0')toupper(rpFrom[i])-'A'+10);
 
signed/unsigned issue:
RexAlgorithm.cpp line 299:
 
< const char* pCur= m_pTokEnd;
< while(isspace(*pCur))
< pCur++;
< return pCur;
 
> unsigned const char* pCur= (unsigned const char*)m_pTokEnd;
> while(isspace(*pCur))
> pCur++;
> return (const char*)pCur;
 
struct REXA_NDFATransition::m_bIsMute is used uninitialized (What this member is really intended for?)
 
And finally concatenation of 2 NFAs (REXA_ParserImpl::Concat) is wrong.
For example: a*b* will never match empty chain, and a*b*c will never match single "c"
 
Regards, Al (Lucky)

QuestionIs this gift Unicode-aware?memberRocom15 May '04 - 23:24 
Who can help me to improve this to support Unicode?
I look forward ...
 
hjsoft@sohu.com
 
Rocom Omen
GeneralExample doesn't find single filesmemberJon11 Nov '03 - 5:24 
Looks like a small bug in FindInThisDirectory().
 
I changed the loop to:
bool bWorking = true;
while (bWorking){
bWorking = findFile.FindNextFile();
 
and then it picked up the single file in the directory ok.
I think it's because FindNextFile returns zero on the last found occurance.
 
Jon

GeneralRe: Example doesn't find single filesmemberRocom15 May '04 - 19:24 
I also have found the bug. In FileFindRecursive.h, I change
the function like below:
void FindInThisDirectory(CString strDir)
{
CFileFind findFile;
BOOL bWorking = findFile.FindFile(strDir+"\\"+m_strWildCard);
while(bWorking){
bWorking = findFile.FindNextFile();
if( !findFile.IsDirectory() ){
m_bCanceled= !m_pFuncProcessFile(findFile.GetFilePath(),
findFile.GetFileName(),m_pData);
if( m_bCanceled )
return;
}
}

Recurse(strDir);
}
GeneralBug in REXA_Scanner::operator()()sussJerome Herry28 Aug '03 - 3:27 
Hi all,
 
I found a bug in the function REXA_ESymbols REXA_Scanner::operator()()
When looking for a rule identifier this test is performed:
 
while( isalpha(*m_pTokEnd) || isdigit(*m_pTokEnd) || m_pTokEnd[1]=='_' )
{
m_strLiteral+= *m_pTokEnd++;
}
 
It must be
 
while( isalpha(*m_pTokEnd) || isdigit(*m_pTokEnd) || *m_pTokEnd=='_' )
{
m_strLiteral += *m_pTokEnd++;
}
 
Exemple:
Let's say that you have a rule named "$r01"
If you look for this pattern "$r01 __FILE__" you will have an error because it will look for the rule "$r01 " which does not exist (because of the space)
 

Jerome.

GeneralAlgorithmsussnachiappan12 Aug '03 - 8:01 
Good day!can u plz give me the algorithm for ur fast regular expression project.
 
nach
GeneralRe: AlgorithmmemberMartin Holzherr12 Aug '03 - 20:34 
Hi,
am I right, that you are
interested in the algorithm, not the code.

If so, I can give you a detailed description
in the next days (the exact description is
taken from a book on theoretical computer science).
The basic theory lying behind the code is
the transformation from a nondeterministic to
a deterministic finite machine.
A nondeterministic machine is easy to understand
and implement, because it is nothing more than
a set of nodes (in a graph) which go - at each node -
to the neighbouring node reachable by reading one of the next characters.
A nondeterministic machine - as the name implies - does not
give you a unique path, but - normally - at each step (reading the next character)
you can go decide to go one of many possible branches. But a wrong decision
can force you to backtrack - meaning that you must undo a decision and go back
to a previous state.
This means in the very end, that nonderministic machines are inefficient (but many
implementations use them).
The good news is, that the transformation is not difficult if you use sets
( I used std::set).
Furthermore this implementation gains much of its speed by using a
very simple and fast - but also space consuming - representation.
More about the algorithm in the next days.
Regards Martin

GeneralRe: Algorithmsussnachiap19 Aug '03 - 5:32 
thanks for ur reply,kindly post ur algorithm as soon as possible
GeneralRe: AlgorithmmemberCuick27 Aug '03 - 20:39 
Hi!
Can u send me the detailed description??
Thanks!
GeneralRe: AlgorithmmemberAmer Gerzic12 Nov '03 - 12:46 
Maybe this will help:
http://www.codeproject.com/useritems/OwnRegExpressionsParser.asp
 
I am not sure though if the algorithm is the same, but this one should be very fast!
 
Amer
GeneralRe: AlgorithmmemberRocom15 May '04 - 18:55 
I'm eager for your detailed description, too.
Thanks, your code are nice for a certainty!
 
My E-Mail:hjsoft@sohu.com
 
Rocom Omen
GeneralProblem with some characters on UNIXsussJerome Herry28 Jul '03 - 21:56 
Hi,
 
I just want to tell you about this problem I met on SUN Solaris with gcc 2.95. Spaces are skiped with the "isspace" function which seems to also skip some other characters like "Ã" (\195). Maybe you should only compare to the ' ' character...
 
Regards,
Jerome.
GeneralValidating Datemembergrscot31 Dec '02 - 4:42 
Hi
I would like to validate a date in the format:mm dd yyyy in C++.
I have no idea of regular expressions. Could someone help me?
The date is entered as a string.
 
Regards,
Chris
 

GeneralQuick questionmemberHockey20 Oct '02 - 11:10 
Regular expression parsers either interpret the search pattern at runtime or they compile the regular expression into an efficient internal form (known as deterministic finite automaton). The regular expression parser described here belongs to the second category
 
Does this mean I can change the string i've fed the regex and assume the expression will not have to be re-evaluated...?
 
I have a buffer which is updated every second or so and I must strip HTML tags each time.
 

Cheers!
 
"An expert is someone who has made all the mistakes in his or her field" - Niels Bohr
GeneralRe: Quick questionmemberMartin Holzherr12 Aug '03 - 20:43 
The regular expression has to be evaluated each
time you change it. But you can use the same expression
for an unlimited number of input files.
 
If you only have a few expressions, you can
use more than one instance of the regular expression
class.
Regards Martin
 

GeneralBackreferencesmembermst7 Aug '02 - 0:20 
Hi there!
 
I want to find strings like
A123A
B123B
but not
A123B
using the expression \([A-Z]\)[1-9]+\1 which uses so called backreferences (works in TextPad, a real cool editor).
I Wonder if REXI_Search is able to process expressions like that?

 
regards
mst
GeneralBug or Feature using '*' and '+'membermst5 Aug '02 - 5:01 
Hi there!
 
First of all: REXI_Search is great and realy fast! Thanx a lot.
 
One problem left:
Whenever I use an expression like "ABC.*XYZ" on a text like
 

ABCIIIIIIXYZIIIIABCIIIIIXYZIIII
^-------------------------^

the occurance found is the on marked above. In my opinion it should be
 

ABCIIIIIIXYZIIIIXYZIIIIIXYZIIII
^----------^

isn't it?
 
regards
Markus
 

GeneralRe: Bug or Feature using '*' and '+'memberMartin Holzherr5 Aug '02 - 5:42 
Hi Markus,
there are different solutions to your matching problem.
REXI_Search always searches for the longest possible match not just for the first found. In most cases this is the behavour wanted,
e.g. if you use a regular expression parser for scanning the tokens of a programming language, then this is just what you expect.
To search for a C-like identifier you can use the following expression
[_A-Za-z][_A-Za-z0-9]*
to match identifiers like
totalTaxValue nofDigits _myHiddenVariable
 
Regards Martin

GeneralRe: Bug or Feature using '*' and '+'memberCodebender17 Dec '04 - 13:11 
'*' is greedy by default. What you want is '.*?', which is the non-greedy version of '*'. That is, it matches as few characters as possible. Unfortunately, this RegEx processor does not appear to support non-greedy matching.
Generalcompiled versionmemberJanie624 Jul '02 - 12:52 
Could someone send me the compiled version of this application. It looks pretty useful, but I am unable to compile it using VC 5++
 
If you wish to send the application, you can email it to
umjaco6@hotmail.com
 
Thanks
JanieRose | [Rose]
Generalcan't compile in VC++ 5sussAnonymous10 Jul '02 - 8:29 
Is this project supposed to be able to compile in Visual C++ 5.0?
 
I can't seem to get it working.
GeneralRe: can't compile in VC++ 5memberMartin Holzherr10 Jul '02 - 20:24 
I didn't compile it under VC++5 only under VC++6,
perhaps VC++5 has problems with templates which
are heavily used by this project and which are known
to be badly supported by older C++ compilers.
Regards Martin Holzherr

GeneralRe: can't compile in VC++ 5sussAnonymous19 Jul '02 - 7:53 
Could you send me a compiled version of this application to test it?
 

Janie
GeneralRe: can't compile in VC++ 5sussAnonymous19 Jul '02 - 8:59 
Hi,
if you are willing to send me a compiled application please send it to the email janie@webwwworld.com
 
Thanks
 
Janie
GeneralJust what I was looking formemberHockeyDude4 Apr '02 - 22:31 
I've only quickly broswed the article and like what I see so far.
 
Only a quick question before I d/l the source and use it...
 
Using your regex utility is posssible to check the validity of form data like you would use perl to check form data in a HTML form...?
 
I'm a total virgin with regex, but would like to learn and use this class. Storing the regex for different form data
 
ie: Email, Credit cards, phone numbers..
 
Is it possible to use this class for this purpose...?
 
Thanx Smile | :)
 
"An expert is someone who has made all the mistakes in his or her field" - Niels Bohr

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web02 | 2.6.130516.1 | Last Updated 30 Oct 2000
Article Copyright 2000 by Martin Holzherr
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid