The Lounge is rated Safe For Work. If you're about to post something inappropriate for a shared office environment, then don't post it. No ads, no abuse, and no programming questions. Trolling, (political, climate, religious or whatever) will result in your account being removed.
"I have no idea what I did, but I'm taking full credit for it." - ThisOldTony
"Common sense is so rare these days, it should be classified as a super power" - Random T-shirt
AntiTwitter: @DalekDave is now a follower!
(I'm ignoring backtracking regex here because it's dirty, and algorithmically less useful except for making it easier for the user to match text)
Anyway it's just a tiny functional programming language with only ()|?* 4 explicit operators and 1 implicit one.
Representing the regex programming language as code: Any regex is mathematically equivelent to the DFA state machine it represents, and can be converted algorithmically back and forth to and from a state machine and a regular expression. Perfect compilation/decompilation.
So you can use them to match text (boring!)
Or you can use them to generate code for state machines (less boring!)
And yet I've met a lot of programmers that either loathe them, are intimidated by them, or both.
They're wonderful little things, with interesting mathematical properties, but more importantly, they're useful for everything quick and dirty.
If you do it right, regex facilitates rather than hinders code generation but Microsoft's engine is unfortunately limited in that regard. It does code generation, but it doesn't generate C# code for example. It could. It just doesn't.
Like I said in the OP, a regex is a state machine is a regex.
a state machine is code.
it's code all the way down.
ETA: If you stick to the basic operations and common syntactic sugar and avoid backtracking and other nonsense, most of the regex stuff is the same regardless of implementation.
Have used some regexes in my article Translitera - Phonetic Typing in Some Indian Languages[^], which is a tool for transliterating from English to some Indian languages. The program uses regexes to identify patterns in each word, and hence split each word into manageable parts.
Of course, there are some situations which are not handled, there is always scope for improvement.
I use them for input validation. Things like dates and times are straightforward. Names are not! Even imposing cultural restrictions (two capitalised names and some fussing around the edges). Patrick O'Reilly-Smythe and Ian McDonald are about as complex as I allowed for members of our Rural Fire Brigade. I can't remember whether Giulio d'Angelo would pass or fail. If he joins up, I'll revisit the code.
My other major use is in (often throwaway) SED scripts, or of course grep. Things like extracting the word after Invalid user in security logs.
Another one I used recently was to reconstruct words that were hyphenated across lines in a OCR'd manual. Google translate barfs on the fragments of hyphenated words.
Software rusts. Simon Stephenson, ca 1994. So does this signature. me, 2012
Forget backtracking regular expressions, as they don't have the same fancy mathematical properties as their non-backtracking counterparts. Use the non-backtracking operators and there's only 5 operations to remember, concatenation, alternation, parentheses, zero or one match and kleene star (looping * - zero or more match), and concatenation is implicit.
1. Simpler to understand
2. Faster to execute
3. Weirdly mathy but in a cool way
4. The same across almost all regular expression engines
I give a primer at the end of this article. I taught them to my computer, and trust me - it's not very smart, but then I also taught it C in that article.
I enjoyed the series right up until they were able to insert a little card into their existing engine that allowed them to travel anywhere immediately. I know that FTL is imaginary, but I'd think that "blink" might have required a completely different set of physics requiring a new engine or something. My "willing suspension of disbelief" became unwilling at that point.
Outside of a dog, a book is a man's best friend; inside of a dog, it's too dark to read. -- Groucho Marx
I would rather use whatever language I am working with to perform the parse. As you just stated... regex is technically another small programming language.
I am not sure if you know this... but you can take a regular expression and use the Ragel state machine compiler[^] to convert it to C/C++, D, Go, Java, Ruby and even Objective-C. Interestingly... I do not see C# support.