|
Ha! I had that very comic in mind when I was attacking this.
Real programmers use butterflies
|
|
|
|
|
I'm motivated to make things better, not the same.
"Before entering on an understanding, I have meditated for a long time, and have foreseen what might happen. It is not genius which reveals to me suddenly, secretly, what I have to say or to do in a circumstance unexpected by other people; it is reflection, it is meditation." - Napoleon I
|
|
|
|
|
Oh but I've already done that.
XBNF is a description language that can describe Chomsky type 2 and Chomsky type 3 languages both using simple compositions of logical constructs like a repeat construct {}, alternation |, concatenation (implicit) etc.
Regex is for the benefit of people that don't want to learn XBNF. It's also for the benefit of leveraging all of the existing regular expressions out there, and this is where it gets important.
If I mimic what's out there it means you can use my tools with the content *you already have* saving you time.
That is an improvement, no?
There's no real reason to "improve" regex. I'd argue that too many already have, leading to the present situation.
I'm not trying to improve that situation. Too many people already have, again leading to the present situation.
I'm simply trying to make the present situation as painless for users of my code as possible.
Real programmers use butterflies
|
|
|
|
|
I am not really good at regex but what about:
Keep your own syntax and offer some sort of "converter" - basically some function that takes a known syntax (POSIX oder .NET or whatever) and returns the "translated" regex which is in your syntax.
This way people choose (perhaps based on knowledge or usecase) which syntax they want to use.
|
|
|
|
|
I've considered that but there are some issues with it, like being able to scan reams of existing regexes (for example in lengthy lexer specifications). It is possible to run each thing through said function, but it's also very difficult to build and test exactly to a syntax versus making a syntax loose enough to accept most constructs from most languages. Doing so easily satisfies the 80/20 rule, so I think that's the way to go.
Real programmers use butterflies
|
|
|
|
|
'as painless ... as possible' is well worth the effort imo.
I wish I could help, but alas, all I can do is encourage. I look forward to seeing what you wind up with.
|
|
|
|
|
I have used Regex provided in various applications, and noticed that some require back slashes in front of operators and others require that they not be used. There seems to be no common ground. The Regex expressions must be different, depending on which app one uses. I would have to have different rules and examples of them for each application.
|
|
|
|
|
I handle escapes by allowing anything to be escaped, but only requiring it where necessary. Most regular expression engines are *supposed to* work that way.
Real programmers use butterflies
|
|
|
|
|
- Decide which constructs/features it makes the most sense to support and do those well (as compliant as possible, without sacrificing performance on your most critical expressions).
- Fail cleanly whenever possible with as much information as possible (e.g., provide a warning about 'regex "x" may not work as expected because feature y is not supported').
- If users say they really need something you don't think you can/should support, then provide a fallback approach (e.g., switch to an off the shelf package and let the user know about the hit they're taking and why).
FWIW, I totally get it. I've become a big fan of RegEx over the years (despite initial reluctance). While they can be quite complicated and obtuse, simple ones are very easy to write and most people adapt to them well as most simple searches just work. At the same time, I worry I'm sometimes giving up too much performance processing RegEx's and compiling them hasn't always generated the performance I'm looking for, so I've been very tempted to write an optimization wrapper that handles simple cases using language intrinsics (e.g., substring, begins with, and exact match).
I've not gotten around to the optimizations; however, I did (unfortunately) implement a couple of simple 'extensions' to RegEx as it doesn't support a couple of basic operations I frequently need:
- Use a leading "!" to mean NOT a match (where the rest of the input is the regex).
- Use a leading >, <, >=, or <= to treat the string as a numeric (or date) comparison (very handy for things like searching for numbers >10).
While it's possible to code those in RegEx the complexity is sufficient to reduce the value to near zero (especially since I use this for quick adhoc database column searches).
|
|
|
|
|
Honestly? You could just use a DFA regex. Even one written in C# will be up to 3 times faster than Microsoft's. downside is no backtracking, but oh well.
Real programmers use butterflies
|
|
|
|
|
Thanks. Did you mean that I should replace the .NET RegEx with a 3rd party (DFA based) RegEx, or that I should have people use DFAs instead of RegEx?
Backtracking's not really an issue. Most of the time I'm just looking for one of the following:
- String Contains Pattern (e.g., "Pattern")
- String Contains Pattern A or Pattern B or ... (e.g., "PatternA|PatternB|...")
- String Starts with Pattern (e.g., "^Pattern") or Ends with Pattern (e.g., "Pattern$")
- String Exactly matches Pattern (e.g., "^Pattern$")
Sometimes I use more complex options:
- String Contains Pattern A or Pattern B (e.g., "Pattern(A|B)")
- String Matches product code (e.g., "P\d+(-\d+)?")
And the ones that RegEx doesn't support:
- String doesn't match one of the above (implemented by me as "!...")
- Numeric or Date comparison (implemented by me as "<10" or ">=1/1/2000" ...)
- Within range (currently not implemented except via regex starts with).
That's about as complex as it gets.
The main performance issue is that I'm using it for ad hoc live filtering of up to 3-4,000 records (filter changes as every character is typed) with the potential for filters on multiple fields, and I'm trying to keep it responsive (< 2 seconds worst case, preferably < 1/10 second).
So far, the performance is reasonable, if not ideal (using the native .NET REGEX), so I've not been highly motivated to change. A 3rd party drop in engine might work (my thoughts were more along the line of recognizing the simple cases and hard coding them, it's hard to beat string.IndexOf and other string intrinsics which can easily handle three of the first four cases.
I also use RegEx's for backend filtering (before it gets to the UI) and there I'm limited to what the database engine supports. Performance is generally pretty good; however, I wonder if there would be value in my detecting simple cases up front and converting them to different operations before sending to the backend. For example:
- Instead of 'Field matches regex "Pattern"' generate 'Field contains "Pattern"'.
- Instead of 'Field matches regex "^Pattern$"' generate 'Field == "Pattern"'.
These patterns are also typically entered by users, but not live (they have to 'submit' the query). I already do some simple pattern manipulation, mostly adding a (?i) to the front as the engine is case sensitive by default and I'd rather it not be. This would be a bit more complex as I'd have to manipulate the operation, not just the pattern.
|
|
|
|
|
David On Life wrote: Did you mean that I should replace the .NET RegEx with a 3rd party (DFA based) RegEx
Yes, that.
However, given what you're telling me - are those records coming from a database? If so you might get more mileage using LIKE from within SQL itself, at least for the simple stuff. That will be orders of magnitude faster than anything you could do on the client side.
Real programmers use butterflies
|
|
|
|
|
Yes. However, the database is Kusto (aka Azure Data Explorer) which has native RegEx support. I use a two-stage approach.
The first stage allows the input of parameters which are passed to Kusto to select a small subset of relevant data (typically 1 to 1,000 records, sometimes more). Parameters may be RegEx, equals, list of matches, contains, startswith, or any other Kusto comparison operation (determined as part of the parameter setup, not by the user). They are not live but processed as part of the query (just like you're suggesting, except there's no 'like' operator in Kusto).
The second stage is local filtering once the data is already on the client. That's the live component. Since the data is already on the client at that point, local filtering is typically faster than requerying. I currently give users the option of either RegEx or simple Contains, but I'm not sure the Contains option is that meaningful.
A typical use case would be using the first stage to pull all storage performance test results in the last month for project x using configuration y. Then use client-side filtering to look for issues (e.g., performance < 90% of expected) and/or further filter on specific test setups (different storage types or different computer types).
|
|
|
|
|
Alright, well unless your users are connected to your servers via a SAN speeding up your regex isn't going to even touch the part of your app that should be taking the longest to execute (downloading the data to the client, no?)
Find out for sure what takes the time. Optimize there. I doubt it has anything to do with regex.
You will get up to a 3x speed improvement over a straight NFA regex search through text. But just the searching the text part itself.
That's probably not where your time is being spent, just from what you are telling me.
Of course, I don't *know* any of this. This is me spitballing based on one comment. However, even if I were in your shoes, I'd profile, and find out during a typical run, what percentage of the total time it takes to execute is being used doing what?
From there, I'd attack the things that take the largest percentage.
If the regex is anywhere even near the top of that list, I'll eat my hat.
Real programmers use butterflies
|
|
|
|
|
honey the codewitch wrote: 90% of this has to do with what is allowed to appear inside [] braces.
Presumably that just ends up being converted to a character mapper. Specials in there only involve shortcuts to ranges. For example character classes for unicode.
Basic character classes have existed for decades. So just start with that and add a couple.
honey the codewitch wrote: There are 3 or 4 major regex syntax varieties out there. POSIX, Perl, JS, .NET etc.
Not sure I agree with that as stated.
Following all use same regex as Perl
# Net
# Javascript
# Java
Given how much those three languages are used I would say that the Perl syntax is the most standard.
Differences from Perl are usually outside the regex itself. Variations in regex itself are probably pretty esoteric.
None of those languages support some of the posix ranges but they do support other escapes that are equivalents. Which means that users of those languages are unlikely to be familiar with the posix ones anyways.
|
|
|
|
|
Aside from the variations between GNU, Perl, and POSIX, there are also de facto ones, like the POSIX-ish syntax used by FLEX and its variants.
This is where I'm getting most of my information (this site, but here's the page on char classes)
Regexp Tutorial - Character Classes or Character Sets[^]
Real programmers use butterflies
|
|
|
|
|
|
I heard the Hubble took that same first image...
|
|
|
|
|
Hubble had a problem with the mirror warping in zero gravity (if I remember it right), so all pictures were blurry until they went up and fixed it via a shuttle launch.
I remember a Nasa guy saying at the time "This is why you should never name a project something that rhymes with trouble."
Sincerely,
-Mark
mamiller@mhemail.org
|
|
|
|
|
Do you use any software / web site / service to :
1) Save URLs
2) categorize those URLs
3) maybe even provide a little note as to why it is interesting (to remind yourself later)
for later reading?
Or, do you just use the browser's favs? -- I find browser favs a bit limiting.
I often come upon material I want to organize into folders for reference and also just keep a _current_ reading list, but haven't found anything very good for that.
Any suggestions?
|
|
|
|
|
One Note. It has also a clipping app/add-on to get the content.
Mircea
|
|
|
|
|
Looks like that is something I have to pay for though[^].
I should've added that I'm a cheapskate -- I thought software was free.
Yes, I make my living from Software Dev & I'm mostly kidding, but $99 / yr feels quite expensive.
Thanks for your input.
|
|
|
|
|
I write it down as a business expense and also get to use the Word, Excel and 1TB of OneDrive. It's a pretty good deal IMHO.
My family on the other hand are a bunch of cheapskates who enjoy it for free
Mircea
|
|
|
|
|
|
Very cool.
And, I'm also embarrassed now, because I guess I could use Google Keep (similar to onenote) and google docs to do something like this. Should'a thought of that.
|
|
|
|
|