|
Based on your description, you want to extract the fourth field from each line:
^([^;]*;){3}([^;]+); Demo[^]
However, depending on the source of the data, you may need to consider how it would "escape" a semicolon embedded in one of the field values.
For example, given a display name of j;smith , would that end up as j\;smith ? j;;smith ? Something else? Or would it just corrupt the entire line?
Once you start having to account for "escaped" separators, parsing the line becomes much harder.
"These people looked deep within my soul and assigned me a number based on the order in which I joined."
- Homer
|
|
|
|
|
If you know that you do not have any embedded semi-colons in your input text, then maybe a simple split would work for you instead of a regex. e.g fields[] = split(line, ';') (or however your base language does that). This is far simpler, and should be much quicker that applying a regex and extracting a match. However, as Richard points out, if you do have embedded semi-colons you'll need to know how they're escaped in the string. In which case it is probably still faster to write a parser that will extract the fields to an array or as struct or class of some sort.
In a related note, you might be tempted to apply a regex to validate the email address, but that is not as simple and straight forward as it might seem. See this discussion from stack-overflow : https://stackoverflow.com/a/201378 The next response on that SO page may also be useful, if you're using C#, which refers to the MailAddress class.
Keep Calm and Carry On
|
|
|
|
|
Kardock wrote: each line looks like this:
Which suggests that it is CSV data. Although 'CSV' stands for 'comma separated value' in general usage the separator can be other types including a semi-colon.
So the best solution is to find a CSV library and use that rather than attempting to roll your own. You should look to see how the library handles bad data (ill formed CSV).
|
|
|
|
|
you're right but that gave me a chance to try to understand regex.
|
|
|
|
|
Make sure to learn one lesson about regex: Don't overuse it.
I've seen numerous regex problems where solving the task using an algorithmic language (such as C#) would be straightforward and simple - and flexible enough to handle with ease all the exceptions and special cases that really can give you a headache trying to do it as a regex.
And there is Geek&Poke: Yesterdays regex[^]
Disclaimer: The only pattern matching language I liked was SNOBOL, but I haven't seen it is use for a few decades now. SNOBOL is (/was) sort of a crossover between predicates and algorithmic programming - you could see it as a different kind of bool expression evaluation, in an otherwise algorithmic programming language. Especially the predicates were written in a way more readable format than in traditional regex. (I am not holding my breath waiting for SNOBOL to raise to a new stardom, though!)
|
|
|
|
|
So given that you just want to mess around with regex.
Kardock wrote: but it fails.
Presumably you mean it runs but it does not successfully match.
The problem is '\w' is not an expression that could ever match an email. So you need to look up what it does match.
The other problem that you will find is that attempting to actually match a valid email is very difficult. The regex to do it is about 1000 characters long. You can google that both to see what a long regex looks like and to educate yourself what a 'valid' email actually is. (I do it every couple of years to remind myself especially when someone says they want to 'validate' an email.)
However you don't need to match an email. What you need to match is the fourth value in the list. So the way to match that is the following
[^;]+
You should probably in fact match all of the columns that way.
So you should study that expression to figure out what it does. And then answer for yourself why the other posters comment about embedded semi-colons being a problem.
|
|
|
|
|
I have a MySQL column with underscores and between those underscores a domain name (which is anything with a dot, except number.number), and I like to use REGEXP_SUBSTR to extract that domain, but only if there are at least 2 preceding underscores before it.
Examples:
1. Something_Else_Type_XY_Z_26.04.23_website.com => website.com
2. Something_Else_Type_XY_Z_website.com_26.04.23 => website.com
3. Something_Else_Type_XY_Z_26.04.23_website.com_Comment => website.com
4. Something_website.com_Else_Type_XY_Z_26.04.23_Comment => NULL (due to not having 2 preceding underscores first)
5. Something_Else_Type_XY_Z_26.04.23_Comment => NULL (due to not having something dot something that isn't number.number)
6. Something_Else_Type_XY_dfd6869_3_21.12.22_website.com_ZU => website.com
Which regex could do so instead of ... in SELECT REGEXP_SUBSTR(full_name, ...) as domain_name?
|
|
|
|
|
One of these then:
\b([a-z0-9]+(-[a-z0-9]+)*\.)+[a-z]{2,}\b
^([a-z0-9]+(-[a-z0-9]+)*\.)+[a-z]{2,}$
\b((xn--)?[a-z0-9]+(-[a-z0-9]+)*\.)+[a-z]{2,}\b
\b((?=[a-z0-9-]{1,63}\.)(xn--)?[a-z0-9]+(-[a-z0-9]+)*\.)+[a-z]{2,63}\b
\b((?=[a-z0-9-]{1,63}\.)[a-z0-9]+(-[a-z0-9]+)*\.)+[a-z]{2,63}\b
Footnote: these make no sense of leading BLANK as your requirement states but I throw them out there anyway for your testing pleasure.
modified 14-Sep-23 14:11pm.
|
|
|
|
|
Unfortunately none of them worked at all.
REGEXP_SUBSTR(field_name, '^([a-z0-9]+(-[a-z0-9]+)*\.)+[a-z]{2,}$') immediately timed out (#3699 - Timeout exceeded in regular expression match ), and the rest just returned NULL for everything.
Any idea why?
|
|
|
|
|
Well, first of all those regex were std::regx not MySQL.
What exactly have timing out and error #3699 to do with this "SELECT" query? I'm not sure why none of them work because I'm not sure how you're leveraging this database. REGEX_SUBSTR? Is that a function?
Probably Oracle, he said.
I know nothing about Oracle, sorry.
|
|
|
|
|
|
I just tried a quick test at RegExr: Learn, Build, & Test RegEx[^] and the following regex finds all the web addresses in the set below (shown in bold). The only thing I cannot correct is that it includes the leading underscore. However it may be a starting point to help you (I have not actually tried it with MySQL).
_[a-zA-Z]*\.[a-zA-Z]*|\.[a-zA-Z].
Something_Else_Type_XY_Z_26.04.23_website.com
Something_Else_Type_XY_Z_website.co.uk_26.04.23
Something_Else_Type_XY_Z_26.04.23_website.com_Comment
Something_website.com_Else_Type_XY_Z_26.04.23_Comment
Something_Else_Type_XY_Z_26.04.23_Comment
Something_Else_Type_XY_dfd6869_3_21.12.22_website.com_ZU
|
|
|
|
|
Unfortunately REGEXP_SUBSTR(field_name, '_[a-zA-Z]*\.[a-zA-Z]*|\.[a-zA-Z]') just gives me the first 2 letters of the field_name ("So" in this case).
|
|
|
|
|
Yes, I got similar results, which does not make a lot of sense. However, on reading the documentation (https://dev.mysql.com/doc/refman/8.0/en/regexp.html#function_regexp-substr[^]) closely I discovered why. Hidden down near the bottom of the page is the following:
Quote: To use a literal instance of a special character in a regular expression, precede it by two backslash (\) characters. The MySQL parser interprets one of the backslashes, and the regular expression library interprets the other. For example, to match the string 1+2 that contains the special + character, only the last of the following regular expressions is the correct one:
So I tried with that, and a slight modification to my original pattern, and successfully extracted the fields as shown in the following results:
mysql> SELECT REGEXP_SUBSTR(Title, '_[a-zA-Z]+(\\.[a-zA-Z]+)+') from Books;
+---------------------------------------------------+
| REGEXP_SUBSTR(Title, '_[a-zA-Z]+(\\.[a-zA-Z]+)+') |
+---------------------------------------------------+
| NULL |
| NULL |
| NULL |
| NULL |
| _website.com |
| NULL |
| _website.com |
| _website.com |
| _website.co.uk |
| _website.com |
| NULL |
| NULL |
+---------------------------------------------------+
The NULL entries are records that do not contain a web address.
[EDIT]
I just tried without the leading underscore in the regular expression and added some numbers into the website names and got the following:
mysql> SELECT REGEXP_SUBSTR(Title, '[a-zA-Z][[:alnum:]]+(\\.[a-zA-Z]+)+') from Books;
+-------------------------------------------------------------+
| REGEXP_SUBSTR(Title, '[a-zA-Z][[:alnum:]]+(\\.[a-zA-Z]+)+') |
+-------------------------------------------------------------+
| NULL |
| NULL |
| NULL |
| NULL |
| website.com |
| NULL |
| google.co.it |
| website99.com |
| Bet365.co.uk |
| website.com |
| NULL |
| NULL |
+-------------------------------------------------------------+
12 rows in set (0.01 sec)
So it would appear to be simpler than I thought.
[/EDIT]
modified 20-Sep-23 4:18am.
|
|
|
|
|
Did you test according to my examples? What about:
Something_website.com_Else_Type_XY_Z_26.04.23_Comment => NULL (due to not having 2 preceding underscores first)
Your regex returns website.com there instead of NULL.
|
|
|
|
|
No, I am afraid I did not work that one out; I leave that as an exercise for you.
|
|
|
|
|
Concur here, using RegexBuddy and this string:
[a-zA-Z][[:alnum:]]*\.[a-zA-Z]*|\.[a-zA-Z]
Nothing but net (colloquial expression eliding formal disdain without turning away in disgust from self-realized talentlessness). Thanks for this.
modified 21-Sep-23 17:54pm.
|
|
|
|
|
MySQL's RE engine has one or two quirks of its own(after all, why make it easy?). The only part I could not figure out is how to ensure at least two underscores in the earlier part of the string. I bet @OriginalGriff or @RichardDeeming could do it with their eyes closed.
|
|
|
|
|
Using RegexBuddy (yeah, I guess I am falling into a hole here) the POSIX notation is perfectly acceptable, hence your [:alnum:]. I can substitute [:alpha:] there as well and I suspect others in the list are good actors. My brain infarct occurs though when I use Herr Gevertz' table of translations and try to replace the POSIX with his ASCII ([a-zA-Z0-9]) or for that matter UNICODE ([\p{L}\p{Nl}\p{Nd}]), neither of which work in his interface. Grand tool, RegexBuddy ... it's one of those feature-rich interfaces which swim with example/sample in the help volume.
My only complaint is that searching through documentation for words has no highlight of the found set so I've got to print out the page and use my pdf search to locate all instances!
|
|
|
|
|
Ok, (didn't read the doc link but thanks for that). Using your new regex:
_[a-zA-Z]*\.[a-zA-Z]*|\.[a-zA-Z]
on the sample of your OP:
1. Something_Else_Type_XY_Z_26.04.23_website.com => _website.com
2. Something_Else_Type_XY_Z_website.com_26.04.23 => _website.com
3. Something_Else_Type_XY_Z_26.04.23_website.com_Comment => _website.com
4. Something_website.com_Else_Type_XY_Z_26.04.23_Comment => NULL
5. Something_Else_Type_XY_Z_26.04.23_Comment => NULL
6. Something_Else_Type_XY_dfd6869_3_21.12.22_website.com_ZU = _website.com
Which isn't quite what you wanted (because of that leading "_"). And as for RM's take on it, as-written (without any _[a-zA-Z]*\.) THAT string returns no matches here either (I use a tool called RegexBuddy and after yesterday's experience, with MySql selected as the input language not std::regex).
So although I can confirm today's discovery is close it'll take me some more sleuthing to run down that leading character ascii 95 ...
{EDIT]
I'm seeing a warning when I try to tackle that underscore using "shorthand character classes" \w to the tune of "MySql doesn't support blah blah blah" so ... this could be a while.
[END EDIT]
modified 19-Sep-23 14:52pm.
|
|
|
|
|
See my update to yesterdays answer above.
|
|
|
|
|
You should keep in mind that this is going to be really slow when compared to other types of searches.
So if you are using large volumes of data and/or many searches you should keep that in mind.
You should also keep in mind that there are valid domain names that will have more than one dot (period).
|
|
|
|
|
hi
I am trying to mask 3 to 8 digit numbers alone exclusively, by excluding already masked data with another regex pattern.
For example it should not match the masked patterns like
[xX\-\*\d{1,4}] , but match the pattern \d{3,8}
Can you please help with the regex pattern to achieve this?
I have tried something like this, but its not matching for the pattern \d{3,8}
(?<=(?<!x\-\*)\d{4})\d{3,8}
Input :
1234
x123412345678
-123412345678
*123412345678
123
1234
12345
123456
1234567
12345678
Expected masking
****
x1234****
-1234****
*1234****
****
****
****
****
****
****
|
|
|
|
|
If I were to use Notepad++ then the find regex would be:
([x\-*]\d{1,4})?(\d{3,8})
So it ties the 1-4 numbers immediately following a x (or X as Notepad++ normally case insensitive), a - (this is a meta character so needs the escape char to make it a normal char) or a *. Having them in a group with a ? allows for non capture if the line doesn't start with the x, - or *. The second group captures between 3 and 8 numbers.
Now this all works as expected in Notepad++, however sometimes these don't translate well across the various regular expression engines. As you haven't provided that info I can't give you an answer specific to your need.
Also I note you seem to want to "mask" the 3-8 digits. As it stands my 2nd group will just support replacing with a fixed number of "*", so some adjustment would be needed.
Hopefully this has helped you though to continue with a final solution.
Terry
|
|
|
|
|
hey Terry,
Thanks for the suggestion. I am trying to achieve this patter replacement in java
This one I thought of it but it is not working out for the input
*1234
x1234
-1234
these kind of inputs should not be masked and output should be as it is *1234 / x1234 / -1234
whatever you have given works fine for
*12345678
x12345678
-12345678
|
|
|
|