|
Unfortunately none of them worked at all.
REGEXP_SUBSTR(field_name, '^([a-z0-9]+(-[a-z0-9]+)*\.)+[a-z]{2,}$') immediately timed out (#3699 - Timeout exceeded in regular expression match ), and the rest just returned NULL for everything.
Any idea why?
|
|
|
|
|
Well, first of all those regex were std::regx not MySQL.
What exactly have timing out and error #3699 to do with this "SELECT" query? I'm not sure why none of them work because I'm not sure how you're leveraging this database. REGEX_SUBSTR? Is that a function?
Probably Oracle, he said.
I know nothing about Oracle, sorry.
|
|
|
|
|
|
I just tried a quick test at RegExr: Learn, Build, & Test RegEx[^] and the following regex finds all the web addresses in the set below (shown in bold). The only thing I cannot correct is that it includes the leading underscore. However it may be a starting point to help you (I have not actually tried it with MySQL).
_[a-zA-Z]*\.[a-zA-Z]*|\.[a-zA-Z].
Something_Else_Type_XY_Z_26.04.23_website.com
Something_Else_Type_XY_Z_website.co.uk_26.04.23
Something_Else_Type_XY_Z_26.04.23_website.com_Comment
Something_website.com_Else_Type_XY_Z_26.04.23_Comment
Something_Else_Type_XY_Z_26.04.23_Comment
Something_Else_Type_XY_dfd6869_3_21.12.22_website.com_ZU
|
|
|
|
|
Unfortunately REGEXP_SUBSTR(field_name, '_[a-zA-Z]*\.[a-zA-Z]*|\.[a-zA-Z]') just gives me the first 2 letters of the field_name ("So" in this case).
|
|
|
|
|
Yes, I got similar results, which does not make a lot of sense. However, on reading the documentation (https://dev.mysql.com/doc/refman/8.0/en/regexp.html#function_regexp-substr[^]) closely I discovered why. Hidden down near the bottom of the page is the following:
Quote: To use a literal instance of a special character in a regular expression, precede it by two backslash (\) characters. The MySQL parser interprets one of the backslashes, and the regular expression library interprets the other. For example, to match the string 1+2 that contains the special + character, only the last of the following regular expressions is the correct one:
So I tried with that, and a slight modification to my original pattern, and successfully extracted the fields as shown in the following results:
mysql> SELECT REGEXP_SUBSTR(Title, '_[a-zA-Z]+(\\.[a-zA-Z]+)+') from Books;
+---------------------------------------------------+
| REGEXP_SUBSTR(Title, '_[a-zA-Z]+(\\.[a-zA-Z]+)+') |
+---------------------------------------------------+
| NULL |
| NULL |
| NULL |
| NULL |
| _website.com |
| NULL |
| _website.com |
| _website.com |
| _website.co.uk |
| _website.com |
| NULL |
| NULL |
+---------------------------------------------------+
The NULL entries are records that do not contain a web address.
[EDIT]
I just tried without the leading underscore in the regular expression and added some numbers into the website names and got the following:
mysql> SELECT REGEXP_SUBSTR(Title, '[a-zA-Z][[:alnum:]]+(\\.[a-zA-Z]+)+') from Books;
+-------------------------------------------------------------+
| REGEXP_SUBSTR(Title, '[a-zA-Z][[:alnum:]]+(\\.[a-zA-Z]+)+') |
+-------------------------------------------------------------+
| NULL |
| NULL |
| NULL |
| NULL |
| website.com |
| NULL |
| google.co.it |
| website99.com |
| Bet365.co.uk |
| website.com |
| NULL |
| NULL |
+-------------------------------------------------------------+
12 rows in set (0.01 sec)
So it would appear to be simpler than I thought.
[/EDIT]
modified 20-Sep-23 4:18am.
|
|
|
|
|
Did you test according to my examples? What about:
Something_website.com_Else_Type_XY_Z_26.04.23_Comment => NULL (due to not having 2 preceding underscores first)
Your regex returns website.com there instead of NULL.
|
|
|
|
|
No, I am afraid I did not work that one out; I leave that as an exercise for you.
|
|
|
|
|
Concur here, using RegexBuddy and this string:
[a-zA-Z][[:alnum:]]*\.[a-zA-Z]*|\.[a-zA-Z]
Nothing but net (colloquial expression eliding formal disdain without turning away in disgust from self-realized talentlessness). Thanks for this.
modified 21-Sep-23 17:54pm.
|
|
|
|
|
MySQL's RE engine has one or two quirks of its own(after all, why make it easy?). The only part I could not figure out is how to ensure at least two underscores in the earlier part of the string. I bet @OriginalGriff or @RichardDeeming could do it with their eyes closed.
|
|
|
|
|
Using RegexBuddy (yeah, I guess I am falling into a hole here) the POSIX notation is perfectly acceptable, hence your [:alnum:]. I can substitute [:alpha:] there as well and I suspect others in the list are good actors. My brain infarct occurs though when I use Herr Gevertz' table of translations and try to replace the POSIX with his ASCII ([a-zA-Z0-9]) or for that matter UNICODE ([\p{L}\p{Nl}\p{Nd}]), neither of which work in his interface. Grand tool, RegexBuddy ... it's one of those feature-rich interfaces which swim with example/sample in the help volume.
My only complaint is that searching through documentation for words has no highlight of the found set so I've got to print out the page and use my pdf search to locate all instances!
|
|
|
|
|
Ok, (didn't read the doc link but thanks for that). Using your new regex:
_[a-zA-Z]*\.[a-zA-Z]*|\.[a-zA-Z]
on the sample of your OP:
1. Something_Else_Type_XY_Z_26.04.23_website.com => _website.com
2. Something_Else_Type_XY_Z_website.com_26.04.23 => _website.com
3. Something_Else_Type_XY_Z_26.04.23_website.com_Comment => _website.com
4. Something_website.com_Else_Type_XY_Z_26.04.23_Comment => NULL
5. Something_Else_Type_XY_Z_26.04.23_Comment => NULL
6. Something_Else_Type_XY_dfd6869_3_21.12.22_website.com_ZU = _website.com
Which isn't quite what you wanted (because of that leading "_"). And as for RM's take on it, as-written (without any _[a-zA-Z]*\.) THAT string returns no matches here either (I use a tool called RegexBuddy and after yesterday's experience, with MySql selected as the input language not std::regex).
So although I can confirm today's discovery is close it'll take me some more sleuthing to run down that leading character ascii 95 ...
{EDIT]
I'm seeing a warning when I try to tackle that underscore using "shorthand character classes" \w to the tune of "MySql doesn't support blah blah blah" so ... this could be a while.
[END EDIT]
modified 19-Sep-23 14:52pm.
|
|
|
|
|
See my update to yesterdays answer above.
|
|
|
|
|
You should keep in mind that this is going to be really slow when compared to other types of searches.
So if you are using large volumes of data and/or many searches you should keep that in mind.
You should also keep in mind that there are valid domain names that will have more than one dot (period).
|
|
|
|
|
hi
I am trying to mask 3 to 8 digit numbers alone exclusively, by excluding already masked data with another regex pattern.
For example it should not match the masked patterns like
[xX\-\*\d{1,4}] , but match the pattern \d{3,8}
Can you please help with the regex pattern to achieve this?
I have tried something like this, but its not matching for the pattern \d{3,8}
(?<=(?<!x\-\*)\d{4})\d{3,8}
Input :
1234
x123412345678
-123412345678
*123412345678
123
1234
12345
123456
1234567
12345678
Expected masking
****
x1234****
-1234****
*1234****
****
****
****
****
****
****
|
|
|
|
|
If I were to use Notepad++ then the find regex would be:
([x\-*]\d{1,4})?(\d{3,8})
So it ties the 1-4 numbers immediately following a x (or X as Notepad++ normally case insensitive), a - (this is a meta character so needs the escape char to make it a normal char) or a *. Having them in a group with a ? allows for non capture if the line doesn't start with the x, - or *. The second group captures between 3 and 8 numbers.
Now this all works as expected in Notepad++, however sometimes these don't translate well across the various regular expression engines. As you haven't provided that info I can't give you an answer specific to your need.
Also I note you seem to want to "mask" the 3-8 digits. As it stands my 2nd group will just support replacing with a fixed number of "*", so some adjustment would be needed.
Hopefully this has helped you though to continue with a final solution.
Terry
|
|
|
|
|
hey Terry,
Thanks for the suggestion. I am trying to achieve this patter replacement in java
This one I thought of it but it is not working out for the input
*1234
x1234
-1234
these kind of inputs should not be masked and output should be as it is *1234 / x1234 / -1234
whatever you have given works fine for
*12345678
x12345678
-12345678
|
|
|
|
|
Hi, not good at RegEx and trying to remove some duplicate values.
data is in CSV, part of it looks like this:
"/content/7/66345/images/590009.jpg , "/content/7/66345/images/590009.jpg , "/attachments/fe519c1e91c5e4983a70a2512fd5788b.jpg , "/content/7/66345/images/590009.jpg , "/content/7/66345/images/590009.jpg , "/attachments/4956e4fe56b59135c086605c9gyye.png
"/content/1/3968663/images/856609.jpg , "/attachments/086605c7c6e4fe56b59135c11b.jpg , "/content/1/3968663/images/856609.jpg , "/attachments/086605c7c6e4fe56b59135c11b.jpg
"/content/1/1458767/images/856657.jpg
"/content/1/1448511/images/856373.jpg
I am trying the following using Notepad++:
\w+\.+jpg|\w+\.+png(?:^|\G)(\b\w+\b),?(?=.*\1)
it does select one image by one image when clicking find, but I am not sure how to delete duplicates,
when I replace it with empty, it removes all, I want to let the first image remain and delete the duplicates, Anyone can help me with the code, please?
I don't want to remove the line, because it could mess with the CSV file, removing the extension of the duplicate image is OK,
Thanks for your help.
|
|
|
|
|
Hi, requesting some clarification:
Given the sample input, please post the expected output, that's to avoid any misunderstanding by me. I could guess but prefer not too.
Also, must this be done on Notepad++ (if yes - why ?), and on which OS ?.
thks
|
|
|
|
|
The regex you provided is close, but there are a few modifications needed to achieve your desired result. Here's the correct regex and how you can use it in Notepad++ -
("\/.*?\.(?:jpg|png))\s*,\s*(?=.*\1)
I have tested this regex in Notepad++ using the following steps -
1) Open your CSV file in Notepad++.
2) Press Ctrl + H to open the "Find" dialog.
3) In the "Find what" field, enter the regex: ("\/.*?\.(?:jpg|png))\s*,\s*(?=.*\1).
4) Leave the "Replace with" field empty.
5) In the "Search Mode" section, select "Regular expression".
6) Click on "Replace All".
Make sure to have a backup of your data before performing any find and replace operations, just in case...
|
|
|
|
|
Presumably your expectation is the following
1. The entire row is duplicated.
2. The duplicated row immediately follows the first row.
Otherwise I doubt regex is the way to go.
|
|
|
|
|
If the ordering of the rows is insignificant you can simply sort the lines to collect the duplicate rows together. (And if necessary, sort again on a key field after you have completed.)
But if you want to remove entire duplicated rows (lines), are you serious about using a regex to compare entire text lines for being identical? That can't be! But from the OP's first post, I cannot see what he intends to compare, and what he intends to remove.
|
|
|
|
|
trønderen wrote: are you serious about using a regex to compare entire text lines for being identical
Myself?
No I would not have attempted it with regex at all. I probably would have created a one shot perl script, not for the regex capabilities, but rather because reading files is easier to set up. And running it for iteration testing is easier also.
And I would note that the editor I use does have a fairly decent regex. So the lack of that would not have impacted my decision.
|
|
|
|
|
I have a regexp that works, in my software I search for timestamps with this:
[01]?[0-9]:[0-5][0-9] and a macro replaces the carriage return with a tab and then I proceed from there. But it's very time-consuming when the timestamps go over 10 minutes as then 2 tabs are required (it's a weird thing but that's how it goes).
1. So, to outline, from 0:00 to 9:99 timestamps, one tab is needed afterwards.
2. But from 10:00 and up, i.e., timestamps like 22:46 and 1:35:05 for example, require 2 tabs aftewards.
If it's any help, here is what my script looks like that goes through the entire document and deletes the carriage return and puts one tab after the timestamp (but where the timestamp needs one tab only between 0:00 and 9:99, then 2 tabs for larger timestamp times.
document.selection.Find("[01]?[0-9]:[0-5][0-9]",eeFindNext | eeFindReplaceRegExp);
document.selection.EndOfLine(false,eeLineView);
document.selection.Text="\x09";
document.selection.Delete(1);
Thank you!
|
|
|
|
|
Member 14835146 wrote: But it's very time-consuming
That is not specific. As in it takes 10 seconds? Or 10 hours?
Regexes meet specific needs but speed is not necessarily one of them. For starters a regex is always interpreted in the process. Even 'compiled' ones still end up in a form that is at best halfway to an actual compiled solution.
And your problem is in fact something that likely could be solved by real code. So that is likely something that would be faster.
But other than that it appears you might be attempting to do a regex solution for an entire file ('document') rather than doing it line by line. If you do in fact have lines which have a fixed number of timestamps then looping might provide a better solution especially if you can anchor the regex.
|
|
|
|