Regular Expressions Discussion Boards

Regular Expression for a repeating pattern?

Les Stockton6-Oct-23 7:50

6-Oct-23 7:50

I've got some data input by a user.
it's like <span>    </span> and this can be any number of this non-breaking space up until the close of the span tag, with nothing else.
I'm trying to figure out a reasonable way to be able to detect an occurance of a span like this. It would be the span and then 30 occurances of the non-breaking space, or it could be 300 occurances, or any number between.
I was hoping there'd be a way to detect this repeating pattern within a regular expression.

XML

<span>&nbsp; &nbsp; </span>

Re: Regular Expression for a repeating pattern?

PIEBALDconsult6-Oct-23 7:59

PIEBALDconsult

6-Oct-23 7:59

You might need to test for the Unicode value.

Re: Regular Expression for a repeating pattern?

RedDk6-Oct-23 8:08

RedDk

6-Oct-23 8:08

Scarf this down:
Regex Quantifier Tutorial: Greedy, Lazy, Possessive[^]

modified 25-Oct-23 2:06am.

Re: Regular Expression for a repeating pattern?

jschell9-Oct-23 5:26

jschell

9-Oct-23 5:26

Specifics of where/which regex is used matters.

But in general

{code}
\s*( \s*)+
{code}

Les Stockton wrote:
until the close of the span tag

In valid XML looking for the closing tag is pointless. But you can add it if you want.

Les Stockton wrote:
XML

Just noting that regexes to parse XML is not a good idea. Primarily this comes down to blocks embedded in other blocks. You cannot parse that with a regex. But there are other complex issues also that would require hideous regexes (which means slow) also.

Also there can be other variances in what you posted.
1. Multiline
2. Spaces in the tags
3. Attributes in the tag.

Re: Regular Expression for a repeating pattern?

k50549-Oct-23 6:12

k5054

9-Oct-23 6:12

In general, trying to parse XML (or HTML) with regex is not a good idea, and almost certainly doomed to failure. However, to match this specific case you might try:

RegEx

<span>(&nbsp; *)+</span>"

That's an extended POSIX regex, and seems to do the job. It matches any of the following:

XML

<span>&nbsp;</span>
<span>&nbsp; </span>
<span>&nbsp; &nbsp; </span>
<span>&nbsp; &nbsp; &nbsp; </span>
<span>&nbsp;&nbsp; &nbsp;</span>

If you need to accept any white space you might try using ( [[:space:]]*) as the sub-pattern.
If you may have line breaks in the span text, then you may need to tell your regex engine to not treat them as end-of-text markers.

Keep Calm and Carry On

Exclude Uppercase for conjoined names

Plastmannen2-Oct-23 22:08

Plastmannen

2-Oct-23 22:08

Hello! I'm new to learning regex and I cant seem to figure out how to exclude certain characters from my regex.
In my example I want to structure names as FIRST_NAME, LAST_NAME. Meaning "Rog er , Green" becomes "Roger, Green".
Which I have acomplished with the RegEx (?=[A-Z]). My current issue is in Sweden where I recide conjoined names such as "Lars-Erik" is rather common and with my current RegEx that becomes "Lars- Erik" which I dont want.
It also does not take into account nordic uppercase letters such as Ö etc.
Is there a way of excluding Uppercase letters that have a Hyphen prefix, as well as including more than US characters?

Re: Exclude Uppercase for conjoined names

OriginalGriff2-Oct-23 22:20

OriginalGriff

2-Oct-23 22:20

When you use square brackets, you are limiting it to just the characters (and ranges) you include within the brackets: [A-Z] includes only the characters 'A' to 'Z'. I fyou want to extent that to accented characters, you either need to add them to the brackets: [A-Za-zÄÀÉÏÔÕÖÜßàâäèéêëîôõöûüçÇ] or use the less specific "alphanumeric character" code \w instead - this has the disadvantage of including '0' to '9' as well, but ... it's a lot easier to read ... Big Grin | :-D

You can also include the hyphen in your "permitted characters" list to allow for hyphenated names: [\w-]

If you are going to use regular expressions, you need a helper tool. Get a copy of Expresso[^] - it's free, and it examines and generates Regular expressions.

"I have no idea what I did, but I'm taking full credit for it." - ThisOldTony
"Common sense is so rare these days, it should be classified as a super power" - Random T-shirt
AntiTwitter: @DalekDave is now a follower!

Re: Exclude Uppercase for conjoined names

Plastmannen2-Oct-23 23:25

Plastmannen

2-Oct-23 23:25

I see, \w would match all characters right? Not just Uppercase letters. So in my case I would need to individually add all Uppercase variants of accented characters into my brackers after A-Z?

Re: Exclude Uppercase for conjoined names

OriginalGriff3-Oct-23 0:42

OriginalGriff

3-Oct-23 0:42

Yes.

Re: Exclude Uppercase for conjoined names

Plastmannen3-Oct-23 1:04

Plastmannen

3-Oct-23 1:04

Alright thank you!

Re: Exclude Uppercase for conjoined names

Richard Deeming2-Oct-23 23:53

Richard Deeming

2-Oct-23 23:53

It sounds like you're building a great way to thoroughly annoy your users. D'Oh! | :doh:

Even ignoring the famous Falsehoods Programmers Believe About Names[^] list, if Mr Green wants to be known as "Rog er", why should he be forced to change his name to "Rog-Er" or "Roger" just to fit in with your system's rules?

I understand that you're probably just trying to prevent typos. But you're building a system that can't cope with anything even slightly unusual "just in case" someone can't type their own name.

"These people looked deep within my soul and assigned me a number based on the order in which I joined."
- Homer

Re: Exclude Uppercase for conjoined names

Plastmannen3-Oct-23 1:26

Plastmannen

3-Oct-23 1:26

I see your concern but this RegEx is to be implemented with a OCR engine which sometimes (unfortunately) adds whitespaces where there are none.
So in my example "Rog er" has actually entered "Roger" and the OCR interprets it as two parts. I want to eliminate those cases, if an actual "Rog er" appears then my customer is informed that hey have to look for "Roger".
"Rog-er / Rog-Er" is accounted for and allowed!

Re: Exclude Uppercase for conjoined names

trønderen3-Oct-23 8:08

trønderen

3-Oct-23 8:08

Norwegian names frequently omit the hyphen: In my school class there were both Per Erik, Hans Petter, Gunn Marit and Marit Irene (all first names, not first+family names). Others had double first names, but used only one of them except at formal occasions. Both my parents had double, un-hyphenated first names (and my father's second first name was used so rarely that I didn't know of it until my mid-teens!).

(In Sweden, you are quite likely to get in contact with people with names Norwegian style.)

The list of false assumptions in the article linked by Richard Dennings is great!

Re: Exclude Uppercase for conjoined names

Richard Deeming3-Oct-23 21:36

Richard Deeming

3-Oct-23 21:36

trønderen wrote:
Richard Dennings

Who?

"These people looked deep within my soul and assigned me a number based on the order in which I joined."
- Homer

Re: Exclude Uppercase for conjoined names

trønderen4-Oct-23 8:40

trønderen

4-Oct-23 8:40

Sorry. My mistake. When you open a reply window, the list of messages is by default hidden, and I took the name from my (incorrect) memory of the author of the message no longer in view.

I hope you were not terribly offended. I am sorry for my mistake anyway.

Re: Exclude Uppercase for conjoined names

jschell4-Oct-23 4:58

jschell

4-Oct-23 4:58

So the real problem is the OCR solution.

The reality is that you are unlikely to be able to deal with all of the error cases. Your solution might introduce more errors.

So it is a trade off.

If error reduction is considered a significant issue, then perhaps better to look into getting a different OCR solution and use both of them. Then compare the output from both and only apply fixes when there is a difference.

If as I said it is a significant problem then any additional cost should not be a problem. But if the cost is a problem then perhaps it isn't as significant as thought.

Re: Exclude Uppercase for conjoined names

trønderen4-Oct-23 9:02

trønderen

4-Oct-23 9:02

Doing this in a language such as C# would have been a trivial task.

It would have given you a lot more flexibility in handling e.g. standard name parts that are not capitalized, such as Ludvigvan Beethoven, Charlesde Gaulle or Bengtaf Klintberg. Lots of other special cases and variations could be handled in a much more maintainable way.
I have linked to this several times earlier, but it cannot be repeated too often: Geek & Poke: Yesterday's regex[^]

You cannot expect your name matching to be perfect on the first try. Or second. Or third. E.g. a list of prepositions such as "van", "de", "af" ... will grow and grow. Adding them to a C# list is far easier than updating your regex.

Re: Exclude Uppercase for conjoined names

OriginalGriff12-Oct-23 1:02

OriginalGriff

12-Oct-23 1:02

There are also English (and Welsh) surnames that start with "ff": it indicated "son of" in Middle Age English and was a single letter - literally an uppercase "F" was written as "ff" Until the end of the Middle Ages the initial capitalization of any name wasn't a thing - names were all written in lowercase. Some rich people^* kept the lowercase starter to this day (and can get very shirty if you use uppercase!)

* Who mostly were the only ones with surnames anyway, they didn't become common practice until the aftermath of the Black Death.

learning regex isn't easy :-)

Kardock18-Sep-23 4:22

Kardock

18-Sep-23 4:22

hi all,

so, I'm new to regex, trying to understand and i admit i'm lost.

here's what i need right now;

i have a list of string where i wish to extract the email address of users, each line looks like this:

DisplayName;Surname;Givenname;Mail;Company

which gives me something like:

$line = 'jsmith;john;smith;john.smith@someemail.com;acme'

since I'm new and not sure how this work, i do these to test and learn, and the results. now i'm trying to understand why the last 2 shown here are failing.

$line -match '\w+' = True
$line -match '\w+;' = true
$line -math '\w+;\w+;' = true
$line -match '\w+;\w+;\w+' = true
$line -match '\w+;\w+;\w+;' = false
$line -match '\w+;\w+;\w+;\.*' = false

at first i thought that this regex would give me the email but it fails.

$regex = '\w+;\w+;\w+;(\w+@\w+);\w+'

thanks for helping me.

Re: learning regex isn't easy :-)

Richard Deeming18-Sep-23 5:02

Richard Deeming

18-Sep-23 5:02

Based on your description, you want to extract the fourth field from each line:

RegEx

^([^;]*;){3}([^;]+);

Demo[^]

However, depending on the source of the data, you may need to consider how it would "escape" a semicolon embedded in one of the field values.

For example, given a display name of j;smith, would that end up as j\;smith? j;;smith? Something else? Or would it just corrupt the entire line?

Once you start having to account for "escaped" separators, parsing the line becomes much harder.

"These people looked deep within my soul and assigned me a number based on the order in which I joined."
- Homer

Re: learning regex isn't easy :-)

k505418-Sep-23 5:33

k5054

18-Sep-23 5:33

If you know that you do not have any embedded semi-colons in your input text, then maybe a simple split would work for you instead of a regex. e.g fields[] = split(line, ';') (or however your base language does that). This is far simpler, and should be much quicker that applying a regex and extracting a match. However, as Richard points out, if you do have embedded semi-colons you'll need to know how they're escaped in the string. In which case it is probably still faster to write a parser that will extract the fields to an array or as struct or class of some sort.

In a related note, you might be tempted to apply a regex to validate the email address, but that is not as simple and straight forward as it might seem. See this discussion from stack-overflow : https://stackoverflow.com/a/201378 The next response on that SO page may also be useful, if you're using C#, which refers to the MailAddress class.

Keep Calm and Carry On

Re: learning regex isn't easy :-)

jschell18-Sep-23 5:42

jschell

18-Sep-23 5:42

Kardock wrote:
each line looks like this:

Which suggests that it is CSV data. Although 'CSV' stands for 'comma separated value' in general usage the separator can be other types including a semi-colon.

So the best solution is to find a CSV library and use that rather than attempting to roll your own. You should look to see how the library handles bad data (ill formed CSV).

Re: learning regex isn't easy :-)

Kardock18-Sep-23 6:07

Kardock

18-Sep-23 6:07

you're right but that gave me a chance to try to understand regex.

Re: learning regex isn't easy :-)

trønderen18-Sep-23 6:04

trønderen

18-Sep-23 6:04

Make sure to learn one lesson about regex: Don't overuse it.

I've seen numerous regex problems where solving the task using an algorithmic language (such as C#) would be straightforward and simple - and flexible enough to handle with ease all the exceptions and special cases that really can give you a headache trying to do it as a regex.

And there is Geek&Poke: Yesterdays regex[^]

Disclaimer: The only pattern matching language I liked was SNOBOL, but I haven't seen it is use for a few decades now. SNOBOL is (/was) sort of a crossover between predicates and algorithmic programming - you could see it as a different kind of bool expression evaluation, in an otherwise algorithmic programming language. Especially the predicates were written in a way more readable format than in traditional regex. (I am not holding my breath waiting for SNOBOL to raise to a new stardom, though!)

Re: learning regex isn't easy :-)

jschell19-Sep-23 5:29

jschell

19-Sep-23 5:29

So given that you just want to mess around with regex.

Kardock wrote:
but it fails.

Presumably you mean it runs but it does not successfully match.

The problem is '\w' is not an expression that could ever match an email. So you need to look up what it does match.

The other problem that you will find is that attempting to actually match a valid email is very difficult. The regex to do it is about 1000 characters long. You can google that both to see what a long regex looks like and to educate yourself what a 'valid' email actually is. (I do it every couple of years to remind myself especially when someone says they want to 'validate' an email.)

However you don't need to match an email. What you need to match is the fourth value in the list. So the way to match that is the following

[^;]+

You should probably in fact match all of the columns that way.

So you should study that expression to figure out what it does. And then answer for yourself why the other posters comment about embedded semi-colons being a problem.

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.