REGEX: capture multiple, multi-line strings of text between two different strings

Question

0.00/5 (No votes)

See more:

I'm attempting to use Regex to parse downloaded strings of text which are recipes. The ultimate goal is to feed a recipe into a database component by component such that they can later be randomly selected for a, say, weekly menu and then generate a overall required ingredients list (a shopping list) by summing the quantities from like ingredients from all the selected recipes...

My specific problem here is capturing blocks of text within a file containing one or more recipes in such a way as to, first, extract blocks of text for further parsing.

These would be things like, the Title, Meal Type (e.g. Breakfast, Snack), Nutritional Qualities (e.g. Calories, Carbs etc), Directions and, of course Ingredients. Starting from the very beginning of a set of recipes in a downloaded 7-day diet menu the first block of text I want to find is between "Day" and "XXX". I know this specific word search will have to be improved later, but if I can get this example working, I believe I'll be able to handle any such issues that follow.

I've been researching the web for a couple days now since I'm always convinced someone has probably asked the same or a similar question before... But I just can't find anything that gets me all the way there.

I've used Regex before on smaller tasks, but I don't consider myself to be fluent (or even "good") with it. And I've also done the requisite study on a couple of very good Regex Reference/Tutorial web sites. But without success. Please note, I'm using VB.Net.

What I have tried:

Here's an example of an input string. This obviously is greatly shortened for use here, but it is representative of the overall problem.

"
Day 1 3479 calories • 42g carbs (15g fiber) • 272g fat • 216g protein
BREAKFAST1144 calories • 3g carbs (0g fiber) • 98g fat • 60g protein
Bun-less Egg Sandwich
Ingredients:
    1 1/2 tbsp Butter (21 g)
    3 Egg (150 g)
III
Directions:
    Heat a nonstick pan over low heat and brush butter around the pan.
    Bla Bla Bla
DDD
XXX

Day 2 3560 calories • 48g carbs (13g fiber) • 269g fat • 239g protein
BREAKFAST1107 calories • 11g carbs (2g fiber) • 90g fat • 61g protein
Mushroom and Cheddar Omelette
Ingredients:
    1 1/2 tbsp Butter (21 g)
	2 Slices Cheddar Cheese
    3 Egg (150 g)
III
Directions:
    Heat a nonstick pan over medium heat and let butter melt until bubbling.
    Bla Bla Bla
DDD
XXX
"

The desired result from Regex would be two matches; one containing all of Day 1 and a second containing all of Day 2. Note here, I've manually inserted "XXX" as a delimiter for clarity in the example and because the format, block order, terminology etc of other downloaded Recipes will likely be indeterminate, or at least very hard to specify. You'll also see "DDD" and "III" inserted as future helper delimiters, which may be avoided with more elegant programming...TBD.

Here are several of the unsuccessful Regex's I've tried.

Day[\u0000-\uFFFF.*?]+XXX - 1 Match
\ADay[\u0000-\uFFFF.*?]+XXX - 1 Match
\ADay[\u0000-\uFFFF.*?]+\ZXXX - No Match

I've also employed every combination of Multi-Line, Single Line, Global Mode On and Off. I've done this out of frustration and it isn't clear that any help, or hurt the result...

The best I've come up with is just one match that captures everything from "Day" to ~The Last~ "XXX." In other words, it finds the entire string/file, skipping over (this what I can't solve) the "XXX" in the interior of the string instead of using it to identify its first match and then moving on to find the next "Day". I've tried everything that would seem to be helpful; but either I get "No Match" or 1 match.

BTW, I use unicode characters in the 'capture everything' token because of the " • " character and because, I'll never know what characters may be contained in other recipes I may download. After more elemental parsing I may yet have to deal with this in a different way, but again first things first.

Any Help would be greatly appreciated and hopefully instructive to me and others.
Thanks in advance

Posted 13-Jan-21 3:02am

brownpeteg

Updated 13-Jan-21 6:07am

Maciej Los

v4

Add a Solution

3 solutions

Solution 2

Just a few interesting links to help building and debugging RegEx.
Here is a link to RegEx documentation:
perlre - perldoc.perl.org[^]
Here is links to tools to help build RegEx and debug them:
.NET Regex Tester - Regex Storm[^]
Expresso Regular Expression Tool[^]
RegExr: Learn, Build, & Test RegEx[^]
Online regex tester and debugger: PHP, PCRE, Python, Golang and JavaScript[^]
This one show you the RegEx as a nice graph which is really helpful to understand what is doing a RegEx: Debuggex: Online visual regex tester. JavaScript, Python, and PCRE.[^]
This site also show the Regex in a nice graph but can't test what match the RegEx: Regexper[^]

Posted 13-Jan-21 5:50am

Patrice T

Solution 3

If you would like to get text between Day and XXX "tags" only, use this pattern: (?<=Day)(.*?)(?=XXX) with these options RegexOptions.Multiline Or RegexOptions.Singleline

For details, see this: mulitline text between tags[^]

Sample output:

2 match(es)
---

 1 3479 calories • 42g carbs (15g fiber) • 272g fat • 216g protein
BREAKFAST1144 calories • 3g carbs (0g fiber) • 98g fat • 60g protein
Bun-less Egg Sandwich
Ingredients:
    1 1/2 tbsp Butter (21 g)
    3 Egg (150 g)
III
Directions:
    Heat a nonstick pan over low heat and brush butter around the pan.
    Bla Bla Bla
DDD

---
 2 3560 calories • 48g carbs (13g fiber) • 269g fat • 239g protein
BREAKFAST1107 calories • 11g carbs (2g fiber) • 90g fat • 61g protein
Mushroom and Cheddar Omelette
Ingredients:
    1 1/2 tbsp Butter (21 g)
  2 Slices Cheddar Cheese
    3 Egg (150 g)
III
Directions:
    Heat a nonstick pan over medium heat and let butter melt until bubbling.
    Bla Bla Bla
DDD

---

Posted 13-Jan-21 6:07am

Maciej Los

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

OriginalGriff · Accepted Answer · 2021-01-13T04:07:00

Solution 1

Just use a basic Regex:

Day \d+ .*?XXX

And each capture will end in a different Match:

VB

Public Shared regex As Regex = New Regex("Day \d+.*?XXX", RegexOptions.Multiline Or RegexOptions.Singleline Or RegexOptions.CultureInvariant Or RegexOptions.Compiled)

...
        Dim ms As MatchCollection = regex.Matches(InputText)

        For Each m As Match In ms
            Console.WriteLine($"\"{m.Value}\"")
        Next

Posted 13-Jan-21 4:07am

OriginalGriff

Comments

brownpeteg 13-Jan-21 11:25am

Wow. I have been making mountains out of mole hills!! Very simple solutions Thanks! And I think I've misinterpreting the descriptions of several Regex elements. For example the .* (in one Regex cheat sheet listing) described it as matching any character except New Line (\n)), but your solution clearly looks past that because you've set the Multiline options. I set the multi-line option but it it didn't work... So, if you would please explain your including all of the options that you have and or why your solution worked and mine didn't?

OriginalGriff 13-Jan-21 11:50am

See here: https://docs.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-options
It explains multiline and singleline.
SingleLine is the option you need - it allows "." to match newlines as well as other characters.

If you want to use Regular Expressions, then get a copy of Expresso[^] - it's free, and it examines and generates Regular expressions.
I use it a lot - it saves huge amounts of effort and frustration.

Maciej Los 13-Jan-21 12:12pm

5ed!

brownpeteg 13-Jan-21 12:27pm

Thanks Again!

OriginalGriff 13-Jan-21 12:50pm

You're welcome!