Basically, as I said l
ast time (How web scrape HTML in Python?
]) you are much, much better off not using regex at all: HTML is inconsistent, impractical and - in essence - a mess.
While it is possible to use a regex to do it, it's a truly horrible regex you end up with, and it will be very difficult to maintain.
For an exercise, just think about
which can be represented two ways as
or <br /> but never as
; and how to identify nested paragraphs:
Then look at "real world" sites and see how many contain malformed HTML with missing close tags ...
Use an HTML parser: you are making your whole project much, much harder than it needs to be!