Click here to Skip to main content
15,886,422 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
Hi, Would someone be so kind to help with regex question? I have an html file I need to search for all urls contained within href attributes that are not in a list I have specified.

I need to return every line that contains a url that is not either https://www.xyz.com/* or http://www.abc.com/start.html. There are many other urls I want excluded, but removed for simplicity.


For example, my html file contains:
<title>hello world</title>
<body>
<a href="http://www.xyz.com/start.aspx">Homepage</a>
</body>


Thank you so much in advance

What I have tried:

this works perfectly as returns all lines with href attributes containing urls:

(<a\ href=")((ht|f)tp(s?)\:\/\/.*)

however, i need to exclude https://www.xyz.com/* or http://www.abc.com/start.html. so i attempted this but still returns same lines and doesn't exclude my list:

(<a\ href=")(((ht|f)tp(s?)\:\/\/.*)(?!https\:\/\/www.xyz.com\/|http\:\/\/www.abc.com\/start.html.*))
Posted
Updated 11-Oct-21 10:28am
v2
Comments
Richard Deeming 12-Oct-21 4:32am    
Are you sure your files will never contain an href using single quotes, or one without quotes? Both could be valid HTML, but wouldn't be picked up by your current regex.

It would probably be better to use AngleSharp to parse the HTML files.
rjb911 12-Oct-21 7:35am    
Thanks for reply. No I cannot confirm that it will never be enclosed in single quotes or no quotes at all. Just needing to get all urls not matching my list.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900