Click here to Skip to main content
15,069,838 members
Please Sign up or sign in to vote.
1.42/5 (4 votes)
I need regex that would validate the following forms of URL

http://www.site.com
https://www.site.com
http://site.com
https://site.com
http://domain.site.com
https://domain.site.com
http://www.domain.site.com
https://www.domain.site.com
site.com 
domain.site.com
http://www.site.com/path/to/dir/
https://www.site.com/path/to/dir/
http://site.com/path/to/dir/
https://site.com/path/to/dir/
http://domain.site.com/path/to/dir/
https://domain.site.com/path/to/dir/
http://www.domain.site.com/path/to/dir/
https://www.domain.domain.site.com/path/to/dir/
site.com/path/to/dir/
domain.site.com/path/to/dir/
http://www.site.com/path/to/file.html
https://www.site.com/path/to/file.html
http://site.com/path/to/file.html
https://site.com/path/to/file.html
http://domain.site.com/path/to/file.html
https://domain.site.com/path/to/file.html
http://www.domain.site.com/path/to/file.html
https://www.domain.domain.site.com/path/to/file.html
site.com/path/to/file.html
domain.site.com/path/to/file.html

And relative paths
./path/to/file.html
./path/to/dir/
./path/to/dir
path/to/file.html
path/to/dir/
path/to/dir

(ftp:// NOT permitted)

The file extension may be html, php, gif, jpg, png.

With my knowledge of regex this would take me a year to accomplish (if not longer). Took me more then an hour yesterday to do a regex for relative URL and that didn't turn out the way I want it to! I feel like I need to apologize for not knowing REGEX! :(

Just to note, it's not a problem if the URL doesn't really point anywhere, my main concern is the format. I just need the format to be in those (and only those) as the examples show (it's an exhaustive list). Only those is what I would be using and need, but if it validates a different format its ok (as long as its only http / https... ftp, ftps or anything else is NOT permitted).
Posted
Updated 7-Sep-21 4:50am
v6
Comments
Maarten Kools 15-Feb-14 16:41pm
   
RegExr[^] has a community library, which also contains plenty of URL validation expressions, that should give you a quick start. From there you'd have to tweak the expressions a bit to get the result you want.
EZW 15-Feb-14 16:56pm
   
lol that's actually where I'm at right now and trying to accomplish while I wait for suggestions (thanks so much for your comment though). I either break it or not have any new results at all.
Vedat Ozan Oner 15-Feb-14 16:57pm
   
thank you for the link. it is great :)
Maarten Kools 15-Feb-14 17:20pm
   
You're welcome, happy to help :)
Vedat Ozan Oner 15-Feb-14 17:07pm
   
((http|https)://)?[a-zA-Z]\w*(\.\w+)+(/\w*(\.\w+)*)*(\?.+)* is good for URL :)
EZW 15-Feb-14 17:48pm
   
That correctly validates all of my absolute URLs but not my relative URLs. :) Maybe I can modify it a bit. I'll try to modify it... my knowledge in RegEx is pretty bad. Thanks a lot for that post... it's 90% there... gets me that much closer :D
Andreas Gieriet 15-Feb-14 19:14pm
   
What do you exactly mean be "validating"? You want to check if the URLs are formally correct, i.e. according to the respective RFC 1738 and wikipedia: URL?
Which of your listed URLs are considered as correct in your view? I.e. http://machine.domain.gaga/ is formally correct but has an meaningless host name part (gaga).
You may also encounter # and & and % and ? and = and ; in the trailing part that may make perfect sense. What is your expectation on "validation" on these (see RFC above again). Please note that only a full URL is a correct url, if the scheme is missing, you may guess a scheme, but this is pure heuristic. You might need to improve your question (and make up your mind what you really want to achieve), in order that anyone here can give you a satisfactory answer. I guess the requirements are not stated well enough. That might be the reason why you don't achieve to get a working solution.
BTW: I guess, you cannot easily solve it with *one* RegEx. Probably, you need to split the text into the various parts and check them individually (not necessarily with RegEx).
Cheers
Andi
EZW 15-Feb-14 19:27pm
   
Hi, thanks a lot for that constructive reply. The URLs would be system generated so they WOULD be valid under normal circumstances but those generated URLs would be in such a place where they could be modified by the user (such as in a URL query) so I need to check it if it is in a proper format so I don't leave the system open to attacks. Whether the URLs lead somewhere, I don't care... if it doesn't lead anywhere I have a fall-back option (or if it doesn't open do default action). I just need regex that validates (or confirms) that the URL is in acceptable format (as examples (which is exhaustive) shown in original post).

Try this:
(^(http[s]?://)?([w]{3}[.])?([a-z0-9]+[.])+com(((/[a-z0-9]+)*(/[a-z0-9]+/))*([a-z0-9]+[.](html|php|gif|png))?)$)|(^([.]/)?((([a-z0-9]+)/?)+|(([a-z0-9]+)/)+([a-z0-9]+[.](html|php|gif|png)))?$)
   
Comments
EZW 15-Feb-14 23:23pm
   
With a little modification, I got it to work!!! Thank you very much sir!!!!!!!

((^(http[s]?:\/\/)?([w]{3}[.])?(([a-z0-9\.]+)+(com|php))(((\/[a-z0-9]+)*(\/[a-z0-9]+\/?))*([a-z0-9]+[.](html|php|gif|png|jpg))?)$)|((^([.]\/)?((([a-z0-9]+)\/?)+|(([a-z0-9]+)\/)+([a-z0-9]+[.](html|php|gif|png|jpg))))$))
Peter Leow 15-Feb-14 23:36pm
   
I see that you added in the jpg extension and escape for php. I have tested,it works for site.com too. Accept this as answer?
EZW 15-Feb-14 23:41pm
   
It does work for everything :D I believed an online regex tester which showed different results then my Apache... now I know not to trust them lol Accepted! Thanks a lot
You might try this (slightly shorter than solution #1):
PHP
^((https?:[/][/])?\w+[.])+com|((https?:[/][/])?\w+[.])+com[/]|[.][/])?\w+([/]\w+)*([/]|[.]html|[.]php|[.]gif|[.]jpg|[.]png)?)$

[EDIT1]
The correct pattern was
txt
^((https?:[/][/])?(\w+[.])+com|((https?:[/][/])?(\w+[.])+com[/]|[.][/])?\w+([/]\w+)*([/]|[.]html|[.]php|[.]gif|[.]jpg|[.]png)?)$
There was a mistake with parenthesis.
[/EDIT1]

This decomposes into (the <yyy> need to be replaced by the respective patterns):
txt
<valid>        = <prefix>|(<prefix>[/]|[.][/])?<path>
<prefix>       = (https?:[/][/])?<host>
<host>         = \w+([.]\w+)*[.]com
<path>         = \w+([/]\w+)*([/]|[.]html|[.]php|[.]gif|[.]jpg|[.]png)?

[EDIT2]
The query comes after the path or if the path is absent, after the prefix - no query allowed for parh without prefix.
[/EDIT2]

[EDIT3]
To manage complexity, split the patterns into separate variables and concat to the full pattern. This enables you to test parts of the full pattern.

E.g.
PHP
// query
$rx_qpart = '\\w+=[^&]*';
$rx_qhead = '[?]'.$rx_qpart;
$rx_qnext = '[&]'.$rx_qpart;
$rx_qtail = '('.$rx_qnext.')*';
$rx_query = '('.$rx_qhead.$rx_qtail.')?'; // *** to be used in the main pattern
// path
$rx_ppart = '\\w+';
$rx_phead = $rx_ppart;
$rx_pnext = '[/]'.$rx_ppart;
$rx_ptail = '('.$rx_pnext.')*';
$rx_pdend = '[/]';
$rx_pfend = '[.]html|[.]php|[.]gif|[.]jpg|[.]png';
$rx_pend  = '('.$rx_pdend.'|'.$rx_pfend.')?':
$rx_rpath = $rx_phead.$rx_ptail.$rx_pend;                     // *** to be used in the main pattern
$rx_qpath = $rx_phead.$rx_ptail.'('.$rx_pfend.')?'.$rx_query; // *** to be used in the main pattern
// host
$rx_hpart = '\\w+';
$rx_hhead = $rx_hpart;
$rx_hnext = '[.]'.$rx_hpart;
$rx_htail = '('.$rx_hnext.')*';
$rx_top   = '[.]com'; // I suggest to replace by $rx_top = $rx_hnext;
$rx_host  = $rx_hhead.$rx_htail.$rx_top; // *** to be used in the main pattern
// protocol
$rx_protocol = '(https?:[/][/])?'; // *** to be used in the main pattern
// prefix
$rx_prefix = $rx_protocol.$rx_host;
// **** full pattern ****
$rx_url = '^('.$rx_prefix.'[/]?';
          .'|'.$rx_prefix.'[/]'.$rx_qpath
          .'|'.$rx_prefix.$rx_query
          .'|'.$rx_rpath
          .'|'.'[.][/]'.$rx_rpath
          .')$';
Note: you must use single quotes to avoid further interpretation by the PHP interpreter of the enclosed special characters like &, etc.
[/EDIT3]


Cheers
Andi
   
v8
Comments
EZW 16-Feb-14 20:36pm
   
That is smaller but I get unknown modifier '?' error... if I put '/' at either ends the error changes to unknown modifier ']'. Thanks for the help :D
EZW 11-Apr-14 2:07am
   
This works now... though slightly modified since the parenthesis were mis-matched

(^((https?:[/][/])?\w+[.])+com|(((https?:[/][/])?\w+[.])+com[/]|[.][/])?\w+([/]\w+)*([/]|[.]html|[.]php|[.]gif|[.]jpg|[.]png)?)$

Now I got another problem (didn't foresee it then :/ ) I need the REGEX to allow a query string.
Andreas Gieriet 11-Apr-14 3:07am
   
Any list of examples?
Adding a query string has to be done in the prefix, after the host. E.g. <query> = ([?]\w+=\w*(&\w+=\w*)*)?
See my updated solution above.
Cheers
Andi
PS: I've corrected my pattern. It had indeed a problem with parenthesis.
EZW 11-Apr-14 22:39pm
   
I get:

Warning: preg_match(): Unknown modifier ']'

:(
Andreas Gieriet 12-Apr-14 10:40am
   
Somehow I missed that you want it for PHP. I did wonder why it worked for me but not for you. My solution is for .Net (e.g. C#) and not for PHP. I assume it is similar, but might differ in details.
Cheers
Andi
EZW 12-Apr-14 23:24pm
   
Oh, that might be it. Thanks for helping me out though, I really appreciate the effort
EZW 13-Apr-14 1:49am
   
I've fixed the error... have the following regex which almost works... still doesn't allow for query strings (it were unescaped '\'s and signs (like '?' and '&' and '.').).

^((https?[\:][\/][\/])?(\w+[\.])+com(((\&|\?)\w+\=\w*)*)?|((https?[\:][\/][\/])?(\w+[\.])+com(((\&|\?)\w+\=\w*)*)?[\/]|[\.][\/])?\w+([\/]\w+)*([\/]|[\.]html|[\.]php|[\.]gif|[\.]jpg|[\.]png)?)$
EZW 13-Apr-14 1:53am
   
I got the following (from the 1st solution to work)

((^(http[s]?:\/\/)?([w]{3}[.])?(([a-z0-9\.]+)+(com|php))(((\/[a-z0-9]+)*(\/[a-z0-9]+\/?))*([a-z0-9]+[.](html|php|gif|png|jpg))?)(((\&|\?)\w+\=\w*)*)$)|((^([.]\/)?((([a-z0-9]+)\/?)+|(([a-z0-9]+)\/)+([a-z0-9]+[.](html|php|gif|png|jpg))))$))

But it's pretty big, and I don't know if I got it right
You might try this simply:

C#
Uri.IsWellFormedUriString(YourURLString, UriKind.RelativeOrAbsolute)


See MSDN
   
Comments
CHill60 1-May-19 6:04am
   
Except URI and URL are not the same thing - the latter is only a sub-set of the former
I did try to post an answer many times, but the preview kept making me think the posted text was being modified??
So then I just deleted them to try again, but then getting blocked, so this almost what I'm was trying to post...
^((https?:\/\/)?([a-z0-9]+\.?)*[a-z0-9]+\.(com|php)(\/([a-z0-9]+\/)+)?|(\.\/)?([a-z0-9]+\/)+[a-z0-9]*$|(\.\/)?([a-z0-9]+\/)+)([a-z0-9]+\.(html?|php|png|jpg|gif))?$


If its not displaying/working properly, then instead just go to... https://regex101.com/r/jOvcz0/1
So it can show the unedited expression, also with having notes at the bottom for me guessing what not to match.
If anybody does have advice for the best pre tags to use with regex, its to be very much appreciated.

Im not know much about .html or 'query strings', but if posting samples I'm certain the experts can help modify.
Its very unfortunate that getting a truthful preview was like 1000x harder than trying to answer some questions.
Now Im finally learned to not look at that last preview, after clicking the "Submit your solution".
   
Comments
Richard Deeming 7-Sep-21 11:01am
   
How does that pattern differ from Peter Leow's solution (solution 1, posted February 2014)?

If you're going to post a new solution to such an old question, you need to make sure you're adding something to the discussion, and you need to clearly explain why your solution is better than the existing ones.
[no name] 7-Sep-21 12:18pm
   
First and foremost, I never saw the date, nor say it was better.
How's it different?...

1: Forward slashes delimited for PHP.
2: No superfluous www matching, no redundant file-extension matching.
3: It wont match simple text-strings like "abcdefg" or empty-lines.
4: Blah, blah, no point. Im not one to criticize, just trying to help.

My apologies for trying to help another human being, it wont happen again.
Not on this site. I would delete it, but maybe it can help someone else.
Dont worry, this account will be deleted, soon as I'm done thanking someone.
Richard Deeming 7-Sep-21 12:31pm
   
If you update your question to explain how your solution differs from the existing solutions, and why someone might chose yours over any of the others, then it could be a valid solution, regardless of the age of the question.

Just dumping another regex without explanation, particularly when accompanied by an off-topic rant about the functionality of the site, is not a good solution.

And responding to constructive criticism with a threat to "rage-quit" is not a good approach to life.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)




CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900