There’s a lot of code out there that deals with URL validation. Basically all of it is concerned with “does this URL meet the RFC spec?” That’s actually not that interesting a question if you are validating user input. You don’t want ‘gopher://whatever/’ or ‘http://10.1.2.3’ as valid URLs in your system if you have asked the user for the address of a web page.
What’s more, because of punycode-powered internationalized URLs most URL validating code will tell you that real URLs that people can use today are invalid (PHP’s parse_url is not even utf8 safe).
Here’s some PHP code that validates URLs in a more practical way. It uses the list of TLDs in static::$validTlds from the IANA list of valid TLDs and assumes the presence of a utf8-safe $this->parseUrl such as Joomla’s version.
(c) 2013 Thomas David Baker, MIT License
/**
* Return true if url is valid, false otherwise.
*
* Note that this is not the RFC definiton of a valid URL. For example we
* differ from the RFC in only accepting http and https URLs, not accepting
* single word hosts, and accepting any characters in hostnames (as modern
* browsers will punycode translate them to ASCII automatically).
*
* @param string $url Url to validate. Must include 'scheme://' to have any
* chance of validating.
*
* @return boolean
*/
public function validUrl($url) {
$parts = $this->parseUrl($url);
// We must be able to recognize this as some form of URL.
if (!$parts) {
return false;
}
// SCHEME.
// Must be qualified with a scheme.
if (!isset($parts['scheme']) || !$parts['scheme']) {
return false;
}
// Only http and https are acceptable. No ftp or similar.
if (!in_array($parts['scheme'], ['http', 'https'])) {
return false;
}
// CHECK FOR 'EXTRA PARTS'.
// If a URL has unrecognized bits then it is not valid - for example the
// 'z' in 'www.google.com:80z'.
// This check invalidates URLs that use a user - we don't allow those.
$partsCheck = $parts;
$partsCheck['scheme'] .= '://';
if (isset($partsCheck['port'])) {
$partsCheck['port'] = ':' . $partsCheck['port'];
}
if (isset($partsCheck['query'])) {
$partsCheck['query'] = '?' . $partsCheck['query'];
}
if (isset($partsCheck['fragment'])) {
$partsCheck['fragment'] = '#' . $partsCheck['fragment'];
}
if (implode('', $partsCheck) !== $url) {
return false;
}
// HOST.
if (!isset($parts['host']) || !$parts['host']) {
return false;
}
// Single word hosts are not acceptable.
if (strpos($parts['host'], '.') === false) {
return false;
}
if (strpos($parts['host'], ' ') !== false) {
return false;
}
if (strpos($parts['host'], '--') !== false) {
return false;
}
if (strpos($parts['host'], '-') === 0) {
return false;
}
// Cope with internationalized domain names.
$host = idn_to_ascii($parts['host']);
$hostSegments = explode('.', $host);
// The IANA lists TLDs in uppercase, so we do too.
$tld = mb_strtoupper(array_pop($hostSegments));
if (!$tld) {
return false;
}
if (!in_array(mb_strtoupper($tld), static::$validTlds)) {
return false;
}
$domain = array_pop($hostSegments);
if (!$domain) {
return false;
}
// PATH.
if (isset($parts['path']) && substr($parts['path'], 0, 1) !== '/') {
return false;
}
// If you made it this far you're golden.
return true;
}
Looking at the list of interesting URLs from http://mathiasbynens.be/demo/url-regex and elsewhere it allows all of the following:
http://foo.com/blah_blah http://foo.com/blah_blah/ http://foo.com/blah_blah_(wikipedia) http://foo.com/blah_blah_(wikipedia)_(again) http://www.example.com/wpstyle/?p=364 https://www.example.com/foo/?bar=baz&inga=42&quux http://✪df.ws/123 http://➡.ws/䨹 http://⌘.ws http://⌘.ws/ http://foo.com/blah_(wikipedia)#cite-1 http://foo.com/blah_(wikipedia)_blah#cite-1 http://foo.com/unicode_(✪)_in_parens http://foo.com/(something)?after=parens http://☺.damowmow.com/ http://code.google.com/events/#&product=browser http://j.mp http://foo.com/?q=Test%20URL-encoded%20stuff http://مثال.إختبار http://例子.测试 http://उदाहरण.परीक्षा http://1337.net http://a.b-c.de
And disallows all of these:
# Invalid URLs http:// http://. http://.. http://../ http://? http://?? http://??/ http://# http://## http://##/ http://foo.bar?q=Spaces should be encoded // //a ///a /// http:///a foo.com rdar://1234 h://test http:// shouldfail.com :// should fail http://foo.bar/foo(bar)baz quux ftps://foo.bar/ http://-error-.invalid/ http://a.b--c.de/ http://-a.b.co http://0.0.0.0 http://10.1.1.0 http://10.1.1.255 http://224.1.1.1 http://1.1.1.1.1 http://123.123.123 http://3628126748 http://.www.foo.bar/ http://www.foo.bar./ http://.www.foo.bar./ http://10.1.1.1 http://10.1.1.254 # The following URLs are valid by the letter of the law but we don't want to allow them. http://userid:password@example.com:8080 http://userid:password@example.com:8080/ http://userid@example.com http://userid@example.com/ http://userid@example.com:8080 http://userid@example.com:8080/ http://userid:password@example.com http://userid:password@example.com/ http://-.~_!$&\'()*+,;=:%40:80%2f::::::@example.com http://142.42.1.1/ http://142.42.1.1:8080/ http://223.255.255.254 ftp://foo.bar/baz


