Practical URL Validation – bluebones.net

There’s a lot of code out there that deals with URL validation. Basically all of it is concerned with “does this URL meet the RFC spec?” That’s actually not that interesting a question if you are validating user input. You don’t want ‘gopher://whatever/’ or ‘http://10.1.2.3’ as valid URLs in your system if you have asked the user for the address of a web page.

What’s more, because of punycode-powered internationalized URLs most URL validating code will tell you that real URLs that people can use today are invalid (PHP’s parse_url is not even utf8 safe).

Here’s some PHP code that validates URLs in a more practical way. It uses the list of TLDs in static::$validTlds from the IANA list of valid TLDs and assumes the presence of a utf8-safe $this->parseUrl such as Joomla’s version.

    (c) 2013 Thomas David Baker, MIT License

    /**
     * Return true if url is valid, false otherwise.
     *
     * Note that this is not the RFC definiton of a valid URL.  For example we
     * differ from the RFC in only accepting http and https URLs, not accepting
     * single word hosts, and accepting any characters in hostnames (as modern
     * browsers will punycode translate them to ASCII automatically).
     *
     * @param string $url Url to validate.  Must include 'scheme://' to have any
     *                    chance of validating.
     *
     * @return boolean
     */
    public function validUrl($url) {
        $parts = $this->parseUrl($url);

        // We must be able to recognize this as some form of URL.
        if (!$parts) {
            return false;
        }

        // SCHEME.
        // Must be qualified with a scheme.
        if (!isset($parts['scheme']) || !$parts['scheme']) {
            return false;
        }
        // Only http and https are acceptable.  No ftp or similar.
        if (!in_array($parts['scheme'], ['http', 'https'])) {
            return false;
        }

        // CHECK FOR 'EXTRA PARTS'.
        // If a URL has unrecognized bits then it is not valid - for example the
        // 'z' in 'www.google.com:80z'.
        // This check invalidates URLs that use a user - we don't allow those.
        $partsCheck = $parts;
        $partsCheck['scheme'] .= '://';
        if (isset($partsCheck['port'])) {
            $partsCheck['port'] = ':' . $partsCheck['port'];
        }
        if (isset($partsCheck['query'])) {
            $partsCheck['query'] = '?' . $partsCheck['query'];
        }
        if (isset($partsCheck['fragment'])) {
            $partsCheck['fragment'] = '#' . $partsCheck['fragment'];
        }
        if (implode('', $partsCheck) !== $url) {
            return false;
        }

        // HOST.
        if (!isset($parts['host']) || !$parts['host']) {
            return false;
        }
        // Single word hosts are not acceptable.
        if (strpos($parts['host'], '.') === false) {
            return false;
        }
        if (strpos($parts['host'], ' ') !== false) {
            return false;
        }
        if (strpos($parts['host'], '--') !== false) {
            return false;
        }
        if (strpos($parts['host'], '-') === 0) {
            return false;
        }
        // Cope with internationalized domain names.
        $host = idn_to_ascii($parts['host']);

        $hostSegments = explode('.', $host);
        // The IANA lists TLDs in uppercase, so we do too.
        $tld = mb_strtoupper(array_pop($hostSegments));
        if (!$tld) {
            return false;
        }
        if (!in_array(mb_strtoupper($tld), static::$validTlds)) {
            return false;
        }
        $domain = array_pop($hostSegments);
        if (!$domain) {
            return false;
        }

        // PATH.
        if (isset($parts['path']) && substr($parts['path'], 0, 1) !== '/') {
            return false;
        }

        // If you made it this far you're golden.
        return true;
    }

Looking at the list of interesting URLs from http://mathiasbynens.be/demo/url-regex and elsewhere it allows all of the following:

http://foo.com/blah_blah
http://foo.com/blah_blah/
http://foo.com/blah_blah_(wikipedia)
http://foo.com/blah_blah_(wikipedia)_(again)
http://www.example.com/wpstyle/?p=364
https://www.example.com/foo/?bar=baz&inga=42&quux
http://✪df.ws/123
http://➡.ws/䨹
http://⌘.ws
http://⌘.ws/
http://foo.com/blah_(wikipedia)#cite-1
http://foo.com/blah_(wikipedia)_blah#cite-1
http://foo.com/unicode_(✪)_in_parens
http://foo.com/(something)?after=parens
http://☺.damowmow.com/
http://code.google.com/events/#&product=browser
http://j.mp
http://foo.com/?q=Test%20URL-encoded%20stuff
http://مثال.إختبار
http://例子.测试
http://उदाहरण.परीक्षा
http://1337.net
http://a.b-c.de

And disallows all of these:

# Invalid URLs
http://
http://.
http://..
http://../
http://?
http://??
http://??/
http://#
http://##
http://##/
http://foo.bar?q=Spaces should be encoded
//
//a
///a
///
http:///a
foo.com
rdar://1234
h://test
http:// shouldfail.com
:// should fail
http://foo.bar/foo(bar)baz quux
ftps://foo.bar/
http://-error-.invalid/
http://a.b--c.de/
http://-a.b.co
http://0.0.0.0
http://10.1.1.0
http://10.1.1.255
http://224.1.1.1
http://1.1.1.1.1
http://123.123.123
http://3628126748
http://.www.foo.bar/
http://www.foo.bar./
http://.www.foo.bar./
http://10.1.1.1
http://10.1.1.254
# The following URLs are valid by the letter of the law but we don't want to allow them.
http://userid:password@example.com:8080
http://userid:password@example.com:8080/
http://userid@example.com
http://userid@example.com/
http://userid@example.com:8080
http://userid@example.com:8080/
http://userid:password@example.com
http://userid:password@example.com/
http://-.~_!$&\'()*+,;=:%40:80%2f::::::@example.com
http://142.42.1.1/
http://142.42.1.1:8080/
http://223.255.255.254
ftp://foo.bar/baz

Leave a Reply