Dead Simple Google Maps API Geolocation

Google turned off the v2 Google Maps API on September 9th which means my 2008 PHP Wrapper for Google Maps API Geocoding has ceased to function.

I’ve put a replacement dead simple PHP wrapper of the v3 Google Maps Geolocation API on github. It has the same API as before.

v2 of the API now gives "We're sorry... ... but your computer or network may be sending automated queries. To protect our users, we can't process your request right now. See Google Help for more information." which isn’t a super useful message.

Learning Clojure

I completed the first 50 problems at 4clojure.com.

The site helps you learn Clojure in the common way by presenting you with a long set of tasks.

As well as setting the problems it tests your answers live in the browser.

You can follow other users. Once you complete a problem you can see their solutions.

Sometimes that leads you from a solution that looks like this:

fn [coll]
  ((fn dist [prev coll]
     (lazy-seq
      (when-let [[x & xs] (seq coll)]
        (let [more (dist (conj prev x) xs)]
          (if (contains? prev x)
            more
            (cons x more))))))
   #{} coll))

to one that looks like this:

reduce #(if (some #{%2} %) % (conj % %2)) []

Good stuff.

Practical URL Validation

There’s a lot of code out there that deals with URL validation. Basically all of it is concerned with “does this URL meet the RFC spec?” That’s actually not that interesting a question if you are validating user input. You don’t want ‘gopher://whatever/’ or ‘http://10.1.2.3’ as valid URLs in your system if you have asked the user for the address of a web page.

What’s more, because of punycode-powered internationalized URLs most URL validating code will tell you that real URLs that people can use today are invalid (PHP’s parse_url is not even utf8 safe).

Here’s some PHP code that validates URLs in a more practical way. It uses the list of TLDs in static::$validTlds from the IANA list of valid TLDs and assumes the presence of a utf8-safe $this->parseUrl such as Joomla’s version.

    (c) 2013 Thomas David Baker, MIT License

    /**
     * Return true if url is valid, false otherwise.
     *
     * Note that this is not the RFC definiton of a valid URL.  For example we
     * differ from the RFC in only accepting http and https URLs, not accepting
     * single word hosts, and accepting any characters in hostnames (as modern
     * browsers will punycode translate them to ASCII automatically).
     *
     * @param string $url Url to validate.  Must include 'scheme://' to have any
     *                    chance of validating.
     *
     * @return boolean
     */
    public function validUrl($url) {
        $parts = $this->parseUrl($url);

        // We must be able to recognize this as some form of URL.
        if (!$parts) {
            return false;
        }

        // SCHEME.
        // Must be qualified with a scheme.
        if (!isset($parts['scheme']) || !$parts['scheme']) {
            return false;
        }
        // Only http and https are acceptable.  No ftp or similar.
        if (!in_array($parts['scheme'], ['http', 'https'])) {
            return false;
        }

        // CHECK FOR 'EXTRA PARTS'.
        // If a URL has unrecognized bits then it is not valid - for example the
        // 'z' in 'www.google.com:80z'.
        // This check invalidates URLs that use a user - we don't allow those.
        $partsCheck = $parts;
        $partsCheck['scheme'] .= '://';
        if (isset($partsCheck['port'])) {
            $partsCheck['port'] = ':' . $partsCheck['port'];
        }
        if (isset($partsCheck['query'])) {
            $partsCheck['query'] = '?' . $partsCheck['query'];
        }
        if (isset($partsCheck['fragment'])) {
            $partsCheck['fragment'] = '#' . $partsCheck['fragment'];
        }
        if (implode('', $partsCheck) !== $url) {
            return false;
        }

        // HOST.
        if (!isset($parts['host']) || !$parts['host']) {
            return false;
        }
        // Single word hosts are not acceptable.
        if (strpos($parts['host'], '.') === false) {
            return false;
        }
        if (strpos($parts['host'], ' ') !== false) {
            return false;
        }
        if (strpos($parts['host'], '--') !== false) {
            return false;
        }
        if (strpos($parts['host'], '-') === 0) {
            return false;
        }
        // Cope with internationalized domain names.
        $host = idn_to_ascii($parts['host']);

        $hostSegments = explode('.', $host);
        // The IANA lists TLDs in uppercase, so we do too.
        $tld = mb_strtoupper(array_pop($hostSegments));
        if (!$tld) {
            return false;
        }
        if (!in_array(mb_strtoupper($tld), static::$validTlds)) {
            return false;
        }
        $domain = array_pop($hostSegments);
        if (!$domain) {
            return false;
        }

        // PATH.
        if (isset($parts['path']) && substr($parts['path'], 0, 1) !== '/') {
            return false;
        }

        // If you made it this far you're golden.
        return true;
    }

Looking at the list of interesting URLs from http://mathiasbynens.be/demo/url-regex and elsewhere it allows all of the following:

http://foo.com/blah_blah
http://foo.com/blah_blah/
http://foo.com/blah_blah_(wikipedia)
http://foo.com/blah_blah_(wikipedia)_(again)
http://www.example.com/wpstyle/?p=364
https://www.example.com/foo/?bar=baz&inga=42&quux
http://✪df.ws/123
http://➡.ws/䨹
http://⌘.ws
http://⌘.ws/
http://foo.com/blah_(wikipedia)#cite-1
http://foo.com/blah_(wikipedia)_blah#cite-1
http://foo.com/unicode_(✪)_in_parens
http://foo.com/(something)?after=parens
http://☺.damowmow.com/
http://code.google.com/events/#&product=browser
http://j.mp
http://foo.com/?q=Test%20URL-encoded%20stuff
http://مثال.إختبار
http://例子.测试
http://उदाहरण.परीक्षा
http://1337.net
http://a.b-c.de

And disallows all of these:

# Invalid URLs
http://
http://.
http://..
http://../
http://?
http://??
http://??/
http://#
http://##
http://##/
http://foo.bar?q=Spaces should be encoded
//
//a
///a
///
http:///a
foo.com
rdar://1234
h://test
http:// shouldfail.com
:// should fail
http://foo.bar/foo(bar)baz quux
ftps://foo.bar/
http://-error-.invalid/
http://a.b--c.de/
http://-a.b.co
http://0.0.0.0
http://10.1.1.0
http://10.1.1.255
http://224.1.1.1
http://1.1.1.1.1
http://123.123.123
http://3628126748
http://.www.foo.bar/
http://www.foo.bar./
http://.www.foo.bar./
http://10.1.1.1
http://10.1.1.254
# The following URLs are valid by the letter of the law but we don't want to allow them.
http://userid:password@example.com:8080
http://userid:password@example.com:8080/
http://userid@example.com
http://userid@example.com/
http://userid@example.com:8080
http://userid@example.com:8080/
http://userid:password@example.com
http://userid:password@example.com/
http://-.~_!$&\'()*+,;=:%40:80%2f::::::@example.com
http://142.42.1.1/
http://142.42.1.1:8080/
http://223.255.255.254
ftp://foo.bar/baz

parse_url Is Not UTF-8 Safe

Handily the good folks at Joomla have written a UTF-8 safe version:

        /**
	 * Does a UTF-8 safe version of PHP parse_url function
	 *
	 * @param   string  $url  URL to parse
	 *
	 * @return  mixed  Associative array or false if badly formed URL.
	 *
	 * @see     http://us3.php.net/manual/en/function.parse-url.php
	 * @since   11.1
	 */
	public static function parse_url($url)
	{
		$result = false;

		// Build arrays of values we need to decode before parsing
		$entities = array('%21', '%2A', '%27', '%28', '%29', '%3B', '%3A', '%40', '%26', '%3D', '%24', '%2C', '%2F', '%3F', '%23', '%5B', '%5D');
		$replacements = array('!', '*', "'", "(", ")", ";", ":", "@", "&", "=", "$", ",", "/", "?", "#", "[", "]");

		// Create encoded URL with special URL characters decoded so it can be parsed
		// All other characters will be encoded
		$encodedURL = str_replace($entities, $replacements, urlencode($url));

		// Parse the encoded URL
		$encodedParts = parse_url($encodedURL);

		// Now, decode each value of the resulting array
		if ($encodedParts)
		{
			foreach ($encodedParts as $key => $value)
			{
				$result[$key] = urldecode(str_replace($replacements, $entities, $value));
			}
		}
		return $result;
	}

Although non-ASCII characters are not legal in URLs if you want to parse possibly wonky data or internationalized (例子.测试) and other non-ASCII URLs (✪df.ws) that translate to ASCII via Punycode then this is very handy.