Mind Hacks at Foyles

Tom Stafford and Matt Webb were at Foyles on Charing Cross Road (London) on Wednesday night to publicize their book Mind Hacks (O’Reilly). They stepped through a few practical examples of the stuff from the book — why faded jeans make your legs look good; how eyes and the brain adapt to light and noise levels; why putting a pen in your mouth and pushing it back for three minutes makes you feel good — fairly successfully and made some jokes about leopards.

The Data Area Passed to the System Call is Too Small

I got the error, “the data area passed to the system call is too small” posting to this site. The problem was using HTTP GET for very long strings. Using HTTP POST with the exact same text works fine. Posting from Mozilla 1.0 to IIS 5. Absolutely nothing on the web explaining the error and only three references in Google so I thought I’d post my “answer”. More information on the differences between GET and POST.

Eclipse Site Lacks Spark

Eclipse just won the Jolt 2005 award in the “Languages and Development Environments” category.

But eclipse.org doesn’t mention it. The first link on the site in the main body is to a ‘white paper’ in PDF format last updated in 2003. The site uses frames. The FAQ was last updated in 2002. There are no screen shots linked anywhere off the home page despite the fact that there are 96 links in the main frame.

Compare and contrast basecamp which for all its qualities surely has less to shout about than Eclipse.

What am I missing here?

URI Usability

URI Thoughts

Ever since I read Matthew Thomas’ outline of an ultimate weblogging system I’ve been thinking about URIs.

It is useful to be able to visit bbc.co.uk/football or bbc.co.uk/cricket and be redirected to the appropriate section in the BBC site (currently http://news.bbc.co.uk/sport1/hi/football/default.stm in the case of football).

Even more useful is the new URI scheme in place at ebay. my.ebay.co.uk, search.ebay.co.uk/ipod and search-completed.ebay.co.uk/microwave all do what you would expect them to (take you to “my ebay”, search current auctions for iPods and search completed auctions for microwaves) both saving time and increasing clarity and usability.

The question then is, what is the best URI scheme?

Of course, this depends on the site. One thing that I think is undisputed across sites is that URLs should not include extensions that give away technology choices. There is no advantage to the URL:

http://example.com/products/default.asp

over:

http://example.com/products/

and the former makes migrating away from ASP (even to ASP.NET) more problematic than it needs to be.

Another no-brainer is that URIs should be short. Short URIs are easier to remember and URIs over 78 characters long will wrap in some emails.

Jakob Nielsen (in
URL as UI) found that mixed case and non-alphanumerics confused users. So let’s ditch those too. That means that a query string (?x=1&y=8394) makes a URI harder to read and less usable. Links with query strings (or sometimes only complex query strings) are also ignored by some search engines. There is a good discussion of this in Toward’s Next Generation URLs.

Without producing a static version of a site every time you make a change (which may actually be a workable solution in some cases) you can use URI rewriting to let your users have usable URIs but give the webserver the query string it needs to serve up the right content. This can be done with ISAPI Rewrite on IIS (Lite version is free, if not Free) and mod_rewrite on Apache.

With URI rewriting it does not matter what URI the underlying technology needs you just need to decide on the appropriate scheme and write clever enough regular expressions to implement it.

It would be possible to run a site where every URI is of the form:

http://example.com/7
http://example.com/23
http://example.com/274383

but that fails the user in that it provides no information about the page and prevents user from using urls to navigate the site. It seems that some “directory structure” is required (although it may not reflect the actual directory structure of the site) and then a name for the page of content should be appended to that.

Cool URIs don’t change suggests dates as an effective
structure and suggests that categories are too changeable over time to work.

An Example

The posts on this site currently have URIs in this format:

http://bluebones.net/news/default.asp?action=view_story&story_id=97

There we have underscores, a question mark and an ampersand plus the URI tells you very little about what you might expect if you clicked on it. Ripe for improvement. Here are the possible schemes I have considered:

A

bluebones.net/posts/96

Very short but not too informative.

B

bluebones.net/posts/2001/06/14/96

Not so short and still fairly oblique.

C

bluebones.net/posts/2001/06/14/internetboggle
bluebones.net/posts/2005/03/13/awstatsoniis5witholdlogfiles

I like this but the longer titles make the URIs overlong.

D

bluebones.net/posts/internetboggle
bluebones.net/posts/wherewizardsstayuplatetheorigins 
oftheinternetbykatiehafnerandmatthewlyon

Even with the date structure removed some posts titles are too long to be included in the URI. And what to do about posts with the same title?

E

bluebones.net/posts/internetboggle
bluebones.net/posts/wherewizardsstayupla

Truncated to 20 characters is OK but you can imagine some horrorshow URI caused by truncating at exactly the wrong point and the content is not so clear. The issue of posts with the same title is even more relevant here.

F

bluebones.net/posts/internetboggle
bluebones.net/posts/wherewizardsstayuplate

This is the scheme I am planning on implementing. The need to cope with duplicates in any scheme that uses strings and not integer ids means that some work will have to be done (by the server) upon posting that is not being done now. Given that I am going to have to write some code anyway, why not just add an extra field when posting (which could be autopopulated with a default of the title with all spaces and punctuation removed) that is called “link” or similar and appears after /posts/ as above. Of course duplicates will have to be checked for and either reported to the user or otherwise dealt with here, too.

What do you think? Help me out before I make irreversible decisons by commenting below.

AWStats on IIS 5 with Old Log Files

I wanted to use awstats to analyse my IIS 5 W3C format log files. I got the program up and running with aid of these instructions as well as conning it that the port (always 80) was actually the bytes sent parameter (for some reason it won’t run without that info being in the log file). And I could only get it to parse the last log file.

I looked at the instructions about how to parse old log files but this involved issuing a command at the commandline for each file. I have logs that go back to 2001. So in the end I wrote this perl script to issue all the necessary commands:

#! c:/perl/perl.exe

sub main {
    
    my $dir = "C:/WINNT/system32/LogFiles/W3SVC1";
    
    for (my $year = 01; $year < = 02; $year++) {
        for (my $month = 1; $month <= 12; $month++) {
            for (my $date = 1; $date <= 31; $date++) {
                my $file = $dir . "/" . get_filename($year, $month, $date);
                if (-e $file) {
                    my $cmd = "f:/inetpub/wwwroot/awstats/cgi-bin"
                        . "/awstats.pl  -config=bluebones.net -LogFile=""
                        . $file . "" -update";
                    print "$cmd
";
                    system($cmd);
                } else {
                    print $file . " does not exist
";
                }
            }
        }    
    }
}

sub get_filename {
    
    local($year, $month, $date) = ($_[0], $_[1], $_[2]);
    
    my $filename = "ex" . pad($year) . pad($month) . pad($date) . ".log";
    
    return $filename;
}

sub pad {
    
    my $num = pop;
    
    if ($num < 10) {
        $num = "0" . $num;
    }
    
    return $num;
}

main();

I had to stop it running in the middle when I hit the date that I added referer to the log file format, alter the conf file manually and then start it running again. But I got there in the end.

What the world needs is a nice, clean API for log files that comes with parsers that intrinsically understand all the various standard formats. That is, I want to be able to just point the program at any Apache, IIS or other standard log files and have it chomp them all up and let me programatically get at the data in any way I like (perhaps stick it all in a SQL database?) Crucially, the program should be able to "discover" the format of the log files by looking at the headers and there should be no configuration (unless you have really weird log files).

Then people can write beautiful graphical reports for this API and everyone can use them regardless of the format that the original logfiles were in. Surely someone has thought of this before? I've put it on my todo list.

Turkish Press Pictures on Google News

Have you noticed that more than 50% of the photographs on news.google.com come from turkishpress.com?

I think it is simply because of their alt tags. For example:

South Korea’s Deputy Foreign Minister Soon Min-Soon arrives in Beijing. US and South Korean envoys held talks with China aimed at coaxing North Korea back into six-party nuclear talks as the CIA said the Stalinist regime could re-start long-range missile testing.

The algorithm google uses to decide on the pictures obviously considers the alt text superior to the page text in determing the subject matter of an image. Long alt tags lead to more appearances in Google News. I wonder if they did it on purpose?

Humane Text Formats

I write in plain text a lot. I want to put stuff on the web a lot.
Oftentimes it’s the stuff I already wrote in plain text. I wondered if I could learn some conventions that would convert to XHTML for no extra work after I’d written the plain text. In fact, I am writing this article now in Ultraedit and later it will go on bluebones.net in HTML. And as I wrote ultraedit I wanted to put a link in for that very reason but wasn’t sure whether to or not because it then makes this file html and I’d need to go back and put in <p> tags and so on. Let’s just say that I think learning one of these formats would be A Good Idea™.

For those wondering why I don’t write in HTML all the time check out these good reasons.

This seems to have been the rationale behind Markdown. There are also numerous other text formats like Textile and Almost Free Text
with similar or identical motivations. I don’t want to learn them all, so which one to pick? I couldn’t find a good comparison or even much of a list of alternatives via Google. Answer: have a face off.

The Test

I decided to use the text of this very article.
(Originally I decided I was going to use a BBC news story too but I’d learnt enough about the formats by the time I’d been through them all once!)

Some things I definitely want the winner to be able to do are:

  • Unordered lists
  • Like this one

and

# Code examples like this.
print "This is essential!"

The quality of tools available is also a big plus. For this to reap rewards I
must be able to go effortlessly from text to XHTML and (strongly preferred) back
again.

The Results

Almost Free Text

http://www.maplefish.com/todd/aft.html

An enviable set of outputs: HTML, LaTeX, lout, DocBook and RTF. You have to tell it explicitly to use other than 8 spaces for tabstops, or use tabs. My text editor is set to use spaces for tabs because they travel better (email, etc.) and it is set to 4 spaces. No titles on links. Does table of contents.
No line breaks allowed in link elements is a problem. Adds a whole load of
extra formatting by default – makes whole documents instead of snippets. HTML 4.0 Transitional. Doesn’t seem to be any way to make snippets or XHTML.

See Almost Free Text Test Results

Markdown

http://daringfireball.net/projects/markdown/

A format that comes from Daring Fireball. Default formatting of the Instiki(http://instiki.org/) Wiki. Choked on converting trademark symbol from the HTML character entity reference to Mardown and cannot do tables but otherwise superb and plain text looks right too – not marked up just “natural”.

Tools include html2text, a
Python script that converts a page of HTML into valid Markdown. There is also a PHP implementation. RedCloth
(Ruby) has limited support for Markdown.

See Markdown Test Results

reStructuredText

http://docutils.sourceforge.net/rst.html

The format of Python doc strings. Only a “very rough prototype” for converting HTML to reStructuredText (written in OCaml). Links are clunkier than in Markdown or Textile. Cannot set title attributes on links. Nice autonumbering footnotes. No simple way to avoid turning processed text into a full document. I would ideally like to process snippets for cutting and pasting into existing documents or standard headers and footers.

See reStructuredText Test Results

Textile

http://www.textism.com/tools/textile/

Originally created for Textpattern. There is a Movable Type plugin. An alternate to the default Markdown in Instiki. Does class, id, style, language attributes and lots of character entity replacement (em dash, curly quotes, that kind of thing). Only a rudimentary HTML=>Textile converter available. Very obviously meant to be turned into HTML (look at the headings h1, h2, etc.) and not so good as just a way of formatting plain text.

I had trouble making html => text tool work – no trademark and strict xml parsing just exited with error “Junk after document element at line 12” despite passing W3C validator test for XHTML 1.0 Strict.

Code doesn’t work. Breaks where it finds CR (can’t have 80 col source and XHTML must use word wrap).

RedCloth (Ruby) supports Textile.

See Textile Test Results

Others

RDoc – Originally created to produce documentation from Ruby source files. Offered as an alternative markup option by Instiki. Outputs XML, HTML, CHM (Compiled HTML) and RI (whatever that is). Commandline tool. Now part of core Ruby which ensures continued support but perhaps only as a documentation tool not in the more general sense that I want to use it.

StructuredText – Allows
embedded HTML and DHTML. Used by Zope and
the related ZWiki. Rather horribly uses
indentation rather than explicit heading markers. Supports tables. No way to
go from HTML to Structured Text. Somewhat similar to reStructured Text.

Other formats I didn’t have time to consider in depth or which I discounted for certain reasons: WikiWikiWeb formatting (no tools only as part of WikiWikiWeb), DocBook (for whole books not snippets), atx (can’t find enough info – seems to have been superseded by Markdown), RDTool (didn’t like =begin/=end and lesser than RDoc from the same community), YAML (aimed mainly at configuration files), MoinMoin formatting (no tools to use it separate from the Wiki it comes from), SeText (superseded by StructuredText and ReStructuredText), POD (Plain Old Documentation – the Perl documentation format).

Summary

Textile and Markdown were the only formats I investigated that were truly practical for snippets not full documents. Textile had better support for more HTML features at the expense of looking more like HTML and less like plain text in the first place. Since I can write HTML any time I want anyway and because it has the better tools, Markdown is my provisional “winner”. If anyone wants
to correct any errors above (in the comments section) I’m willing to revise my opinion. (Quick! Before I get wed to this syntax and can’t change!)

Feature Comparison Table

Plain Text Formats Feature Comparison Table
Format To HTML Tool From HTML Tool Tables? Link Titles? class Attribute? id Attribute? Output formats License
Almost Free Text Yes No No No No No HTML, LaTeX, lout, DocBook and RTF Clarified Artistic License
Markdown Yes Yes No Yes No No XHTML BSD-style
reStructuredText Yes Sort of Yes No Yes Auto Latex, XML, PseduoXML, HTML Python
Textile Yes No Yes Yes Yes Yes XHTML Textile License

Medical Imaging Lecture

The first Hounsfield Memorial Lecture was given Thursday 10
February 2005 at 17.30 by Professor Robert S. Balaban, Scientific Director,
National Heart, Lung & Blood Institute National Institutes of Health, USA.

I attended following the course on Computer Vision I did at Imperial last
term.

“Imaging: An Interface between Physiology and Medicine”
covered imaging techniques allowing noninvasive viewing of internal organs and
processes. Particular attention was paid to X-Rays from infrared, CT, MRI CT-
PET tumour detection, and CT-MRI. One practical illustration
was a trial run in a local hospital in the US where they used these imaging
techniques to detect whether those with chest pains in the ER have a real heart
problem or not. This is a big improvment on the current system of sitting
patients in a room and monitoring them until something goes really wrong.

Dr. Balaban also talked about image-guided robotic surgery.
MRI as the eyes of robotic surgery with a surgeon not in the room. With these
techniques the surgeon can not only see what is happening on the surface but
also in the internal organs underneath where he is operating.

Motion is the big problem for these realtime views of the insides of living
creatures – Dr. Balaban illustrated showing us a great band of movement that
ruined his pictures of cells in a muscle, then revealing that the muscle was in
the leg and the movement came just from respiration.

CSS Zen Garden Design

I’ve completed a first draft of a potential CSS Zen Garden design.

I’m not a designer by nature and I’m quite happy with it. What I need is some constructive criticism though. Does it not work in your browser? Do you think it is just crap pseudopr0n? Do I need a different image on my listitems? Let me know in the comments section …

Quick Javadoc Reference

I write most of my code in Ultraedit and when I’m writing java I really want to be able to access javadoc documentation for the Sun API as quick as possible. This is a description of the evolution of a little utility to look up javadoc documentation very quickly.

I know that IDEs like IntelliJ and Eclipse offer in-window documentation tooltips (this is actually executed best in Microsoft’s Visual Studio.NET of all the IDEs I have used). But these horking great programs put me off because they feel clunky and tend to specialise in one language. I prefer to use the tricks I learn with my text editor with everything I am working on rather than learning a set of shortcuts, etc. for every language. I know Emacs has tagfiles for every language and is on zillions of platforms but it doesn’t feel like a native Windows app and as that’s where I spend 99% of my time it just doesn’t cut it. Ultraedit makes me feel nimble and encourages cross-application serendipity.

I used to bring up a Run box with Win-R and write the full path to file I needed. So to get the docs on java.util.HashSet I’d do:

Win-R, "c:docsapijavautilHashSet.html", Enter

(obviously autocomplete would make it a little bit quicker than that). This was far too slow – interrupting your train of thought to look up documentation.

To speed things up I dropped a copy of every HTML file into the docs folder. So to get the HashSet documentation I need only type:

Win-R, "c:docsHashSet.html", Enter

(again with autocomplete speeding me up). This wasn’t too bad speedwise but all the links in the file launched would not work. So if you wanted to look at the docs of a superclass or of a method’s return value you had to go back to the Run box. Not ideal.

It seemed the only answer was to write a little utility. I chose Perl as it’s filehandling is simple and the language itself pretty fast. The code I ended up with after a first pass was:

#!c:perlinperl.exe

use strict;
use warnings;

$ARGV[0] || die "No arg supplied.
";

my $look_for = "/" . $ARGV[0] . ".html";
my @found;
my @files = search_dir("c:/docs/api");
if (@files == 1) {
    print "c:/progra~1/intern~1/IEXPLORE.EXE " . $files[0];
    exec("c:/progra~1/intern~1/iexplore " . $files[0]);
} else {
    # FIXME
}

sub search_dir {
    
    my $dir = pop;
    my @list = glob($dir . "/*");
    foreach (@list) {
        if (-d && $_ ne $dir && ! /class-use/) {
            search_dir($_);
        } elsif (/$look_for/) {
            push @found, $_;
        }
    }
    return @found;
}

I had to put in special cases so that one directory was not traversed indefinitely and to avoid the class-use directories that are not of interest (they contain links to classes that used the class).

This worked just fine (excluding the failure to handle multiple matches) but had a noticeable delay. The time I saved typing:

Win-R, docs.pl HashSet, Enter 

compared to the longer version was lost in the searching of the filesystem. Worse the time was now idle time instead of active time and that seems longer (think of waiting for a bus).

What was taking the time was traversing the filesystem looking at filenames.
So I decided that work could be done just once and cached. I ran “find” at a Cygwin bash prompt from the api directory and put the output in a quickref.txt file. I used “grep -v class-use” to remove the class-use directories and replaced /c/ (my symlink to the root of the C: drive under Cygwin) with c:/ to produce a list of paths to all the relevant files.

Now all I needed was code to read this file and find the correct entry.

#!c:/perl/bin/perl.exe

use strict;
use warnings;

$ARGV[0] || die "No arg supplied.
";

my $look_for = "/" . $ARGV[0] . ".html";
open(FILE, "<quickref.txt");

while (<FILE>) {
    if (/$look_for/) {
        exec("c:/progra~1/intern~1/iexplore " . $_);
    }
}

This code was a good deal shorter (the work’s largely been done in creating quickref.txt) and faster. I speeded it up even more by forcing Internet Explorer (quicker startup time than the otherwise superior Firefox) and short-circuiting at the first match (only conflict that occurs often is java.sql.Date and java.util.Date anyway).

All that remained was to create a shortcut on my path to the file called ‘d’ allowing me to launch (for example) fully linked HashSet documentation with:

Win-R, "d Hashset", Enter

A future version could generate the quickref.txt file if it is not present. But as this is such a simple utility I doubt it will have a future version. Other javadocs can be simply added by appending their paths to quickref.txt. Obviously the browser commandline should not be hardcoded and I could consider respecting ESR’s $BROWSER attribute. If you think this is overkill or have suggestions for improvements or just simply think I’m a nutter please comment below.