October 2007 – Doug Clow's Imaginatively-Titled Blog

Just had a lunchtime conversation about how you find new music that then turned to how you find new books. There are lots of differences, with obvious reasons, but there’s one that is interesting and related to stuff I’m interested in work-wise.

For music, there’s things like LastFM, where social network effects help tell you that if you like X you might like Y, because lots of people who like X also like Y, and things like Pandora, where serious processing and analysis of the data can tell you that if you like X you might like Y, because X shares some computable properties with Y.

For books, there’s some social networking stuff (e.g. LibraryThing), but they don’t seem quite so effective. (Possibly because books are less exciting than hot new music?) And there is, of course, Amazon’s “people who bought X also bought Y”, but general consensus is that this is not much help. There isn’t – so far as I know – anything that does the analysis-of-data thing and tells you that book X is like book Y in some computable way. On the face of it that seems entirely possible and potentially very interesting. But the trouble is getting hold of the data. Google’s busy scanning the world’s books in, but they’re not going to let you have the entire data corpus to play with. You can definitely do interesting stuff with full-text analysis – remember all that enthusiasm for Shakespearean authorship stuff years ago, and the who-wrote-Primary-Colors game? Things have moved on since then.

I’ve gone on elsewhere about my belief that print-on-demand is going to be even more huge in the next five to ten years, reaching up from smallish documents that were the triumph of the 90s to full novel-length books. Iff the publishers allow it, of course. If they don’t they’ll go the way of the traditional music industry, which is looking extremely doomed at the moment (Radiohead having just released their latest album direct, eschewing the music industry, followed swiftly by Nine Inch Nails, Oasis and Jamiroquai, and today by Madonna – although she seems rather to have signed a traditional-style deal with a non-traditional outfit.)

Anyway – if publishers will allow distribution of electronic book texts as far as bookshops and libraries (and on to e-paper technologies as they come on stream), there is the potential for getting hold of the text of new books and doing interesting stuff with it, which will help people find new books they like. Which would be cool.

Edit: My lunchtime colleagues have suggested LibraryThing and What Should I Read Next – both ways of mining lists of what people have read/liked to generate suggestions for your own future reading. (LibraryThing’s UnSuggester – books you are unlikely to find appealing – is far more fun to play with!)

I’m not a huge fan of CAPTCHAs – those annoying prove-you’re-not-a-spam-bot thingummies where you have to type some disguised letters. Even if you can deploy them without the outrageous accessibility issues, they’re at best a necessary evil. (At worst, they don’t even fox sophisticated spam tools.) It’s fundamentally wrong for a computer to be setting a human make-work.

But reCAPTCHA is a fantastic hack around this. Instead of generating an image by distorting some known text, they use scanned text that has failed to OCR. This is very neat in two ways. Firstly, the human effort involved contributes to digitisation projects and so becomes real, useful work instead of wasted drudgery. And secondly, the images in question are a great source of images that are likely to be easily read by a human (who has no visual impairment or challenges) but are known to be hard to read by a computer. Excellent!

(Of course, it’s no help to the clever counter-attack strategy of bots getting people to solve CAPTCHAs in exchange for access to porn, although I’m not sure that’ll work in practice at scale.)

This sort of issue isn’t usually a problem in formal educational settings, since anonymous access isn’t usually allowed. But it is for informal educational projects like OpenLearn. And as I like to keep saying, the boundaries between the two are fuzzy and getting fuzzier.

Month: October 2007

Book recommendations

Neat hack: reCAPTCHA