Over the last little while, I’ve become fascinated by artificial intelligence in general, machine learning in particular, and neural networks in particular particular. AI is not only interesting for the new things it lets us do, it helps us understand more about what it means to think at all. Part of which, of course, is learning.
It’s a cliché to say that we are living through a transformatory time in the capabilities of computers and artificial intelligence. But we are! In my lifetime, computer chess has improved from being a bit of a joke, to being an interesting novelty for non-expert players, to being pretty challenging for all but masters, to famously defeating Kasparov handily. But it’s not stopped: Deep Blue needed some pretty serious hardware in 1996; now you can download an app for your phone that easily outclasses human grandmasters. Go is harder, but DeepMind’s AlphaGo handily beat Lee Sedol and then Ke Jie, the world number 1, earlier this year.
More than anything, I’m reminded of Dijstra’s aphorism “The question of whether machines can think is about as relevant as the question of whether submarines can swim.”
This all raises many serious, profound and important questions for humanity.
It’s also good for a giggle. This line of work from me is definitely the latter.
I’ve been following Janelle Shane‘s fun and games with char-rnn, a recurrent neural network tool she’s trained on a variety of interesting datasets to generate:
- new names for cats, – Parihen the Thawk, Snox Boops, Big Wiggy Bool
- new paint colours – sudden pine 69 74 61, felthy blue 112, 130, 148
- 80s action figures – Stanker, Arnon Prombot
- new recipes – Spread over grease and make a gently pan mixture
with 1 several hours, turning and boil on high until the mixture is completely golden.
- new Harry Potter fanfiction – The Perfect Party by iamisaac
Draco has been left alone, and Ginny confused must learn and who has his best friend. They were breathed by a love that didn’t become his grounds and the flowers begin.
- and much more.
I’m a firm believer that to really get to grips with a technology, you have to have a go yourself. I’m also a firm believer that British names are funny, and in particular, British placenames. So I set out to train a neural network to generate British place names.
First stop was to get the neural network software working on my laptop. (An early 2014 MacBook Air.) This took an unbelievable amount of faff. Rather than Janelle Shane’s beloved char-rnn, I’ve gone for torch-rnn, following Jeff Thompson’s really helpful guide. I won’t trouble you, gentle readers, with the details, although I really should write them up to help future researchers.
So it was with no small sense of personal accomplishment that I managed to train the network on the sample data (Shakespeare) and generated my very own Fakespeare, which – as others have noted – bears more than a passing resemblance in form to the Bard’s own work, although is perhaps a little lacking in terms of meaning and intellectual coherence:
JULIET: Then, he was a warrel to feet she well be a prison. First Senator: It is the common in the deed the father; I was his soul and dangerous prince, and off Of the word and this brother of some honest, The sins and like my death to me, where we see how thee The compartary doth distraines of your love. KING RICHARD III: And then are they have no more than I do adming Of the common of my son to his out in our charge To meet the world to be to the people so. GLOUCESTER: The glad somethere him that the complayell the executions I have not the more of the thrones of my woe, Which shall be see the state the more are gentlemen? KING RICHARD III: Thou wert this presence of his brother's brother. LADY ANNE: This that what you have been in the world by the world Unto the part of the wearing of the world, And condent the wides the surpring him of him. CAMILLO: I will not be thine heart! I would show their king's hand, Which are the worst and sound to me to these words: When that then is the king and base and son. What say the death of our country's parts Was the act be a children of the wars to me, And they pray you, mark it is him to the country Than when our neither drunk and be the bold; And that the deady of the larkly same and game?
As I remarked on Twitter, if you declaim this loudly and with confidence, it sounds a lot like those bits of real Shakespeare where you can’t quite follow what’s going on but have to rely on the actor’s tone and other clues.
The next step was to get hold of some training data for my plans. Wikipedia has a list of UK locations, but it’s split across hundreds of subpages for ‘convenience’. I couldn’t get any sense out of Wikidata, and writing my own little scraper script to get all of it seemed more bother than this bit of whimsy was worth. A bit more searching led me to the Data.gov.uk catalogue page for OS Open Names, which led to the OS Open Names page on the Ordnance Survey site, which got me an emailed link to a monster zip file.
This was the mother lode of location data. It was a monster: 96.5 Mb zipped, 1.61 Gb unzipped. The Ordnance Survey is the authoritative source for location data in the UK, and this is pretty much it for names. This data file “contains over 870,000 named and numbered roads, nearly 44,000 settlements and over 1.6 million postcodes, all matched/mapped to the National Grid.”
Once I’d got my head around it (which took some doing, and is worth a post in itself if anyone’s interested), I began to appreciate just how much of a thing of wonder it is. I had the name of every single populated place in England, Wales and Scotland, and its location. And every road. And every postcode.
There were decisions to be made. I decided to split the data in to England, Wales and Scotland, since they have quite distinctive naming patterns. I also decided to split up the populated places from the road names: there are way more road names, obviously, and they have a different form. I also stripped out all the semantic stuff and all the actual location stuff, since although that can be very valuable, it’s not very funny.
As with most data science activities, getting hold of the data, understanding it, and processing it in to the form you need to work with took something like 95% of the effort. But it is now working!
Results to follow: watch this space.