rnn-fun: English place names

I’m playing with neural networks to generate funny place names. The first results are for populated places in England.

  • Creswell Green, Staffordshire
  • Fressingfield, Suffolk
  • Salterbeck, Cumbria

geograph-3150880-by-John-Firth

These are genuine English places that you can go and visit. As is Upton Snodsbury, pictured above. These beauties of nomenclatural artistry are from my training data, the Ordnance Survey’s OS Open Names dataset, the motherlode of UK placenames. After processing, I had 32,637 genuine names of “populated places” (i.e. settlements) in England. That’s plenty for training a network, although for extra geek points an extra 131 names would’ve been better. After an hour or so of making my laptop whirr away, I was ready to go. The results were a joy.

  • Bollonby, Dorset
  • Shettleton Vale, Dorset
  • Woolsly, Dorset
  • Upper Haggerlane, Dorset
  • Knighton, Dorset
  • Ovansley Barcots, Dorset
  • Brownthwaite, Dorset
  • Leeenhill, Dorset
  • Harkstourt Paw, Dorset
  • West Bank, Somerset
  • Newmalk, Dorset
  • Abbey Green, Dorset
  • Chefton, City of Plymouth
  • Springham, Devon
  • Stokesage, Devon
  • Sherrady Paroe, Devon

These are very convincing English placenames. Apart, perhaps, from Leeenhill, which has just one too many eees. Some are accidentally genuine – there is an Abbey Green in Dorset, it turns out. Also, the network appeared to have become obsessed with the South West. Looking more closely, I realised that not only had it learned the right form for entries here (a name and a county), it had learned that the same counties tended to go together, since my training dataset had great runs of data in county order. That wasn’t a regularity I wanted it to learn, so I randomised the order of the training set and set it off again. This time the results came in random order, but were still top-quality:

  • Tregounden, County of Herefordshire
  • Kingstworth, Swinghamshire
  • Liston, Devon
  • Hatwell Holt, County of Herefordshire
  • Lower’s Common, Staffordshire
  • West Curwood, Hertfordshire
  • Topworth, North Yorkshire
  • Rykerook, Somerset
  • Marton, Cumbria
  • Glumhill, Staffordshire
  • Lines, Milton Keynes
  • St Egerperferton, County Durham
  • Penty Fiehall, Cumbria
  • Weetloss, East Riding of Yorkshire
  • Little Sandy Park, Kent
  • Lewtin, West Sussex
  • Great Burton, Rutland

It’s not perfect – Tregounden is clearly in Cornwall, not Herefordshire. But it’s pretty convincing. If there isn’t a Glumhill in Staffordshire, there should be.

Almost all of the counties are genuine ones, but there are occasional fancies, like Swinghamshire, home of Kingstworth. There was also

  • Wimmer, Bourcestershire

which perhaps suggests the network had absorbed The Archers as part of the quintessence of Englishness.

Most of the names are very convincing. And even with the ones that aren’t, you can see what it was thinking. We all know – well, all people who know England well know – that you’d never get somewhere called St Egerperferton, but it’s a good guess. And while there are some straight Lines in Milton Keynes, there are a lot fewer than people imagine, and they don’t get called that.

Some of the suggestions have just the sort of is-that-actually-rude feel to them that makes English toponymy such a joy:

  • Sniddington, County Durham
  • Bockpole, East Sussex
  • Runter End, Cambridgeshire

But, of course, if you look hard enough through the data, you can find ones that are clearly quite rude. Or at least are sufficient to raise a schoolkid-like snigger:

  • Up Bumberidge, Hampshire
  • Bumlangout, North Yorkshire
  • Weston, Bumcershire
  • Milly Load, Nottinghamshire
  • Mounton in in the Woop, Bath and North East Somerset
  • Old Functon, Oxfordshire
  • Lower Mincor, Oxfordshire
  • Randiskerton, North Yorkshire
  • Thornwithich, Pooestop
  • Bookenby Weece, Northamptonshire

I think I vote Pooestop my least favourite fictional county, perhaps even worse than Bumcershire.

Next stop: I introduce the neural network to place names in the Celtic fringes – Scotland and Wales – and see whether its toponymical genius will work there too.

Advertisements

rnn-fun: Neural networks for fun and profit

(mostly fun)

Over the last little while, I’ve become fascinated by artificial intelligence in general, machine learning in particular, and neural networks in particular particular. AI is not only interesting for the new things it lets us do, it helps us understand more about what it means to think at all. Part of which, of course, is learning.

Turing
Alan Turing, thinking like a machine to crack Enigma

It’s a cliché to say that we are living through a transformatory time in the capabilities of computers and artificial intelligence. But we are! In my lifetime, computer chess has improved from being a bit of a joke, to being an interesting novelty for non-expert players, to being pretty challenging for all but masters, to famously defeating Kasparov handily.  But it’s not stopped: Deep Blue needed some pretty serious hardware in 1996; now you can download an app for your phone that easily outclasses human grandmasters. Go is harder, but DeepMind’s AlphaGo handily beat Lee Sedol and then Ke Jie, the world number 1, earlier this year.

More than anything, I’m reminded of Dijstra’s aphorism “The question of whether machines can think is about as relevant as the question of whether submarines can swim.”

This all raises many serious, profound and important questions for humanity.

It’s also good for a giggle. This line of work from me is definitely the latter.

I’ve been following Janelle Shane‘s fun and games with char-rnn, a recurrent neural network tool she’s trained on a variety of interesting datasets to generate:

  • new names for cats, – Parihen the Thawk, Snox Boops, Big Wiggy Bool
  • new paint colours – sudden pine 69 74 61, felthy blue 112, 130, 148
  • 80s action figures – Stanker, Arnon Prombot
  • new recipes – Spread over grease and make a gently pan mixture
    with 1 several hours, turning and boil on high until the mixture is completely golden.
  • new Harry Potter fanfiction – The Perfect Party by iamisaac
    Draco has been left alone, and Ginny confused must learn and who has his best friend. They were breathed by a love that didn’t become his grounds and the flowers begin.
  • and much more.

I’m a firm believer that to really get to grips with a technology, you have to have a go yourself. I’m also a firm believer that British names are funny, and in particular, British placenames. So I set out to train a neural network to generate British place names.

First stop was to get the neural network software working on my laptop. (An early 2014 MacBook Air.) This took an unbelievable amount of faff. Rather than Janelle Shane’s beloved char-rnn, I’ve gone for torch-rnn, following Jeff Thompson’s really helpful guide. I won’t trouble you, gentle readers, with the details, although I really should write them up to help future researchers.

So it was with no small sense of personal accomplishment that I managed to train the network on the sample data (Shakespeare) and generated my very own Fakespeare, which – as others have noted – bears more than a passing resemblance in form to the Bard’s own work, although is perhaps a little lacking in terms of meaning and intellectual coherence:

JULIET:
Then, he was a warrel to feet she well be a prison.

First Senator:
It is the common in the deed the father;
I was his soul and dangerous prince, and off
Of the word and this brother of some honest,
The sins and like my death to me, where we see how thee
The compartary doth distraines of your love.

KING RICHARD III:
And then are they have no more than I do adming
Of the common of my son to his out in our charge
To meet the world to be to the people so.

GLOUCESTER:
The glad somethere him that the complayell the executions
I have not the more of the thrones of my woe,
Which shall be see the state the more are gentlemen?

KING RICHARD III:
Thou wert this presence of his brother's brother.

LADY ANNE:
This that what you have been in the world by the world
Unto the part of the wearing of the world,
And condent the wides the surpring him of him.

CAMILLO:
I will not be thine heart! I would show their king's hand,
Which are the worst and sound to me to these words:
When that then is the king and base and son.
What say the death of our country's parts
Was the act be a children of the wars to me,
And they pray you, mark it is him to the country
Than when our neither drunk and be the bold;
And that the deady of the larkly same and game?

As I remarked on Twitter, if you declaim this loudly and with confidence, it sounds a lot like those bits of real Shakespeare where you can’t quite follow what’s going on but have to rely on the actor’s tone and other clues.

The next step was to get hold of some training data for my plans. Wikipedia has a list of UK locations, but it’s split across hundreds of subpages for ‘convenience’. I couldn’t get any sense out of Wikidata, and writing my own little scraper script to get all of it seemed more bother than this bit of whimsy was worth. A bit more searching led me to the Data.gov.uk catalogue page for OS Open Names, which led to the OS Open Names page on the Ordnance Survey site, which got me an emailed link to a monster zip file.

This was the mother lode of location data. It was a monster: 96.5 Mb zipped, 1.61 Gb unzipped. The Ordnance Survey is the authoritative source for location data in the UK, and this is pretty much it for names. This data file “contains over 870,000 named and numbered roads, nearly 44,000 settlements and over 1.6 million postcodes, all matched/mapped to the National Grid.”

Once I’d got my head around it (which took some doing, and is worth a post in itself if anyone’s interested), I began to appreciate just how much of a thing of wonder it is. I had the name of every single populated place in England, Wales and Scotland, and its location. And every road. And every postcode.

There were decisions to be made. I decided to split the data in to England, Wales and Scotland, since they have quite distinctive naming patterns. I also decided to split up the populated places from the road names: there are way more road names, obviously, and they have a different form. I also stripped out all the semantic stuff and all the actual location stuff, since although that can be very valuable, it’s not very funny.

As with most data science activities, getting hold of the data, understanding it, and processing it in to the form you need to work with took something like 95% of the effort. But it is now working!

Results to follow: watch this space.