iSpot and taxonomy

Work on the Biodiversity Observatory – to be called iSpot to the public – is proceeding apace. One of the things we want to be able to offer to help people in getting scientific names is to be able to map between common names for things and the scientific names. Once you know the scientific name of a species, you can find much more information than if you only know the common name. We also want to be able to help people get scientific names right – it’s easy to get them wrong – so we want to be able to provide facilities like ‘did you mean X’ when people mistype a name.

To make that work, we need a database behind the scenes that has a list of correct scientific names, a mapping between common names, and information about the taxonomic tree: each Species is part of a Genus, which is part of a Family, which is part of an Order, which is part of a Class, which is part of an Order, which is part of a Division, which is part of a Kingdom, which is part of Life.

This gets reasonably complicated even if everybody agreed on what goes where.  There’s all sorts of messing around with sub-Families and supra-Classes and things like that on top of the basic tree structure. But of course people don’t agree. And even if everybody agreed now, new information about species’ relationships to each other is becoming available all the time – especially as genetic sequencing becomes cheaper and easier to do and cleverer ways of mining genetic information to reveal evolutionary history are devised. So as we learn more, species get renamed, merged, split, and relocated in the taxonomic tree. And it’s not just obscure species that most iSpot users will never see that get changed around like this – the common garden snail has now been given at least four different scientific names (Helix aspersa, Cryptomphalus aspersus, Cantareus aspersus, and Cornu aspersum) and which is ‘correct’ or ‘preferred’ has been the matter of debate, sometimes vigorous, over time.

We really don’t want to do this work ourselves. It’s a whole discipline in itself, and we can’t hope to duplicate or exceed it. And one of our central development principles is to build on or link to existing work, rather than duplicating effort.

Luckily, there are two important databases that have (some of!) the information we want.

The first is the National Biodiversity Network‘s NBN Species Dictionary. This is as close as we can get to a complete, definitive list of species in the UK. Different parts of it (checklists) are maintained by different groups of specialists, and updated as those specialist groups decide. New versions are published roughly four times a year. (Although there’s a backlog of new information to check in for the latest update so that’s somewhat delayed.) It includes a scientific name and an NBN species ID. This species ID can be used to access the lovely web services that NBN make available via an API. It also has some mapping to common names for classification groups – it has a controlled list of about 160 names (e.g. ‘terrestrial mammals’, ‘higher plants’) that map on to the scientific names for points on the taxonomic tree but are (hopefully) comprehensible to ordinary people – or at least, ordinary people who start to get a little bit interested in nature. Even better, within each checklist, there is definitive hierarchical information – what Order, Family, Genus etc each species belongs in – for each preferred scientific name. However, combining all these to give a single consensus tree is a huge amount of work. The Natural History Museum (who do a lot of the work looking after the Species Dictionary for the NBN) did this work once, but then gave up because maintaining it was so hard. So the Species Dictionary can give us a definitive list of preferred scientific names (and also other scientific names and how they map on to preferred scientific names). It can also give us a broad-brush top-level classification for all of these. There is taxonomic hierarchy information, but not that’s easily combinable. (I want to have a look to see if we can use the checklist-level hierarchy data to support browsing at that level.), But the Species Dictionary doesn’t give us very comprehensive mappings between common names and scientific names. There’s some, but not a lot.

The second source of data is the Natural History Museum’s Nature Navigator. This is a lovely website for browsing the taxonomic tree, switching back and fro as you want between scientific and common names. Nature Navigator contains everything that has a common name. Some common names map on to scientific species names, but others map on to other parts of the tree – so ‘Pea family’ maps on to ‘Family Fabaceae’. It also contains complete and reasonably definitive hierarchical information (all keyed from the scientific names, rather than the common ones, but you can generate the common one on the fly). This looks much more promising for our purposes, since it has so many more common names and complete, usable hierarchical data.

However (there has to be at least one ‘but’ in taxonomy, I’m learning): it only covers things which have a common name, and lots of things don’t, including things that people will want to spot on iSpot – including insects and spiders. And it’s been frozen in stone since the funding ran out in 2004, and things have changed since then. And the taxonomic data it uses differs from common UK usage in many important regards – for instance, the bird data is quite different to what most UK birders use.

In rough order-of-magnitude figures:

NBN Species Dictionary: contains 250,000 scientific names, which reduces to about 80,000 preferred scientific names for species when synonyms and so on are taken in to account.  Some patchy common name mappings. All classified in to just over 100 ‘comprehensible’ taxonomic groupings. Updated regularly.

Nature Navigator: contains 140,000 common names, mapped on to appropriate preferred scientific names/points on the taxonomic tree.

Just to add to the fun, there is an international effort well underway to create a definitive list for all species across the world, called Catalog of Life, merging work by ITIS in the US and Species 2000 at the University of Reading. The aim is to create a globally-unique identifier for species – a Life Sciences ID or LSID. Thankfully, though, we as a project can leave the coordination and mapping between that and the NBN Species Dictionary to others.

The Right Answer would be to include Nature Navigator data as a checklist within the NBN Species Database, which would fold all of the Nature Navigator common name data in to the definitive Species Database. That’s a fair amount of work, but may well be within the scope of what the Taxonomic support project within OPAL (Open Air Laboratories – the parent project of the Biodiversity Observatory) will do. We’ll be pressing them to do that.

Of course, that almost certainly won’t happen in time for the launch of iSpot in the Summer, so we’ll need a stopgap solution of some sort … somehow I think converting taxonomists, biologists and field studies experts to a loose, Web 2.0 folksonomy approach is going to be beyond the scope of this project!