Show me the data!
Header

Using the NHM Data Portal API

September 30th, 2015 | Posted by rmounce in Content Mining | NHM

Anyone care to remember how awful and unusable the web interface for accessing the NHM’s specimen records used to be? Behold the horror below as it was in 2013, or visit the Web Archive to see just how bad it was. It’s not even the ‘look’ of it that was the major problem – it was more that it simply wouldn’t return results for many searches. No one I know actually used that web interface because of these issues. And obviously there was no API.

2013. It was worse than it looks.

2013. It was worse than it looks.

The internal database that the NHM uses is based upon KE Emu and everyone who’s had the misfortune of having to use it knows that it’s literally dinosaur software – it wouldn’t look out of place in the year 1999 and again, the actual (poor) performance of it is the far bigger problem. I guess by 2025 the museum might replace it, if there’s sufficient funding and the political issues keeping it in place are successfully navigated. To hear just how much everyone at the museum knows what I’m talking about; listen to the knowing laughter in the audience when I describe the NHM’s KE Emu database as “terrible” in my SciFri talk video below (from about 3.49 onwards):

Given the above, perhaps now you can better understand my astonishment and sincere praise I feel is due for the team behind the still relatively new online NHM Data Portal at: http://data.nhm.ac.uk/

The new Data Portal is flipping brilliant. Ben Scott is the genius behind it all – the lead architect of the project. Give that man a pay raise, ASAP!

He’s successfully implemented the open source CKAN software, which itself incidentally is maintained by the Open Knowledge Foundation (now known simply as Open Knowledge). This is the same software solution that both the US and UK governments use to publish their open government data. It’s a good, proven, popular design choice, it scales, and I’m pleased to say it works really well for both casual users and more advanced users. This is where the title of post comes in…

The NHM Specimen Records now have an API and this is bloody brilliant

In my text mining project to find NHM specimens in the literature, and link them up to the NHM’s official specimen records, it’s vitally important to have a reliable, programmatic web service I can use to lookup tens of thousands of catalogue numbers against. If I had to copy and paste-in each one e.g. “BMNH(E)1239048manually, using a GUI web browser my work simply wouldn’t be possible. I wouldn’t have even started my project.

Put simply, the new Data Portal is a massive enabler for academic research.

To give something back for all the usage tips that Ben has been kindly giving me (thanks!), I’d thought I’d use this post to describe how I’ve been using the NHM Data Portal API to do my research:

At first, I was simply querying the database from a local dump. One of the many great features of the new Specimen Records database at the Data Portal, is that the portal enables you to download the entire database as a single plain text table: over 3GB’s in size. Just click the “Download” button, you can’t miss it! But after a while, I realised this approach was impractical – my local copy after just a few weeks was significantly out of date. New specimen records are made public on the Data Portal every week, I think!

So, I had to bite the bullet and learn how to use the web API. Yes: it’s a museum with an API! How cool is that? There really aren’t many of those around at the moment. This is cutting-edge technology for museums. The Berkeley Ecoinformatics Engine is one other I know of. Among other things it allows API access to geolocated specimen records from the Berkeley Natural History Museums. Let me know in the comments if you know of more.

The basic API query for the NHM Data Portal Specimen Records database is this:

That doesn’t look pretty, so let me break it down into meaningful chunks.

The first part of the URL is the base URL and is the typical CKAN DataStore Data API endpoint for data search. The second part specifies which exact database on the Data Portal you’d like to search. Each database has it’s own 32-digit GUID to uniquely identify it. There are currently 25 different databases/datasets available at the NHM Data Portal including data from the PREDICTS project, assessing ecological diversity in changing terrestrial systems. The third and final part is the specific query you want to run against the specified database, in this case: “Archaeopteryx”. This is a simple search that queries across all fields of the database, which may be too generic for many purposes.

This query will return 2 specimen records in JSON format. The output doesn’t look pretty to human eyes, but to a computer this is cleanly-structured data and it can easily be further analysed, manipulated or converted.

More complex / realistic search queries using the API

The simple search queries across all fields. A more targeted query on a particular field of the database is sometimes more desirable. You can do this with the API too:

In the above example I have filtered my API query to search the “catalogNumber” field of the database for the exact string “PV P 51007

This isn’t very forgiving though. If you search for just “51007” with this type of filter you get 0 records returned:

So, the kind of search I’m actually going to use to lookup my putative catalogue numbers (as found in the published literature) via the API, will have to make use of the more complex SQL-query style:

This query returns 19 records that contain at least partially, the string ‘51007’ in the catalogNumber field. Incidentally, you’ll see if you run this search that 3 completely different entomological specimen records share the exact same catalogue number: “BMNH(E)251007”:

Thamastes dipterus Hagen, 1858 (Trichoptera, Limnephilidae)

Contarinia kanervoi Barnes, 1958 (Diptera, Cecidomyiidae)

Sympycnus peniculitarsus Hollis, D., 1964 (Diptera, Dolichopodidae)

NHM Catalogue numbers are unfortunately far from uniquely identifying but that’s something I’ll leave for the next post in this series!

Isn’t the NHM Data Portal amazing? I certainly think it is. Especially given what it was like before!