Show me the data!
Header

Using the NHM Data Portal API

September 30th, 2015 | Posted by rmounce in Content Mining | NHM - (0 Comments)

Anyone care to remember how awful and unusable the web interface for accessing the NHM’s specimen records used to be? Behold the horror below as it was in 2013, or visit the Web Archive to see just how bad it was. It’s not even the ‘look’ of it that was the major problem – it was more that it simply wouldn’t return results for many searches. No one I know actually used that web interface because of these issues. And obviously there was no API.

2013. It was worse than it looks.

2013. It was worse than it looks.

The internal database that the NHM uses is based upon KE Emu and everyone who’s had the misfortune of having to use it knows that it’s literally dinosaur software – it wouldn’t look out of place in the year 1999 and again, the actual (poor) performance of it is the far bigger problem. I guess by 2025 the museum might replace it, if there’s sufficient funding and the political issues keeping it in place are successfully navigated. To hear just how much everyone at the museum knows what I’m talking about; listen to the knowing laughter in the audience when I describe the NHM’s KE Emu database as “terrible” in my SciFri talk video below (from about 3.49 onwards):

Given the above, perhaps now you can better understand my astonishment and sincere praise I feel is due for the team behind the still relatively new online NHM Data Portal at: http://data.nhm.ac.uk/

The new Data Portal is flipping brilliant. Ben Scott is the genius behind it all – the lead architect of the project. Give that man a pay raise, ASAP!

He’s successfully implemented the open source CKAN software, which itself incidentally is maintained by the Open Knowledge Foundation (now known simply as Open Knowledge). This is the same software solution that both the US and UK governments use to publish their open government data. It’s a good, proven, popular design choice, it scales, and I’m pleased to say it works really well for both casual users and more advanced users. This is where the title of post comes in…

The NHM Specimen Records now have an API and this is bloody brilliant

In my text mining project to find NHM specimens in the literature, and link them up to the NHM’s official specimen records, it’s vitally important to have a reliable, programmatic web service I can use to lookup tens of thousands of catalogue numbers against. If I had to copy and paste-in each one e.g. “BMNH(E)1239048manually, using a GUI web browser my work simply wouldn’t be possible. I wouldn’t have even started my project.

Put simply, the new Data Portal is a massive enabler for academic research.

To give something back for all the usage tips that Ben has been kindly giving me (thanks!), I’d thought I’d use this post to describe how I’ve been using the NHM Data Portal API to do my research:

At first, I was simply querying the database from a local dump. One of the many great features of the new Specimen Records database at the Data Portal, is that the portal enables you to download the entire database as a single plain text table: over 3GB’s in size. Just click the “Download” button, you can’t miss it! But after a while, I realised this approach was impractical – my local copy after just a few weeks was significantly out of date. New specimen records are made public on the Data Portal every week, I think!

So, I had to bite the bullet and learn how to use the web API. Yes: it’s a museum with an API! How cool is that? There really aren’t many of those around at the moment. This is cutting-edge technology for museums. The Berkeley Ecoinformatics Engine is one other I know of. Among other things it allows API access to geolocated specimen records from the Berkeley Natural History Museums. Let me know in the comments if you know of more.

The basic API query for the NHM Data Portal Specimen Records database is this:

That doesn’t look pretty, so let me break it down into meaningful chunks.

The first part of the URL is the base URL and is the typical CKAN DataStore Data API endpoint for data search. The second part specifies which exact database on the Data Portal you’d like to search. Each database has it’s own 32-digit GUID to uniquely identify it. There are currently 25 different databases/datasets available at the NHM Data Portal including data from the PREDICTS project, assessing ecological diversity in changing terrestrial systems. The third and final part is the specific query you want to run against the specified database, in this case: “Archaeopteryx”. This is a simple search that queries across all fields of the database, which may be too generic for many purposes.

This query will return 2 specimen records in JSON format. The output doesn’t look pretty to human eyes, but to a computer this is cleanly-structured data and it can easily be further analysed, manipulated or converted.

More complex / realistic search queries using the API

The simple search queries across all fields. A more targeted query on a particular field of the database is sometimes more desirable. You can do this with the API too:

In the above example I have filtered my API query to search the “catalogNumber” field of the database for the exact string “PV P 51007

This isn’t very forgiving though. If you search for just “51007” with this type of filter you get 0 records returned:

So, the kind of search I’m actually going to use to lookup my putative catalogue numbers (as found in the published literature) via the API, will have to make use of the more complex SQL-query style:

This query returns 19 records that contain at least partially, the string ‘51007’ in the catalogNumber field. Incidentally, you’ll see if you run this search that 3 completely different entomological specimen records share the exact same catalogue number: “BMNH(E)251007”:

Thamastes dipterus Hagen, 1858 (Trichoptera, Limnephilidae)

Contarinia kanervoi Barnes, 1958 (Diptera, Cecidomyiidae)

Sympycnus peniculitarsus Hollis, D., 1964 (Diptera, Dolichopodidae)

NHM Catalogue numbers are unfortunately far from uniquely identifying but that’s something I’ll leave for the next post in this series!

Isn’t the NHM Data Portal amazing? I certainly think it is. Especially given what it was like before!

Yesterday, I tried to read a piece of research, relevant to my interests that was published in 1949. Sadly as is usual, I hit a paywall asking me for £30 + tax to read it (I didn’t pay).

Needless to say, the author is almost certainly deceased so I can’t simply email him for a copy.

The paper copy is useless to me, even though my institution probably has one somewhere. I need electronic access. It would probably take me an hour to walk to the library, do the required catalogue searches, find the shelf, find the issue, find the page, re-type the paragraphs I need back into a computer, walk back to my desk etc… That whole paper-based workflow is a non-starter.

I noted the article is available electronically online to some lucky, privileged subscribers – but who? Why is the list of institutions that are privileged enough to have access to paywalled articles not public information? It would be extremely helpful to know what institutions have access to which journals & which journal year ranges.

So I thought I’d do an informal twitter poll of people on twitter about this issue:

I received an overwhelming number of responses. Probably over a hundred in total. Huge thanks to all those who took part.

Given such a brilliant community response it would be remiss of me not to share what I’ve learnt with everyone, not just those who helped contribute each little piece of information. So 24 hours later, here’s what I now know about who can access this 1949 paper (data supporting these statements is permanently archived at Zenodo):

Mounce, Ross. (2015). Data on which institutions have access to a 1949 paper, paywalled at Taylor & Francis. Zenodo.

I’m not pretending the following analysis of the data is rigorous science. It’s not. It’s anecdata about access to a single paper at a single journal (a classic n=1 experiment). Of course it also relies on each contributor correctly reporting the truth, and that some potential responses may have self-censored. The sampling is highly non-random and reflects my social sphere of influence on Twitter; predominately US and UK-centric, although I do have single data points from Brazil & Australia (thanks Gabi & Cameron!). Nevertheless, despite all these provisos it’s highly interesting anecdata:

The United Kingdom of Great Britain and Northern Ireland:
Of responses representing 41 different UK institutions including my own, only 3 have access to this paper, namely: University of Cambridge, University of Oxford, and University of Glasgow.
Had I got more responses from a wider variety of UK HEIs like the University of Lincoln and University Of Worcester where I also have friends, I suspect the overall percentage of UK institutions that have access would be even smaller! I’m particularly amused that it appears that no London-based institution has electronic access to this paper!

North America:
Of responses representing 29 different institutions in Canada and the United States, only 7 have access to the paper, namely: Virginia Tech, University of Illinois, University of Florida, North Carolina State University, Case Western Reserve University, Arizona State University, and McGill University. It’s intriguing that North American institutions appear to have slightly better access to this journal as originally the journal was published in London, England!

The ‘rest of the world’ (not meant in a patronising way):
Of responses representing 23 different institutions not based in the UK, Canada, or the United States, only 2 definitely have access to this paper: Wageningen University and Stockholm University. I note that the person who contributed data on Stockholm University access does not have an official recognised affiliation with Stockholm university and that they used alternative methods *cough* to discover this (just for clarity and to further demonstrate the sampling issues at play here!).

Despite asking far and wide. I only found 11 different institutions that actually have electronic access to this paper, and none from London where the paper was actually published.

I’m fascinated by this data, despite its limitations. I’d like to collect more and collect it more efficiently. Perhaps the librarian community could help by publishing exactly what each institution has access to? Although one conversation thread seemed to indicate that libraries may not even know exactly what they have subscribed to at any one point in time (Seriously? WTF!).

Why is this stuff important to know?

I often hear an old canard from certain people that we don’t need open access because “most researchers have access to all the journals and articles they need”. Sometimes some crap, misleading survey data is trotted-out to support this opinion. Actual data on which actual institutions have actual access to subscription-only research is pivotal to countering this canard. For example, it is extremely useful to point out that institutions like Brock University and University of Montreal do NOT have access to the bundle of Wiley journals.  Particularly at a time when maddeningly many societies have decided to start publishing …with Wiley e.g. the Ecological Society of America! It’s not very joined-up thinking and it’s going to create a lot of pain for a lot of people. Both Montreal & Brock & many other institutions with ecologists do not have access to the big Wiley bundle of journals. I’m sure there are useful examples in other subject areas too of mismatch between subscriptions held & needed access. The solution to this of course is NOT to re-subscribe, but to fix the problem at its source; to fully-recognise that access is a global issue and many people need access to a very wide variety of different journals, that a proper transition to an open access availability model is needed.

If I wait 26 years, it will be available for free in the Biodiversity Heritage Library. I hope I live that long!

What to do next?

If your institution isn’t listed in my dataset so far, please do still try and access this article and let me know if you can or cannot instantly access it via your institutional affiliations from Taylor & Francis.

Given we have researchers coming from all corners of the globe for OpenCon later this year, I will soon explore whether together, as the OpenCon community, we can do something like this on a grander scale to more rigorously document the patchy nature of subscription-provided access.

The final word

I’ll leave the final word, to the obvious ‘elephant-in-the-room’ that I haven’t discussed much so far, they are the 99.99% relative to us privileged institutionally-affiliated lucky-ones. I am very obviously aware of and do care about, independent researchers & readers of the ‘general public’; neither of which can afford subscription-access to most paywalled journals:

Today (2015-09-01), marks the public announcement of Research Ideas & Outcomes (RIO for short), a new open access journal for all disciplines that seeks to open-up the entire research cycle with some truly novel features

I know what you might be thinking: Another open access journal? Really? 

Myself, nor Daniel Mietchen simply wouldn’t be involved with this project if it was just another boring open access journal. This journal packs a mighty combination of novel features into one platform:

  • 1.) RIO will publish research proposals, as well as regular research outputs such as articles, data papers and software – this has never been done by a journal before to my knowledge
  • 2.) RIO will label research outputs with ‘Impact Categories’ based upon UN Millennium Development Goals (MDGs) and EU Societal Challenges, to highlight the real-world relevance of research and to better link-up research across disciplines (see below for some example MDGs).

millenium-development-goals

  • 3.) RIO supports a variety of different types of peer-review, including ‘pre-submission, author-facilitated, external peer-review‘ (new), as well as post-publication journal-organized open peer-review (similar to that pioneered by F1000Research), and ‘spontaneous’ (not journal-organized) post-publication open peer-review which is actively encouraged. All peer-review will be open/public, in keeping with the overall guiding philosophy of the journal to increase transparency and reduce waste in the research cycle. Reviewer comments are highly valuable; it is a waste not to make them public. When supplied, all reviewer comments will be made openly available.
  • 4.) RIO offers flexibility in publishing services and pricing in a bold attempt to ‘decouple’ the traditional scholarly journal into its component services. Authors & funders thus may choose to pay for the publishing services they actually want, not an inflexible bundle of different services, as there is at most journals.
Source: Priem, J. and Hemminger, B. M. 2012. Decoupling the scholarly journal. Frontiers in Computational Neuroscience. Licensed under CC BY-NC

Source: Priem, J. and Hemminger, B. M. 2012. Decoupling the scholarly journal. Frontiers in Computational Neuroscience. Image licensed under CC BY-NC.

 

  • 5.) On the technical side of things, RIO uses an integrated end-to-end XML-backed publication system for Authoring, Reviewing, Publishing, Hosting, and Archiving called ARPHA. As a publishing geek this excites me greatly as it eliminates the need for typesetting, ensuring a smooth and low-cost publishing process. Reviewers can make comments inline or more generally over the entire manuscript, on the very same document and platform that the authors wrote in, much like Google Docs. This has been successfully tried and tested for years at the Biodiversity Data Journal and is a system now ready for wider-use.

 

For the above reasons and more, I’m hugely excited about this journal and am delighted to be one of their founding editors alongside Dr Daniel Mietchen. See our growing list of Advisory and Editorial Board members for insight into who else is backing this new journal – we’ve got some great people on board already! If you’re interested in supporting this initiative please do enquire about volunteering as an editor for the journal, we need more editors to support the broad scale and ambition of journal. You can apply via the main website here.