Show me the data!

OpenCon 2015 Brussels was an amazing event. I’ll save a summary of it for the weekend but in the mean time, I urgently need to discuss something that came up at the conference.

At OpenCon, it emerged that Elsevier have apparently been blocking Chris Hartgerink’s attempts to access relevant psychological research papers for content mining.

No one can doubt that Chris’s research intent is legitimate – he’s not fooling around here. He’s a smart guy; statistically, programmatically and scientifically – without doubt he has the technical skills to execute his proposed research. Only recently he was an author on an excellent paper highlighted in Nature News: ‘Smart software spots statistical errors in psychology papers‘.

Why then are Elsevier interfering with his research?

I know nothing more about his case other than what is in his blog posts, however I have also had publishers block my own attempts to do content mining this year, so I think this is the right time for me to go public about this, in support of Chris.

My own use of content mining

I am trying to map where in the giant morass of research literature Natural History Museum (London) specimens are mentioned. No-one has an accurate index of this information. With the use of simple regular expressions it’s easy to filter hundreds of thousands of full text articles to find, classify and lookup potential mentions of specimens.

In the course of this work, I was frequently obstructed by BioOne. My IP address kept getting blocked, stopping me from downloading any further papers from this publisher. I should note here that my institution (NHMUK) pays BioOne to provide access to all their papers – my access is both legitimate and paid-for.

Strong claims, require strong evidence. Thankfully I was doing my work with the full support and knowledge of the NHM Library & Archives team, so they forwarded one or two of the threatening messages they were getting from the publishers I was mining. I have no idea how many messages were sent in total. Here’s one such message from BioOne (below)

Blocked by BioOne

Blocked by BioOne

So according to BioOne, I swiftly found out that downloading more that 100 full text articles in a single session is automatically deemed “excessive” and “a violation of permissible activity“.

Isn’t that absolutely crazy? In the age of ‘big data’ where anyone can download over a million full text articles from the PubMed Central OA subset at a few clicks, an artificially imposed-restriction of just 100 is simply mad and is anti-science. As a member of a subscription-paying institution I have a paid right to be able to access and analyze this content surely? We are paying for access but not actually getting full access.

If I tell other journals like eLife, PLOS ONE, or PeerJ that I have downloaded every single one of their articles for analysis – I get a high-five: these journals understand the importance of analysis-at-scale. Furthermore, the subscription access business model needn’t be a barrier: the Royal Society journals are very friendly with content mining – I have never had a problem downloading entire decades worth of journal content from the Royal Society journals.

I have two objectives for this blog post.

1.) A plea to traditional publishers: PLEASE STOP BLOCKING LEGITIMATE RESEARCH

Please get out of the way and let us do our research. If our institutions have paid for access, you should provide it to us. You are clearly impeding the progress of science. Far more content mining research has been done on open access content and there’s a reason for that – it’s a heck of a lot less hassle and (legal) danger. These artificial obstructions on access to research are absurd and unhelpful.

2.) A plea to researchers and librarians: SHARE YOUR STORIES

I’m absolutely sure it’s not just Chris & I that have experienced problems with traditional publishers artificially obstructing our research. Heather Piwowar is one great example I know. She bravely, extensively and publicly documented her torturous experiences with negotiating access & text mining to Elsevier-controlled content. But we need more people to speak-up. I fear that librarians in particular may be inadvertently sweeping these issues under the carpet – they are most likely to get the most interesting emails from publishers with respect to these matters.

This is a serious matter. Given the experience of Aaron Swartz; being faced with up to 50 years of imprisonment for downloading ‘too many’ JSTOR papers – it would not surprise me if few researchers come forward publicly.

Using the NHM Data Portal API

September 30th, 2015 | Posted by rmounce in Content Mining | NHM - (Comments Off on Using the NHM Data Portal API)

Anyone care to remember how awful and unusable the web interface for accessing the NHM’s specimen records used to be? Behold the horror below as it was in 2013, or visit the Web Archive to see just how bad it was. It’s not even the ‘look’ of it that was the major problem – it was more that it simply wouldn’t return results for many searches. No one I know actually used that web interface because of these issues. And obviously there was no API.

2013. It was worse than it looks.

2013. It was worse than it looks.

The internal database that the NHM uses is based upon KE Emu and everyone who’s had the misfortune of having to use it knows that it’s literally dinosaur software – it wouldn’t look out of place in the year 1999 and again, the actual (poor) performance of it is the far bigger problem. I guess by 2025 the museum might replace it, if there’s sufficient funding and the political issues keeping it in place are successfully navigated. To hear just how much everyone at the museum knows what I’m talking about; listen to the knowing laughter in the audience when I describe the NHM’s KE Emu database as “terrible” in my SciFri talk video below (from about 3.49 onwards):

Given the above, perhaps now you can better understand my astonishment and sincere praise I feel is due for the team behind the still relatively new online NHM Data Portal at:

The new Data Portal is flipping brilliant. Ben Scott is the genius behind it all – the lead architect of the project. Give that man a pay raise, ASAP!

He’s successfully implemented the open source CKAN software, which itself incidentally is maintained by the Open Knowledge Foundation (now known simply as Open Knowledge). This is the same software solution that both the US and UK governments use to publish their open government data. It’s a good, proven, popular design choice, it scales, and I’m pleased to say it works really well for both casual users and more advanced users. This is where the title of post comes in…

The NHM Specimen Records now have an API and this is bloody brilliant

In my text mining project to find NHM specimens in the literature, and link them up to the NHM’s official specimen records, it’s vitally important to have a reliable, programmatic web service I can use to lookup tens of thousands of catalogue numbers against. If I had to copy and paste-in each one e.g. “BMNH(E)1239048manually, using a GUI web browser my work simply wouldn’t be possible. I wouldn’t have even started my project.

Put simply, the new Data Portal is a massive enabler for academic research.

To give something back for all the usage tips that Ben has been kindly giving me (thanks!), I’d thought I’d use this post to describe how I’ve been using the NHM Data Portal API to do my research:

At first, I was simply querying the database from a local dump. One of the many great features of the new Specimen Records database at the Data Portal, is that the portal enables you to download the entire database as a single plain text table: over 3GB’s in size. Just click the “Download” button, you can’t miss it! But after a while, I realised this approach was impractical – my local copy after just a few weeks was significantly out of date. New specimen records are made public on the Data Portal every week, I think!

So, I had to bite the bullet and learn how to use the web API. Yes: it’s a museum with an API! How cool is that? There really aren’t many of those around at the moment. This is cutting-edge technology for museums. The Berkeley Ecoinformatics Engine is one other I know of. Among other things it allows API access to geolocated specimen records from the Berkeley Natural History Museums. Let me know in the comments if you know of more.

The basic API query for the NHM Data Portal Specimen Records database is this:

That doesn’t look pretty, so let me break it down into meaningful chunks.





The first part of the URL is the base URL and is the typical CKAN DataStore Data API endpoint for data search. The second part specifies which exact database on the Data Portal you’d like to search. Each database has it’s own 32-digit GUID to uniquely identify it. There are currently 25 different databases/datasets available at the NHM Data Portal including data from the PREDICTS project, assessing ecological diversity in changing terrestrial systems. The third and final part is the specific query you want to run against the specified database, in this case: “Archaeopteryx”. This is a simple search that queries across all fields of the database, which may be too generic for many purposes.

This query will return 2 specimen records in JSON format. The output doesn’t look pretty to human eyes, but to a computer this is cleanly-structured data and it can easily be further analysed, manipulated or converted.

More complex / realistic search queries using the API

The simple search queries across all fields. A more targeted query on a particular field of the database is sometimes more desirable. You can do this with the API too:




&filters={"catalogNumber":"PV P 51007"}

In the above example I have filtered my API query to search the “catalogNumber” field of the database for the exact string “PV P 51007

This isn’t very forgiving though. If you search for just “51007” with this type of filter you get 0 records returned:{%22catalogNumber%22:%2251007%22}

So, the kind of search I’m actually going to use to lookup my putative catalogue numbers (as found in the published literature) via the API, will have to make use of the more complex SQL-query style:





This query returns 19 records that contain at least partially, the string ‘51007’ in the catalogNumber field. Incidentally, you’ll see if you run this search that 3 completely different entomological specimen records share the exact same catalogue number: “BMNH(E)251007”:

Thamastes dipterus Hagen, 1858 (Trichoptera, Limnephilidae)

Contarinia kanervoi Barnes, 1958 (Diptera, Cecidomyiidae)

Sympycnus peniculitarsus Hollis, D., 1964 (Diptera, Dolichopodidae)

NHM Catalogue numbers are unfortunately far from uniquely identifying but that’s something I’ll leave for the next post in this series!

Isn’t the NHM Data Portal amazing? I certainly think it is. Especially given what it was like before!

With a first commit to github not so long ago (2015-04-13), getpapers is one of the newest tools in the ContentMine toolchain.

It’s also the most readily accessible and perhaps most immediately exciting – it does exactly what it says on the tin: it gets papers for you en masse without having to click around all those different publisher websites. A superb time-saver.

It kinda reminds me of mps-youtube: a handy CLI application for watching/listening to youtube.

Installation is super simple and usage is well documented at the source code repository on github, and of course it’s available under an OSI-approved open source MIT license.

An example usage querying Europe PubMedCentral

Currently you can search 3 different aggregators of academic papers: Europe PubMedCentral, arXiv, and IEEE. Copyright restrictions unfortunately mean that full text article download with getpapers is restricted to only freely accessible or open access papers. The development team plans to add more sources that provide API access in future, although it should be noted that many research aggregators simply don’t appear to have an API at the moment e.g. bioRxiv.

The speed of the overall process is very impressive. I ran the below search & download command and it executed it all in 32 seconds, including the download of 50 full text PDFs of the search-relevant articles!

getpapers --query 'flaveria c4' -p --outdir test

You can choose to download different file formats of the search results: PDF, XML or even the supplementary data. Furthermore, getpapers integrates extremely well with the rest of the ContentMine toolchain, so it’s an ideal starting point for content mining.

getpapers is one of many tools in the ContentMine toolchain that I’ll be demonstrating to early career biologists at a FREE registration, one-day workshop at the University of Bath, Tuesday 28th July. If you’re interested in learning more about fully utilizing the research literature in scalable, reproducible ways, come along! We still have some places left. See the flyer below for more details or follow this link to the official workshop registration page:


Deep indexing supplementary data files

June 20th, 2015 | Posted by rmounce in Conservation Hackathon | Content Mining | Hack days - (Comments Off on Deep indexing supplementary data files)

To prove my point about the way that supplementary data files bury useful data, making it utterly indiscoverable to most, I decided to do a little experiment (in relation to text mining for museum specimen identifiers, but also perhaps with some relevance to the NHM Conservation Hackathon):

I collected the links for all Biology Letters supplementary data files. I then filtered out the non-textual media such as audio, video and image files, then downloaded the remaining content.

A breakdown of file extensions encountered in this downloaded subset:

763 .doc files
543 .pdf files
109 .docx files
75 .xls files
53 .xlsx files
25 .csv files
19 .txt files
14 .zip files
2 .rtf files
2 .nex files
1 .xml file
1 “.xltx” file

I then converted some of these unfriendly formats into simpler, more easily searchable plain text formats:

for i in *.zip ; do unzip $i -d /home/ross/work/royal-soc-si/biol-letters-supp-info/transformed/unzipped_$i ; done
for i in *.docx ; do docx2txt $i /home/ross/work/royal-soc-si/biol-letters-supp-info/transformed/$i.txt ; done
for i in *.doc ; do catdoc -a $i > /home/ross/work/royal-soc-si/biol-letters-supp-info/transformed/$i.txt ; done
for i in *.pdf ; do pdftotext $i > /home/ross/work/royal-soc-si/biol-letters-supp-info/transformed/$i.txt ; done
for i in *.rtf ; do unrtf --text $i > /home/ross/work/royal-soc-si/biol-letters-supp-info/transformed/$i.txt ; done
for i in *.xls ; do in2csv $i > /home/ross/work/royal-soc-si/biol-letters-supp-info/transformed/$i.csv ; done
for i in *.xlsx ; do in2csv $i > /home/ross/work/royal-soc-si/biol-letters-supp-info/transformed/$i.csv ; done


Now everything is properly searchable and indexable!

In a matter of seconds I can find NHM specimen identifiers that might not otherwise be mentioned in the full text of the paper, without actually wasting any time manually reading any papers. Note, not all the ‘hits’ are true positives but most are, and those that aren’t e.g. “NHMQEVLEGYKKKYE” are easy to distinguish as NOT valid NHM specimen identifiers:

$ grep -ior 'nhm............'
20120949_ESM_1.txt:NHMUK R6792), N
20120949_ESM_1.txt:NHMUK R8646) in
20120949_ESM_1.txt:NHMUK R36615, ‘
20120949_ESM_1.txt:NHMUK R36620), 
20120949_ESM_1.txt:NHMUK R16586). 
20120949_ESM_1.txt:NHMUK R36620) a
20120949_ESM_1.txt:NHMUK R16586) a
20120949_ESM_1.txt:NHMUK R6856 was
20120949_ESM_1.txt:NHMUK Charig ar
20120949_ESM_1.txt:NHMUK R6856 and
20120949_ESM_1.txt:NHMUK R6856 wer
20120949_ESM_1.txt:NHMUK R6856 wer
20120949_ESM_1.txt:NHMUK R6856 and
20120949_ESM_1.txt:NHM R6856 just 
20120949_ESM_1.txt:NHM R6856 (figu
20120949_ESM_1.txt:NHMUK R6856 had
20120949_ESM_1.txt:NHMUK R3592) an
20120949_ESM_1.txt:NHMUK R6856. Th
20120949_ESM_1.txt:NHMUK R6856). M
20120949_ESM_1.txt:NHMUK with the 
20120949_ESM_1.txt:NHMUK R6856 is 
20120949_ESM_1.txt:NHMUK R6856 sug
20120949_ESM_1.txt:NHMUK R6856. Th
20120949_ESM_1.txt:NHMUK R6856 sug
20120949_ESM_1.txt:NHMUK R6856, bu
20120949_ESM_1.txt:NHMUK R6586 is 
20120949_ESM_1.txt:NHMUK R6586 als
20120949_ESM_1.txt:NHMUK R6586, we
20120949_ESM_1.txt:NHMUK R6586 can
20120949_ESM_1.txt:NHMUK R6586 was
20120949_ESM_1.txt:NHMUK R6586 may
20120949_ESM_1.txt:NHMUK R6856 are
20120949_ESM_1.txt:NHMUK R6856) av
20120949_ESM_1.txt:NHMUK R6795) in
20120949_ESM_1.txt:NHMUK R6795 is 
20120949_ESM_1.txt:NHMUK R6856 and
20120949_ESM_1.txt:NHMUK R6856 and
20120949_ESM_1.txt:NHMUK R6856 was
20120949_ESM_1.txt:NHMUK R6856 fal
20120949_ESM_1.txt:NHMUK R6856 is 
20120949_ESM_1.txt:NHMUK R6856 + S
20120949_ESM_1.txt:NHMUK R6856 whe
20120949_ESM_1.txt:NHMUK R6856 + S
20120949_ESM_1.txt:NHMUK 1, Tanzan
20120949_ESM_1.txt:NHMUK R6856 and
20120949_ESM_1.txt:NHMUK Charig ar
20120949_ESM_1.txt:NHMUK R6856 to 
20120949_ESM_1.txt:NHMUK) for perm
20120949_ESM_1.txt:NHMUK) for acce
20120949_ESM_1.txt:NHMUK Image Res
20120949_ESM_1.txt:NHMUK, The Natu
rsbl20060505supp.txt:NHM uncataloged
rsbl20060505supp.txt:NHM uncataloged
rsbl20070502supp01.doc.txt:NHM) provided v
rsbl20090302supp3.doc.txt:NHM = The Natur
rsbl20090302supp3.doc.txt:NHMW = Natural 
rsbl20090302supp3.doc.txt:NHM E32070	Plan
rsbl20090302supp3.doc.txt:NHM EE5034	Plan
rsbl20090302supp3.doc.txt:NHM E4381	Plank
rsbl20090302supp3.doc.txt:NHM E10384	Plan
rsbl20090302supp3.doc.txt:NHM EE4825	Plan
rsbl20090302supp3.doc.txt:NHM E8389	Plank
rsbl20090302supp3.doc.txt:NHM EE8132	Plan
rsbl20090302supp3.doc.txt:NHM EE5585	Non-
rsbl20090302supp3.doc.txt:NHM EE ?	Non-pl
rsbl20090302supp3.doc.txt:NHM EE1961	?	?	
rsbl20090302supp3.doc.txt:NHM E35551	Plan
rsbl20090302supp3.doc.txt:NHM E76539	?	Up
rsbl20090302supp3.doc.txt:NHM EE4055	Plan
rsbl20090302supp3.doc.txt:NHM E81494	Plan
rsbl20090302supp3.doc.txt:NHM EE4631	?	Ap
rsbl20090302supp3.doc.txt:NHM EE4632	?	Ap
rsbl20090302supp3.doc.txt:NHM EE4641	Plan
rsbl20090302supp3.doc.txt:NHM E20098	Plan
rsbl20090302supp3.doc.txt:NHM EE4404	Plan
rsbl20090302supp3.doc.txt:NHM EE8397	Plan
rsbl20090302supp3.doc.txt:NHM EE2372	?	Ma
rsbl20090302supp3.doc.txt:NHM E79718	Plan
rsbl20090302supp3.doc.txt:NHM E40574	Plan
rsbl20090302supp3.doc.txt:NHM EE4524	Plan
rsbl20090302supp3.doc.txt:NHM E79415	Non-
rsbl20090302supp3.doc.txt:NHM E45372	?	Tu
rsbl20090302supp3.doc.txt:NHM EE2321	Plan
rsbl20090302supp3.doc.txt:NHM EE2262	Plan
rsbl20090302supp3.doc.txt:NHM EE4610	Plan
rsbl20090302supp3.doc.txt:NHM E4052	Non-p
rsbl20090302supp3.doc.txt:NHM EE191	Plank
rsbl20090302supp3.doc.txt:NHM EE2353	Plan
rsbl20090302supp3.doc.txt:NHM E4034	Plank
rsbl20090302supp3.doc.txt:NHM EE2432	Plan
rsbl20090302supp3.doc.txt:NHM E4176	Plank
rsbl20090302supp3.doc.txt:NHM EE4048	?	Ma
rsbl20090302supp3.doc.txt:NHM E9892	Plank
rsbl20090302supp3.doc.txt:NHM E4979	?	Tur
rsbl20090302supp3.doc.txt:NHM E75821	Plan
rsbl20090302supp3.doc.txt:NHM E40974	?	Se
rsbl20090302supp3.doc.txt:NHM E79094	Plan
rsbl20090302supp3.doc.txt:NHM E582	Plankt
rsbl20090302supp3.doc.txt:NHMW 2005z0083/
rsbl20090302supp3.doc.txt:NHM E82582	?	U.
rsbl20090302supp3.doc.txt:NHM EE7698	Plan
rsbl20090302supp3.doc.txt:NHM E9392	Plank
rsbl20090302supp3.doc.txt:NHM E73207	?	Al
rsbl20090302supp3.doc.txt:NHM E43810	Plan
rsbl20090302supp3.doc.txt:NHM 56422	?	Apt
rsbl20090302supp3.doc.txt:NHM E83246	Plan
20120949_ESM_5.txt:NHMUK R6856) am
rsbl2011364supp1.doc.txt:NHM-72.666; MCZ
20120949_ESM_3.txt:NHMUK R6856). P
20120949_ESM_3.txt:NHMUK R6856) in
rsbl20090778supp1.doc.txt:NHM as a contro
rsbl20090139supp1.txt:NHM, The Natura
rsbl20090139supp1.txt:NHM R1034). As
20120949_ESM_2.txt:NHMUK R6856) in
20120949_ESM_2.txt:NHMUK R6856) in
rsbl20080409supp01.doc.txt:NHMW, Naturhist
rsbl20130021supp1.doc.txt:NHM, Staatliche
rsbl20130021supp1.doc.txt:NHMUK PV R498 a
rsbl20130021supp1.doc.txt:NHMUK PV OR3612
rsbl20130021supp1.doc.txt:NHMUK PV R3938 
rsbl20130021supp1.doc.txt:NHMUK PV R5465)
rsbl20130021supp1.doc.txt:NHMUK PV OR2003
rsbl20130021supp1.doc.txt:NHMUK PV R1158)
rsbl20130021supp1.doc.txt:NHMUK PV R5595)
rsbl20130021supp1.doc.txt:NHMUK PV R4086)
rsbl20130021supp1.doc.txt:NHMUK and GLAHM
rsbl20130021supp1.doc.txt:NHM); Sveltonec
rsbl20130021supp1.doc.txt:NHMUK PV R11185
rsbl20130021supp1.doc.txt:NHM1284-R); Mal
rsbl20130021supp1.doc.txt:NHMUK PV R6682)
rsbl20130021supp1.doc.txt:NHMUK PV R6682)
rsbl20130021supp1.doc.txt:NHMUK in 1959, 
rsbl20130021supp1.doc.txt:NHMUK. While th
rsbl20130021supp1.doc.txt:NHMUK PV R6682 
rsbl20130021supp1.doc.txt:NHMUK PV R6682)
rsbl20130021supp1.doc.txt:NHMUK PV R6682,
rsbl20130021supp1.doc.txt:NHMUK PV R6682,
rsbl20130021supp1.doc.txt:NHMUK PV R6682 
rsbl20130021supp1.doc.txt:NHMUK) for the 
20120949_ESM_4.txt:NHMUK R6856) wh


Perhaps this approach might be useful to the PREDICTS / LPI teams, looking for species occurrence data sets?

I don’t know why figshare doesn’t do deep indexing by default – it’d be really useful to search the morass of published supplementary data that out there!