Show me the data!

In this post I’ll go through an illustrated example of what I plan to do with my text mining project: linking-up biological specimens from the Natural History Museum, London (sometimes known as BMNH or NHMUK) to the published research literature with persistent identifiers.

I’ve run some simple grep searches of the PMC open access subset already, and PLOS ONE make up a significant portion of the ‘hits’, unsurprisingly.

Below is a visual representation of the BMNH specimen ‘hits’ I found in the full text of one PLOS ONE paper:

Grohé C, Morlo M, Chaimanee Y, Blondel C, Coster P, et al. (2012) New Apterodontinae (Hyaenodontida) from the Eocene Locality of Dur At-Talah (Libya): Systematic, Paleoecological and Phylogenetical Implications. PLoS ONE 7(11): e49054. doi: 10.1371/journal.pone.0049054


I used the open source software Gephi, and the Yifan Hu layout to create the above graphical representation. The node marked in blue is the paper. Nodes marked in red are catalogue numbers I couldn’t find in the NHM Open Data Portal specimen collections dataset: 10 out of 34 not found.

Source data table below showing how uninformative the NHM persistent IDs are. I would have plotted them on the graph instead of the catalogue strings as that would be technically more correct (they are the unique IDs), but it would look horrible.


I’ve been failing to find a lot of well known entities in the online specimen collections dataset which makes me rather concerned about its completeness. High profile specimens such as Lesothosaurus “BMNH RUB 17” (as mentioned in this PLOS ONE paper, Table 1) can’t be found online via the portal under that catalogue number. I can however find RUB 16, RUB 52 and RUB 54 but these are probably different specimens. RUB 17 is mentioned in a great many papers by many different authors so it seems unlikely that they have all independently given the specimen an incorrect catalogue number – the problem is more likely to be in the completeness of the online dataset.

Another ‘missing’ example is “BMNH R4947” a specimen of Euoplocephalus tutus as referred to in Table 4 of this PLOS ONE paper by Arbour and Currie. There are two other records for that taxon, but not under R4947.

To end on a happier note, I can definitely answer one question conclusively:
What is the most ‘popular’ NHM specimen in PLOS ONE full text?

…it’s “BMNH 37001”, Archaeopteryx lithographica which is referred to in full text by four different papers (see below for details).

I have feeling many more NHM specimens are hiding out in separate supplementary materials files. Mining these will be hard unless figshare gets their act together and creates a full-text API for searching their collection – I believe it’s a metadata only API at the moment.

37001 in PLOS ONE papers


I’ve purposefully made very simple graphs so far. Once I get more data, I can start linking it up to create beautiful and complex graphs like the one below (of the taxa shared between 3000 microbial phylogenetic studies in IJSEM, unpublished), which I’m still trying to get my head around. The linked open data work continues…

Bacteria subutilis commonly used


Now I’m at the Natural History Museum, London I’ve started a new and ambitious text-mining project: to find, extract, publish, and link-up all mentions of NHM, London specimens published in the recent research literature (born digital, published post-2000).

Rod Page is already blazing a trail in this area with older BHL literature. See: Linking specimen codes to GBIF & Design Notes on Modelling Links for recent, relevant posts. But there’s still lots to be done I think, so here’s my modest effort.



It’s important to demonstrate the value of biological specimen collections. A lot of money is spent cataloguing, curating and keeping safe these specimens. It would be extremely useful to show that these specimens are being used, at scale, in real, recent research — it’s not just irrelevant stamp collecting.

Sometimes the NHM, London specimen catalogue has incorrect, incomplete or outdated data about it’s own specimens – there is better, newer data about them in the published literature that needs to be fed back to the museum.

An example: specimen “BMNH 2013.2.13.3” is listed in the online catalogue on the NHM open data portal as Petrochromis nov. sp. By searching the literature for BMNH specimens, I happened to find where the new species of this specimen was described: as Petrochromis horii Takahashi & Koblmüller, 2014. It’s also worth noting this specimen has associated nucleotide sequence data on GenBank here: .

Having talked a lot about the 5 stars of open data in the context of research data recently, I wonder… wouldn’t it be really useful to make 4 or 5 star linked open data around biological specimens? From Rod Page, I gather this is part of the grand goal of creating a biodiversity knowledge graph.

For this project, I will be focussing on linking BMNH (NHM, London) specimen identifiers with publication identifiers (e.g. DOIs) and GenBank accession numbers.


What questions to ask?

Where have NHM, London specimens been used/published? What are the most used NHM, London specimens in research? How does NHM, London specimen usage compare to other major museums such as the AMNH (New York) or MNHN (Paris).

Materials for Mining

1.) The PubMedCentral Open Access subset – a million papers, but mainly biomedical research.
2.) Open Access & free access journals that not included in PMC
3.) figshare – particularly useful if nothing else, as a means of mining PLOS ONE supplementary materials (I read recently that essentially 90% of figshare is actually PLOS ONE supp. material! See Table 2)
4.) select subscription access journals – annoyingly hard to get access to in bulk, but important to include as sadly much natural history research is still published behind paywalls.


(very) Preliminary Results

The PMC OA subset is fantastic & really facilitates this kind of research – I wish ALL of the biodiversity literature was aggregated like (some) of the open access biomedical literature is. You can literally just download a million papers, click, and go do your research. It facilitates rigorous research by allowing full machine access to full texts.

Simple grep searches for ‘NHMUK’ & ‘BMNH [A-Z0-9][0-9]’, two of the commonest citation forms by which specimens may be cited reveal many thousands of possible specimen mentions in the PMC OA subset I must now look through to clean-up & link-up. In terms of journals, these ‘hits’ in the PMC OA subset come from (in no particular order): PLOS ONE, Parasites & Vectors, PeerJ, ZooKeys, Toxins, Zoo J Linn Soc, Parasite, Frontiers in Zoology, Ecology & Evolution, BMC Research Notes, Biology Letters, BMC Evolutionary Biology, Aquatic Biosystems, BMC Biology, Molecular Ecology, Journal of Insect Science, Nucleic Acids Research and more…!

specimen “BMNH” is a great example to lookup / link-up on the NHM Open Data Portal: the catalogue record has 7 associated images openly available under CC BY, so I can liven up this post by including an image of the specimen (below)! I found this specimen used in a PLOS ONE paper: Walmsley et al. (2013) Why the Long Face? The Mechanics of Mandibular Symphysis Proportions in Crocodiles. doi: 10.1371/journal.pone.0053873 (in the text caption for figure 1 to be precise).

© The Trustees of the Natural History Museum, London. Licensed for reuse under CC BY 4.0. Source.



Questions Arising

How to find and extract mentions of NHM, London specimens in papers published in Science, Nature & PNAS ? There are sure to be many! I’m assuming the last 15 years worth of research published in these journals will be difficult to scrape – they would be quite likely to block my IP address if I tried to. Furthermore, all the actual science is typically buried in supplementary file PDFs in these journals not in the ‘main’ short article. Will Science, Nature & PNAS  let me download all their supp material from the last 15 years? Is this facilitated at all? How do people actually do rigorous research when the contents of supplementary data files published in these journals are so undiscoverable & inaccessible to search?


It’s clear to me there are many separate divisions when it comes to discoverability of research. There’s the divide between open access (highly discoverable & searchable) and subscription access (less discoverable, less searchable, depending upon publisher-restrictions). There’s also the divide between the ‘paper’ (more searchable) and ‘supplementary materials’ (less easily searchable). Finally, there’s also the divide between textual and non-textual media: a huge amount of knowledge in the scientific literature is trapped in non-textual forms such as figure images which simply aren’t instantly searchable by textual methods (figure captions DO NOT contain all of the information of the figure image! Also, OCR is time consuming and error-prone especially on the heterogeneity of fonts and orientation of words in most figures). For example, looking across thousands of papers with phylogenetic analyses published in the journal IJSEM, 95% of the taxa / GenBank accessions used in them are only mentioned in the figure image, nowhere else in the paper or supplementary materials as text! This needs to change.


As should be obvious by now; this is a very preliminary post, just to let people know what I’m doing and what I’m thinking. In my next post I’ll detail some of the subscription access journals I’ve been text mining for specimens, and the barriers I’ve encountered when trying to do so.


Bonus question: How should I publish this annotation data?

Easiest would be to release all annotations as a .csv on the NHM open data portal with 3 columns where each column mimics ‘subject’  ‘predicate’ ‘object’ notation: Specimen, “is mentioned in”, Article DOI.

But if I wanted to publish something a little better & a little more formal, what kind of RDF vocabulary can I use to describe “occurs in” or “is mentioned in”. What would be the most useful format to publish this data in so that it can be re-used and extended to become part of the biodiversity knowledge graph and have lasting value?

Making a journal scraper

May 13th, 2015 | Posted by rmounce in Content Mining - (8 Comments)

Yesterday, I made a journal scraper for the International Journal of Systematic and Evolutionary Microbiology (IJSEM).

Fortunately, Richard Smith-Unna and the ContentMine team have done most of the hard work in creating the general framework with quickscrape (open-source and available on github), I just had to modify the available journal-scrapers to work with IJSEM.

How did I do it?

Find an open access article in the target journal e..g James et al (2015) Kazachstania yasuniensis sp. nov., an ascomycetous yeast species found in mainland Ecuador and on the Galápagos

In your browser, view the HTML source of the full text page, in the Chrome/Chromium browser the keyboard shortcut to do this is Ctrl-U. You should then see something like this, perhaps with less funky highlighting colours:

I based my IJSEM scraper on the existing set of scraper definitions for eLife because I know both journals use similar underlying technology to create their webpages.

The first bit I clearly had to modify was the extraction of publisher. In the eLife scraper this works:

but at IJSEM that information isn’t specified with ‘citation_publisher’, instead it’s tagged as ‘DC.Publisher’ so I modified the element to reflect that:

The license and copyright information extraction is even more different between eLife and IJSEM, here’s the correct scraper for the former:

and here’s how I changed it to extract that information from IJSEM pages:

The XPath needed is completely different. The information is inside a div, not a meta tag.


Hardest of all though were the full size figures and the supplementary materials files – they’re not directly linked from the full text HTML page which is rather annoying. Richard had to help me out with these by creating “followables”:

In his words:

any element can ‘follow’ any other element in the elements array, just by adding the key-value pair "follow": "element_name" to the element that does the following. If you want to follow an element, but don’t want the followed element to be included in the results, you add it to a followables array instead of the elements array. The followed array must capture a URL.



The bottom-line is, it might look complicated initially, but actually it’s not that hard to write a fully-functioning  journal scraper definition, for use with quickscrape. I’m off to go and create one for Taylor & Francis journals now :)


Wouldn’t it be nice if all scholarly journals presented their content on the web in the same way, so we didn’t have to write a thousand different scrapers to download it? That’d be just too helpful wouldn’t it?



April Clyburne-Sherin asked an interesting question on the OpenCon Discussion List recently:

I am an author on a manuscript that my lab wants to publish in a subscription journal that normally retains the copyright. The manuscript is a desirable one so they are “willing” (haha) to provide it “open access” (that was my stipulation to my lab when they started speaking with the publisher). My lab is happy with this, but I do not trust the publisher and want to be able to negotiate a publishing agreement that guarantees:
  • We retain the copyright;
  • The article will be open access forever and no version will be behind a paywall at their journal ever;
  • That there are no sign-ins, registrations, DRM viewing issues, or other ‘free” obstacles to viewing the article.

Comment: Quite rightly, April does not trust the publisher to make the published work fully open access in perpetuity, and wants to do more as an author, with the publishing agreement (a formal contract) to ensure that the publisher will actually provide the exact services she wants.

Recent events this year, whereby Elsevier, Wiley and Springer have all been caught red-handed selling access to hybrid open access articles justifies this lack of trust. It’s a sad state of affairs that authors such as April & myself no longer trust some service providers to actually provide the services we pay them for (e.g. Open Access).

Some helpful links & pointers have been provided on the discussion list, and this may be a concern many other scholarly authors have so it’s valuable to collate, discuss and publicise possible solutions to the thorny problem of publishing agreements with legacy publishers. I certainly don’t pretend to have all the answers here and I think organisations like SPARC might want to act on this one.

Lorraine Chuen links to the Canadian Association of Research Libraries (CARL) ‘Resources for Authors’ page which amongst other things discusses the Canadian SPARC Author Addendum. I knew about the US SPARC Author Addendum, but I never knew there was a Canadian version too!

Matt Menzenski links to the University of Kansas Authors & Copyright page. I particularly like An Introduction to Publication Agreements for Authors (Armstrong, 2009) that they link to at the very top – it’s really useful information.

My Suggested Solutions

For my part, I chipped-in with four different ways that in their own way either partially or wholly fulfil some or all of the criteria April is looking for:

1.) Wait for them to send you their proposed publishing agreement & change the terms to ones you find agreeable

If they send you a standard CTA (Copyright Transfer Agreement) form as PDF, you can modify the wording of that PDF to terms you prefer and send it back to them and they probably won’t even notice as long as it’s signed & doesn’t look too different. It’s cheeky, but I got away with it for a book chapter once. Be careful to remove / replace the term ‘work for hire’ – it may look like an innocuous statement but apparently this is fairly key in legal terms – I neglected to remove that from my book chapter agreement.


2.) Transferring away your copyright away to another person
Not as easy perhaps for multi-author papers but Mike Taylor has a good (successful-ish) anecdote about transferring his copyright to his spouse, thereby preventing the Geological Society from taking the copyright of the work.


3.) Claim that one of the authors is a US federal government employee
Use Section 105 of the US Copyright Act by pretending that at least one of the authors is an employee of the US Government. Works of the U.S. federal government cannot be copyrighted by their authors in the US – they must be public domain, which is in practice achieved by applying the Creative Commons Zero waiver to the paper. The CTA form may contain a check box asking about this. If not, just email them about it. Michael Eisen famously, successfully liberated a NASA space research paper from behind a paywall at Science (AAAS), using Section 105 as justification.
Will publishers really bother fact-checking your assertion about the employment of one of the authors? I don’t think so. It could land them in big trouble if they dare disregard the US Copyright Act.


4.) Simply do not sign, or do not return the unfavourable publishing agreement
Another risky approach is simply not to sign or not to return the CTA the publisher sends you after acceptance (with the obvious risk that this could delay publication). I think this is perhaps the most promising approach, there is strong evidence that many academics currently employ this practice. When you think about it: publishers actually need our papers or they’ll go bust. They need a constant stream of content to justify their existence. If you don’t sign-off on their stipulated terms and conditions, after acceptance, they do have real pressures to get on and publish the paper anyway, especially with the increased focus on optimising submission to publication times these days.


I’ll let Reinhard Diestel (mathematician, University of Hamburg) have the last word on this post, it’s a solution I’m keenly interested in trying myself:
I stopped signing away my copyright on journal papers in the late 1990s. Interestingly, almost all publishers reacted either positively or not at all when I did not return the copyright form signed as requested: in all cases did they print the paper in question, usually without additional delay, and sometimes with unexpected understanding and support. (Yes, there have been one or two cases where things were a little more difficult at first, but these too were resolved amicably in the end.)” —