Show me the data!

To prove my point about the way that supplementary data files bury useful data, making it utterly indiscoverable to most, I decided to do a little experiment (in relation to text mining for museum specimen identifiers, but also perhaps with some relevance to the NHM Conservation Hackathon):

I collected the links for all Biology Letters supplementary data files. I then filtered out the non-textual media such as audio, video and image files, then downloaded the remaining content.

A breakdown of file extensions encountered in this downloaded subset:

763 .doc files
543 .pdf files
109 .docx files
75 .xls files
53 .xlsx files
25 .csv files
19 .txt files
14 .zip files
2 .rtf files
2 .nex files
1 .xml file
1 “.xltx” file

I then converted some of these unfriendly formats into simpler, more easily searchable plain text formats:


Now everything is properly searchable and indexable!

In a matter of seconds I can find NHM specimen identifiers that might not otherwise be mentioned in the full text of the paper, without actually wasting any time manually reading any papers. Note, not all the ‘hits’ are true positives but most are, and those that aren’t e.g. “NHMQEVLEGYKKKYE” are easy to distinguish as NOT valid NHM specimen identifiers:


Perhaps this approach might be useful to the PREDICTS / LPI teams, looking for species occurrence data sets?

I don’t know why figshare doesn’t do deep indexing by default – it’d be really useful to search the morass of published supplementary data that out there!

Progress on specimen mining

June 14th, 2015 | Posted by rmounce in Content Mining - (0 Comments)

I’ve been on holiday to Japan recently, so work came to a halt on this for a while but I think I’ve largely ‘done’ PLOS ONE full text now (excluding supplementary materials).

My results are on github: – one prettier file without the exact provenance or in-sentence context of each putative specimen entity, and one more extensive file with provenance & context included which unfortunately github can’t render/preview.


Some summary stats:

I found 427 unique BMNH/NHMUK specimen mentions from a total of just 69 unique PLOS ONE papers. The latter strongly suggests to me that there are a lot of ‘hidden’ specimen identifiers hiding out in difficult-to-search supplementary materials files.

I found 497 specimen mentions if you include instances where the same BMNH/NHMUK specimen is mentioned in different PLOS ONE papers.

Finding putative specimen entities in PLOS ONE full text is relatively automatic and easy. The time-consuming manual part is accurately matching them up with official NHM collection specimens data.

I could only confidently link-up 314 of the 497 detected mentions, to their corresponding unique IDs / URLs in the NHM Open Data Portal Collection Specimens dataset. Approximately one third can’t be confidently be matched-up to a unique specimen in the online specimen collection dataset — I suspect this is mainly down to absence/incompleteness in the online collections data, although a small few are likely typo’s in PLOS ONE papers.

In my last post I was confident that the BM Archaeopteryx specimen would be the most frequently mentioned specimen but with more extensive data collection and analysis that appears now not to be true! NHMUK R3592 (a specimen of Erythrosuchus africanus) is mentioned in 5 different PLOS ONE papers. Pleasingly, Google Scholar also finds only five PLOS ONE papers mentioning this specimen – independent confirmation of my methodology.

One of the BM specimens of Erythrosuchus is more referred to in PLOS ONE than the BM Archaeopterx specimen

Now I have these two ‘atomic’ identifiers linked-up (NHM specimen collections occurrence ID + the Digital Object Identifier of the research article in which it appears), I can if desired, find out a whole wealth of information about these specimens and the papers they are mentioned in.

My next steps will be to extend this search to all of the PubMedCentral OA subset, not just PLOS ONE.


In this post I’ll go through an illustrated example of what I plan to do with my text mining project: linking-up biological specimens from the Natural History Museum, London (sometimes known as BMNH or NHMUK) to the published research literature with persistent identifiers.

I’ve run some simple grep searches of the PMC open access subset already, and PLOS ONE make up a significant portion of the ‘hits’, unsurprisingly.

Below is a visual representation of the BMNH specimen ‘hits’ I found in the full text of one PLOS ONE paper:

Grohé C, Morlo M, Chaimanee Y, Blondel C, Coster P, et al. (2012) New Apterodontinae (Hyaenodontida) from the Eocene Locality of Dur At-Talah (Libya): Systematic, Paleoecological and Phylogenetical Implications. PLoS ONE 7(11): e49054. doi: 10.1371/journal.pone.0049054


I used the open source software Gephi, and the Yifan Hu layout to create the above graphical representation. The node marked in blue is the paper. Nodes marked in red are catalogue numbers I couldn’t find in the NHM Open Data Portal specimen collections dataset: 10 out of 34 not found.

Source data table below showing how uninformative the NHM persistent IDs are. I would have plotted them on the graph instead of the catalogue strings as that would be technically more correct (they are the unique IDs), but it would look horrible.


I’ve been failing to find a lot of well known entities in the online specimen collections dataset which makes me rather concerned about its completeness. High profile specimens such as Lesothosaurus “BMNH RUB 17” (as mentioned in this PLOS ONE paper, Table 1) can’t be found online via the portal under that catalogue number. I can however find RUB 16, RUB 52 and RUB 54 but these are probably different specimens. RUB 17 is mentioned in a great many papers by many different authors so it seems unlikely that they have all independently given the specimen an incorrect catalogue number – the problem is more likely to be in the completeness of the online dataset.

Another ‘missing’ example is “BMNH R4947” a specimen of Euoplocephalus tutus as referred to in Table 4 of this PLOS ONE paper by Arbour and Currie. There are two other records for that taxon, but not under R4947.

To end on a happier note, I can definitely answer one question conclusively:
What is the most ‘popular’ NHM specimen in PLOS ONE full text?

…it’s “BMNH 37001”, Archaeopteryx lithographica which is referred to in full text by four different papers (see below for details).

I have feeling many more NHM specimens are hiding out in separate supplementary materials files. Mining these will be hard unless figshare gets their act together and creates a full-text API for searching their collection – I believe it’s a metadata only API at the moment.

37001 in PLOS ONE papers


I’ve purposefully made very simple graphs so far. Once I get more data, I can start linking it up to create beautiful and complex graphs like the one below (of the taxa shared between 3000 microbial phylogenetic studies in IJSEM, unpublished), which I’m still trying to get my head around. The linked open data work continues…

Bacteria subutilis commonly used


Now I’m at the Natural History Museum, London I’ve started a new and ambitious text-mining project: to find, extract, publish, and link-up all mentions of NHM, London specimens published in the recent research literature (born digital, published post-2000).

Rod Page is already blazing a trail in this area with older BHL literature. See: Linking specimen codes to GBIF & Design Notes on Modelling Links for recent, relevant posts. But there’s still lots to be done I think, so here’s my modest effort.



It’s important to demonstrate the value of biological specimen collections. A lot of money is spent cataloguing, curating and keeping safe these specimens. It would be extremely useful to show that these specimens are being used, at scale, in real, recent research — it’s not just irrelevant stamp collecting.

Sometimes the NHM, London specimen catalogue has incorrect, incomplete or outdated data about it’s own specimens – there is better, newer data about them in the published literature that needs to be fed back to the museum.

An example: specimen “BMNH 2013.2.13.3” is listed in the online catalogue on the NHM open data portal as Petrochromis nov. sp. By searching the literature for BMNH specimens, I happened to find where the new species of this specimen was described: as Petrochromis horii Takahashi & Koblmüller, 2014. It’s also worth noting this specimen has associated nucleotide sequence data on GenBank here: .

Having talked a lot about the 5 stars of open data in the context of research data recently, I wonder… wouldn’t it be really useful to make 4 or 5 star linked open data around biological specimens? From Rod Page, I gather this is part of the grand goal of creating a biodiversity knowledge graph.

For this project, I will be focussing on linking BMNH (NHM, London) specimen identifiers with publication identifiers (e.g. DOIs) and GenBank accession numbers.


What questions to ask?

Where have NHM, London specimens been used/published? What are the most used NHM, London specimens in research? How does NHM, London specimen usage compare to other major museums such as the AMNH (New York) or MNHN (Paris).

Materials for Mining

1.) The PubMedCentral Open Access subset – a million papers, but mainly biomedical research.
2.) Open Access & free access journals that not included in PMC
3.) figshare – particularly useful if nothing else, as a means of mining PLOS ONE supplementary materials (I read recently that essentially 90% of figshare is actually PLOS ONE supp. material! See Table 2)
4.) select subscription access journals – annoyingly hard to get access to in bulk, but important to include as sadly much natural history research is still published behind paywalls.


(very) Preliminary Results

The PMC OA subset is fantastic & really facilitates this kind of research – I wish ALL of the biodiversity literature was aggregated like (some) of the open access biomedical literature is. You can literally just download a million papers, click, and go do your research. It facilitates rigorous research by allowing full machine access to full texts.

Simple grep searches for ‘NHMUK’ & ‘BMNH [A-Z0-9][0-9]’, two of the commonest citation forms by which specimens may be cited reveal many thousands of possible specimen mentions in the PMC OA subset I must now look through to clean-up & link-up. In terms of journals, these ‘hits’ in the PMC OA subset come from (in no particular order): PLOS ONE, Parasites & Vectors, PeerJ, ZooKeys, Toxins, Zoo J Linn Soc, Parasite, Frontiers in Zoology, Ecology & Evolution, BMC Research Notes, Biology Letters, BMC Evolutionary Biology, Aquatic Biosystems, BMC Biology, Molecular Ecology, Journal of Insect Science, Nucleic Acids Research and more…!

specimen “BMNH” is a great example to lookup / link-up on the NHM Open Data Portal: the catalogue record has 7 associated images openly available under CC BY, so I can liven up this post by including an image of the specimen (below)! I found this specimen used in a PLOS ONE paper: Walmsley et al. (2013) Why the Long Face? The Mechanics of Mandibular Symphysis Proportions in Crocodiles. doi: 10.1371/journal.pone.0053873 (in the text caption for figure 1 to be precise).

© The Trustees of the Natural History Museum, London. Licensed for reuse under CC BY 4.0. Source.



Questions Arising

How to find and extract mentions of NHM, London specimens in papers published in Science, Nature & PNAS ? There are sure to be many! I’m assuming the last 15 years worth of research published in these journals will be difficult to scrape – they would be quite likely to block my IP address if I tried to. Furthermore, all the actual science is typically buried in supplementary file PDFs in these journals not in the ‘main’ short article. Will Science, Nature & PNAS  let me download all their supp material from the last 15 years? Is this facilitated at all? How do people actually do rigorous research when the contents of supplementary data files published in these journals are so undiscoverable & inaccessible to search?


It’s clear to me there are many separate divisions when it comes to discoverability of research. There’s the divide between open access (highly discoverable & searchable) and subscription access (less discoverable, less searchable, depending upon publisher-restrictions). There’s also the divide between the ‘paper’ (more searchable) and ‘supplementary materials’ (less easily searchable). Finally, there’s also the divide between textual and non-textual media: a huge amount of knowledge in the scientific literature is trapped in non-textual forms such as figure images which simply aren’t instantly searchable by textual methods (figure captions DO NOT contain all of the information of the figure image! Also, OCR is time consuming and error-prone especially on the heterogeneity of fonts and orientation of words in most figures). For example, looking across thousands of papers with phylogenetic analyses published in the journal IJSEM, 95% of the taxa / GenBank accessions used in them are only mentioned in the figure image, nowhere else in the paper or supplementary materials as text! This needs to change.


As should be obvious by now; this is a very preliminary post, just to let people know what I’m doing and what I’m thinking. In my next post I’ll detail some of the subscription access journals I’ve been text mining for specimens, and the barriers I’ve encountered when trying to do so.


Bonus question: How should I publish this annotation data?

Easiest would be to release all annotations as a .csv on the NHM open data portal with 3 columns where each column mimics ‘subject’  ‘predicate’ ‘object’ notation: Specimen, “is mentioned in”, Article DOI.

But if I wanted to publish something a little better & a little more formal, what kind of RDF vocabulary can I use to describe “occurs in” or “is mentioned in”. What would be the most useful format to publish this data in so that it can be re-used and extended to become part of the biodiversity knowledge graph and have lasting value?

Making a journal scraper

May 13th, 2015 | Posted by rmounce in Content Mining - (5 Comments)

Yesterday, I made a journal scraper for the International Journal of Systematic and Evolutionary Microbiology (IJSEM).

Fortunately, Richard Smith-Unna and the ContentMine team have done most of the hard work in creating the general framework with quickscrape (open-source and available on github), I just had to modify the available journal-scrapers to work with IJSEM.

How did I do it?

Find an open access article in the target journal e..g James et al (2015) Kazachstania yasuniensis sp. nov., an ascomycetous yeast species found in mainland Ecuador and on the Galápagos

In your browser, view the HTML source of the full text page, in the Chrome/Chromium browser the keyboard shortcut to do this is Ctrl-U. You should then see something like this, perhaps with less funky highlighting colours:

I based my IJSEM scraper on the existing set of scraper definitions for eLife because I know both journals use similar underlying technology to create their webpages.

The first bit I clearly had to modify was the extraction of publisher. In the eLife scraper this works:

but at IJSEM that information isn’t specified with ‘citation_publisher’, instead it’s tagged as ‘DC.Publisher’ so I modified the element to reflect that:

The license and copyright information extraction is even more different between eLife and IJSEM, here’s the correct scraper for the former:

and here’s how I changed it to extract that information from IJSEM pages:

The XPath needed is completely different. The information is inside a div, not a meta tag.


Hardest of all though were the full size figures and the supplementary materials files – they’re not directly linked from the full text HTML page which is rather annoying. Richard had to help me out with these by creating “followables”:

In his words:

any element can ‘follow’ any other element in the elements array, just by adding the key-value pair "follow": "element_name" to the element that does the following. If you want to follow an element, but don’t want the followed element to be included in the results, you add it to a followables array instead of the elements array. The followed array must capture a URL.



The bottom-line is, it might look complicated initially, but actually it’s not that hard to write a fully-functioning  journal scraper definition, for use with quickscrape. I’m off to go and create one for Taylor & Francis journals now :)


Wouldn’t it be nice if all scholarly journals presented their content on the web in the same way, so we didn’t have to write a thousand different scrapers to download it? That’d be just too helpful wouldn’t it?



April Clyburne-Sherin asked an interesting question on the OpenCon Discussion List recently:

I am an author on a manuscript that my lab wants to publish in a subscription journal that normally retains the copyright. The manuscript is a desirable one so they are “willing” (haha) to provide it “open access” (that was my stipulation to my lab when they started speaking with the publisher). My lab is happy with this, but I do not trust the publisher and want to be able to negotiate a publishing agreement that guarantees:
  • We retain the copyright;
  • The article will be open access forever and no version will be behind a paywall at their journal ever;
  • That there are no sign-ins, registrations, DRM viewing issues, or other ‘free” obstacles to viewing the article.

Comment: Quite rightly, April does not trust the publisher to make the published work fully open access in perpetuity, and wants to do more as an author, with the publishing agreement (a formal contract) to ensure that the publisher will actually provide the exact services she wants.

Recent events this year, whereby Elsevier, Wiley and Springer have all been caught red-handed selling access to hybrid open access articles justifies this lack of trust. It’s a sad state of affairs that authors such as April & myself no longer trust some service providers to actually provide the services we pay them for (e.g. Open Access).

Some helpful links & pointers have been provided on the discussion list, and this may be a concern many other scholarly authors have so it’s valuable to collate, discuss and publicise possible solutions to the thorny problem of publishing agreements with legacy publishers. I certainly don’t pretend to have all the answers here and I think organisations like SPARC might want to act on this one.

Lorraine Chuen links to the Canadian Association of Research Libraries (CARL) ‘Resources for Authors’ page which amongst other things discusses the Canadian SPARC Author Addendum. I knew about the US SPARC Author Addendum, but I never knew there was a Canadian version too!

Matt Menzenski links to the University of Kansas Authors & Copyright page. I particularly like An Introduction to Publication Agreements for Authors (Armstrong, 2009) that they link to at the very top – it’s really useful information.

My Suggested Solutions

For my part, I chipped-in with four different ways that in their own way either partially or wholly fulfil some or all of the criteria April is looking for:

1.) Wait for them to send you their proposed publishing agreement & change the terms to ones you find agreeable

If they send you a standard CTA (Copyright Transfer Agreement) form as PDF, you can modify the wording of that PDF to terms you prefer and send it back to them and they probably won’t even notice as long as it’s signed & doesn’t look too different. It’s cheeky, but I got away with it for a book chapter once. Be careful to remove / replace the term ‘work for hire’ – it may look like an innocuous statement but apparently this is fairly key in legal terms – I neglected to remove that from my book chapter agreement.


2.) Transferring away your copyright away to another person
Not as easy perhaps for multi-author papers but Mike Taylor has a good (successful-ish) anecdote about transferring his copyright to his spouse, thereby preventing the Geological Society from taking the copyright of the work.


3.) Claim that one of the authors is a US federal government employee
Use Section 105 of the US Copyright Act by pretending that at least one of the authors is an employee of the US Government. Works of the U.S. federal government cannot be copyrighted by their authors in the US – they must be public domain, which is in practice achieved by applying the Creative Commons Zero waiver to the paper. The CTA form may contain a check box asking about this. If not, just email them about it. Michael Eisen famously, successfully liberated a NASA space research paper from behind a paywall at Science (AAAS), using Section 105 as justification.
Will publishers really bother fact-checking your assertion about the employment of one of the authors? I don’t think so. It could land them in big trouble if they dare disregard the US Copyright Act.


4.) Simply do not sign, or do not return the unfavourable publishing agreement
Another risky approach is simply not to sign or not to return the CTA the publisher sends you after acceptance (with the obvious risk that this could delay publication). I think this is perhaps the most promising approach, there is strong evidence that many academics currently employ this practice. When you think about it: publishers actually need our papers or they’ll go bust. They need a constant stream of content to justify their existence. If you don’t sign-off on their stipulated terms and conditions, after acceptance, they do have real pressures to get on and publish the paper anyway, especially with the increased focus on optimising submission to publication times these days.


I’ll let Reinhard Diestel (mathematician, University of Hamburg) have the last word on this post, it’s a solution I’m keenly interested in trying myself:
I stopped signing away my copyright on journal papers in the late 1990s. Interestingly, almost all publishers reacted either positively or not at all when I did not return the copyright form signed as requested: in all cases did they print the paper in question, usually without additional delay, and sometimes with unexpected understanding and support. (Yes, there have been one or two cases where things were a little more difficult at first, but these too were resolved amicably in the end.)” —


Roughly ten days after I first blogged about this (see: Springer caught red-handed selling access to an Open Access article), Springer have now made a curious public statement acknowledging this debacle:

Statement on Annals of Forest Science article

Berlin, 6 May 2015

A number of tweets posted by Prof. Luis Apiolaza on 27 April, and by others active on social media, suggest that Springer is charging for access to open access articles published in Annals of Forest Science. After looking into this issue, there is indeed an issue with the status of the article, but this has to do with the background of the journal itself.

Annals of Forest Science is a journal owned by INRA (Institut National de la Recherche Agronomique). In 2009, when the article in question first appeared, the journal was being published by another company that allowed readers to read the articles without paying a fee (“free access”). When Springer started working with INRA in 2011 we agreed to add the 2007-2010 archives to SpringerLink, Springer’s online platform, in order to ensure a smooth transition and to give a wider distribution to the most recent articles. Since the copyright was not assigned to the author, and since there is no mention of the licensing used, we incorrectly assumed that the article was not open access.

It is clear that this article was intended to be open access, and it will be made so on SpringerLink as quickly as possible. Anyone that has purchased the article will, of course, be reimbursed.

Please note that we support Green Open Access and we feed all articles from INRA journals to the HAL repository after the 12-month embargo, making the articles freely downloadable there (this is clearly written on the journal’s webpage, with a link to the HAL platform). The article in question can also be found there for free (since 2011).

This has been an oversight, and we apologize for not being more thorough and vigilant.


Ruth Francis | Springer | Corporate Communications
tel +44 203192 2732 |


I am pleased that Springer are committing to reimbursing all (reader) purchasers of wrongly-paywalled articles, and I shall check my bank balance regularly in the coming weeks to see if they honour this promise.

I am also pleased that Springer see fit to formally apologize for their carelessness of publishing. I note that AFAIK neither Wiley nor Elsevier have apologised for similar incidents this year.

But I’m rather bemused by this wording they have chosen: “It is clear that this article was intended to be open access, and it will be made so on SpringerLink as quickly as possible”

Indeed it seems they chose this wording carefully, because as far as I can tell with my browser, Luis’s open access article is still on sale (see screenshot below).

Update: As of 2015-07-05 13:20 (BST) the article is now no longer paywalled. At the time of writing, as can be seen below it was clearly paywalled.



Springer SBM as an entity makes nearly a billion euros per year in turnover. Despite the considerable size, wealth and ‘experience’ in publishing, Springer can’t seem to unpaywall Luis’s article. Astonishing.

Today, the author of a paid-for, ‘hybrid’ open access article published in 2009, found that it was wrongly on sale at a Springer website:

FWIW it’s still freely available at the original publisher website here.

To test if Springer really were just brazenly selling a copy of the exact same open access article, I paid Springer to access a copy myself (screenshot below) and found it was exactly the same:

my receipt

I don’t actually care whether this is technically ‘legal’ any more. That doesn’t matter. This is scammy publishing. I want a refund and I will be contacting Springer shortly to ask for this. The author also hopes I get a refund – he wanted his article be open access, not available for a ransom:


Frankly, I’m getting tired of writing these blog posts, but it needs to be done to record what happened, because it keeps on happening.

I really think we need to setup a c.f. to monitor and report on these types of incidents. It’s clear the publishers don’t care about this issue themselves – they get extra money from readers by making these ‘mistakes’ and no financial penalty if anyone does spot these mistakes. Calculated indifference.

Are these known incidences just the tip of the iceberg? How do we know this isn’t happening at a greater scale, unobserved? There are more than 50 million research articles on sale at the moment. Perhaps in small part this explains the obscene profits of the legacy publishers?

It’s yet another nail in the coffin for hybrid OA – we simply can’t trust these publishers to keep this content open and paywall-free.

A recap of recent incidents of selling open access articles, without the publisher acknowledging to the reader/buyer that it is an open access article:

Springer (April, 2015) this post

Wiley (March, 2015) link

Elsevier (March, 2015) link

Elsevier (2014) link