Show me the data!

Progress on specimen mining

June 14th, 2015 | Posted by rmounce in Content Mining - (Comments Off on Progress on specimen mining)

I’ve been on holiday to Japan recently, so work came to a halt on this for a while but I think I’ve largely ‘done’ PLOS ONE full text now (excluding supplementary materials).

My results are on github: – one prettier file without the exact provenance or in-sentence context of each putative specimen entity, and one more extensive file with provenance & context included which unfortunately github can’t render/preview.


Some summary stats:

I found 427 unique BMNH/NHMUK specimen mentions from a total of just 69 unique PLOS ONE papers. The latter strongly suggests to me that there are a lot of ‘hidden’ specimen identifiers hiding out in difficult-to-search supplementary materials files.

I found 497 specimen mentions if you include instances where the same BMNH/NHMUK specimen is mentioned in different PLOS ONE papers.

Finding putative specimen entities in PLOS ONE full text is relatively automatic and easy. The time-consuming manual part is accurately matching them up with official NHM collection specimens data.

I could only confidently link-up 314 of the 497 detected mentions, to their corresponding unique IDs / URLs in the NHM Open Data Portal Collection Specimens dataset. Approximately one third can’t be confidently be matched-up to a unique specimen in the online specimen collection dataset — I suspect this is mainly down to absence/incompleteness in the online collections data, although a small few are likely typo’s in PLOS ONE papers.

In my last post I was confident that the BM Archaeopteryx specimen would be the most frequently mentioned specimen but with more extensive data collection and analysis that appears now not to be true! NHMUK R3592 (a specimen of Erythrosuchus africanus) is mentioned in 5 different PLOS ONE papers. Pleasingly, Google Scholar also finds only five PLOS ONE papers mentioning this specimen – independent confirmation of my methodology.

One of the BM specimens of Erythrosuchus is more referred to in PLOS ONE than the BM Archaeopterx specimen

Now I have these two ‘atomic’ identifiers linked-up (NHM specimen collections occurrence ID + the Digital Object Identifier of the research article in which it appears), I can if desired, find out a whole wealth of information about these specimens and the papers they are mentioned in.

My next steps will be to extend this search to all of the PubMedCentral OA subset, not just PLOS ONE.


In this post I’ll go through an illustrated example of what I plan to do with my text mining project: linking-up biological specimens from the Natural History Museum, London (sometimes known as BMNH or NHMUK) to the published research literature with persistent identifiers.

I’ve run some simple grep searches of the PMC open access subset already, and PLOS ONE make up a significant portion of the ‘hits’, unsurprisingly.

Below is a visual representation of the BMNH specimen ‘hits’ I found in the full text of one PLOS ONE paper:

Grohé C, Morlo M, Chaimanee Y, Blondel C, Coster P, et al. (2012) New Apterodontinae (Hyaenodontida) from the Eocene Locality of Dur At-Talah (Libya): Systematic, Paleoecological and Phylogenetical Implications. PLoS ONE 7(11): e49054. doi: 10.1371/journal.pone.0049054


I used the open source software Gephi, and the Yifan Hu layout to create the above graphical representation. The node marked in blue is the paper. Nodes marked in red are catalogue numbers I couldn’t find in the NHM Open Data Portal specimen collections dataset: 10 out of 34 not found.

Source data table below showing how uninformative the NHM persistent IDs are. I would have plotted them on the graph instead of the catalogue strings as that would be technically more correct (they are the unique IDs), but it would look horrible.

Catalogue Number   NHM Data Portal Persistent ID
BMNH M 85319   114e7dab-b3bd-43a2-9c0f-876ec41ecfc9
BMNH M 85318   14502273-5337-4d1b-a7dd-52375f62e684
BMNH M 85315   273019df-4252-4640-97f4-902ba8c9fc44
BMNH M 85321   28931039-5bcf-45ba-a531-415475515624
BMNH M 8437    2ca4c323-5810-4b01-8dc7-cf203c973d98
BMNH M 8436    2cdeb90c-6bb0-4430-8e2b-5873db0caca8
BMNH M 9259    37f115c5-39cc-4d1d-994b-8a4e4fcd470f
BMNH M 85323   38698d5e-69f3-4f5c-8bba-5543021aac44
BMNH M 85301   477b9900-e0a7-442c-939f-048e3354bc14
BMNH M 85313   499d8a9d-d4eb-40f8-b904-c199785cc5ff
BMNH M 85297   49c91ade-fe6e-48a5-8fce-97c168252552
BMNH M 8441    4b5e3722-e92a-4a8a-997e-775e1823a07e
BMNH M 85298   4cf564e8-554d-4747-8eaa-03dd1bee0fb1
BMNH M 85320   5877de10-5327-4fcb-9fb4-51b807e97e62
BMNH M 9257    67033c65-667d-453c-8e69-265e640b2202
BMNH M 85317   73dc0d26-02e4-4889-b6b3-d3617f47300e
BMNH M 85312   87bce8d5-1b40-46a2-9f9b-15f8bd7bddae
BMNH M 8880    a89d2901-e4c0-47cf-aef4-ce458ecb98f2
BMNH M 8512    b1d01430-0046-4650-a111-7008f9597b35
BMNH M 85300   cbb0d734-a052-4132-a912-7b647468d6fc
BMNH M 85316   d789f148-3393-4a48-9143-b3bbd0ed7dee
BMNH M 85303   ee8f78d9-84f2-425c-b2e7-7e8ca013c39f
BMNH M 8873    f5de5e4d-bf48-45ef-8d7b-5484f9aa78fd
BMNH M 85322   fd30af1d-9b7f-4834-8964-e1b244550c46
BMNH M 10348	
BMNH M 85324	
BMNH M 85325	
BMNH M 85326	
BMNH M 85327	
BMNH M 85328	
BMNH M 85330	
BMNH M 85331	
BMNH M 85332	
BMNH M 85335	


I’ve been failing to find a lot of well known entities in the online specimen collections dataset which makes me rather concerned about its completeness. High profile specimens such as Lesothosaurus “BMNH RUB 17” (as mentioned in this PLOS ONE paper, Table 1) can’t be found online via the portal under that catalogue number. I can however find RUB 16, RUB 52 and RUB 54 but these are probably different specimens. RUB 17 is mentioned in a great many papers by many different authors so it seems unlikely that they have all independently given the specimen an incorrect catalogue number – the problem is more likely to be in the completeness of the online dataset.

Another ‘missing’ example is “BMNH R4947” a specimen of Euoplocephalus tutus as referred to in Table 4 of this PLOS ONE paper by Arbour and Currie. There are two other records for that taxon, but not under R4947.

To end on a happier note, I can definitely answer one question conclusively:
What is the most ‘popular’ NHM specimen in PLOS ONE full text?

…it’s “BMNH 37001”, Archaeopteryx lithographica which is referred to in full text by four different papers (see below for details).

I have feeling many more NHM specimens are hiding out in separate supplementary materials files. Mining these will be hard unless figshare gets their act together and creates a full-text API for searching their collection – I believe it’s a metadata only API at the moment.

37001 in PLOS ONE papers


I’ve purposefully made very simple graphs so far. Once I get more data, I can start linking it up to create beautiful and complex graphs like the one below (of the taxa shared between 3000 microbial phylogenetic studies in IJSEM, unpublished), which I’m still trying to get my head around. The linked open data work continues…

Bacteria subutilis commonly used


Now I’m at the Natural History Museum, London I’ve started a new and ambitious text-mining project: to find, extract, publish, and link-up all mentions of NHM, London specimens published in the recent research literature (born digital, published post-2000).

Rod Page is already blazing a trail in this area with older BHL literature. See: Linking specimen codes to GBIF & Design Notes on Modelling Links for recent, relevant posts. But there’s still lots to be done I think, so here’s my modest effort.



It’s important to demonstrate the value of biological specimen collections. A lot of money is spent cataloguing, curating and keeping safe these specimens. It would be extremely useful to show that these specimens are being used, at scale, in real, recent research — it’s not just irrelevant stamp collecting.

Sometimes the NHM, London specimen catalogue has incorrect, incomplete or outdated data about it’s own specimens – there is better, newer data about them in the published literature that needs to be fed back to the museum.

An example: specimen “BMNH 2013.2.13.3” is listed in the online catalogue on the NHM open data portal as Petrochromis nov. sp. By searching the literature for BMNH specimens, I happened to find where the new species of this specimen was described: as Petrochromis horii Takahashi & Koblmüller, 2014. It’s also worth noting this specimen has associated nucleotide sequence data on GenBank here: .

Having talked a lot about the 5 stars of open data in the context of research data recently, I wonder… wouldn’t it be really useful to make 4 or 5 star linked open data around biological specimens? From Rod Page, I gather this is part of the grand goal of creating a biodiversity knowledge graph.

For this project, I will be focussing on linking BMNH (NHM, London) specimen identifiers with publication identifiers (e.g. DOIs) and GenBank accession numbers.


What questions to ask?

Where have NHM, London specimens been used/published? What are the most used NHM, London specimens in research? How does NHM, London specimen usage compare to other major museums such as the AMNH (New York) or MNHN (Paris).

Materials for Mining

1.) The PubMedCentral Open Access subset – a million papers, but mainly biomedical research.
2.) Open Access & free access journals that not included in PMC
3.) figshare – particularly useful if nothing else, as a means of mining PLOS ONE supplementary materials (I read recently that essentially 90% of figshare is actually PLOS ONE supp. material! See Table 2)
4.) select subscription access journals – annoyingly hard to get access to in bulk, but important to include as sadly much natural history research is still published behind paywalls.


(very) Preliminary Results

The PMC OA subset is fantastic & really facilitates this kind of research – I wish ALL of the biodiversity literature was aggregated like (some) of the open access biomedical literature is. You can literally just download a million papers, click, and go do your research. It facilitates rigorous research by allowing full machine access to full texts.

Simple grep searches for ‘NHMUK’ & ‘BMNH [A-Z0-9][0-9]’, two of the commonest citation forms by which specimens may be cited reveal many thousands of possible specimen mentions in the PMC OA subset I must now look through to clean-up & link-up. In terms of journals, these ‘hits’ in the PMC OA subset come from (in no particular order): PLOS ONE, Parasites & Vectors, PeerJ, ZooKeys, Toxins, Zoo J Linn Soc, Parasite, Frontiers in Zoology, Ecology & Evolution, BMC Research Notes, Biology Letters, BMC Evolutionary Biology, Aquatic Biosystems, BMC Biology, Molecular Ecology, Journal of Insect Science, Nucleic Acids Research and more…!

specimen “BMNH” is a great example to lookup / link-up on the NHM Open Data Portal: the catalogue record has 7 associated images openly available under CC BY, so I can liven up this post by including an image of the specimen (below)! I found this specimen used in a PLOS ONE paper: Walmsley et al. (2013) Why the Long Face? The Mechanics of Mandibular Symphysis Proportions in Crocodiles. doi: 10.1371/journal.pone.0053873 (in the text caption for figure 1 to be precise).

© The Trustees of the Natural History Museum, London. Licensed for reuse under CC BY 4.0. Source.



Questions Arising

How to find and extract mentions of NHM, London specimens in papers published in Science, Nature & PNAS ? There are sure to be many! I’m assuming the last 15 years worth of research published in these journals will be difficult to scrape – they would be quite likely to block my IP address if I tried to. Furthermore, all the actual science is typically buried in supplementary file PDFs in these journals not in the ‘main’ short article. Will Science, Nature & PNAS  let me download all their supp material from the last 15 years? Is this facilitated at all? How do people actually do rigorous research when the contents of supplementary data files published in these journals are so undiscoverable & inaccessible to search?


It’s clear to me there are many separate divisions when it comes to discoverability of research. There’s the divide between open access (highly discoverable & searchable) and subscription access (less discoverable, less searchable, depending upon publisher-restrictions). There’s also the divide between the ‘paper’ (more searchable) and ‘supplementary materials’ (less easily searchable). Finally, there’s also the divide between textual and non-textual media: a huge amount of knowledge in the scientific literature is trapped in non-textual forms such as figure images which simply aren’t instantly searchable by textual methods (figure captions DO NOT contain all of the information of the figure image! Also, OCR is time consuming and error-prone especially on the heterogeneity of fonts and orientation of words in most figures). For example, looking across thousands of papers with phylogenetic analyses published in the journal IJSEM, 95% of the taxa / GenBank accessions used in them are only mentioned in the figure image, nowhere else in the paper or supplementary materials as text! This needs to change.


As should be obvious by now; this is a very preliminary post, just to let people know what I’m doing and what I’m thinking. In my next post I’ll detail some of the subscription access journals I’ve been text mining for specimens, and the barriers I’ve encountered when trying to do so.


Bonus question: How should I publish this annotation data?

Easiest would be to release all annotations as a .csv on the NHM open data portal with 3 columns where each column mimics ‘subject’  ‘predicate’ ‘object’ notation: Specimen, “is mentioned in”, Article DOI.

But if I wanted to publish something a little better & a little more formal, what kind of RDF vocabulary can I use to describe “occurs in” or “is mentioned in”. What would be the most useful format to publish this data in so that it can be re-used and extended to become part of the biodiversity knowledge graph and have lasting value?

Making a journal scraper

May 13th, 2015 | Posted by rmounce in Content Mining - (8 Comments)

Yesterday, I made a journal scraper for the International Journal of Systematic and Evolutionary Microbiology (IJSEM).

Fortunately, Richard Smith-Unna and the ContentMine team have done most of the hard work in creating the general framework with quickscrape (open-source and available on github), I just had to modify the available journal-scrapers to work with IJSEM.

How did I do it?

Find an open access article in the target journal e..g James et al (2015) Kazachstania yasuniensis sp. nov., an ascomycetous yeast species found in mainland Ecuador and on the Galápagos

In your browser, view the HTML source of the full text page, in the Chrome/Chromium browser the keyboard shortcut to do this is Ctrl-U. You should then see something like this, perhaps with less funky highlighting colours:

<!DOCTYPE html
  PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "">
      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
      <title>Kazachstania yasuniensis sp. nov., an ascomycetous yeast species found in mainland Ecuador and on the Galápagos </title>
      <meta name="googlebot" content="NOODP" />
      <meta name="" content="/cgi/content/full/65/Pt_4/1304" />
      <meta content="/ijs/65/Pt_4/1304.atom" name="HW.identifier" />
      <meta name="DC.Format" content="text/html" />
      <meta name="DC.Language" content="en" />
      <meta content="Kazachstania yasuniensis sp. nov., an ascomycetous yeast species found in mainland Ecuador and on the Galápagos"
            name="DC.Title" />
      <meta content="10.1099/ijs.0.000102" name="DC.Identifier" />
      <meta content="2015-04-01" name="DC.Date" />
      <meta content="Society for General Microbiology" name="DC.Publisher" />
      <meta content="Stephen A. James" name="DC.Contributor" />
      <meta content="Enrique Javier Carvajal Barriga" name="DC.Contributor" />
      <meta content="Patricia Portero Barahona" name="DC.Contributor" />
      <meta content="Carmen Nueno-Palop" name="DC.Contributor" />
      <meta content="Kathryn Cross" name="DC.Contributor" />
      <meta content="Christopher J. Bond" name="DC.Contributor" />
      <meta content="Ian N. Roberts" name="DC.Contributor" />
      <meta content="International Journal of Systematic and Evolutionary&#xA;                Microbiology"
            name="citation_journal_title" />

I based my IJSEM scraper on the existing set of scraper definitions for eLife because I know both journals use similar underlying technology to create their webpages.

The first bit I clearly had to modify was the extraction of publisher. In the eLife scraper this works:

  "url": "elifesciences\\.org",
  "elements": {
    "publisher": {
      "selector": "//meta[@name='citation_publisher']",
      "attribute": "content"

but at IJSEM that information isn’t specified with ‘citation_publisher’, instead it’s tagged as ‘DC.Publisher’ so I modified the element to reflect that:

  "url": "ijs\\.sgmjournals\\.org",
  "elements": {
    "publisher": {
      "selector": "//meta[@name='DC.Publisher']",
      "attribute": "content"

The license and copyright information extraction is even more different between eLife and IJSEM, here’s the correct scraper for the former:

    "license": {
      "selector": "//meta[@name='DC.Rights']",
      "attribute": "text"
    "copyright": {
      "selector": "//meta[@name='DC.Rights']",
      "attribute": "text"

and here’s how I changed it to extract that information from IJSEM pages:

    "license": {
      "selector": "//div[contains(@class, 'license')]",
      "attribute": "text"
    "copyright": {
      "selector": "//div/p[contains(@class, 'copyright')]",
      "attribute": "text"

The XPath needed is completely different. The information is inside a div, not a meta tag.


Hardest of all though were the full size figures and the supplementary materials files – they’re not directly linked from the full text HTML page which is rather annoying. Richard had to help me out with these by creating “followables”:

In his words:

any element can ‘follow’ any other element in the elements array, just by adding the key-value pair "follow": "element_name" to the element that does the following. If you want to follow an element, but don’t want the followed element to be included in the results, you add it to a followables array instead of the elements array. The followed array must capture a URL.


   "url": "ijs\\.sgmjournals\\.org",
  "followables": {
    "figure_expansion": {
      "selector": "//div[contains(@class, 'fig-inline')]//a[text()='In this window']",
      "attribute": "href"
    "suppdata_expansion": {
      "selector": "//a[@rel='supplemental-data']",
      "attribute": "href"


     "supplementary_material": {
-      "selector": "//a[@rel='supplemental-data']",
      "follow": "suppdata_expansion",
      "selector": "//div[@id='content-block']//a",
       "attribute": "href",
       "download": true
     "figure": {
-      "selector": "//div[contains(@class, 'fig-inline')]/a/img",
-      "attribute": "src",
      "follow": "figure_expansion",
      "selector": "//div[contains(@class, 'fig-expansion')]/a",
      "attribute": "href",
       "download": true


The bottom-line is, it might look complicated initially, but actually it’s not that hard to write a fully-functioning  journal scraper definition, for use with quickscrape. I’m off to go and create one for Taylor & Francis journals now :)


Wouldn’t it be nice if all scholarly journals presented their content on the web in the same way, so we didn’t have to write a thousand different scrapers to download it? That’d be just too helpful wouldn’t it?