In this post I’ll go through an illustrated example of what I plan to do with my text mining project: linking-up biological specimens from the Natural History Museum, London (sometimes known as BMNH or NHMUK) to the published research literature with persistent identifiers.
I’ve run some simple grep searches of the PMC open access subset already, and PLOS ONE make up a significant portion of the ‘hits’, unsurprisingly.
Below is a visual representation of the BMNH specimen ‘hits’ I found in the full text of one PLOS ONE paper:
(2012) New Apterodontinae (Hyaenodontida) from the Eocene Locality of Dur At-Talah (Libya): Systematic, Paleoecological and Phylogenetical Implications. PLoS ONE 7(11): e49054. doi: 10.1371/journal.pone.0049054
I used the open source software Gephi, and the Yifan Hu layout to create the above graphical representation. The node marked in blue is the paper. Nodes marked in red are catalogue numbers I couldn’t find in the NHM Open Data Portal specimen collections dataset: 10 out of 34 not found.
Source data table below showing how uninformative the NHM persistent IDs are. I would have plotted them on the graph instead of the catalogue strings as that would be technically more correct (they are the unique IDs), but it would look horrible.
Catalogue Number NHM Data Portal Persistent ID
BMNH M 85319 114e7dab-b3bd-43a2-9c0f-876ec41ecfc9
BMNH M 85318 14502273-5337-4d1b-a7dd-52375f62e684
BMNH M 85315 273019df-4252-4640-97f4-902ba8c9fc44
BMNH M 85321 28931039-5bcf-45ba-a531-415475515624
BMNH M 8437 2ca4c323-5810-4b01-8dc7-cf203c973d98
BMNH M 8436 2cdeb90c-6bb0-4430-8e2b-5873db0caca8
BMNH M 9259 37f115c5-39cc-4d1d-994b-8a4e4fcd470f
BMNH M 85323 38698d5e-69f3-4f5c-8bba-5543021aac44
BMNH M 85301 477b9900-e0a7-442c-939f-048e3354bc14
BMNH M 85313 499d8a9d-d4eb-40f8-b904-c199785cc5ff
BMNH M 85297 49c91ade-fe6e-48a5-8fce-97c168252552
BMNH M 8441 4b5e3722-e92a-4a8a-997e-775e1823a07e
BMNH M 85298 4cf564e8-554d-4747-8eaa-03dd1bee0fb1
BMNH M 85320 5877de10-5327-4fcb-9fb4-51b807e97e62
BMNH M 9257 67033c65-667d-453c-8e69-265e640b2202
BMNH M 85317 73dc0d26-02e4-4889-b6b3-d3617f47300e
BMNH M 85312 87bce8d5-1b40-46a2-9f9b-15f8bd7bddae
BMNH M 8880 a89d2901-e4c0-47cf-aef4-ce458ecb98f2
BMNH M 8512 b1d01430-0046-4650-a111-7008f9597b35
BMNH M 85300 cbb0d734-a052-4132-a912-7b647468d6fc
BMNH M 85316 d789f148-3393-4a48-9143-b3bbd0ed7dee
BMNH M 85303 ee8f78d9-84f2-425c-b2e7-7e8ca013c39f
BMNH M 8873 f5de5e4d-bf48-45ef-8d7b-5484f9aa78fd
BMNH M 85322 fd30af1d-9b7f-4834-8964-e1b244550c46
BMNH M 10348
BMNH M 85324
BMNH M 85325
BMNH M 85326
BMNH M 85327
BMNH M 85328
BMNH M 85330
BMNH M 85331
BMNH M 85332
I’ve been failing to find a lot of well known entities in the online specimen collections dataset which makes me rather concerned about its completeness. High profile specimens such as Lesothosaurus “BMNH RUB 17″ (as mentioned in this PLOS ONE paper, Table 1) can’t be found online via the portal under that catalogue number. I can however find RUB 16, RUB 52 and RUB 54 but these are probably different specimens. RUB 17 is mentioned in a great many papers by many different authors so it seems unlikely that they have all independently given the specimen an incorrect catalogue number – the problem is more likely to be in the completeness of the online dataset.
Another ‘missing’ example is “BMNH R4947″ a specimen of Euoplocephalus tutus as referred to in Table 4 of this PLOS ONE paper by Arbour and Currie. There are two other records for that taxon, but not under R4947.
To end on a happier note, I can definitely answer one question conclusively:
What is the most ‘popular’ NHM specimen in PLOS ONE full text?
…it’s “BMNH 37001″, Archaeopteryx lithographica which is referred to in full text by four different papers (see below for details).
I have feeling many more NHM specimens are hiding out in separate supplementary materials files. Mining these will be hard unless figshare gets their act together and creates a full-text API for searching their collection – I believe it’s a metadata only API at the moment.
I’ve purposefully made very simple graphs so far. Once I get more data, I can start linking it up to create beautiful and complex graphs like the one below (of the taxa shared between 3000 microbial phylogenetic studies in IJSEM, unpublished), which I’m still trying to get my head around. The linked open data work continues…