Show me the data!

To prove my point about the way that supplementary data files bury useful data, making it utterly indiscoverable to most, I decided to do a little experiment (in relation to text mining for museum specimen identifiers, but also perhaps with some relevance to the NHM Conservation Hackathon):

I collected the links for all Biology Letters supplementary data files. I then filtered out the non-textual media such as audio, video and image files, then downloaded the remaining content.

A breakdown of file extensions encountered in this downloaded subset:

763 .doc files
543 .pdf files
109 .docx files
75 .xls files
53 .xlsx files
25 .csv files
19 .txt files
14 .zip files
2 .rtf files
2 .nex files
1 .xml file
1 “.xltx” file

I then converted some of these unfriendly formats into simpler, more easily searchable plain text formats:


Now everything is properly searchable and indexable!

In a matter of seconds I can find NHM specimen identifiers that might not otherwise be mentioned in the full text of the paper, without actually wasting any time manually reading any papers. Note, not all the ‘hits’ are true positives but most are, and those that aren’t e.g. “NHMQEVLEGYKKKYE” are easy to distinguish as NOT valid NHM specimen identifiers:


Perhaps this approach might be useful to the PREDICTS / LPI teams, looking for species occurrence data sets?

I don’t know why figshare doesn’t do deep indexing by default – it’d be really useful to search the morass of published supplementary data that out there!

Progress on specimen mining

June 14th, 2015 | Posted by rmounce in Content Mining - (0 Comments)

I’ve been on holiday to Japan recently, so work came to a halt on this for a while but I think I’ve largely ‘done’ PLOS ONE full text now (excluding supplementary materials).

My results are on github: – one prettier file without the exact provenance or in-sentence context of each putative specimen entity, and one more extensive file with provenance & context included which unfortunately github can’t render/preview.


Some summary stats:

I found 427 unique BMNH/NHMUK specimen mentions from a total of just 69 unique PLOS ONE papers. The latter strongly suggests to me that there are a lot of ‘hidden’ specimen identifiers hiding out in difficult-to-search supplementary materials files.

I found 497 specimen mentions if you include instances where the same BMNH/NHMUK specimen is mentioned in different PLOS ONE papers.

Finding putative specimen entities in PLOS ONE full text is relatively automatic and easy. The time-consuming manual part is accurately matching them up with official NHM collection specimens data.

I could only confidently link-up 314 of the 497 detected mentions, to their corresponding unique IDs / URLs in the NHM Open Data Portal Collection Specimens dataset. Approximately one third can’t be confidently be matched-up to a unique specimen in the online specimen collection dataset — I suspect this is mainly down to absence/incompleteness in the online collections data, although a small few are likely typo’s in PLOS ONE papers.

In my last post I was confident that the BM Archaeopteryx specimen would be the most frequently mentioned specimen but with more extensive data collection and analysis that appears now not to be true! NHMUK R3592 (a specimen of Erythrosuchus africanus) is mentioned in 5 different PLOS ONE papers. Pleasingly, Google Scholar also finds only five PLOS ONE papers mentioning this specimen – independent confirmation of my methodology.

One of the BM specimens of Erythrosuchus is more referred to in PLOS ONE than the BM Archaeopterx specimen

Now I have these two ‘atomic’ identifiers linked-up (NHM specimen collections occurrence ID + the Digital Object Identifier of the research article in which it appears), I can if desired, find out a whole wealth of information about these specimens and the papers they are mentioned in.

My next steps will be to extend this search to all of the PubMedCentral OA subset, not just PLOS ONE.