Show me the data!

In this post I’ll go through an illustrated example of what I plan to do with my text mining project: linking-up biological specimens from the Natural History Museum, London (sometimes known as BMNH or NHMUK) to the published research literature with persistent identifiers.

I’ve run some simple grep searches of the PMC open access subset already, and PLOS ONE make up a significant portion of the ‘hits’, unsurprisingly.

Below is a visual representation of the BMNH specimen ‘hits’ I found in the full text of one PLOS ONE paper:

Grohé C, Morlo M, Chaimanee Y, Blondel C, Coster P, et al. (2012) New Apterodontinae (Hyaenodontida) from the Eocene Locality of Dur At-Talah (Libya): Systematic, Paleoecological and Phylogenetical Implications. PLoS ONE 7(11): e49054. doi: 10.1371/journal.pone.0049054


I used the open source software Gephi, and the Yifan Hu layout to create the above graphical representation. The node marked in blue is the paper. Nodes marked in red are catalogue numbers I couldn’t find in the NHM Open Data Portal specimen collections dataset: 10 out of 34 not found.

Source data table below showing how uninformative the NHM persistent IDs are. I would have plotted them on the graph instead of the catalogue strings as that would be technically more correct (they are the unique IDs), but it would look horrible.


I’ve been failing to find a lot of well known entities in the online specimen collections dataset which makes me rather concerned about its completeness. High profile specimens such as Lesothosaurus “BMNH RUB 17″ (as mentioned in this PLOS ONE paper, Table 1) can’t be found online via the portal under that catalogue number. I can however find RUB 16, RUB 52 and RUB 54 but these are probably different specimens. RUB 17 is mentioned in a great many papers by many different authors so it seems unlikely that they have all independently given the specimen an incorrect catalogue number – the problem is more likely to be in the completeness of the online dataset.

Another ‘missing’ example is “BMNH R4947″ a specimen of Euoplocephalus tutus as referred to in Table 4 of this PLOS ONE paper by Arbour and Currie. There are two other records for that taxon, but not under R4947.

To end on a happier note, I can definitely answer one question conclusively:
What is the most ‘popular’ NHM specimen in PLOS ONE full text?

…it’s “BMNH 37001″, Archaeopteryx lithographica which is referred to in full text by four different papers (see below for details).

I have feeling many more NHM specimens are hiding out in separate supplementary materials files. Mining these will be hard unless figshare gets their act together and creates a full-text API for searching their collection – I believe it’s a metadata only API at the moment.

37001 in PLOS ONE papers


I’ve purposefully made very simple graphs so far. Once I get more data, I can start linking it up to create beautiful and complex graphs like the one below (of the taxa shared between 3000 microbial phylogenetic studies in IJSEM, unpublished), which I’m still trying to get my head around. The linked open data work continues…

Bacteria subutilis commonly used


Now I’m at the Natural History Museum, London I’ve started a new and ambitious text-mining project: to find, extract, publish, and link-up all mentions of NHM, London specimens published in the recent research literature (born digital, published post-2000).

Rod Page is already blazing a trail in this area with older BHL literature. See: Linking specimen codes to GBIF & Design Notes on Modelling Links for recent, relevant posts. But there’s still lots to be done I think, so here’s my modest effort.



It’s important to demonstrate the value of biological specimen collections. A lot of money is spent cataloguing, curating and keeping safe these specimens. It would be extremely useful to show that these specimens are being used, at scale, in real, recent research — it’s not just irrelevant stamp collecting.

Sometimes the NHM, London specimen catalogue has incorrect, incomplete or outdated data about it’s own specimens – there is better, newer data about them in the published literature that needs to be fed back to the museum.

An example: specimen “BMNH 2013.2.13.3” is listed in the online catalogue on the NHM open data portal as Petrochromis nov. sp. By searching the literature for BMNH specimens, I happened to find where the new species of this specimen was described: as Petrochromis horii Takahashi & Koblmüller, 2014. It’s also worth noting this specimen has associated nucleotide sequence data on GenBank here: .

Having talked a lot about the 5 stars of open data in the context of research data recently, I wonder… wouldn’t it be really useful to make 4 or 5 star linked open data around biological specimens? From Rod Page, I gather this is part of the grand goal of creating a biodiversity knowledge graph.

For this project, I will be focussing on linking BMNH (NHM, London) specimen identifiers with publication identifiers (e.g. DOIs) and GenBank accession numbers.


What questions to ask?

Where have NHM, London specimens been used/published? What are the most used NHM, London specimens in research? How does NHM, London specimen usage compare to other major museums such as the AMNH (New York) or MNHN (Paris).

Materials for Mining

1.) The PubMedCentral Open Access subset – a million papers, but mainly biomedical research.
2.) Open Access & free access journals that not included in PMC
3.) figshare – particularly useful if nothing else, as a means of mining PLOS ONE supplementary materials (I read recently that essentially 90% of figshare is actually PLOS ONE supp. material! See Table 2)
4.) select subscription access journals – annoyingly hard to get access to in bulk, but important to include as sadly much natural history research is still published behind paywalls.


(very) Preliminary Results

The PMC OA subset is fantastic & really facilitates this kind of research – I wish ALL of the biodiversity literature was aggregated like (some) of the open access biomedical literature is. You can literally just download a million papers, click, and go do your research. It facilitates rigorous research by allowing full machine access to full texts.

Simple grep searches for ‘NHMUK’ & ‘BMNH [A-Z0-9][0-9]’, two of the commonest citation forms by which specimens may be cited reveal many thousands of possible specimen mentions in the PMC OA subset I must now look through to clean-up & link-up. In terms of journals, these ‘hits’ in the PMC OA subset come from (in no particular order): PLOS ONE, Parasites & Vectors, PeerJ, ZooKeys, Toxins, Zoo J Linn Soc, Parasite, Frontiers in Zoology, Ecology & Evolution, BMC Research Notes, Biology Letters, BMC Evolutionary Biology, Aquatic Biosystems, BMC Biology, Molecular Ecology, Journal of Insect Science, Nucleic Acids Research and more…!

specimen “BMNH″ is a great example to lookup / link-up on the NHM Open Data Portal: the catalogue record has 7 associated images openly available under CC BY, so I can liven up this post by including an image of the specimen (below)! I found this specimen used in a PLOS ONE paper: Walmsley et al. (2013) Why the Long Face? The Mechanics of Mandibular Symphysis Proportions in Crocodiles. doi: 10.1371/journal.pone.0053873 (in the text caption for figure 1 to be precise).

© The Trustees of the Natural History Museum, London. Licensed for reuse under CC BY 4.0. Source.



Questions Arising

How to find and extract mentions of NHM, London specimens in papers published in Science, Nature & PNAS ? There are sure to be many! I’m assuming the last 15 years worth of research published in these journals will be difficult to scrape – they would be quite likely to block my IP address if I tried to. Furthermore, all the actual science is typically buried in supplementary file PDFs in these journals not in the ‘main’ short article. Will Science, Nature & PNAS  let me download all their supp material from the last 15 years? Is this facilitated at all? How do people actually do rigorous research when the contents of supplementary data files published in these journals are so undiscoverable & inaccessible to search?


It’s clear to me there are many separate divisions when it comes to discoverability of research. There’s the divide between open access (highly discoverable & searchable) and subscription access (less discoverable, less searchable, depending upon publisher-restrictions). There’s also the divide between the ‘paper’ (more searchable) and ‘supplementary materials’ (less easily searchable). Finally, there’s also the divide between textual and non-textual media: a huge amount of knowledge in the scientific literature is trapped in non-textual forms such as figure images which simply aren’t instantly searchable by textual methods (figure captions DO NOT contain all of the information of the figure image! Also, OCR is time consuming and error-prone especially on the heterogeneity of fonts and orientation of words in most figures). For example, looking across thousands of papers with phylogenetic analyses published in the journal IJSEM, 95% of the taxa / GenBank accessions used in them are only mentioned in the figure image, nowhere else in the paper or supplementary materials as text! This needs to change.


As should be obvious by now; this is a very preliminary post, just to let people know what I’m doing and what I’m thinking. In my next post I’ll detail some of the subscription access journals I’ve been text mining for specimens, and the barriers I’ve encountered when trying to do so.


Bonus question: How should I publish this annotation data?

Easiest would be to release all annotations as a .csv on the NHM open data portal with 3 columns where each column mimics ‘subject’  ‘predicate’ ‘object’ notation: Specimen, “is mentioned in”, Article DOI.

But if I wanted to publish something a little better & a little more formal, what kind of RDF vocabulary can I use to describe “occurs in” or “is mentioned in”. What would be the most useful format to publish this data in so that it can be re-used and extended to become part of the biodiversity knowledge graph and have lasting value?

Making a journal scraper

May 13th, 2015 | Posted by rmounce in Content Mining - (5 Comments)

Yesterday, I made a journal scraper for the International Journal of Systematic and Evolutionary Microbiology (IJSEM).

Fortunately, Richard Smith-Unna and the ContentMine team have done most of the hard work in creating the general framework with quickscrape (open-source and available on github), I just had to modify the available journal-scrapers to work with IJSEM.

How did I do it?

Find an open access article in the target journal e..g James et al (2015) Kazachstania yasuniensis sp. nov., an ascomycetous yeast species found in mainland Ecuador and on the Galápagos

In your browser, view the HTML source of the full text page, in the Chrome/Chromium browser the keyboard shortcut to do this is Ctrl-U. You should then see something like this, perhaps with less funky highlighting colours:

I based my IJSEM scraper on the existing set of scraper definitions for eLife because I know both journals use similar underlying technology to create their webpages.

The first bit I clearly had to modify was the extraction of publisher. In the eLife scraper this works:

but at IJSEM that information isn’t specified with ‘citation_publisher’, instead it’s tagged as ‘DC.Publisher’ so I modified the element to reflect that:

The license and copyright information extraction is even more different between eLife and IJSEM, here’s the correct scraper for the former:

and here’s how I changed it to extract that information from IJSEM pages:

The XPath needed is completely different. The information is inside a div, not a meta tag.


Hardest of all though were the full size figures and the supplementary materials files – they’re not directly linked from the full text HTML page which is rather annoying. Richard had to help me out with these by creating “followables”:

In his words:

any element can ‘follow’ any other element in the elements array, just by adding the key-value pair "follow": "element_name" to the element that does the following. If you want to follow an element, but don’t want the followed element to be included in the results, you add it to a followables array instead of the elements array. The followed array must capture a URL.



The bottom-line is, it might look complicated initially, but actually it’s not that hard to write a fully-functioning  journal scraper definition, for use with quickscrape. I’m off to go and create one for Taylor & Francis journals now :)


Wouldn’t it be nice if all scholarly journals presented their content on the web in the same way, so we didn’t have to write a thousand different scrapers to download it? That’d be just too helpful wouldn’t it?



[Update: I’ve submitted this idea as a FORCE11 £1K Challenge research proposal 2015-01-13. I may be unemployed from April 2015 onwards (unsolicited job offers welcome!), so I certainly might find myself with plenty of time on my hands to properly get this done…!]

Inspired by something I heard Stephen Curry say recently, and with a little bit of help from Jo McIntyre I’ve started a project to compare EuropePMC author manuscripts with their publisher-made (mangled?) ‘version of record’ twins.

How different are author manuscripts from the publisher version of record? Or put it another way, what value do publishers add to each manuscript? With the aggregation & linkage provided by EuropePMC – an excellent service – we can rigorously test this.


In this blog post I’ll go through one paper I chose at random from EuropePMC:

Sinha, N., Manohar, S., and Husain, M. 2013. Impulsivity and apathy in parkinson’s disease. J Neuropsychol 7:255-283.  doi: 10.1111/jnp.12013 (publisher version) PMCID: PMC3836240 (EuropePMC version)


A quick & dirty analysis with a simple tool that’s easy to use & available to everyone:

pdftotext -layout     (you’re welcome to suggest a better method by the way, I like hacking PDFs)

(P) = Publisher-version , (A) = Author-version

Manual Post-processing – remove the header and footer crud from each e.g. “262
Nihal Sinha et al.” (P) and “J Neuropsychol. Author manuscript; available in PMC 2013 November 21.” (A)

Automatic Post-processing – I’m not interested in numbers or punctuation or words of 3-letters or less so I applied this bash-one-liner:

strings $inputfile | tr ‘[A-Z]’ ‘[a-z]’ | sed ‘s/[[:punct:]]/ /g’ | sed ‘s/[[:digit:]]/ /g’ |  sed s/’ ‘/\\n/g | awk ‘length > 3′ | sort | uniq -c | sort -nr > $outputfile

Then I just manually diff’d the resulting word lists – there’s so little difference it’s easy for this particular pair.



The correspondence line changed slightly from this in the author version:

Correspondence should be addressed to Nuffield Department of Clinical Neurosciences and Department Experimental Psychology, Oxford University, Oxford OX3 9DU, UK ( . (A)

To this in the publisher version (I’ve added bold-face to highlight the changes):

Correspondence should be addressed to Masud Husain, Nuffield Department of Clinical Neurosciences and Department Experimental Psychology, Oxford University, Oxford OX3 9DU, UK (e-mail: (P)


Reference styling has been changed. Why I don’t know, seems a completely pointless change. Either style seems perfectly functional to me tbh:

Drijgers RL, Dujardin K, Reijnders JSAM, Defebvre L, Leentjens AFG. Validation of diagnostic criteria for apathy in Parkinson’s disease. Parkinsonism & Related Disorders. 2010; 16:656–660. doi:10.1016/j.parkreldis.2010.08.015. [PubMed: 20864380] (A)

to this in the publisher version:

Drijgers, R. L., Dujardin, K., Reijnders, J. S. A. M., Defebvre, L., & Leentjens, A. F. G. (2010). Validation of diagnostic criteria for apathy in Parkinson’s disease. Parkinsonism & Related Disorders, 16, 656–660. doi:10.1016/j.parkreldis.2010.08.015 (P)

In the publisher-version only (P) “Continued” has been added below some tables to acknowledge that they overflow on the next page. Arguably the publisher has made the tables worse as they’ve put them sideways (landscape) so they now overflow onto other pages. In the author-version (A) they are portrait-orientated and so hence each fit on one page entirely.


Finally, and most intriguingly, some of the figure-text comes out only in the publisher-version (P). In the author-version (A) the figure text is entirely image pixels, not copyable text. Yet the publisher version has introduced some clearly imperfect figure text. Look closely and you’ll see in some places e.g. “Dyskinetic state” of figure 2 c) in (P), the ‘ti’ has been ligatured and is copied out as a theta symbol:

DyskineƟc state




I don’t know about you, but for this particular article, it doesn’t seem like the publisher has really done all that much aside from add their own header & footer material, some copyright stamps & their journal logo – oh, and ‘organizing peer-review’. How much do we pay academic publishers for these services? Billions? Is it worth it?

I plan to sample at least 100 ‘twinned’ manuscript-copies and see what the average difference is between author-manuscripts and publisher-versions. If the above is typical of most then this will be really bad news for the legacy academic journal publishers… Watch this space!


Thoughts or comments as to how to improve the method, or relevant papers to read on this subject are welcome. Collaboration welcome too – this is an activity that scales well between collaborators.

I’m proud to announce an interesting public output from my BBSRC-funded postdoc project:
PLUTo: Phyloinformatic Literature Unlocking Tools. Software for making published phyloinformatic data discoverable, open, and reusable


Screenshot of some of the PLOS ONE phylogeny figure collection on Flickr















I’ve made openly available my first-pass filter of PLOS ONE phylogeny figures (I’m not in any way claiming this is *all* of them).

This curated & tagged image collection is on Flickr for easy browsing:

As well as on Github for version control, open archiving, and collaboration (I have remote collaborators):

(Github doesn’t like repositories over 1GB so I’ve had to split-up the content between 4 separate repositories)



The aim of the PLUTo project is to re-extract & liberate phylogenetic data & associated metadata from the research literature. Sadly, only ~4% of modern published phylogenetic analysis studies make their underlying data available. Another study finds that if you ask the authors for this data, only 16% will be kind enough to reply with the requested data!

This particular data type is a cornerstone of modern evolutionary biology. You’ll find phylogenetic analyses across a whole host of journal subjects – medical, ecological, natural history, palaeontology… There are also many different ways in which this data can be re-used e.g. supertrees  & comparative cladistics. Not to mention, simple validation studies &/or analyses which extend-upon or map new data on to a phylogeny. It’s really useful data and we should be archiving it for future re-use and re-analysis. To my great delight, this is what I’m being paid to attempt to do for my first postdoc; on a grant I co-wrote – finding & liberating phylogenetic data for everyone!




  •  It’s a BOAI-compliant open access journal that publishes most articles under CC BY, with a few under CC0.
    • This means I can openly re-publish figures online (provided sufficient attribution is given) — no need to worry about DMCA takedown notices or ‘getting sued’! This makes the process of research much easier. Private, non-public, access-restricted repositories for collaboration are a hassle I’d rather do without.
  • It’s a high-volume ‘megajournal’ publishing ~200 articles per day, many of which include phylogenetic analyses.
    • Thus its worthwhile establishing a regular daily or weekly method for parsing-out phylogenetic tree figures from this journal
  • Killer feature: as far as I know, PLOS are the only publisher to embed rich metadata inside their figure image files.
    • This makes satisfying the CC BY licence trivially easy — sufficient attribution metadata is already embedded in the file. Just ensure that wherever you’re uploading the file to doesn’t wipe this embedded data, hence why I chose Flickr as my initial upload platform.


What does this enable or make easier?


On it’s own, this collection doesn’t do much, this is still an early stage – but it gives us an important insight into the prevalence of certain types of visual display-style that researchers are using:

‘radial’ phylogenies

Source: Zerillo et al 2013 PLOS ONE. Carbohydrate-Active Enzymes in Pythium and Their Role in Plant Cell Wall and Storage Polysaccharide Degradation

Source: Zerillo et al 2013 PLOS ONE. Carbohydrate-Active Enzymes in Pythium and Their Role in Plant Cell Wall and Storage Polysaccharide Degradation














‘geophylogeny’ (phylogeny displayed relative to a map of some sort, 2D or 3D)

Source: Guo et al 2012 PLOS ONE. Evolution and Biogeography of the Slipper Orchids: Eocene Vicariance of the Conduplicate Genera in the Old and New World Tropics

Source: Guo et al 2012 PLOS ONE. Evolution and Biogeography of the Slipper Orchids: Eocene Vicariance of the Conduplicate Genera in the Old and New World Tropics











‘timescaled’ (phylogenies where the branch lengths are proportional to units of time or geological periods)

Source: Pol et al 2014 PLOS ONE. A New Notosuchian from the Late Cretaceous of Brazil and the Phylogeny of Advanced Notosuchians

Source: Pol et al 2014 PLOS ONE. A New Notosuchian from the Late Cretaceous of Brazil and the Phylogeny of Advanced Notosuchians











Source: McDowell et al 2013 PLOS ONE. The Opportunistic Pathogen Propionibacterium acnes: Insights into Typing, Human Disease, Clonal Diversification and CAMP Factor Evolution

Source: McDowell et al 2013 PLOS ONE. The Opportunistic Pathogen Propionibacterium acnes: Insights into Typing, Human Disease, Clonal Diversification and CAMP Factor Evolution












Arguably it also facilitates complex searches for specific types of phylogeny

e.g. analyses using cytochrome b
(you could use PLOS’s API to do this, particularly their figure/table caption search field — but you’d get a lot of false positives — this is an expert-curated collection that has filtered-out non-phylo figures)

In my initial roadmap, the plan is to do PLOS ONE, the other PLOS journals, then BMC journals, then possibly Zootaxa & Phytotaxa (Magnolia Press). There will be a Github-based website for the project soon, lots still to do…!


Want to know more / collaborate / critique ?


I’ve got an accepted lightning talk at iEvoBio in Raleigh, NC later this year about the PLUTo project.

As well as an accepted lightning talk at the Bioinformatics Open Source Conference (BOSC) in Boston, MA.

Elsewise, contact me via twitter @rmounce , the comment section on this blog post, or email ross dot mounce <at> gmail dot com

Setting-up AMI2 on Windows

October 6th, 2013 | Posted by rmounce in Content Mining - (1 Comments)

I’ve been rather preoccupied in the last few months hence the lack of blog posts. (Apologies!)

Here’s a quick recap of some things I’ve done since July:

  • Got married in China (in September)
  • Successfully proposed that the Systematics Association (of which I’m a council member) should sign DORA
  • Gave an invited talk on open science at an INNGE workshop at INTECOL 2013
  • Completed and handed-in my PhD thesis last Thursday!

So yeah, I really didn’t have time blog until now.

But now my PhD thesis is handed-in I can concentrate on the next step… Matthew Wills, myself, and Peter Murray-Rust have an approved BBSRC grant to work on further developing AMI2 to extract phylogenetic trees from the literature (born-digital PDFs).

At the moment it is in alpha stage so it doesn’t extract trees perfectly – it needs work. But in case you might want to try it out I thought I’d use this post to explain how to get a test development of it running on Windows (I don’t usually use Windows myself, I much prefer linux). These notes are thus as much an open notebook science ‘aide memoire’ for myself as they are instructions for others!

Dependencies and IDE:

1.) You’ll need Java JDK, Eclipse, Mercurial and Maven for starters.

If you haven’t got this setup already you may need to set your environment variables e.g. JAVA_HOME

2.) Within Eclipse you need to install the m2e (maven integration) plugin

(from within the Eclipse GUI) click ‘Help’ -> Install New Software -> All available sites (from the dropdown) -> select m2e


3.) Using mercurial, clone the AMI2 suite to a clean workspace folder. The suite includes:


[euclid-dev itself has many dependencies which are indicated in its POM file which you shouldn’t need to worry about – they should be pulled-in automatically. These include:  commons-io, log4j, xom, joda and junit.]

4.) From within the Eclipse GUI import your workspace of AMI2 tools:

click ‘File’ -> Import -> Maven -> select ‘Existing Maven Projects’ -> Next -> select your workspace


5.) Test if it works. In the package explorer side-pane window you should now see folders corresponding to the six AMI2 tools listed above.

Right-click on svg2xml-dev -> select ‘Run-as’ -> JUnit Test

and sit back and watch the test run in the console at the bottom of the Eclipse GUI.

(The tests are a little slow, have patience, it may take a few minutes – it took me 175 seconds)

To view the results, in the package explorer pane, navigate inside the svg2xml-dev document tree into /target/output/multiple-1471-2148-11-312 and click ont TEXT.0 to see what the text-extraction looks like. You should see something like this below (note it successfully gets italics, bold, and superscripts)


Gene conversion and purifying selection shape nucleotide variation in gibbon L/M opsin genes

Tomohide Hiwatashi 1 , Akichika Mikami 2,8 , Takafumi Katsumura 1 , Bambang Suryobroto 3 , Dyah Perwitasari-Farajallah 3,4 , Suchinda Malaivijitnond 5 , Boripat Siriaroonrat 6 , Hiroki Oota 1,9 , Shunji Goto 7,10 and Shoji Kawamura 1*


Abstract Background: Routine trichromatic color vision is a characteristic feature of catarrhines (humans, apes and Old World monkeys). This is enabled by L and M opsin genes arrayed on the X chromosome and an autosomal S opsin gene. In non-human catarrhines, genetic variation affecting the color vision phenotype is reported to be absent or rare in both L and M opsin genes, despite the suggestion that gene conversion has homogenized the two genes. However, nucleotide variation of both introns and exons among catarrhines has only been examined in detail for the L opsin gene of humans and chimpanzees. In the present study, we examined the nucleotide variation of gibbon (Catarrhini, Hylobatidae) L and M opsin genes. Specifically, we focused on the 3.6~3.9-kb region that encompasses the centrally located exon 3 through exon 5, which encode the amino acid sites functional for the spectral tuning of the genes.

Results: Among 152 individuals representing three genera ( Hylobates ,  Nomascus and  Symphalangus ), all had both L and M opsin genes and no L/M hybrid genes. Among 94 individuals subjected to the detailed DNA sequencing, the nucleotide divergence between L and M opsin genes in the exons was significantly higher than the divergence in introns in each species. The ratio of the inter-LM divergence to the intra-L/M polymorphism was significantly lower in the introns than that in synonymous sites. When we reconstructed the phylogenetic tree using the exon sequences, the L/M gene duplication was placed in the common ancestor of catarrhines, whereas when intron sequences were used, the gene duplications appeared multiple times in different species. Using the GENECONV program, we also detected that tracts of gene conversions between L and M opsin genes occurred mostly within the intron regions.

Conclusions: These results indicate the historical accumulation of gene conversions between L and M opsin genes in the introns in gibbons. Our study provides further support for the homogenizing role of gene conversion between the L and M opsin genes and for the purifying selection against such homogenization in the central exons to maintain the spectral difference between L and M opsins in non-human catarrhines.


Background In catarrhine primates (humans, apes and Old World monkeys) the L and M opsin genes are closely juxta-posed on the X chromosome and, in combination with the autosomal S opsin gene, enable routinely trichro-matic color vision [1,2]. The L and M opsin genes have a close evolutionary relationship and are highly similar in nucleotide sequence (~96% identity). Among 15

* Correspondence: 1 Department of Integrated Biosciences, Graduate School of Frontier Sciences, The University of Tokyo, Kashiwa 277-8562, Japan Full list of author information is available at the end of the article

amino acid differences between the human L and M opsin genes, three account for the main shifts in spectral sensitivities and tuning [3-9]. The organization of the L and M opsin genes among humans is known to be variable and includes the absence of an L or M opsin gene or the presence of L/M hybrid genes with an intermediate spectral sensitivity. A high incidence (approximately 3-8%) of color vision   deficien-cies in males results as a consequence [10].
Hiwatashi et al . BMC Evolutionary Biology 2011, 11 :312

If you’d like to try your own PDFs with it you’ll need to do two things:

A.) place the PDF(s) to be tested within the folder:    svg2xml-dev/src/test/resources/pdfs

B.) edit the file:    svg2xml-dev/src/test/java/org/xmlcml/svg2xml/pdf/

so that

new PDFAnalyzer().analyzePDFFile(new File(” …

points at your file(s).


You can then right-click ‘multipletest’ from within and select Run As -> JUnit Test


We’re working with BMC journal content for the moment, and when we perfect it on this, we will expand our scope to include subscription access content too.



In the last 2 weeks I’ve given talks in Brussels & Amsterdam.

The first one was given during a European Commission (Brussels) working group meeting on Text & Data Mining. There were perhaps only ~30 people in the room for that.

The second presentation was given just a few days ago at Beyond The PDF 2 (#btpdf2) in Amsterdam.

I uploaded the slides from both of these talks to Slideshare just before or after I gave each talk to help maximize their impact. Since then they’ve had nearly 1000 views according to my Slideshare analytics dashboard.

It’s not just the view count I’m impressed with. The global reach is also pretty cool too (see below, created with BatchGeo):

View My Slideshare Impact 08/Mar/2013 to 22/Mar/2013 in a full screen map

Now obviously, these view counts don’t always mean that the viewers always went through all the slides, and a minority of the view-count are bots crawling the web but still I’m pretty pleased. Imagine if I hadn’t uploaded my Content Mining presentation to the public web? I would have travelled all the way to Brussels and back again (in the same day!) for the benefit of *just* ~30 people (albeit rather important people!). Instead, over 800 people have had the opportunity to view my slides, from all over the world (although, admittedly mostly just US & Europe).

The moral of this short story: upload your slides & tweet about them whenever you give a talk!
You may not appreciate just how big your potential audience could be. Something academics sceptical of Open Access should perhaps think about?

Particular thanks should go to @openscience for helping disseminate these slides far and wide. During just a 60 minute period, upon first release, thanks to @openscience and others my PDF metadata slidedeck got over 100 views this Wednesday!

Next step… must work on getting these stats into an ImpactStory widget for the next version of my CV!

So a week ago, I investigated publisher-produced Version of Record PDFs with pdfinfo and the results were very disappointing. Lots of missing metadata was found and one could not reliably identify most of these PDFs from metadata alone, let alone extract particular fields of interest.

But Rod Page kindly alerted to me the fact that I might be using the wrong tool for this investigation. So at his suggestion I’ve tried again to extract metadata from the exact same set of PDFs as last time…

Only this time I’ll be using exiftool version 9.10.

This time I’ve put the full raw metadata output from exiftool on figshare for each and every PDF file, just to really prove the point, reproducible research and all. I’d love to post the corresponding PDFs too but sadly many of them are not Open Access and this thus prevents me from uploading them to a public space.   **Insert timely comment here about how closed access publications stifle effective research practices…**

Exiftool is really simple to use. You just need type:
exiftool NameOfPDF.pdf
to get a human-readable exhaustive output of all possible metadata.

exiftool -b -XMP NameOfPDF.pdf
to get XML-structured metadata. I could only extract this from 56 of the 69 PDF files. The data output from this for those 56 PDFs is available as a separate fileset on figshare here.

Finally, if you want to test a whole bunch of PDF files in your working directory I’ve made a simple shell script that loops through all PDFs in your working directory, available here (oops, it’s not data, perhaps I should have put that on github instead?). [I’m sure many readers will be able to create a simple bash loop themselves but just for those that don’t…]


I’m assuming that the reason exiftool -b -XMP failed on 13 of those PDFs is because they have no embedded XMP metadata – an empty (zero-byte sized) file is created for these. This is an assumption though… I notice that those 13 exactly correspond with all the 13 that were produced with iText. I checked the website and I’m pretty sure iText 2.x and up can embed XMP metadata, it’s just whether the publishers have bothered to use & apply this functionality.

So if I’m right, neither Taylor & Francis, BRILL, nor Acta Palaeontology Polonica embed XMP metadata (at all!) in their PDFs. The alternative explanation is that the XMP metadata is in there but exiftool for whatever reason can’t read/parse it from iText produced PDFs. I find this an unlikely alternative explanation though tbh.

Elsevier have superior XMP metadata to everyone else by the looks of it, but Elsevier aside the metadata is still very poor, so my conclusions from last week’s post still stand I think.

Most of the others do contain metadata (of some sort) but by and large it’s rather poor. I need to get some other work done on Monday so I’m afraid this is where I’m going to leave this for now. But I hope I’ve made the point.

Further angles to explore

Interestingly Brian Kelly, has taken this a slightly different direction and looked at the metadata of PDFs in institutional repositories. I hadn’t realised this but apparently some institutional repositories (IRs) universally add cover pages to most deposits. If this is done without care for the embedded metadata, the original metadata can be wiped and/or replaced with newer (less informative) metadata.  Not to mention that cover pages are completely unnecessary -> all the information on a cover page is exactly the kind of stuff that should be put in embedded metadata! No need to waste time and space by putting that info as the first page. JSTOR does this too (cover pages) and it annoys the hell out of me.

After some excellent chat on Twitter about this IR angle I’ve discovered that UKOLN based here on campus at Bath have also done some interesting research in this area, in particular the FixRep project which is described in more detail here. CrossRef labs pdfmark tool also looks like something of interest towards fixing poor quality metadata PDFs. I’ve got this installed/compiled from the source on github but haven’t tried it out yet. It would be interesting to see the difference it makes – a before and after comparison of metadata to see what we’re missing… But why should we fix a problem that shouldn’t exist in the first place? Publishers are the point of origin for this. It’s their job to be the first to publish the Version of Record. They should provide the highest level of metadata possible IMO.


Why would publishers add metadata?

Because their customers – libraries, governments, research funders (in the case of Open Access PDFs ) should demand it. A pipe dream perhaps but that’s my $.02.  I would ask for a refund if I downloaded MP3’s from iTunes/Amazon MP3 with insufficient embedded metadata. Why not the same principle for electronically published PDFs?


PS Apologies for some of the very cryptic filenames in the metadata uploads on figshare. You’ll have to cross-match with this list here or the spreadsheet I uploaded last week to work out which metadata file corresponds to which PDF/Bibliographic Data record/Publisher.

Publisher Identifier Journal Contains embedded XMP metadata? Filename
American Association for the Advancement of Science Ezard2011 Science yes? ezard_11_interplay_759293.pdf
American Association for the Advancement of Science Nagalingum2011 Science yes? nagalingum_11_recent_719133.pdf
American Association for the Advancement of Science Rowe2011 Science yes? Science-2011-Rowe-955-7.pdf
Blackwell Publishing Ltd Burks2011 Cladistics yes? burks_11_combined_694888.pdf
Blackwell Publishing Ltd Janies2011 Cladistics yes? janies_11_supramap_779773.pdf
Blackwell Publishing Ltd Simmons2011 Cladistics yes? simmons_11_deterministic_779537.pdf
BRILL Barbosa2011 Insect Systematics & Evolution no barbosa_11_phylogeny_779910.pdf
BRILL Dellape2011 Insect Systematics & Evolution no dellape_11_phylogenetic_779909.pdf
Cambridge Journals Online Knoll2010 Geological Magazine yes? knoll_10_primitive_475553.pdf
Cambridge Journals Online Saucede2007 Geological Magazine yes? thomas_saucegraved_07_phylogeny_506869.pdf
CSIRO Chamorro2011 Invertebrate Systematics yes? chamorro_11_phylogeny_780467.pdf
CSIRO Daugeron2011 Invertebrate Systematics yes? daugeron_11_phylogenetic_780466.pdf
CSIRO Johnson2011 Invertebrate Systematics yes? johnson_11_collaborative_750540.pdf
Elsevier Lane2011 Molecular Phylogenetics and Evolution yes E3-1-s2.0-S1055790311001448-main.pdf
Elsevier Cunha2011 Molecular Phylogenetics and Evolution yes E2-1-s2.0-S1055790311001680-main.pdf
Elsevier Spribille2011 Molecular Phylogenetics and Evolution yes E1-1-s2.0-S1055790311001606-main.pdf
Frontiers In Horn2011 Frontiers in Neuroscience yes? fnins-05-00088.pdf
Frontiers In Ogura2011 Frontiers in Neuroscience yes? fnins-05-00091.pdf
Frontiers In Tsagareli2011 Frontiers in Neuroscience yes? fnins-05-00092.pdf
Hindawi Diniz2012 Psyche: A Journal of Entomology yes? 79139500.pdf
Hindawi Restrepo2012 Psyche: A Journal of Entomology yes? 516419.pdf
Hindawi Savopoulou2012 Psyche: A Journal of Entomology yes? 167420.pdf
Institute of Paleobiology, Polish Academy of Sciences Amson2011 Acta Palaeontologica Polonica no amson_11_affinities_666987.pdf
Institute of Paleobiology, Polish Academy of Sciences Edgecombe2011 Acta Palaeontologica Polonica no edgecombe_11_new_666988.pdf
Institute of Paleobiology, Polish Academy of Sciences Williamson2011 Acta Palaeontologica Polonica no app2E20092E0147.pdf
Magnolia Press Agiuar2011 Zootaxa yes? zt02846p098.pdf
Magnolia Press Ebach2011 Zootaxa yes? ebach_11_taxonomy_599972.pdf
Magnolia Press Nelson2011 Zootaxa yes? nelson_11_resemblance_688762.pdf
National Academy of Sciences Casanovas2011 Proceedings of the National Academy of Sciences yes? casanovas-vilar_11_updated_644658.pdf
National Academy of Sciences Goswami2011 Proceedings of the National Academy of Sciences yes? goswami_11_radiation_814757.pdf
National Academy of Sciences Thorne2011 Proceedings of the National Academy of Sciences yes? thorne_11_resetting_654055.pdf
Nature Publishing Group Meng2011 Nature yes? meng_11_transitional_644647.pdf
Nature Publishing Group Rougier2011 Nature yes? rougier_11_highly_720202.pdf
Nature Publishing Group Venditti2011 Nature yes? venditti_11_multiple_779840.pdf
NRC Research Press CruzadoCaballero2010 Canadian Journal of Earth Sciences yes? 650000.pdf
NRC Research Press Druckenmiller2010 Canadian Journal of Earth Sciences yes? 80000000c5.pdf
NRC Research Press Mazierski2010 Canadian Journal of Earth Sciences yes? mazierski_10_description_577223.pdf
NRC Research Press Modesto2009 Canadian Journal of Earth Sciences yes? modesto_09_new_577201.pdf
NRC Research Press Parsons2009 Canadian Journal of Earth Sciences yes? parsons_09_new_575744.pdf
NRC Research Press Wu2007 Canadian Journal of Earth Sciences yes? wu_07_new_622125.pdf
Pensoft Publishers Hagedorn2011 ZooKeys yes? hagedorn_11_creative_779747.pdf
Pensoft Publishers Penev2011 ZooKeys yes? penev_11_interlinking_694886.pdf
Pensoft Publishers Thessen2011 ZooKeys yes? thessen_11_data_779746.pdf
Public Library of Science Hess2011 PLoS ONE yes? hess_11_addressing_694222.pdf
Public Library of Science McDonald2011 PLoS ONE yes? mcdonald_11_subadult_694229.pdf
Public Library of Science Wicherts2011 PLoS ONE yes? wicherts_11_willingness_779788.pdf
SAGE Publications deKloet2011 Journal of Veterinary Diagnostic Investigation yes? Invest-2011-deKloet-421-9.pdf
SAGE Publications Richter2011 Journal of Veterinary Diagnostic Investigation yes? Invest-2011-Richter-430-5.pdf
SAGE Publications Wassmuth2011 Journal of Veterinary Diagnostic Investigation yes? Invest-2011-Wassmuth-436-53.pdf
Senckenberg Natural History Collections Dresden Fresneda2011 Arthropod Systematics & Phylogeny yes? fresneda_11_phylogenetic_785869.pdf
Senckenberg Natural History Collections Dresden Mally2011 Arthropod Systematics & Phylogeny yes? ASP_69_1_Mally_55-71.pdf
Senckenberg Natural History Collections Dresden Shimizu2011 Arthropod Systematics & Phylogeny yes? ASP_69_2_Shimizu_75-81.pdf
Springer-Verlag Beermann2011 Zoomorphology yes? 10.1007_s00435-011-0129-9.pdf
Springer-Verlag Cuezzo2011 Zoomorphology yes? cuezzo_11_ultrastructure_694669.pdf
Springer-Verlag Vinn2011 Zoomorphology yes? 10.1007_s00435-011-0133-0.pdf
Taylor & Francis Bianucci2011 Journal of Vertebrate Paleontology no bianucci_11_aegyptocetus_778747.pdf
Taylor & Francis Makovicky2011 Journal of Vertebrate Paleontology no makovicky_11_new_694826.pdf
Taylor & Francis Pietri2011 Journal of Vertebrate Paleontology no pietri_11_revision_689491.pdf
Taylor & Francis Rook2011 Journal of Vertebrate Paleontology no rook_11_phylogeny_694916.pdf
Taylor & Francis Tsuihiji2011 Journal of Vertebrate Paleontology no tsuihiji_11_cranial_660620.pdf
Taylor & Francis Yates2011 Journal of Vertebrate Paleontology no yates_11_new_694821.pdf
Taylor & Francis Gerth2011 Systematics and Biodiversity no gerth_11_wolbachia_779749.pdf
Taylor & Francis Krebes2011 Systematics and Biodiversity no krebes_11_phylogeography_779700.pdf
Sociedade Brasileira de Ictiologia Britski2011 Neotropical Ichthyology yes? a02v9n2.pdf
Sociedade Brasileira de Ictiologia Sarmento2011 Neotropical Ichthyology yes? a03v9n2.pdf
Sociedade Brasileira de Ictiologia Calegari2011 Neotropical Ichthyology yes? a04v9n2.pdf
Royal Society Billet2011 Proceedings of the Royal Society B: Biological Sciences yes? billet_11_oldest_687630.pdf
Royal Society Polly2011 Proceedings of the Royal Society B: Biological Sciences yes? polly_11_history_625430.pdf
Royal Society Sansom2011 Proceedings of the Royal Society B: Biological Sciences yes? sansom_11_decay_625429.pdf