Show me the data!

Today (2015-09-01), marks the public announcement of Research Ideas & Outcomes (RIO for short), a new open access journal for all disciplines that seeks to open-up the entire research cycle with some truly novel features

I know what you might be thinking: Another open access journal? Really? 

Myself, nor Daniel Mietchen simply wouldn’t be involved with this project if it was just another boring open access journal. This journal packs a mighty combination of novel features into one platform:

  • 1.) RIO will publish research proposals, as well as regular research outputs such as articles, data papers and software – this has never been done by a journal before to my knowledge
  • 2.) RIO will label research outputs with ‘Impact Categories’ based upon UN Millennium Development Goals (MDGs) and EU Societal Challenges, to highlight the real-world relevance of research and to better link-up research across disciplines (see below for some example MDGs).


  • 3.) RIO supports a variety of different types of peer-review, including ‘pre-submission, author-facilitated, external peer-review‘ (new), as well as post-publication journal-organized open peer-review (similar to that pioneered by F1000Research), and ‘spontaneous’ (not journal-organized) post-publication open peer-review which is actively encouraged. All peer-review will be open/public, in keeping with the overall guiding philosophy of the journal to increase transparency and reduce waste in the research cycle. Reviewer comments are highly valuable; it is a waste not to make them public. When supplied, all reviewer comments will be made openly available.
  • 4.) RIO offers flexibility in publishing services and pricing in a bold attempt to ‘decouple’ the traditional scholarly journal into its component services. Authors & funders thus may choose to pay for the publishing services they actually want, not an inflexible bundle of different services, as there is at most journals.
Source: Priem, J. and Hemminger, B. M. 2012. Decoupling the scholarly journal. Frontiers in Computational Neuroscience. Licensed under CC BY-NC

Source: Priem, J. and Hemminger, B. M. 2012. Decoupling the scholarly journal. Frontiers in Computational Neuroscience. Image licensed under CC BY-NC.


  • 5.) On the technical side of things, RIO uses an integrated end-to-end XML-backed publication system for Authoring, Reviewing, Publishing, Hosting, and Archiving called ARPHA. As a publishing geek this excites me greatly as it eliminates the need for typesetting, ensuring a smooth and low-cost publishing process. Reviewers can make comments inline or more generally over the entire manuscript, on the very same document and platform that the authors wrote in, much like Google Docs. This has been successfully tried and tested for years at the Biodiversity Data Journal and is a system now ready for wider-use.


For the above reasons and more, I’m hugely excited about this journal and am delighted to be one of their founding editors alongside Dr Daniel Mietchen. See our growing list of Advisory and Editorial Board members for insight into who else is backing this new journal – we’ve got some great people on board already! If you’re interested in supporting this initiative please do enquire about volunteering as an editor for the journal, we need more editors to support the broad scale and ambition of journal. You can apply via the main website here.

In this post I’ll go through an illustrated example of what I plan to do with my text mining project: linking-up biological specimens from the Natural History Museum, London (sometimes known as BMNH or NHMUK) to the published research literature with persistent identifiers.

I’ve run some simple grep searches of the PMC open access subset already, and PLOS ONE make up a significant portion of the ‘hits’, unsurprisingly.

Below is a visual representation of the BMNH specimen ‘hits’ I found in the full text of one PLOS ONE paper:

Grohé C, Morlo M, Chaimanee Y, Blondel C, Coster P, et al. (2012) New Apterodontinae (Hyaenodontida) from the Eocene Locality of Dur At-Talah (Libya): Systematic, Paleoecological and Phylogenetical Implications. PLoS ONE 7(11): e49054. doi: 10.1371/journal.pone.0049054


I used the open source software Gephi, and the Yifan Hu layout to create the above graphical representation. The node marked in blue is the paper. Nodes marked in red are catalogue numbers I couldn’t find in the NHM Open Data Portal specimen collections dataset: 10 out of 34 not found.

Source data table below showing how uninformative the NHM persistent IDs are. I would have plotted them on the graph instead of the catalogue strings as that would be technically more correct (they are the unique IDs), but it would look horrible.


I’ve been failing to find a lot of well known entities in the online specimen collections dataset which makes me rather concerned about its completeness. High profile specimens such as Lesothosaurus “BMNH RUB 17” (as mentioned in this PLOS ONE paper, Table 1) can’t be found online via the portal under that catalogue number. I can however find RUB 16, RUB 52 and RUB 54 but these are probably different specimens. RUB 17 is mentioned in a great many papers by many different authors so it seems unlikely that they have all independently given the specimen an incorrect catalogue number – the problem is more likely to be in the completeness of the online dataset.

Another ‘missing’ example is “BMNH R4947” a specimen of Euoplocephalus tutus as referred to in Table 4 of this PLOS ONE paper by Arbour and Currie. There are two other records for that taxon, but not under R4947.

To end on a happier note, I can definitely answer one question conclusively:
What is the most ‘popular’ NHM specimen in PLOS ONE full text?

…it’s “BMNH 37001”, Archaeopteryx lithographica which is referred to in full text by four different papers (see below for details).

I have feeling many more NHM specimens are hiding out in separate supplementary materials files. Mining these will be hard unless figshare gets their act together and creates a full-text API for searching their collection – I believe it’s a metadata only API at the moment.

37001 in PLOS ONE papers


I’ve purposefully made very simple graphs so far. Once I get more data, I can start linking it up to create beautiful and complex graphs like the one below (of the taxa shared between 3000 microbial phylogenetic studies in IJSEM, unpublished), which I’m still trying to get my head around. The linked open data work continues…

Bacteria subutilis commonly used


For those that know me as a biologist it might perhaps surprise you to know that my most cited publication so far is on Open Access and Altmetrics (published in April 2013, 25 cites and counting…) — nothing to do with biology per se!

So I took great interest in this new publication:

Wang, X., Liu, C., Mao, W., and Fang, Z. 2015. The open access advantage considering citation, article usage and social media attention. Scientometrics, pp. 1-10. DOI: 10.1007/s11192-015-1547-0

The authors have gathered some really fascinating data measuring day-by-day altmetrics of papers at the journal Nature Communications, which at the time was hybrid: some articles behind a paywall, some articles were paid-for open access at a cost of $5200 to the authors/funders. (The cost of open access here is an absolute rip-off. I do not endorse or recommend outrageously priced paid-for open access outlets like Nature Communications. PLOS ONE costs just $1350 remember! PeerJ is just $99 per author!)

The paper is by no means perfect – I’m not saying it is – but the ideas behind it are good. Many on twitter have commented that it’s ironic that this paper on open access advantage is itself only made available behind a paywall at the publisher.

The good news is, Dr Xianwen Wang has responded to this and has made an ‘eprint’ copy (stripped of all publisher branding) freely available at arXiv as of 2015-03-19 (post-publication).  The written English throughout the manuscript is not brilliant but I feel this reflects poorly on the journal rather than the authors – it’s remarkable that Scientometrics can charge a subscription fee to subscribers if they offer no copy-editing on accepted manuscripts!  Finally, technical detail on precisely how the data was obtained is rather lacking. So that’s the critique out of the way…

My tweets about this paper have been very popular e.g.

But I wanted to dig deeper into the data. So I emailed the corresponding author; Xianwen for a copy of the data behind figure 2 and he happily and quickly sent it to me. I was fairly shocked (in a good way) that he sent the data. Most of the times I’ve sent email requests for data in the past have been ultimately unsuccessful. This is well documented in the field of phylogenetics *sad face*. The ’email the author’ system simply cannot be relied upon, and is one of many reasons why I feel all non-sensitive data supporting research should be made publicly available, alongside the article, on the day of publication.

I did my own re-analysis of the raw data Xianwen sent over, and discovered there were lots of odd jumps in data, which couldn’t really be explained by peaks in social media activity e.g. for A cobalt complex redox shuttle for dye-sensitized solar cells with high open-circuit potentials (visualized below). ~520 days after it was first published, in one single day it apparently accumulated 21,577 page views! There was also a smaller spike of 2000 page views earlier.

Article View Spikes

Xianwen had filtered these suspicious jumps out of his figures but neglected to mention that in the methods section, so upon informing him of this discrepancy he’s told me he’s going to contact the editor to sort it out. A great little example of how data sharing results in improved science? The unfiltered data looks a little bit like the plot below:

Anyway, back to the spikes/jumps in activity – they certainly aren’t an error introduced by the authors of the paper – they can also be seen via Altmetric (a service provider of altmetrics). The question is: what is causing these one-day spikes in activity?

I have alerted the team at Altmetric, and they have/will alert Nature Publishing Group to investigate further

Most of the spikes are likely to be accidental in cause but it would be good to know more. A downloading script gone awry? But there is still a possibility that within this dataset there is putative evidence for deliberate gaming of altmetrics, specifically: article views. I look forward to hearing more from Altmetric and Nature Publishing Group about this… the ball is very much in their court right now.

Moreover, now that these peculiar spikes have been detected; what, if anything, should be done about it?

Just a quick post to congratulate the Bill & Melinda Gates Foundation for their fabulous new research policy covering both open access & open data.

One of the key things they’ve implemented for 2017 is ZERO TOLERANCE for post-publication embargoes of research. Work MUST be made openly available IMMEDIATELY upon publication to be compliant. No ifs, no buts.

Let’s just remind ourselves why other major research funders like RCUK & Wellcome Trust allow publishers to impose an embargo on academic work before it can be made public:



Do any academics want a post-publication embargo on their work, that stops it being shared, read & re-used by the widest readership possible?




Does it benefit readers, patients, policy-makers or practitioners to have a post-publication embargo delaying their access to the very latest research?




Does it benefit research funders themselves to have a post-publication embargo on work they fund?




The only stakeholder that benefits from research funder policies that allow post-publication embargoes preventing free access to research are the legacy publishers. The fact that RCUK, Wellcome Trust and many others pander to these parasitic publishers and their laughably unfit-for-purpose business model is just WRONG and it makes me angry. JUST SAY “NO” TO POST-PUBLICATION EMBARGOES!




It’s high-time that major research funders wrote policies that ask for what WE ALL ACTUALLY WANT, instead of a bullshit compromise that minimises fiscal harm to the multi-billion dollar legacy publishers.


I admire the Gates Foundation. They understand what we all need and they’ve implemented that in a clear and appropriate policy; optimal for readers, researchers, patients, practitioners and policy-makers. We want immediate open access, and we want it NOW! The ball is now in your court Wellcome Trust, make your move!