Show me the data!

This post is about my new preprint I’ve uploaded to PeerJ PrePrints:

Mounce, R. (2015) Dark Research: information content in some paywalled research papers is not easily discoverable online. PeerJ PrePrints

Needless to say, it’s not peer-reviewed yet but you can change that by commenting on it at the excellent PeerJ PrePrints website. All feedback is welcome.

The central hypothesis of this work is that content in some academic journals is less discoverable than in other journals. What do I mean discoverable? It’s simple really. Imagine a paper is a bag-of-words:


If academic search providers like Google Scholar, Web of Knowledge, and Scopus can correctly tell me that this paper contains the word ‘rat’ then this is good and what science needs. If they can’t find it, it’s bad for the funders, authors and potential readers of that paper – the rat research remains hidden as ‘dark research’, published but not easily found. More formally, in terms of information retrieval you can measure search performance across many documents by assessing recall.

Recall is defined as:

the fraction of the documents that are relevant to the query that are successfully retrieved

As a toy example: if there are 100 papers containing the word ‘rat’ in Zootaxa, and Google Scholar returns 50 search results containing the word ‘rat’ in Zootaxa, then we ascertain that Google Scholar has 50% recall for the word ‘rat’ in Zootaxa.

In my preprint I test recall for terms related to my subject-area against >20,000 full text papers from PLOS ONE & Zootaxa. The results are really intriguing:

  • Web of Knowledge is shown to be consistently poor at recall across both journals (not surprising, it only indexes titles, abstracts and keywords – woeful at detecting words that are typically present only in the methods section).
  • Google Scholar appears to have near-perfect recall of PLOS ONE content (open access), but less than 50% recall on average of Zootaxa content.
  • Scopus shows an inverse trend: reasonably consistent and good recall of Zootaxa content, averaging ~70% recall for all tests but poorer at recalling PLOS ONE content (45% recall on average).


Why is Google Scholar so poor at recalling terms used in Zootaxa papers? Is it because Zootaxa is predominantly a subscription access journal?


Why is Scopus so poor at recalling terms used in PLOS ONE papers? PLOS ONE is an open access journal, published predominantly under CC-BY – there should be no difficulty indexing it. Google Scholar demonstrates that it can be done well.


Why is Scopus so much better than Google Scholar at indexing terms in Zootaxa? What does Scopus do, or have that Google Scholar doesn’t?


I don’t pretend to have answers to these questions – academic search providers tend to be incredibly opaque about how they operate. But it’s fascinating, and slightly worrying for our ability to do research effectively if we can’t know where knowledge has been published.

More general thoughts

Why is academic content search so bad in 2015? It’s really not that hard for born-digital papers! Is this another sign that academic publishing is broken? Discoverability is broken & inconsistent. Access is broken & inconsistent. Peer review is broken & inconsistent. Hiring & tenure is broken & inconsistent…


The good news is; there’s a clear model for success here if we can identify its exact determinants: PLOS ONE & Google Scholar provide excellent discoverability (>95%). Whatever they’re doing, I suggest publishers copy it to ensure high discoverability of research.



Recently I had the opportunity to collaborate on an extremely timely paper on data sharing and data re-use in phylogenetics, as part of the continuing MIAPA (Minimal Information for a Phylogenetic Analysis) working group project:


Additionally, in order to also practice what we preach about data archiving, we opted (it wasn’t mandated by the journal or editors) to put the underlying data for this publication in Dryad so it was immediately freely available for re-use/scrutiny/whatever upon publication of the paper, under a CC0 waiver

Dryad (and similar free services like FigShare, MorphoBank & LabArchives) allow research data to be made available either pre-publication, on publication, or even post-publication with optional embargoes (access denied for up to 1-year after the paper is published). I’m strongly against the use of data embargoes but Dryad allow it because embargoed data is better than no data at all! I’ve seen some recent papers that have made use of this option and apparently the journals, editors & reviewers are ‘fine’ with this practice of proactively denying access to data. I guess it’s a generational thing? That sort of practise used-to understandably be okay pre-Internet when digital data was costly to distribute. But now we can freely & easily distribute supporting data, there are a multitude of reasons why we really should unless there are justifiable reasons not to e.g. privacy with sensitive medical/patient data.


I haven’t had all that much experience of the publication process so far – I’m amazed how kludgy it can be at times – far from smooth or efficient IMO. I was in charge of the Dryad data deposition for this paper among other things and because the journal isn’t integrated with Dryad’s deposition process it took me quite a few emails to work out what & when to do things but it wasn’t a major difficulty – the benefits of doing this will almost certainly outweigh the small effort cost of doing it. Those journals with a Dryad-integrated workflow will no doubt have a smoother process.

Another thing I learn’t from this manuscript was that publishers commonly outsource their typesetting to developing countries (for the cheaper labor available there). So in this instance BMC sent our MS to the Philippines to be re-typeset for publication and when the proofs came back we encountered some really comical errors e.g. Phylomatic had been re-typeset as ‘phlegmatic’. This sparked a very serendipitous conversation on Twitter, which eventually led to Bryan Vickery (Chief Operating Officer at BMC) inviting me to visit the London office of BMC to have a chat about ‘all-things-publishing’ (and btw, serious *props* to PLOS and BMC for having such nice, helpful tweeps on Twitter):

Bryan and I arranged a time and a date (after-SVP) and so I ended-up visiting BMC for more than 2 hours on Wednesday 24th October. I got to meet not only Bryan but also Deborah Kahn, Shane Canning and others including some of the editors for BMC Research Notes (thanks again for helping publish our paper!) & BMC Evolutionary Biology. Iain Hrynaszkiewicz was there too (Hi Iain!), given our enthusiasm for Open Data (do read his *excellent* paper ‘Open By Default’ in the same article collection as ours) I’m sure we’ll meet again at more workshops and events in future.

I couldn’t possibly go through everything that was explained to me there but it certainly was illuminating. I suspect many junior academics like myself have little or no clue at all as to the behind-the-scenes processes that go on with manuscripts to get them into a state ready for publication. Perhaps a publisher visit (or even short placement?) scheme like this should be run as part of a postgraduate skills training session? Moreover perhaps it could help alleviate the ‘too many PhDs, too few academic jobs‘ problem by highlighting skilled sciencey jobs like STM publishing as viable and noble alternatives to the extremely overpopulated rat-race for tenure-track academic jobs. STM publishing isn’t even an ‘exit’ from academia. People like Jenny Rohm (chair of Science is Vital) have demonstrated that one can go into STM publishing and still go back into academia after this.

The cost of peer-review & publishing


This part of the post has sat on the backburner for a long time because it’s a complex one.

From what I was told (and I could well believe) organizing peer-review can be an immensely variable process. Sometimes it can very simple. Automated processes such as peer2ref can be used to select appropriate reviewers for a manuscript, if these reviewers accept and get on with it nicely and in a timely fashion the process can be of very little administrative burden. However there are also times when maybe 10 or 12 reviewers need to be contacted before 2 may agree and then there can be complications after this leading to a very time consuming, costly and burdensome process. So organizing peer-review costs money, but it’s difficult, or perhaps commercially-sensitive (?) to put an average price on that process -> I’m still in the dark on how much this process should cost. If anyone knows of a reputable source for data on this please do let me know.


What of DOI’s?  Why do some high-volume journals like Zootaxa & Taxon operate without DOI’s? Is there really much money to be saved by dispensing with them? Well, Bryan kindly pointed me to this link here for all the salient info.

It’s just $1 per DOI. That’s nothing tbh. What’s more, it’s even cheaper to retrospectively add DOI’s to older already published content: ‘backfile’ DOI’s are just $0.15. That means Zootaxa could retrospectively add DOI’s to all ~5866 of their backfile articles (2004-2009) for just $880 !  There’s plenty of other things that would need fixing before that happened though, Zootaxa doesn’t even have proper article landing pages as was pointed out to me by Rod Page. No doubt there would also be some labour cost associated with getting someone to add DOI’s to all those thousands of articles. Still, it looks cheap to me. I still feel justified in my annoyed rant I sent to TAXACOM a while ago about this pressing issue with respect to DOI’s and responsibility of publishers.

This also has ramifications for some of the changes I’ve been pushing for now I’m on the Systematics Association council. Our main publication is a book and each of the chapters *could* but currently don’t have DOI’s issued for them, I suggested we issue DOI’s at the last council meeting, but alas it’s not up to me, we need co-operation from our publisher to make this happen (Hi, Cambridge University Press!). Book chapter DOI’s cost just $0.25 per DOI, so I think this small cost would certainly be worth it, if it raises the discoverability and citeability of our publications.

Article submission

A final point of interest from my BMC visit: Bryan told me that BMC used to offer a means by which authors could submit their works directly via an XML authoring tool. It wasn’t popular, but I wonder whether this was perhaps because it was a little before its time? The whole process of Biologists submitting Word files, having figures and text inadvertently mangled and wrongly re-typeset at the publisher seems extremely inefficient to me. Physicists & Computational Scientists seem to get along fine with LaTeX submission processes which alleviate some but not all of the typesetting shenanigans. Perhaps it is the authors, and the authoring tools that need to change to enable more re-usable research in the future, to fully enable the potential of the semantic Web. It looks like Pensoft might be trying to go again in this direction with its Pensoft Writing Tool.

image by Gregor Hagedorn. CC BY-SA

On that note, it might be good to end with a small advert  for the Pro-iBiosphere biodiversity informatics & taxonomy workshop in February, 2013 Leiden (NL).

I very much look forward to meeting taxonomists IRL!





just a quick post…

I’m pretty shocked at the poor indexing service given by Thompson Reuters Web of Knowledge (or ISI Web of Science as you might know it).

I’ve unashamedly bashed them before and I’ll bash them again here. (They deserve criticism because they’re paid a lot of money to do this as a commercial for-profit enterprise, and I don’t think they’re doing it as well as they could be.)

I performed a very simple search today looking for the articles containing the word ‘cladistic’ but NOT ‘phylogen*’ for articles published in the year 2010.

Topic=(cladistic) NOT Topic=(phylogen*) AND Year Published=(2010)

Below is a screenshot of just one of many of the disappointing results. I’ve refined the search to just the PLoS paper, to clearly show that it does come-up in this search:

It’s an Open Access paper, so we can all go see for ourselves the FULL content of the paper

Na, B.-K., Bae, Y.-A., Zo, Y.-G., Choe, Y., Kim, S.-H., Desai, P. V., Avery, M. A., Craik, C. S., Kim, T.-S., Rosenthal, P. J., and Kong, Y. 2010. Biochemical properties of a novel cysteine protease of plasmodium vivax, vivapain-4. PLOS NEGLECTED TROPICAL DISEASES 4.

In which we find the text caption for figure 1 mentions ‘phylogen*’ twice!

from Na, B.-K., Bae, Y.-A., Zo, Y.-G., Choe, Y., Kim, S.-H., Desai, P. V., Avery, M. A., Craik, C. S., Kim, T.-S., Rosenthal, P. J., and Kong, Y. 2010. Biochemical properties of a novel cysteine protease of plasmodium vivax, vivapain-4. PLOS NEGLECTED TROPICAL DISEASES 4. CC-BY licenced

so at the very least I suspect Web of Science (WoS) is systematically NOT indexing the caption text of figures (if you know more than I about this, please do comment). Academics rely on services like this to effectively and accurately search the literature, to perform comprehensive reviews and such. If all the textual content of science isn’t actually being indexed by WoS, that’s clearly going to lead to bad science at some point (e.g. a vital missing paper, not picked up in an otherwise well designed literature search). I could forgive them for not being able to OCR the text within the images of figures, but NOT for the fully machine-readable text captions like this one. Furthermore, it’s Open Access and fully-digital – why aren’t they indexing figure caption text?



It appears it’s not just figure caption text they don’t index. Do they index only titles and abstracts?

many of the other 81 results (papers) of that search for ‘cladistic’ but NOT ‘phylogen*’ contain the word-stem ‘phylogen*’ in the full text of the paper!


Wilts, E. F., Arbizu, P. M., and Ahlrichs, W. H. 2010. Description of bryceella perpusilla n. sp (monogononta: Proalidae), a new rotifer species from terrestrial mosses, with notes on the ground plan of bryceella REMANE, 1929. INTERNATIONAL REVIEW OF HYDROBIOLOGY 95.

Echeverry, A. and Morrone, J. J. 2010. Parsimony analysis of endemicity as a panbiogeographical tool: an analysis of caribbean plant taxa. BIOLOGICAL JOURNAL OF THE LINNEAN SOCIETY 101.

Stutz, H. L., Shiozawa, D. K., and Evans, R. P. 2010. Inferring dispersal of aquatic invertebrates from genetic variation: a comparative study of an amphipod and mayfly in great basin springs. JOURNAL OF THE NORTH AMERICAN BENTHOLOGICAL SOCIETY 29

Campo, D., Molares, J., Garcia, L., Fernandez-Rueda, P., Garcia-Gonzalez, C., and Garcia-Vazquez, E. 2010. Phylogeography of the european stalked barnacle (pollicipes pollicipes): identification of glacial refugia. MARINE BIOLOGY 157.

Choiniere, J. N., Clark, J. M., Forster, C. A., and Xu, X. 2010. A basal coelurosaur (dinosauria: Theropoda) from the late jurassic (oxfordian) of the shishugou formation in wucaiwan, people’s republic of china. JOURNAL OF VERTEBRATE PALEONTOLOGY 30.

Caldwell, M. W. and Palci, A. 2010. A new species of marine ophidiomorph lizard, adriosaurus skrbinensis, from the upper cretaceous of slovenia. JOURNAL OF VERTEBRATE PALEONTOLOGY 30.

Hastings, A. K., Bloch, J. I., Cadena, E. A., and Jaramillo, C. A. 2010. A new small short-snouted dyrosaurid (crocodylomorpha, mesoeucrocodylia) from the paleocene of northeastern colombia. JOURNAL OF VERTEBRATE PALEONTOLOGY 30.

Karanovic, I. and McKay, K. 2010. Two new species of leicacandona karanovic (ostracoda, candoninae) from the great sandy desert, australia. JOURNAL OF NATURAL HISTORY 44.

(and more, these are just some of the articles I’ve looked at the full-text of so far… I think it’s safe to say now this is NOT a one off phenomenon)

I’ve now found through manual inspection that at least 47 of the ‘hits’ for this search actually contain a ‘phylogen*’ word within the main text of the paper (excluding the reference list)

I guess I’m probably not the first to realise this but… wow. Is this not *really* poor service? I’m pretty sure my desktop software could do a better job of indexing than this. All it is, is simple string matching!

…and of course I can do a better job of this myself with Open Access papers. All one need do is download the OA corpus from UKPMC and index the *FULL* text including figure caption text and reference lists yourself. I wonder how many more relevant papers I might ‘find’ with my searches if I did this rather than relying on Web of Science?

I realise thus far, I may not have explained too clearly exactly what I’m doing for my Panton fellowship. With this post I shall attempt to remedy this and shed a little more light on what I’ve been doing lately.

The main thrust of my fellowship is to extract phylogenetic tree data from the literature using content mining approaches (think text mining, but not just text!) – using the literature in its entirety as my data. I have very little prior experience in this area, but luckily I have an expert mentor guiding me: Peter Murray-Rust (whom you may often see referred to as PMR). For those of us biologists who may not be familiar with his work, whilst trying not to be too sycophantic about it, PMR is simply brilliant, it’s amazing what he and his collaborators have done to extract chemical data from the chemical literature and provide it openly for everyone, in spite of fierce opposition at times from those with vested interests in this data remaining ‘closed’.

Now he’s turned his attention to the biological literature for my project and together we’re going to try and provide open tools to extract phylogenetic data from the literature. Initially I proposed trying to grab just tree topology and tip labels – a kind of bare minimum, but PMR has convinced me that we should be ambitious and all-encompassing, and thus our aims have expanded to include branch lengths, support values, the data-type the phylogeny was inferred from, and other useful metadata. And why not? We’re ingesting the totality of the paper in our process, from title page to reference list, so there’s plenty of machine-readable data to be gleaned. The question is, can we glean it off accurately enough, balancing precision and recall?

So for starters, we’ve been using test materials that we’re legally allowed to, namely Open Access CC-BY papers from BMC & PLoS to test our extraction tools, specifically focusing on a subset of all ~8500 papers containing the word-stem phylogen* from BMC. It’s a rough proxy for papers that’ll contain a tree, and it’s good enough for now – we’ll need to be able to deal with false positives along with all the positive positives, so it’s instructive to keep these in our sample.

We’ve been working on the regular structure of BMC PDFs, getting out bibliographic metadata, and the main-text for further NLP processing downstream to pick out data & method relevant words like say PAUP* , ML , mitochondrial loci etc… But the real reason we’re deliberately using PDFs rather than the XML (which we also have access to) is the figures – where all the valuable phylogenetic tree data is. If this can be re-interpreted with reference to the bibliographic metadata, the figure caption and further methodological details from the full-text of the paper, then we may be able to reconstruct some fairly rich and useful phylogenetic data.

To make it clear, in slight contrast to the Lapp et al iEvoBio presentation embedded above, we’re not trying to just extract the images, but rather to re-interpret them back into actual re-useable data, probably to be provided in NeXML (and from there on, whatever form you want). We’re pretty sure it’s an achievable goal. Programs like TreeThief, TreeRipper, and TreeSnatcher Plus have gone some way towards this already, but never before been incorporated in a content mining workflow AFAIK.

Unfortunately I wasn’t at iEvoBio 2012 (I’m short on money and on time these days) but it’s great to see from the slides the growing recognition of the SVG image file format as a brilliant tool for communicating digital science. I also put a bit about that in my Hennig XXXI talk slides too (towards the end). Programs like TNT do output SVG files, so there’s scope to make this a normal part of any publication workflow. Regrettably though, rather few publisher produced PDFs contain SVG formatted images – but if people, and editorial boards (perhaps?) can be made aware of their advantages, perhaps we can change this in future…?

the very same file, opened as plain-text. It’s fairly easy to reconvert back into re-useable machine-readable data.


Agapornis phylogeny.svg from Wikipedia (PD)










Gathering phylogenetic data from beyond PLoS, BMC and other smaller Open Access publishers is going to be hard, not for technical, but purely legal reasons:

The scope and scale of phylogenetic research (using ‘phylogen*’ as a proxy):

There’s a lot of phylogenetic research out there… but little of it is Open Access – which is problematic for content mining approaches – particularly if subscription-access publishers are reticent to allow access.

Some facts:

  • with a Thomson Reuters Web of Science search, SCI-EXPANDED database (only), Topic=(phylogen*) AND Year Published=(2000-2011) this returns 101,669 results (at the time of searching YMMV)
  • 91,788 of which are primary Research Articles (as opposed to Reviews, Proceedings Papers, Meeting Abstracts, Editorial Materials, Corrections, Book Reviews etc…)
  • Recent MIAPA working group research I contributed to (in review) quantitatively estimates that approximately 66% of papers containing ‘phylogen*’ report a new phylogenetic analysis (new data).
  • Thus conservatively assuming just one tree per paper (there are often many per paper), there are > 60,000 trees contained within just 21st century research articles.
  • As with STM publishing as a whole, the number of phylogenetic research articles being published each year shows consistent year-on-year increases
  • Cross-match this with publisher licencing data and you’ll find that only ~11% of phylogenetic research published in 2010 was CC-BY Open Access (and this % probably decreases as you go back before 2010)
So the real fun and games will come later this year, when I’m sure we’ll have the capability (software tools) to do some amazing stuff, having first perfected it on OA materials… but will they let us? Heather Piwowar’s experience earlier this year didn’t look too fun – and that was all for just one publisher. Phylogenetic research occurs in and beyond at least 80 separate STM publishers by my count (let alone the >500 journals it occurs in!) – so there’s no way anyone would bother trying to negotiate with them all! I’m sticking by the intuitive principle that The Right to Read Is the Right to Mine but I’ll have a think about that some more when we actually get to that bridge.

Finally, it’s also worth acknowledging that we’re certainly not the first in this peculiar non-biomedical mining space – ‘biodiversity informaticists’ have been doing useful things with these techniques for a while now in innovative ways largely unrelated to medicine e.g. LINNAEUS from Casey Bergmann’s lab, and a recent review of other projects from Thessen et al (2012) [hat-tip to @rdmpage for bringing that later paper to the world’s attention via Twitter]. Literally all areas of academia could probably benefit from some form or another of content mining – it’s not just a biomed / biochem tool.

So, I hope that explains things a bit better. Any questions?


Some references (but not all!):

Gerner, M., Nenadic, G., and Bergman, C. 2010. LINNAEUS: A species name identification system for biomedical literature. BMC Bioinformatics 11:85+. [CC-BY Open Access]

Thessen, A. E., Cui, H., and Mozzherin, D. 2012. Applications of natural language processing in biodiversity science. Advances in Bioinformatics 2012:1-17. [CC-BY Open Access]

Hughes, J. 2011. TreeRipper web application: towards a fully automated optical tree recognition software. BMC Bioinformatics 12:178+.  [CC-BY Open Access]

Laubach, T., von Haeseler, A., and Lercher, M. 2012. TreeSnatcher plus: capturing phylogenetic trees from images. BMC Bioinformatics 13:110+. [CC-BY Open Access, incidentally I was one of the reviewers for this paper. I signed my review, and made a point of it too. Nor was it a soft review either I might add]