Show me the data!

This is a re-post of something I was invited to write to sum-up my experiences at OKCon 2011. The original post can be viewed here on the official OKFN Open Science blog. For some reason the Prezi embed code at the bottom didn’t work, but does here on my blog

Many thanks to Jenny Molloy for inviting me to write the post, and Maria Neicu for editing it.

A couple of months ago, I gave a talk at the Open Knowledge Conference 2011, on ‘Open Palaeontology’ – based upon 18 months experience as a lowly PhD student trying, and mostly failing to get usable digital data from palaeontological research papers. As you might well have inferred already from that last sentence; it’s been an interesting ride.

The main point of my talk was the sheer stupidity/naivety of the way in which data is supplied (or in some cases, not at all!) with or within research papers. Effective science operates through the accumulation of knowledge and data, all advances are incremental and build upon the work of others – the Panton Principles probably sum it up far better than I could. Any such barriers to the accumulation of knowledge/data therefore impede the progress of science.

Whilst there are numerous barriers to academic research – access to research papers being perhaps the most well-known and well-publicised; the issue that most aggravates me, is not access to these papers, but the actual papers themselves – in the context of the 21st century (I’m thinking the Internet Age here…), they are only barely adequate (at best) for communicating research data and this is a major problem for the future legacy of our published work… and my research project.

My PhD thesis title is quite broad: ‘The Importance of Fossils in Phylogeny’. Given this title and (wide)scope, I need to look at a lot of papers, in a lot of different journals, and extract data from these articles to re-analyse; to assess the importance of fossils in phylogeny; on a meta-scale. There are long established data formats for the particular type of data I wish to extract. So well established and easy to understand there’s even a Wikipedia page here describing the most commonly used data format (nexus). There exist multiple databases set aside specifically to host this type of data e.g. TreeBASE and MorphoBank. Yet despite all this standardisation and provisioning for paleomorphological phylogenetic data – far less than 1% of all data published on, is actually readily-available in a standardised, digital, usable format.

In most cases the data is there; you just have to dig very very hard to release it from the pdf file it’s usually buried in (and then spend unnecessary and copious amounts of time, manually reformatting and validating it). See the picture below for a typical example (and yes, it is sadly printed sideways, this is a common and silly practice that publishers use to inappropriately squeeze data matrices into papers):

I hope you’ll agree with me that this is clearly absurd and hugely inefficient. As I explain in my presentation (slides at the bottom of this post) the data, as originially analysed/used, comes in a much richer, more usable, digital, Standardised format. Yet when published it gets stripped of all useful metadata and converted into a flat, inextricable and significantly obfuscated table. Why? It’s my belief this practise is a lazy unwanted vestigial hangover from the days of paper-based (only) publishing, in which this might have been the only way in which to convey the data with the paper. But in 2011, I can confidently say that the vast majority of researchers read and the use the digital versions of research papers – so why not make full and proper use of the digital format to aid scientific communication? I argue, not to axe paper copies. But to make sure that digital versions are more than just plain pdf versions of the paper copy, as they can and should IMO be.

With this goal in mind, I set about writing an Open Letter to the rest of my research community to explain why we need to richly-digitise our published research data ASAP. Naturally, I wouldn’t get very far just by myself, so I enlisted the support of a variety of academic friends via Facebook, and (inspired by OKFN pads I’d seen) we concocted a draft letter together using an Etherpad. The result of this was a fairly basic Drupal-based website that we launched and disseminated via mailing lists, Twitter, as far and wide as we possibly could, *hoping* just hoping, that our fellow academics would read, take note and support our cause.

Surprisingly, it worked to an extent and a lot of big names in Palaeontology signed our Open Letter in support of our cause; then things got even better when a Nature journalist (Ewen Callaway) got interested in our campaign and wrote an article for Nature News about it, which can be found here. A huge thanks must go to everyone who helped out with the campaign, it’s generated truly International support, as can be demonstrated on the map below:
(you might have to zoom out a bit. For some reason it zooms into Africa by default )

View Open Letter Signatures in a larger map

It’s far too soon to know the true impact of the campaign. Journal editorial boards can be very slow to change their editorial policies, especially if it requires a modicum of extra effort on the part of the publisher. Additionally, once editorial policy does change at a journal, it can only apply to articles submitted from henceforth and thus articles already in the submission pipeline don’t get affected by any new guidelines. It’s not uncommon for delays of a year between submission and publishing in palaeontology, so for this and other reasons, I’m not expecting to see visible change until 2012, but I think we might have helped get the ball rolling, if nothing else…
The Paleontological Society journals (Paleobiology and Journal of Paleontology) have recently adopted mandatory data submission to the Dryad repository, and the Journal of Vertebrate Paleontology has also improved their editorial policy with respect to certain types of data, but these are just a few of many many journals that publish palaeontological articles. I’m very much hoping that other journals will follow suit in the next few months and years by taking steps to improve the way in which research data is communicated, for the good of everyone; authors, publishers, funders and readers.

Anyway, here’s the Prezi I used to convey some of that (and more) at OKCon 2011. Huge thanks to the conference organisers for inviting me to give this talk. It was the most professionally run conference I’ve ever been to, by far. Great food, excellent WiFi provisioning, good comms, superb accommodation… I could go on. If the conference is on next year – I’ll be there for sure!

After more than 6 months of waiting, my long overdue first paper has finally been published in Nature

Why do I say ‘long overdue’?
Doesn’t it usually take a long time for academic papers to get published?

Well… if you read it here [paywalled], you’ll see it’s only 600 words and 1 figure (at the absolute limit allowed by Nature’s editorial policy). It’s not really a full and lengthy contribution to science, just a simple “this previous paper is significantly wrong, and this is why”. I would have gone into greater depth of analysis had I been allowed to, but the strict word-limit rather prohibits this. I have no idea quite why it took so long to get accepted and published. I suspect Nature isn’t to blame. Politeness dictates that the original study authors can take a certain amount of time to reply to such letters, fair’s fair – but 6 months? Hmmm…

Anyway, some background and context:

My entire research thesis relies on re-analysing other people’s data. I have no access to specimens, I’m not a field palaeontologist, my research angle is ‘palaeoinformatics’ – large-scale re-analyses of hundreds if not thousands of palaeontological datasets. So naturally, when a new fossil specimen (reconstruction below) with novel phylogenetic data gets published in Nature I justifiably take a keen interest.

Artistic reconstruction of Diania cactiformis by Mingguang Chi as first published in Nature

Using the freely-available software program TNT, one can quickly and easily re-analyse the data matrix given in the Liu et al 2011 supplementary materials in a matter of seconds with any desktop computer. Any such reasonably informed parsimony analysis of this data (there are numerous parameters and settings one could specify), does NOT generate the consensus phylogeny they depict and furthermore the ‘real’ result generates cladograms in which the position of Diania cactiformis is significantly different.

It took several different re-analyses to be sure that I was onto something interesting. Frustratingly in phylogenetics, authors aren’t always as explicit in their methods sections as I’d like them to be. In this instance, it was not stated (as is true of most papers) what branch collapsing rules were followed. Is it safe to assume that they used the default setting? I think not!

So, once I was sure there was a problem, I emailed the corresponding author with my concerns. This was her reply in verbatim:

Hello, Ross, thank you for your attention, now I am very busy with writing an application (the deadline is March 10th), could you wait me for few days, I will show you all the phylogeny tree which I got, the methods, the rules and so on. Thank you very much!
All best regards!

I never did get a follow-up email. [I assume because I submitted a formal reply to Nature which Jianni will have been notified of]. As a lowly grad student I’m sadly used to getting fobbed-off all the time…
Good thing I didn’t wait too long to formally reply either. It soon became apparent via Facebook that another research group intended to challenge this paper too. Well done to David Legg et al for their successful reply, also published in Nature.

Finally, I’d like to thank my supervisor Matthew Wills for helping me appropriately word my submission. I may have had the idea, and done the analyses, but it might not have been published without his excellent editorial input into the wording of the piece. We tried very hard to be polite and maintain the importance of the specimen, whilst necessarily pointing out the flaws of the analysis given.

I’ve been in Nature two times now in 2011! I didn’t write the first piece though. Can I make it a hat-trick? Time will tell…

As for the Liu et al. counter-reply which I’ve only just been allowed to see like the rest of the world – I’m surprised certain bits of it made it past peer-review. Their use of the PTP test is particularly intriguing and defies current scientific consensus on the validity of the usage of this test:

“Additionally, a significant value of the partitioning tail permutation (PTP) test (P = 0.01) suggests the presence of a clear phylogenetic signal in the morphological data, also strongly supporting the topology shown.”

First of all, I’ll generously assume they intended to refer to the permutation tail probability test not the “partitioning tail permutation” [sic] test. I don’t doubt the numerical result of the PTP test they present – it’s the inference they make from that result that I find illogical and unjustified given the numerous papers that have critically examined the PTP test – that this result somehow “strongly support[s] the topology shown” well; it doesn’t!!! One only has to look at the titles of numerous papers on the PTP test (although, please by all means read them!) to see that few if any can recommend it’s use for the purpose of supporting particular topologies:

  • Swofford et al (1996) The Topology-Dependent permutation test for monophyly does not test for monophyly.
  • Slowinski & Crother (1998) Is the PTP test useful? [Barely; only to determine if data has ‘signal’]
  • Peres-Neto & Marques (2000) When are random data not random, or is the PTP test useful? [Largely, no]
  • Harshman (2001) Does the T-PTP test tell us anything we want to know? [Largely, no]
  • Wilkinson, M. et al (2002) Type 1 error rates of the parsimony permutation tail probability test. [Points out errors in the Peres-Neto & Marques 2000 analysis, but still agrees that “the parsimony PTP cannot generally be assumed to guarantee well-supported phylogenetic hypotheses.”]

FYI the original papers describing the PTP test are Archie, 1989; Faith 1991; and Faith & Cranston 1991 [full citations and links given at the bottom]

There are other problems with the Liu et al reply which I hope others may be able to see for themselves but I’ll leave it at that for now. I perceive this statement for instance:

In this context, Mounce and Wills seem to have overlooked the potential significance of their reanalysis of our data.

to be a bit of a ‘cheap shot’ especially considering the extremely short word-limit enforced on our comment by Nature. Of course we would have loved to have described the implications of the reanalysis for each and every character – but this simply wasn’t relevant enough to the criticism we were presenting. How this was deemed relevant to the rebuttal of our valid and justified points in the Mounce & Wills comment I leave up to you to decide

I’d also like to say how much I admire the work of many of the scientists on the Liu et al paper. I think they produce some excellent work, and I’ve met Jason Dunlop in particular many times – he’s a nice guy and an excellent scientist. As a middle author I’m pretty confident he has little to do with the problems inherent in the original Liu et al Nature paper and the subsequent Liu et al counter-reply. My own objections to both papers are nothing personal – just an obsession with good science and logical reasoning!

Welcome to the world of academia…

    References & Links:

Archie, J. W. A randomization test for phylogenetic information in systematic data. Systematic Biology 38, 239-252 (1989). URL

Callaway, E. Fossil data enter the web period : Nature news. Nature (2011). URL

Coddington, J. & Scharff, N. Problems with zero-length branches. Cladistics 10, 415-423 (1994). URL

Faith, D. P. Cladistic permutation tests for monophyly and nonmonophyly. Systematic Zoology 40, 366-375 (1991). URL

Faith, D. P. & Cranston, P. S. Could a cladogram this short have arisen by chance alone?: On permutation tests for cladistic structure. Cladistics 7, 1-28 (1991). URL

Harshman, J. Does the T-PTP test tell us anything we want to know? Systematic Biology 50 (2001). URL

Legg, D. A. et al. Lobopodian phylogeny reanalysed. Nature 476, E1 (2011). URL

Liu, J., Steiner, M., Dunlop, J., Keupp, H., Shu, D., Ou, Q., Han, J., Zhang, Z., & Zhang, X. (2011). An armoured Cambrian lobopodian from China with arthropod-like appendages Nature, 470 (7335), 526-530 DOI: 10.1038/nature09704

Liu, J. et al. Liu et al. reply. Nature 476, E1 (2011). URL

Mounce, R., & Wills, M. (2011). Phylogenetic position of Diania challenged Nature, 476 (7359) DOI: 10.1038/nature10266

Peres-Neto, P. R. & Marques, F. When are random data not random, or is the PTP test useful? Cladistics 16, 420-424 (2000). URL

Slowinski, J. B. & Crother, B. I. Is the PTP test useful? Cladistics 14, 297-302 (1998). URL

Swofford, D. L., Thorne, J. L., Felsenstein, J. & Wiegmann, B. M. The Topology-Dependent permutation test for monophyly does not test for monophyly. Syst Biol 45, 575-579 (1996). URL

Wilkinson, M., Peres Neto, P. R., Foster, P. G. & Moncrieff, C. B. Type 1 error rates of the parsimony permutation tail probability test. Systematic Biology 51, 524-527 (2002). URL