Show me the data!

A FlyTree in the Ointment of Falsifiability

June 10th, 2011 | Posted by rmounce in Open Data | Phylogenetics

Part inspired by the ‘Bugs!’ blogging contest, part inspired by Morgan Jackson’s post I thought I’d write some thoughts and observations on the recent FlyToL paper (Wiegmann et al., PNAS, 2011).

[IMPORTANT UPDATE Since I wrote this blogpost, the authors have made the data publicly available. In fact, credit to them – they’ve done more than they strictly needed to and deposited their data matrix in TreeBASE here. I’m sure this data will be a valuable and well cited resource now for the rest of the scientific community ]

First, a disclaimer:

My remarks should NOT be taken personally. They are my views, and my views alone (although hopefully many would support them). The following observations are in fact just one high-profile case of some of the points I shall make in my SystAss 2011 talk in Belfast next month (draft abstract here). I have a special interest in this topic because my research, my PhD, my life(!) utterly depend on being able to re-use and reproduce other people’s primary phylogenetic analyses. My research is that of synthesis and meta-analysis – no shame, and no lesser a science IMO. My only ‘agenda’ is that of a concern for my discipline: phylogenetics and it’s reproducibility.


So… to begin, with a bang!

Short Summary: The phylogeny reported in the recent FlyToL paper is not falsifiable. That is not to say that it is ‘false’ or ‘made-up’, just that the hypothesis cannot be challenged because insufficient evidence is provided with the paper. Personally, I do not doubt that the results presented are the correct results of the analysis they describe, it just irks me that the normally mandatory and indeed requisite evidence hasn’t been presented. I have notified the authors and editors of my objections and had some acknowledgement of the issue but so far, no visible improvement to the publicly-provided evidence has been made.

The Details: Most published phylogenies represent quantitatively-calculated evidence-derived hypotheses of evolutionary relationships between taxa. Thus, when a phylogeny is presented as a result of an analysis in a paper – the presentation of the underlying evidence and methods used to generate this are crucial. Without one or both, the phylogeny presented becomes unrefutable or non-falsifiable. This is the unfortunate situation that the FlyToL phylogeny is in.

Where is the evidence for figure 1 (the phylogeny, the Raison d’ĂȘtre of the paper)? I could draw a tree like that myself with pen and paper – the supporting evidence is what makes it a scholarly hypothesis of phylogeny.

The method is well detailed. They used RAxML and MrBayes as stated in the ‘Phylogenetic Inference’ subsection of the Materials and Methods, and further details are given in the separate ‘Supporting Information’ file.

But what data did they analyse with these methods? Well, we are given the identity of the taxa they analysed. We are told they used molecular AND morphological data in some of their analyses. But to be falsifiable, we need to know and have access to the exact molecular and morphological data they used, down to the very last DNA base pair (for molecular) and genital character (for morphological). Remarkably, neither dataset is present in this paper, nor linked to from this paper. Molecular sequence data as used in phylogenetic analyses *has to* be databased in GenBank as a pre-requisite before most papers will even be allowed to be published. It is clearly written in most editorial policies e.g. Nature, Science, PNAS, Syst Biol… take your pick! I assume they have GenBank’d their sequences, they just don’t seem to have supplied the GenBank Accession numbers with the paper (as they claim to, in Table S1, numbers still absent at the time of writing this blogpost [10/06/11]). Likewise with the morphological data – it is not available. One can find a description of the characters they used “on the FLYTREE morphology Web site” but this is NOT the data itself.

If *all* phylogenetic hypotheses were allowed to be published without requisite supporting evidence, I would be extremely worried about the discipline as a whole.

Also, this affects my ability to do my research – I could easily have re-used that morphological dataset in some of my work, and given them a citation in the process! [N.B. After a fair few email exchanges, I did eventually get given the morphological dataset, but still no sight of the molecular dataset, or its GenBank Accession numbers]

The nature of my research means I come across such [assumed] innocent mistakes in papers relatively frequently – most journal editors in my experience are both grateful and pro-active when I raise such data issues: so take a bow Paul D. Taylor (Journal of Systematic Palaeontology), Paul Barrett & Annalisa Berta (Journal of Vertebrate Paleontology), Peter J Hayward (Zoo J Linn Soc), Henry Gee (Nature) etc… but I’m surprised PNAS haven’t done anything about this yet.

What do you think? Am I being an overly pedantic nutcase? Or is there some validity to my concerns? Should I dare raise it again?

Reference: Wiegmann, B. M. et al. Episodic radiations in the fly tree of life. Proceedings of the National Academy of Sciences 108, 5690-5695 (2011). URL

Comments (resurrected from the old blog, sorry for the lack of formatting)

Brian has been quite good in the past, in submitting his alignments and trees to TreeBase so I don’t know what happened here. It could just be an oversight but a very annoying one definitely one worth pursuing. I think that the PNAS editor and reviewers should be ashamed of their work, it should not have appeared without all the supporting evidence.

Like Reply
12 months ago 1 Like

Ross Mounce
That’s what I really dont *get*. It’s a massive collaboration of top top scientists – a who’s who’s of dipteran systematists, and an NSF ToL grant. It shouldn’t have happened in the first place, and I don’t know why it hasn’t been (easily) corrected yet. Perhaps there are technical difficulties over at GenBank getting accession numbers for so many sequences(?).

They told me they’re witholding the morphology matrix so they can submit that as part of another set of analyses to Cladistics (is this legit?), will be interesting to see what’s in that paper when it comes out…

Publicly (tax-payer) funded research in particular should be extra careful to be transparent and reproducible. Phylogenetics shouldn’t encounter a ‘Climategate’-like situation but can you imagine if this happened with a paper on say the origins of HIV or avian flu? That’d be the perfect recipe for a media disaster.

Edit Reply
12 months ago in reply to bljog 1 Like

J. Salvador Arias
In general, I agree with you ;)! But I most concerned with reproduction and legacy rather that falsifiability.

You can test hand made (or random) relationships, even in absence of the source data. So you might test FlyToL tree,with published matrices or not. That is why you can test Lineaus groups, and countless of classic authors, even if they never publish something like a phylogenetic matrix.

But, it is an valid point to be skeptical of the results. That is in fact a good attribute for a scientist! And as long as the data from the FlyToL continue to be private, their results are just a high profile preach (and as you, I believe that they show the right results).

Systematics was hampered by a long history of research based on authority, and this is just a sad return of such practices.

So keep fighting! You are right!

A Liked Reply
12 months ago 1 Like

Juan Pablo Carbajal
You are not pedantic.
These issues (reproducibility, falsifiability) should be so clear to everyone, that your post make no sense. BUT IT DOES!
You are taking the pains (and the ugly looks) of many just to safeguard the values of scientific work and collaboration. All my support!

It is a pity that I have no knowledge on what you are talking about, being that the case I would’ve said something less general.

A Liked Reply
12 months ago 1 Like

Morgan Jackson
Excellent point Ross, and a matter which I didn’t have room to discuss in my post. My assumption is that the GenBank numbers will be included in the final publication (it is still in pre-publication, and not quite “official” yet), although that’s really no excuse for a paper making such large claims. When the data is made available however, I’m sure there will be many people who will be sectioning pieces off and using bits in various other analyses, and I’d be interested to hear more about your potential uses of the data!

Like Reply
11 months ago

Ross Mounce
Hey thanks Morgan, I liked your post too. Interesting points about the topology they found.

It’s true it’s still ‘pre-publication’ but I haven’t seen any other pre-publication papers without underlying data supplied – this isn’t a valid excuse IMO. And usually ‘pre-publication’ papers come out in PNAS very soon after with a turn-around of ~ a month, no?

This has been hanging around in ‘pre-print’ status since March 14th! Perhaps *because* of these data issues?

Also, in the ‘Digital Age’ – when papers come out online – that IS effectively when they get published (in all but terms of citation date which takes the paper publication date).

If it wasn’t ready to be published online, it shouldn’t have been put online. Am I too harsh?

Like you say, when the data does get published I’m sure it’ll get a LOT of re-usage. Which is great for them, the journal and the re-users (myself included). In the meantime, all those parties are missing out on that benefit: Science is the loser.

Edit Reply
11 months ago in reply to Morgan Jackson

Karen Cranston
You have probably seen this by now, but just in case you haven’t, the TreeBASE submission is here:

I just talked to Brian, and the GenBank records are submitted by not yet public (some of the required intron annotations are not done).

Like Reply
10 months ago in reply to Ross Mounce 1 Like