Show me the data!

Part inspired by the ‘Bugs!’ blogging contest, part inspired by Morgan Jackson’s post I thought I’d write some thoughts and observations on the recent FlyToL paper (Wiegmann et al., PNAS, 2011).

[IMPORTANT UPDATE Since I wrote this blogpost, the authors have made the data publicly available. In fact, credit to them – they’ve done more than they strictly needed to and deposited their data matrix in TreeBASE here. I’m sure this data will be a valuable and well cited resource now for the rest of the scientific community ]

First, a disclaimer:

My remarks should NOT be taken personally. They are my views, and my views alone (although hopefully many would support them). The following observations are in fact just one high-profile case of some of the points I shall make in my SystAss 2011 talk in Belfast next month (draft abstract here). I have a special interest in this topic because my research, my PhD, my life(!) utterly depend on being able to re-use and reproduce other people’s primary phylogenetic analyses. My research is that of synthesis and meta-analysis – no shame, and no lesser a science IMO. My only ‘agenda’ is that of a concern for my discipline: phylogenetics and it’s reproducibility.


So… to begin, with a bang!

Short Summary: The phylogeny reported in the recent FlyToL paper is not falsifiable. That is not to say that it is ‘false’ or ‘made-up’, just that the hypothesis cannot be challenged because insufficient evidence is provided with the paper. Personally, I do not doubt that the results presented are the correct results of the analysis they describe, it just irks me that the normally mandatory and indeed requisite evidence hasn’t been presented. I have notified the authors and editors of my objections and had some acknowledgement of the issue but so far, no visible improvement to the publicly-provided evidence has been made.

The Details: Most published phylogenies represent quantitatively-calculated evidence-derived hypotheses of evolutionary relationships between taxa. Thus, when a phylogeny is presented as a result of an analysis in a paper – the presentation of the underlying evidence and methods used to generate this are crucial. Without one or both, the phylogeny presented becomes unrefutable or non-falsifiable. This is the unfortunate situation that the FlyToL phylogeny is in.

Where is the evidence for figure 1 (the phylogeny, the Raison d’être of the paper)? I could draw a tree like that myself with pen and paper – the supporting evidence is what makes it a scholarly hypothesis of phylogeny.

The method is well detailed. They used RAxML and MrBayes as stated in the ‘Phylogenetic Inference’ subsection of the Materials and Methods, and further details are given in the separate ‘Supporting Information’ file.

But what data did they analyse with these methods? Well, we are given the identity of the taxa they analysed. We are told they used molecular AND morphological data in some of their analyses. But to be falsifiable, we need to know and have access to the exact molecular and morphological data they used, down to the very last DNA base pair (for molecular) and genital character (for morphological). Remarkably, neither dataset is present in this paper, nor linked to from this paper. Molecular sequence data as used in phylogenetic analyses *has to* be databased in GenBank as a pre-requisite before most papers will even be allowed to be published. It is clearly written in most editorial policies e.g. Nature, Science, PNAS, Syst Biol… take your pick! I assume they have GenBank’d their sequences, they just don’t seem to have supplied the GenBank Accession numbers with the paper (as they claim to, in Table S1, numbers still absent at the time of writing this blogpost [10/06/11]). Likewise with the morphological data – it is not available. One can find a description of the characters they used “on the FLYTREE morphology Web site” but this is NOT the data itself.

If *all* phylogenetic hypotheses were allowed to be published without requisite supporting evidence, I would be extremely worried about the discipline as a whole.

Also, this affects my ability to do my research – I could easily have re-used that morphological dataset in some of my work, and given them a citation in the process! [N.B. After a fair few email exchanges, I did eventually get given the morphological dataset, but still no sight of the molecular dataset, or its GenBank Accession numbers]

The nature of my research means I come across such [assumed] innocent mistakes in papers relatively frequently – most journal editors in my experience are both grateful and pro-active when I raise such data issues: so take a bow Paul D. Taylor (Journal of Systematic Palaeontology), Paul Barrett & Annalisa Berta (Journal of Vertebrate Paleontology), Peter J Hayward (Zoo J Linn Soc), Henry Gee (Nature) etc… but I’m surprised PNAS haven’t done anything about this yet.

What do you think? Am I being an overly pedantic nutcase? Or is there some validity to my concerns? Should I dare raise it again?

Reference: Wiegmann, B. M. et al. Episodic radiations in the fly tree of life. Proceedings of the National Academy of Sciences 108, 5690-5695 (2011). URL

Comments (resurrected from the old blog, sorry for the lack of formatting)

Brian has been quite good in the past, in submitting his alignments and trees to TreeBase so I don’t know what happened here. It could just be an oversight but a very annoying one definitely one worth pursuing. I think that the PNAS editor and reviewers should be ashamed of their work, it should not have appeared without all the supporting evidence.

Like Reply
12 months ago 1 Like

Ross Mounce
That’s what I really dont *get*. It’s a massive collaboration of top top scientists – a who’s who’s of dipteran systematists, and an NSF ToL grant. It shouldn’t have happened in the first place, and I don’t know why it hasn’t been (easily) corrected yet. Perhaps there are technical difficulties over at GenBank getting accession numbers for so many sequences(?).

They told me they’re witholding the morphology matrix so they can submit that as part of another set of analyses to Cladistics (is this legit?), will be interesting to see what’s in that paper when it comes out…

Publicly (tax-payer) funded research in particular should be extra careful to be transparent and reproducible. Phylogenetics shouldn’t encounter a ‘Climategate’-like situation but can you imagine if this happened with a paper on say the origins of HIV or avian flu? That’d be the perfect recipe for a media disaster.

Edit Reply
12 months ago in reply to bljog 1 Like

J. Salvador Arias
In general, I agree with you ;)! But I most concerned with reproduction and legacy rather that falsifiability.

You can test hand made (or random) relationships, even in absence of the source data. So you might test FlyToL tree,with published matrices or not. That is why you can test Lineaus groups, and countless of classic authors, even if they never publish something like a phylogenetic matrix.

But, it is an valid point to be skeptical of the results. That is in fact a good attribute for a scientist! And as long as the data from the FlyToL continue to be private, their results are just a high profile preach (and as you, I believe that they show the right results).

Systematics was hampered by a long history of research based on authority, and this is just a sad return of such practices.

So keep fighting! You are right!

A Liked Reply
12 months ago 1 Like

Juan Pablo Carbajal
You are not pedantic.
These issues (reproducibility, falsifiability) should be so clear to everyone, that your post make no sense. BUT IT DOES!
You are taking the pains (and the ugly looks) of many just to safeguard the values of scientific work and collaboration. All my support!

It is a pity that I have no knowledge on what you are talking about, being that the case I would’ve said something less general.

A Liked Reply
12 months ago 1 Like

Morgan Jackson
Excellent point Ross, and a matter which I didn’t have room to discuss in my post. My assumption is that the GenBank numbers will be included in the final publication (it is still in pre-publication, and not quite “official” yet), although that’s really no excuse for a paper making such large claims. When the data is made available however, I’m sure there will be many people who will be sectioning pieces off and using bits in various other analyses, and I’d be interested to hear more about your potential uses of the data!

Like Reply
11 months ago

Ross Mounce
Hey thanks Morgan, I liked your post too. Interesting points about the topology they found.

It’s true it’s still ‘pre-publication’ but I haven’t seen any other pre-publication papers without underlying data supplied – this isn’t a valid excuse IMO. And usually ‘pre-publication’ papers come out in PNAS very soon after with a turn-around of ~ a month, no?

This has been hanging around in ‘pre-print’ status since March 14th! Perhaps *because* of these data issues?

Also, in the ‘Digital Age’ – when papers come out online – that IS effectively when they get published (in all but terms of citation date which takes the paper publication date).

If it wasn’t ready to be published online, it shouldn’t have been put online. Am I too harsh?

Like you say, when the data does get published I’m sure it’ll get a LOT of re-usage. Which is great for them, the journal and the re-users (myself included). In the meantime, all those parties are missing out on that benefit: Science is the loser.

Edit Reply
11 months ago in reply to Morgan Jackson

Karen Cranston
You have probably seen this by now, but just in case you haven’t, the TreeBASE submission is here:

I just talked to Brian, and the GenBank records are submitted by not yet public (some of the required intron annotations are not done).

Like Reply
10 months ago in reply to Ross Mounce 1 Like

What’s your Norell number?

Taking inspiration from the Erdős number game, I thought I’d use ColWiz‘s ‘Link Explorer’ function, and my citeulike bibliographic data to explore (in a highly unscientific way!) the authorship connections between people in my reference database of ~2750 papers.

Mark A Norell has LOADS of publications. Not too surprising considering he works at the AMNH (without doubt the best palaeontological research institute in the world if one measures by number of publications and/or Impact Factor).

How many authorship connections does he have? In my library I have just 35 of his innumerate publications. Just a small sample of what he’s truly published. In these 35 publications (mostly vertebrate phylogeny related stuff, 2000-present) he has a total of ~55 different co-authors.

I would like to have calculated this exactly but sadly ColWiz only provides a visual exploration with no quantitation.

So his 1st order co-author relationships according to my (limited) data look like this:

Now it gets interesting… What about the 2nd order relations. The co-authors of co-authors of his (according to my limited bib-library). Here’s the result of this below. *wow*


Far, far too many authors to count manually! From this I’d guess that no published palaeontologist with more than say 3 papers has a Norell number greater than 5.

Does anyone know any good, FREE, quantitative bibliometric software that would actually calculate some of this for me, rather than just providing pretty (but fairly meaningless) vizualisations?

Could there be a better candidate for the linkman in palaeontology? On Facebook, Dave Godfrey suggested Romer, Halstead, or Ostrom. They’re great suggestions but I’m looking for a good link-point for 21st century palaeontology tbh. So Norell it is!

Anyway, back to *proper* work…

PS If one counts my Open Letter as a publication then I have a Norell number of 3 (via co-authorship with Graeme Lloyd who has a Norell number of 2). If one only counts my soon to be published paper with my supervisor then my Norell number is 4 (Ross Mounce -> Matthew Wills -> Paul Barrett -> Diego Pol -> Norell). Some of my friends e.g. Shaena Montanari are lucky enough to have the lowest Norell number possible already -> 1. It’s a small world…

UPDATE 01/06/11 I’ve now submitted a modified version of this abstract. Many thanks to all who commented.

One of the many conferences I’m going to this year is a fairly big event. The abstract deadline also happens to be very soon too. So I thought I’d post a draft of my abstract here and see what peeps think.

Is my Latin correctly formed? What do you think? Here it is anyway…

Nullius in Calculo: On the Explicitness and Reproducibility of Cladistic Analyses

Author: Mounce, Ross C P

Abstract: The result of a cladistic analysis should be repeatable.
Yet I present here numerous examples of recent papers in which the results contained therein cannot be replicated, given only the content of the paper, supplementary materials and links. Barriers to study replication include (1) absence of requisite information, (2) typesetting errors, and even (3) author error. I argue that these problems, many of which are easily-spotted, should not be appearing in peer-review published papers with such regularity. I humbly suggest that reviewers and editors not only examine the words of papers, but also the underlying data and calculations: Nullius in Verba, Nullius in Calculo. Furthermore, I believe the reporting of phylogenetic analyses would greatly benefit from increased Standardization following community-agreed criteria c.f. MIAPA (Leebens-Mack et al, 2006), and data deposition in appropriate data archives specifically designed to accommodate phylogenetic data e.g. TreeBASE or MorphoBank. In addition to problems with reproducibility, I also detail problems with explicitness of method reporting. A detailed manual examination of over 300 recently published ILD tests provides evidence to suggest that method sections are rarely sufficiently explicit in their detail to exactly replicate the methods used to generate reported results. In particular I suggest that authors should be encouraged to explicitly state how ‘gaps’ are coded and treated, and which branch collapsing rules are followed in analyses. Different settings can and do generate different results, therefore all such important settings should be explicitly stated.

[end of abstract]

Some further comment:

Any feedback good/bad/indifferent would be much appreciated.
My supervisor has given it ‘the green light’ in principle, but obviously is keen for me to handle the topic delicately and sensitively

I’m not out to ‘name and shame’ individual errors with this – it’s the system that needs changing IMO, and I’ll do my damnedest to make that crystal clear when I give the talk.

I’d be pretty surprised if this abstract was rejected – I’ve got strong evidence including an accepted Nature paper demonstrating some of this. Would love to blog more about this but Nature’s embargo policy rather prevents/scares me from doing so! My last talk at a Systematics Association conference (below) went down a storm too, picking-up a special commendation from the judging panel for it’s “unscorable” uniqueness [I used a Prezi] so I can only hope this next one will be as successful.


May 26th, 2011 | Posted by rmounce in Conferences | phdchat | Travel - (0 Comments)

I’ve been travelling rather a LOT in the last 6 weeks or so; Rhossili Bay, London, Edinburgh, Leicester, Barcelona, Cambridge. It’s so nice to be back in Bath for a bit again.

While the memories are still fresh I thought I’d write a short summary of what I’ve been doing -> so much to update on my CV!


April 16 – 21 in Rhossili Bay helping my supervisor (Matthew Wills) teach undergrads about ecology, experimental design and real fieldwork. The weather was absolutely “lush” as they say in Wales. The students were well-behaved and did some excellent projects while they were there – staying up til past 3am in some instances on the last night to finish off their presentations!

The only downside was the lack of mobile phone signal. It’s stressful to be without internet for so long!


The very next week (April 26 – 28) I flew up to Edinburgh for a PaleoDB short course. Congratulations are sincerely owed to the organisers, especially Al McGowan, for taking the time to arrange such a beneficial event. PaleoDB is an *excellent* free-to-use resource but I do wish it was more Open. I made no hesitation in pointing this out during discussions. I have since applied to become a contributing member of the database (application pending).


After Edinburgh, there was the small matter of a conference to attend
Progressive Palaeontology 2011 was ably hosted at Leicester University. I gave a well-received talk on my research, sneaking-in some data sharing advocacy at the end:

Best of all, one of my labmates Anne O’Connor won the Best Poster prize – w00t!


Next-up: a well deserved holiday in Barcelona sunshine. Beautiful beaches, bountiful cerveza, and amazing architecture. Lovely!

Inside the Sagrada Família

Inside the Sagrada Família


No rest for the wicked… on the 19th and 20th I attended a short course at EBI-Hinxton entitled ‘Linking Open Data in Biology using Ontologies and Literature Mining‘. Had a superlative 3 course dinner in St Catherine’s college (Cambridge University) on the Thursday night, and a further meal in The Eagle the next night.

You might reasonably ask – why is a palaeontologist interested in ontologies and text mining? Well, it’s simple really – this field is crying-out for these techniques to be applied, and I hope I’ll be one of the one’s there first to utilize these powerful techniques on palaeontological data. The pickings could be rich, if only we had Open linked knowledge infrastructure in place… I shall no doubt blog more in future on this topic.


Finally, tomorrow today I’ll be going to a London BioGeeks event. Really looking forward to finally meeting Mark Hahnel face to face for the first time. I really think his FigShare initiative is an excellent idea. I’ve supported it myself by adding a few bits of test data. Will no doubt add more in the future…

    Future Travel

Needless to say, all this travelling has completely wiped-out my £200 BBSRC Travel Grant. Have applied for more grants to fund my trips to Belfast, Berlin, Brazil and Las Vegas later-on this year!

*fingers crossed* – for richer or poorer, it’s going to be an exciting year!