Show me the data!
Header

Doing some information research ahead of my imminent OKCon 2011, Berlin talk, it’s come to my attention that the Open Access journal PLoS ONE is actually an excellent journal to publish in, with respect to Impact Factor.

In the Digital Age, journals are merely vessels in which we can publish our research. Aside from the prestige of the huge, well-established journals like Nature and Science, there’s not all that much difference between the other journals. Sure, there’s cost to think about, perceived quality of peer-review, length of time it takes to get from submission to being printed, and a few other factors but really – it’s impact (this is not necessarily best measured by the Impact Factor metric, as Bjorn Brembs often points out) that for me at least, is the most important.

If the PLoS ONE Paleontology Collection was a journal it would have a 2010 Impact Factor of 4.15 which would make it the #1 Paleontology-specific journal (vs 2009 JCR ‘Paleontology’ journal scores). But it’s not a journal so perhaps the comparison is an unfair one. Likewise I’m sure if one collected together Nature palaeontological articles and treated them as a ‘journal’ that ‘Nature Palaeontology’ pseudo-journal would have a massive Impact Factor.

Here’s my calculations (numbers listed in the order that the publications are in my personal online CUL library, linked to below):

Cites in 2010 to items published in: 2009 = 3 + 6 + 6 + 1 + 1 + 0 + 2 + 1 + 3 + 5 + 6 + 0 + 2 + 5 + 3 + 2 + 5 + 5 + 2 + 4 + 2 + 2 + 3 + 12 = 81

Cites in 2010 to items published in: 2008 = 0 + 6 + 4 + 3 + 4 + 5 + 4 + 10 + 4 + 3 + 6 + 0 + 5 + 8 + 15 + 8 = 85

Number of items published in: 2009 = 24 link to bibliography

Number of items published in: 2008 = 16 link to bibliography

Calculation: IF = (Cites to recent items / Number of recent items) = (81+85) / (24+16) = 4.15

Of course Thomson-Reuters official JCR probably doesn’t count citations from journals such as “Caminhos de Geografia” and Google Scholar (which I used because it’s much quicker/easier/Open than WoK) doesn’t always provide the correct year metadata for each article. But still, as a rough estimate I think this is quite impressive. Well done PLoS!

The task now, is to convince fellow palaeontologists that it’s worth publishing here.

Every day I get hugely frustrated that I can’t access articles published in otherwise excellent journals such as Neues Jahrbuch für Geologie und Paläontologie, Abhandlungen, the Canadian Journal of Earth Sciences, and Zootaxa.

These journals and authors who publish in this Closed manner aren’t doing themselves any favours IMO. What’s the point of publishing research if only a very select few people can read it?

Sure, granted many palaeontologists will happily send you a pdf if you ask for one either directly via email or on a mailing list such as VRTPALEO but those routes don’t always work…

Whether it be ‘Gold’ Open Access, or ‘Green’ Open Access it’s a simple matter of logic that Open Access is beneficial for authors and readers alike.

Part inspired by the ‘Bugs!’ blogging contest, part inspired by Morgan Jackson’s post I thought I’d write some thoughts and observations on the recent FlyToL paper (Wiegmann et al., PNAS, 2011).

[IMPORTANT UPDATE Since I wrote this blogpost, the authors have made the data publicly available. In fact, credit to them – they’ve done more than they strictly needed to and deposited their data matrix in TreeBASE here. I’m sure this data will be a valuable and well cited resource now for the rest of the scientific community ]

First, a disclaimer:

My remarks should NOT be taken personally. They are my views, and my views alone (although hopefully many would support them). The following observations are in fact just one high-profile case of some of the points I shall make in my SystAss 2011 talk in Belfast next month (draft abstract here). I have a special interest in this topic because my research, my PhD, my life(!) utterly depend on being able to re-use and reproduce other people’s primary phylogenetic analyses. My research is that of synthesis and meta-analysis – no shame, and no lesser a science IMO. My only ‘agenda’ is that of a concern for my discipline: phylogenetics and it’s reproducibility.

</end>

So… to begin, with a bang!

Short Summary: The phylogeny reported in the recent FlyToL paper is not falsifiable. That is not to say that it is ‘false’ or ‘made-up’, just that the hypothesis cannot be challenged because insufficient evidence is provided with the paper. Personally, I do not doubt that the results presented are the correct results of the analysis they describe, it just irks me that the normally mandatory and indeed requisite evidence hasn’t been presented. I have notified the authors and editors of my objections and had some acknowledgement of the issue but so far, no visible improvement to the publicly-provided evidence has been made.

The Details: Most published phylogenies represent quantitatively-calculated evidence-derived hypotheses of evolutionary relationships between taxa. Thus, when a phylogeny is presented as a result of an analysis in a paper – the presentation of the underlying evidence and methods used to generate this are crucial. Without one or both, the phylogeny presented becomes unrefutable or non-falsifiable. This is the unfortunate situation that the FlyToL phylogeny is in.

Where is the evidence for figure 1 (the phylogeny, the Raison d’être of the paper)? I could draw a tree like that myself with pen and paper – the supporting evidence is what makes it a scholarly hypothesis of phylogeny.

The method is well detailed. They used RAxML and MrBayes as stated in the ‘Phylogenetic Inference’ subsection of the Materials and Methods, and further details are given in the separate ‘Supporting Information’ file.

But what data did they analyse with these methods? Well, we are given the identity of the taxa they analysed. We are told they used molecular AND morphological data in some of their analyses. But to be falsifiable, we need to know and have access to the exact molecular and morphological data they used, down to the very last DNA base pair (for molecular) and genital character (for morphological). Remarkably, neither dataset is present in this paper, nor linked to from this paper. Molecular sequence data as used in phylogenetic analyses *has to* be databased in GenBank as a pre-requisite before most papers will even be allowed to be published. It is clearly written in most editorial policies e.g. Nature, Science, PNAS, Syst Biol… take your pick! I assume they have GenBank’d their sequences, they just don’t seem to have supplied the GenBank Accession numbers with the paper (as they claim to, in Table S1, numbers still absent at the time of writing this blogpost [10/06/11]). Likewise with the morphological data – it is not available. One can find a description of the characters they used “on the FLYTREE morphology Web site” but this is NOT the data itself.

If *all* phylogenetic hypotheses were allowed to be published without requisite supporting evidence, I would be extremely worried about the discipline as a whole.

Also, this affects my ability to do my research – I could easily have re-used that morphological dataset in some of my work, and given them a citation in the process! [N.B. After a fair few email exchanges, I did eventually get given the morphological dataset, but still no sight of the molecular dataset, or its GenBank Accession numbers]

The nature of my research means I come across such [assumed] innocent mistakes in papers relatively frequently – most journal editors in my experience are both grateful and pro-active when I raise such data issues: so take a bow Paul D. Taylor (Journal of Systematic Palaeontology), Paul Barrett & Annalisa Berta (Journal of Vertebrate Paleontology), Peter J Hayward (Zoo J Linn Soc), Henry Gee (Nature) etc… but I’m surprised PNAS haven’t done anything about this yet.

What do you think? Am I being an overly pedantic nutcase? Or is there some validity to my concerns? Should I dare raise it again?

Reference: Wiegmann, B. M. et al. Episodic radiations in the fly tree of life. Proceedings of the National Academy of Sciences 108, 5690-5695 (2011). URL http://dx.doi.org/10.1073/pnas.1012675108.

Comments (resurrected from the old blog, sorry for the lack of formatting)

bljog
Brian has been quite good in the past, in submitting his alignments and trees to TreeBase so I don’t know what happened here. It could just be an oversight but a very annoying one definitely one worth pursuing. I think that the PNAS editor and reviewers should be ashamed of their work, it should not have appeared without all the supporting evidence.

Like Reply
12 months ago 1 Like

Ross Mounce
That’s what I really dont *get*. It’s a massive collaboration of top top scientists – a who’s who’s of dipteran systematists, and an NSF ToL grant. It shouldn’t have happened in the first place, and I don’t know why it hasn’t been (easily) corrected yet. Perhaps there are technical difficulties over at GenBank getting accession numbers for so many sequences(?).

They told me they’re witholding the morphology matrix so they can submit that as part of another set of analyses to Cladistics (is this legit?), will be interesting to see what’s in that paper when it comes out…

Publicly (tax-payer) funded research in particular should be extra careful to be transparent and reproducible. Phylogenetics shouldn’t encounter a ‘Climategate’-like situation but can you imagine if this happened with a paper on say the origins of HIV or avian flu? That’d be the perfect recipe for a media disaster.

Edit Reply
12 months ago in reply to bljog 1 Like

J. Salvador Arias
In general, I agree with you ;)! But I most concerned with reproduction and legacy rather that falsifiability.

You can test hand made (or random) relationships, even in absence of the source data. So you might test FlyToL tree,with published matrices or not. That is why you can test Lineaus groups, and countless of classic authors, even if they never publish something like a phylogenetic matrix.

But, it is an valid point to be skeptical of the results. That is in fact a good attribute for a scientist! And as long as the data from the FlyToL continue to be private, their results are just a high profile preach (and as you, I believe that they show the right results).

Systematics was hampered by a long history of research based on authority, and this is just a sad return of such practices.

So keep fighting! You are right!

A Liked Reply
12 months ago 1 Like

Juan Pablo Carbajal
Man!
You are not pedantic.
These issues (reproducibility, falsifiability) should be so clear to everyone, that your post make no sense. BUT IT DOES!
You are taking the pains (and the ugly looks) of many just to safeguard the values of scientific work and collaboration. All my support!

It is a pity that I have no knowledge on what you are talking about, being that the case I would’ve said something less general.

A Liked Reply
12 months ago 1 Like

Morgan Jackson
Excellent point Ross, and a matter which I didn’t have room to discuss in my post. My assumption is that the GenBank numbers will be included in the final publication (it is still in pre-publication, and not quite “official” yet), although that’s really no excuse for a paper making such large claims. When the data is made available however, I’m sure there will be many people who will be sectioning pieces off and using bits in various other analyses, and I’d be interested to hear more about your potential uses of the data!

Like Reply
11 months ago

Ross Mounce
Hey thanks Morgan, I liked your post too. Interesting points about the topology they found.

It’s true it’s still ‘pre-publication’ but I haven’t seen any other pre-publication papers without underlying data supplied – this isn’t a valid excuse IMO. And usually ‘pre-publication’ papers come out in PNAS very soon after with a turn-around of ~ a month, no?

This has been hanging around in ‘pre-print’ status since March 14th! Perhaps *because* of these data issues?

Also, in the ‘Digital Age’ – when papers come out online – that IS effectively when they get published (in all but terms of citation date which takes the paper publication date).

If it wasn’t ready to be published online, it shouldn’t have been put online. Am I too harsh?

Like you say, when the data does get published I’m sure it’ll get a LOT of re-usage. Which is great for them, the journal and the re-users (myself included). In the meantime, all those parties are missing out on that benefit: Science is the loser.

Edit Reply
11 months ago in reply to Morgan Jackson

Karen Cranston
You have probably seen this by now, but just in case you haven’t, the TreeBASE submission is here:

http://www.treebase.org/treebase-web/search/study/summary.html?id=11372

I just talked to Brian, and the GenBank records are submitted by not yet public (some of the required intron annotations are not done).

Like Reply
10 months ago in reply to Ross Mounce 1 Like

What’s your Norell number?

Taking inspiration from the Erdős number game, I thought I’d use ColWiz‘s ‘Link Explorer’ function, and my citeulike bibliographic data to explore (in a highly unscientific way!) the authorship connections between people in my reference database of ~2750 papers.

Mark A Norell has LOADS of publications. Not too surprising considering he works at the AMNH (without doubt the best palaeontological research institute in the world if one measures by number of publications and/or Impact Factor).

How many authorship connections does he have? In my library I have just 35 of his innumerate publications. Just a small sample of what he’s truly published. In these 35 publications (mostly vertebrate phylogeny related stuff, 2000-present) he has a total of ~55 different co-authors.

I would like to have calculated this exactly but sadly ColWiz only provides a visual exploration with no quantitation.

So his 1st order co-author relationships according to my (limited) data look like this:
1storder

Now it gets interesting… What about the 2nd order relations. The co-authors of co-authors of his (according to my limited bib-library). Here’s the result of this below. *wow*

2ndorder

Far, far too many authors to count manually! From this I’d guess that no published palaeontologist with more than say 3 papers has a Norell number greater than 5.

Does anyone know any good, FREE, quantitative bibliometric software that would actually calculate some of this for me, rather than just providing pretty (but fairly meaningless) vizualisations?

Could there be a better candidate for the linkman in palaeontology? On Facebook, Dave Godfrey suggested Romer, Halstead, or Ostrom. They’re great suggestions but I’m looking for a good link-point for 21st century palaeontology tbh. So Norell it is!

Anyway, back to *proper* work…

PS If one counts my Open Letter as a publication then I have a Norell number of 3 (via co-authorship with Graeme Lloyd who has a Norell number of 2). If one only counts my soon to be published paper with my supervisor then my Norell number is 4 (Ross Mounce -> Matthew Wills -> Paul Barrett -> Diego Pol -> Norell). Some of my friends e.g. Shaena Montanari are lucky enough to have the lowest Norell number possible already -> 1. It’s a small world…