Show me the data!

Reflections on the OKFN Open Science hackday

April 1st, 2012 | Posted by rmounce in Open Data - (Comments Off on Reflections on the OKFN Open Science hackday)

Yesterday, I dragged myself out of bed (it was a Saturday!) to go to my first ever ‘hackathon‘. Thankfully it was a lot less geeky than it sounds – just a cosy little get together of people interested in Open Science, to work on things in a shared public space.

Nick Stenning, Stefan Wehrmeyer, Jenny Molloy, Caspar Addyman and I all beavering away on our laptops at the Barbican Centre, later joined by surprise guest Todd Vision (Dryad & UNC) in the afternoon. We also had online participation from afar communicating with us via Etherpad & IRC, including Rufus Pollock giving me a few pointers on PDF image extraction tools and James Casbon working on notebook.js.

You can see a record of all the things we worked on here on the official Etherpad for the event.

I have to say, I didn’t make all that much progress on my tasks for the day for a variety of n00by errors. The tools I wanted to use were rather large to download, particularly the Eclipse IDE which took a fair while to get over the public WiFi we were using. I was also using a small netbook. This is handy for my regular train journeys between Bath & London but not so useful when you need simultaneous windows open e.g. IRC + PDF manual + terminal + browser. The 24″ desktop screens I usually do work on have probably led me astray into such less efficient multi-window habits! Although by using a translucent dropdown terminal (Tilda) I saved on some window switching, but not enough to make things easy…

So for next time I’ve learn’t:

1.) Bring a comfortably sized laptop. Unless you really know what you’re doing on the command-line, you’re gonna need screen real estate

2.) Download all the large files you’ll need before you go

3.) Consider bringing your own food, drink & snacks! I think I must have spent over £10 just on lunch there, and the canteen only had over-priced tuna sandwiches :/

All in all though, the session was great. There’s no substitute to meeting people IRL. There was time for excellent therapeutic #PhDchat with Jenny, tactical discussions on how to encourage more palaeontologists into publicly archiving research publication data with Todd, and meeting other people in the Open Science community I’d never met before. As we discussed at the hackday – it’s not something we would do every weekend, but as a special event every now and again – it’s well worth going to!

Perhaps I might see YOU at the next one? All are welcome


Research data should be appropriately licensed with re-use in mind

November 29th, 2011 | Posted by rmounce in Open Data | Palaeontology | Phylogenetics - (Comments Off on Research data should be appropriately licensed with re-use in mind)

I’m really pleased this new Open Access paper has just been published.

CC BY 3.0 Zookeys Special Issue

Hagedorn, G. et al. Creative commons licenses and the non-commercial condition: Implications for the re-use of biodiversity information 150, 127-149 (2011).

Some background…

After parading my Open Data t-shirt (pictured below) around the Society of Vertebrate Paleontology meeting this month, I was invited to give an impromptu pitch in front of the great and good of the Mammal AToL project & MorphoBank people. Having pointed out to MorphoBank a while ago that they should really make explicit the terms and conditions [license] under which they make their (?) data available, I naturally advocated CC-BY 3.0 and CC0 licences. I talked about this very subject and pleaded with them NOT to use the NC clause refering to Rod Page & Peter Murray-Rust ‘s [1,2] thoughts on the matter.

Data providers vs Data re-users – need they really be in opposition?

The trouble is, a lot of (data providing) institutions seem hell-bent on ‘protecting commercial interests’, at the expense of research opportunities. So as I understand it, at the moment databases such as these face an awkward problem of either satisfying the restriction requests of data providers OR satisfying permissiveness of re-use by data re-users [such as myself!], and the needs of both camps are seldom entirely met.


I see this paper as an important step in persuading such restriction-minded institutions of the absolute importance of #OpenData / #PantonPrinciples and how NC clauses can genuinely obstruct and impair real academic research.
I just hope people read it and take note!

[Most of this is just a re-post of my spur of the moment G+ post here.
I’m reposting here so that this might hopefully get picked up by Research Blogging to give this paper the publicity it deserves. Much of the content is widely applicable IMO to most of scholarly communications, not just biodiversity informatics, and indeed the whole ZooKeys special issue (Open Access) is well worth a browse.]


[3] Hagedorn, G., Mietchen, D., Morris, R., Agosti, D., Penev, L., Berendsohn, W., & Hobern, D. (2011). Creative Commons licenses and the non-commercial condition: Implications for the re-use of biodiversity information ZooKeys, 150 DOI: 10.3897/zookeys.150.2189

This is a re-post of something I was invited to write to sum-up my experiences at OKCon 2011. The original post can be viewed here on the official OKFN Open Science blog. For some reason the Prezi embed code at the bottom didn’t work, but does here on my blog

Many thanks to Jenny Molloy for inviting me to write the post, and Maria Neicu for editing it.

A couple of months ago, I gave a talk at the Open Knowledge Conference 2011, on ‘Open Palaeontology’ – based upon 18 months experience as a lowly PhD student trying, and mostly failing to get usable digital data from palaeontological research papers. As you might well have inferred already from that last sentence; it’s been an interesting ride.

The main point of my talk was the sheer stupidity/naivety of the way in which data is supplied (or in some cases, not at all!) with or within research papers. Effective science operates through the accumulation of knowledge and data, all advances are incremental and build upon the work of others – the Panton Principles probably sum it up far better than I could. Any such barriers to the accumulation of knowledge/data therefore impede the progress of science.

Whilst there are numerous barriers to academic research – access to research papers being perhaps the most well-known and well-publicised; the issue that most aggravates me, is not access to these papers, but the actual papers themselves – in the context of the 21st century (I’m thinking the Internet Age here…), they are only barely adequate (at best) for communicating research data and this is a major problem for the future legacy of our published work… and my research project.

My PhD thesis title is quite broad: ‘The Importance of Fossils in Phylogeny’. Given this title and (wide)scope, I need to look at a lot of papers, in a lot of different journals, and extract data from these articles to re-analyse; to assess the importance of fossils in phylogeny; on a meta-scale. There are long established data formats for the particular type of data I wish to extract. So well established and easy to understand there’s even a Wikipedia page here describing the most commonly used data format (nexus). There exist multiple databases set aside specifically to host this type of data e.g. TreeBASE and MorphoBank. Yet despite all this standardisation and provisioning for paleomorphological phylogenetic data – far less than 1% of all data published on, is actually readily-available in a standardised, digital, usable format.

In most cases the data is there; you just have to dig very very hard to release it from the pdf file it’s usually buried in (and then spend unnecessary and copious amounts of time, manually reformatting and validating it). See the picture below for a typical example (and yes, it is sadly printed sideways, this is a common and silly practice that publishers use to inappropriately squeeze data matrices into papers):

I hope you’ll agree with me that this is clearly absurd and hugely inefficient. As I explain in my presentation (slides at the bottom of this post) the data, as originially analysed/used, comes in a much richer, more usable, digital, Standardised format. Yet when published it gets stripped of all useful metadata and converted into a flat, inextricable and significantly obfuscated table. Why? It’s my belief this practise is a lazy unwanted vestigial hangover from the days of paper-based (only) publishing, in which this might have been the only way in which to convey the data with the paper. But in 2011, I can confidently say that the vast majority of researchers read and the use the digital versions of research papers – so why not make full and proper use of the digital format to aid scientific communication? I argue, not to axe paper copies. But to make sure that digital versions are more than just plain pdf versions of the paper copy, as they can and should IMO be.

With this goal in mind, I set about writing an Open Letter to the rest of my research community to explain why we need to richly-digitise our published research data ASAP. Naturally, I wouldn’t get very far just by myself, so I enlisted the support of a variety of academic friends via Facebook, and (inspired by OKFN pads I’d seen) we concocted a draft letter together using an Etherpad. The result of this was a fairly basic Drupal-based website that we launched and disseminated via mailing lists, Twitter, as far and wide as we possibly could, *hoping* just hoping, that our fellow academics would read, take note and support our cause.

Surprisingly, it worked to an extent and a lot of big names in Palaeontology signed our Open Letter in support of our cause; then things got even better when a Nature journalist (Ewen Callaway) got interested in our campaign and wrote an article for Nature News about it, which can be found here. A huge thanks must go to everyone who helped out with the campaign, it’s generated truly International support, as can be demonstrated on the map below:
(you might have to zoom out a bit. For some reason it zooms into Africa by default )

View Open Letter Signatures in a larger map

It’s far too soon to know the true impact of the campaign. Journal editorial boards can be very slow to change their editorial policies, especially if it requires a modicum of extra effort on the part of the publisher. Additionally, once editorial policy does change at a journal, it can only apply to articles submitted from henceforth and thus articles already in the submission pipeline don’t get affected by any new guidelines. It’s not uncommon for delays of a year between submission and publishing in palaeontology, so for this and other reasons, I’m not expecting to see visible change until 2012, but I think we might have helped get the ball rolling, if nothing else…
The Paleontological Society journals (Paleobiology and Journal of Paleontology) have recently adopted mandatory data submission to the Dryad repository, and the Journal of Vertebrate Paleontology has also improved their editorial policy with respect to certain types of data, but these are just a few of many many journals that publish palaeontological articles. I’m very much hoping that other journals will follow suit in the next few months and years by taking steps to improve the way in which research data is communicated, for the good of everyone; authors, publishers, funders and readers.

Anyway, here’s the Prezi I used to convey some of that (and more) at OKCon 2011. Huge thanks to the conference organisers for inviting me to give this talk. It was the most professionally run conference I’ve ever been to, by far. Great food, excellent WiFi provisioning, good comms, superb accommodation… I could go on. If the conference is on next year – I’ll be there for sure!

A FlyTree in the Ointment of Falsifiability

June 10th, 2011 | Posted by rmounce in Open Data | Phylogenetics - (Comments Off on A FlyTree in the Ointment of Falsifiability)

Part inspired by the ‘Bugs!’ blogging contest, part inspired by Morgan Jackson’s post I thought I’d write some thoughts and observations on the recent FlyToL paper (Wiegmann et al., PNAS, 2011).

[IMPORTANT UPDATE Since I wrote this blogpost, the authors have made the data publicly available. In fact, credit to them – they’ve done more than they strictly needed to and deposited their data matrix in TreeBASE here. I’m sure this data will be a valuable and well cited resource now for the rest of the scientific community ]

First, a disclaimer:

My remarks should NOT be taken personally. They are my views, and my views alone (although hopefully many would support them). The following observations are in fact just one high-profile case of some of the points I shall make in my SystAss 2011 talk in Belfast next month (draft abstract here). I have a special interest in this topic because my research, my PhD, my life(!) utterly depend on being able to re-use and reproduce other people’s primary phylogenetic analyses. My research is that of synthesis and meta-analysis – no shame, and no lesser a science IMO. My only ‘agenda’ is that of a concern for my discipline: phylogenetics and it’s reproducibility.


So… to begin, with a bang!

Short Summary: The phylogeny reported in the recent FlyToL paper is not falsifiable. That is not to say that it is ‘false’ or ‘made-up’, just that the hypothesis cannot be challenged because insufficient evidence is provided with the paper. Personally, I do not doubt that the results presented are the correct results of the analysis they describe, it just irks me that the normally mandatory and indeed requisite evidence hasn’t been presented. I have notified the authors and editors of my objections and had some acknowledgement of the issue but so far, no visible improvement to the publicly-provided evidence has been made.

The Details: Most published phylogenies represent quantitatively-calculated evidence-derived hypotheses of evolutionary relationships between taxa. Thus, when a phylogeny is presented as a result of an analysis in a paper – the presentation of the underlying evidence and methods used to generate this are crucial. Without one or both, the phylogeny presented becomes unrefutable or non-falsifiable. This is the unfortunate situation that the FlyToL phylogeny is in.

Where is the evidence for figure 1 (the phylogeny, the Raison d’être of the paper)? I could draw a tree like that myself with pen and paper – the supporting evidence is what makes it a scholarly hypothesis of phylogeny.

The method is well detailed. They used RAxML and MrBayes as stated in the ‘Phylogenetic Inference’ subsection of the Materials and Methods, and further details are given in the separate ‘Supporting Information’ file.

But what data did they analyse with these methods? Well, we are given the identity of the taxa they analysed. We are told they used molecular AND morphological data in some of their analyses. But to be falsifiable, we need to know and have access to the exact molecular and morphological data they used, down to the very last DNA base pair (for molecular) and genital character (for morphological). Remarkably, neither dataset is present in this paper, nor linked to from this paper. Molecular sequence data as used in phylogenetic analyses *has to* be databased in GenBank as a pre-requisite before most papers will even be allowed to be published. It is clearly written in most editorial policies e.g. Nature, Science, PNAS, Syst Biol… take your pick! I assume they have GenBank’d their sequences, they just don’t seem to have supplied the GenBank Accession numbers with the paper (as they claim to, in Table S1, numbers still absent at the time of writing this blogpost [10/06/11]). Likewise with the morphological data – it is not available. One can find a description of the characters they used “on the FLYTREE morphology Web site” but this is NOT the data itself.

If *all* phylogenetic hypotheses were allowed to be published without requisite supporting evidence, I would be extremely worried about the discipline as a whole.

Also, this affects my ability to do my research – I could easily have re-used that morphological dataset in some of my work, and given them a citation in the process! [N.B. After a fair few email exchanges, I did eventually get given the morphological dataset, but still no sight of the molecular dataset, or its GenBank Accession numbers]

The nature of my research means I come across such [assumed] innocent mistakes in papers relatively frequently – most journal editors in my experience are both grateful and pro-active when I raise such data issues: so take a bow Paul D. Taylor (Journal of Systematic Palaeontology), Paul Barrett & Annalisa Berta (Journal of Vertebrate Paleontology), Peter J Hayward (Zoo J Linn Soc), Henry Gee (Nature) etc… but I’m surprised PNAS haven’t done anything about this yet.

What do you think? Am I being an overly pedantic nutcase? Or is there some validity to my concerns? Should I dare raise it again?

Reference: Wiegmann, B. M. et al. Episodic radiations in the fly tree of life. Proceedings of the National Academy of Sciences 108, 5690-5695 (2011). URL

Comments (resurrected from the old blog, sorry for the lack of formatting)

Brian has been quite good in the past, in submitting his alignments and trees to TreeBase so I don’t know what happened here. It could just be an oversight but a very annoying one definitely one worth pursuing. I think that the PNAS editor and reviewers should be ashamed of their work, it should not have appeared without all the supporting evidence.

Like Reply
12 months ago 1 Like

Ross Mounce
That’s what I really dont *get*. It’s a massive collaboration of top top scientists – a who’s who’s of dipteran systematists, and an NSF ToL grant. It shouldn’t have happened in the first place, and I don’t know why it hasn’t been (easily) corrected yet. Perhaps there are technical difficulties over at GenBank getting accession numbers for so many sequences(?).

They told me they’re witholding the morphology matrix so they can submit that as part of another set of analyses to Cladistics (is this legit?), will be interesting to see what’s in that paper when it comes out…

Publicly (tax-payer) funded research in particular should be extra careful to be transparent and reproducible. Phylogenetics shouldn’t encounter a ‘Climategate’-like situation but can you imagine if this happened with a paper on say the origins of HIV or avian flu? That’d be the perfect recipe for a media disaster.

Edit Reply
12 months ago in reply to bljog 1 Like

J. Salvador Arias
In general, I agree with you ;)! But I most concerned with reproduction and legacy rather that falsifiability.

You can test hand made (or random) relationships, even in absence of the source data. So you might test FlyToL tree,with published matrices or not. That is why you can test Lineaus groups, and countless of classic authors, even if they never publish something like a phylogenetic matrix.

But, it is an valid point to be skeptical of the results. That is in fact a good attribute for a scientist! And as long as the data from the FlyToL continue to be private, their results are just a high profile preach (and as you, I believe that they show the right results).

Systematics was hampered by a long history of research based on authority, and this is just a sad return of such practices.

So keep fighting! You are right!

A Liked Reply
12 months ago 1 Like

Juan Pablo Carbajal
You are not pedantic.
These issues (reproducibility, falsifiability) should be so clear to everyone, that your post make no sense. BUT IT DOES!
You are taking the pains (and the ugly looks) of many just to safeguard the values of scientific work and collaboration. All my support!

It is a pity that I have no knowledge on what you are talking about, being that the case I would’ve said something less general.

A Liked Reply
12 months ago 1 Like

Morgan Jackson
Excellent point Ross, and a matter which I didn’t have room to discuss in my post. My assumption is that the GenBank numbers will be included in the final publication (it is still in pre-publication, and not quite “official” yet), although that’s really no excuse for a paper making such large claims. When the data is made available however, I’m sure there will be many people who will be sectioning pieces off and using bits in various other analyses, and I’d be interested to hear more about your potential uses of the data!

Like Reply
11 months ago

Ross Mounce
Hey thanks Morgan, I liked your post too. Interesting points about the topology they found.

It’s true it’s still ‘pre-publication’ but I haven’t seen any other pre-publication papers without underlying data supplied – this isn’t a valid excuse IMO. And usually ‘pre-publication’ papers come out in PNAS very soon after with a turn-around of ~ a month, no?

This has been hanging around in ‘pre-print’ status since March 14th! Perhaps *because* of these data issues?

Also, in the ‘Digital Age’ – when papers come out online – that IS effectively when they get published (in all but terms of citation date which takes the paper publication date).

If it wasn’t ready to be published online, it shouldn’t have been put online. Am I too harsh?

Like you say, when the data does get published I’m sure it’ll get a LOT of re-usage. Which is great for them, the journal and the re-users (myself included). In the meantime, all those parties are missing out on that benefit: Science is the loser.

Edit Reply
11 months ago in reply to Morgan Jackson

Karen Cranston
You have probably seen this by now, but just in case you haven’t, the TreeBASE submission is here:

I just talked to Brian, and the GenBank records are submitted by not yet public (some of the required intron annotations are not done).

Like Reply
10 months ago in reply to Ross Mounce 1 Like