Show me the data!
Header

This is a re-post of something I was invited to write to sum-up my experiences at OKCon 2011. The original post can be viewed here on the official OKFN Open Science blog. For some reason the Prezi embed code at the bottom didn’t work, but does here on my blog

Many thanks to Jenny Molloy for inviting me to write the post, and Maria Neicu for editing it.

A couple of months ago, I gave a talk at the Open Knowledge Conference 2011, on ‘Open Palaeontology’ – based upon 18 months experience as a lowly PhD student trying, and mostly failing to get usable digital data from palaeontological research papers. As you might well have inferred already from that last sentence; it’s been an interesting ride.

The main point of my talk was the sheer stupidity/naivety of the way in which data is supplied (or in some cases, not at all!) with or within research papers. Effective science operates through the accumulation of knowledge and data, all advances are incremental and build upon the work of others – the Panton Principles probably sum it up far better than I could. Any such barriers to the accumulation of knowledge/data therefore impede the progress of science.

Whilst there are numerous barriers to academic research – access to research papers being perhaps the most well-known and well-publicised; the issue that most aggravates me, is not access to these papers, but the actual papers themselves – in the context of the 21st century (I’m thinking the Internet Age here…), they are only barely adequate (at best) for communicating research data and this is a major problem for the future legacy of our published work… and my research project.

My PhD thesis title is quite broad: ‘The Importance of Fossils in Phylogeny’. Given this title and (wide)scope, I need to look at a lot of papers, in a lot of different journals, and extract data from these articles to re-analyse; to assess the importance of fossils in phylogeny; on a meta-scale. There are long established data formats for the particular type of data I wish to extract. So well established and easy to understand there’s even a Wikipedia page here describing the most commonly used data format (nexus). There exist multiple databases set aside specifically to host this type of data e.g. TreeBASE and MorphoBank. Yet despite all this standardisation and provisioning for paleomorphological phylogenetic data – far less than 1% of all data published on, is actually readily-available in a standardised, digital, usable format.

In most cases the data is there; you just have to dig very very hard to release it from the pdf file it’s usually buried in (and then spend unnecessary and copious amounts of time, manually reformatting and validating it). See the picture below for a typical example (and yes, it is sadly printed sideways, this is a common and silly practice that publishers use to inappropriately squeeze data matrices into papers):
7BHO

I hope you’ll agree with me that this is clearly absurd and hugely inefficient. As I explain in my presentation (slides at the bottom of this post) the data, as originially analysed/used, comes in a much richer, more usable, digital, Standardised format. Yet when published it gets stripped of all useful metadata and converted into a flat, inextricable and significantly obfuscated table. Why? It’s my belief this practise is a lazy unwanted vestigial hangover from the days of paper-based (only) publishing, in which this might have been the only way in which to convey the data with the paper. But in 2011, I can confidently say that the vast majority of researchers read and the use the digital versions of research papers – so why not make full and proper use of the digital format to aid scientific communication? I argue, not to axe paper copies. But to make sure that digital versions are more than just plain pdf versions of the paper copy, as they can and should IMO be.

With this goal in mind, I set about writing an Open Letter to the rest of my research community to explain why we need to richly-digitise our published research data ASAP. Naturally, I wouldn’t get very far just by myself, so I enlisted the support of a variety of academic friends via Facebook, and (inspired by OKFN pads I’d seen) we concocted a draft letter together using an Etherpad. The result of this was a fairly basic Drupal-based website that we launched http://supportpalaeodataarchiving.co.uk/ and disseminated via mailing lists, Twitter, Academia.edu as far and wide as we possibly could, *hoping* just hoping, that our fellow academics would read, take note and support our cause.

Surprisingly, it worked to an extent and a lot of big names in Palaeontology signed our Open Letter in support of our cause; then things got even better when a Nature journalist (Ewen Callaway) got interested in our campaign and wrote an article for Nature News about it, which can be found here. A huge thanks must go to everyone who helped out with the campaign, it’s generated truly International support, as can be demonstrated on the map below:
(you might have to zoom out a bit. For some reason it zooms into Africa by default )


View Open Letter Signatures in a larger map

It’s far too soon to know the true impact of the campaign. Journal editorial boards can be very slow to change their editorial policies, especially if it requires a modicum of extra effort on the part of the publisher. Additionally, once editorial policy does change at a journal, it can only apply to articles submitted from henceforth and thus articles already in the submission pipeline don’t get affected by any new guidelines. It’s not uncommon for delays of a year between submission and publishing in palaeontology, so for this and other reasons, I’m not expecting to see visible change until 2012, but I think we might have helped get the ball rolling, if nothing else…
The Paleontological Society journals (Paleobiology and Journal of Paleontology) have recently adopted mandatory data submission to the Dryad repository, and the Journal of Vertebrate Paleontology has also improved their editorial policy with respect to certain types of data, but these are just a few of many many journals that publish palaeontological articles. I’m very much hoping that other journals will follow suit in the next few months and years by taking steps to improve the way in which research data is communicated, for the good of everyone; authors, publishers, funders and readers.

Anyway, here’s the Prezi I used to convey some of that (and more) at OKCon 2011. Huge thanks to the conference organisers for inviting me to give this talk. It was the most professionally run conference I’ve ever been to, by far. Great food, excellent WiFi provisioning, good comms, superb accommodation… I could go on. If the conference is on next year – I’ll be there for sure!

A Conference Abstract #OpenDraft

May 29th, 2011 | Posted by rmounce in Conferences - (Comments Off on A Conference Abstract #OpenDraft)

UPDATE 01/06/11 I’ve now submitted a modified version of this abstract. Many thanks to all who commented.

One of the many conferences I’m going to this year is a fairly big event. The abstract deadline also happens to be very soon too. So I thought I’d post a draft of my abstract here and see what peeps think.

Is my Latin correctly formed? What do you think? Here it is anyway…

Title:
Nullius in Calculo: On the Explicitness and Reproducibility of Cladistic Analyses

Author: Mounce, Ross C P

Abstract: The result of a cladistic analysis should be repeatable.
Yet I present here numerous examples of recent papers in which the results contained therein cannot be replicated, given only the content of the paper, supplementary materials and links. Barriers to study replication include (1) absence of requisite information, (2) typesetting errors, and even (3) author error. I argue that these problems, many of which are easily-spotted, should not be appearing in peer-review published papers with such regularity. I humbly suggest that reviewers and editors not only examine the words of papers, but also the underlying data and calculations: Nullius in Verba, Nullius in Calculo. Furthermore, I believe the reporting of phylogenetic analyses would greatly benefit from increased Standardization following community-agreed criteria c.f. MIAPA (Leebens-Mack et al, 2006), and data deposition in appropriate data archives specifically designed to accommodate phylogenetic data e.g. TreeBASE or MorphoBank. In addition to problems with reproducibility, I also detail problems with explicitness of method reporting. A detailed manual examination of over 300 recently published ILD tests provides evidence to suggest that method sections are rarely sufficiently explicit in their detail to exactly replicate the methods used to generate reported results. In particular I suggest that authors should be encouraged to explicitly state how ‘gaps’ are coded and treated, and which branch collapsing rules are followed in analyses. Different settings can and do generate different results, therefore all such important settings should be explicitly stated.

[end of abstract]

Some further comment:

Any feedback good/bad/indifferent would be much appreciated.
My supervisor has given it ‘the green light’ in principle, but obviously is keen for me to handle the topic delicately and sensitively

I’m not out to ‘name and shame’ individual errors with this – it’s the system that needs changing IMO, and I’ll do my damnedest to make that crystal clear when I give the talk.

I’d be pretty surprised if this abstract was rejected – I’ve got strong evidence including an accepted Nature paper demonstrating some of this. Would love to blog more about this but Nature’s embargo policy rather prevents/scares me from doing so! My last talk at a Systematics Association conference (below) went down a storm too, picking-up a special commendation from the judging panel for it’s “unscorable” uniqueness [I used a Prezi] so I can only hope this next one will be as successful.

#RoadWarrior

May 26th, 2011 | Posted by rmounce in Conferences | phdchat | Travel - (Comments Off on #RoadWarrior)

I’ve been travelling rather a LOT in the last 6 weeks or so; Rhossili Bay, London, Edinburgh, Leicester, Barcelona, Cambridge. It’s so nice to be back in Bath for a bit again.

While the memories are still fresh I thought I’d write a short summary of what I’ve been doing -> so much to update on my CV!

    Wales

April 16 – 21 in Rhossili Bay helping my supervisor (Matthew Wills) teach undergrads about ecology, experimental design and real fieldwork. The weather was absolutely “lush” as they say in Wales. The students were well-behaved and did some excellent projects while they were there – staying up til past 3am in some instances on the last night to finish off their presentations!

The only downside was the lack of mobile phone signal. It’s stressful to be without internet for so long!

    Scotland

The very next week (April 26 – 28) I flew up to Edinburgh for a PaleoDB short course. Congratulations are sincerely owed to the organisers, especially Al McGowan, for taking the time to arrange such a beneficial event. PaleoDB is an *excellent* free-to-use resource but I do wish it was more Open. I made no hesitation in pointing this out during discussions. I have since applied to become a contributing member of the database (application pending).

    Leicester

After Edinburgh, there was the small matter of a conference to attend
Progressive Palaeontology 2011 was ably hosted at Leicester University. I gave a well-received talk on my research, sneaking-in some data sharing advocacy at the end:

Best of all, one of my labmates Anne O’Connor won the Best Poster prize – w00t!

    Spain

Next-up: a well deserved holiday in Barcelona sunshine. Beautiful beaches, bountiful cerveza, and amazing architecture. Lovely!

Inside the Sagrada Família

Inside the Sagrada Família

    Cambridge

No rest for the wicked… on the 19th and 20th I attended a short course at EBI-Hinxton entitled ‘Linking Open Data in Biology using Ontologies and Literature Mining‘. Had a superlative 3 course dinner in St Catherine’s college (Cambridge University) on the Thursday night, and a further meal in The Eagle the next night.

You might reasonably ask – why is a palaeontologist interested in ontologies and text mining? Well, it’s simple really – this field is crying-out for these techniques to be applied, and I hope I’ll be one of the one’s there first to utilize these powerful techniques on palaeontological data. The pickings could be rich, if only we had Open linked knowledge infrastructure in place… I shall no doubt blog more in future on this topic.

    London

Finally, tomorrow today I’ll be going to a London BioGeeks event. Really looking forward to finally meeting Mark Hahnel face to face for the first time. I really think his FigShare initiative is an excellent idea. I’ve supported it myself by adding a few bits of test data. Will no doubt add more in the future…

    Future Travel

Needless to say, all this travelling has completely wiped-out my £200 BBSRC Travel Grant. Have applied for more grants to fund my trips to Belfast, Berlin, Brazil and Las Vegas later-on this year!

*fingers crossed* – for richer or poorer, it’s going to be an exciting year!