Show me the data!

Research Councils UK (RCUK) – a partnership of seven core UK research funding bodies (AHRC, BBSRC, EPSRC, ESRC, MRC, NERC, and STFC), has recently released a very welcome draft policy document detailing their proposed Open Access mandate, for all research which they help fund.

The new proposed policies include (quoting from the draft):

  • Peer reviewed research papers which result from research that is wholly or partially funded by the Research Councils must be published in journals which are compliant with Research Council policy on Open Access.
  • Research papers which result from research that is wholly or partially funded by the Research Councils should ideally be made Open Access on publication, and must be made Open Access after no longer than the Research Councils’maximum acceptable embargo period. [6 months for all except AHRC & ESRC for which 12 months is the maximum delay permitted].
  • researchers are strongly encouraged to publish their work in compliance with the policy as soon as possible. [added emphasis, mine]

As a researcher funded by BBSRC myself – I’m thrilled to read this document.

It shows a clear understanding of the issues, including explicit statements on the need of different types of access – both manual AND automated:

The existing policy will be clarified by specifically stating that Open Access includes unrestricted use of manual and automated text and data mining tools. Also, that it allows unrestricted re-use of content with proper attribution – as defined by the Creative Commons CC-BY licence


But as a strong supporter of the Panton Principles for Open Data in Science, and Science Code Manifesto, I’m a little disappointed that the policy improvements with respect to data and code access are comparatively minor. Such underlying research materials need only be ‘accessible’ with few further stipulations as to how. AFAIK this allows researchers to make their data available via pigeon-transport (only) on Betamax tapes, 10 years after the data was generated *if there is no ‘best practice’ standard in one’s field.

The BBSRC’s data sharing policy for example seems to favour cost-effectiveness over transparency: “It should also be cost effective and the data shared should be of the highest quality.” and maddeningly seems to give researchers ownership over data, even though the data was obtained using BBSRC-funding: “Ownership of the data generated from the research that BBSRC funds resides with the investigators and their institutions.” This seems rather devoid of logic to me – if taxpayers paid for this data to be created, surely they should have some ownership of it? Finally ”Where best practice does not exist, release of data within three years of its generation is suggested.” 3 years huh? And that’s only a suggestion! Does anyone actually check that data is made available after those 3 years? I suspect not.

Admittedly, it would be hard to create a good one-size fits all policy, and policing it would cost more money, but I do feel that data & code sharing policies could be tightened-up in places, to enable more frictionless sharing, re-using and building-on previous research outputs.

So all in all this is a great step in the right direction towards Open Scholarship, particularly for BBB-compliant Open Access.

Related reactions and comments which are highly worth reading include posts by Casey Bergman, Peter Suber, and Richard Van Noorden.

Creative Commons Licence This blog post is licensed under a Creative Commons Attribution 3.0 Unported License, so feel free to redistribute, remix and re-use! All that I ask for is attribution :)

[Rather than summarise what’s already been said about Elsevier and their for-excessive-profit practices in recent weeks, I’ll just lazily assume you’ve read it all… right then. Here’s what I have to add.]

This post is a real-world anecdote of the problems that Elsevier’s journal bundling & excessive profiteering*** causes. Just one of many reasons which persuaded me to sign my name along with 5,000+ other academics over at The Cost of Knowledge, to register my disapproval of what Elsevier (and other publishers) are doing with scholarly works.

Recently, I discovered to my dismay that my institutional library (University of Bath) had cut it’s subscription (and therefore easy access) to an important journal in my field. Literally, one week I had free access to the journal content, and then the next week I found I didn’t!

The journal is Biology Letters, a general biology journal by Royal Society Publishing [RSP from now on]. **


Intellectual property owned by Royal Society Publishing (taken from Wikipedia)

Interestingly RSP take a relatively enlightened stance on Open Access, and have made some interesting statements in the past, such as this gem [from a statement published way back in 2008]:

“…some companies do appear to be making excessive profits from the publication of researchers’ papers”

I think RSP, is a non-profit organisation (source) and hence it doesn’t surprise me that they have such prescient criticism of Elsevier & co to offer. They aren’t in the business of excessive profiteering like some.

So… RSP’s Biology Letters has been cut from our subscriptions budget. Why? – was the very first question I emailed the subject librarian at my institution. To their credit, I got some wonderfully informative replies from our librarian staff – I have no doubt they’ve done their best, given the limited powers they have. Like all institutions, we don’t have an unlimited budget. Something had to be cut, and unfortunately it was our subscription to Biology Letters. Which by the way, would only have cost us £852 for an institutional online-only subscription.

Why was this journal, of which I read/used at least 15 separate articles of in 2011 alone, cut from our subscriptions instead of a journal like… Elsevier’s ‘International Journal of Coal Geology‘?*

I think this is a fair question to ask. Biology Letters has a higher impact factor, not that the journal Impact Factor is a particularly brilliant metric of quality and would cost a lot less (£1107 [Biol. Lett. print version+online] vs 2540 Euros; the current institutional subscription price for the print version ‘International Journal of Coal Geology’). Most damningly of all, I suspect no-one at my institution ever reads this Elsevier journal, feel free to correct me on this – I’m sure I could find plenty of other Elsevier journals that satisfy this last property.

But the answer to this question is of course not relevant to any of 3 rational above points (unfortunately) – Biology Letters can be cut because it’s vulnerable, as it’s not part of a MegaBundle sold by a large for-profit publisher. The International Journal of Coal Geology cannot be cut because access to it comes as part of a ‘Big Deal’ bundle, in which there are some *vital* journals to which we *must* have access to (and the corporation selling access, knows and exploits this). So despite the fact that no one needs it here, that it’s ~2x more expensive, and it has a lower Impact Factor – I have access to this, and many countless other bundled journals I DON’T need, and I DON’T have access to vital articles from another journal I *do* need for my research.

Welcome to the crazy world of academic publishing! Much of it simply doesn’t make sense in the Digital Age. Of current explanations, I’d say Mike Taylor’s parable explains this most clearly.

I can’t claim to have explained all of the problems and intricacies here – but rest assured it clearly doesn’t make sense to me. Journal mega-bundling is plainly inefficient, and we can’t let this practice continue.

Stop feeding the beast! The Cost of Knowledge

* Through-out this post I use the example of the International Journal of Coal Geology, not out of disrespect for the editorial board, or the scholarly quality of the work presented there-in – I’m sure it’s great if you’re into Coal Geology. I only use it because a) it’s an Elsevier journal to which Elsevier very arguably adds very little value to, and b) I sincerely believe virtually no researchers at my institution make use of this journal.

** Just for the record, I don’t blame RSP or my librarians for this subscription cut happening. It’s out of their control. RSP do a great job IMO, as do my librarians.

*** I just read that one UK institution pays over £1,000,000 (yes, more than a million) every year for Elsevier’s ‘Big Deal’ bundle (source). I think this is a disgraceful ransom.

This is a re-post of something I was invited to write to sum-up my experiences at OKCon 2011. The original post can be viewed here on the official OKFN Open Science blog. For some reason the Prezi embed code at the bottom didn’t work, but does here on my blog

Many thanks to Jenny Molloy for inviting me to write the post, and Maria Neicu for editing it.

A couple of months ago, I gave a talk at the Open Knowledge Conference 2011, on ‘Open Palaeontology’ – based upon 18 months experience as a lowly PhD student trying, and mostly failing to get usable digital data from palaeontological research papers. As you might well have inferred already from that last sentence; it’s been an interesting ride.

The main point of my talk was the sheer stupidity/naivety of the way in which data is supplied (or in some cases, not at all!) with or within research papers. Effective science operates through the accumulation of knowledge and data, all advances are incremental and build upon the work of others – the Panton Principles probably sum it up far better than I could. Any such barriers to the accumulation of knowledge/data therefore impede the progress of science.

Whilst there are numerous barriers to academic research – access to research papers being perhaps the most well-known and well-publicised; the issue that most aggravates me, is not access to these papers, but the actual papers themselves – in the context of the 21st century (I’m thinking the Internet Age here…), they are only barely adequate (at best) for communicating research data and this is a major problem for the future legacy of our published work… and my research project.

My PhD thesis title is quite broad: ‘The Importance of Fossils in Phylogeny’. Given this title and (wide)scope, I need to look at a lot of papers, in a lot of different journals, and extract data from these articles to re-analyse; to assess the importance of fossils in phylogeny; on a meta-scale. There are long established data formats for the particular type of data I wish to extract. So well established and easy to understand there’s even a Wikipedia page here describing the most commonly used data format (nexus). There exist multiple databases set aside specifically to host this type of data e.g. TreeBASE and MorphoBank. Yet despite all this standardisation and provisioning for paleomorphological phylogenetic data – far less than 1% of all data published on, is actually readily-available in a standardised, digital, usable format.

In most cases the data is there; you just have to dig very very hard to release it from the pdf file it’s usually buried in (and then spend unnecessary and copious amounts of time, manually reformatting and validating it). See the picture below for a typical example (and yes, it is sadly printed sideways, this is a common and silly practice that publishers use to inappropriately squeeze data matrices into papers):

I hope you’ll agree with me that this is clearly absurd and hugely inefficient. As I explain in my presentation (slides at the bottom of this post) the data, as originially analysed/used, comes in a much richer, more usable, digital, Standardised format. Yet when published it gets stripped of all useful metadata and converted into a flat, inextricable and significantly obfuscated table. Why? It’s my belief this practise is a lazy unwanted vestigial hangover from the days of paper-based (only) publishing, in which this might have been the only way in which to convey the data with the paper. But in 2011, I can confidently say that the vast majority of researchers read and the use the digital versions of research papers – so why not make full and proper use of the digital format to aid scientific communication? I argue, not to axe paper copies. But to make sure that digital versions are more than just plain pdf versions of the paper copy, as they can and should IMO be.

With this goal in mind, I set about writing an Open Letter to the rest of my research community to explain why we need to richly-digitise our published research data ASAP. Naturally, I wouldn’t get very far just by myself, so I enlisted the support of a variety of academic friends via Facebook, and (inspired by OKFN pads I’d seen) we concocted a draft letter together using an Etherpad. The result of this was a fairly basic Drupal-based website that we launched and disseminated via mailing lists, Twitter, as far and wide as we possibly could, *hoping* just hoping, that our fellow academics would read, take note and support our cause.

Surprisingly, it worked to an extent and a lot of big names in Palaeontology signed our Open Letter in support of our cause; then things got even better when a Nature journalist (Ewen Callaway) got interested in our campaign and wrote an article for Nature News about it, which can be found here. A huge thanks must go to everyone who helped out with the campaign, it’s generated truly International support, as can be demonstrated on the map below:
(you might have to zoom out a bit. For some reason it zooms into Africa by default )

View Open Letter Signatures in a larger map

It’s far too soon to know the true impact of the campaign. Journal editorial boards can be very slow to change their editorial policies, especially if it requires a modicum of extra effort on the part of the publisher. Additionally, once editorial policy does change at a journal, it can only apply to articles submitted from henceforth and thus articles already in the submission pipeline don’t get affected by any new guidelines. It’s not uncommon for delays of a year between submission and publishing in palaeontology, so for this and other reasons, I’m not expecting to see visible change until 2012, but I think we might have helped get the ball rolling, if nothing else…
The Paleontological Society journals (Paleobiology and Journal of Paleontology) have recently adopted mandatory data submission to the Dryad repository, and the Journal of Vertebrate Paleontology has also improved their editorial policy with respect to certain types of data, but these are just a few of many many journals that publish palaeontological articles. I’m very much hoping that other journals will follow suit in the next few months and years by taking steps to improve the way in which research data is communicated, for the good of everyone; authors, publishers, funders and readers.

Anyway, here’s the Prezi I used to convey some of that (and more) at OKCon 2011. Huge thanks to the conference organisers for inviting me to give this talk. It was the most professionally run conference I’ve ever been to, by far. Great food, excellent WiFi provisioning, good comms, superb accommodation… I could go on. If the conference is on next year – I’ll be there for sure!

Doing some information research ahead of my imminent OKCon 2011, Berlin talk, it’s come to my attention that the Open Access journal PLoS ONE is actually an excellent journal to publish in, with respect to Impact Factor.

In the Digital Age, journals are merely vessels in which we can publish our research. Aside from the prestige of the huge, well-established journals like Nature and Science, there’s not all that much difference between the other journals. Sure, there’s cost to think about, perceived quality of peer-review, length of time it takes to get from submission to being printed, and a few other factors but really – it’s impact (this is not necessarily best measured by the Impact Factor metric, as Bjorn Brembs often points out) that for me at least, is the most important.

If the PLoS ONE Paleontology Collection was a journal it would have a 2010 Impact Factor of 4.15 which would make it the #1 Paleontology-specific journal (vs 2009 JCR ‘Paleontology’ journal scores). But it’s not a journal so perhaps the comparison is an unfair one. Likewise I’m sure if one collected together Nature palaeontological articles and treated them as a ‘journal’ that ‘Nature Palaeontology’ pseudo-journal would have a massive Impact Factor.

Here’s my calculations (numbers listed in the order that the publications are in my personal online CUL library, linked to below):

Cites in 2010 to items published in: 2009 = 3 + 6 + 6 + 1 + 1 + 0 + 2 + 1 + 3 + 5 + 6 + 0 + 2 + 5 + 3 + 2 + 5 + 5 + 2 + 4 + 2 + 2 + 3 + 12 = 81

Cites in 2010 to items published in: 2008 = 0 + 6 + 4 + 3 + 4 + 5 + 4 + 10 + 4 + 3 + 6 + 0 + 5 + 8 + 15 + 8 = 85

Number of items published in: 2009 = 24 link to bibliography

Number of items published in: 2008 = 16 link to bibliography

Calculation: IF = (Cites to recent items / Number of recent items) = (81+85) / (24+16) = 4.15

Of course Thomson-Reuters official JCR probably doesn’t count citations from journals such as “Caminhos de Geografia” and Google Scholar (which I used because it’s much quicker/easier/Open than WoK) doesn’t always provide the correct year metadata for each article. But still, as a rough estimate I think this is quite impressive. Well done PLoS!

The task now, is to convince fellow palaeontologists that it’s worth publishing here.

Every day I get hugely frustrated that I can’t access articles published in otherwise excellent journals such as Neues Jahrbuch für Geologie und Paläontologie, Abhandlungen, the Canadian Journal of Earth Sciences, and Zootaxa.

These journals and authors who publish in this Closed manner aren’t doing themselves any favours IMO. What’s the point of publishing research if only a very select few people can read it?

Sure, granted many palaeontologists will happily send you a pdf if you ask for one either directly via email or on a mailing list such as VRTPALEO but those routes don’t always work…

Whether it be ‘Gold’ Open Access, or ‘Green’ Open Access it’s a simple matter of logic that Open Access is beneficial for authors and readers alike.