Show me the data!

So there’s been a few already:

SpotOn day 1 & SpotOn day 2 being the best I’ve read, from Ian Mulvany

see also: SpotOn London – a global conference by Jon Tennant | #solo12 reflection by A J Cann | SpotOn London 2012 in brief by Charis Cook | Altmetric @ SpotOn London 2012 by Jean Liu, and more

but since SpotOn London 2012 was such an interesting event, and there were so many parallel sessions I thought it would be good to add some more to the post-conference discussion

Data is the new black

photo by Bastian Greshake (@gedankenstuecke), Copyright not mine.

It wasn’t just a popular badge at the conference… data is a hot topic in science right now (as it should be!). Data is an undervalued but absolutely vital output of research. Research funding agencies appear to have over-incentivized the production of research publications (many of which are mere executive summaries of the years of research effort they represent) to the exclusion of almost everything else.

Science isn’t just about the production of papers; data and code are extremely important research outputs too (I’m not going to mention patents – they’re a sticky issue best dealt with in another post). The good news is that funding bodies now seem to have realised that they’re seriously missing out on RoI by focusing solely on papers; just recently NSF Grant Proposal Guidelines changed with amended terminology away from narrow-measurement ‘Publications’ to the newer broader term ‘Products’ that explicitly recognises non-publication outputs as creditworthy first class research objects (incidentally, this was one of the many excellent suggestions made in the Force11 manifesto for ‘Improving Future Research Communication and e-Scholarship’ read it if you haven’t already).

The immense value to be gained, time to be saved, and innovative research enabled by making data available for re-use was up for discussion at the #solo12reuse session. Mark Hahnel (@figshare) was organiser/chair, and Sarah Callaghan (@sorcha_ni) of the British Atmospheric Data Centre and I were the invited panelists for a ~1hr slot. As the conference was extremely well-organized *all* sessions were live-streamed via Google Hangouts & made publicly available via YouTube afterwards. I’ve embedded the stream of the #solo12reuse session below:

A transcript of some of what was discussed:

Intro’s from ~02:00 … then straight into discussion from ~09:00 onwards: Josh Greenberg (@epistemographer) contends that data sharing in chemistry perhaps ‘doesn’t make as much sense’ – I have a feeling PMR & many others would disagree with this!

At 13:20 Sarah Callaghan: NERC sets its data embargo policy so that data can only be withheld for a maximum of 2 years after it was collected after which it must be made publicly available, somewhere, somehow – the ambiguity of which IMO needs to be worked on…

At 14:25 discussion of ‘levels of re-usability’ and definition. Access control as a means of encouraging data sharing (?)

17:30 Sarah Callaghan: “It’s important to have ‘first dibs’ on your own data” but not beyond this without peer-vetted justification/scrutiny IMO

18:30 David Shotton (@dshotton): noted that one shouldn’t expect absolutely every data point/item to be shared – not all data is useful/valuable. It’s about retaining & making available bits that might be of re-use value.

At 20:24 I start to introduce AMI2 & the OpenContentMining project.

37:50 I bring the Panton Principles on screen, I also had the OKFN Science Working Group page displayed (although not discussed) for a good ~10 minutes. Note to self: hijack the display computer at panel sessions more often…

from 40:17 onwards… Mark Hahnel: “In terms of re-use and getting people incentivized, are Data Papers the future?” Sarah Callaghan “NO. Until research achievement is predicated on something other than publishing in ‘high-impact’ journals then we’re stuffed: we’ve got to shoehorn data & code in order for them to ‘count’ [lamentably]” So for now we need data papers, but perhaps in the future we won’t need to constrain these outputs to a ‘paper’ style format.

from 43:00 Martin Fenner (@mfenner) plays Devil’s Advocate and suggests that data citation may not work and that perhaps #altmetrics might be better indicators of usage. Much debate ensues…

from 45:50 I give a plug to Iain H’s paper: ‘Open By Default’ Hrynaszkiewicz, I. and Cockerill, M. 2012. Open by default: a proposed copyright license and waiver agreement for open access research and data in peer-reviewed journals. BMC Research Notes 5:494+ then we discuss legal barriers to re-using data.

This post has taken a while to write and is fairly long now, so I’m going to split my recap of #solo12 into two or more parts now. In part 2 I’ll attempt to discuss some of other *excellent* sessions I saw, in particular the brilliant, well-received outburst on the absurd inefficiency of the publication process by professional typesetter Dr Kaveh Bazargan during the #solo12journals session. I’m surprised someone hasn’t done a whole blogpost about this already – it was my highlight of the conference tbh!
I’ll be posting part two on Monday 19th November (weekends are slow for blogs… I want people to read this!)

Until then…

Gold OA Pricewatch

November 7th, 2012 | Posted by rmounce in Open Access - (8 Comments)

An interesting move from Nature Publishing Group today…

In a press release dated 7 November 2012 they’ve announced they’re allowing the Creative Commons Attribution (CC BY) license to be applied to articles in some (but not all) of their journals, specifically citing Wellcome Trust and RCUK policies that now require their funded authors to publish Gold OA with a CC BY license (or alternatively to use the Green OA route), recognizing that more restrictive licenses get the funders less return on investment.

Also included is a terribly poor quality screenshot of the new Gold OA pricing scheme that will apply for these journals (below)

An image of a table of numbers like this would never be allowed to be published in any one of NPG’s journals. So why did they do this here? Are they actively trying to make it harder for people to compare Gold OA charges between journals? Odd.

But what’s really outrageous about this: they’re explicitly charging MORE for applying/allowing a CC BY license relative to the more restrictive licenses. Applying a license to a digital work costs nothing. By charging £100-400 more for CC BY they’re really taking the piss – charging more for ABSOLUTELY NO ADDITIONAL EFFORT on their part. Horrid.

Other than greed what is the justification for this?

UPDATE: the income made from printing paper (deadtree) reprints, for profit, is cited as the justification. This still doesn’t get away from the fact that this is going to penalise RCUK-funded authors who wish to publish via the Gold OA route. I also don’t remember Nature Publishing Group charging differentiated OA prices for journals that previously offered a choice of different licences – has Scientific Reports always charged different rates for different licenses? NO it seems, just one flat price: £890 AND a choice of three different Creative Commons licenses including CC BY !

Recently I had the opportunity to collaborate on an extremely timely paper on data sharing and data re-use in phylogenetics, as part of the continuing MIAPA (Minimal Information for a Phylogenetic Analysis) working group project:


Additionally, in order to also practice what we preach about data archiving, we opted (it wasn’t mandated by the journal or editors) to put the underlying data for this publication in Dryad so it was immediately freely available for re-use/scrutiny/whatever upon publication of the paper, under a CC0 waiver

Dryad (and similar free services like FigShare, MorphoBank & LabArchives) allow research data to be made available either pre-publication, on publication, or even post-publication with optional embargoes (access denied for up to 1-year after the paper is published). I’m strongly against the use of data embargoes but Dryad allow it because embargoed data is better than no data at all! I’ve seen some recent papers that have made use of this option and apparently the journals, editors & reviewers are ‘fine’ with this practice of proactively denying access to data. I guess it’s a generational thing? That sort of practise used-to understandably be okay pre-Internet when digital data was costly to distribute. But now we can freely & easily distribute supporting data, there are a multitude of reasons why we really should unless there are justifiable reasons not to e.g. privacy with sensitive medical/patient data.


I haven’t had all that much experience of the publication process so far – I’m amazed how kludgy it can be at times – far from smooth or efficient IMO. I was in charge of the Dryad data deposition for this paper among other things and because the journal isn’t integrated with Dryad’s deposition process it took me quite a few emails to work out what & when to do things but it wasn’t a major difficulty – the benefits of doing this will almost certainly outweigh the small effort cost of doing it. Those journals with a Dryad-integrated workflow will no doubt have a smoother process.

Another thing I learn’t from this manuscript was that publishers commonly outsource their typesetting to developing countries (for the cheaper labor available there). So in this instance BMC sent our MS to the Philippines to be re-typeset for publication and when the proofs came back we encountered some really comical errors e.g. Phylomatic had been re-typeset as ‘phlegmatic’. This sparked a very serendipitous conversation on Twitter, which eventually led to Bryan Vickery (Chief Operating Officer at BMC) inviting me to visit the London office of BMC to have a chat about ‘all-things-publishing’ (and btw, serious *props* to PLOS and BMC for having such nice, helpful tweeps on Twitter):

Bryan and I arranged a time and a date (after-SVP) and so I ended-up visiting BMC for more than 2 hours on Wednesday 24th October. I got to meet not only Bryan but also Deborah Kahn, Shane Canning and others including some of the editors for BMC Research Notes (thanks again for helping publish our paper!) & BMC Evolutionary Biology. Iain Hrynaszkiewicz was there too (Hi Iain!), given our enthusiasm for Open Data (do read his *excellent* paper ‘Open By Default’ in the same article collection as ours) I’m sure we’ll meet again at more workshops and events in future.

I couldn’t possibly go through everything that was explained to me there but it certainly was illuminating. I suspect many junior academics like myself have little or no clue at all as to the behind-the-scenes processes that go on with manuscripts to get them into a state ready for publication. Perhaps a publisher visit (or even short placement?) scheme like this should be run as part of a postgraduate skills training session? Moreover perhaps it could help alleviate the ‘too many PhDs, too few academic jobs‘ problem by highlighting skilled sciencey jobs like STM publishing as viable and noble alternatives to the extremely overpopulated rat-race for tenure-track academic jobs. STM publishing isn’t even an ‘exit’ from academia. People like Jenny Rohm (chair of Science is Vital) have demonstrated that one can go into STM publishing and still go back into academia after this.

The cost of peer-review & publishing


This part of the post has sat on the backburner for a long time because it’s a complex one.

From what I was told (and I could well believe) organizing peer-review can be an immensely variable process. Sometimes it can very simple. Automated processes such as peer2ref can be used to select appropriate reviewers for a manuscript, if these reviewers accept and get on with it nicely and in a timely fashion the process can be of very little administrative burden. However there are also times when maybe 10 or 12 reviewers need to be contacted before 2 may agree and then there can be complications after this leading to a very time consuming, costly and burdensome process. So organizing peer-review costs money, but it’s difficult, or perhaps commercially-sensitive (?) to put an average price on that process -> I’m still in the dark on how much this process should cost. If anyone knows of a reputable source for data on this please do let me know.


What of DOI’s?  Why do some high-volume journals like Zootaxa & Taxon operate without DOI’s? Is there really much money to be saved by dispensing with them? Well, Bryan kindly pointed me to this link here for all the salient info.

It’s just $1 per DOI. That’s nothing tbh. What’s more, it’s even cheaper to retrospectively add DOI’s to older already published content: ‘backfile’ DOI’s are just $0.15. That means Zootaxa could retrospectively add DOI’s to all ~5866 of their backfile articles (2004-2009) for just $880 !  There’s plenty of other things that would need fixing before that happened though, Zootaxa doesn’t even have proper article landing pages as was pointed out to me by Rod Page. No doubt there would also be some labour cost associated with getting someone to add DOI’s to all those thousands of articles. Still, it looks cheap to me. I still feel justified in my annoyed rant I sent to TAXACOM a while ago about this pressing issue with respect to DOI’s and responsibility of publishers.

This also has ramifications for some of the changes I’ve been pushing for now I’m on the Systematics Association council. Our main publication is a book and each of the chapters *could* but currently don’t have DOI’s issued for them, I suggested we issue DOI’s at the last council meeting, but alas it’s not up to me, we need co-operation from our publisher to make this happen (Hi, Cambridge University Press!). Book chapter DOI’s cost just $0.25 per DOI, so I think this small cost would certainly be worth it, if it raises the discoverability and citeability of our publications.

Article submission

A final point of interest from my BMC visit: Bryan told me that BMC used to offer a means by which authors could submit their works directly via an XML authoring tool. It wasn’t popular, but I wonder whether this was perhaps because it was a little before its time? The whole process of Biologists submitting Word files, having figures and text inadvertently mangled and wrongly re-typeset at the publisher seems extremely inefficient to me. Physicists & Computational Scientists seem to get along fine with LaTeX submission processes which alleviate some but not all of the typesetting shenanigans. Perhaps it is the authors, and the authoring tools that need to change to enable more re-usable research in the future, to fully enable the potential of the semantic Web. It looks like Pensoft might be trying to go again in this direction with its Pensoft Writing Tool.

image by Gregor Hagedorn. CC BY-SA

On that note, it might be good to end with a small advert  for the Pro-iBiosphere biodiversity informatics & taxonomy workshop in February, 2013 Leiden (NL).

I very much look forward to meeting taxonomists IRL!





I just submitted some comments to SPARC / PLOS / OASPA’s request for public comment on their new HowOpenIsIt? material here. If you haven’t done so yourself, the deadline is TODAY 5pm (EST).

Below are the comments I submitted. A mixture of praise for remembering to include machine-readability. Concern over some possible interpretations, and practical points on providing Hyperlinks or URLs for all the CC licenses mentioned:


* I heartily support & commend that Machine Readability takes pride of place within this guide to Open Access. This freedom was there from the start in the Budapest declaration: “…crawl them for indexing, pass them as data to software, or use them for any other lawful purpose…” but in recent years this freedom has been often neglected by some, and worse actively-restricted by some subscription-based publishers in their contractual agreements. Yet it represents one of the most important freedoms that needs to be enabled by Open Access. It has been estimated that over 50 million academic articles have been published and the volume of publications is increasing rapidly year on year. The only rational way we’ll be able to make full use of all this research both NOW and in the future, is if we are allowed to use machines to help us make sense of this vast and growing literature.

* I am slightly worried that the statement on machine readability for Open Access, could yet still provide a barrier for use by publishers to protect their content from mining: “…through a community standard API or protocol” perhaps leaves too much to interpretation. The API provided could be a poor one, inflexible and not sufficiently cutting-edge for the research required. I think there is no need for a clause on how machines might be let access to Open Access research if it is published CC-BY as mentioned under Reuse Rights. Only that the medium in which the work is published (PDF, HTML, XML or other) is sufficiently machine-interpretable and not DRM-protected.

* I support that the guide itself is licensed under CC-BY-NC-ND to prevent derivative or modified works, to prevent interoperability problems. This is in line with both W3C ( and IETF practices.

* May I suggest the paper version of this guide (if there is to be one) be printed with full URLs to the CC-BY-NC-ND, CC-BY, & CC BY-NC licenses mentioned in the guide. Likewise the electronic/digital version should have clickable hyperlinks to further explain these contractions.

* I think the guide should make it clearer that the label ‘Open Access’ should only be applied to content that has all of the full top-line suite of rights. Anything less than this in any of the categories is nearly but not quite Open Access. There are other terms available for such less Open content, like ‘free access’, ‘public access’, ‘less-restricted access’ that can all be applied in some form or combination to apply to the set of rights in between ‘Open Access’ and ‘Closed Access’. This guide should reaffirm that only the full suite of Open rights makes a work Open Access.

* However, I do wonder if the question of who holds copyright (author or publisher) is somewhat irrelevant to Open Access? I certainly support that authors retain copyright to their own content, but in instances where the publisher has taken the copyright and the work is in all other respects fulfilling the other qualities of Open Access – is this not Open Access? Surely then the Copyright column is just a special case subset of the Reuse Rights column? The issue of who holds copyright is something important but separate to Open Access in my opinion.

* Ditto for ‘Author Posting’ this duplicates what is given in the Reuse Rights column, just a special case for the author. This section is usefully distinct in grey not-quite-Open Access cases, but for Open Access it is just a rewritten duplication that *anyone* has the right to reuse/repost.

At some point I also intend to make comment on BMC’s Open Data & Open Bibliography RFP but the deadline for that is much later and I have LOTS of work to do in the mean time, so that’ll have to wait for a bit…

Opportunity Knocks

October 3rd, 2012 | Posted by rmounce in Open Access | Palaeontology - (2 Comments)

A few months ago I gave a short talk about the Open Knowledge Foundation and its activities as relevant to academics at a small (but good!) palaeontology conference in Cambridge (which I blogged about previously).

I didn’t need to give this talk. Neither the OKF nor my academic progression required me to give this talk. I just felt it might be helpful to let my friends and peers know who the OKF are, what they’re trying to achieve, and what my Panton Fellowship is about.

That optional talk has now paid HUGE dividends: enabling me to talk live on BBC Radio 3 last night about Open Access and the beneficial impact this will have on research with our Minister for Science & Universities, David Willetts MP & Dame Janet Finch (writer of ‘the Finch report’). I got some good time at the end after the show to speak with David about encouraging efficiently run ultra low-cost journals like the Journal of Machine Learning Research. I hope this will have had some influence, if not, I certainly tried!

So how did this come about?

Nick Crumpton, PhD student at the University of Cambridge, and one of the student organisers for Progressive Palaeontology 2012 (ProgPal) is also a BBC Online British Science Association Media Fellow and thus has good contacts at the BBC. They were apparently looking for a young scientist to come on the show and give an informed opinion from ‘the coalface’ of research so Nick kindly remembered my impassioned talk from ProgPal on OKF & openness in academia and recommended me.

I got in touch with the programme producer, and was invited to join the live radio debate later that night.

Image © British Broadcasting Company. Click through to listen to the radio programme. The Open Access discussion segment occurs from about 6min40s in

…and that’s how it happened.

With Open Access Week coming up very soon, 22-28 October, I guess the point of this post is:

No matter how small your contribution towards the advocacy of Open Access might seem; every little helps. Keep at it. Keep speaking out about OA until all publicly funded research everywhere (glares at the US) is Open Access.

Postscript: That same day Sir Mark Walport was also interviewed on BBC Radio, partly about Open Access – I highly recommend & agree with his opinions; the link is here. Listen from 11.38 to 15.10 for the OA bits h/t Steve Hitchcock @stevehit

Wow! Where to begin… In this post I shall attempt to summarise some of OKFestival 2012.

Some Background:

I had been to the Open Knowledge Conference last year (in Berlin), where I gave an invited talk on Open Palaeontology and met lots of brilliant people in the Open Science community like Bjoern Brembs, Cameron Neylon & Peter Murray-Rust. But this year the event was even bigger, and even better – teaming up with the annual Open Government Data Camp for a mega-event.

The Event Itself:

It was a little awkward that it was held so far away from most of the conference accommodation – everyone had a 20-30 minute commute before getting to the venue, and some of the talk rooms were fairly far apart. But once the conference goers got used to that it was plain sailing from there, and the Aalto University buildings themselves were wonderfully modern and well equipped for it (inc. great WiFi). I got to Helsinki on the Tuesday, and caught the tail end of the Data Journalism session that day including an excellent, inspirational talk on amongst other things. It detailed the amazing knowledge and insight gained from tracking the movement of ships with open data. I couldn’t help thinking that academics could learn a lot from these open data visualization experts (myself included!).

An interesting example of Shippr data – ships turn off their beacons once they pass the point for fear of pirates…

Wednesday – my chance to make a difference

I really liked the way that the conference had an introductory session to the days parallel events in the morning from 10am – 11am. If one was unsure of which stream to go to – these Morning Plenaries gave each topic stream a chance to pitch their events in a short slot to the awaiting audience. I thought this was very helpful given there were 13 separate topic streams at the conference!

I was involved in two sessions this day. Firstly the Open Access discussion panel, the video for which is here with Tim Hubbard (Sanger Institute), Carlos Russel (World Bank), Peter Murray-Rust (University of Cambridge / Open Knowledge Foundation) and Tom Olijhoek & Mark MacGillivray (Open Access Index):

It’s a long video, we covered many topics, with excellent contributions from the audience including Puneet Kishnor from Creative Commons and Matt Todd from the Open Source Drug Discovery team amongst others.

Then after this there was the research data session with contributions from Mark Wainwright on CKAN, Mark Hahnel on Figshare and Joss Winn of the Orbital project.

Finally we finished with the Panton Fellowships Session with talks from myself and Sophie Kershaw on what we’d been doing in our fellowship work:

The day was rounded off with a hugely inspirational talk from Matt Todd summarising his Open Source Drug Discovery work in the main lecture theatre, with a lovely if expensive meal afterwards in Lasipalatsi Ravintola.


I spent some quality time with Peter working on a BBSRC grant proposal.
I also thoroughly enjoyed Hans Rosling’s fantastic key note presentation which I urge you all to watch – it was brilliant, and thrilling to be there live in the audience for.


If there’s one thing that impresses me most of all about OKFestival, it’s this: it’s not just about talking – they do things here too. Lots of ‘hacking’ sessions on Friday to create new tools and collate awesome new data. Most conferences are extremely boring in that it’s just talk after talk after talk. Things get done here, new collaborations are started, fresh links across disciplinary boundaries are made connecting journalism with academia, economic development with open architectural design, and other incredible trans-disciplinary mashups. It’s a joy to behold.

I’m really glad I came to OKFestival, as ever I got a lot out of it.

Next year it’ll be in Switzerland (?), I hope I didn’t just make that up… I seem to remember that it was announced to be there but I couldn’t find any confirmation from Google. Rest assured I’ll try and be there though!

I said I would make an update on Tuesday (today), so if I get this posted before midnight I will (just) have met that  goal…

In this (minor) update I have:

added: Ubiquity Press (great low cost option!), SPIE (scored for 1-column per page), SAGE Open, Frontiers, WileyOpenAccess, OxfordOpen (OUP hybrid option), GigaScience, Open Biology (Royal Society)

added the label for: Pensoft (sincerest apologies, it is tied with Copernicus and was on the 0.1 plot, just unlabelled!)

changed the categorization of: Scientific Reports (NPG) [I have put it in a no-mans-land between CC BY and CC BY NC since they give authors a choice of licenses. I think this is a bad idea as it allows authors to make the mistake of choosing a less open licence (are there really any common circumstances in which they might want a less open, free to read licence?)]


As noted elsewhere there are actually a lot of completely fee-free Gold Open Access journals out there (I shall try and make a listing of them in a future post), they’re just not perhaps all that well-known. GigaScience and Open Biology (Royal Society) are temporarily completely fee-free options that certainly look like good recommendations!


I shall endeavour to add-in more of a variety of the various differently priced BMC journals in the next update of the plot. Basically I believe most of them lie in the range between BMC Research Notes, and BMC Biology.

My site stats show that in just a few days v0.1 of the plot had nearly 1000 pageviews, which is HUGE for my otherwise low-key blog!

And it has had real impact already. Thanks to Mike Taylor, Acta Pal. Polonica is thinking of adopting the CC BY licence. Brilliant news! It is fee-free but not explicitly licensed to allow re-use at the moment. Hopefully this will change soon.


Anyway, I have to get off the train now, so that’ll be the end of this post.




Since Sunday afternoon I’ve been at an International Council for Science (ICSU) / Royal Society invited workshop on ‘Revaluing Science in the Digital Age’.

We’ve had a fascinating set of talks from academics, publishers (PLoS, Nature, BMC), librarians, policymakers, data managers, scientific societies…

Attendees included:
Jose Cotta, European Commision

Mark Thorley (RCUK)
Chris Banks  (University Librarian and Director, Aberdeen)
Mark Hahnel (Figshare)
Max Wilkinson (UCL, Head of Research Data Service)
Dave Roberts (ViBRANT)
Rob Frost (GSK)
Catriona MacCallum (PLoS)
Mark Forster (Syngenta)
Iain Hrynaszkiewicz (BMC)
Ruth Wilson (Nature Publishing Group)
Kaitlin Thaney (Digital Science)
Stuart Taylor (Royal Society)
Robert Simpson (Zooniverse)
Paul Groth (OpenPHACTS)
and more…


I gave a talk on content mining and the importance of full BOAI-compliant Open Access with respect to this, on behalf of the Open Knowledge Foundation:

There was lots of discussion on reproducibility, provenance of data, peer review, incentives, research misconduct and ethics.

I’ve met many new people and have learnt many new things. For example, on the subject of reproducibility I talked about Roger Peng and the journal Biostatistics in discussion, and then was soon informed that there was an analogous journal in Chemistry called Organic Syntheses whereby:

In order for a procedure to be accepted for publication, each reaction must be successfully repeated in the laboratory of a member of the Editorial Board at least twice, with similar yields (generally ±5%) and selectivity similar to that reported by the submitters.

Fantastic! We were also informed that this rigorous protocol ensures that research published in this journal is very highly regarded. I’ve suggested similar such reproducibility checks for phylogenetics research before (at the Systematics Association Biennial meeting Belfast, 2011) but this was viewed as too futuristic / infeasible…

Right now we’re working on a draft statement of outcome from this workshop that ICSU can pass to its members to possibly officially agree to endorse.

So I better finish here, and get back to the discussion.
I’m rather hoping they will endorse the Panton Principles rather than reinvent the wheel (policy-wise).

Exciting times!


PS I have made a Storify of the tweets from the workshop here .