Show me the data!
Header

Notes on my Palaeontology [Online] guest post

December 2nd, 2012 | Posted by rmounce in Open Access | Open Data | Palaeontology | Panton Fellowship updates - (Comments Off on Notes on my Palaeontology [Online] guest post)

I’m proud to announce I have a new article over at Palaeontology [Online]

The Palaeontology [Online] logo – by the P [O] team, licensed under a Creative Commons Attribution License

Posts at ‘P [O]’ are primarily aimed at public-engagement and since the site was launched back in July 2011, with sponsorship and support from the Palaeontology Association, one post per month has been featured on site. This month [December], I’ve written a rather different type of post for them. Not so much about fossils, creatures, classification and rocks – but instead on how palaeontology and science-as-a-whole is made available with respect to Open Access, Open Data, Open Source (code), and Open Educational Resources (OERs). Incidentally, I think it’s also the first P [O] post with embedded video content too – really making using of the digital medium!

I’ve tied these strands together with an explicit acknowledgement that Creative Commons has legally enabled all this Open content and that it’s a fantastic achievement. Consider it my early birthday present to celebrate that it’s now been nearly 10 years since Creative Commons first launched (#cc10 on Twitter btw for related news & events).

I’m hoping it will raise awareness that citizens & scientists alike can directly read the primary scientific literature themselves (via Open Access journals and articles) and they should be encouraged to – given as taxpayers they’ve paid for most of it to be created! Also more than just mere engagement, I’ve highlighted that uniquely with an Open philosophy there’s nothing stopping ‘amateur’ or citizen science contributions in palaeontology – it’s sad that more of the literature, data, code and educational resources in this area aren’t openly available for re-use – arguably the world would know a lot more about palaeontology if they were.

With specific reference to http://opendefinition.org/ I try and make it clear what open actually means in this context. There’s been a lot of openwashing this year. Open is clearly a desirable state, and a label which will help sell and ‘add value’ to products, therefore both innocent and malicious temptation abounds to mistakenly label or brand things as ‘open’ when they are de facto not open. Education and awareness-raising clearly has a significant role to play here in preventing this problem.

 

During the production of the article some interesting points were raised, which in the end didn’t make it to the ‘final’ version of the post, so I’ll blog them here instead.

 

On Open Access:

For the sake of simplicity I neglected to point out that in actuality the definition of OA is slightly narrower and more specific than just open as per http://opendefinition.org/ . OA is defined by the BOAI-definition which does not require nor allow(?) the ShareAlike (SA) clause. It does however require the Attribution clause (BY):

the relevant excerpt…

… The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited. …

see Mike Taylor’s excellent posts over at SVPOW for more.

 

On Open Data:

I wonder if perhaps there is still a perception out there that there are still technical barriers to sharing data openly?

Particularly with regard to very very large datasets & data files. I decided this was too niche a point for inclusion in the main post but in case anyone’s wondering – you can easily share *any* filesize these days.

Journals like GigaScience specialise in publishing ‘big data’ studies and already make available petabytes (=1 million gigabytes) worth of data. Data archives like figshare allow unlimited filesize uploads (only limited to 1GB if you keep it private), I’m sure Dryad would also be willing to archive large files. Want proof? Look no further than this 21GB database of microbial data that’s been downloaded at least 10 times as made available via BioTorrents – I couldn’t find anyone seeding it just now, but if there was greater institutional support for p2p data sharing I’m sure this would take off.

 

On Open Source (code):

There’s an excellent editorial in PLoS Computational Biology that regrettably I only just became aware of too late to include. It’s by Andreas Prlić & Hilmar Lapp the latter of whom I had the pleasure of meeting recently at NESCent in Durham, North Carolina. Its a short paper and Open Access so I recommend you all read at least the Why Do We Support Open-Source Scientific Software? section – it’s an excellent clear and concise summary of the greater value of open in this area.

 

On MOOCs: 

The American Museum of Natural History (AMNH) have some online courses available here, and whilst they’re probably of the very highest quality, they are neither MOOCs nor OERs because they’re not open, nor free. Each course costs $495 plus a $25 one-time registration fee. Grad credit is also available at an additional cost.

Perhaps one day the AMNH might be persuaded to run one of these courses as a MOOC? If not to help advertise and drive interest in the other courses but also to demonstrate their quality. If MIT can do it…

I should also refer interested readers to the excellent set of seminars on phylogenetics over at phyloseminar. I’ve virtually-attended (live, over-the-internet but not there in person) a few of these and have enjoyed recordings of others. It’s not a complete course (so not a MOOC) but depending upon exact licencing details could perhaps be classed as OER-like material.  The next one will be soon: 5pm (UK time) Wednesday 5th December Understanding biodiversity patterns using the Tree of Life given by Hélène Morlon.

 

 

All the posts over at P [O] are of very high-quality and are worthy academic contributions. As such I’m going to list my post there on my CV as soon as I update it. It’ll sit nicely in my publications list alongside articles in BMC Research Notes, Nature, and The Systematist. I deliberately intermix peer-reviewed publications and non-peer-reviewed publications to make people reconsider and examine the relative merits of each, rather than just counting volume or (worse) the journal Impact Factor which is of course irrelevant.

I encourage everyone else who’s published an article at P [O] to also proudly display it on their CV.

 

 

A couple of days ago I posted specifically about the data re-use session.
I’m going to use this post to muse about the conference more generally.

About SpotOn London 2012

It used to be called Science Online London – an informative, sensible and appropriate name. This year I hear (rumours) that it had to change name to SpotOn because Science AAAS or some other litigious entity was claiming brand identity infringement. I have sympathies with the organisers for this enforced change but ‘SpotOn London’ which apparently stands for “Science Policy, Outreach and Tools Online” (you wouldn’t know unless told!) is not my cup of tea, tbh. I suggest next year we continue to use the #solo hashtag and continue to call it Science Online London (informally), even if for legal reasons this can’t be the official conference title.

SOLO12

As the focus of the conference was tripartite: “Policy“, “Outreach“, and “Tools“; the mix of speakers, panellists & attendees was refreshingly diverse – unlike most academic conferences I go to. High-level & high-profile academics like Prof. Stephen Curry (Imperial College), Prof. Athene Donald (U.of Cambridge),Dr Ethan Perlstein and Dr Jenny Rohn (Science is Vital) mixed freely with PhD students like myself, Jon Tennant, Jojo Scoble, Nick Crumpton, Tom Phillips and others. There were policy people like Mark Henderson and Nic Bilham and even politicians themselves: we should all be grateful for Julian Huppert MP for Cambridge, one of unfortunately few UK politicians to take a genuine interest in science. There were publishers reps including Matt Hodgkinson & Martin Fenner (PLOS), Brian Hole (Ubiquity Press), Graeme Moffat & Kamila Markram (FrontiersIn), Ian Mulvany (eLife), Michael Habib (Elsevier), and ‘independents’ like Anna Sharman (Anna Sharman Editorial Services) & Kaveh Bazargan (of River Valley). Librarians Peter Morgan (U.of Cambridge) & Frank Norman, research funders Geraldine Clement-Stoneham (MRC), journalists Ed Yong and really interesting people who defy easy classification(!) like Brian Kelly (UKOLN), Tony Hirst, and the Digital Science team (some of): Euan Adie, Mark Hahnel and Kaitlin Thaney.

Now apologies to those I didn’t name check in the above list – there were many other brilliant and interesting people there (Ed Baker, Vince Smith, David Shotton, Josh Greenberg, I could go on… There’s a fuller list of attendees by Twitter handle here). I merely selected a few from broad categories to show the impressive diversity of representation there. This is one of the very best things about the conference – it attracts virtually all of the stakeholders of science. It’s not just about researchers, publishers, research funders and librarians – it rightly recognises that science isn’t only for ivory tower academics; it’s for everyone.

[Incidentally, for those interested I’d say gender diversity was quite balanced. Alas racial diversity was rather too imbalanced – perhaps sadly reflective of academia as whole?]

As befits a conference formerly known as ‘Science Online’ *all* of the talks were recorded & tweeted, so there are videos on youtube of every single one, and Storifys (of the tweets) available to view.

Selected Highlights (aside from the #solo12reuse session):

The journal is dead, long live the journal

In the early stages of this session, I was worried it wasn’t really going anywhere interesting with the discussion…

and then Dr Kaveh Bazargan took the microphone at about ~28:22 (skip to that section, it’s brilliant)

on the publishing process, author manuscript submissions & typesetting:

It’s madness really. I’m here to say I shouldn’t be in business.

as any manuscript-submitting biologist knows… publishers ask for all sorts of ridiculously pedantic formatting from us, particularly for reference lists. As Kaveh reminds us this is all pointless and stupid because when the publishers get this, they send it off to typesetters to be typeset anyhow – this process is hugely inefficient: “madness”. Not only this, but if a submitted manuscript gets rejected from one journal, the poor authors have to waste often significant amounts of time and energy to re-format their manuscript to suit the stylistic vagaries of another journal. Microsoft Word is not a good authoring tool – it’s largely unstructured. The publishing process requires a high-degree of technical structure, usually provided by XML or TeX.

If you dig into the issue a little bit. You’ll see that programs like Mendeley (and any other reference manager I would think) are fully capable of providing reference lists as structured XML. And yet journal policies enforce that we submit plain text (in say a Word doc), only for the typesetters to get paid by the publishing companies to then re-implement those plain text references back into fully-structured XML. Madness! 

Typesetters are mostly located in areas where the labour is cheap. India, Phillipines etc… it’s an intensely manual process and perhaps in future may be less of a necessity(?). Furthermore, as I discovered with a recent BMC manuscript I was an author on, typesetters can sometime introduce new errors into the publication process which slow down the process even further! I commend the brutal honesty of Dr Bazargan in bravely speaking-out about this, this is an issue completely separate to OA/TA journals; both can be guilty of this madness.

It also affects re-use potential as he also remarked in the #solo12reuse session. Not every publisher publicly exposes the XML version of the papers they publish, these are of extreme importance to re-use potential (e.g. mining) – the Geological Society of London and their publications are one of these which is a great shame. I asked Neal Marriott of GSL about this back in June via email and he replied: “We do not currently have a feature to allow download of the NLM XML source.”   I also tried to take this up politely with Nic Bilham at the bar after the first night of the conference, but for someone with “external relations” in his job title I found him rather frosty towards me. Happy to say I had no such problems with Grace Baynes (NPG) it was charming to meet her in person after our exchanges regarding NPG’s new OA pricing strategy.

So how do we get GSL and other publishers to expose their XML which they surely have? They already provide HTML & PDF versions, what’s difficult about exposing the underlying XML version too?

 

Altmetrics

I was at another session at the time this was going on but I think this is an important session I should highlight. Broadening the assessment & evaluation of research beyond incredibly narrow metrics (like the journal Impact Factor; die die!) is something that’s clearly very important. Everyone agrees it’s “early days” and that not all impact is measurable, but that shouldn’t dissuade us from actively researching and cautiously embracing this new (positive) trend.

Fraud in Science

Virginia Barbour, Ed Yong and others were on the panel for this one and again, whilst I wasn’t in the room at the time for this session – looking back at the video, I rather wish I was at the session – it was really interesting:

  • Virginia Barbour 18:45 “I think there’s much more evidence of sloppiness than outright fraud… at some PLOS journals we ask authors for the original figures to check for figure manipulation before acceptance… when we ask authors to supply these a large number of authors can’t do it” (is it really that they can’t find the files, or just that they don’t *want* to supply the originals?) “It is completely unacceptable, but not uncommon” Amen Virginia – I agree very much!
  • Virginia Barbour 21:36 “…the larger issues that plague science; sloppiness, unwillingness to share data, conflict of interest, and publication bias. There are *solutions* to these and the great thing is that the internet makes it much easier to spot and actually makes it easier to address than previously…” +1
  • much of the later talk about clinical trials refers to work done in this paper: Prayle, A. P., Hurley, M. N., and Smyth, A. R. 2012. Compliance with mandatory reporting of clinical trial results on ClinicalTrials.gov: cross sectional study. BMJ 344. in which the authors report a rather disappointing 22% compliance rate with US Food and Drug Administration Amendments Act legislation that requires the results of all clinical trials to be reported within reasonable time.
  • Finally, I was seriously impressed and pleased that Ed Yong extolled the virtues of Open Data from 45:07 to enable greater transparency and lower the barriers to critical re-analyses. This is something I most definitely would have raised had I been there, alas I think I was at the “publishing research data: what’s in it for me?” session

So yeah, the conference was great. Not all the sessions were brilliant. The ‘big data’ session was a little disappointing (no offence to any of the panel, just small attendence, little engagement) – perhaps because the topic is already well-covered for conferences with alternative events like the recent O’Reilly Strata Big Data meetup dominating?

I’ll be there at next year’s Science Online London event for sure – whatever it’s called!
;)

So there’s been a few already:

SpotOn day 1 & SpotOn day 2 being the best I’ve read, from Ian Mulvany

see also: SpotOn London – a global conference by Jon Tennant | #solo12 reflection by A J Cann | SpotOn London 2012 in brief by Charis Cook | Altmetric @ SpotOn London 2012 by Jean Liu, and more

but since SpotOn London 2012 was such an interesting event, and there were so many parallel sessions I thought it would be good to add some more to the post-conference discussion

Data is the new black

photo by Bastian Greshake (@gedankenstuecke), Copyright not mine.

It wasn’t just a popular badge at the conference… data is a hot topic in science right now (as it should be!). Data is an undervalued but absolutely vital output of research. Research funding agencies appear to have over-incentivized the production of research publications (many of which are mere executive summaries of the years of research effort they represent) to the exclusion of almost everything else.

Science isn’t just about the production of papers; data and code are extremely important research outputs too (I’m not going to mention patents – they’re a sticky issue best dealt with in another post). The good news is that funding bodies now seem to have realised that they’re seriously missing out on RoI by focusing solely on papers; just recently NSF Grant Proposal Guidelines changed with amended terminology away from narrow-measurement ‘Publications’ to the newer broader term ‘Products’ that explicitly recognises non-publication outputs as creditworthy first class research objects (incidentally, this was one of the many excellent suggestions made in the Force11 manifesto for ‘Improving Future Research Communication and e-Scholarship’ read it if you haven’t already).

The immense value to be gained, time to be saved, and innovative research enabled by making data available for re-use was up for discussion at the #solo12reuse session. Mark Hahnel (@figshare) was organiser/chair, and Sarah Callaghan (@sorcha_ni) of the British Atmospheric Data Centre and I were the invited panelists for a ~1hr slot. As the conference was extremely well-organized *all* sessions were live-streamed via Google Hangouts & made publicly available via YouTube afterwards. I’ve embedded the stream of the #solo12reuse session below:

A transcript of some of what was discussed:

Intro’s from ~02:00 … then straight into discussion from ~09:00 onwards: Josh Greenberg (@epistemographer) contends that data sharing in chemistry perhaps ‘doesn’t make as much sense’ – I have a feeling PMR & many others would disagree with this!

At 13:20 Sarah Callaghan: NERC sets its data embargo policy so that data can only be withheld for a maximum of 2 years after it was collected after which it must be made publicly available, somewhere, somehow – the ambiguity of which IMO needs to be worked on…

At 14:25 discussion of ‘levels of re-usability’ and definition. Access control as a means of encouraging data sharing (?)

17:30 Sarah Callaghan: “It’s important to have ‘first dibs’ on your own data” but not beyond this without peer-vetted justification/scrutiny IMO

18:30 David Shotton (@dshotton): noted that one shouldn’t expect absolutely every data point/item to be shared – not all data is useful/valuable. It’s about retaining & making available bits that might be of re-use value.

At 20:24 I start to introduce AMI2 & the OpenContentMining project.

37:50 I bring the Panton Principles on screen, I also had the OKFN Science Working Group page displayed (although not discussed) for a good ~10 minutes. Note to self: hijack the display computer at panel sessions more often…

from 40:17 onwards… Mark Hahnel: “In terms of re-use and getting people incentivized, are Data Papers the future?” Sarah Callaghan “NO. Until research achievement is predicated on something other than publishing in ‘high-impact’ journals then we’re stuffed: we’ve got to shoehorn data & code in order for them to ‘count’ [lamentably]” So for now we need data papers, but perhaps in the future we won’t need to constrain these outputs to a ‘paper’ style format.

from 43:00 Martin Fenner (@mfenner) plays Devil’s Advocate and suggests that data citation may not work and that perhaps #altmetrics might be better indicators of usage. Much debate ensues…

from 45:50 I give a plug to Iain H’s paper: ‘Open By Default’ Hrynaszkiewicz, I. and Cockerill, M. 2012. Open by default: a proposed copyright license and waiver agreement for open access research and data in peer-reviewed journals. BMC Research Notes 5:494+ then we discuss legal barriers to re-using data.

This post has taken a while to write and is fairly long now, so I’m going to split my recap of #solo12 into two or more parts now. In part 2 I’ll attempt to discuss some of other *excellent* sessions I saw, in particular the brilliant, well-received outburst on the absurd inefficiency of the publication process by professional typesetter Dr Kaveh Bazargan during the #solo12journals session. I’m surprised someone hasn’t done a whole blogpost about this already – it was my highlight of the conference tbh!
I’ll be posting part two on Monday 19th November (weekends are slow for blogs… I want people to read this!)

Until then…
 

Gold OA Pricewatch

November 7th, 2012 | Posted by rmounce in Open Access - (12 Comments)

An interesting move from Nature Publishing Group today…

In a press release dated 7 November 2012 they’ve announced they’re allowing the Creative Commons Attribution (CC BY) license to be applied to articles in some (but not all) of their journals, specifically citing Wellcome Trust and RCUK policies that now require their funded authors to publish Gold OA with a CC BY license (or alternatively to use the Green OA route), recognizing that more restrictive licenses get the funders less return on investment.

Also included is a terribly poor quality screenshot of the new Gold OA pricing scheme that will apply for these journals (below)

An image of a table of numbers like this would never be allowed to be published in any one of NPG’s journals. So why did they do this here? Are they actively trying to make it harder for people to compare Gold OA charges between journals? Odd.

But what’s really outrageous about this: they’re explicitly charging MORE for applying/allowing a CC BY license relative to the more restrictive licenses. Applying a license to a digital work costs nothing. By charging £100-400 more for CC BY they’re really taking the piss – charging more for ABSOLUTELY NO ADDITIONAL EFFORT on their part. Horrid.

Other than greed what is the justification for this?

UPDATE: the income made from printing paper (deadtree) reprints, for profit, is cited as the justification. This still doesn’t get away from the fact that this is going to penalise RCUK-funded authors who wish to publish via the Gold OA route. I also don’t remember Nature Publishing Group charging differentiated OA prices for journals that previously offered a choice of different licences – has Scientific Reports always charged different rates for different licenses? NO it seems, just one flat price: £890 AND a choice of three different Creative Commons licenses including CC BY !