Show me the data!

A couple of days ago I posted specifically about the data re-use session.
I’m going to use this post to muse about the conference more generally.

About SpotOn London 2012

It used to be called Science Online London – an informative, sensible and appropriate name. This year I hear (rumours) that it had to change name to SpotOn because Science AAAS or some other litigious entity was claiming brand identity infringement. I have sympathies with the organisers for this enforced change but ‘SpotOn London’ which apparently stands for “Science Policy, Outreach and Tools Online” (you wouldn’t know unless told!) is not my cup of tea, tbh. I suggest next year we continue to use the #solo hashtag and continue to call it Science Online London (informally), even if for legal reasons this can’t be the official conference title.


As the focus of the conference was tripartite: “Policy“, “Outreach“, and “Tools“; the mix of speakers, panellists & attendees was refreshingly diverse – unlike most academic conferences I go to. High-level & high-profile academics like Prof. Stephen Curry (Imperial College), Prof. Athene Donald (U.of Cambridge),Dr Ethan Perlstein and Dr Jenny Rohn (Science is Vital) mixed freely with PhD students like myself, Jon Tennant, Jojo Scoble, Nick Crumpton, Tom Phillips and others. There were policy people like Mark Henderson and Nic Bilham and even politicians themselves: we should all be grateful for Julian Huppert MP for Cambridge, one of unfortunately few UK politicians to take a genuine interest in science. There were publishers reps including Matt Hodgkinson & Martin Fenner (PLOS), Brian Hole (Ubiquity Press), Graeme Moffat & Kamila Markram (FrontiersIn), Ian Mulvany (eLife), Michael Habib (Elsevier), and ‘independents’ like Anna Sharman (Anna Sharman Editorial Services) & Kaveh Bazargan (of River Valley). Librarians Peter Morgan (U.of Cambridge) & Frank Norman, research funders Geraldine Clement-Stoneham (MRC), journalists Ed Yong and really interesting people who defy easy classification(!) like Brian Kelly (UKOLN), Tony Hirst, and the Digital Science team (some of): Euan Adie, Mark Hahnel and Kaitlin Thaney.

Now apologies to those I didn’t name check in the above list – there were many other brilliant and interesting people there (Ed Baker, Vince Smith, David Shotton, Josh Greenberg, I could go on… There’s a fuller list of attendees by Twitter handle here). I merely selected a few from broad categories to show the impressive diversity of representation there. This is one of the very best things about the conference – it attracts virtually all of the stakeholders of science. It’s not just about researchers, publishers, research funders and librarians – it rightly recognises that science isn’t only for ivory tower academics; it’s for everyone.

[Incidentally, for those interested I’d say gender diversity was quite balanced. Alas racial diversity was rather too imbalanced – perhaps sadly reflective of academia as whole?]

As befits a conference formerly known as ‘Science Online’ *all* of the talks were recorded & tweeted, so there are videos on youtube of every single one, and Storifys (of the tweets) available to view.

Selected Highlights (aside from the #solo12reuse session):

The journal is dead, long live the journal

In the early stages of this session, I was worried it wasn’t really going anywhere interesting with the discussion…

and then Dr Kaveh Bazargan took the microphone at about ~28:22 (skip to that section, it’s brilliant)

on the publishing process, author manuscript submissions & typesetting:

It’s madness really. I’m here to say I shouldn’t be in business.

as any manuscript-submitting biologist knows… publishers ask for all sorts of ridiculously pedantic formatting from us, particularly for reference lists. As Kaveh reminds us this is all pointless and stupid because when the publishers get this, they send it off to typesetters to be typeset anyhow – this process is hugely inefficient: “madness”. Not only this, but if a submitted manuscript gets rejected from one journal, the poor authors have to waste often significant amounts of time and energy to re-format their manuscript to suit the stylistic vagaries of another journal. Microsoft Word is not a good authoring tool – it’s largely unstructured. The publishing process requires a high-degree of technical structure, usually provided by XML or TeX.

If you dig into the issue a little bit. You’ll see that programs like Mendeley (and any other reference manager I would think) are fully capable of providing reference lists as structured XML. And yet journal policies enforce that we submit plain text (in say a Word doc), only for the typesetters to get paid by the publishing companies to then re-implement those plain text references back into fully-structured XML. Madness! 

Typesetters are mostly located in areas where the labour is cheap. India, Phillipines etc… it’s an intensely manual process and perhaps in future may be less of a necessity(?). Furthermore, as I discovered with a recent BMC manuscript I was an author on, typesetters can sometime introduce new errors into the publication process which slow down the process even further! I commend the brutal honesty of Dr Bazargan in bravely speaking-out about this, this is an issue completely separate to OA/TA journals; both can be guilty of this madness.

It also affects re-use potential as he also remarked in the #solo12reuse session. Not every publisher publicly exposes the XML version of the papers they publish, these are of extreme importance to re-use potential (e.g. mining) – the Geological Society of London and their publications are one of these which is a great shame. I asked Neal Marriott of GSL about this back in June via email and he replied: “We do not currently have a feature to allow download of the NLM XML source.”   I also tried to take this up politely with Nic Bilham at the bar after the first night of the conference, but for someone with “external relations” in his job title I found him rather frosty towards me. Happy to say I had no such problems with Grace Baynes (NPG) it was charming to meet her in person after our exchanges regarding NPG’s new OA pricing strategy.

So how do we get GSL and other publishers to expose their XML which they surely have? They already provide HTML & PDF versions, what’s difficult about exposing the underlying XML version too?



I was at another session at the time this was going on but I think this is an important session I should highlight. Broadening the assessment & evaluation of research beyond incredibly narrow metrics (like the journal Impact Factor; die die!) is something that’s clearly very important. Everyone agrees it’s “early days” and that not all impact is measurable, but that shouldn’t dissuade us from actively researching and cautiously embracing this new (positive) trend.

Fraud in Science

Virginia Barbour, Ed Yong and others were on the panel for this one and again, whilst I wasn’t in the room at the time for this session – looking back at the video, I rather wish I was at the session – it was really interesting:

  • Virginia Barbour 18:45 “I think there’s much more evidence of sloppiness than outright fraud… at some PLOS journals we ask authors for the original figures to check for figure manipulation before acceptance… when we ask authors to supply these a large number of authors can’t do it” (is it really that they can’t find the files, or just that they don’t *want* to supply the originals?) “It is completely unacceptable, but not uncommon” Amen Virginia – I agree very much!
  • Virginia Barbour 21:36 “…the larger issues that plague science; sloppiness, unwillingness to share data, conflict of interest, and publication bias. There are *solutions* to these and the great thing is that the internet makes it much easier to spot and actually makes it easier to address than previously…” +1
  • much of the later talk about clinical trials refers to work done in this paper: Prayle, A. P., Hurley, M. N., and Smyth, A. R. 2012. Compliance with mandatory reporting of clinical trial results on cross sectional study. BMJ 344. in which the authors report a rather disappointing 22% compliance rate with US Food and Drug Administration Amendments Act legislation that requires the results of all clinical trials to be reported within reasonable time.
  • Finally, I was seriously impressed and pleased that Ed Yong extolled the virtues of Open Data from 45:07 to enable greater transparency and lower the barriers to critical re-analyses. This is something I most definitely would have raised had I been there, alas I think I was at the “publishing research data: what’s in it for me?” session

So yeah, the conference was great. Not all the sessions were brilliant. The ‘big data’ session was a little disappointing (no offence to any of the panel, just small attendence, little engagement) – perhaps because the topic is already well-covered for conferences with alternative events like the recent O’Reilly Strata Big Data meetup dominating?

I’ll be there at next year’s Science Online London event for sure – whatever it’s called!

So there’s been a few already:

SpotOn day 1 & SpotOn day 2 being the best I’ve read, from Ian Mulvany

see also: SpotOn London – a global conference by Jon Tennant | #solo12 reflection by A J Cann | SpotOn London 2012 in brief by Charis Cook | Altmetric @ SpotOn London 2012 by Jean Liu, and more

but since SpotOn London 2012 was such an interesting event, and there were so many parallel sessions I thought it would be good to add some more to the post-conference discussion

Data is the new black

photo by Bastian Greshake (@gedankenstuecke), Copyright not mine.

It wasn’t just a popular badge at the conference… data is a hot topic in science right now (as it should be!). Data is an undervalued but absolutely vital output of research. Research funding agencies appear to have over-incentivized the production of research publications (many of which are mere executive summaries of the years of research effort they represent) to the exclusion of almost everything else.

Science isn’t just about the production of papers; data and code are extremely important research outputs too (I’m not going to mention patents – they’re a sticky issue best dealt with in another post). The good news is that funding bodies now seem to have realised that they’re seriously missing out on RoI by focusing solely on papers; just recently NSF Grant Proposal Guidelines changed with amended terminology away from narrow-measurement ‘Publications’ to the newer broader term ‘Products’ that explicitly recognises non-publication outputs as creditworthy first class research objects (incidentally, this was one of the many excellent suggestions made in the Force11 manifesto for ‘Improving Future Research Communication and e-Scholarship’ read it if you haven’t already).

The immense value to be gained, time to be saved, and innovative research enabled by making data available for re-use was up for discussion at the #solo12reuse session. Mark Hahnel (@figshare) was organiser/chair, and Sarah Callaghan (@sorcha_ni) of the British Atmospheric Data Centre and I were the invited panelists for a ~1hr slot. As the conference was extremely well-organized *all* sessions were live-streamed via Google Hangouts & made publicly available via YouTube afterwards. I’ve embedded the stream of the #solo12reuse session below:

A transcript of some of what was discussed:

Intro’s from ~02:00 … then straight into discussion from ~09:00 onwards: Josh Greenberg (@epistemographer) contends that data sharing in chemistry perhaps ‘doesn’t make as much sense’ – I have a feeling PMR & many others would disagree with this!

At 13:20 Sarah Callaghan: NERC sets its data embargo policy so that data can only be withheld for a maximum of 2 years after it was collected after which it must be made publicly available, somewhere, somehow – the ambiguity of which IMO needs to be worked on…

At 14:25 discussion of ‘levels of re-usability’ and definition. Access control as a means of encouraging data sharing (?)

17:30 Sarah Callaghan: “It’s important to have ‘first dibs’ on your own data” but not beyond this without peer-vetted justification/scrutiny IMO

18:30 David Shotton (@dshotton): noted that one shouldn’t expect absolutely every data point/item to be shared – not all data is useful/valuable. It’s about retaining & making available bits that might be of re-use value.

At 20:24 I start to introduce AMI2 & the OpenContentMining project.

37:50 I bring the Panton Principles on screen, I also had the OKFN Science Working Group page displayed (although not discussed) for a good ~10 minutes. Note to self: hijack the display computer at panel sessions more often…

from 40:17 onwards… Mark Hahnel: “In terms of re-use and getting people incentivized, are Data Papers the future?” Sarah Callaghan “NO. Until research achievement is predicated on something other than publishing in ‘high-impact’ journals then we’re stuffed: we’ve got to shoehorn data & code in order for them to ‘count’ [lamentably]” So for now we need data papers, but perhaps in the future we won’t need to constrain these outputs to a ‘paper’ style format.

from 43:00 Martin Fenner (@mfenner) plays Devil’s Advocate and suggests that data citation may not work and that perhaps #altmetrics might be better indicators of usage. Much debate ensues…

from 45:50 I give a plug to Iain H’s paper: ‘Open By Default’ Hrynaszkiewicz, I. and Cockerill, M. 2012. Open by default: a proposed copyright license and waiver agreement for open access research and data in peer-reviewed journals. BMC Research Notes 5:494+ then we discuss legal barriers to re-using data.

This post has taken a while to write and is fairly long now, so I’m going to split my recap of #solo12 into two or more parts now. In part 2 I’ll attempt to discuss some of other *excellent* sessions I saw, in particular the brilliant, well-received outburst on the absurd inefficiency of the publication process by professional typesetter Dr Kaveh Bazargan during the #solo12journals session. I’m surprised someone hasn’t done a whole blogpost about this already – it was my highlight of the conference tbh!
I’ll be posting part two on Monday 19th November (weekends are slow for blogs… I want people to read this!)

Until then…

Wow! Where to begin… In this post I shall attempt to summarise some of OKFestival 2012.

Some Background:

I had been to the Open Knowledge Conference last year (in Berlin), where I gave an invited talk on Open Palaeontology and met lots of brilliant people in the Open Science community like Bjoern Brembs, Cameron Neylon & Peter Murray-Rust. But this year the event was even bigger, and even better – teaming up with the annual Open Government Data Camp for a mega-event.

The Event Itself:

It was a little awkward that it was held so far away from most of the conference accommodation – everyone had a 20-30 minute commute before getting to the venue, and some of the talk rooms were fairly far apart. But once the conference goers got used to that it was plain sailing from there, and the Aalto University buildings themselves were wonderfully modern and well equipped for it (inc. great WiFi). I got to Helsinki on the Tuesday, and caught the tail end of the Data Journalism session that day including an excellent, inspirational talk on amongst other things. It detailed the amazing knowledge and insight gained from tracking the movement of ships with open data. I couldn’t help thinking that academics could learn a lot from these open data visualization experts (myself included!).

An interesting example of Shippr data – ships turn off their beacons once they pass the point for fear of pirates…

Wednesday – my chance to make a difference

I really liked the way that the conference had an introductory session to the days parallel events in the morning from 10am – 11am. If one was unsure of which stream to go to – these Morning Plenaries gave each topic stream a chance to pitch their events in a short slot to the awaiting audience. I thought this was very helpful given there were 13 separate topic streams at the conference!

I was involved in two sessions this day. Firstly the Open Access discussion panel, the video for which is here with Tim Hubbard (Sanger Institute), Carlos Russel (World Bank), Peter Murray-Rust (University of Cambridge / Open Knowledge Foundation) and Tom Olijhoek & Mark MacGillivray (Open Access Index):

It’s a long video, we covered many topics, with excellent contributions from the audience including Puneet Kishnor from Creative Commons and Matt Todd from the Open Source Drug Discovery team amongst others.

Then after this there was the research data session with contributions from Mark Wainwright on CKAN, Mark Hahnel on Figshare and Joss Winn of the Orbital project.

Finally we finished with the Panton Fellowships Session with talks from myself and Sophie Kershaw on what we’d been doing in our fellowship work:

The day was rounded off with a hugely inspirational talk from Matt Todd summarising his Open Source Drug Discovery work in the main lecture theatre, with a lovely if expensive meal afterwards in Lasipalatsi Ravintola.


I spent some quality time with Peter working on a BBSRC grant proposal.
I also thoroughly enjoyed Hans Rosling’s fantastic key note presentation which I urge you all to watch – it was brilliant, and thrilling to be there live in the audience for.


If there’s one thing that impresses me most of all about OKFestival, it’s this: it’s not just about talking – they do things here too. Lots of ‘hacking’ sessions on Friday to create new tools and collate awesome new data. Most conferences are extremely boring in that it’s just talk after talk after talk. Things get done here, new collaborations are started, fresh links across disciplinary boundaries are made connecting journalism with academia, economic development with open architectural design, and other incredible trans-disciplinary mashups. It’s a joy to behold.

I’m really glad I came to OKFestival, as ever I got a lot out of it.

Next year it’ll be in Switzerland (?), I hope I didn’t just make that up… I seem to remember that it was announced to be there but I couldn’t find any confirmation from Google. Rest assured I’ll try and be there though!

Since Sunday afternoon I’ve been at an International Council for Science (ICSU) / Royal Society invited workshop on ‘Revaluing Science in the Digital Age’.

We’ve had a fascinating set of talks from academics, publishers (PLoS, Nature, BMC), librarians, policymakers, data managers, scientific societies…

Attendees included:
Jose Cotta, European Commision

Mark Thorley (RCUK)
Chris Banks  (University Librarian and Director, Aberdeen)
Mark Hahnel (Figshare)
Max Wilkinson (UCL, Head of Research Data Service)
Dave Roberts (ViBRANT)
Rob Frost (GSK)
Catriona MacCallum (PLoS)
Mark Forster (Syngenta)
Iain Hrynaszkiewicz (BMC)
Ruth Wilson (Nature Publishing Group)
Kaitlin Thaney (Digital Science)
Stuart Taylor (Royal Society)
Robert Simpson (Zooniverse)
Paul Groth (OpenPHACTS)
and more…


I gave a talk on content mining and the importance of full BOAI-compliant Open Access with respect to this, on behalf of the Open Knowledge Foundation:

There was lots of discussion on reproducibility, provenance of data, peer review, incentives, research misconduct and ethics.

I’ve met many new people and have learnt many new things. For example, on the subject of reproducibility I talked about Roger Peng and the journal Biostatistics in discussion, and then was soon informed that there was an analogous journal in Chemistry called Organic Syntheses whereby:

In order for a procedure to be accepted for publication, each reaction must be successfully repeated in the laboratory of a member of the Editorial Board at least twice, with similar yields (generally ±5%) and selectivity similar to that reported by the submitters.

Fantastic! We were also informed that this rigorous protocol ensures that research published in this journal is very highly regarded. I’ve suggested similar such reproducibility checks for phylogenetics research before (at the Systematics Association Biennial meeting Belfast, 2011) but this was viewed as too futuristic / infeasible…

Right now we’re working on a draft statement of outcome from this workshop that ICSU can pass to its members to possibly officially agree to endorse.

So I better finish here, and get back to the discussion.
I’m rather hoping they will endorse the Panton Principles rather than reinvent the wheel (policy-wise).

Exciting times!


PS I have made a Storify of the tweets from the workshop here .