Show me the data!


October 21st, 2013 | Posted by rmounce in Open Access - (3 Comments)

I handed in my thesis not long ago, on Thursday 3rd October 2013. No idea when my viva is yet. I can’t blog many of the chapters because I haven’t convinced my manuscript co-authors of the value of preprints, yet. I’m also a bit unsure as to how some of the other chapters will be received and thus I’ll wait until after the viva before I decide what to do next with it.

Given it’s open access week this week, there is one bit of my thesis I should definitely share: the acknowledgements!

I can’t possibly thank everyone enough for the help I’ve received over the past 4 years – my knowledge, skills, and connections have been vastly extended. Note in particular the bit I’ve highlighted in bold just for this blog post – I want everyone to know how absolutely reliant I’ve been on ‘alternate’ forms of literature access during my research – this is the new ‘normal’ for many early career researchers I fear, until open access is more prevalent we’ll have to continue to hunt, scavenge, beg, steal, and borrow for every PDF. My generation of researchers grew-up using Napster, Isohunt, Copyright infringement is an everyday activity for many of us – WE DONT CARE. Have you been to a conference? How many of the pictures on the speakers slides weren’t technically infringing someone else’s copyright? WE DONT CARE. One can shut down or block specific portals, but doing so doesn’t really solve the basic problem: from what I’ve seen, time and time again, copyright’s only role in science is to obstruct it. My biggest hope for Open Access Week 2013 is that someone will torrent Elsevier’s back catalogue – journal/publisher torrents have been done before and will be done again!  It probably won’t happen, but I can dream…


I would like to thank my supervisor, Matthew Wills for putting up with me all this time. I
have been lucky to have such accommodating and understanding support. I also must
thank my lab mates Martin Hughes, Anne O’Connor, Sylvain Gerber, Katie Davis, Rob
Sansom and everyone else in the Biodiversity Lab at the University of Bath – we had some
great times and some brilliant times together. Sincere thanks also to the University of Oslo
Bioportal computing cluster for providing me free cloud computation for my work.
Many people have helped spur my imagination along the way with ideas for different
chapters of this thesis. For this I would like to thank Ward Wheeler, Pablo Goloboff, Mark
Siddall, Dan Janies, Steve Farris and the generous financial support of the Willi Hennig
Society. I want to thank all those in the palaeontology community who have shared their
published data with me, particularly Graeme Lloyd for his stirling work in making
dinosaurian data available – I hope I have done something interesting with the data I have
used and opened eyes to new possibilities. I also want to thank all those in the open
science community – Peter Murray-Rust, Todd Vision, Heather Piwowar, Mark Hahnel,
Martin Fenner, Geoffrey Boulton, Jenny Molloy and so many more I’ve had the pleasure of
meeting in person. The energy and enthusiasm I drew from countless online discussions
on Facebook, Google+ and Twitter was truly inspirational.
For facilitating greater access to scientific literature I must heartily thank the Natural
History Museum, London library and archives, the #icanhazpdf community on Twitter,
Wikipaleo on Facebook, References Wanted on FriendFeed,, and SciHub.
Without these additional literature access facilitators I would not have been able to read
half the sources I cite in this thesis.
I must thank my wife Avril for her patience with me especially during the write-up phase,
for allowing me to go away to all these amazing conferences abroad, and for tolerating all
those long nights into mid-morning when I was tapping away on my noisy keyboard.
Finally, I thank my family: Richard, Rosemary & Tara for repeatedly encouraging me to
finish my thesis – I got there in the end!

Hack4ac recap

July 9th, 2013 | Posted by rmounce in BMC | eLife | Hack days | Open Access | Open Data | Open Science | PeerJ | PLoS - (4 Comments)

Last Saturday I went to Hack4Ac – a hackday in London bringing together many sections of the academic community in pursuit of two goals:

  • To demonstrate the value of the CC-BY licence within academia. We are interested in supporting innovations around and on top of the literature.
  • To reach out to academics who are keen to learn or improve their programming skills to better their research. We’re especially interested in academics who have never coded before


The list of attendees was stellar, cross-disciplinary (inc. Humanities) and international. The venue (Skills Matter) & organisation were also suitably first-class – lots of power leads, spare computer mice, projectors, whiteboards, good wi-fi, separate workspaces for the different self-assembled hack teams, tea, coffee & snacks all throughout the day to keep us going, prizes & promo swag for all participants…

The principal organizers; Jason Hoyt (PeerJ, formerly at Mendeley) & Ian Mulvany (Head of Tech at eLife) thus deserve a BIG thank you for making all this happen. I hear this may also be turned into a fairly regular set of meetups too, which will be great for keeping up the momentum of innovation going on right now in academic publishing.

The hack projects themselves…

The overall winner of the day was ScienceGist as voted for by the attendees. All the projects were great in their own way considering we only had from ~10am to 5pm to get them in a presentable state.



This project was initiated by Jure Triglav, building upon his previous experience with Tiris. This new project aims to provide an open platform for post-publication summaries (‘gists’) of research papers, providing shorter, more easily understandable summaries of the content of each paper.

I also led a project under the catchy-title of Figures → Data where-by we tried to provide added-value by taking CC-BY bar charts and histograms from the literature and attempting to re-extract the numerical data from those plots with automated efforts using computer vision techniques. On my team for the day I had Peter Murray-Rust, Vincent Adam (of HackYourPhD) and Thomas Branch (Imperial College). This was handy because I know next to nothing about computer vision – I’m Your Typical Biologist ™  in that I know how to script in R, perl, bash and various other things, just enough to get by but not nearly enough to attempt something ambitious like this on my own!

Forgive me the self-indulgence if I talk about this  Figures → Data project more than I do the others but I thought it would be illuminative to discuss the whole process in detail…

In order to share links between our computers in real-time, and to share initial ideas and approaches, Vincent set-up an etherpad here to record our notes. You can see the development of our collaborative note-taking using the timeslider function below (I did a screen record of it for prosperity using recordmydesktop):

In this etherpad we document that there are a variety of ways in which to discover bar charts & histograms:

  • figuresearch is one such web-app that searches the PMC OA subset for figure captions & figure images. With this you can find over 7,000 figure captions containing the word ‘histogram’ (you would assume that the corresponding figure would contain at least one histogram for 99% of those figures, although there are exceptions).
  • figshare has nearly 10,000 hits for histogram figures, whilst BMC & PLOS can also be commended for providing the ability to search their literature stack by just figure captions, making the task of figure discovery far more efficient and targeted.

Jason Hoyt was in the room with us for quite a bit of the hack and clearly noted the search features we were looking for – just yesterday he tweeted: “PeerJ now supports figure search & all images free to use CC-BY (inspired by @rmounce at #hack4ac)” [link] – I’m really glad to see our hack goals helped Jason to improve content search for PeerJ to better enable the needs (albeit somewhat niche in this case) of real researchers. It’s this kind of unique confluence of typesetters, publishers, researchers, policymakers and hackers at doing-events like this that can generate real change in academic publishing.

The downside of our project was that we discovered someone’s done much of this before. ReVision: Automated Classification, Analysis and Redesign of Chart Images  [PDF] was an award-winning paper at an ACM conference in 2011. Much of this project would have helped our idea, particularly the classification of figures tech. Yet sadly, as with so much of ‘closed’ science we couldn’t find any open source code associated with this project. There were comments that this type of non-code sharing behaviour, blocking re-use and progress, are fairly typical in computer science & ACM conferences (I wouldn’t know but it was muttered…).  If anyone does know of the existence of related open source code available for this project do let me know!

So… we had to start from a fairly low-level ourselves: Vincent & Thomas tried MATLAB & C based approaches with OpenCV and their code is all up on our project github. Peter tried using AMI2 toolset, particularly the Canny algorithm, whilst I built up an annotated corpus of 40 CC-BY bar charts & histograms for testing purposes. Results of all three approaches can be seen below in their attempts to simplify this hilarious figure about dolphin cognition from a PLOS paper:

The plastic fish just wasn't as captivating...

“Figure 5. Total time spent looking at different targets.” from Siniscalchi M, Dimatteo S, Pepe AM, Sasso R, Quaranta A (2012) Visual Lateralization in Wild Striped Dolphins (Stenella coeruleoalba) in Response to Stimuli with Different Degrees of Familiarity. PLoS ONE 7(1): e30001. doi:10.1371/journal.pone.0030001 CC-BY

Peter’s results (using AMI2):


Thomas’s results (OpenCV & C):


Vincent’s results (OpenCV & MATLAB & bilateral filtering)

We might not have won 1st prize but I think our efforts are pretty cool, and we got some laughs from our slides presenting our days’ work at the end (e.g. see below). Importantly, *everything* we did that day is openly-available on github to re-use, re-work and improve upon (I’ll ping Thomas & Vincent soon to make sure their code contributions are openly licensed). Proper full-stack open science basically!

some figures are just awful


Other hack projects…

As I thought would happen I’ve waffled on about our project. If you’d like to know more about the other projects hopefully someone else will blog about them at greater length (sorry!) I’ve got my thesis to write y’know! ;)

You can find more about them all either on the Twitter channel #hack4ac or alternatively on the hack4ac github page. I’ll write a little bit more below, but it’ll be concise, I warn you!

  • Textmining PLOS Author Contributions

This project has a lovely website for itself: and so needs no more explanation.

  • Getting more openly-licensed content on wikipedia

This group had problems with the YouTube API I think. Ask @EvoMRI (Daniel Mietchen) if you’re interested…

  • articlEnhancer

Not content with helping out the PLOS author contribs project Walther Georg also unveiled his own article enhancer project which has a nice webpage about it here:

  • Qual or Quant methods?

Dan Stowell & co used NLP techniques on full-text accessible CC-BY research papers, to classify all of them in an automated way determining whether they were qualitative or quantitative papers (or a mixture of the two). The last tweeted report of it sounded rather promising: “Upstairs at #hack4ac we’re hacking a system to classify research papers as qual or quant. first results: 96 right of 97. #woo #NLPwhysure” More generally, I believe their idea was to enable a “search by methods” capability, which I think would be highly sought-after if they could do it. Best of luck!
Apologies if I missed any projects. Feel free to post epic-long comments about them below ;)




This post was originally posted over at the LSE Impact blog where I was kindly invited to write on this theme by the Managing Editor. It’s a widely read platform and I hope it inspires some academics to upload more of their work for everyone to read and use

Recently I tried to explain on twitter in a few tweets how everyone can take easy steps towards open scholarship with their own work. It’s really not that hard and potentially very beneficial for your own career progress – open practices enable people to read & re-use your work, rather than let it gather dust unread and undiscovered in a limited access venue as is traditional. For clarity I’ve rewritten the ethos of those tweets below:

Step 1: before submitting to a journal or peer-review service upload your manuscript to a public preprint server

Step 2: after your research is accepted for publication, deposit all the outputs – full-text, data & code in subject or institutional repositories

The above is the concise form of it, but as with everything in life there is devil in the detail, and much to explain, so I will elaborate upon these steps in this post.

Step 1: Preprints

Uploading a preprint before submission is technically very easy to do – it takes just a few clicks, but the barrier that prevents many from doing this in practice is cultural and psychological. In disciplines like physics it’s completely normal to upload preprints to and their submission to a journal in some cases has more to do with satisfying the requirements of the Research Excellence Framework exercise than any real desire to see it in a journal. Many preprints on arXiv get cited and are valued scientific contributions, even without them ever being published in a journal. That said, even within this community author perceptions differ as to the exact practice of when to upload a preprint in the publication cycle.

Within biology it’s relatively unheard of to upload a preprint before submission but that’s likely to change this year because of an excellent well-put article advocating their use in biology and the very many different outlets available for them. My own experience of this has been illuminating – I recently co-authored a paper openly on github and the preprint was made available with a citable DOI via figshare. We’ve received a nice comment, more than 250 views and a citation from another preprint. All before our paper has been ‘published’ in the traditional sense. I hope this illustrates well how open practices really do accelerate progress.

This is not a one-off occurrence either. As with open access papers, freely accessible preprints have a clear citation advantage over traditional subscription access papers:


Outside of the natural sciences the situation is also similar; Martin Fenner notes that in the social sciences (SSRN) and economics (RePEc) preprints are also common either in this guise, or as ‘working papers’ – the name may be different but the pre-submission accessibility is the same. Yet I suspect, like in biology, this practice isn’t yet mainstream in the Arts & Humanities – perhaps just a matter of time before this cultural shift occurs (more on this later on in the post…)?

There is one important caveat to mention with respect to posting preprints – a small minority of conservative, traditional journals will not accept articles that have been posted online prior to submission. You might well want to check Sherpa/RoMEo before you upload your preprint to ensure that your preferred destination journal accepts preprint submissions. There is an increasing grass-roots led trend apparent to convince these journals that preprint submissions should be allowed, of which some have already succeeded.

If even much-loathed publishers like Elsevier allow preprints, unconditionally, I think it goes to show how rather uncontroversial preprints are. Prior to submission it’s your work and you can put it anywhere you wish.


Step 2: Postprints


Unlike with preprints, the postprint situation is a little trickier. Publishers like to think that they have the exclusive right to publish your peer-reviewed work. The exact terms of these agreements will vary from journal to journal depending on the exact terms of the copyright or licencing agreement you might have signed. Some publishers try to enforce ’embargoes’ upon postprints, to maintain the artificial scarcity of your work and their monopoly of control over access to it. But rest assured, at some point, often just 12 months after publication, you’ll be ‘allowed’ to upload copies of your work to the public internet (again SHERPA/RoMEO gives excellent information with respect to this).

So, assuming you already have some form of research output(s) to show for your work, you’ll want these to be discoverable, readable and re-usable by others – after all, what’s the point of doing research if no-one knows about it! If you’ve invested a significant amount of time writing a publication, gathering data, or developing software – you want people to be able to read and use this output. All outputs are important, not just publications. If you’ve published a paper in a traditional subscription access journal, then most of the world can’t read it. But, you can make a postprint of that work available, subject to the legal nonsense referred to above.

If it’s allowed, why don’t more people do it?

Similar to the cultural issues discussed with preprints, for some reason, researchers on the whole don’t tend to use institutional repositories (IR) to make their work more widely available. My IR at the University of Bath lists metadata for over 3300 published papers, yet relatively few of those metadata records have a fulltext copy of the item deposited with them for various reasons. Just ~6.9% of records have fulltext deposits, as published back in June 2011.

I think it’s because institutional repositories have an image problem: some are functional but extremely drab. I also hear of researchers full of disdain who say of their IR’s (I paraphrase):

“Oh, that thing? Isn’t that just for theses & dissertations – you wouldn’t put proper research there”

All this is set to change though as researchers are increasingly being mandated to deposit their fulltext outputs in IR’s. One particular noteworthy driver of change in this realm could be the newly-launched Zenodo service. Unlike or ResearchGate which are for-profit operations, and are really just websites in many respects; Zenodo is a proper repository – it supports harvesting of content via the OAI-PMH protocol and all metadata about the content is CC0, and it’s a not-for-profit operation. Crucially, it provides a repository for academics less well-served by the existing repository systems – not all research institutions have a repository, and independent or retired scholars also need a discoverable place to put their postprints. I think the attractive, modern-look, and altmetrics to demonstrate impact will also add that missing ‘sex appeal’ to provide the extra incentive to upload.


Providing Access to Your Published Research Data Benefits You

A new preprint on PeerJ shows that papers with associated open research data have a citation advantage. Furthermore other research has shown that willingness to share research data is related to the strength of the evidence and the quality of the results. Traditional repository software was designed around handling metadata records and publications. They don’t tend be great at storing or visualizing research data. But a new development in this arena is the use of CKAN software for research data management. Originally CKAN was developed by the Open Knowledge Foundation to help make open government data more discoverable and usable; the UK, US, and governments around the world now use this technology to make data available. Now research institutions like the University of Lincoln are also using this too for research data management, and like Zenodo the interface is clean, modern and provides excellent discoverability.


Repositories are superior for enabling discovery of your work

Even though I use & ResearchGate myself. They’re not perfect solutions. If someone is looking for your papers, or a particular paper that you wrote these websites do well in making your output discoverable for these types of searches from a simple Google search. But interestingly, for more complex queries, these simple websites don’t provide good discoverability.

An example: I have a fulltext copy of my Nature letter on, it can’t be found from Google Scholar – but the copy in my institutional repository at Bath can. This is the immense value of interoperable and open metadata. Academics would do well to think closely about how this affects the discoverability of their work online.

The technology for searching across repositories for freely accessible postprints isn’t as good as I’d want it to be. But repository search engines like BASE, CORE and Repository Search are improving day by day. Hopefully, one day we’ll have a working system where you can paste-in a DOI and it’ll take you to a freely available postprint copy of the work; Jez Cope has an excellent demo of this here.

Open scholarship is now open to all

So, if there aren’t any suitable fee-free journals in your subject area (1), you find you don’t have funds to publish a gold open access article (2), and you aren’t eligible for am OA fee waiver (3), fear not. With a combination of preprint & postprint postings, you too can make your research freely available online, even if it has the misfortune to be published in a traditional subscription access journal. Upload your work today!

This is a re-post, originally first blogged by myself on the Open Knowledge Foundation main blog here

Recently Science Europe published a clear and concise position statement titled:
Principles on the Transition to Open Access to Research Publications

This is an extremely timely & important document that clarifies what governments and research funders should expect during the transition to open access. Unlike the recent US OSTP public access policy which allows publishers to apply up to a 12 month access embargo (to the disgust of some scientists like Michael Eisen) on publicly-funded research, this new Science Europe statement makes clear that only up to a 6 month embargo at maximum should be accepted for publicly funded STEM research. The recent RCUK (UK research councils) open access policy also requires 6 months embargo at most, with some caveats.

But among the many excellent principles is a particularly bold and welcome proclamation:

the hybrid model, as currently defined and implemented by publishers, is not a working and viable pathway to Open Access. Any model for transition to Open Access supported by Science Europe Member Organisations must prevent ‘double dipping’ and increase cost transparency

Hybrid options are typically far more expensive than ‘pure’ open access journal costs, and they don’t typically aid transparency or the wider transition to open access.

The Open Knowledge Foundation heartily endorses these principles as together with the above they respect, and reinforce the need for free access AND full re-use rights to scientific research.

About Science Europe:

Science Europe is an association of European Research Funding Organisations and Research Performing Organisations, based in Brussels. At present Science Europe comprises 51 Research Funding and Research Performing Organisations from 26 countries, representing around €30 billion per annum.