Show me the data!

Easy steps towards open scholarship

May 20th, 2013 | Posted by rmounce in Open Access | Open Data | Open Science

This post was originally posted over at the LSE Impact blog where I was kindly invited to write on this theme by the Managing Editor. It’s a widely read platform and I hope it inspires some academics to upload more of their work for everyone to read and use

Recently I tried to explain on twitter in a few tweets how everyone can take easy steps towards open scholarship with their own work. It’s really not that hard and potentially very beneficial for your own career progress – open practices enable people to read & re-use your work, rather than let it gather dust unread and undiscovered in a limited access venue as is traditional. For clarity I’ve rewritten the ethos of those tweets below:

Step 1: before submitting to a journal or peer-review service upload your manuscript to a public preprint server

Step 2: after your research is accepted for publication, deposit all the outputs – full-text, data & code in subject or institutional repositories

The above is the concise form of it, but as with everything in life there is devil in the detail, and much to explain, so I will elaborate upon these steps in this post.

Step 1: Preprints

Uploading a preprint before submission is technically very easy to do – it takes just a few clicks, but the barrier that prevents many from doing this in practice is cultural and psychological. In disciplines like physics it’s completely normal to upload preprints to and their submission to a journal in some cases has more to do with satisfying the requirements of the Research Excellence Framework exercise than any real desire to see it in a journal. Many preprints on arXiv get cited and are valued scientific contributions, even without them ever being published in a journal. That said, even within this community author perceptions differ as to the exact practice of when to upload a preprint in the publication cycle.

Within biology it’s relatively unheard of to upload a preprint before submission but that’s likely to change this year because of an excellent well-put article advocating their use in biology and the very many different outlets available for them. My own experience of this has been illuminating – I recently co-authored a paper openly on github and the preprint was made available with a citable DOI via figshare. We’ve received a nice comment, more than 250 views and a citation from another preprint. All before our paper has been ‘published’ in the traditional sense. I hope this illustrates well how open practices really do accelerate progress.

This is not a one-off occurrence either. As with open access papers, freely accessible preprints have a clear citation advantage over traditional subscription access papers:


Outside of the natural sciences the situation is also similar; Martin Fenner notes that in the social sciences (SSRN) and economics (RePEc) preprints are also common either in this guise, or as ‘working papers’ – the name may be different but the pre-submission accessibility is the same. Yet I suspect, like in biology, this practice isn’t yet mainstream in the Arts & Humanities – perhaps just a matter of time before this cultural shift occurs (more on this later on in the post…)?

There is one important caveat to mention with respect to posting preprints – a small minority of conservative, traditional journals will not accept articles that have been posted online prior to submission. You might well want to check Sherpa/RoMEo before you upload your preprint to ensure that your preferred destination journal accepts preprint submissions. There is an increasing grass-roots led trend apparent to convince these journals that preprint submissions should be allowed, of which some have already succeeded.

If even much-loathed publishers like Elsevier allow preprints, unconditionally, I think it goes to show how rather uncontroversial preprints are. Prior to submission it’s your work and you can put it anywhere you wish.


Step 2: Postprints


Unlike with preprints, the postprint situation is a little trickier. Publishers like to think that they have the exclusive right to publish your peer-reviewed work. The exact terms of these agreements will vary from journal to journal depending on the exact terms of the copyright or licencing agreement you might have signed. Some publishers try to enforce ‘embargoes’ upon postprints, to maintain the artificial scarcity of your work and their monopoly of control over access to it. But rest assured, at some point, often just 12 months after publication, you’ll be ‘allowed’ to upload copies of your work to the public internet (again SHERPA/RoMEO gives excellent information with respect to this).

So, assuming you already have some form of research output(s) to show for your work, you’ll want these to be discoverable, readable and re-usable by others – after all, what’s the point of doing research if no-one knows about it! If you’ve invested a significant amount of time writing a publication, gathering data, or developing software – you want people to be able to read and use this output. All outputs are important, not just publications. If you’ve published a paper in a traditional subscription access journal, then most of the world can’t read it. But, you can make a postprint of that work available, subject to the legal nonsense referred to above.

If it’s allowed, why don’t more people do it?

Similar to the cultural issues discussed with preprints, for some reason, researchers on the whole don’t tend to use institutional repositories (IR) to make their work more widely available. My IR at the University of Bath lists metadata for over 3300 published papers, yet relatively few of those metadata records have a fulltext copy of the item deposited with them for various reasons. Just ~6.9% of records have fulltext deposits, as published back in June 2011.

I think it’s because institutional repositories have an image problem: some are functional but extremely drab. I also hear of researchers full of disdain who say of their IR’s (I paraphrase):

“Oh, that thing? Isn’t that just for theses & dissertations – you wouldn’t put proper research there”

All this is set to change though as researchers are increasingly being mandated to deposit their fulltext outputs in IR’s. One particular noteworthy driver of change in this realm could be the newly-launched Zenodo service. Unlike or ResearchGate which are for-profit operations, and are really just websites in many respects; Zenodo is a proper repository – it supports harvesting of content via the OAI-PMH protocol and all metadata about the content is CC0, and it’s a not-for-profit operation. Crucially, it provides a repository for academics less well-served by the existing repository systems – not all research institutions have a repository, and independent or retired scholars also need a discoverable place to put their postprints. I think the attractive, modern-look, and altmetrics to demonstrate impact will also add that missing ‘sex appeal’ to provide the extra incentive to upload.


Providing Access to Your Published Research Data Benefits You

A new preprint on PeerJ shows that papers with associated open research data have a citation advantage. Furthermore other research has shown that willingness to share research data is related to the strength of the evidence and the quality of the results. Traditional repository software was designed around handling metadata records and publications. They don’t tend be great at storing or visualizing research data. But a new development in this arena is the use of CKAN software for research data management. Originally CKAN was developed by the Open Knowledge Foundation to help make open government data more discoverable and usable; the UK, US, and governments around the world now use this technology to make data available. Now research institutions like the University of Lincoln are also using this too for research data management, and like Zenodo the interface is clean, modern and provides excellent discoverability.


Repositories are superior for enabling discovery of your work

Even though I use & ResearchGate myself. They’re not perfect solutions. If someone is looking for your papers, or a particular paper that you wrote these websites do well in making your output discoverable for these types of searches from a simple Google search. But interestingly, for more complex queries, these simple websites don’t provide good discoverability.

An example: I have a fulltext copy of my Nature letter on, it can’t be found from Google Scholar – but the copy in my institutional repository at Bath can. This is the immense value of interoperable and open metadata. Academics would do well to think closely about how this affects the discoverability of their work online.

The technology for searching across repositories for freely accessible postprints isn’t as good as I’d want it to be. But repository search engines like BASE, CORE and Repository Search are improving day by day. Hopefully, one day we’ll have a working system where you can paste-in a DOI and it’ll take you to a freely available postprint copy of the work; Jez Cope has an excellent demo of this here.

Open scholarship is now open to all

So, if there aren’t any suitable fee-free journals in your subject area (1), you find you don’t have funds to publish a gold open access article (2), and you aren’t eligible for am OA fee waiver (3), fear not. With a combination of preprint & postprint postings, you too can make your research freely available online, even if it has the misfortune to be published in a traditional subscription access journal. Upload your work today!

    Hi Ross, I’m just wondering what your thoughts are on CC0 vs. CC-BY? I see Palaeo-online where you posted a cpuple years back, and this blog are CC-BY (and fair enough), and I’m just wondering when you think the requirement for attribution could/should be dropped?

    What limitations do you think CC-BY imposes on say written articles as opposed to CC0?

    Also, given that attribution is required in references within academic papers, is the BY of CC-BY actually necessary? (Or at least, if it’s not specified how attribution should be given)

    • Ross Mounce

      Good questions!

      ***When you think the requirement for attribution could/should be dropped?

      My answer:

      For data.

      For US government works (and for *any* government work around the world ideally).

      For academic research, where the authors are happy to choose CC0 themselves (thereby removing the potential burden on future researchers caused by attribution-stacking problems & such)

      (and even if the *legal* requirement for attribution is dropped, we must remember that this is totally separate to the real reasons why objects get attributed [cited] in the research literature: cultural, scholarly norms NOT legal reasons. Thus even if the *legal* requirement is dropped, the object may still be attributed formally in a scholarly work if appropriate. Odd people seem to think “CC0 == no one will cite my work” that’s not how it works. CC0/PD just removes the *legal* obligation to attribute.)

      ***What limitations do you think CC-BY imposes on say written articles as opposed to CC0?

      Attribution stacking. Text-mining is made difficult by the legal requirement to attribute. If I ‘mine’ the text of 1,000,000 papers it is very difficult to sufficiently attribute all 1,000,000 of those mined papers, especially if some of those papers don’t have DOI’s!

      If you dig in to the details of CC-BY it can get quite awkward. In theory authors can state exactly *how* they wish to be attributed. So each of the papers could have a 1,000 different styles of attribution you would have to cater for. But I haven’t seen too many people in real life complain about exactly ‘how’ & what ‘style’ of attribution they’ve been given by a re-user — but this could certainly be a sticking-point in the future.

      ***Also, given that attribution is required in references within academic papers, is the BY of CC-BY actually necessary?

      No. In the academic realm of usage it’s there purely as a psychological comfort to authors – they *must* get cited if their work is re-used. I have never heard of an academic actually using legal means to sue another academic for not attributing re-used work. CC-BY only provides *legal* means. If you felt someone didn’t abide by the terms of CC-BY you’d set, you’d have to go through the courts to actually get something done about it.

      Nina Paley has an excellent blog post related to this, she changed her excellent film ‘Sita Sings the Blues from CC-BY-SA to CC0 because she realised “why point a loaded gun at everyone when I’d never fire it?” i.e. why apply -BY-SA restrictions when she’d never (for this work) actually bother sueing anyone for breaking either -BY or -SA restrictions :

      Politically, CC BY is an acceptable compromise for OA adovcacy. Telling all taxpayer-funded researchers to revoke *all* copyright on their work under CC0 and you probably won’t get very far (unless they’re US gov employees). Whereas with CC BY authors retain their copyright (nice warm fuzzy feeling) but otherwise permit almost every kind of re-use. It’s a happy compromise and I’m happy with it.

      Some related reads:

      PS It’s my ambition to publish a paper under CC0 in the near future just to prove the point (as many others have).


        Really really interesting. I think intuition suggests that CC0 would permit plagiarism, though it’s academic policy and just general cultural norm to penalise that. Just Googled attrib. stacking, cheers for the info ‒ the Dryad link in particular verified some of my inklings.

        One thing you didn’t touch upon though, why -BY for Palaeo Online & your blog? I would’ve thought the latter’s an easy place to start, or is that related to plagiarism/copyright of your own writing? Not looking to pick holes in your stance, just interested in pros/cons of use for really simple example such as personal projects. The Canadensys site is BY too, so I’m guessing there’s been a conscious decision there too which I’m interested to hear the details of.

        I was looking at the details and notice it includes waiving “moral rights”. These caused concern recently when people (erroneously) thought Nature had the power to sign these off (doesn’t apply to scholarly articles in the UK though), and looking again at the definition it doesn’t apply to reuse in newspapers or magazines — i.e. press wouldn’t be able to reuse work, but perhaps a website would.

        Mike Linksvayer argues quite succinctly for “upgrading to CC0”

        Attribution (BY).Do not take part in the debasement of attribution, and more broadly, provenance, already useful to readers, communities of practice, and publishers, by making them seem mere objects of copyright license compliance. If attribution is useful, it will be provided. If not, robots will find out. Rarely does anyone comply with the exact legal requirements of the attribution term anyway, and as a licensor, you probably won’t provide the information needed by licensees to easily comply. Plus, the corresponding icon looks like a men’s bathroom sign.

        Rather like Unlicense is to code. There are other ways of stamping out unfair use and at the end of the day I’m not going to ‘fire the loaded gun’ either – I mean it’s just the internet.

        • Ross Mounce

          Of course CC0 *permits* plagiarism in terms of legal (law) aspects. I can’t / won’t / don’t need to deny that.

          It’s essentially irrelevant though.

          Copyright and copyright-enforced restrictions on re-use act in the legal realm — in courts of law. The legal realm & the ‘academic norms’ realm are completely different, and whether the legal realm happens to permit something is irrelevant because if it’s scholarship published in an academic journal the rules of ‘academic norms’ still apply regardless of the legally-enforced rules. The ‘courts’ for plagiarism of scholarship are: fellow academics, editors, and the ‘High Court’ / ‘Supreme Court’ adjudicator is COPE .

          In defence of my blog licensing, , the Canadensys blog — they’re all blogs (Canadensys makes all its data CC0 but the blog posts are CC BY).

          I see -BY as a legit, defensible option for grey literature like blogs & newsletters (relative to CC0) because blogs are definitely a mode of literature that *do* frequently suffer from plagiarism / insufficient attribution. There is greater ‘respect’ for formally published papers and they have higher profile, so you’re less likely to see people plagiarising academic papers. Not so for blog content…

          These days you see many ‘content aggregator’ blogs e.g. Bioinfo-Bloggers (explained here ) that just scrape material and are lazy at attribution. There is more artistry in blogging (and hence copyright), and importantly I’m clearly not paid to blog by my employer in any way.

          This is a big distinction from academic research papers which I *am* effectively paid to write & publish by a tax-payer funded research council (BBSRC). Therefore if they wanted all BBSRC research to be CC0 – that’d be fine – they’re the one that’s paying me to do the research, and because attribution stacking is a real problem it’s an understandable ask – trying to maximise the ease of re-use of the work they pay to create.

          Simply put:

          CC0 is what is best for everyone else (not necessarily what is absolutely best in all cases for the creator of the work).

          So charities & governments often use CC0 to maximise the ease of re-use for everyone, regardless of the creator – because they pay the creator to do that work any way, so morally that’s completely okay.

          Practically, CC BY provides otherwise unfunded creators with a tiny bit of protection to cause a little fuss if their work isn’t properly attributed. I readily admit I’m extremely unlikely to sue anyone for plagiarising this blog content BUT I certainly will use CC BY to kick-up a big fuss online if content is re-used without appopriate attribution.

          If another blog reposts one of my blog posts with no attribution to me – I’ll kick-up a fuss online and point to CC BY (but likely go no further than that).

          Whereas, in the unlikely event that a major global brand e.g. Coca Cola uses my CC BY material without even attributing me, CC BY enables me to think about firing that gun of getting the lawyers in.

          For blog content the CC BY or CC0 thing isn’t much of an issue.

          Blog content isn’t aggregated on the same scale or ways as data – data can be recombined and re-used on a massive scale so attribution stacking is a very *real* problem there and hence CC0 being preferred to CC BY is much more practically important for data.

          The ‘next cure for cancer’ isn’t likely to come from large-scale analysis of blog posts, therefore I’m not going to lose sleep over CC BY (and the attribution stacking probs it can cause) for blog content :)

          The compromise I allude to is the current advocacy by most OA advocates & organizations such as OASPA for CC BY (rather than CC0). In a perfect world, I actually think all taxpayer-funded scientific research papers should be CC0 to avoid attrib stacking problems in future, rather than CC BY. But that would NOT be palatable to many academics, thus, at least on my part I recognise that advocacy for CC BY instead of CC0 is actually a compromise on what’s ‘best’ for science.


            I’m with you now. That dumb Bioinfo Bloggers site hasn’t even gone to the trouble of highlighting the authors e.g. Bergman who’re using Share-Alike… Thanks again

