Show me the data!

Open in order to unleash the power of text mining

October 23rd, 2017 | Posted by rmounce in Generation Open | Open Access - (Comments Off on Open in order to unleash the power of text mining)

In 2017, we have a vast toolbox of informative methods to help us analyse large volumes of text. Sentiment analysis, topic modelling, and named entity recognition are to name but a few of these exciting approaches. Computational power and storage capacity are not the limiting factors on what we could do with the 100 million or so journal articles that comprise the ever-growing research literature so far. But the continued observance of 17th century limitations on how we can use research are simply jarring. Thanks to computers and the internet, we have the ability to do wonderful things, but the licensing and access-restrictions placed on most of the research literature explicitly and artificially prevent most of us from trying. As a result, few researchers bother thinking about using text mining techniques – it is often simpler and easier to just farm-out repetitive large-scale literature analysis tasks to an array of student minions and volunteers to do by-hand – even though computers could and perhaps should be doing these analyses for us.

Inadequate computational access to research has already caused us great harm. Just ask the Ministry of Health in Liberia: they were not pleased to discover, after a lethal Ebola virus outbreak, that vital knowledge locked-away in “forgotten papers” published in the 1980’s, clearly warned that the Ebola virus might be present in Liberia. This information wasn’t in the title, keywords, metadata, or abstract; it was completely hidden behind a paywall. Full text mining approaches would have easily found this buried knowledge and would have provided vital early warning that Ebola could come to Liberia, which might have prevented some deaths during the West African Ebola virus epidemic (2013–2016)

Some subscription-based publishers have been known to use ‘defence’ mechanisms such as ‘trap URLs’ that hinder text miners – making it even harder to do basic research. Whilst other subscription publishers like Royal Society Publishing are helpfully supportive to text miners, as are open access publishers. Hindawi for instance, allows anyone to download every single article they’ve ever published with a single mouse-click. Thanks to open licensing, aggregators like Europe PubMedCentral can bring together the outputs of many different OA publishers, making millions of articles available with a minimum of fuss. It is “no bullshit” access. You want it? You can have it all. No need to beg permission, to spend months negotiating and signing additional contracts, nor to use complicated publisher-controlled access APIs, and their associated restrictions. Furthermore, OA publishers typically provide highly structured full-text XML files which make it even easier for text miners. But only a small fraction of the research literature is openly-licensed open access. It’s for these reasons and more that many of the best text-mining researchers operate-on and enrich our understanding of open access papers-only e.g. Florez-Vargas et al 2016.

So if I had but one wish this Christmas, it would be for the artificial, legally-imposed restrictions on the bulk download and analysis of research texts, to be unambiguously removed for everyone, worldwide – so that no researcher need fear imprisonment or other punitive action, simply for doing justified and ethical academic research. Unchain the literature, and we might be able to properly unleash and apply the collected knowledge of humanity.  


This is my short contribution for Open Access Week 2017, and the #OpenInOrderTo website created by SPARC, to move beyond talking about openness in itself and focus on what openness enables.


Today (2015-09-01), marks the public announcement of Research Ideas & Outcomes (RIO for short), a new open access journal for all disciplines that seeks to open-up the entire research cycle with some truly novel features

I know what you might be thinking: Another open access journal? Really? 

Myself, nor Daniel Mietchen simply wouldn’t be involved with this project if it was just another boring open access journal. This journal packs a mighty combination of novel features into one platform:

  • 1.) RIO will publish research proposals, as well as regular research outputs such as articles, data papers and software – this has never been done by a journal before to my knowledge
  • 2.) RIO will label research outputs with ‘Impact Categories’ based upon UN Millennium Development Goals (MDGs) and EU Societal Challenges, to highlight the real-world relevance of research and to better link-up research across disciplines (see below for some example MDGs).


  • 3.) RIO supports a variety of different types of peer-review, including ‘pre-submission, author-facilitated, external peer-review‘ (new), as well as post-publication journal-organized open peer-review (similar to that pioneered by F1000Research), and ‘spontaneous’ (not journal-organized) post-publication open peer-review which is actively encouraged. All peer-review will be open/public, in keeping with the overall guiding philosophy of the journal to increase transparency and reduce waste in the research cycle. Reviewer comments are highly valuable; it is a waste not to make them public. When supplied, all reviewer comments will be made openly available.
  • 4.) RIO offers flexibility in publishing services and pricing in a bold attempt to ‘decouple’ the traditional scholarly journal into its component services. Authors & funders thus may choose to pay for the publishing services they actually want, not an inflexible bundle of different services, as there is at most journals.
Source: Priem, J. and Hemminger, B. M. 2012. Decoupling the scholarly journal. Frontiers in Computational Neuroscience. Licensed under CC BY-NC

Source: Priem, J. and Hemminger, B. M. 2012. Decoupling the scholarly journal. Frontiers in Computational Neuroscience. Image licensed under CC BY-NC.


  • 5.) On the technical side of things, RIO uses an integrated end-to-end XML-backed publication system for Authoring, Reviewing, Publishing, Hosting, and Archiving called ARPHA. As a publishing geek this excites me greatly as it eliminates the need for typesetting, ensuring a smooth and low-cost publishing process. Reviewers can make comments inline or more generally over the entire manuscript, on the very same document and platform that the authors wrote in, much like Google Docs. This has been successfully tried and tested for years at the Biodiversity Data Journal and is a system now ready for wider-use.


For the above reasons and more, I’m hugely excited about this journal and am delighted to be one of their founding editors alongside Dr Daniel Mietchen. See our growing list of Advisory and Editorial Board members for insight into who else is backing this new journal – we’ve got some great people on board already! If you’re interested in supporting this initiative please do enquire about volunteering as an editor for the journal, we need more editors to support the broad scale and ambition of journal. You can apply via the main website here.

Wiley & Readcube have done something rather sneaky recently, and it’s not escaped the attention of diligent readers of the scientific literature.

excellent facebook comment

On the article landing page for some, if not all(?) journal articles at Wiley, in JavaScript enabled web browsers they’ve replaced all links to download the PDF file of the article with links that direct you to Readcube instead.

This is incredibly annoying – they are literally forcing us to use Readcube. That is not cool.

Some will rush to the defence of Readcube and point out that if they detect you have the rights to, you can download the PDF from within Readcube, but that’s missing the point. No-one need waste their precious time whilst Readcube takes ages to load in your browser tab, when all you wanted in the first place was the PDF.

What Readcube provides IS NOT EVEN PDF. It’s a mishmash of JavaScript, HTML and DRM technology. Thus when Wiley has icons saying “get PDF” they’re lying. Clicking the “get PDF” link does NOT send you to the PDF. It sends you to Readcube’s proprietary, rights-restricted mock-up of a PDF.

It doesn’t even render the figure images properly, sometimes missing important bits e.g. this figure (below):

Luckily there’s a simple solution: you can block Readcube in your browser settings and get simple, direct one-click access to PDF files again by selectively disabling JavaScript on all Readcube-infected websites e.g., and

Firefox users

Install the add-on called YesScript and ‘blacklist’ all Readcube-tainted websites.

Google Chrome / Chromium users

Use Vince Buffalo’s ‘Get Me the F**king PDF‘ Chrome plugin. It’s really good.
This browser is so clever you don’t even need to install anything new. Selective JavaScript blacklisting of websites is an in-built function:

A) Click the menu button in the top right hand corner of your browser
B) Select Settings
C) (scroll to bottom) Click Show advanced settings
D) Underneath the “Privacy” section, click the “Content settings” button.
E) Under the “Javascript” section, click “Manage Exceptions” and add at least these three Readcube-infected websites:, and (example screenshot below)


Safari users

I haven’t tested this but the JavaScript Blocker extension looks like it should do the job.

Internet Explorer users

I’m tempted to say: install Chrome or Firefox but I’m well aware that some unfortunate academics have ‘university-managed’ computers on which they can’t easily install things. If so try the instructions for IE here. Let me know if you have better solutions for unfortunate IE users.

Before (left) and After (right) disabling JavaScript on the page.

Before (left) and After (right) disabling JavaScript on the page.

Added bonus function – extra privacy!

Would you want advertisers to be collecting data on you, knowing what you’ve been reading? It’s possible, though not proven AFAIK that the journal publishers themselves, or the advertisers they use are recording information about what articles you’re reading. They might know you read that article about average penis length three times last week for instance… Eric Hellman wrote quite an alarming post about the extent of this tracking at publisher websites recently. Thus blocking JavaScript at publisher websites provides extra privacy, not just protection against Readcube!

Above all I think we should #BlockReadcube not just for our own utility (easier access to the real PDF), but to send them a powerful message: we do not want the literature to be assimilated and enclosed in rights-restrictions by new technology. We do not want non-consenting ‘cubification of the research literature. We are Starfleet, and as far as I’m concerned: Readcube is the Borg.


PS If you like some of the features of Readcube, try Utopia Docs – it’s free and it’s released under an Open Source license, and it doesn’t force you to use it!

Update 2015-03-20: This post does not indicate I’m suddenly ‘in favour’ of PDF’s by the way, as some seem to have interpreted. If Wiley wanted to do something good, they should publish their full text XML on site like other good publishers do e.g. PLOS, eLife, Hindawi, MDPI, Pensoft, BMC, Copernicus… If they did this then readers could choose to use innovative open source viewing software such the eLife Lens. That kind of change would add value & choice, rather than subtract value (& rights) as they have in this case.

Further discussion of Readcube and rights-restrictions:

So, apparently Elsevier are launching a new open access mega-journal some time this year, joining the bandwagon of similar efforts from almost every other major publisher. A lovely acknowledgement of the roaring success of PLOS ONE, who did it first a long time ago.

They’re only ~8 years behind, but they’re learning. I for one am pleased they are asking the research community what they want from this new journal. One of their “key points” in the press release is: “the journal will be developed in close collaboration with the research community and will evolve in response to feedback”

Well, I’m a member of the research community. I’m a BBSRC-funded postdoc at the University of Bath. I publish research myself AND I re-use published research, so I have a dual perspective that Elsevier should find useful. Here’s my feedback on their new open access journal proposal:


  • Does the research community really need or want a new journal?

We have at least 27,000 other peer-reviewed journals (source: Ulrich’s). I can’t see anything in Elsevier’s proposal that’s really new, or better than anything that already exists – you’ll be hard pressed to beat PeerJ. More journals add to the fragmentation of the research literature – it’s already hard to search across all these journals effectively. Why not just accept more volume in existing journals? It’d be great if you flipped The Lancet, Cell, and Trends in Ecology and Evolution to full (100%) open access journals, and rejected less submitted papers that present sound science. I genuinely do not know of any researcher that asked specifically for an additional new Elsevier journal.


The definition of open access always has been, and always will be this:

By “open access” to this literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited. (BOAI)

If you’re going to allow the CC-BY-NC-ND licence then by definition you can’t call it an open access journal. Either don’t allow that restrictive non-open licence, or call this new journal a ‘free-to-read’ journal or a ‘public access’ journal. These are the established terms for cost-free but not open journal content that the research community uses. Speak our language for a change instead of deliberately opaque legalese.


  • Take feedback on the design of your new journal from the WORLD not just the research community

Approximately 80% of the world’s academic research is taxpayer or charitably funded. The world is therefore your customer, not just researchers. Ask the world what they want from your new journal.


Take inspiration from the Panton Principles: “Science is based on building on, reusing and openly criticising the published body of scientific knowledge” – help researchers do the best science possible by not allowing them any excuses to not share non-sensitive data with their colleagues. The ’email the author’ system has been widely proven not to work, in my own experience too.


  • Make peer reviews open for all to see, post-publication alongside the paper

At the time of review, you can do single or double blind, but after the manuscript is accepted and published, please publish the reviews alongside the accepted paper. The research community can then see for themselves how good peer review is at your new journal. Allow people to sign their reviews if they wish to (and personally I think this is best in most circumstances).


  • Encourage data citation

Do I really need to explain this one? Old school academic editors have apparently been striking these out at some journals. Please make all editors aware that this is both a good thing and is encouraged.


  • Encourage authors to provide their ORCIDs upon submission, (and ORCIDs for reviewers and editors too please)

This will help people disambiguate who’s who’s which is important when there are at least 7 million active researchers.


  • Charge a reasonable APC ($1350 or less), and be generous with fee waivers and discounts for those that cannot afford them

Anything more than $1350 per article for a new journal in 2015 is daylight robbery. For the first year of publication you should waive charges for everyone, as everyone else does.


  • Provide open, full text XML

Great for text-mining. We don’t need your API. Just give us the content.


There you go Elsevier – that’s my feedback. If you can do ALL of the above or better, I might even publish with you myself. I have stated what I think you should do; it’s up to you now to implement it. I anticipate the launch of your glorious new journal. When your new journal comes out I shall revisit this post & score your new journal against it.


I encourage all other researchers & the scholarly poor who feel similarly, to also make their feelings known to Elsevier, and to add points I have perhaps overlooked. I’d say good luck Elsevier, but you don’t need luck with your fat profit margins – it’s simple to openly publish a good peer-reviewed research journal – just get on and do it already.




Ross Mounce, PhD