Making a journal scraper

May 13th, 2015

Yesterday, I made a journal scraper for the International Journal of Systematic and Evolutionary Microbiology (IJSEM).

Fortunately, Richard Smith-Unna and the ContentMine team have done most of the hard work in creating the general framework with quickscrape (open-source and available on github), I just had to modify the available journal-scrapers to work with IJSEM.

How did I do it?

Find an open access article in the target journal e..g James et al (2015) Kazachstania yasuniensis sp. nov., an ascomycetous yeast species found in mainland Ecuador and on the Galápagos

In your browser, view the HTML source of the full text page, in the Chrome/Chromium browser the keyboard shortcut to do this is Ctrl-U. You should then see something like this, perhaps with less funky highlighting colours:

I based my IJSEM scraper on the existing set of scraper definitions for eLife because I know both journals use similar underlying technology to create their webpages.

The first bit I clearly had to modify was the extraction of publisher. In the eLife scraper this works:

but at IJSEM that information isn’t specified with ‘citation_publisher’, instead it’s tagged as ‘DC.Publisher’ so I modified the element to reflect that:

The license and copyright information extraction is even more different between eLife and IJSEM, here’s the correct scraper for the former:

and here’s how I changed it to extract that information from IJSEM pages:

The XPath needed is completely different. The information is inside a div, not a meta tag.


Hardest of all though were the full size figures and the supplementary materials files – they’re not directly linked from the full text HTML page which is rather annoying. Richard had to help me out with these by creating “followables”:

In his words:

any element can ‘follow’ any other element in the elements array, just by adding the key-value pair "follow": "element_name" to the element that does the following. If you want to follow an element, but don’t want the followed element to be included in the results, you add it to a followables array instead of the elements array. The followed array must capture a URL.



The bottom-line is, it might look complicated initially, but actually it’s not that hard to write a fully-functioning  journal scraper definition, for use with quickscrape. I’m off to go and create one for Taylor & Francis journals now :)


Wouldn’t it be nice if all scholarly journals presented their content on the web in the same way, so we didn’t have to write a thousand different scrapers to download it? That’d be just too helpful wouldn’t it?



April Clyburne-Sherin asked an interesting question on the OpenCon Discussion List recently:

I am an author on a manuscript that my lab wants to publish in a subscription journal that normally retains the copyright. The manuscript is a desirable one so they are “willing” (haha) to provide it “open access” (that was my stipulation to my lab when they started speaking with the publisher). My lab is happy with this, but I do not trust the publisher and want to be able to negotiate a publishing agreement that guarantees:
  • We retain the copyright;
  • The article will be open access forever and no version will be behind a paywall at their journal ever;
  • That there are no sign-ins, registrations, DRM viewing issues, or other ‘free” obstacles to viewing the article.

Comment: Quite rightly, April does not trust the publisher to make the published work fully open access in perpetuity, and wants to do more as an author, with the publishing agreement (a formal contract) to ensure that the publisher will actually provide the exact services she wants.

Recent events this year, whereby Elsevier, Wiley and Springer have all been caught red-handed selling access to hybrid open access articles justifies this lack of trust. It’s a sad state of affairs that authors such as April & myself no longer trust some service providers to actually provide the services we pay them for (e.g. Open Access).

Some helpful links & pointers have been provided on the discussion list, and this may be a concern many other scholarly authors have so it’s valuable to collate, discuss and publicise possible solutions to the thorny problem of publishing agreements with legacy publishers. I certainly don’t pretend to have all the answers here and I think organisations like SPARC might want to act on this one.

Lorraine Chuen links to the Canadian Association of Research Libraries (CARL) ‘Resources for Authors’ page which amongst other things discusses the Canadian SPARC Author Addendum. I knew about the US SPARC Author Addendum, but I never knew there was a Canadian version too!

Matt Menzenski links to the University of Kansas Authors & Copyright page. I particularly like An Introduction to Publication Agreements for Authors (Armstrong, 2009) that they link to at the very top – it’s really useful information.

My Suggested Solutions

For my part, I chipped-in with four different ways that in their own way either partially or wholly fulfil some or all of the criteria April is looking for:

1.) Wait for them to send you their proposed publishing agreement & change the terms to ones you find agreeable

If they send you a standard CTA (Copyright Transfer Agreement) form as PDF, you can modify the wording of that PDF to terms you prefer and send it back to them and they probably won’t even notice as long as it’s signed & doesn’t look too different. It’s cheeky, but I got away with it for a book chapter once. Be careful to remove / replace the term ‘work for hire’ – it may look like an innocuous statement but apparently this is fairly key in legal terms – I neglected to remove that from my book chapter agreement.


2.) Transferring away your copyright away to another person
Not as easy perhaps for multi-author papers but Mike Taylor has a good (successful-ish) anecdote about transferring his copyright to his spouse, thereby preventing the Geological Society from taking the copyright of the work.


3.) Claim that one of the authors is a US federal government employee
Use Section 105 of the US Copyright Act by pretending that at least one of the authors is an employee of the US Government. Works of the U.S. federal government cannot be copyrighted by their authors in the US – they must be public domain, which is in practice achieved by applying the Creative Commons Zero waiver to the paper. The CTA form may contain a check box asking about this. If not, just email them about it. Michael Eisen famously, successfully liberated a NASA space research paper from behind a paywall at Science (AAAS), using Section 105 as justification.
Will publishers really bother fact-checking your assertion about the employment of one of the authors? I don’t think so. It could land them in big trouble if they dare disregard the US Copyright Act.


4.) Simply do not sign, or do not return the unfavourable publishing agreement
Another risky approach is simply not to sign or not to return the CTA the publisher sends you after acceptance (with the obvious risk that this could delay publication). I think this is perhaps the most promising approach, there is strong evidence that many academics currently employ this practice. When you think about it: publishers actually need our papers or they’ll go bust. They need a constant stream of content to justify their existence. If you don’t sign-off on their stipulated terms and conditions, after acceptance, they do have real pressures to get on and publish the paper anyway, especially with the increased focus on optimising submission to publication times these days.


I’ll let Reinhard Diestel (mathematician, University of Hamburg) have the last word on this post, it’s a solution I’m keenly interested in trying myself:
I stopped signing away my copyright on journal papers in the late 1990s. Interestingly, almost all publishers reacted either positively or not at all when I did not return the copyright form signed as requested: in all cases did they print the paper in question, usually without additional delay, and sometimes with unexpected understanding and support. (Yes, there have been one or two cases where things were a little more difficult at first, but these too were resolved amicably in the end.)” —


Roughly ten days after I first blogged about this (see: Springer caught red-handed selling access to an Open Access article), Springer have now made a curious public statement acknowledging this debacle:

Statement on Annals of Forest Science article

Berlin, 6 May 2015

A number of tweets posted by Prof. Luis Apiolaza on 27 April, and by others active on social media, suggest that Springer is charging for access to open access articles published in Annals of Forest Science. After looking into this issue, there is indeed an issue with the status of the article, but this has to do with the background of the journal itself.

Annals of Forest Science is a journal owned by INRA (Institut National de la Recherche Agronomique). In 2009, when the article in question first appeared, the journal was being published by another company that allowed readers to read the articles without paying a fee (“free access”). When Springer started working with INRA in 2011 we agreed to add the 2007-2010 archives to SpringerLink, Springer’s online platform, in order to ensure a smooth transition and to give a wider distribution to the most recent articles. Since the copyright was not assigned to the author, and since there is no mention of the licensing used, we incorrectly assumed that the article was not open access.

It is clear that this article was intended to be open access, and it will be made so on SpringerLink as quickly as possible. Anyone that has purchased the article will, of course, be reimbursed.

Please note that we support Green Open Access and we feed all articles from INRA journals to the HAL repository after the 12-month embargo, making the articles freely downloadable there (this is clearly written on the journal’s webpage, with a link to the HAL platform). The article in question can also be found there for free (since 2011).

This has been an oversight, and we apologize for not being more thorough and vigilant.


Ruth Francis | Springer | Corporate Communications
tel +44 203192 2732 |


I am pleased that Springer are committing to reimbursing all (reader) purchasers of wrongly-paywalled articles, and I shall check my bank balance regularly in the coming weeks to see if they honour this promise.

I am also pleased that Springer see fit to formally apologize for their carelessness of publishing. I note that AFAIK neither Wiley nor Elsevier have apologised for similar incidents this year.

But I’m rather bemused by this wording they have chosen: “It is clear that this article was intended to be open access, and it will be made so on SpringerLink as quickly as possible”

Indeed it seems they chose this wording carefully, because as far as I can tell with my browser, Luis’s open access article is still on sale (see screenshot below).

Update: As of 2015-07-05 13:20 (BST) the article is now no longer paywalled. At the time of writing, as can be seen below it was clearly paywalled.



Springer SBM as an entity makes nearly a billion euros per year in turnover. Despite the considerable size, wealth and ‘experience’ in publishing, Springer can’t seem to unpaywall Luis’s article. Astonishing.

Today, the author of a paid-for, ‘hybrid’ open access article published in 2009, found that it was wrongly on sale at a Springer website:

FWIW it’s still freely available at the original publisher website here.

To test if Springer really were just brazenly selling a copy of the exact same open access article, I paid Springer to access a copy myself (screenshot below) and found it was exactly the same:

my receipt

I don’t actually care whether this is technically ‘legal’ any more. That doesn’t matter. This is scammy publishing. I want a refund and I will be contacting Springer shortly to ask for this. The author also hopes I get a refund – he wanted his article be open access, not available for a ransom:


Frankly, I’m getting tired of writing these blog posts, but it needs to be done to record what happened, because it keeps on happening.

I really think we need to setup a c.f. to monitor and report on these types of incidents. It’s clear the publishers don’t care about this issue themselves – they get extra money from readers by making these ‘mistakes’ and no financial penalty if anyone does spot these mistakes. Calculated indifference.

Are these known incidences just the tip of the iceberg? How do we know this isn’t happening at a greater scale, unobserved? There are more than 50 million research articles on sale at the moment. Perhaps in small part this explains the obscene profits of the legacy publishers?

It’s yet another nail in the coffin for hybrid OA – we simply can’t trust these publishers to keep this content open and paywall-free.

A recap of recent incidents of selling open access articles, without the publisher acknowledging to the reader/buyer that it is an open access article:

