Show me the data!

Traditional Publishers: please stop blocking research

November 19th, 2015 | Posted by rmounce in Content Mining

OpenCon 2015 Brussels was an amazing event. I’ll save a summary of it for the weekend but in the mean time, I urgently need to discuss something that came up at the conference.

At OpenCon, it emerged that Elsevier have apparently been blocking Chris Hartgerink’s attempts to access relevant psychological research papers for content mining.

No one can doubt that Chris’s research intent is legitimate – he’s not fooling around here. He’s a smart guy; statistically, programmatically and scientifically – without doubt he has the technical skills to execute his proposed research. Only recently he was an author on an excellent paper highlighted in Nature News: ‘Smart software spots statistical errors in psychology papers‘.

Why then are Elsevier interfering with his research?

I know nothing more about his case other than what is in his blog posts, however I have also had publishers block my own attempts to do content mining this year, so I think this is the right time for me to go public about this, in support of Chris.

My own use of content mining

I am trying to map where in the giant morass of research literature Natural History Museum (London) specimens are mentioned. No-one has an accurate index of this information. With the use of simple regular expressions it’s easy to filter hundreds of thousands of full text articles to find, classify and lookup potential mentions of specimens.

In the course of this work, I was frequently obstructed by BioOne. My IP address kept getting blocked, stopping me from downloading any further papers from this publisher. I should note here that my institution (NHMUK) pays BioOne to provide access to all their papers – my access is both legitimate and paid-for.

Strong claims, require strong evidence. Thankfully I was doing my work with the full support and knowledge of the NHM Library & Archives team, so they forwarded one or two of the threatening messages they were getting from the publishers I was mining. I have no idea how many messages were sent in total. Here’s one such message from BioOne (below)

Blocked by BioOne

Blocked by BioOne

So according to BioOne, I swiftly found out that downloading more that 100 full text articles in a single session is automatically deemed “excessive” and “a violation of permissible activity“.

Isn’t that absolutely crazy? In the age of ‘big data’ where anyone can download over a million full text articles from the PubMed Central OA subset at a few clicks, an artificially imposed-restriction of just 100 is simply mad and is anti-science. As a member of a subscription-paying institution I have a paid right to be able to access and analyze this content surely? We are paying for access but not actually getting full access.

If I tell other journals like eLife, PLOS ONE, or PeerJ that I have downloaded every single one of their articles for analysis – I get a high-five: these journals understand the importance of analysis-at-scale. Furthermore, the subscription access business model needn’t be a barrier: the Royal Society journals are very friendly with content mining – I have never had a problem downloading entire decades worth of journal content from the Royal Society journals.

I have two objectives for this blog post.

1.) A plea to traditional publishers: PLEASE STOP BLOCKING LEGITIMATE RESEARCH

Please get out of the way and let us do our research. If our institutions have paid for access, you should provide it to us. You are clearly impeding the progress of science. Far more content mining research has been done on open access content and there’s a reason for that – it’s a heck of a lot less hassle and (legal) danger. These artificial obstructions on access to research are absurd and unhelpful.

2.) A plea to researchers and librarians: SHARE YOUR STORIES

I’m absolutely sure it’s not just Chris & I that have experienced problems with traditional publishers artificially obstructing our research. Heather Piwowar is one great example I know. She bravely, extensively and publicly documented her torturous experiences with negotiating access & text mining to Elsevier-controlled content. But we need more people to speak-up. I fear that librarians in particular may be inadvertently sweeping these issues under the carpet – they are most likely to get the most interesting emails from publishers with respect to these matters.

This is a serious matter. Given the experience of Aaron Swartz; being faced with up to 50 years of imprisonment for downloading ‘too many’ JSTOR papers – it would not surprise me if few researchers come forward publicly.