Show me the data!
Header

Page View Spikes on Research Articles

March 24th, 2015 | Posted by rmounce in Open Data

For those that know me as a biologist it might perhaps surprise you to know that my most cited publication so far is on Open Access and Altmetrics (published in April 2013, 25 cites and counting…) — nothing to do with biology per se!

So I took great interest in this new publication:

Wang, X., Liu, C., Mao, W., and Fang, Z. 2015. The open access advantage considering citation, article usage and social media attention. Scientometrics, pp. 1-10. DOI: 10.1007/s11192-015-1547-0

The authors have gathered some really fascinating data measuring day-by-day altmetrics of papers at the journal Nature Communications, which at the time was hybrid: some articles behind a paywall, some articles were paid-for open access at a cost of $5200 to the authors/funders. (The cost of open access here is an absolute rip-off. I do not endorse or recommend outrageously priced paid-for open access outlets like Nature Communications. PLOS ONE costs just $1350 remember! PeerJ is just $99 per author!)

The paper is by no means perfect – I’m not saying it is – but the ideas behind it are good. Many on twitter have commented that it’s ironic that this paper on open access advantage is itself only made available behind a paywall at the publisher.

The good news is, Dr Xianwen Wang has responded to this and has made an ‘eprint’ copy (stripped of all publisher branding) freely available at arXiv as of 2015-03-19 (post-publication).  The written English throughout the manuscript is not brilliant but I feel this reflects poorly on the journal rather than the authors – it’s remarkable that Scientometrics can charge a subscription fee to subscribers if they offer no copy-editing on accepted manuscripts!  Finally, technical detail on precisely how the data was obtained is rather lacking. So that’s the critique out of the way…

My tweets about this paper have been very popular e.g.

But I wanted to dig deeper into the data. So I emailed the corresponding author; Xianwen for a copy of the data behind figure 2 and he happily and quickly sent it to me. I was fairly shocked (in a good way) that he sent the data. Most of the times I’ve sent email requests for data in the past have been ultimately unsuccessful. This is well documented in the field of phylogenetics *sad face*. The ’email the author’ system simply cannot be relied upon, and is one of many reasons why I feel all non-sensitive data supporting research should be made publicly available, alongside the article, on the day of publication.

I did my own re-analysis of the raw data Xianwen sent over, and discovered there were lots of odd jumps in data, which couldn’t really be explained by peaks in social media activity e.g. for A cobalt complex redox shuttle for dye-sensitized solar cells with high open-circuit potentials (visualized below). ~520 days after it was first published, in one single day it apparently accumulated 21,577 page views! There was also a smaller spike of 2000 page views earlier.

Article View Spikes

Xianwen had filtered these suspicious jumps out of his figures but neglected to mention that in the methods section, so upon informing him of this discrepancy he’s told me he’s going to contact the editor to sort it out. A great little example of how data sharing results in improved science? The unfiltered data looks a little bit like the plot below:

Anyway, back to the spikes/jumps in activity – they certainly aren’t an error introduced by the authors of the paper – they can also be seen via Altmetric (a service provider of altmetrics). The question is: what is causing these one-day spikes in activity?

I have alerted the team at Altmetric, and they have/will alert Nature Publishing Group to investigate further

Most of the spikes are likely to be accidental in cause but it would be good to know more. A downloading script gone awry? But there is still a possibility that within this dataset there is putative evidence for deliberate gaming of altmetrics, specifically: article views. I look forward to hearing more from Altmetric and Nature Publishing Group about this… the ball is very much in their court right now.

Moreover, now that these peculiar spikes have been detected; what, if anything, should be done about it?

  • Ross – nice post, and deeply ironic that the authors didn’t take advantage of this OA effect themselves…

    In my experience, usage spikes like this are never the result of author gaming, but usually the result of some innocent ‘script gone awry’ or ‘spam bot trying to leave lots of comments’. So I doubt the answer will turn out to be that someone deliberately manipulated the data (it is far too easy to spot) and more something that their download counter needs to exclude.

    An interesting question is how should you fix the historical record for these things – if you identify “shoebot” leaving spam comments to buy shoes (for example), do you ‘invisibly’ remove the bad data? Or do you leave it in there with a note that it is bad data? Or some variation of visibility?

  • Luke Skibinski

    Fun post – my experience with these sort of spikes is not so much malicious gaming activity but a change to the program that gathers these metrics. Some pattern is tweaked or some new source is added (a long forgotten printer-friendly page linked directly to from Google for example …) and these incredible spikes happen. The responsible thing to do is to correlate these changes in metrics to commits and releases in software, but that only works if the software is open source. Open Source metrics gathering and Open Access go hand in hand.