Show me the data!

Pricing, Citation Impact, and Springer Nature (Part 1)

November 17th, 2021 | Posted by rmounce in SpringerNature

On the 26th October 2021, Springer Nature published version 1 of a (not peer-reviewed) “white paper” titled “Going for gold: exploring the reach and impact of Gold open access articles in hybrid journals” by Christina Emery, Mithu Lucraft, Jessica Monaghan, David Stuart, and Susie Winter.

Springer Nature present cherry-picked analyses with an experimental design of their choosing, of 60,567 articles published in 1,262 of their ‘hybrid’ journals.

What is a ‘hybrid’ journal?

A ‘hybrid’ journal is a journal that is predominantly a paywalled subscription journal, albeit that it permits individual articles within that journal to not be paywalled if one of three things happens:

  • (1) the author(s), institution, or funder pays a fee (APC) to ensure permanent openness of an individual article with a Creative Commons license,


  • (2) the journal grants a fee waiver and gives permanent open access to an individual article with a Creative Commons license,


  • (3) the journal just turns off the paywall on an individual article at whim without author-side payment, without a Creative Commons license, and importantly without any assurance of permanence of the turned-off paywall state (so called ‘bronze OA’).

Springer Nature use the results of various analyses to claim, presumably to an audience of governments, policymakers, and researchers that they should focus their efforts (funding $$$) on their hybrid gold OA journals rather than green OA infrastructure:

 “Efforts which seek to increase the availability of Green OA don’t create the intended benefits and risk delaying or even preventing the take up of full Gold OA and achieving the benefits described above. While sharing of subscription-tied earlier versions can help the dissemination of research, they do not have as strong a reach or impact as full Gold OA, and remain dependent on the continuation of subscription models to fund the costs of editorial and publishing processes to validate and improve the manuscript. As such, we believe investment in Gold OA should be a priority and is the only way to achieve full, immediate and sustainable OA.”

The methods by which they have chosen to do their analysis, do not accurately or cleanly tackle the hypothesis of: which intervention has more impact ‘gold OA’ or ‘green OA’ , let alone the substantial pricing differential between the two. Simply put, the experimental design in the (not peer reviewed) white paper is not a fair or logical comparison.

Extensive cherry-picking

As with any analysis it’s always good to look at the data sources. In the SpringerNature white paper they look at full original research papers (excluding other items found within journals such as Editorials, Letters, News, Correction Notices, Retraction Notices, Book Reviews, Obituaries, Opinions, Research Summaries, “Other Journals in Brief”, “Product News” , et cetera) that are first published online at the journal in the calendar year 2018 (that is from 2018-01-01 to 2018-12-31 inclusive). An interesting quirk of this is that lots of research articles within the first 2018 issue of SN hybrid journals are excluded because they tend to be published online in 2017 and only assigned to a “2018” issue later-on. Similarly there are many articles in the SN whitepaper dataset that are in “2019” journal issues but were first published online in 2018 e.g. say December 2018. [That’s not a complaint fwiw, just an observation…]

When I first looked at an exemplar SN hybrid journal, namely Insectes Sociaux, I was shocked to observe a large discrepancy between the number of full research articles in that journal, that were published in the calendar year 2018 (53), and the much fewer number of articles from that journal included in SN’s whitepaper dataset (29). By my analysis the whitepaper arbitrarily excludes 24 (=53-29) of the full research articles published in 2018, in this journal.

The SN whitepaper is pseudo-transparent about the selectivity of their sampling. On page 8 they mention:

Only those primary research articles where all the necessary metadata was available
were included in the analysis:
• 138,449/157,333 (88%) of the articles were identified as being published in a
journal with an impact factor
• 68,668/157,333 (44%) of the articles had a corresponding author that had an
identifiable THE institutional ranking and country.

The overlap between these two factors left a final data set of 60,567 records
incorporated in the analysis.

Careful readers will observe that 60,567 out of 157,333 amounts to just 38.5% of the set of full research articles in SN hybrid journals, published in the calendar year 2018. It might be okay were this sample a random sample but clearly it is explicitly non-random – it excludes full research articles with corresponding authors from outside the set of 2,112 institutions included in the Times Higher Education (THE) rankings. For context, estimates vary, but there are thought to be at least 31,000 higher education institutions in the world. This bakes-in a significant bias towards Western institutions and does not give a truly global or balanced picture of what’s being published in SN hybrid journals.

Their vague description of their selection methodology doesn’t even correspond with the data they’ve excluded. For instance, within Insectes Sociaux, I found this paper (DOI: 10.1007/s00040-018-0616-9) published in April 2018. The journal is a hybrid journal, it has a Clarivate Journal Impact Factor, and the corresponding author affiliation on this paper is “Graduate School of Education, University of the Ryukyus, Japan” , the University of the Ryukus is one of the lucky 2,112 institutions to be included in the THE rankings, therefore I can’t see why this paper is not included in their dataset of 60,567 articles. The way in which they have whittled-down the sample from 157,333 articles to 60,567 is not reproducible and does not appear to match their stated selection criteria.

Via email, I asked the authors of the report for the full list of 157,333 DOIs of the initial sample (just the DOIs, nothing more) and the response from Mithu Lucraft was “I’m not able to release the broader dataset. If you wish to apply for a NDA to utilise a different dataset, I’ll look into the appropriate contact internally for this purpose”. I can’t help but note that the 60,567 dataset is publicly available from figshare under a CC BY license, yet when I ask merely for a list of DOIs pertaining to the very same study it is hinted I would have to apply for an NDA. SpringerNature operate transparency and open licensing only when it suits them. I have no intention of ever signing a non-disclosure agreement with SpringerNature and so I assume I will now have to recreate the list of ~ 157,333 full research articles published in SN hybrid journals in 2018, myself, without their assistance.

A closer look at hybrid gold versus green preprints posted at arXiv, for physics papers

Leaving aside the rampant cherry-picking that has occurred to create the whitepaper dataset, if we drill-down into a subject-area e.g. ‘Physics’ we can observe from the dataset that the median number of citations of a 2018 published Physics paper, (when assessed in November 2021, this is an elapsed period of at maximum 3 years and 10 months [if published 2018-01-01] and at minimum 2 years and 10 months [if published 2018-12-31] ), in an SN hybrid journal:

  • * that was made gold open access at an SN hybrid journal is 3 citations (across n=315 articles)
  • * that is paywalled at the hybrid SN journal but is also publicly accessible via an arxiv preprint copy is 3 citations (across n=838 articles)
  • * that is neither open access at the journal, nor publicly accessible via arxiv or other preprint servers or repositories is 2 citations (for n=2103)
  • * (this data is not provided by SpringerNature, my own analysis) for the 111 papers published in 2018 at the fully open access journal SciPost Physics, which is NOT published by SpringerNature, the median number of citations is 10 citations

From examining the data SN provide, the citation difference between gold OA and green OA as routes to achieving public access to research is negligible.

Providing open access or at least public access to a version of a research output from a theoretical perspective could clearly create more measurable impact (e.g. citations, downloads, altmetrics). However, Over 130 peer-reviewed studies have previously tested for the existence of the ‘open access citation advantage’ (OACA) but a recent meta-analysis points out that most of them are poorly done. This ‘Going for Gold’ SN whitepaper sadly joins the majority of poorly executed studies.

What then with regard to costs?

  • arXiv’s running costs famously amount to less than $10 per paper. [I’m conscious that this is a barebones figure that is not sustainable in the long-run and that arXiv sorely more financial support from the world’s research institutions]
  • By 2021 list-price, SpringerNature physics journals, hybrid gold APCs vary from $4480 in The Astronomy and Astrophysics Review to just $2690 in Celestial Mechanics and Dynamical Astronomy, the median list price per article to enable open access in SN hybrid Physics journals is $2780
  • Alternatively, one could calculate the cost of hybrid gold on the basis of estimated per article costs contained within ‘transformative agreement’ big deals as listed at the ESAC registry. In the 2021 Irish IREL consortium agreement with Springer Nature, Irish institutions covered by that agreement will pay €2,410 per article for open access, which if we convert that from Euros to USD is $2727.80 per article.
  • Another interesting comparator group left-out of the SN whitepaper is the existence of fully open access journals. The SN whitepaper chose to provide analysis exclusively of ‘hybrid’ journals. A suitable physics journal that enables open access at the journal for all articles is SciPost Physics. According to their data, it costs them about 620 euros per article (~ $700 USD), and their model is such that there is no author-facing charge (no APC).

So, policymakers; when faced with a choice of enabling public access to research via ‘green’ routes such as arXiv or fully open access journals such as SciPost Physics or hybrid ‘gold’ routes such as SpringerNature hybrid journals which would you choose? On the basis of the evidence, both what SpringerNature cares to cherry-pick in their report and data external to that, in a world where money is in limited supply, it’s clear to me that green open access and fully open access journals are better options. Hybrid journals no matter how much you cherry-pick the data and methods, always come out as the most expensive and the most prone to price gouging practices going forwards. Spending money on hybrid journals is wasteful and SpringerNature’s own data (!) actually demonstrates this.