Show me the data!

Central Tendency, Citation Distributions, and Springer Nature (Part 2)

November 19th, 2021 | Posted by rmounce in SpringerNature - (Comments Off on Central Tendency, Citation Distributions, and Springer Nature (Part 2))

“In statistics, a central tendency (or measure of central tendency) is a central or typical value for a probability distribution. Colloquially, measures of central tendency are often called averages. The most common measures of central tendency are the arithmetic mean, the median, and the mode.” — Wikipedia.

In the UK, we teach school kids how to calculate the mean, median, and mode in Year 6 (kids aged 10-11), it’s simple stuff.

If your data is normally distributed then the mean is an appropriate measure of central tendency to describe your data. However, if your data has significant skew and/or big outliers then it is not considered appropriate to report the mean, and instead one should use the median or mode.

You’ll see this advice in countless stats textbooks and websites e.g.

In a strongly skewed distribution, what is the best indicator of central tendency?
It is usually inappropriate to use the mean in such situations where your data is skewed. You would normally choose the median or mode, with the median usually preferred.” — from Laerd Statistics.

In the Penn State “Elementary Statistics” course they teach that: “For distributions that have outliers or are skewed, the median is often the preferred measure of central tendency because the median is more resistant to outliers than the mean.”

In the SpringerNature “white paper” titled “Going for gold: exploring the reach and impact of Gold open access articles in hybrid journals” by Christina Emery, Mithu Lucraft, Jessica Monaghan, David Stuart, and Susie Winter, the authors examine the distribution of citations to 60,567 individual articles within 1,262 of Springer Nature’s ‘hybrid’ journals. To help understand the central tendency or ‘average’ of citations accrued to articles, the authors of this report frequently chose to refer-to and display means. The main figures of the paper (figures 1, 2, and 3) are particularly peculiar as they are bar chart style comparisons of means and model predictions.

Figure 1 from the SpringerNature whitepaper “Going for gold: exploring the reach and impact of Gold open access articles in hybrid journals” by Christina Emery, Mithu Lucraft, Jessica Monaghan, David Stuart, and Susie Winter
(image caption for the figure immediately above) Reproduction of figure 1 from the Springer Nature produced, not-peer-reviewed “white paper” titled “Going for gold: exploring the reach and impact of Gold open access articles in hybrid journals” by Christina Emery, Mithu Lucraft, Jessica Monaghan, David Stuart, and Susie Winter. This work is available for re-use under a Creative Commons Attribution License 4.0, copyright of Emery et al.

Figure 1, Figure 2, and Figure 3 are all textbook examples of misleading statistical malpractice. Beneath the misleading choice of presentation what we have in Figure 1 is a comparison between the number of citations to 60,567 articles published by SpringerNature, split into three categories “Non-OA” , “EarlyV”, and “Gold OA”. The “Non-OA” bar represents data about 44,557 articles, the “EarlyV” bar represents data about 8,350 articles, and the “Gold OA” bar represents data about 7,660 articles. Let’s have a look at the actual data shall we? Below are my histogram frequency density plots of the citation distributions for each of SpringerNature’s categories; “Non-OA” , “EarlyV” , and “Gold OA” :

Full disclosure: for the sake of convenience, the relatively few exceptional papers with citations >40 are not plotted. One thing that I hope you’ll immediately notice with all three of these citation distributions is that they are heavily skewed. With the help of the R package ‘e1071‘ I calculated the skewness of each of these three distributions. For context any value larger than 1, or smaller than -1 is considered indicative of a strongly skewed distribution. The “Non-OA” set has a skew of 8.1, the “EarlyV” set has a skew of 6.0, and the “Gold OA” set has a skew of 5.4. All three citation distributions are highly skewed. This level of skew is absolutely to be expected. Per Seglen (1992) termed the typical skew of journal citation distributions “the skewness of science“. Any decent statistician will tell you that you should not represent the central tendency of a highly skewed distribution with the mean and yet this is exactly what the authors of the SN white paper have chosen to do.

A more statistically appropriate representation of three distributions is to use boxplots, inter-quartile ranges, and the median. Here’s how that looks (the black bar indicates the median, which is 4 citations for “Non-OA” and “EarlyV” and is 6 citations for “Gold OA”):

To their credit, they do display a boxplot analysis of this data but I can’t help but notice that they stick it in the Appendix as Figure 4 on page 19 of the PDF! They choose a log-scale for the y-axis whereas here I prefer a normal scale, albeit that choice means that outlier papers with >30 citations are not shown.

Am I concerned about the 2 citation difference in medians, over a period of ~3 years, between “EarlyV” (Green OA) and “Gold OA” (expensive hybrid gold open access at the journal)? No. Why?

1.) SN massively cherry-picked their sample choosing only 38.5% of the full research articles they could have otherwise included. If we add back-in the articles they chose to exclude who knows what the picture will actually look like.

2.) There’s a huge unaddressed flaw in the “white paper” methodology with respect to author manuscripts made publicly available at repositories. SpringerNature hybrid journals set an ’embargo’ of either 6 months or 12 months depending on exactly which journal. Comparing the citation performance of an article that was made immediately (day-0) open access at the journal (their “Gold OA”), with the citation performance of an article which has a parallel copy publicly available only 365 days after the publication date, gives the “EarlyV” set much less time for the purported open access benefit to take effect. Effectively it’s an unfair comparison where the “Gold OA” set has been given an additional year or six months extra to accrue citations relative to the eventual public emergence of green OA author manuscripts. But with the advent of the Rights Retention Strategy whereby author manuscripts can be archived with a zero-day embargo we may eventually be able to do a ‘fairer’ analysis between the citation benefit of open access provided either at the journal (“Gold”) or at a repository (“Green”).

3.) SN failed to factor-in other possible biasing factors which might be co-associated with “Gold OA” e.g. research funding. If grant funded research, from funders with an open access policy, tends to be more highly cited than say non-grant funded research, or from grant funded research from a funder that does not pay for open access in hybrid journals, then that would bias the results. What this result would really be demonstrating is funder choice for research that tends to be more highly cited, relative to non-grant funded research?

4.) Hybrid Gold Open Access is typically priced as the most expensive way possible of doing open access. Whilst the Max Planck Digital Library appears happy to pay Springer Nature $11,200 for some articles, the rest of the world sensibly will not pay this ransom. There also seems no cap on the constant above-inflation price rises of hybrid OA options over time. At current prices, even for ‘cheaper’ SN hybrid journals, most research institutions simply cannot afford to pay for hybrid gold open access at Springer Nature for all their articles. Even if it did somehow garner a tiny citation benefit over a three year period, is it worth $2780 per article? I think not.

5.) Fully open access journals of which there are over 17,000 listed at DOAJ are typically both lower in price and often higher in citation performance per article as I demonstrated with SciPost Physics in Part 1 of this series.

All that SpringerNature have demonstrated with their white paper is alarming statistical illiteracy, and a lack of reproducibility and transparency. Given how popular measures like Clarivate’s Journal Impact Factor are (which is also calculated in a statistically illiterate way), perhaps SpringerNature just decided to run with it anyway, despite the methodological and statistical wrongness? As SPARC notes, the lead author of the report is SN’s Senior Marketing Manager – this “white paper” is pure marketing, not rigorous research.

Pricing, Citation Impact, and Springer Nature (Part 1)

November 17th, 2021 | Posted by rmounce in SpringerNature - (Comments Off on Pricing, Citation Impact, and Springer Nature (Part 1))

On the 26th October 2021, Springer Nature published version 1 of a (not peer-reviewed) “white paper” titled “Going for gold: exploring the reach and impact of Gold open access articles in hybrid journals” by Christina Emery, Mithu Lucraft, Jessica Monaghan, David Stuart, and Susie Winter.

Springer Nature present cherry-picked analyses with an experimental design of their choosing, of 60,567 articles published in 1,262 of their ‘hybrid’ journals.

What is a ‘hybrid’ journal?

A ‘hybrid’ journal is a journal that is predominantly a paywalled subscription journal, albeit that it permits individual articles within that journal to not be paywalled if one of three things happens:

  • (1) the author(s), institution, or funder pays a fee (APC) to ensure permanent openness of an individual article with a Creative Commons license,


  • (2) the journal grants a fee waiver and gives permanent open access to an individual article with a Creative Commons license,


  • (3) the journal just turns off the paywall on an individual article at whim without author-side payment, without a Creative Commons license, and importantly without any assurance of permanence of the turned-off paywall state (so called ‘bronze OA’).

Springer Nature use the results of various analyses to claim, presumably to an audience of governments, policymakers, and researchers that they should focus their efforts (funding $$$) on their hybrid gold OA journals rather than green OA infrastructure:

 “Efforts which seek to increase the availability of Green OA don’t create the intended benefits and risk delaying or even preventing the take up of full Gold OA and achieving the benefits described above. While sharing of subscription-tied earlier versions can help the dissemination of research, they do not have as strong a reach or impact as full Gold OA, and remain dependent on the continuation of subscription models to fund the costs of editorial and publishing processes to validate and improve the manuscript. As such, we believe investment in Gold OA should be a priority and is the only way to achieve full, immediate and sustainable OA.”

The methods by which they have chosen to do their analysis, do not accurately or cleanly tackle the hypothesis of: which intervention has more impact ‘gold OA’ or ‘green OA’ , let alone the substantial pricing differential between the two. Simply put, the experimental design in the (not peer reviewed) white paper is not a fair or logical comparison.

Extensive cherry-picking

As with any analysis it’s always good to look at the data sources. In the SpringerNature white paper they look at full original research papers (excluding other items found within journals such as Editorials, Letters, News, Correction Notices, Retraction Notices, Book Reviews, Obituaries, Opinions, Research Summaries, “Other Journals in Brief”, “Product News” , et cetera) that are first published online at the journal in the calendar year 2018 (that is from 2018-01-01 to 2018-12-31 inclusive). An interesting quirk of this is that lots of research articles within the first 2018 issue of SN hybrid journals are excluded because they tend to be published online in 2017 and only assigned to a “2018” issue later-on. Similarly there are many articles in the SN whitepaper dataset that are in “2019” journal issues but were first published online in 2018 e.g. say December 2018. [That’s not a complaint fwiw, just an observation…]

When I first looked at an exemplar SN hybrid journal, namely Insectes Sociaux, I was shocked to observe a large discrepancy between the number of full research articles in that journal, that were published in the calendar year 2018 (53), and the much fewer number of articles from that journal included in SN’s whitepaper dataset (29). By my analysis the whitepaper arbitrarily excludes 24 (=53-29) of the full research articles published in 2018, in this journal.

The SN whitepaper is pseudo-transparent about the selectivity of their sampling. On page 8 they mention:

Only those primary research articles where all the necessary metadata was available
were included in the analysis:
• 138,449/157,333 (88%) of the articles were identified as being published in a
journal with an impact factor
• 68,668/157,333 (44%) of the articles had a corresponding author that had an
identifiable THE institutional ranking and country.

The overlap between these two factors left a final data set of 60,567 records
incorporated in the analysis.

Careful readers will observe that 60,567 out of 157,333 amounts to just 38.5% of the set of full research articles in SN hybrid journals, published in the calendar year 2018. It might be okay were this sample a random sample but clearly it is explicitly non-random – it excludes full research articles with corresponding authors from outside the set of 2,112 institutions included in the Times Higher Education (THE) rankings. For context, estimates vary, but there are thought to be at least 31,000 higher education institutions in the world. This bakes-in a significant bias towards Western institutions and does not give a truly global or balanced picture of what’s being published in SN hybrid journals.

Their vague description of their selection methodology doesn’t even correspond with the data they’ve excluded. For instance, within Insectes Sociaux, I found this paper (DOI: 10.1007/s00040-018-0616-9) published in April 2018. The journal is a hybrid journal, it has a Clarivate Journal Impact Factor, and the corresponding author affiliation on this paper is “Graduate School of Education, University of the Ryukyus, Japan” , the University of the Ryukus is one of the lucky 2,112 institutions to be included in the THE rankings, therefore I can’t see why this paper is not included in their dataset of 60,567 articles. The way in which they have whittled-down the sample from 157,333 articles to 60,567 is not reproducible and does not appear to match their stated selection criteria.

Via email, I asked the authors of the report for the full list of 157,333 DOIs of the initial sample (just the DOIs, nothing more) and the response from Mithu Lucraft was “I’m not able to release the broader dataset. If you wish to apply for a NDA to utilise a different dataset, I’ll look into the appropriate contact internally for this purpose”. I can’t help but note that the 60,567 dataset is publicly available from figshare under a CC BY license, yet when I ask merely for a list of DOIs pertaining to the very same study it is hinted I would have to apply for an NDA. SpringerNature operate transparency and open licensing only when it suits them. I have no intention of ever signing a non-disclosure agreement with SpringerNature and so I assume I will now have to recreate the list of ~ 157,333 full research articles published in SN hybrid journals in 2018, myself, without their assistance.

A closer look at hybrid gold versus green preprints posted at arXiv, for physics papers

Leaving aside the rampant cherry-picking that has occurred to create the whitepaper dataset, if we drill-down into a subject-area e.g. ‘Physics’ we can observe from the dataset that the median number of citations of a 2018 published Physics paper, (when assessed in November 2021, this is an elapsed period of at maximum 3 years and 10 months [if published 2018-01-01] and at minimum 2 years and 10 months [if published 2018-12-31] ), in an SN hybrid journal:

  • * that was made gold open access at an SN hybrid journal is 3 citations (across n=315 articles)
  • * that is paywalled at the hybrid SN journal but is also publicly accessible via an arxiv preprint copy is 3 citations (across n=838 articles)
  • * that is neither open access at the journal, nor publicly accessible via arxiv or other preprint servers or repositories is 2 citations (for n=2103)
  • * (this data is not provided by SpringerNature, my own analysis) for the 111 papers published in 2018 at the fully open access journal SciPost Physics, which is NOT published by SpringerNature, the median number of citations is 10 citations

From examining the data SN provide, the citation difference between gold OA and green OA as routes to achieving public access to research is negligible.

Providing open access or at least public access to a version of a research output from a theoretical perspective could clearly create more measurable impact (e.g. citations, downloads, altmetrics). However, Over 130 peer-reviewed studies have previously tested for the existence of the ‘open access citation advantage’ (OACA) but a recent meta-analysis points out that most of them are poorly done. This ‘Going for Gold’ SN whitepaper sadly joins the majority of poorly executed studies.

What then with regard to costs?

  • arXiv’s running costs famously amount to less than $10 per paper. [I’m conscious that this is a barebones figure that is not sustainable in the long-run and that arXiv sorely more financial support from the world’s research institutions]
  • By 2021 list-price, SpringerNature physics journals, hybrid gold APCs vary from $4480 in The Astronomy and Astrophysics Review to just $2690 in Celestial Mechanics and Dynamical Astronomy, the median list price per article to enable open access in SN hybrid Physics journals is $2780
  • Alternatively, one could calculate the cost of hybrid gold on the basis of estimated per article costs contained within ‘transformative agreement’ big deals as listed at the ESAC registry. In the 2021 Irish IREL consortium agreement with Springer Nature, Irish institutions covered by that agreement will pay €2,410 per article for open access, which if we convert that from Euros to USD is $2727.80 per article.
  • Another interesting comparator group left-out of the SN whitepaper is the existence of fully open access journals. The SN whitepaper chose to provide analysis exclusively of ‘hybrid’ journals. A suitable physics journal that enables open access at the journal for all articles is SciPost Physics. According to their data, it costs them about 620 euros per article (~ $700 USD), and their model is such that there is no author-facing charge (no APC).

So, policymakers; when faced with a choice of enabling public access to research via ‘green’ routes such as arXiv or fully open access journals such as SciPost Physics or hybrid ‘gold’ routes such as SpringerNature hybrid journals which would you choose? On the basis of the evidence, both what SpringerNature cares to cherry-pick in their report and data external to that, in a world where money is in limited supply, it’s clear to me that green open access and fully open access journals are better options. Hybrid journals no matter how much you cherry-pick the data and methods, always come out as the most expensive and the most prone to price gouging practices going forwards. Spending money on hybrid journals is wasteful and SpringerNature’s own data (!) actually demonstrates this.