Show me the data!

Author Archives: rmounce

Central Tendency, Citation Distributions, and Springer Nature (Part 2)

November 19th, 2021 | Posted by rmounce in SpringerNature - (Comments Off on Central Tendency, Citation Distributions, and Springer Nature (Part 2))

“In statistics, a central tendency (or measure of central tendency) is a central or typical value for a probability distribution. Colloquially, measures of central tendency are often called averages. The most common measures of central tendency are the arithmetic mean, the median, and the mode.” — Wikipedia.

In the UK, we teach school kids how to calculate the mean, median, and mode in Year 6 (kids aged 10-11), it’s simple stuff.

If your data is normally distributed then the mean is an appropriate measure of central tendency to describe your data. However, if your data has significant skew and/or big outliers then it is not considered appropriate to report the mean, and instead one should use the median or mode.

You’ll see this advice in countless stats textbooks and websites e.g.

In a strongly skewed distribution, what is the best indicator of central tendency?
It is usually inappropriate to use the mean in such situations where your data is skewed. You would normally choose the median or mode, with the median usually preferred.” — from Laerd Statistics.

In the Penn State “Elementary Statistics” course they teach that: “For distributions that have outliers or are skewed, the median is often the preferred measure of central tendency because the median is more resistant to outliers than the mean.”

In the SpringerNature “white paper” titled “Going for gold: exploring the reach and impact of Gold open access articles in hybrid journals” by Christina Emery, Mithu Lucraft, Jessica Monaghan, David Stuart, and Susie Winter, the authors examine the distribution of citations to 60,567 individual articles within 1,262 of Springer Nature’s ‘hybrid’ journals. To help understand the central tendency or ‘average’ of citations accrued to articles, the authors of this report frequently chose to refer-to and display means. The main figures of the paper (figures 1, 2, and 3) are particularly peculiar as they are bar chart style comparisons of means and model predictions.

Figure 1 from the SpringerNature whitepaper “Going for gold: exploring the reach and impact of Gold open access articles in hybrid journals” by Christina Emery, Mithu Lucraft, Jessica Monaghan, David Stuart, and Susie Winter
(image caption for the figure immediately above) Reproduction of figure 1 from the Springer Nature produced, not-peer-reviewed “white paper” titled “Going for gold: exploring the reach and impact of Gold open access articles in hybrid journals” by Christina Emery, Mithu Lucraft, Jessica Monaghan, David Stuart, and Susie Winter. This work is available for re-use under a Creative Commons Attribution License 4.0, copyright of Emery et al.

Figure 1, Figure 2, and Figure 3 are all textbook examples of misleading statistical malpractice. Beneath the misleading choice of presentation what we have in Figure 1 is a comparison between the number of citations to 60,567 articles published by SpringerNature, split into three categories “Non-OA” , “EarlyV”, and “Gold OA”. The “Non-OA” bar represents data about 44,557 articles, the “EarlyV” bar represents data about 8,350 articles, and the “Gold OA” bar represents data about 7,660 articles. Let’s have a look at the actual data shall we? Below are my histogram frequency density plots of the citation distributions for each of SpringerNature’s categories; “Non-OA” , “EarlyV” , and “Gold OA” :

Full disclosure: for the sake of convenience, the relatively few exceptional papers with citations >40 are not plotted. One thing that I hope you’ll immediately notice with all three of these citation distributions is that they are heavily skewed. With the help of the R package ‘e1071‘ I calculated the skewness of each of these three distributions. For context any value larger than 1, or smaller than -1 is considered indicative of a strongly skewed distribution. The “Non-OA” set has a skew of 8.1, the “EarlyV” set has a skew of 6.0, and the “Gold OA” set has a skew of 5.4. All three citation distributions are highly skewed. This level of skew is absolutely to be expected. Per Seglen (1992) termed the typical skew of journal citation distributions “the skewness of science“. Any decent statistician will tell you that you should not represent the central tendency of a highly skewed distribution with the mean and yet this is exactly what the authors of the SN white paper have chosen to do.

A more statistically appropriate representation of three distributions is to use boxplots, inter-quartile ranges, and the median. Here’s how that looks (the black bar indicates the median, which is 4 citations for “Non-OA” and “EarlyV” and is 6 citations for “Gold OA”):

To their credit, they do display a boxplot analysis of this data but I can’t help but notice that they stick it in the Appendix as Figure 4 on page 19 of the PDF! They choose a log-scale for the y-axis whereas here I prefer a normal scale, albeit that choice means that outlier papers with >30 citations are not shown.

Am I concerned about the 2 citation difference in medians, over a period of ~3 years, between “EarlyV” (Green OA) and “Gold OA” (expensive hybrid gold open access at the journal)? No. Why?

1.) SN massively cherry-picked their sample choosing only 38.5% of the full research articles they could have otherwise included. If we add back-in the articles they chose to exclude who knows what the picture will actually look like.

2.) There’s a huge unaddressed flaw in the “white paper” methodology with respect to author manuscripts made publicly available at repositories. SpringerNature hybrid journals set an ’embargo’ of either 6 months or 12 months depending on exactly which journal. Comparing the citation performance of an article that was made immediately (day-0) open access at the journal (their “Gold OA”), with the citation performance of an article which has a parallel copy publicly available only 365 days after the publication date, gives the “EarlyV” set much less time for the purported open access benefit to take effect. Effectively it’s an unfair comparison where the “Gold OA” set has been given an additional year or six months extra to accrue citations relative to the eventual public emergence of green OA author manuscripts. But with the advent of the Rights Retention Strategy whereby author manuscripts can be archived with a zero-day embargo we may eventually be able to do a ‘fairer’ analysis between the citation benefit of open access provided either at the journal (“Gold”) or at a repository (“Green”).

3.) SN failed to factor-in other possible biasing factors which might be co-associated with “Gold OA” e.g. research funding. If grant funded research, from funders with an open access policy, tends to be more highly cited than say non-grant funded research, or from grant funded research from a funder that does not pay for open access in hybrid journals, then that would bias the results. What this result would really be demonstrating is funder choice for research that tends to be more highly cited, relative to non-grant funded research?

4.) Hybrid Gold Open Access is typically priced as the most expensive way possible of doing open access. Whilst the Max Planck Digital Library appears happy to pay Springer Nature $11,200 for some articles, the rest of the world sensibly will not pay this ransom. There also seems no cap on the constant above-inflation price rises of hybrid OA options over time. At current prices, even for ‘cheaper’ SN hybrid journals, most research institutions simply cannot afford to pay for hybrid gold open access at Springer Nature for all their articles. Even if it did somehow garner a tiny citation benefit over a three year period, is it worth $2780 per article? I think not.

5.) Fully open access journals of which there are over 17,000 listed at DOAJ are typically both lower in price and often higher in citation performance per article as I demonstrated with SciPost Physics in Part 1 of this series.

All that SpringerNature have demonstrated with their white paper is alarming statistical illiteracy, and a lack of reproducibility and transparency. Given how popular measures like Clarivate’s Journal Impact Factor are (which is also calculated in a statistically illiterate way), perhaps SpringerNature just decided to run with it anyway, despite the methodological and statistical wrongness? As SPARC notes, the lead author of the report is SN’s Senior Marketing Manager – this “white paper” is pure marketing, not rigorous research.

Pricing, Citation Impact, and Springer Nature (Part 1)

November 17th, 2021 | Posted by rmounce in SpringerNature - (Comments Off on Pricing, Citation Impact, and Springer Nature (Part 1))

On the 26th October 2021, Springer Nature published version 1 of a (not peer-reviewed) “white paper” titled “Going for gold: exploring the reach and impact of Gold open access articles in hybrid journals” by Christina Emery, Mithu Lucraft, Jessica Monaghan, David Stuart, and Susie Winter.

Springer Nature present cherry-picked analyses with an experimental design of their choosing, of 60,567 articles published in 1,262 of their ‘hybrid’ journals.

What is a ‘hybrid’ journal?

A ‘hybrid’ journal is a journal that is predominantly a paywalled subscription journal, albeit that it permits individual articles within that journal to not be paywalled if one of three things happens:

  • (1) the author(s), institution, or funder pays a fee (APC) to ensure permanent openness of an individual article with a Creative Commons license,


  • (2) the journal grants a fee waiver and gives permanent open access to an individual article with a Creative Commons license,


  • (3) the journal just turns off the paywall on an individual article at whim without author-side payment, without a Creative Commons license, and importantly without any assurance of permanence of the turned-off paywall state (so called ‘bronze OA’).

Springer Nature use the results of various analyses to claim, presumably to an audience of governments, policymakers, and researchers that they should focus their efforts (funding $$$) on their hybrid gold OA journals rather than green OA infrastructure:

 “Efforts which seek to increase the availability of Green OA don’t create the intended benefits and risk delaying or even preventing the take up of full Gold OA and achieving the benefits described above. While sharing of subscription-tied earlier versions can help the dissemination of research, they do not have as strong a reach or impact as full Gold OA, and remain dependent on the continuation of subscription models to fund the costs of editorial and publishing processes to validate and improve the manuscript. As such, we believe investment in Gold OA should be a priority and is the only way to achieve full, immediate and sustainable OA.”

The methods by which they have chosen to do their analysis, do not accurately or cleanly tackle the hypothesis of: which intervention has more impact ‘gold OA’ or ‘green OA’ , let alone the substantial pricing differential between the two. Simply put, the experimental design in the (not peer reviewed) white paper is not a fair or logical comparison.

Extensive cherry-picking

As with any analysis it’s always good to look at the data sources. In the SpringerNature white paper they look at full original research papers (excluding other items found within journals such as Editorials, Letters, News, Correction Notices, Retraction Notices, Book Reviews, Obituaries, Opinions, Research Summaries, “Other Journals in Brief”, “Product News” , et cetera) that are first published online at the journal in the calendar year 2018 (that is from 2018-01-01 to 2018-12-31 inclusive). An interesting quirk of this is that lots of research articles within the first 2018 issue of SN hybrid journals are excluded because they tend to be published online in 2017 and only assigned to a “2018” issue later-on. Similarly there are many articles in the SN whitepaper dataset that are in “2019” journal issues but were first published online in 2018 e.g. say December 2018. [That’s not a complaint fwiw, just an observation…]

When I first looked at an exemplar SN hybrid journal, namely Insectes Sociaux, I was shocked to observe a large discrepancy between the number of full research articles in that journal, that were published in the calendar year 2018 (53), and the much fewer number of articles from that journal included in SN’s whitepaper dataset (29). By my analysis the whitepaper arbitrarily excludes 24 (=53-29) of the full research articles published in 2018, in this journal.

The SN whitepaper is pseudo-transparent about the selectivity of their sampling. On page 8 they mention:

Only those primary research articles where all the necessary metadata was available
were included in the analysis:
• 138,449/157,333 (88%) of the articles were identified as being published in a
journal with an impact factor
• 68,668/157,333 (44%) of the articles had a corresponding author that had an
identifiable THE institutional ranking and country.

The overlap between these two factors left a final data set of 60,567 records
incorporated in the analysis.

Careful readers will observe that 60,567 out of 157,333 amounts to just 38.5% of the set of full research articles in SN hybrid journals, published in the calendar year 2018. It might be okay were this sample a random sample but clearly it is explicitly non-random – it excludes full research articles with corresponding authors from outside the set of 2,112 institutions included in the Times Higher Education (THE) rankings. For context, estimates vary, but there are thought to be at least 31,000 higher education institutions in the world. This bakes-in a significant bias towards Western institutions and does not give a truly global or balanced picture of what’s being published in SN hybrid journals.

Their vague description of their selection methodology doesn’t even correspond with the data they’ve excluded. For instance, within Insectes Sociaux, I found this paper (DOI: 10.1007/s00040-018-0616-9) published in April 2018. The journal is a hybrid journal, it has a Clarivate Journal Impact Factor, and the corresponding author affiliation on this paper is “Graduate School of Education, University of the Ryukyus, Japan” , the University of the Ryukus is one of the lucky 2,112 institutions to be included in the THE rankings, therefore I can’t see why this paper is not included in their dataset of 60,567 articles. The way in which they have whittled-down the sample from 157,333 articles to 60,567 is not reproducible and does not appear to match their stated selection criteria.

Via email, I asked the authors of the report for the full list of 157,333 DOIs of the initial sample (just the DOIs, nothing more) and the response from Mithu Lucraft was “I’m not able to release the broader dataset. If you wish to apply for a NDA to utilise a different dataset, I’ll look into the appropriate contact internally for this purpose”. I can’t help but note that the 60,567 dataset is publicly available from figshare under a CC BY license, yet when I ask merely for a list of DOIs pertaining to the very same study it is hinted I would have to apply for an NDA. SpringerNature operate transparency and open licensing only when it suits them. I have no intention of ever signing a non-disclosure agreement with SpringerNature and so I assume I will now have to recreate the list of ~ 157,333 full research articles published in SN hybrid journals in 2018, myself, without their assistance.

A closer look at hybrid gold versus green preprints posted at arXiv, for physics papers

Leaving aside the rampant cherry-picking that has occurred to create the whitepaper dataset, if we drill-down into a subject-area e.g. ‘Physics’ we can observe from the dataset that the median number of citations of a 2018 published Physics paper, (when assessed in November 2021, this is an elapsed period of at maximum 3 years and 10 months [if published 2018-01-01] and at minimum 2 years and 10 months [if published 2018-12-31] ), in an SN hybrid journal:

  • * that was made gold open access at an SN hybrid journal is 3 citations (across n=315 articles)
  • * that is paywalled at the hybrid SN journal but is also publicly accessible via an arxiv preprint copy is 3 citations (across n=838 articles)
  • * that is neither open access at the journal, nor publicly accessible via arxiv or other preprint servers or repositories is 2 citations (for n=2103)
  • * (this data is not provided by SpringerNature, my own analysis) for the 111 papers published in 2018 at the fully open access journal SciPost Physics, which is NOT published by SpringerNature, the median number of citations is 10 citations

From examining the data SN provide, the citation difference between gold OA and green OA as routes to achieving public access to research is negligible.

Providing open access or at least public access to a version of a research output from a theoretical perspective could clearly create more measurable impact (e.g. citations, downloads, altmetrics). However, Over 130 peer-reviewed studies have previously tested for the existence of the ‘open access citation advantage’ (OACA) but a recent meta-analysis points out that most of them are poorly done. This ‘Going for Gold’ SN whitepaper sadly joins the majority of poorly executed studies.

What then with regard to costs?

  • arXiv’s running costs famously amount to less than $10 per paper. [I’m conscious that this is a barebones figure that is not sustainable in the long-run and that arXiv sorely more financial support from the world’s research institutions]
  • By 2021 list-price, SpringerNature physics journals, hybrid gold APCs vary from $4480 in The Astronomy and Astrophysics Review to just $2690 in Celestial Mechanics and Dynamical Astronomy, the median list price per article to enable open access in SN hybrid Physics journals is $2780
  • Alternatively, one could calculate the cost of hybrid gold on the basis of estimated per article costs contained within ‘transformative agreement’ big deals as listed at the ESAC registry. In the 2021 Irish IREL consortium agreement with Springer Nature, Irish institutions covered by that agreement will pay €2,410 per article for open access, which if we convert that from Euros to USD is $2727.80 per article.
  • Another interesting comparator group left-out of the SN whitepaper is the existence of fully open access journals. The SN whitepaper chose to provide analysis exclusively of ‘hybrid’ journals. A suitable physics journal that enables open access at the journal for all articles is SciPost Physics. According to their data, it costs them about 620 euros per article (~ $700 USD), and their model is such that there is no author-facing charge (no APC).

So, policymakers; when faced with a choice of enabling public access to research via ‘green’ routes such as arXiv or fully open access journals such as SciPost Physics or hybrid ‘gold’ routes such as SpringerNature hybrid journals which would you choose? On the basis of the evidence, both what SpringerNature cares to cherry-pick in their report and data external to that, in a world where money is in limited supply, it’s clear to me that green open access and fully open access journals are better options. Hybrid journals no matter how much you cherry-pick the data and methods, always come out as the most expensive and the most prone to price gouging practices going forwards. Spending money on hybrid journals is wasteful and SpringerNature’s own data (!) actually demonstrates this.

Suppression as a form of liberation?

July 3rd, 2020 | Posted by rmounce in Research Assessment - (Comments Off on Suppression as a form of liberation?)

On Monday 29th June 2020, I learned from Retraction Watch that Clarivate, the for-profit proprietor of Journal Impact Factor ™ has newly “suppressed” 33 journals from their indexing service. The immediate consequence of this “suppression” is that these 33 journals do not get assigned an official Clarivate Journal Impact Factor ™ . Clarivate justify this action on the basis of “anomalous citation patterns” but without much further detail given for each of the journals other than the overall “% Self-cites” of the journal, and the effect of those self-cites on Clarivate’s citation-based ranking of journals (% Distortion of category rank).

Amongst the 33 journals, I spotted not one but two systematics journals that I know very well:

I have read, cited, and analysed (textmining and image analysis) articles from both of these journals extensively. Chapter 6 of my PhD thesis mined over 12,000 Zootaxa articles looking for phylogenetic data. In a more recent work published in Research Ideas and Outcomes (RIO Journal), I mined over 5,800 IJSEM articles for phylogenetic tree data. Of relevance, I should also say I was a council member of the Systematics Association for many years.

Given the experiences listed above, I am therefore very well placed to say that what Clarivate has done to these two systematics journals is utter brainless idiocy.

The reason why Zootaxa articles cite quite a high proportion of other Zootaxa articles is obvious (“self-citation” at the journal-level from Clarivate’s point-of-view) to anyone in the discipline. Zootaxa is an important ‘megajournal’ for the zoological systematics community. According to data Zootaxa published over 5,000 items (articles and monographs) between 2018 and 2019. Clarivate’s own records from another one of their proprietary analytics services called ‘Zoological Record‘ indicate that 26.57% of all new zoological taxa are published in Zootaxa. For many decades descriptive taxonomy has been pushed-out of for-profit journals. Zootaxa is a vital refugia for sound science in a poorly funded discipline.

The case for legitimate ‘high’ journal-level self-citation at International Journal of Systematic and Evolutionary Microbiology (IJSEM) is even clearer and easier to explain. The International Code of Nomenclature of Prokaryotes (ICNP) requires that all new bacteria names are published in IJSEM and nowhere else (a very sensible idea which the bacteriology community should be commended for). Hence a lot of the systematic and evolutionary microbiology articles in IJSEM will cite prior IJSEM articles.

Wayne Maddison has commented on Twitter that the hardest hit researchers by this action might be those in developing countries. I agree. The problem here is that many institutions and research funders idiotically use the Journal Impact Factor ™ to assess the quality of an individual’s research output. In some regimes, if a researcher publishes a paper in a journal that has a Journal Impact Factor ™ then it ‘counts’, whereas if a researcher publishes a paper in a journal that has not been given an official Journal Impact Factor ™ by Clarivate then that paper may not ‘count’ towards the assessment of that researcher.

The zoology section of the Chilean Society of Biology has already petitioned Clarivate to unsuppress Zootaxa, to give it back its Journal Impact Factor ™ . I understand why they would do this but I would actually call for something quite different and more far-reaching.

I would encourage all systematists, taxonomists, zoologists, microbiologists, and biologists in general to see the real problem here: Clarivate, a for-profit analytics company, should never be so relied-upon by research evaluation committees to arbitrarily decide the value of a research output. Especially given that the Journal Impact Factor ™ is untransparent, irreproducible, and fundamentally statistically illiterate.

Thus to bring us back to my title. I wonder if Clarivate’s wacky “suppression” might actually be a pathway to liberation from the inappropriate stupidity of using Journal Impact Factor ™ to evaluate individual research outputs. Given we have all now witnessed just how brainless some of Clarivate’s decision making is, I would ask Clarivate to please “suppress” all journals thereby removing the harmful stupidity of Journal Impact Factor ™ from the lives of researchers.

Referring Elsevier/RELX to the Advertising Standards Authority

May 14th, 2018 | Posted by rmounce in Paywall Watch - (Comments Off on Referring Elsevier/RELX to the Advertising Standards Authority)

In late 2016, Martin Eve, Stuart Lawson and Jon Tennant referred Elsevier/RELX to the Competition and Markets Authority. Inspired by this, I thought I would try referring a complaint to the UK Advertising Standards Authority (ASA) about some blatant fibbing I saw Elsevier engage-in with their marketing spiel at a recent conference.

The content of my submission is below:

Name: Ross Mounce

Ad type: Leaflets, flyers and circulars

Brand/product: Elsevier

Date: 26th February 2018

Your complaint:
Elsevier, a large academic publishing company, have flyers and a large poster, both containing the same information at the Researcher to Reader Conference (British Medical Association House, London). They claim on both the flyers and the poster that “Fact #2: Our APC prices are value for money Our APC prices range from $150 – $5000 US dollars…” [APC means Article Processing Charge, a publishing service for academic customers] I believe this is false advertising as some of their journals clearly charge $5200 US dollars as an APC. $5200 is greater than the maximum of $5000 advertised. They also report these prices without VAT added-on, this is also misleading as this meeting is in the UK. UK customers choosing this service would have to pay the APC plus VAT tax and so the prices should be displayed inclusive of taxes in adverts like this. There is no mention of the need to pay VAT on either the flyers or the poster. I went to their website the same day and found thirteen journals published at Elsevier, that by Elsevier’s own price list charge $5200 US dollars, not including VAT. Those journals are: Cancer Cell, Cell, Cell Chemical Biology, Cell Host & Microbe, Cell Metabolism, Cell Stem Cell, Cell Systems, Current Biology, Developmental Cell, Immunity, Molecular Cell, Neuron, and Structure. For reference I have attached a PDF of Elsevier’s online price list which I downloaded from Elsevier’s official website here: which takes one to this PDF URL:

I attached images of the offending poster and flyers. Below is a photo I took of the misleading flyer:

Misleading Elsevier Flyer

I am pleased to announce that the UK Advertising Standards Authority upheld my complaint.

Here is their reply:

ASA Enquiry Ref: A18-443580 – RELX (UK) Ltd t/a Elsevier

Dear Dr Mounce,

Thank you for contacting the Advertising Standards Authority (ASA).

Your Complaint: RELX (UK) Ltd t/a Elsevier

I understand from your complaint that you felt that Elsevier’s advertising was misleading because it did not accurately reflect the price range of their products and they do not quote prices with VAT.  Please note that we have only reviewed the leaflet which you forwarded to us, because we considered that the sign constituted point of sale material, which is not covered by our Codes.

We have concluded that the leaflet was likely to have breached the Advertising Rules that we apply and I am writing to let you know that we have taken steps to address this.

We have explained your concerns to the advertiser and provided guidance to them on the areas that require attention, together with advice on how to ensure that their advertising complies with the Codes.

Comments such as yours help us to understand the issues that matter to consumers and we will keep a record of your complaint on file for use in future monitoring. If you would like more information about our complaint handling principles, please visit our website here.

Thank you once again for contacting us with your concerns.

Yours sincerely,


Damson Warner-Allen

Complaints Executive

Direct line 020 7492 2173

Advertising Standards Authority

Mid City Place, 71 High Holborn

London WC1V 6QT

Telephone 020 7492 2222

I am thrilled that the Advertising Standards Authority has officially upheld my complaint, and I encourage others who notice similar problems with Elsevier’s business practices, and that of other academic publishers to come forward with further complaints. These companies are not immune to regulation – they must abide by the law at all times. The punishment for now is just a slap-on-the-wrist but if they are consistently caught misadvertising, stronger punishments can and would be meted out. Perhaps now is the time for more regulators to start seriously investigating complaints about these richly profitable publishing companies with dubious business practices? Watch this space…