“In statistics, a central tendency (or measure of central tendency) is a central or typical value for a probability distribution. Colloquially, measures of central tendency are often called averages. The most common measures of central tendency are the arithmetic mean, the median, and the mode.” — Wikipedia.
In the UK, we teach school kids how to calculate the mean, median, and mode in Year 6 (kids aged 10-11), it’s simple stuff.
If your data is normally distributed then the mean is an appropriate measure of central tendency to describe your data. However, if your data has significant skew and/or big outliers then it is not considered appropriate to report the mean, and instead one should use the median or mode.
You’ll see this advice in countless stats textbooks and websites e.g.
“In a strongly skewed distribution, what is the best indicator of central tendency?
It is usually inappropriate to use the mean in such situations where your data is skewed. You would normally choose the median or mode, with the median usually preferred.” — from Laerd Statistics.
In the Penn State “Elementary Statistics” course they teach that: “For distributions that have outliers or are skewed, the median is often the preferred measure of central tendency because the median is more resistant to outliers than the mean.”
In the SpringerNature “white paper” titled “Going for gold: exploring the reach and impact of Gold open access articles in hybrid journals” by Christina Emery, Mithu Lucraft, Jessica Monaghan, David Stuart, and Susie Winter, the authors examine the distribution of citations to 60,567 individual articles within 1,262 of Springer Nature’s ‘hybrid’ journals. To help understand the central tendency or ‘average’ of citations accrued to articles, the authors of this report frequently chose to refer-to and display means. The main figures of the paper (figures 1, 2, and 3) are particularly peculiar as they are bar chart style comparisons of means and model predictions.
Figure 1, Figure 2, and Figure 3 are all textbook examples of misleading statistical malpractice. Beneath the misleading choice of presentation what we have in Figure 1 is a comparison between the number of citations to 60,567 articles published by SpringerNature, split into three categories “Non-OA” , “EarlyV”, and “Gold OA”. The “Non-OA” bar represents data about 44,557 articles, the “EarlyV” bar represents data about 8,350 articles, and the “Gold OA” bar represents data about 7,660 articles. Let’s have a look at the actual data shall we? Below are my histogram frequency density plots of the citation distributions for each of SpringerNature’s categories; “Non-OA” , “EarlyV” , and “Gold OA” :
Full disclosure: for the sake of convenience, the relatively few exceptional papers with citations >40 are not plotted. One thing that I hope you’ll immediately notice with all three of these citation distributions is that they are heavily skewed. With the help of the R package ‘e1071‘ I calculated the skewness of each of these three distributions. For context any value larger than 1, or smaller than -1 is considered indicative of a strongly skewed distribution. The “Non-OA” set has a skew of 8.1, the “EarlyV” set has a skew of 6.0, and the “Gold OA” set has a skew of 5.4. All three citation distributions are highly skewed. This level of skew is absolutely to be expected. Per Seglen (1992) termed the typical skew of journal citation distributions “the skewness of science“. Any decent statistician will tell you that you should not represent the central tendency of a highly skewed distribution with the mean and yet this is exactly what the authors of the SN white paper have chosen to do.
A more statistically appropriate representation of three distributions is to use boxplots, inter-quartile ranges, and the median. Here’s how that looks (the black bar indicates the median, which is 4 citations for “Non-OA” and “EarlyV” and is 6 citations for “Gold OA”):
To their credit, they do display a boxplot analysis of this data but I can’t help but notice that they stick it in the Appendix as Figure 4 on page 19 of the PDF! They choose a log-scale for the y-axis whereas here I prefer a normal scale, albeit that choice means that outlier papers with >30 citations are not shown.
Am I concerned about the 2 citation difference in medians, over a period of ~3 years, between “EarlyV” (Green OA) and “Gold OA” (expensive hybrid gold open access at the journal)? No. Why?
1.) SN massively cherry-picked their sample choosing only 38.5% of the full research articles they could have otherwise included. If we add back-in the articles they chose to exclude who knows what the picture will actually look like.
2.) There’s a huge unaddressed flaw in the “white paper” methodology with respect to author manuscripts made publicly available at repositories. SpringerNature hybrid journals set an ’embargo’ of either 6 months or 12 months depending on exactly which journal. Comparing the citation performance of an article that was made immediately (day-0) open access at the journal (their “Gold OA”), with the citation performance of an article which has a parallel copy publicly available only 365 days after the publication date, gives the “EarlyV” set much less time for the purported open access benefit to take effect. Effectively it’s an unfair comparison where the “Gold OA” set has been given an additional year or six months extra to accrue citations relative to the eventual public emergence of green OA author manuscripts. But with the advent of the Rights Retention Strategy whereby author manuscripts can be archived with a zero-day embargo we may eventually be able to do a ‘fairer’ analysis between the citation benefit of open access provided either at the journal (“Gold”) or at a repository (“Green”).
3.) SN failed to factor-in other possible biasing factors which might be co-associated with “Gold OA” e.g. research funding. If grant funded research, from funders with an open access policy, tends to be more highly cited than say non-grant funded research, or from grant funded research from a funder that does not pay for open access in hybrid journals, then that would bias the results. What this result would really be demonstrating is funder choice for research that tends to be more highly cited, relative to non-grant funded research?
4.) Hybrid Gold Open Access is typically priced as the most expensive way possible of doing open access. Whilst the Max Planck Digital Library appears happy to pay Springer Nature $11,200 for some articles, the rest of the world sensibly will not pay this ransom. There also seems no cap on the constant above-inflation price rises of hybrid OA options over time. At current prices, even for ‘cheaper’ SN hybrid journals, most research institutions simply cannot afford to pay for hybrid gold open access at Springer Nature for all their articles. Even if it did somehow garner a tiny citation benefit over a three year period, is it worth $2780 per article? I think not.
5.) Fully open access journals of which there are over 17,000 listed at DOAJ are typically both lower in price and often higher in citation performance per article as I demonstrated with SciPost Physics in Part 1 of this series.
All that SpringerNature have demonstrated with their white paper is alarming statistical illiteracy, and a lack of reproducibility and transparency. Given how popular measures like Clarivate’s Journal Impact Factor are (which is also calculated in a statistically illiterate way), perhaps SpringerNature just decided to run with it anyway, despite the methodological and statistical wrongness? As SPARC notes, the lead author of the report is SN’s Senior Marketing Manager – this “white paper” is pure marketing, not rigorous research.