Show me the data!

## I am supporting RIO Journal. I think you should too

September 1st, 2015 | Posted by in Generation Open | Open Access | Open Data | Open Science | Publications - (0 Comments)

Today (2015-09-01), marks the public announcement of Research Ideas & Outcomes (RIO for short), a new open access journal for all disciplines that seeks to open-up the entire research cycle with some truly novel features

I know what you might be thinking: Another open access journal? Really?

Myself, nor Daniel Mietchen simply wouldn’t be involved with this project if it was just another boring open access journal. This journal packs a mighty combination of novel features into one platform:

• 1.) RIO will publish research proposals, as well as regular research outputs such as articles, data papers and software – this has never been done by a journal before to my knowledge
• 2.) RIO will label research outputs with ‘Impact Categories’ based upon UN Millennium Development Goals (MDGs) and EU Societal Challenges, to highlight the real-world relevance of research and to better link-up research across disciplines (see below for some example MDGs).

• 3.) RIO supports a variety of different types of peer-review, including ‘pre-submission, author-facilitated, external peer-review‘ (new), as well as post-publication journal-organized open peer-review (similar to that pioneered by F1000Research), and ‘spontaneous’ (not journal-organized) post-publication open peer-review which is actively encouraged. All peer-review will be open/public, in keeping with the overall guiding philosophy of the journal to increase transparency and reduce waste in the research cycle. Reviewer comments are highly valuable; it is a waste not to make them public. When supplied, all reviewer comments will be made openly available.
• 4.) RIO offers flexibility in publishing services and pricing in a bold attempt to ‘decouple’ the traditional scholarly journal into its component services. Authors & funders thus may choose to pay for the publishing services they actually want, not an inflexible bundle of different services, as there is at most journals.

Source: Priem, J. and Hemminger, B. M. 2012. Decoupling the scholarly journal. Frontiers in Computational Neuroscience. Image licensed under CC BY-NC.

• 5.) On the technical side of things, RIO uses an integrated end-to-end XML-backed publication system for Authoring, Reviewing, Publishing, Hosting, and Archiving called ARPHA. As a publishing geek this excites me greatly as it eliminates the need for typesetting, ensuring a smooth and low-cost publishing process. Reviewers can make comments inline or more generally over the entire manuscript, on the very same document and platform that the authors wrote in, much like Google Docs. This has been successfully tried and tested for years at the Biodiversity Data Journal and is a system now ready for wider-use.

For the above reasons and more, I’m hugely excited about this journal and am delighted to be one of their founding editors alongside Dr Daniel Mietchen. See our growing list of Advisory and Editorial Board members for insight into who else is backing this new journal – we’ve got some great people on board already! If you’re interested in supporting this initiative please do enquire about volunteering as an editor for the journal, we need more editors to support the broad scale and ambition of journal. You can apply via the main website here.

## What is Journal Visibility?

August 28th, 2015 | Posted by in Publications - (2 Comments)

I’ve just read a paper published in Systematic Biology called ‘A Falsification of the Citation Impediment in the Taxonomic Literature‘.

Having read the full paper many times, including the 64-page PDF supplementary materials file. I’m amazed the paper was published in its current form.

Early on, in the abstract no less, the authors introduce a parameter called ‘journal visibility’. Apparently they ‘correct’ the number of citations for it.

We compared the citation numbers of 306 taxonomic and 2291 non-taxonomic research articles (2009–2012) on mosses, orchids, ciliates, ants, and snakes, using Web of Science (WoS) and correcting for journal visibility. For three of the five taxa, significant differences were absent in citation numbers between taxonomic and non-taxonomic papers.

I count over twenty further instances of the term ‘visibility’ or ‘visable’ in this paper. It is clearly an important part of the work and calculations. But what is it and how did they correct for it? All parameters in reputable scientific papers should be clearly defined, as well as any numerical ‘correction’ operations performed. Yet in this paper I honestly can’t find any given explicit definition of ‘journal visibility’. As Brian O’Meara points out, they define highly visible journals as “those included in WoS and with a good standing”. Good standing is not further defined or scored. No definition is given for what a lowly visible or middlingly visible journal is. All journals indexed in Web of Science are assigned an Impact Factor. Thus ‘included in WoS’ and ‘has Impact Factor’ are two ways of saying the same thing.

For the sake of clarity I will now quote and number all other passages in the paper, aside from the abstract, that mention ‘visibility’ or ‘visible’ (I have highlighted each instance in red):

1 & 2 & 3

In more detail, we address five questions: Does publishing taxonomy harm a journal’s citation performance? Is it within the possibilities of journal editors to influence taxonomy’s visibility? If more high-visibility journals opened their doors to taxonomic publications, would taxonomy’s productivity be sufficient for an increase in the number of taxonomic papers in these journals? Can taxonomy be published by taxonomists only or by a larger community? And finally, would the community use the chance to publish more taxonomic papers in highly visible journals?

4

Just 14 of the 47 journals published both taxonomic and non-taxonomic papers on the focal taxa on a yearly basis in the years 2009–2012 (Table 1). The analyzed taxonomic publications in these 14 journals might have experienced lower visibility than publications in the other 33 journals. This is due to the fact that the average IF 2012 of the 14 journals with both taxonomic and non-taxonomic publications was significantly lower ( 1.16±0.51 standard deviation [SD]) than the average IF of the other 33 journals ( 2.66±1.60 ; Student’s t -test, P<0.001 ).

5

Because of the correction for journal visibility, we consider the results for the 14 journals to be more representative of the citation performance of taxonomic versus non-taxonomic per se than the results for all journals.

6 & 7 & 8

[Section Heading] EDITORS CAN INCREASE THE VISIBILITY OF TAXONOMIC PUBLICATIONS

For strengthening the impact and prospects of taxonomy, equal opportunity is needed for taxonomists and non-taxonomists. In practice, this means that taxonomists should be able to publish in highly visible journals (those included in WoS and with a good standing). Editors of highly visible periodicals that include taxonomy will contribute actively to reducing the taxonomic impediment and, considering our analyses, might on top of this do the best for their journals.

9 & 10

The IF 2012 of these 19 journals that (in principle) publish taxonomy ( 2.61±1.64 ) does, on average, not differ significantly from that of the 14 journals that do not publish taxonomy at all ( 2.73±1.61 ; Student’s t -test, P=0.84 ) meaning that equal visibility for taxonomists and non-taxonomists might, in fact, not be out of reach. In essence, for many editors of highly visible periodicals, it might not so much be a question of changing the scope of their journals but of increasing the frequency of taxonomic publications and thus simply of communicating the willingness to publish taxonomy to the community

11 & 12 & 13

[Section Heading] TAXONOMY’S PRODUCTIVITY WOULD BE SUFFICIENT TO INCREASE THE NUMBER OF PAPERS IN HIGHLY VISIBLE JOURNALS

It is not enough, however, for editors of highly visible journals to actively invite taxonomic contributions. A crucial question about whether increasing taxonomy’s visibility will work is the capacity of taxonomy to follow the invitation. One way to approach this issue is looking at the growth rate of taxonomy

14

To our knowledge, a comprehensive taxonomic literature database is available just for animals, Zoological Record (ZR). For 2012, the latest year considered here, ZR lists 2.1 times more publications on animal taxonomy than WoS (Fig. 2b, c). This indicates that already in the short term, there is sufficient taxonomic publication output for editors of highly visible journals to indeed increase their share in taxonomy.

15

On the whole, the capacity for increased publication of taxonomy in highly visible journals seems to be there. Accepting that the potential exists, there is still a question of whether taxonomy’s flexibility will be sufficient for a change in publication culture to be realized.

16 & 17 & 18 & 19

[Section Heading] THE COMMUNITY WOULD LIKELY USE THE CHANCE TO INCREASE TAXONOMY’S VISIBILITY

… This suggests that taxonomists indeed would use also other chances of publishing in highly visible journals, should the opportunity arise. The resulting shift from aiming at low visibility to targeting highly visible journals will be very important for taxonomists in working toward both an improved image (Carbayo and Marques 2011) and an improved measure of their scientific impact (Agnarsson and Kuntner 2007).

20 & 21 & 22

Editors of highly visible journals in biology could help (i) increase the visibility of taxonomic publications by encouraging taxonomists to publish in their journals (thereby generally not harming but possibly boosting their journals) and (ii) increase total taxonomic output by making it attractive for scientists working in species delimitation (with their primary focus different from taxonomy) to publish the taxonomic consequences of their research.

The task of taxonomic authors, in turn, will be to follow the invitation and to submit indeed their best papers to the best-visible journals available for submission—just as authors of non-taxonomic papers do.

My inferences on visibility

For independent, unbiased confirmation, I looked-up the definition of ‘visibility’ online and found:

### Noun

visibility ‎(countable and uncountable, plural visibilities)

1. (uncountable) The condition of being visible.
2. (countable) The degree to which things may be seen.

By the above definition, which is not unreasonable, I would have thought that open access journals would have the highest ‘journal visibility’ as everyone with an internet connection is able to see articles in them without having to login or pay money to view.

Popular subscription access journals like Nature arguably have middling visibility as many scientists have access to them (although not that many actually read all the articles in them, I certainly don’t). Finally, many subscription access journals are known to be less widely subscribed to by both individuals and institutions e.g. Zootaxa (I would love to have data to demonstrate this more objectively, it is certainly true for UK Higher Education Institutions that significantly more subscribe to Nature than to Zootaxa).

I get the feeling that the authors of this paper did not score ‘visibility’ in this manner.

Many of the mentions of ‘visibility’ appear near discussion of Impact Factor (IF). Perhaps the authors mean to suggest that visibility and Impact Factor are one and the same thing or are highly-correlated? No evidence or citation is given to support this idea. I find this conflation of ‘visibility’ and Impact Factor to be simply wrong and dangerously misleading. Why?

Take the visibility of Elsevier journals for instance. They range in Impact Factor from 0 (many journals e.g. Arab Journal of Gastroenterology), to 2 (e.g. Academic Pediatrics), up to 45 (The Lancet). Yet I’d argue the visibility of most Elsevier subscription journals is the same because institutional libraries tend to (be practically forced to) buy Elsevier journals as a bundle – the euphemistically-titled ‘Freedom Collection‘. With the privilege of an institutional affiliation you typically either have access to all the Elsevier journals, including the cruddy ones, or you have access to none of them (in one ARL survey from 2012, 92% of surveyed libraries subscribed to the Elsevier bundle). Unfortunately very few academic libraries opt to subscribe to just a few select Elsevier subscription-only journals, rather than the bundle, MIT is one of the rare exceptions. Thus whether an individual subscription access Elsevier journal has an Impact Factor of 0, 2, 5, or 10 the global visibility of articles in Elsevier journals is relatively similar between different Elsevier journals, except only for the very most popular journals like The Lancet which might have an appreciable number of individual subscribers and institutions that subscribe to the journal without subscribing to the rest of Elsevier’s bundle of journals.

Journals aren’t a good unit of measure anyway – citations, views, downloads and ‘quality’ (broadly-defined) can vary greatly even within the same journal. Articles are a more appropriate unit of measure and we have abundant article-level metrics (ALMs) these days. Let’s not lose sight of that fact.

Surely this article needs correction at the very least? This is more than just a minor linguistic quibble. If the authors mean to say Impact Factor every time they say ‘visible’ or ‘visibility’ why don’t they just do this? Perhaps it is because Impact Factor is so widely and rightly derided, not to mention statistically illiterate (the distribution of journal article citations are well known to be skewed, you shouldn’t take the mean but the median to measure central tendency. The Impact Factor uses the mean in its calculation – oops!) they knew that it wouldn’t be meaningful and so masked it by using ‘visibility’ a weasel-word instead?

This article seems to be asking: Is it within the possibilities of journal editors to influence taxonomy’s visibility Impact Factor.

## I’m saying NO to Wiley

August 18th, 2015 | Posted by in Ecology | Open Access | Publications | Wiley | Wrongly selling OA articles - (2 Comments)

I got invited to review a manuscript by a British Ecological Society journal (MEE) that is published with Wiley recently.

I rejected the request and will from now on decline to review for all Wiley journals. In this post I duplicate my email to the Assistant Editor (Chris Greaves) explaining why. FWIW Chris has handled my letter extremely well and will forward it on for me to where it needs to be seen/read within the British Ecological Society.

Below is the email I sent earlier today in full:

from: Ross Mounce <ross.mounce@gmail.com>
to: coordinator@methodsinecologyandevolution.org
date: 18 August 2015 at 11:57
subject: Re: Follow-up: Invitation to Review for Methods in Ecology and Evolution

Dear Chris,

Thank you (and Rich FitzJohn) for inviting me to review this manuscript.

It looks interesting from the abstract and in other circumstances I would certainly agree to review it.

However, I refused to review this manuscript and will refuse to review any subsequent manuscript for this publisher (Wiley) because I believe they are actively impeding progress in science by choosing to operate a predominately subscription-based business model – artificially restricting access to knowledge that taxpayers (through government funding) and charities predominantly fund. Furthermore they do an extremely poor job of it.

• They produce but actively withhold full text XML (even from subscribers). Reputable open access publishers have no qualms in making their full text XML available to all. This is deeply frustrating for those interested in synthesis, reproducibility and getting the most from published science in a time-efficient manner. As the manuscript I was just asked to review was principally about ‘automated content analysis’ I find this particularly galling and I am wondering why the authors thought it was appropriate to submit this to such a journal.
• They use an outdated back-end system: ‘ManuscriptCentral’ which is by all accounts an extremely poor system. Wiley have made huge profits each and every year in the past decade and yet seem completely unwilling to re-invest that in improving their systems. There wasn’t even a free text box to explain my reasons for declining to review this manuscript. Utterly poor, neglected design. Try PeerJ or Pensoft’s submission system. They have clearly worked hard and invested time and effort into making publishing research better for everyone, not just their own profit-margin.
• Wiley’s hybrid open access charge ($3000) is outrageously expensive and bears no resemblance or link to the actual cost of production or services provided. I am aware of the ‘discount’ levied for British Ecological Society members (down to$2,250). The ‘discount’ is only gained if one of the authors pays ~ $80 to join BES (full, ordinary member rate). That is still far too high. For context, some other open access fees: PLOS ONE charges$1350, PeerJ just $99 per author (the manuscript I was just asked to review has only 4 authors), Ubiquity Press journals$500, and Biodiversity Data Journal is still FREE ($0) whilst in launch phase. This to me is strong evidence of either deep inefficiency or profit-gouging or a mixture of both on Wiley’s part, none of which are excusable. I am certainly not alone in thinking this. See recent tweets from Rob Lanfear (an excellent scientist): https://twitter.com/RobLanfear/status/630523174061342720 https://twitter.com/RobLanfear/status/630526920086568960 • Wiley are a significant player in the modern oligopoly of academic publisher knowledge racketeering. Data from FOI requests in the UK show that in the last five years (2010-2014), 125 UK Higher Education Institutes have collectively spent nearly £77,000,000 renting access to knowledge that Wiley has captured. That’s just the UK. Wiley doesn’t pay authors for their content, nor do they pay reviewers. I don’t know why the British Ecological Society (BES) partners with these racketeers – I find this arrangement severely detrimental to the goals of BES and academic research. • Like the other big knowledge racketeers Wiley operate a ‘big bundle’ subscription system. By adding BES journals to this big bundle of subscriber-only knowledge, it makes it harder for libraries around the world to cancel their subscriptions to this big bundle. Wiley know this and hence are actively trying to acquire as many good journals as possible (e.g. ESA journals) to make themselves ‘too big to cancel’. • On a personal note, I am particularly aggrieved with Wiley because they are currently, without my consent, charging$45.60 including tax, to ‘non-subscribers’ for access to one of my open access articles that they have copied over from where it is freely available at the original publisher. Charging $45.60 to access something that is freely available at the original publisher is simply astonishing and is just another facet to the lunacy of the many and multiple ways in which Wiley and companies like it seek to profiteer from and restrict access to research. For all these reasons and many more I simply cannot agree to review manuscripts for any Wiley journal. I am already boycotting Elsevier, and am considering applying the same to subscription-access Nature Springer and Taylor & Francis journals for similar reasons. I urge the British Ecological Society to reconsider their ‘partnership’ with this profiteering entity and to pursue publishing with organisations that are actually competent at modern 21st century academic publishing, particularly those that support and actively facilitate content mining e.g. Pensoft, PLOS, PeerJ, eLife, Ubiquity Press, MDPI and F1000Research, to name but a few. Sincerely, Ross Mounce ———————————– I feel relieved to have done this. Having reviewed for Wiley only last month it didn’t feel right. Why would I help them whilst boycotting Elsevier? They are essentially as bad as each other. My position is more logically consistent now. Many thanks to others who have also publicly written about refusing to review for legacy publishers, these posts certainly helped me in my decision-making: Heather Piwowar: Sending A Message Ethan White: Why I will no longer review for your journal Casey Bergman: Just Say No – The Roberts/Ashburner Response PS Having read Tom Pollard’s post on this matter, I might also write to one of the authors to explain why I declined to review their article. I wish them them well and I look forward to reading their article when it comes out. ## Advice for journal-publishing academic societies August 2nd, 2015 | Posted by in Ecology | Open Access | Open Science | Publications - (1 Comments) I read some sad news on Twitter recently. The Ecological Society of America has decided to publish its journals with Wiley: Whilst I think the decision to move away from their old, unloved publishing platform is a good one. The move to publish their journals with Wiley is a strategically poor one. In this post I shall explain my reasoning and some of the widespread dissatisfaction with the direction of this change. Society journals should not be a profit-driven business The stated goals of The Ecological Society of America (ESA) are noble and I reproduce them here below to help you understand what the society in theory aims to do: • * promote ecological science by improving communication among ecologists; • * raise the public’s level of awareness of the importance of ecological science; • * increase the resources available for the conduct of ecological science; • * ensure the appropriate use of ecological science in environmental decision making by enhancing communication between the ecological community and policy-makers Reading those four bullet points, it strikes me that a society with this stated mission should be a vanguard of the open access movement. An efficient, well-implemented open access publishing system, supported (and thus empowered) by the ESA would positively address all four of those goals. Do I need to explain how open access would improve communication among ecologists? It should be obvious to most. Some facts: Universities around the world do not have access to all subscription journals, not even Harvard. Wiley’s big journal bundle of subscriptions is no exception to this rule. Brock University in Canada is one such notable example. ‘Ecology and Evolution’ is one of two “main themes” of Brock’s Biology Department yet it does not have access to the Wiley bundle of subscription journals. Furthermore, as the above tweet demonstrates many ecologists are not based at universities. Not all uses or readership of ecology journals is by ecologists, it’s absolutely not sufficient to just provide access to ecologists (alone). It’s vital that policymakers and the public have access to the latest research, no embargoes. Want evidence that policymakers lack access to research? Look no further than this blog post from a recent intern at the UK Parliamentary Office for Science & Technology (POST): The level of access to journals was far lower than I had expected (it was actually shocking) – I ended up using my academic access throughout my placement. If the ESA seriously wants to “ensure the appropriate use of ecological science in environmental decision making by enhancing communication between the ecological community and policy-makers” then making it easier for policymakers like those at POST to access research published in ESA journals would surely be a great way of doing that. How does the ESA expect to “raise the public’s level of awareness of the importance of ecological science” if most of the science that they themselves publish in their own journals is behind an expensive paywall?$20 for 30 day access to one article? Admittedly that’s cheaper than many but it’s simply not supportive of ESA’s mission.

Does this unnecessary paywall help raise the awareness of ecological science?

Lastly, with respect to increasing “resources available for the conduct of ecological science” the ESA urgently needs to consider the big picture here. Wiley, Springer Nature, Elsevier and other legacy publishers are a major drain on the financial resources available for research. With their big bundle deals they ransom/rent access to libraries for sums that can be up to many millions of dollars, every year, per institution. Money should instead be diverted into efficient, high-quality publishing systems like JMLR, Open Library of Humanities, PeerJ, Pensoft and Ubiquity Press to name but a few. All of these not only provide open access, but also high-quality publishing services at a significantly lower cost. Many provide added extras such as semantically-enhanced full-text XML which would make synthesis of ecological science easier. Wiley does not provide direct access to per article full-text XML even to its paying subscribers! They do half the job for thrice the price. Why would ESA want to help to sustain and enhance Wiley’s famous 42% profit margin? These legacy publishers are strategically merging, and acquiring journals in order to make it harder for libraries to cancel their dross-laden ‘big bundle’ subscription packages. It doesn’t seem like a logical decision to me or others.

Comparing this to other recent journal publishing changes

To put into context the ESA move to Wiley, let’s look at three other recent examples of academic societies changing publisher:

1.) Museum für Naturkunde Berlin journals (flipping to open access)

In 2014 all of their journals moved away from being published with Wiley. Their two zoological journals which have been around since before the ESA was even formed(!) transferred to open access publishing with Pensoft. Their Earth Science journal Fossil Record also moved away from Wiley, to open access publishing with Copernicus Publications. Guess what? The sky didn’t fall. I predict the articles in these journals will start being read, downloaded and cited more now that they are open access to everyone.

2.) Paleontology Society journals (switching to arguably a more benign, less profit-driven legacy publisher)

In 2015 PalSoc journals switched to be published with Cambridge University Press (CUP). I’m not super enthusiastic about CUP but if a society really wants to do legacy publishing, without worsening the stranglehold of the big publishing companies over libraries then CUP, or other university presses (Oxford, John Hopkins, Chicago) seem like safer custodians of academic intellectual property to me.

3.) American Society of Limnology and Oceanography (moving to Wiley)

I for one was completely unaware that ESA were looking for a new publisher. I would have tried to help if I had known. I have many unanswered questions over the consultation process. For example, the ESA has an Open Science section and mailing list, its members are extremely knowledgeable about the academic publishing landscape and publishing technology.

Was the ESA membership in it’s entirety specifically and clearly asked which publisher they would like the ESA to publish with? Did they ask their membership what features they wanted from their new publishing platform? I would have requested a platform that provides access to semantically-enriched full text XML – Wiley does not provide this. Given a choice, and the vital context and information given above I think few ESA members, policymakers, or members of the public would choose Wiley as ESA’s new publisher.

I gather from Twitter that “any and all” were invited to submit a proposal to publish ESA journals and that Elsevier submitted a proposal. But having a lazy tendering process only biases decisions towards major conglomerates who have the time, energy and resources to make slick proposals – I wonder if smaller but high-quality publishing companies were pro-actively approached by ESA to submit a proposal? In the public interest, I think the ESA should publish the names of all organisations who submitted proposals to publish ESA journals – I think just that data alone might potentially reveal flaws in the tendering process. I’m finding it really hard to reconcile the goals of ESA and shareholder-profits motivation of Wiley. I genuinely think the leadership of ESA is out of touch with its membership and that they may not have been properly consulted about this major change to the society.

This is a long post, and I’ve said enough, so I’ll leave it to a professional scholarly communications expert (Kevin Smith, Duke University), to have the last word about Wiley, and the recent trend towards cancelling Wiley subscriptions:

I don’t know if Wiley is the worst offender amongst the large commercial publishers, or whether there is a real trend toward cancelling Wiley packages.  But I know the future of scholarship lies elsewhere than with these large legacy corporations.

Postscript

But perhaps we can turn this negative into positive by creating resources and impartial educational guides for academic societies on how to negotiate better publishing deals, and how to start a tendering process with an eye towards the inevitable future of open access? If SPARC or SPARC Europe already provides these resources please do point me at them!

July 2nd, 2015 | Posted by in Content Mining | Open Science - (1 Comments)

With a first commit to github not so long ago (2015-04-13), getpapers is one of the newest tools in the ContentMine toolchain.

It’s also the most readily accessible and perhaps most immediately exciting – it does exactly what it says on the tin: it gets papers for you en masse without having to click around all those different publisher websites. A superb time-saver.

It kinda reminds me of mps-youtube: a handy CLI application for watching/listening to youtube.

Installation is super simple and usage is well documented at the source code repository on github, and of course it’s available under an OSI-approved open source MIT license.

An example usage querying Europe PubMedCentral

Currently you can search 3 different aggregators of academic papers: Europe PubMedCentral, arXiv, and IEEE. Copyright restrictions unfortunately mean that full text article download with getpapers is restricted to only freely accessible or open access papers. The development team plans to add more sources that provide API access in future, although it should be noted that many research aggregators simply don’t appear to have an API at the moment e.g. bioRxiv.

The speed of the overall process is very impressive. I ran the below search & download command and it executed it all in 32 seconds, including the download of 50 full text PDFs of the search-relevant articles!

You can choose to download different file formats of the search results: PDF, XML or even the supplementary data. Furthermore, getpapers integrates extremely well with the rest of the ContentMine toolchain, so it’s an ideal starting point for content mining.

getpapers is one of many tools in the ContentMine toolchain that I’ll be demonstrating to early career biologists at a FREE registration, one-day workshop at the University of Bath, Tuesday 28th July. If you’re interested in learning more about fully utilizing the research literature in scalable, reproducible ways, come along! We still have some places left. See the flyer below for more details or follow this link to the official workshop registration page: bit.ly/MiningWrkshp

## Deep indexing supplementary data files

June 20th, 2015 | Posted by in Conservation Hackathon | Content Mining | Hack days - (0 Comments)

To prove my point about the way that supplementary data files bury useful data, making it utterly indiscoverable to most, I decided to do a little experiment (in relation to text mining for museum specimen identifiers, but also perhaps with some relevance to the NHM Conservation Hackathon):

I collected the links for all Biology Letters supplementary data files. I then filtered out the non-textual media such as audio, video and image files, then downloaded the remaining content.

763 .doc files
543 .pdf files
109 .docx files
75 .xls files
53 .xlsx files
25 .csv files
19 .txt files
14 .zip files
2 .rtf files
2 .nex files
1 .xml file
1 “.xltx” file

I then converted some of these unfriendly formats into simpler, more easily searchable plain text formats:

Now everything is properly searchable and indexable!

In a matter of seconds I can find NHM specimen identifiers that might not otherwise be mentioned in the full text of the paper, without actually wasting any time manually reading any papers. Note, not all the ‘hits’ are true positives but most are, and those that aren’t e.g. “NHMQEVLEGYKKKYE” are easy to distinguish as NOT valid NHM specimen identifiers:

Perhaps this approach might be useful to the PREDICTS / LPI teams, looking for species occurrence data sets?

I don’t know why figshare doesn’t do deep indexing by default – it’d be really useful to search the morass of published supplementary data that out there!

## Progress on specimen mining

June 14th, 2015 | Posted by in Content Mining - (0 Comments)

I’ve been on holiday to Japan recently, so work came to a halt on this for a while but I think I’ve largely ‘done’ PLOS ONE full text now (excluding supplementary materials).

My results are on github: https://github.com/rossmounce/NHM-specimens/tree/master/results – one prettier file without the exact provenance or in-sentence context of each putative specimen entity, and one more extensive file with provenance & context included which unfortunately github can’t render/preview.

Some summary stats:

I found 427 unique BMNH/NHMUK specimen mentions from a total of just 69 unique PLOS ONE papers. The latter strongly suggests to me that there are a lot of ‘hidden’ specimen identifiers hiding out in difficult-to-search supplementary materials files.

I found 497 specimen mentions if you include instances where the same BMNH/NHMUK specimen is mentioned in different PLOS ONE papers.

Finding putative specimen entities in PLOS ONE full text is relatively automatic and easy. The time-consuming manual part is accurately matching them up with official NHM collection specimens data.

I could only confidently link-up 314 of the 497 detected mentions, to their corresponding unique IDs / URLs in the NHM Open Data Portal Collection Specimens dataset. Approximately one third can’t be confidently be matched-up to a unique specimen in the online specimen collection dataset — I suspect this is mainly down to absence/incompleteness in the online collections data, although a small few are likely typo’s in PLOS ONE papers.

In my last post I was confident that the BM Archaeopteryx specimen would be the most frequently mentioned specimen but with more extensive data collection and analysis that appears now not to be true! NHMUK R3592 (a specimen of Erythrosuchus africanus) is mentioned in 5 different PLOS ONE papers. Pleasingly, Google Scholar also finds only five PLOS ONE papers mentioning this specimen – independent confirmation of my methodology.

One of the BM specimens of Erythrosuchus is more referred to in PLOS ONE than the BM Archaeopterx specimen

Now I have these two ‘atomic’ identifiers linked-up (NHM specimen collections occurrence ID + the Digital Object Identifier of the research article in which it appears), I can if desired, find out a whole wealth of information about these specimens and the papers they are mentioned in.

My next steps will be to extend this search to all of the PubMedCentral OA subset, not just PLOS ONE.

## BMNH specimens used in PLOS ONE

May 24th, 2015 | Posted by in Content Mining | Open Data - (1 Comments)

In this post I’ll go through an illustrated example of what I plan to do with my text mining project: linking-up biological specimens from the Natural History Museum, London (sometimes known as BMNH or NHMUK) to the published research literature with persistent identifiers.

I’ve run some simple grep searches of the PMC open access subset already, and PLOS ONE make up a significant portion of the ‘hits’, unsurprisingly.

Below is a visual representation of the BMNH specimen ‘hits’ I found in the full text of one PLOS ONE paper:

Grohé C, Morlo M, Chaimanee Y, Blondel C, Coster P, et al. (2012) New Apterodontinae (Hyaenodontida) from the Eocene Locality of Dur At-Talah (Libya): Systematic, Paleoecological and Phylogenetical Implications. PLoS ONE 7(11): e49054.

I used the open source software Gephi, and the Yifan Hu layout to create the above graphical representation. The node marked in blue is the paper. Nodes marked in red are catalogue numbers I couldn’t find in the NHM Open Data Portal specimen collections dataset: 10 out of 34 not found.

Source data table below showing how uninformative the NHM persistent IDs are. I would have plotted them on the graph instead of the catalogue strings as that would be technically more correct (they are the unique IDs), but it would look horrible.

I’ve been failing to find a lot of well known entities in the online specimen collections dataset which makes me rather concerned about its completeness. High profile specimens such as Lesothosaurus “BMNH RUB 17” (as mentioned in this PLOS ONE paper, Table 1) can’t be found online via the portal under that catalogue number. I can however find RUB 16, RUB 52 and RUB 54 but these are probably different specimens. RUB 17 is mentioned in a great many papers by many different authors so it seems unlikely that they have all independently given the specimen an incorrect catalogue number – the problem is more likely to be in the completeness of the online dataset.

Another ‘missing’ example is “BMNH R4947” a specimen of Euoplocephalus tutus as referred to in Table 4 of this PLOS ONE paper by Arbour and Currie. There are two other records for that taxon, but not under R4947.

To end on a happier note, I can definitely answer one question conclusively:
What is the most ‘popular’ NHM specimen in PLOS ONE full text?

…it’s “BMNH 37001”, Archaeopteryx lithographica which is referred to in full text by four different papers (see below for details).

I have feeling many more NHM specimens are hiding out in separate supplementary materials files. Mining these will be hard unless figshare gets their act together and creates a full-text API for searching their collection – I believe it’s a metadata only API at the moment.

I’ve purposefully made very simple graphs so far. Once I get more data, I can start linking it up to create beautiful and complex graphs like the one below (of the taxa shared between 3000 microbial phylogenetic studies in IJSEM, unpublished), which I’m still trying to get my head around. The linked open data work continues…