Show me the data!

Deep indexing supplementary data files

June 20th, 2015 | Posted by rmounce in Conservation Hackathon | Content Mining | Hack days - (Comments Off on Deep indexing supplementary data files)

To prove my point about the way that supplementary data files bury useful data, making it utterly indiscoverable to most, I decided to do a little experiment (in relation to text mining for museum specimen identifiers, but also perhaps with some relevance to the NHM Conservation Hackathon):

I collected the links for all Biology Letters supplementary data files. I then filtered out the non-textual media such as audio, video and image files, then downloaded the remaining content.

A breakdown of file extensions encountered in this downloaded subset:

763 .doc files
543 .pdf files
109 .docx files
75 .xls files
53 .xlsx files
25 .csv files
19 .txt files
14 .zip files
2 .rtf files
2 .nex files
1 .xml file
1 “.xltx” file

I then converted some of these unfriendly formats into simpler, more easily searchable plain text formats:

for i in *.zip ; do unzip $i -d /home/ross/work/royal-soc-si/biol-letters-supp-info/transformed/unzipped_$i ; done
for i in *.docx ; do docx2txt $i /home/ross/work/royal-soc-si/biol-letters-supp-info/transformed/$i.txt ; done
for i in *.doc ; do catdoc -a $i > /home/ross/work/royal-soc-si/biol-letters-supp-info/transformed/$i.txt ; done
for i in *.pdf ; do pdftotext $i > /home/ross/work/royal-soc-si/biol-letters-supp-info/transformed/$i.txt ; done
for i in *.rtf ; do unrtf --text $i > /home/ross/work/royal-soc-si/biol-letters-supp-info/transformed/$i.txt ; done
for i in *.xls ; do in2csv $i > /home/ross/work/royal-soc-si/biol-letters-supp-info/transformed/$i.csv ; done
for i in *.xlsx ; do in2csv $i > /home/ross/work/royal-soc-si/biol-letters-supp-info/transformed/$i.csv ; done


Now everything is properly searchable and indexable!

In a matter of seconds I can find NHM specimen identifiers that might not otherwise be mentioned in the full text of the paper, without actually wasting any time manually reading any papers. Note, not all the ‘hits’ are true positives but most are, and those that aren’t e.g. “NHMQEVLEGYKKKYE” are easy to distinguish as NOT valid NHM specimen identifiers:

$ grep -ior 'nhm............'
20120949_ESM_1.txt:NHMUK R6792), N
20120949_ESM_1.txt:NHMUK R8646) in
20120949_ESM_1.txt:NHMUK R36615, ‘
20120949_ESM_1.txt:NHMUK R36620), 
20120949_ESM_1.txt:NHMUK R16586). 
20120949_ESM_1.txt:NHMUK R36620) a
20120949_ESM_1.txt:NHMUK R16586) a
20120949_ESM_1.txt:NHMUK R6856 was
20120949_ESM_1.txt:NHMUK Charig ar
20120949_ESM_1.txt:NHMUK R6856 and
20120949_ESM_1.txt:NHMUK R6856 wer
20120949_ESM_1.txt:NHMUK R6856 wer
20120949_ESM_1.txt:NHMUK R6856 and
20120949_ESM_1.txt:NHM R6856 just 
20120949_ESM_1.txt:NHM R6856 (figu
20120949_ESM_1.txt:NHMUK R6856 had
20120949_ESM_1.txt:NHMUK R3592) an
20120949_ESM_1.txt:NHMUK R6856. Th
20120949_ESM_1.txt:NHMUK R6856). M
20120949_ESM_1.txt:NHMUK with the 
20120949_ESM_1.txt:NHMUK R6856 is 
20120949_ESM_1.txt:NHMUK R6856 sug
20120949_ESM_1.txt:NHMUK R6856. Th
20120949_ESM_1.txt:NHMUK R6856 sug
20120949_ESM_1.txt:NHMUK R6856, bu
20120949_ESM_1.txt:NHMUK R6586 is 
20120949_ESM_1.txt:NHMUK R6586 als
20120949_ESM_1.txt:NHMUK R6586, we
20120949_ESM_1.txt:NHMUK R6586 can
20120949_ESM_1.txt:NHMUK R6586 was
20120949_ESM_1.txt:NHMUK R6586 may
20120949_ESM_1.txt:NHMUK R6856 are
20120949_ESM_1.txt:NHMUK R6856) av
20120949_ESM_1.txt:NHMUK R6795) in
20120949_ESM_1.txt:NHMUK R6795 is 
20120949_ESM_1.txt:NHMUK R6856 and
20120949_ESM_1.txt:NHMUK R6856 and
20120949_ESM_1.txt:NHMUK R6856 was
20120949_ESM_1.txt:NHMUK R6856 fal
20120949_ESM_1.txt:NHMUK R6856 is 
20120949_ESM_1.txt:NHMUK R6856 + S
20120949_ESM_1.txt:NHMUK R6856 whe
20120949_ESM_1.txt:NHMUK R6856 + S
20120949_ESM_1.txt:NHMUK 1, Tanzan
20120949_ESM_1.txt:NHMUK R6856 and
20120949_ESM_1.txt:NHMUK Charig ar
20120949_ESM_1.txt:NHMUK R6856 to 
20120949_ESM_1.txt:NHMUK) for perm
20120949_ESM_1.txt:NHMUK) for acce
20120949_ESM_1.txt:NHMUK Image Res
20120949_ESM_1.txt:NHMUK, The Natu
rsbl20060505supp.txt:NHM uncataloged
rsbl20060505supp.txt:NHM uncataloged
rsbl20070502supp01.doc.txt:NHM) provided v
rsbl20090302supp3.doc.txt:NHM = The Natur
rsbl20090302supp3.doc.txt:NHMW = Natural 
rsbl20090302supp3.doc.txt:NHM E32070	Plan
rsbl20090302supp3.doc.txt:NHM EE5034	Plan
rsbl20090302supp3.doc.txt:NHM E4381	Plank
rsbl20090302supp3.doc.txt:NHM E10384	Plan
rsbl20090302supp3.doc.txt:NHM EE4825	Plan
rsbl20090302supp3.doc.txt:NHM E8389	Plank
rsbl20090302supp3.doc.txt:NHM EE8132	Plan
rsbl20090302supp3.doc.txt:NHM EE5585	Non-
rsbl20090302supp3.doc.txt:NHM EE ?	Non-pl
rsbl20090302supp3.doc.txt:NHM EE1961	?	?	
rsbl20090302supp3.doc.txt:NHM E35551	Plan
rsbl20090302supp3.doc.txt:NHM E76539	?	Up
rsbl20090302supp3.doc.txt:NHM EE4055	Plan
rsbl20090302supp3.doc.txt:NHM E81494	Plan
rsbl20090302supp3.doc.txt:NHM EE4631	?	Ap
rsbl20090302supp3.doc.txt:NHM EE4632	?	Ap
rsbl20090302supp3.doc.txt:NHM EE4641	Plan
rsbl20090302supp3.doc.txt:NHM E20098	Plan
rsbl20090302supp3.doc.txt:NHM EE4404	Plan
rsbl20090302supp3.doc.txt:NHM EE8397	Plan
rsbl20090302supp3.doc.txt:NHM EE2372	?	Ma
rsbl20090302supp3.doc.txt:NHM E79718	Plan
rsbl20090302supp3.doc.txt:NHM E40574	Plan
rsbl20090302supp3.doc.txt:NHM EE4524	Plan
rsbl20090302supp3.doc.txt:NHM E79415	Non-
rsbl20090302supp3.doc.txt:NHM E45372	?	Tu
rsbl20090302supp3.doc.txt:NHM EE2321	Plan
rsbl20090302supp3.doc.txt:NHM EE2262	Plan
rsbl20090302supp3.doc.txt:NHM EE4610	Plan
rsbl20090302supp3.doc.txt:NHM E4052	Non-p
rsbl20090302supp3.doc.txt:NHM EE191	Plank
rsbl20090302supp3.doc.txt:NHM EE2353	Plan
rsbl20090302supp3.doc.txt:NHM E4034	Plank
rsbl20090302supp3.doc.txt:NHM EE2432	Plan
rsbl20090302supp3.doc.txt:NHM E4176	Plank
rsbl20090302supp3.doc.txt:NHM EE4048	?	Ma
rsbl20090302supp3.doc.txt:NHM E9892	Plank
rsbl20090302supp3.doc.txt:NHM E4979	?	Tur
rsbl20090302supp3.doc.txt:NHM E75821	Plan
rsbl20090302supp3.doc.txt:NHM E40974	?	Se
rsbl20090302supp3.doc.txt:NHM E79094	Plan
rsbl20090302supp3.doc.txt:NHM E582	Plankt
rsbl20090302supp3.doc.txt:NHMW 2005z0083/
rsbl20090302supp3.doc.txt:NHM E82582	?	U.
rsbl20090302supp3.doc.txt:NHM EE7698	Plan
rsbl20090302supp3.doc.txt:NHM E9392	Plank
rsbl20090302supp3.doc.txt:NHM E73207	?	Al
rsbl20090302supp3.doc.txt:NHM E43810	Plan
rsbl20090302supp3.doc.txt:NHM 56422	?	Apt
rsbl20090302supp3.doc.txt:NHM E83246	Plan
20120949_ESM_5.txt:NHMUK R6856) am
rsbl2011364supp1.doc.txt:NHM-72.666; MCZ
20120949_ESM_3.txt:NHMUK R6856). P
20120949_ESM_3.txt:NHMUK R6856) in
rsbl20090778supp1.doc.txt:NHM as a contro
rsbl20090139supp1.txt:NHM, The Natura
rsbl20090139supp1.txt:NHM R1034). As
20120949_ESM_2.txt:NHMUK R6856) in
20120949_ESM_2.txt:NHMUK R6856) in
rsbl20080409supp01.doc.txt:NHMW, Naturhist
rsbl20130021supp1.doc.txt:NHM, Staatliche
rsbl20130021supp1.doc.txt:NHMUK PV R498 a
rsbl20130021supp1.doc.txt:NHMUK PV OR3612
rsbl20130021supp1.doc.txt:NHMUK PV R3938 
rsbl20130021supp1.doc.txt:NHMUK PV R5465)
rsbl20130021supp1.doc.txt:NHMUK PV OR2003
rsbl20130021supp1.doc.txt:NHMUK PV R1158)
rsbl20130021supp1.doc.txt:NHMUK PV R5595)
rsbl20130021supp1.doc.txt:NHMUK PV R4086)
rsbl20130021supp1.doc.txt:NHMUK and GLAHM
rsbl20130021supp1.doc.txt:NHM); Sveltonec
rsbl20130021supp1.doc.txt:NHMUK PV R11185
rsbl20130021supp1.doc.txt:NHM1284-R); Mal
rsbl20130021supp1.doc.txt:NHMUK PV R6682)
rsbl20130021supp1.doc.txt:NHMUK PV R6682)
rsbl20130021supp1.doc.txt:NHMUK in 1959, 
rsbl20130021supp1.doc.txt:NHMUK. While th
rsbl20130021supp1.doc.txt:NHMUK PV R6682 
rsbl20130021supp1.doc.txt:NHMUK PV R6682)
rsbl20130021supp1.doc.txt:NHMUK PV R6682,
rsbl20130021supp1.doc.txt:NHMUK PV R6682,
rsbl20130021supp1.doc.txt:NHMUK PV R6682 
rsbl20130021supp1.doc.txt:NHMUK) for the 
20120949_ESM_4.txt:NHMUK R6856) wh


Perhaps this approach might be useful to the PREDICTS / LPI teams, looking for species occurrence data sets?

I don’t know why figshare doesn’t do deep indexing by default – it’d be really useful to search the morass of published supplementary data that out there!

Wiley & Readcube have done something rather sneaky recently, and it’s not escaped the attention of diligent readers of the scientific literature.

excellent facebook comment

On the article landing page for some, if not all(?) journal articles at Wiley, in JavaScript enabled web browsers they’ve replaced all links to download the PDF file of the article with links that direct you to Readcube instead.

This is incredibly annoying – they are literally forcing us to use Readcube. That is not cool.

Some will rush to the defence of Readcube and point out that if they detect you have the rights to, you can download the PDF from within Readcube, but that’s missing the point. No-one need waste their precious time whilst Readcube takes ages to load in your browser tab, when all you wanted in the first place was the PDF.

What Readcube provides IS NOT EVEN PDF. It’s a mishmash of JavaScript, HTML and DRM technology. Thus when Wiley has icons saying “get PDF” they’re lying. Clicking the “get PDF” link does NOT send you to the PDF. It sends you to Readcube’s proprietary, rights-restricted mock-up of a PDF.

It doesn’t even render the figure images properly, sometimes missing important bits e.g. this figure (below):

Luckily there’s a simple solution: you can block Readcube in your browser settings and get simple, direct one-click access to PDF files again by selectively disabling JavaScript on all Readcube-infected websites e.g., and

Firefox users

Install the add-on called YesScript and ‘blacklist’ all Readcube-tainted websites.

Google Chrome / Chromium users

Use Vince Buffalo’s ‘Get Me the F**king PDF‘ Chrome plugin. It’s really good.
This browser is so clever you don’t even need to install anything new. Selective JavaScript blacklisting of websites is an in-built function:

A) Click the menu button in the top right hand corner of your browser
B) Select Settings
C) (scroll to bottom) Click Show advanced settings
D) Underneath the “Privacy” section, click the “Content settings” button.
E) Under the “Javascript” section, click “Manage Exceptions” and add at least these three Readcube-infected websites:, and (example screenshot below)


Safari users

I haven’t tested this but the JavaScript Blocker extension looks like it should do the job.

Internet Explorer users

I’m tempted to say: install Chrome or Firefox but I’m well aware that some unfortunate academics have ‘university-managed’ computers on which they can’t easily install things. If so try the instructions for IE here. Let me know if you have better solutions for unfortunate IE users.

Before (left) and After (right) disabling JavaScript on the page.

Before (left) and After (right) disabling JavaScript on the page.

Added bonus function – extra privacy!

Would you want advertisers to be collecting data on you, knowing what you’ve been reading? It’s possible, though not proven AFAIK that the journal publishers themselves, or the advertisers they use are recording information about what articles you’re reading. They might know you read that article about average penis length three times last week for instance… Eric Hellman wrote quite an alarming post about the extent of this tracking at publisher websites recently. Thus blocking JavaScript at publisher websites provides extra privacy, not just protection against Readcube!

Above all I think we should #BlockReadcube not just for our own utility (easier access to the real PDF), but to send them a powerful message: we do not want the literature to be assimilated and enclosed in rights-restrictions by new technology. We do not want non-consenting ‘cubification of the research literature. We are Starfleet, and as far as I’m concerned: Readcube is the Borg.


PS If you like some of the features of Readcube, try Utopia Docs – it’s free and it’s released under an Open Source license, and it doesn’t force you to use it!

Update 2015-03-20: This post does not indicate I’m suddenly ‘in favour’ of PDF’s by the way, as some seem to have interpreted. If Wiley wanted to do something good, they should publish their full text XML on site like other good publishers do e.g. PLOS, eLife, Hindawi, MDPI, Pensoft, BMC, Copernicus… If they did this then readers could choose to use innovative open source viewing software such the eLife Lens. That kind of change would add value & choice, rather than subtract value (& rights) as they have in this case.

Further discussion of Readcube and rights-restrictions:

Hack4ac recap

July 9th, 2013 | Posted by rmounce in BMC | eLife | Hack days | Open Access | Open Data | Open Science | PeerJ | PLoS - (4 Comments)

Last Saturday I went to Hack4Ac – a hackday in London bringing together many sections of the academic community in pursuit of two goals:

  • To demonstrate the value of the CC-BY licence within academia. We are interested in supporting innovations around and on top of the literature.
  • To reach out to academics who are keen to learn or improve their programming skills to better their research. We’re especially interested in academics who have never coded before


The list of attendees was stellar, cross-disciplinary (inc. Humanities) and international. The venue (Skills Matter) & organisation were also suitably first-class – lots of power leads, spare computer mice, projectors, whiteboards, good wi-fi, separate workspaces for the different self-assembled hack teams, tea, coffee & snacks all throughout the day to keep us going, prizes & promo swag for all participants…

The principal organizers; Jason Hoyt (PeerJ, formerly at Mendeley) & Ian Mulvany (Head of Tech at eLife) thus deserve a BIG thank you for making all this happen. I hear this may also be turned into a fairly regular set of meetups too, which will be great for keeping up the momentum of innovation going on right now in academic publishing.

The hack projects themselves…

The overall winner of the day was ScienceGist as voted for by the attendees. All the projects were great in their own way considering we only had from ~10am to 5pm to get them in a presentable state.



This project was initiated by Jure Triglav, building upon his previous experience with Tiris. This new project aims to provide an open platform for post-publication summaries (‘gists’) of research papers, providing shorter, more easily understandable summaries of the content of each paper.

I also led a project under the catchy-title of Figures → Data where-by we tried to provide added-value by taking CC-BY bar charts and histograms from the literature and attempting to re-extract the numerical data from those plots with automated efforts using computer vision techniques. On my team for the day I had Peter Murray-Rust, Vincent Adam (of HackYourPhD) and Thomas Branch (Imperial College). This was handy because I know next to nothing about computer vision – I’m Your Typical Biologist ™  in that I know how to script in R, perl, bash and various other things, just enough to get by but not nearly enough to attempt something ambitious like this on my own!

Forgive me the self-indulgence if I talk about this  Figures → Data project more than I do the others but I thought it would be illuminative to discuss the whole process in detail…

In order to share links between our computers in real-time, and to share initial ideas and approaches, Vincent set-up an etherpad here to record our notes. You can see the development of our collaborative note-taking using the timeslider function below (I did a screen record of it for prosperity using recordmydesktop):

In this etherpad we document that there are a variety of ways in which to discover bar charts & histograms:

  • figuresearch is one such web-app that searches the PMC OA subset for figure captions & figure images. With this you can find over 7,000 figure captions containing the word ‘histogram’ (you would assume that the corresponding figure would contain at least one histogram for 99% of those figures, although there are exceptions).
  • figshare has nearly 10,000 hits for histogram figures, whilst BMC & PLOS can also be commended for providing the ability to search their literature stack by just figure captions, making the task of figure discovery far more efficient and targeted.

Jason Hoyt was in the room with us for quite a bit of the hack and clearly noted the search features we were looking for – just yesterday he tweeted: “PeerJ now supports figure search & all images free to use CC-BY (inspired by @rmounce at #hack4ac)” [link] – I’m really glad to see our hack goals helped Jason to improve content search for PeerJ to better enable the needs (albeit somewhat niche in this case) of real researchers. It’s this kind of unique confluence of typesetters, publishers, researchers, policymakers and hackers at doing-events like this that can generate real change in academic publishing.

The downside of our project was that we discovered someone’s done much of this before. ReVision: Automated Classification, Analysis and Redesign of Chart Images  [PDF] was an award-winning paper at an ACM conference in 2011. Much of this project would have helped our idea, particularly the classification of figures tech. Yet sadly, as with so much of ‘closed’ science we couldn’t find any open source code associated with this project. There were comments that this type of non-code sharing behaviour, blocking re-use and progress, are fairly typical in computer science & ACM conferences (I wouldn’t know but it was muttered…).  If anyone does know of the existence of related open source code available for this project do let me know!

So… we had to start from a fairly low-level ourselves: Vincent & Thomas tried MATLAB & C based approaches with OpenCV and their code is all up on our project github. Peter tried using AMI2 toolset, particularly the Canny algorithm, whilst I built up an annotated corpus of 40 CC-BY bar charts & histograms for testing purposes. Results of all three approaches can be seen below in their attempts to simplify this hilarious figure about dolphin cognition from a PLOS paper:

The plastic fish just wasn't as captivating...

“Figure 5. Total time spent looking at different targets.” from Siniscalchi M, Dimatteo S, Pepe AM, Sasso R, Quaranta A (2012) Visual Lateralization in Wild Striped Dolphins (Stenella coeruleoalba) in Response to Stimuli with Different Degrees of Familiarity. PLoS ONE 7(1): e30001. doi:10.1371/journal.pone.0030001 CC-BY

Peter’s results (using AMI2):


Thomas’s results (OpenCV & C):


Vincent’s results (OpenCV & MATLAB & bilateral filtering)

We might not have won 1st prize but I think our efforts are pretty cool, and we got some laughs from our slides presenting our days’ work at the end (e.g. see below). Importantly, *everything* we did that day is openly-available on github to re-use, re-work and improve upon (I’ll ping Thomas & Vincent soon to make sure their code contributions are openly licensed). Proper full-stack open science basically!

some figures are just awful


Other hack projects…

As I thought would happen I’ve waffled on about our project. If you’d like to know more about the other projects hopefully someone else will blog about them at greater length (sorry!) I’ve got my thesis to write y’know! ;)

You can find more about them all either on the Twitter channel #hack4ac or alternatively on the hack4ac github page. I’ll write a little bit more below, but it’ll be concise, I warn you!

  • Textmining PLOS Author Contributions

This project has a lovely website for itself: and so needs no more explanation.

  • Getting more openly-licensed content on wikipedia

This group had problems with the YouTube API I think. Ask @EvoMRI (Daniel Mietchen) if you’re interested…

  • articlEnhancer

Not content with helping out the PLOS author contribs project Walther Georg also unveiled his own article enhancer project which has a nice webpage about it here:

  • Qual or Quant methods?

Dan Stowell & co used NLP techniques on full-text accessible CC-BY research papers, to classify all of them in an automated way determining whether they were qualitative or quantitative papers (or a mixture of the two). The last tweeted report of it sounded rather promising: “Upstairs at #hack4ac we’re hacking a system to classify research papers as qual or quant. first results: 96 right of 97. #woo #NLPwhysure” More generally, I believe their idea was to enable a “search by methods” capability, which I think would be highly sought-after if they could do it. Best of luck!
Apologies if I missed any projects. Feel free to post epic-long comments about them below ;)