Show me the data!

Hack4ac recap

July 9th, 2013 | Posted by rmounce in BMC | eLife | Hack days | Open Access | Open Data | Open Science | PeerJ | PLoS

Last Saturday I went to Hack4Ac – a hackday in London bringing together many sections of the academic community in pursuit of two goals:

  • To demonstrate the value of the CC-BY licence within academia. We are interested in supporting innovations around and on top of the literature.
  • To reach out to academics who are keen to learn or improve their programming skills to better their research. We’re especially interested in academics who have never coded before


The list of attendees was stellar, cross-disciplinary (inc. Humanities) and international. The venue (Skills Matter) & organisation were also suitably first-class – lots of power leads, spare computer mice, projectors, whiteboards, good wi-fi, separate workspaces for the different self-assembled hack teams, tea, coffee & snacks all throughout the day to keep us going, prizes & promo swag for all participants…

The principal organizers; Jason Hoyt (PeerJ, formerly at Mendeley) & Ian Mulvany (Head of Tech at eLife) thus deserve a BIG thank you for making all this happen. I hear this may also be turned into a fairly regular set of meetups too, which will be great for keeping up the momentum of innovation going on right now in academic publishing.

The hack projects themselves…

The overall winner of the day was ScienceGist as voted for by the attendees. All the projects were great in their own way considering we only had from ~10am to 5pm to get them in a presentable state.



This project was initiated by Jure Triglav, building upon his previous experience with Tiris. This new project aims to provide an open platform for post-publication summaries (‘gists’) of research papers, providing shorter, more easily understandable summaries of the content of each paper.

I also led a project under the catchy-title of Figures → Data where-by we tried to provide added-value by taking CC-BY bar charts and histograms from the literature and attempting to re-extract the numerical data from those plots with automated efforts using computer vision techniques. On my team for the day I had Peter Murray-Rust, Vincent Adam (of HackYourPhD) and Thomas Branch (Imperial College). This was handy because I know next to nothing about computer vision – I’m Your Typical Biologist ™  in that I know how to script in R, perl, bash and various other things, just enough to get by but not nearly enough to attempt something ambitious like this on my own!

Forgive me the self-indulgence if I talk about this  Figures → Data project more than I do the others but I thought it would be illuminative to discuss the whole process in detail…

In order to share links between our computers in real-time, and to share initial ideas and approaches, Vincent set-up an etherpad here to record our notes. You can see the development of our collaborative note-taking using the timeslider function below (I did a screen record of it for prosperity using recordmydesktop):

In this etherpad we document that there are a variety of ways in which to discover bar charts & histograms:

  • figuresearch is one such web-app that searches the PMC OA subset for figure captions & figure images. With this you can find over 7,000 figure captions containing the word ‘histogram’ (you would assume that the corresponding figure would contain at least one histogram for 99% of those figures, although there are exceptions).
  • figshare has nearly 10,000 hits for histogram figures, whilst BMC & PLOS can also be commended for providing the ability to search their literature stack by just figure captions, making the task of figure discovery far more efficient and targeted.

Jason Hoyt was in the room with us for quite a bit of the hack and clearly noted the search features we were looking for – just yesterday he tweeted: “PeerJ now supports figure search & all images free to use CC-BY (inspired by @rmounce at #hack4ac)” [link] – I’m really glad to see our hack goals helped Jason to improve content search for PeerJ to better enable the needs (albeit somewhat niche in this case) of real researchers. It’s this kind of unique confluence of typesetters, publishers, researchers, policymakers and hackers at doing-events like this that can generate real change in academic publishing.

The downside of our project was that we discovered someone’s done much of this before. ReVision: Automated Classification, Analysis and Redesign of Chart Images  [PDF] was an award-winning paper at an ACM conference in 2011. Much of this project would have helped our idea, particularly the classification of figures tech. Yet sadly, as with so much of ‘closed’ science we couldn’t find any open source code associated with this project. There were comments that this type of non-code sharing behaviour, blocking re-use and progress, are fairly typical in computer science & ACM conferences (I wouldn’t know but it was muttered…).  If anyone does know of the existence of related open source code available for this project do let me know!

So… we had to start from a fairly low-level ourselves: Vincent & Thomas tried MATLAB & C based approaches with OpenCV and their code is all up on our project github. Peter tried using AMI2 toolset, particularly the Canny algorithm, whilst I built up an annotated corpus of 40 CC-BY bar charts & histograms for testing purposes. Results of all three approaches can be seen below in their attempts to simplify this hilarious figure about dolphin cognition from a PLOS paper:

The plastic fish just wasn't as captivating...

“Figure 5. Total time spent looking at different targets.” from Siniscalchi M, Dimatteo S, Pepe AM, Sasso R, Quaranta A (2012) Visual Lateralization in Wild Striped Dolphins (Stenella coeruleoalba) in Response to Stimuli with Different Degrees of Familiarity. PLoS ONE 7(1): e30001. doi:10.1371/journal.pone.0030001 CC-BY

Peter’s results (using AMI2):


Thomas’s results (OpenCV & C):


Vincent’s results (OpenCV & MATLAB & bilateral filtering)

We might not have won 1st prize but I think our efforts are pretty cool, and we got some laughs from our slides presenting our days’ work at the end (e.g. see below). Importantly, *everything* we did that day is openly-available on github to re-use, re-work and improve upon (I’ll ping Thomas & Vincent soon to make sure their code contributions are openly licensed). Proper full-stack open science basically!

some figures are just awful


Other hack projects…

As I thought would happen I’ve waffled on about our project. If you’d like to know more about the other projects hopefully someone else will blog about them at greater length (sorry!) I’ve got my thesis to write y’know! ;)

You can find more about them all either on the Twitter channel #hack4ac or alternatively on the hack4ac github page. I’ll write a little bit more below, but it’ll be concise, I warn you!

  • Textmining PLOS Author Contributions

This project has a lovely website for itself: and so needs no more explanation.

  • Getting more openly-licensed content on wikipedia

This group had problems with the YouTube API I think. Ask @EvoMRI (Daniel Mietchen) if you’re interested…

  • articlEnhancer

Not content with helping out the PLOS author contribs project Walther Georg also unveiled his own article enhancer project which has a nice webpage about it here:

  • Qual or Quant methods?

Dan Stowell & co used NLP techniques on full-text accessible CC-BY research papers, to classify all of them in an automated way determining whether they were qualitative or quantitative papers (or a mixture of the two). The last tweeted report of it sounded rather promising: “Upstairs at #hack4ac we’re hacking a system to classify research papers as qual or quant. first results: 96 right of 97. #woo #NLPwhysure” More generally, I believe their idea was to enable a “search by methods” capability, which I think would be highly sought-after if they could do it. Best of luck!
Apologies if I missed any projects. Feel free to post epic-long comments about them below ;)




  • Martin Fenner

    Thanks for the excellent summary! We ran out of time calculating the data for all PLOS papers in our project (PLOS author contributions). Will try to finish soon and will then upload the dataset to an appropriate place.

  • Dan Stowell

    Thanks Ross for writing this up! I just want to point out that our “search by methods” hack all came from Nev’s idea. Unfortunately I don’t have her full name but all credit for the idea goes to her – it’s a neat idea.

  • Kaveh Bazargan

    Thanks for the write-up, Ross. I learnt so much and got great ideas. Worth regular repeats, and happy to help in any way…

  • Pingback: Unilever Centre for Molecular Informatics, Cambridge - Hack4ac, Content-mining and open-science in Oxford « petermr's blog