Show me the data!

Open Access discussed on the radio

August 20th, 2012 | Posted by rmounce in Open Access - (Comments Off on Open Access discussed on the radio)

[I’m cross posting this from the OKFN version so I can embed the audio of the show in the post]

Last Friday (17/08/12), representing the Open Knowledge Foundation, I had the pleasure of discussing the new Research Councils UK (RCUK) plan for all UK publicly-funded research to be published Open Access, on a special half hour Voice of Russia UK broadcast radio discussion.

I have written about this policy before and am very supportive of it, just as I am with Open Access in academia in general. I personally believe it will aid transparency and equality in research – so that no researcher has an unfair advantage over another through greater/easier access to vital research literature (just one of many worthy benefits arising from Open Access). But there are certainly also vocal opponents to this plan – mostly those with vested interests in keeping the obscene profits of the traditional subscription access publishing system alive (which commonly generate >30% profit margins largely derived from the taxpayer-spending of the world’s research libraries on journal subscriptions). Whilst others express vague and often unspecified “concerns” about Open Access and further still many academics are notably apathetic towards it, or are even proudly agnostic on the issue.

Thus a publicly-broadcast discussion of this new open policy is well warranted.

No secret science
[wpaudio url=”″ text=”Voice of Russia UK radio Open Access discussion hosted by Daniel Cinna” ]
I won’t say anything about the discussion itself, only that you should listen to it (embedded above; alt link here) if you are at all interested in the future of science, and the benefits of the new RCUK Open Access policy.

The members of the discussion panel included Rita Gardner, the Director of the Royal Geographical Society, noted for her concerns about the potential effects of Open Access on UK Learned Society income and revenue [paywalled link]; Ross Mounce, Panton Fellow promoting open data in science (myself) from the Open Knowledge Foundation; Bjorn Brembs, Professor at the Department of Genetics at the University of Leipzig, noted critic of for-profit publishers and their lack of ‘value-add’ amongst other issues; and Timothy Gowers, the Rouse Ball Professor of Mathematics at Cambridge University, instigator of the popular academic-led boycott of the academic for-profit publisher Elsevier.

The ensuing discussion was ably guided by VoiceofRussia radio presenter Daniel Cinna, and recorded by a backroom team with an impressively professional studio setup (Timothy & Bjorn were joining the debate via phone from abroad almost seemlessly, whilst Rita and I were in the London studio). As noted by Rita off-air, it would have been nice to have had a publisher representative in the discussion to add their unique viewpoint but apparently the VoR production team had asked, but no for-profit publisher they had asked was willing to take part. So one cannot attribute any blame to the VoR team if the discussion panel lacked representational balance.

About Voice of Russia (adapted from their own website):

The Voice of Russia is the world’s oldest international broadcaster and is among the world’s top five radio broadcasters today which include the BBC, the Voice of America, Deutsche Welle and Radio France International. The London-based team produces programs for VoR that bring our listeners a Russian perspective on our two countries and the world. VoR broadcasts to 160 countries in 38 languages using short and medium waves, FM, satellite and the global communications network. In London we are now also available online and via DAB radio. We aim to welcome a new British audience to our 109 million listeners worldwide.

Beware when using Thompson Reuters Web of Knowledge

July 18th, 2012 | Posted by rmounce in Content Mining | Open Access | Open Data | Phylogenetics | PLoS - (Comments Off on Beware when using Thompson Reuters Web of Knowledge)

just a quick post…

I’m pretty shocked at the poor indexing service given by Thompson Reuters Web of Knowledge (or ISI Web of Science as you might know it).

I’ve unashamedly bashed them before and I’ll bash them again here. (They deserve criticism because they’re paid a lot of money to do this as a commercial for-profit enterprise, and I don’t think they’re doing it as well as they could be.)

I performed a very simple search today looking for the articles containing the word ‘cladistic’ but NOT ‘phylogen*’ for articles published in the year 2010.

Topic=(cladistic) NOT Topic=(phylogen*) AND Year Published=(2010)

Below is a screenshot of just one of many of the disappointing results. I’ve refined the search to just the PLoS paper, to clearly show that it does come-up in this search:

It’s an Open Access paper, so we can all go see for ourselves the FULL content of the paper

Na, B.-K., Bae, Y.-A., Zo, Y.-G., Choe, Y., Kim, S.-H., Desai, P. V., Avery, M. A., Craik, C. S., Kim, T.-S., Rosenthal, P. J., and Kong, Y. 2010. Biochemical properties of a novel cysteine protease of plasmodium vivax, vivapain-4. PLOS NEGLECTED TROPICAL DISEASES 4.

In which we find the text caption for figure 1 mentions ‘phylogen*’ twice!

from Na, B.-K., Bae, Y.-A., Zo, Y.-G., Choe, Y., Kim, S.-H., Desai, P. V., Avery, M. A., Craik, C. S., Kim, T.-S., Rosenthal, P. J., and Kong, Y. 2010. Biochemical properties of a novel cysteine protease of plasmodium vivax, vivapain-4. PLOS NEGLECTED TROPICAL DISEASES 4. CC-BY licenced

so at the very least I suspect Web of Science (WoS) is systematically NOT indexing the caption text of figures (if you know more than I about this, please do comment). Academics rely on services like this to effectively and accurately search the literature, to perform comprehensive reviews and such. If all the textual content of science isn’t actually being indexed by WoS, that’s clearly going to lead to bad science at some point (e.g. a vital missing paper, not picked up in an otherwise well designed literature search). I could forgive them for not being able to OCR the text within the images of figures, but NOT for the fully machine-readable text captions like this one. Furthermore, it’s Open Access and fully-digital – why aren’t they indexing figure caption text?



It appears it’s not just figure caption text they don’t index. Do they index only titles and abstracts?

many of the other 81 results (papers) of that search for ‘cladistic’ but NOT ‘phylogen*’ contain the word-stem ‘phylogen*’ in the full text of the paper!


Wilts, E. F., Arbizu, P. M., and Ahlrichs, W. H. 2010. Description of bryceella perpusilla n. sp (monogononta: Proalidae), a new rotifer species from terrestrial mosses, with notes on the ground plan of bryceella REMANE, 1929. INTERNATIONAL REVIEW OF HYDROBIOLOGY 95.

Echeverry, A. and Morrone, J. J. 2010. Parsimony analysis of endemicity as a panbiogeographical tool: an analysis of caribbean plant taxa. BIOLOGICAL JOURNAL OF THE LINNEAN SOCIETY 101.

Stutz, H. L., Shiozawa, D. K., and Evans, R. P. 2010. Inferring dispersal of aquatic invertebrates from genetic variation: a comparative study of an amphipod and mayfly in great basin springs. JOURNAL OF THE NORTH AMERICAN BENTHOLOGICAL SOCIETY 29

Campo, D., Molares, J., Garcia, L., Fernandez-Rueda, P., Garcia-Gonzalez, C., and Garcia-Vazquez, E. 2010. Phylogeography of the european stalked barnacle (pollicipes pollicipes): identification of glacial refugia. MARINE BIOLOGY 157.

Choiniere, J. N., Clark, J. M., Forster, C. A., and Xu, X. 2010. A basal coelurosaur (dinosauria: Theropoda) from the late jurassic (oxfordian) of the shishugou formation in wucaiwan, people’s republic of china. JOURNAL OF VERTEBRATE PALEONTOLOGY 30.

Caldwell, M. W. and Palci, A. 2010. A new species of marine ophidiomorph lizard, adriosaurus skrbinensis, from the upper cretaceous of slovenia. JOURNAL OF VERTEBRATE PALEONTOLOGY 30.

Hastings, A. K., Bloch, J. I., Cadena, E. A., and Jaramillo, C. A. 2010. A new small short-snouted dyrosaurid (crocodylomorpha, mesoeucrocodylia) from the paleocene of northeastern colombia. JOURNAL OF VERTEBRATE PALEONTOLOGY 30.

Karanovic, I. and McKay, K. 2010. Two new species of leicacandona karanovic (ostracoda, candoninae) from the great sandy desert, australia. JOURNAL OF NATURAL HISTORY 44.

(and more, these are just some of the articles I’ve looked at the full-text of so far… I think it’s safe to say now this is NOT a one off phenomenon)

I’ve now found through manual inspection that at least 47 of the ‘hits’ for this search actually contain a ‘phylogen*’ word within the main text of the paper (excluding the reference list)

I guess I’m probably not the first to realise this but… wow. Is this not *really* poor service? I’m pretty sure my desktop software could do a better job of indexing than this. All it is, is simple string matching!

…and of course I can do a better job of this myself with Open Access papers. All one need do is download the OA corpus from UKPMC and index the *FULL* text including figure caption text and reference lists yourself. I wonder how many more relevant papers I might ‘find’ with my searches if I did this rather than relying on Web of Science?

I realise thus far, I may not have explained too clearly exactly what I’m doing for my Panton fellowship. With this post I shall attempt to remedy this and shed a little more light on what I’ve been doing lately.

The main thrust of my fellowship is to extract phylogenetic tree data from the literature using content mining approaches (think text mining, but not just text!) – using the literature in its entirety as my data. I have very little prior experience in this area, but luckily I have an expert mentor guiding me: Peter Murray-Rust (whom you may often see referred to as PMR). For those of us biologists who may not be familiar with his work, whilst trying not to be too sycophantic about it, PMR is simply brilliant, it’s amazing what he and his collaborators have done to extract chemical data from the chemical literature and provide it openly for everyone, in spite of fierce opposition at times from those with vested interests in this data remaining ‘closed’.

Now he’s turned his attention to the biological literature for my project and together we’re going to try and provide open tools to extract phylogenetic data from the literature. Initially I proposed trying to grab just tree topology and tip labels – a kind of bare minimum, but PMR has convinced me that we should be ambitious and all-encompassing, and thus our aims have expanded to include branch lengths, support values, the data-type the phylogeny was inferred from, and other useful metadata. And why not? We’re ingesting the totality of the paper in our process, from title page to reference list, so there’s plenty of machine-readable data to be gleaned. The question is, can we glean it off accurately enough, balancing precision and recall?

So for starters, we’ve been using test materials that we’re legally allowed to, namely Open Access CC-BY papers from BMC & PLoS to test our extraction tools, specifically focusing on a subset of all ~8500 papers containing the word-stem phylogen* from BMC. It’s a rough proxy for papers that’ll contain a tree, and it’s good enough for now – we’ll need to be able to deal with false positives along with all the positive positives, so it’s instructive to keep these in our sample.

We’ve been working on the regular structure of BMC PDFs, getting out bibliographic metadata, and the main-text for further NLP processing downstream to pick out data & method relevant words like say PAUP* , ML , mitochondrial loci etc… But the real reason we’re deliberately using PDFs rather than the XML (which we also have access to) is the figures – where all the valuable phylogenetic tree data is. If this can be re-interpreted with reference to the bibliographic metadata, the figure caption and further methodological details from the full-text of the paper, then we may be able to reconstruct some fairly rich and useful phylogenetic data.

To make it clear, in slight contrast to the Lapp et al iEvoBio presentation embedded above, we’re not trying to just extract the images, but rather to re-interpret them back into actual re-useable data, probably to be provided in NeXML (and from there on, whatever form you want). We’re pretty sure it’s an achievable goal. Programs like TreeThief, TreeRipper, and TreeSnatcher Plus have gone some way towards this already, but never before been incorporated in a content mining workflow AFAIK.

Unfortunately I wasn’t at iEvoBio 2012 (I’m short on money and on time these days) but it’s great to see from the slides the growing recognition of the SVG image file format as a brilliant tool for communicating digital science. I also put a bit about that in my Hennig XXXI talk slides too (towards the end). Programs like TNT do output SVG files, so there’s scope to make this a normal part of any publication workflow. Regrettably though, rather few publisher produced PDFs contain SVG formatted images – but if people, and editorial boards (perhaps?) can be made aware of their advantages, perhaps we can change this in future…?

the very same file, opened as plain-text. It’s fairly easy to reconvert back into re-useable machine-readable data.


Agapornis phylogeny.svg from Wikipedia (PD)










Gathering phylogenetic data from beyond PLoS, BMC and other smaller Open Access publishers is going to be hard, not for technical, but purely legal reasons:

The scope and scale of phylogenetic research (using ‘phylogen*’ as a proxy):

There’s a lot of phylogenetic research out there… but little of it is Open Access – which is problematic for content mining approaches – particularly if subscription-access publishers are reticent to allow access.

Some facts:

  • with a Thomson Reuters Web of Science search, SCI-EXPANDED database (only), Topic=(phylogen*) AND Year Published=(2000-2011) this returns 101,669 results (at the time of searching YMMV)
  • 91,788 of which are primary Research Articles (as opposed to Reviews, Proceedings Papers, Meeting Abstracts, Editorial Materials, Corrections, Book Reviews etc…)
  • Recent MIAPA working group research I contributed to (in review) quantitatively estimates that approximately 66% of papers containing ‘phylogen*’ report a new phylogenetic analysis (new data).
  • Thus conservatively assuming just one tree per paper (there are often many per paper), there are > 60,000 trees contained within just 21st century research articles.
  • As with STM publishing as a whole, the number of phylogenetic research articles being published each year shows consistent year-on-year increases
  • Cross-match this with publisher licencing data and you’ll find that only ~11% of phylogenetic research published in 2010 was CC-BY Open Access (and this % probably decreases as you go back before 2010)
So the real fun and games will come later this year, when I’m sure we’ll have the capability (software tools) to do some amazing stuff, having first perfected it on OA materials… but will they let us? Heather Piwowar’s experience earlier this year didn’t look too fun – and that was all for just one publisher. Phylogenetic research occurs in and beyond at least 80 separate STM publishers by my count (let alone the >500 journals it occurs in!) – so there’s no way anyone would bother trying to negotiate with them all! I’m sticking by the intuitive principle that The Right to Read Is the Right to Mine but I’ll have a think about that some more when we actually get to that bridge.

Finally, it’s also worth acknowledging that we’re certainly not the first in this peculiar non-biomedical mining space – ‘biodiversity informaticists’ have been doing useful things with these techniques for a while now in innovative ways largely unrelated to medicine e.g. LINNAEUS from Casey Bergmann’s lab, and a recent review of other projects from Thessen et al (2012) [hat-tip to @rdmpage for bringing that later paper to the world’s attention via Twitter]. Literally all areas of academia could probably benefit from some form or another of content mining – it’s not just a biomed / biochem tool.

So, I hope that explains things a bit better. Any questions?


Some references (but not all!):

Gerner, M., Nenadic, G., and Bergman, C. 2010. LINNAEUS: A species name identification system for biomedical literature. BMC Bioinformatics 11:85+. [CC-BY Open Access]

Thessen, A. E., Cui, H., and Mozzherin, D. 2012. Applications of natural language processing in biodiversity science. Advances in Bioinformatics 2012:1-17. [CC-BY Open Access]

Hughes, J. 2011. TreeRipper web application: towards a fully automated optical tree recognition software. BMC Bioinformatics 12:178+.  [CC-BY Open Access]

Laubach, T., von Haeseler, A., and Lercher, M. 2012. TreeSnatcher plus: capturing phylogenetic trees from images. BMC Bioinformatics 13:110+. [CC-BY Open Access, incidentally I was one of the reviewers for this paper. I signed my review, and made a point of it too. Nor was it a soft review either I might add]

[A monthly update on my Panton Fellowship related activities]

Last month I was slightly late with my monthly report, so this month I’m going to get things back on track and write my post now, on this leisurely sunny Sunday afternoon…

It’s been a good month:

First of all, I had the chance to speak about my Fellowship work for the Ede & Ravenscroft Prize final. I made a few choice comments to our Pro-Vice-Chancellor who was present, about the plurality of benefits of Open Access & Open Data, and the difficulties of trying to do content mining research on subscription-access journals. I didn’t win the prize in the end, but getting to the final, and being recognised as one of the top 5 research students at the University of Bath was pretty cool. I then immediately went out and spent part of the £50 runners-up prize on Michael Nielsen‘s excellent Open Science book Reinventing Discovery. I gave it a read, then immediately passed it on to another lab for a friend to read, and it now resides with my supervisor who will also hopefully find time to read it (part of my not so subtle attempt to help spread the knowledge of how digital, networked, openness can hugely benefit research).

I bought some other books too, but this was the important one

Then on the 11th of May, PMR came Bath to give a talk to our Biology & Biochemistry Department. Those who came (including our subject librarian – thanks for coming!) were wowed with the ways in which PMR and colleagues have helped make semantically-enriched Linked Open Data available on chemicals for everyone, not just academic chemists! It’s brilliant to have an expert demonstration of the ways in which projects like CrystalEye have made the data underlying some chemical research publications far more easily searchable, open, and re-usable across many thousands of publications. There’s a strong, easily-justified need for more of this type of post-publication data scraping in biology (and palaeontology I might add!). We share a strong belief that research publications should be made open and explicitly re-usable without restriction.

Sadly, most of the biological literature in my domain is neither open, nor re-usable without permission (more of which in a later post) – which makes my highly integrative data-focused research, that much harder than it otherwise could be. As I’ve said before on the internet – I have all of PLoS on my USB stick, I’ve no doubt I could put all the relevant papers I need & scrape data from them, on just my desktop computer hard drive – yet subscription-access paywalls, and current copyright law prevent me from doing this for much of the literature (PLoS and other Open Access literature aside). I can understand how we arrived at this strange situation (we didn’t used to have such computational power to analyze large volumes of data, nor the Internet with which to freely & easily distribute research) but now we *do*, it seems like utter madness to continue to publish research in ways (e.g. subscription-access, copyright-transferred to the publisher) which make it very hard to analyze or re-use en masse.

The Panton Principles

So I’ve been joining the nascent OKFN working group Skype calls on Content Mining and soon we will hopefully have some interesting things to announce…

PMR also got the chance to meet my PhD supervisor and the rest of the lab which is great since I’m doing this fellowship work concurrently with my PhD work on fossils & phylogeny.

Later on in the month, I suggested the excellent Panton Discussions be made more amenable for podcasting. An OKFN group are now working on producing an audio-only version of all of them, and making them more easily integrable on personal listening devices (mobile phones & MP3 players).

Finally, the past week has been a whirlwind:

On Tuesday (22nd May) I was at the Natural History Musuem, London to talk with Dr Mark Wilkinson about some PhD project-related work – he’s kindly supplied me with some source code (among other things), so I can recompile his programs to run on my linux machines. I told him all about the OKFN & Panton Fellowship and he was very supportive of the goals. Time and time again, I encounter such enlightened, high-up academics and wonder why & how academic publishing is still in it’s current state – it’s not for want of researcher support for Open Access in my experience!

On Wednesday, I was back with PMR in Cambridge hacking PDFs, focusing particularly on BMC literature as this is BOAI-compliant Open Access and we can do what we want with such material. Towards the end of the session we had a think about what metadata would desirable to extract from the text of the papers and figure labels that might add context and information to the phylogenetic analysis performed, and phylogenetic tree presented in each of the papers. By coincidence the Open Tree of Life group have also just republished the MIAPA working group list of desirable metadata for phylogenies. We certainly won’t be able to get all this information, and the information we can extract may not necessarily be interpreted and associated 100% correctly, but it will certainly be hugely valuable as this information would otherwise take 4 years to re-digitise(!) by some estimates.

On Thursday, I went to ProgPal (Progressive Palaeontology), a conference also in Cambridge. There I gave a short ‘announcement’ talk with slides to explain to everyone there a) what the Open Knowledge Foundation are about, and b) why they might be of interest to academic palaeontologists. I touched upon Open Access and Open Data issues in palaeontology and encouraged those with an interest to visit the website, join the Open Science mailing list, listen to or watch the Panton Discussions, and consider applying for a Panton Fellowship next year if they had any innovative ideas for paleo-data. This talk tied-in very well with the other announcement talks for Palaeontology Online (a new free outreach & education initiative) and Palaeocast (a new paleo podcast).

Which reminds me, I should really pop them both an email to explain why they should post their content with a Creative Commons Attribution Licence, so their materials can be re-used, re-posted and remixed as Open Educational Resources

Best of all, on Friday I travelled down to London to my alma mater to attend & furiously tweet the Open Access debate at Imperial College London, in the very same lecture room I sat most of my undergrad lectures! There were rather a many palaeontologists also there, including Tori Herridge and Nick Crumpton and a large volume of tweets under the #OAdebate hashtag were sent (archived here if you’re interested). Graham Taylor of the Publishers Association said some rather provocative things that got me rather hot under the collar including:

…we [publishers] are the stewards of genuine science…

Which I think could all too easily be misinterpreted to overstate the importance of the role that publishers play in organising peer-review, spell-checking, typesetting and other such tasks. I also couldn’t help laughing out loud at Graham’s straight-faced proposal for subscription-access publishers to offer ‘fee-waived walk-in access at public libraries‘ as a way to provide taxpayer access to taxpayer funded research. Stephen Curry (also on the panel) thankfully quickly interrupted to state how ridiculous that was. I’ll leave it to Mike Taylor’s post here to explain just how ludicrous that proposal is in light of 21st century technology. I will however give Graham Taylor credit for further disavowing the Research Works Act, he said of it [and presumably his organisation’s initial support for it]: “the RWA was not such a good idea, don’t ask me to defend that one”, which elicited a pleased response from the audience.

There will be another debate held after the release of the Finch report which I suspect will be rather more exciting. A lot of the issues were aired at this debate, but the brevity of the time slot allowed for the event meant that there was not enough time for in-depth discussion IMO.

That’s just about it for the month. I can’t wait for what the next month will bring!