Show me the data!

I realise thus far, I may not have explained too clearly exactly what I’m doing for my Panton fellowship. With this post I shall attempt to remedy this and shed a little more light on what I’ve been doing lately.

The main thrust of my fellowship is to extract phylogenetic tree data from the literature using content mining approaches (think text mining, but not just text!) – using the literature in its entirety as my data. I have very little prior experience in this area, but luckily I have an expert mentor guiding me: Peter Murray-Rust (whom you may often see referred to as PMR). For those of us biologists who may not be familiar with his work, whilst trying not to be too sycophantic about it, PMR is simply brilliant, it’s amazing what he and his collaborators have done to extract chemical data from the chemical literature and provide it openly for everyone, in spite of fierce opposition at times from those with vested interests in this data remaining ‘closed’.

Now he’s turned his attention to the biological literature for my project and together we’re going to try and provide open tools to extract phylogenetic data from the literature. Initially I proposed trying to grab just tree topology and tip labels – a kind of bare minimum, but PMR has convinced me that we should be ambitious and all-encompassing, and thus our aims have expanded to include branch lengths, support values, the data-type the phylogeny was inferred from, and other useful metadata. And why not? We’re ingesting the totality of the paper in our process, from title page to reference list, so there’s plenty of machine-readable data to be gleaned. The question is, can we glean it off accurately enough, balancing precision and recall?

So for starters, we’ve been using test materials that we’re legally allowed to, namely Open Access CC-BY papers from BMC & PLoS to test our extraction tools, specifically focusing on a subset of all ~8500 papers containing the word-stem phylogen* from BMC. It’s a rough proxy for papers that’ll contain a tree, and it’s good enough for now – we’ll need to be able to deal with false positives along with all the positive positives, so it’s instructive to keep these in our sample.

We’ve been working on the regular structure of BMC PDFs, getting out bibliographic metadata, and the main-text for further NLP processing downstream to pick out data & method relevant words like say PAUP* , ML , mitochondrial loci etc… But the real reason we’re deliberately using PDFs rather than the XML (which we also have access to) is the figures – where all the valuable phylogenetic tree data is. If this can be re-interpreted with reference to the bibliographic metadata, the figure caption and further methodological details from the full-text of the paper, then we may be able to reconstruct some fairly rich and useful phylogenetic data.

To make it clear, in slight contrast to the Lapp et al iEvoBio presentation embedded above, we’re not trying to just extract the images, but rather to re-interpret them back into actual re-useable data, probably to be provided in NeXML (and from there on, whatever form you want). We’re pretty sure it’s an achievable goal. Programs like TreeThief, TreeRipper, and TreeSnatcher Plus have gone some way towards this already, but never before been incorporated in a content mining workflow AFAIK.

Unfortunately I wasn’t at iEvoBio 2012 (I’m short on money and on time these days) but it’s great to see from the slides the growing recognition of the SVG image file format as a brilliant tool for communicating digital science. I also put a bit about that in my Hennig XXXI talk slides too (towards the end). Programs like TNT do output SVG files, so there’s scope to make this a normal part of any publication workflow. Regrettably though, rather few publisher produced PDFs contain SVG formatted images – but if people, and editorial boards (perhaps?) can be made aware of their advantages, perhaps we can change this in future…?

the very same file, opened as plain-text. It’s fairly easy to reconvert back into re-useable machine-readable data.


Agapornis phylogeny.svg from Wikipedia (PD)










Gathering phylogenetic data from beyond PLoS, BMC and other smaller Open Access publishers is going to be hard, not for technical, but purely legal reasons:

The scope and scale of phylogenetic research (using ‘phylogen*’ as a proxy):

There’s a lot of phylogenetic research out there… but little of it is Open Access – which is problematic for content mining approaches – particularly if subscription-access publishers are reticent to allow access.

Some facts:

  • with a Thomson Reuters Web of Science search, SCI-EXPANDED database (only), Topic=(phylogen*) AND Year Published=(2000-2011) this returns 101,669 results (at the time of searching YMMV)
  • 91,788 of which are primary Research Articles (as opposed to Reviews, Proceedings Papers, Meeting Abstracts, Editorial Materials, Corrections, Book Reviews etc…)
  • Recent MIAPA working group research I contributed to (in review) quantitatively estimates that approximately 66% of papers containing ‘phylogen*’ report a new phylogenetic analysis (new data).
  • Thus conservatively assuming just one tree per paper (there are often many per paper), there are > 60,000 trees contained within just 21st century research articles.
  • As with STM publishing as a whole, the number of phylogenetic research articles being published each year shows consistent year-on-year increases
  • Cross-match this with publisher licencing data and you’ll find that only ~11% of phylogenetic research published in 2010 was CC-BY Open Access (and this % probably decreases as you go back before 2010)
So the real fun and games will come later this year, when I’m sure we’ll have the capability (software tools) to do some amazing stuff, having first perfected it on OA materials… but will they let us? Heather Piwowar’s experience earlier this year didn’t look too fun – and that was all for just one publisher. Phylogenetic research occurs in and beyond at least 80 separate STM publishers by my count (let alone the >500 journals it occurs in!) – so there’s no way anyone would bother trying to negotiate with them all! I’m sticking by the intuitive principle that The Right to Read Is the Right to Mine but I’ll have a think about that some more when we actually get to that bridge.

Finally, it’s also worth acknowledging that we’re certainly not the first in this peculiar non-biomedical mining space – ‘biodiversity informaticists’ have been doing useful things with these techniques for a while now in innovative ways largely unrelated to medicine e.g. LINNAEUS from Casey Bergmann’s lab, and a recent review of other projects from Thessen et al (2012) [hat-tip to @rdmpage for bringing that later paper to the world’s attention via Twitter]. Literally all areas of academia could probably benefit from some form or another of content mining – it’s not just a biomed / biochem tool.

So, I hope that explains things a bit better. Any questions?


Some references (but not all!):

Gerner, M., Nenadic, G., and Bergman, C. 2010. LINNAEUS: A species name identification system for biomedical literature. BMC Bioinformatics 11:85+. [CC-BY Open Access]

Thessen, A. E., Cui, H., and Mozzherin, D. 2012. Applications of natural language processing in biodiversity science. Advances in Bioinformatics 2012:1-17. [CC-BY Open Access]

Hughes, J. 2011. TreeRipper web application: towards a fully automated optical tree recognition software. BMC Bioinformatics 12:178+.  [CC-BY Open Access]

Laubach, T., von Haeseler, A., and Lercher, M. 2012. TreeSnatcher plus: capturing phylogenetic trees from images. BMC Bioinformatics 13:110+. [CC-BY Open Access, incidentally I was one of the reviewers for this paper. I signed my review, and made a point of it too. Nor was it a soft review either I might add]

Running TNT in parallel

July 14th, 2012 | Posted by rmounce in Conferences | Phylogenetics | TNT - (6 Comments)

Building upon the instructions given here and here I thought I’d write up one of the many useful things Pablo Goloboff kindly taught us at the TNT scripting workshop after the Hennig XXXI meeting.

It’s actually not the easiest thing to setup if you’re using Ubuntu… Pablo had to help me do it – I would never have got it up and running on my own.


You’ll need to install the pvm package either from the repositories with

sudo apt get install pvm

or download and compile from source

It’s actually better to compile from source because the pvm package in the Ubuntu repositories is out of date – they provide only version 3.4.5, whilst the latest version of pvm, released way back in 2009 is 3.4.6 ! I guess the packaging team have other priorities…

Then you’ll need to configure pvm on your machine:

  • edit your bashrc file  nano ~/.bashrc  and insert this line:

export PVM_ROOT=/usr/lib/pvm3

save & close the bashrc file. Now  source ~/.bashrc and then test the path with echo $PVM_ROOT this should now return


  • in your user home directory (for me this is /home/ross/ ) create a plaintext file called hostlist (the exact name doesn’t matter but remember it) and write one line within this file:

rossnetbook ep=/usr/bin/

(replace ‘rossnetbook’ with your computer hostname – if you’re not sure what this is then nano /etc/hostname will tell you) save and close this file.

  • now start pvm from your user home directory with pvm hostlist  this tells pvm your hostname and the path. Unfortunately you’ll need to start-up pvm this way every time you restart your computer. Perhaps there’s a better way? Let me know if so…

Finally, make sure you’ve copied the 64-bit TNT binary to both /usr/bin/ & to your user home directory and make sure that they’re executable.

Now you should be ready to go…

if you get an error message like this from TNT:

tnt*>ptnt begin ajob2 2 = mult 5; return ; ptnt wait . ;
Macro language is ON
Macros: 50.5 Kb in use, 51.8 Kb free
libpvm [pid7539] /tmp/pvmd.1000: No such file or directory
libpvm [pid7539] /tmp/pvmd.1000: No such file or directory
libpvm [pid7539] /tmp/pvmd.1000: No such file or directory
libpvm [pid7539]: pvm_config(): Can’t contact local daemon

Can’t enter parallel interface (make sure PVM is running)

you’ve probably forgotten to start pvm with pvm hostlist

see the video I uploaded below for demonstration of the speed-up possible by performing tasks in parallel:

the video shows me performing a simple search on the zilla dataset of Chase et al. (1993) using traditional heuristic settings (60 reps) performed first in serial, then in parallel (starting after 2:00) 20 reps x 3 slaves.

100 seconds for search 1, down to just 48 seconds for search 2 (in parallel). YMMV

Neither of these searches found the shortest length trees btw!

mxram 100; /* increase memory */
p zilla.tnt; /* read in the data */
hold 20000; /* increase the maximum number of trees held */
mult 60; /* perform a traditional search with 60 replications */
le; /* tree lengths */

/* parallel tnt job, called ‘ajob’ using 3 slaves performing ‘mult 20′ on each slave */
ptnt begin ajob 3 = mult 20; return ; ptnt wait . ;

basically just insert what you want your slaves to do in between the ‘=’ and the return; commands.

ptnt get ajob; /* get data back from slaves to master */

This was just the tip of the iceberg of the course. I can’t even begin to write-up the rest of the course in this much detail! But I hope this helps…

many many thanks Pablo, and all the organisers of this workshop AND the conference – it was *much* appreciated

After sending a letter to my local MP, urging him to support the recommendations of the Hargreaves Report on Intellectual Property reform in parliament nearly a month ago (sent on the 17th June 2012) – I finally have a reply!

Sadly, it’s not the reply I wanted. Don Foster does not appear willing to support the Early Day Motion on Intellectual Property law reform to further enable research, that I explicitly asked him to sign.


Below is a verbatim copy of his letter as was emailed to me earlier (12th July 2012), which I am posting here so all his constituents can see, for their own satisfaction (or not, as in my case), his position with respect to IP reform.



12 July 2012

Mr Ross Mounce

Flat 3, Rochfort Court

Forester Avenue



Our Ref: Moun010/1

Dear Mr Mounce


Thank you for writing to me in reference to the exceptions for digital content proposed by the Hargreaves Review. I had in fact received your letter, so I apologise the confusion on Twitter and the delay in replying; life has been rather hectic recently not least because of the impending Olympics and my role as a member of the Olympic Board. Adding this to the very large amounts of correspondence received in my office each day, I’m sorry to say that I cannot always respond as quickly as I’d like.


As you may know, I have been involved in these issues for some time. I am currently working within Government on the development of the Communications White paper (which will touch on these issues when it is published in early 2013), I helped initiate – and serve as a member of – the Creative Industries Council (which, among other issues, is reviewing the Hargreaves Report) and am a member of the All Party Parliamentary Group on IP (which is currently reviewing the work of the IPO and its various recent pronouncements).


Inevitably I am somewhat constrained in responding in detail to your letter since some of the work I am involved in is not yet public and, more importantly, because final conclusions haven’t been reached.


I am also conscious that we have to work within – while seeking, potentially, to change – relevant EU legislation. As you will know, this includes the InfoSoc Directive (Directive 2001/29/EC; see


You will know better than I, that the development of “exceptions” is never easy. An exception for “format shifting” may be alright and reasonable for, say, the music industry but the situation is very different for the UK Video Industry. Similarly, an exception for “parody” could make sense for small snippets used in a comedy show, but would not necessarily be appropriate for a situation where one artist does a complete performance of another artist’s song and claims it to be a parody. Some of the proposed exceptions for the copying of educational works seem to worryingly disregard the importance of copyright/IP protection in ensuring the flow of new works. 

Whilst Don Foster MP will treat as confidential any personal information which you pass on, he may need to allow his staff and authorised volunteers to see it if this is needed to help and advise you. Don may pass on all or some of this information to agencies such as DWP, Inland Revenue or Local Council if this is necessary to help with your case. We may write to you from time to time, to keep you informed on issues that you might find of interest. Please let us know if you do not wish him to contact you for this purpose

— page break —

In your letter you argue that the exceptions proposed in relation to “data mining” should be accepted.


I am well aware of the strength of feeling among some on this issue. But, as the IPO’s summary of the responses to consultation on the issue makes clear, proposals to enable researchers to use computerised techniques to read information contained in journal articles without infringing publishers’ rights have drawn “strongly divided” views from the industry.


Certainly it is true that, as the IPO puts it, “Researchers and research institutions generally supported the proposed exception. They argued that copyright was not established to restrict the use of data, and the added value of these technologies was provided by the actions of researchers, not publishers.”


However, there is also a strongly held opposing view; one which suggested that it was “too soon to seek a regulatory solution in a new and fast-developing sector, ” that “a copyright exception would prevent publishers from ensuring security of content and stability of provision,” and that “an unremunerated exception would remove the incentive for publishers to make the considerable investments needed to convert content into the right forms, to develop their own services, and to support the application of services by researchers or third parties.”


In the light of these arguments we are currently working to find a way forward. I am not yet in a position to give you assurances that I will press from the exceptions you seek as they are currently formulated (although, I am more inclined to support them than to oppose them).


I hope you will understand, and again, apologies for the delay in replying.

With best wishes,

Yours sincerely,

Rt Hon Don Foster MP

Please reply to 31 James Street West, Bath, BA1 2BT

Tel: 01225 338 973 Fax: 01225 463 630


I have also forwarded Don’s letter to all the staff in my department, as this is about research, and their local parliamentary representative’s views with respect to research, and thus concerns them:


Dear all,

Last month, I sent a formal letter to our local MP Don Foster, urging him
to sign his support for recommendations in the Hargreaves report[1] on
Intellectual Property law reform as it relates to research, particularly
the proposed exceptions to copyright to enable text-mining of otherwise ‘closed
access’ content. (You can read the full letter I sent here[2] it’s fairly short.)
The legal obstructions to text-mining research were also recently described in a Nature news piece[3].

This directly relates to my fellowship research project, as I’m using
content mining techniques to re-extract phylogenetic data from the
literature. At the moment I can only legally mine just ~13% of UKPMC
literature (the XML of which is just 5.5Gb[4] btw, let me know
if you want a copy). This is a great shame, as we have the tools and
capability to make use of ALL of the research literature and more with
current techniques. Only legal barriers prevent this.

I have attached his reply. At best he is sitting on the fence. At worst I
think he has failed to critically evaluate the meekness of the
counter-arguments given such as it being “too soon to seek a regulatory

So, is anyone here interested in sending a further letter in support of
enabling text-mining research? This is perhaps our one and only chance to
directly influence government policy on this research issue. I feel a more
senior scientist from the University of Bath would perhaps help convince
Don to lend his support to this issue. The parliamentary motion already has
the support of 27 MPs[5], but will need more.

Please feel free to contact me offlist about this. I will not contact this
list about this issue again.

Many thanks for your time,



Having invested some time in this already, there seems little point in giving up now. I will wait and see over the weekend if there is anyone else from my department (or the University as whole), that wants to pursue this further before I try again.

The subscription-access segment of the STM publishing industry that is protesting the recommendations of the Hargreaves report undoubtedly has paid lobbyists dedicated to this issue, and they have clearly done a good job here. I am unpaid and inexperienced with respect to arguing this from the research perspective, and probably outnumbered. But I will not stop trying to further enable research.

I sent my local MP (Don Foster, Lib Dem) a simple, fairly short (~265 words), clear & concise formal letter 18 days ago – I blogged the draft of it which is virtually the same here.

It’s been at least 13 working days now by my count and I still haven’t received a proper reply, so I tweeted @DonFosterMP last night:



I soon also got a reply from Don Foster’s press officer, email below:


From: “ROBERTS, Nick”
To: “‘Ross Mounce'”
Subject: RE: Letter from your constituent Ross Mounce
Date: Thu, 5 Jul 2012 10:12:06 +0000

Dear Ross,

I’ve just seen your tweets to Don. Apologies for the mix up. acts as a middleman and forwards your query on to our casework folder as opposed to Don directly. There is nothing wrong with this, it just means that our acknowledgment email goes to them and does not get forwarded to you.

We did receive your email of the 17th and rest assured we will get that response to you asap. However, we get over 500 emails a day so as you’ll appreciate there are backlogs. This is especially true in the summer when there are staff holidays. Our small team then have to prioritise “urgent” cases.

Once again, please accept my apologies for the delay in Don’s reply.



Nick Roberts
Caseworker & Press Officer
Office of the Rt Hon Don Foster MP

31 James Street West, Bath, BA1 2BT
t: 01225 338973
f: 01225 463630

NOTE: Information in this email is confidential and may be privileged. It is intended for the addressee only. If you have received it in error please notify the sender immediately and delete it from your system. You should not otherwise copy it, retransmit it, use or disclose its contents unless permission to do so is explicitly stated. Views expressed in personal emails do not necessarily reflect the position or opinion of the Liberal Democrats.

Personally I don’t care if my MP occasionally receives 500 emails a day – he can still send confirmation of receipt messages automatically surely? It would have been nice to know that my letter had at least been looked at.

I can’t help feeling that I’m being fobbed off here.

But it appears I’m not alone in being ignored. According to 2008 statistics from
Don Foster only replied to 57% of letters sent to him with their service (and in fairness this makes him far from the ‘worst’ MP in terms of response rate for that year; stats here).

With so many young people today completely apathetic towards UK politics, myself included, this hardly sets a great precedent. It was my first attempt to engage, and so far it’s been largely unsatisfactory.

So what to do next? I don’t know frankly. I await a fuller response from Don Foster himself.

I’ll keep this post updated with any further relevant correspondences.


It’s that time again… time to write my monthly Panton Fellowship update.

The trouble is, as I start writing this it’s 6am (London, UK). I arrived back from the Hennig XXXI meeting (University of California Riverside) after a long flight yesterday and am supremely jetlagged. I still can’t decide whether this is awesome (I can get more work done, by waking up earlier), or terrible as I can’t keep my eyes open past 9pm at night!

At this conference I shoe-horned some of my Panton Fellowship project work into the latter half of my talk (slides below), as it fitted in with the theme of the submitted abstract on supertrees.

Supertrees are just one of many many different possible (re)uses of the phylogenetic tree data I am trying to liberate from the literature for this project. I tried to stress this during my talk, as a lot of people at Hennig aren’t too keen on supertrees as a method for inferring large phylogenies. In fact, there was a compelling talk with solid data from Dan Janies given later on in conference, critiquing supertree methods such as SuperFine and SuperTriplets, which were outperformed in most tests in terms of both speed and optimality (tree length) by supermatrix methods using TNT. That’s fine though – there are so many other interesting hypotheses one can investigate with large samples of real phylogenetic estimates (trees).


  • Do model-based phylogenetic analyses perform better than parsimony? [Probably not, judging by the conclusions in this paper]  –  I’d like to see this hypothesis re-tested more rigorously using tree-to-tree distance comparisons between the different method trees. Except we can’t currently do this very easily because there’s a paucity of machine-readable tree data from published papers
  • Meta-analysis of phylogenetic tree balance and factors that influence balance e.g. (this thesis, and this PLoS ONE article).  Are large trees more imbalanced than small trees? Are vertebrate trees more balanced than invertebrate trees?
  • Fossil taxa in phylogenetic trees – are they more often than not found at the base of the tree? Is this ‘real’ or perhaps apparent ‘stem-ward slippage‘ caused by preservational biases?
  • Similarity and dissimilarity between phylogeny and measures of morphological disparity as studied  by my lab mate Martin Hughes

So, I hope you’ll appreciate this data isn’t just needed for producing large supertrees.

I could go on about the conference – it was excellent as ever, but I’ll save that for a dedicated later post.

Other activities this month included:

  • submitting my quarterly Panton report to the Fellowship Board
  • attending the OKFN Bibliohack session at QMUL’s Mile End campus (13th & 14th June) helping out with the creation of the OKFN Open Access Index, and learning how to use & debug a few issues with PubCrawler (a web crawler for scraping academic publication information, not a beer finder app!), with Peter Murray-Rust
  • discussing Open Access, Open Data and full text XML publishing with the Geological Society of London. The GSL have a working group currently investigating if/how they can transition to greater openness. Kudos to them for looking into this. Many a UK academic society may currently be hiding their heads in the sand at moment ignoring that the UK policy-wise is now committed to Open Access as the future of research publishing. It probably won’t be easy for GSL to make this transition as their accounts[PDF] show they are rather reliant on subscription-based journals and books for income. It’s hard to see how Open Access article processing charges could immediately replace the £millions subscription income per year from relatively few books & journals. Careful and perhaps difficult decisions will have to be made at some point to balance the goals of this charitable society, the acceptable level of income and the choice and amount of expenditure on non-publication related activites (e.g. ‘outreach events’).Interestingly, I note The American Geophysical Union (AGU) has recently decided to outsource their publications to an external company. Does anyone know ‘who’ yet? I just hope it’s not Elsevier.

Finally, the audio for the talk on the Open Knowledge Foundation and the Panton Fellowships, I gave in Cambridge recently, has now been uploaded, so I can now present the slides and the actual talk I gave together (below) for the first time! Many thanks for the organisers of the conference for doing all this work to make audio from all the talks available – it’s really cool that a relatively modest, small PhD student conference can produce such an excellent digital archive of what happened – I only wish the ‘bigger’ conferences has the resources & willpower to do this too!

…and if that’s not enough Panton updates for you, you can read Sophie Kershaw’s updates for June too, over on her blog

Taking inspiration from Cameron Neylon, I have written a DRAFT letter to my local MP urging him to support the recommendations in the Hargreaves Review of Intellectual Property.

[UPDATE: Letter now sent :) ]


Dear Don Foster,

As a graduate research student at the University of Bath, currently using textmining techniques to do scientific research in an efficient and comprehensive manner, I urge you to sign this early day motion (no. 151, tabled 11.06.2012).

Aside from my tweets to @DonFosterMP last night I have never before been moved to formally contact you, but this urgently requires your action.

Simple desktop computers can interpret vast amounts of digital information. The capacity, tools and eagerness already exist to enable scientists to do systematic reviews, knowledge syntheses, and innovative analyses on a scale never before imagined. But there is one barrier alone that stifles all this strong potential for good: current UK copyright law.

The exceptions for digital content proposed by Professor Hargreaves in his review would be a boon for research (; Chapter 5). It is galling that 87% of the research contained within UK Pub Med Central cannot be legally mined for information (p47 of the report). Especially exasperating when we are allowed to manually ‘human-read’ nearly all of this content. We just are not allowed to efficiently use machines to read the literature for us.

Current UK copyright law is outdated and is sometimes the *only* factor holding back scientific research. We need to remove this unnecessary artificial barrier to let UK scientists perform world-class research with modern and innovative tools and ideas. Otherwise we will be left in the Dark Ages, instead of the Bright, Shiny Digital Economy of the Future.
Yours sincerely,

Ross Mounce

PhD Candidate & Panton Fellow
Fossils, Phylogeny and Macroevolution Research Group
University of Bath,

Further resources:
* The Hargreaves Report
* PMR’s response to the Hargreaves report
* TechDirt – UK Publishers Pretend To Embrace Copyright Reform… In Order To Kill Copyright Reform
* Glyn Moody’s – Review of the UK Government Response to the Hargreaves Review



[A monthly update on my Panton Fellowship related activities]

Last month I was slightly late with my monthly report, so this month I’m going to get things back on track and write my post now, on this leisurely sunny Sunday afternoon…

It’s been a good month:

First of all, I had the chance to speak about my Fellowship work for the Ede & Ravenscroft Prize final. I made a few choice comments to our Pro-Vice-Chancellor who was present, about the plurality of benefits of Open Access & Open Data, and the difficulties of trying to do content mining research on subscription-access journals. I didn’t win the prize in the end, but getting to the final, and being recognised as one of the top 5 research students at the University of Bath was pretty cool. I then immediately went out and spent part of the £50 runners-up prize on Michael Nielsen‘s excellent Open Science book Reinventing Discovery. I gave it a read, then immediately passed it on to another lab for a friend to read, and it now resides with my supervisor who will also hopefully find time to read it (part of my not so subtle attempt to help spread the knowledge of how digital, networked, openness can hugely benefit research).

I bought some other books too, but this was the important one

Then on the 11th of May, PMR came Bath to give a talk to our Biology & Biochemistry Department. Those who came (including our subject librarian – thanks for coming!) were wowed with the ways in which PMR and colleagues have helped make semantically-enriched Linked Open Data available on chemicals for everyone, not just academic chemists! It’s brilliant to have an expert demonstration of the ways in which projects like CrystalEye have made the data underlying some chemical research publications far more easily searchable, open, and re-usable across many thousands of publications. There’s a strong, easily-justified need for more of this type of post-publication data scraping in biology (and palaeontology I might add!). We share a strong belief that research publications should be made open and explicitly re-usable without restriction.

Sadly, most of the biological literature in my domain is neither open, nor re-usable without permission (more of which in a later post) – which makes my highly integrative data-focused research, that much harder than it otherwise could be. As I’ve said before on the internet – I have all of PLoS on my USB stick, I’ve no doubt I could put all the relevant papers I need & scrape data from them, on just my desktop computer hard drive – yet subscription-access paywalls, and current copyright law prevent me from doing this for much of the literature (PLoS and other Open Access literature aside). I can understand how we arrived at this strange situation (we didn’t used to have such computational power to analyze large volumes of data, nor the Internet with which to freely & easily distribute research) but now we *do*, it seems like utter madness to continue to publish research in ways (e.g. subscription-access, copyright-transferred to the publisher) which make it very hard to analyze or re-use en masse.

The Panton Principles

So I’ve been joining the nascent OKFN working group Skype calls on Content Mining and soon we will hopefully have some interesting things to announce…

PMR also got the chance to meet my PhD supervisor and the rest of the lab which is great since I’m doing this fellowship work concurrently with my PhD work on fossils & phylogeny.

Later on in the month, I suggested the excellent Panton Discussions be made more amenable for podcasting. An OKFN group are now working on producing an audio-only version of all of them, and making them more easily integrable on personal listening devices (mobile phones & MP3 players).

Finally, the past week has been a whirlwind:

On Tuesday (22nd May) I was at the Natural History Musuem, London to talk with Dr Mark Wilkinson about some PhD project-related work – he’s kindly supplied me with some source code (among other things), so I can recompile his programs to run on my linux machines. I told him all about the OKFN & Panton Fellowship and he was very supportive of the goals. Time and time again, I encounter such enlightened, high-up academics and wonder why & how academic publishing is still in it’s current state – it’s not for want of researcher support for Open Access in my experience!

On Wednesday, I was back with PMR in Cambridge hacking PDFs, focusing particularly on BMC literature as this is BOAI-compliant Open Access and we can do what we want with such material. Towards the end of the session we had a think about what metadata would desirable to extract from the text of the papers and figure labels that might add context and information to the phylogenetic analysis performed, and phylogenetic tree presented in each of the papers. By coincidence the Open Tree of Life group have also just republished the MIAPA working group list of desirable metadata for phylogenies. We certainly won’t be able to get all this information, and the information we can extract may not necessarily be interpreted and associated 100% correctly, but it will certainly be hugely valuable as this information would otherwise take 4 years to re-digitise(!) by some estimates.

On Thursday, I went to ProgPal (Progressive Palaeontology), a conference also in Cambridge. There I gave a short ‘announcement’ talk with slides to explain to everyone there a) what the Open Knowledge Foundation are about, and b) why they might be of interest to academic palaeontologists. I touched upon Open Access and Open Data issues in palaeontology and encouraged those with an interest to visit the website, join the Open Science mailing list, listen to or watch the Panton Discussions, and consider applying for a Panton Fellowship next year if they had any innovative ideas for paleo-data. This talk tied-in very well with the other announcement talks for Palaeontology Online (a new free outreach & education initiative) and Palaeocast (a new paleo podcast).

Which reminds me, I should really pop them both an email to explain why they should post their content with a Creative Commons Attribution Licence, so their materials can be re-used, re-posted and remixed as Open Educational Resources

Best of all, on Friday I travelled down to London to my alma mater to attend & furiously tweet the Open Access debate at Imperial College London, in the very same lecture room I sat most of my undergrad lectures! There were rather a many palaeontologists also there, including Tori Herridge and Nick Crumpton and a large volume of tweets under the #OAdebate hashtag were sent (archived here if you’re interested). Graham Taylor of the Publishers Association said some rather provocative things that got me rather hot under the collar including:

…we [publishers] are the stewards of genuine science…

Which I think could all too easily be misinterpreted to overstate the importance of the role that publishers play in organising peer-review, spell-checking, typesetting and other such tasks. I also couldn’t help laughing out loud at Graham’s straight-faced proposal for subscription-access publishers to offer ‘fee-waived walk-in access at public libraries‘ as a way to provide taxpayer access to taxpayer funded research. Stephen Curry (also on the panel) thankfully quickly interrupted to state how ridiculous that was. I’ll leave it to Mike Taylor’s post here to explain just how ludicrous that proposal is in light of 21st century technology. I will however give Graham Taylor credit for further disavowing the Research Works Act, he said of it [and presumably his organisation’s initial support for it]: “the RWA was not such a good idea, don’t ask me to defend that one”, which elicited a pleased response from the audience.

There will be another debate held after the release of the Finch report which I suspect will be rather more exciting. A lot of the issues were aired at this debate, but the brevity of the time slot allowed for the event meant that there was not enough time for in-depth discussion IMO.

That’s just about it for the month. I can’t wait for what the next month will bring!


I have previously commented elsewhere on other blogs, that uniquely, with BOAI-compliant Open Access literature, one is able to re-distribute research however one wishes (provided proper attribution is given). I believe this to be hugely beneficial and perhaps a rather under-appreciated facet of the plurality of benefits offered by Open Access publishing.

Below is an expanded version of the comment I made on Cameron Neylon’s excellent blog Science in the Open on this very theme (and please do read Cameron’s post too for greater context):

Decentralized journal/article distribution is already happening.

I have 20,000+ PLoS articles on my computer right now. You can get them too – via BioTorrents. When compressed (as initially provided there) it’s less than 16GB’s of files – a trivial amount for anyone with a broadband connection. I can now (and do!) take PLoS on a USB stick with me wherever I go, allowing me to do research on trains, planes, and remote locations completely hassle free without even an internet connection. It was easy to download (pretty much 1-click) too via my high-speed institutional connection – and didn’t overload PLoS’s servers because I didn’t *get* the articles from their servers. With peer-2-peer file sharing the load is balanced between seeders (and in turn, I’m now seeding this torrent too, to help share the load). If all institutions/libraries agreed to help seed the world’s research literature, without copyright restriction on electronic redistribution (which we could do tomorrow if it weren’t for the legal copyright barriers imposed by most traditional subscription-access publishers) doing literature research would be pretty much frictionless! We could even get papers & data on campus much quicker over campus LAN rather than the internet.

Institutions already agree to help distribute code e.g. R and it’s multitude of packages – this is hugely beneficial, and helps share the costs associated with bandwidth — why not for research publications? The PLoS corpus is a great way to try out content mining ideas – it shows you how easy academic life *could* be if everything was Open Access. I’ve run some simple scripts on it myself. I’m not sure the simple things I did such as string matching could be classified as ‘text mining’ – but one thing I do know is – it was 100,000x times easier/quicker doing this locally, machine-reading files, rather than doing it paper by paper negotiating paywalls (where do I click, how many hoops do I have to jump through before I’m let in, what information are the ‘helpful’ tracking cookies keeping about me…) and getting cutoff by publishers. It’s worth pointing out as well, that once you have all the literature you need on your computer – you don’t even need the internet to do your research! For research in lesser economically developed countries, with weaker telecomms infrastructure – I’d imagine this would be a real boon for research.

It’s a window on the world that *could* be possible if we just changed our attitude WRT to copyright and research publishing. That PLoS, BMC and other Open Access publishers use the Creative Commons Attribution Licence makes this all possible.

I predict that the rights to electronically redistribute, and machine-read research will be vital for 21st century research – yet currently we academics often wittingly or otherwise relinquish these rights to publishers. This has got to stop. The world is networked, thus scholarly literature should move with the times and be openly networked too.

In short, I think research would be a whole lot easier to do, and ultimately (all things considered) be more cost-effective, if all future publicly-funded research could be made BOAI-compliant Open Access. This is just my opinion – you are welcome to disagree in the comments section below, I sincerely hope I don’t sound like an Open Access ‘zealot‘ for this is certainly not my intention.