Show me the data!

Answers to questions about the SVP meeting abstract embargo

August 27th, 2012 | Posted by rmounce in Panton Fellowship updates - (Comments Off on Answers to questions about the SVP meeting abstract embargo)

Considering I emailed him on a Friday, Darin Croft has done well to reply to my questions about the SVP abstract embargo so soon, on Monday. I don’t always get such swift replies, so that is much appreciated. [Thanks Darin!]

Below is his email in full, as promised (the bits in quotes are my original questions) publicly supplied so the confusion can be cleared-up for all:

> 1.) What would happen if a researcher (and SVP member) deliberately broke
> the embargo and blogged/tweeted/published research that was the basis of
> their own submitted talk abstract (I’m surprised this hasn’t happened
> already tbh, given how early the abstract deadline is – some e-journals have
> very quick turnaround times…)

Our embargo is meant to protect the researchers themselves so that
they have greater control over when and how their research is made
accessible to members of the media. Therefore, our embargo policy does
not apply to a researcher publicizing their own work. This has, in
fact, happened many times already, typically in the scenario you note
(i.e., a researcher’s work is published after the abstract deadline
but before the annual meeting). Based on your question, perhaps this
is something we should clarify to avoid confusion.

> 2.) What would happen if a researcher (and SVP member) broke the embargo and
> blogged or tweeted some or all the of the content of another researcher’s
> talk abstract

This would most likely be referred to the SVP’s Ethics Committee,
which is the standard procedure in the case of possible violations of
the SVP’s Bylaws or policies by a member.

> 3.) If a blogger or journalist *did* write an article or two on the basis of
> the meeting abstract booklet – do you seriously think that could harm the
> chances of VP’ers getting published in one of the glamour mags?

I believe this has actually occurred in the past, though before my
tenure as Chair of the Media Liaison Committee. Regardless, I cannot
speak for what the editors of high profile journals might or might not
do in such an instance. I would suggest you contact them directly. Our
current policy is that the potential risk for researchers does not
outweigh any potential benefit.

I hope that information is useful. Thank you for letting me know
beforehand that you plan to publish these responses.



[End of email]

So, from that it seems we can at least talk about our own abstracts. I still disagree that the potential ‘risk’ of freely accessible abstracts outweighs the benefits, but I’ll leave it there for now – I’m just happy to let you all know what my talk is about without fear of losing the talk slot!

I thoroughly agree with Darin that they should change the wording of the policy next year to make this clearer, because frankly what is written in the embargo policy currently (as emailed to all conference registrants) clearly contradicts what Darin says here, and I’m not the only one to have been confused and slightly annoyed by this.

I’d also be intrigued to know more about the SVP’s Ethics Committee procedures, Bylaws and rules. Perhaps there is a URL for these somewhere? But I will not pursue that any further now.

That just leaves me to say that my talk for this year’s SVP will be:


by Ross Mounce & Matthew A Wills, University of Bath

Previous phylogenetic work using conventional character partition homogeneity tests has
often revealed significant incongruence between cranial and postcranial character data. We
extend this approach by applying pairwise character compatibility tests across a sample of
more than 60 pseudo-independent vertebrate data sets. We contrast ‘fuzzy’ compatibility,
boildown bootstrap and clique approaches. In particular, we find that the Le Quesne
probability (LQP) has several desirable properties. The LQP is simply the probability that a
randomly permuted character will have incompatibility with other characters in the matrix
as low or lower than that of the original character. Within recent analyses of Sauropod taxa
we find that characters related to neural arches often conflict with dental characters in some
datasets but it is difficult to generalise; we are still exploring possible causative mechanisms
for this. In contrast, other vertebrate groups such as ratites appear to have relatively
little character conflict between morphological characters. Pairwise tests of character
compatibility work well with binary data and ordered multistate characters, but can only give
an indication of ‘potential compatibility’ with unordered multistate characters. Composite
‘higher’ taxa and polymorphic codes are also problematic for existing compatibility
software, typically creating artificial incompatibilities. We recommend that composite taxa
are decomposed into their constituents in order to remove ambiguity for the purpose of these
tests, or else that polymorphic states are treated as missing data.

It’s part review, part defence of an oft ignored method, and part meta-analysis of lots of datasets using congruence methods to look at character compatibility. It forms part of my thesis work on comparing different statistical methods to compare and contrast the utility & congruence of morphological characters in phylogenetic analyses.

Great to be able to talk about my research without worry :)

Panton Fellowship updates: July (month 4)

August 4th, 2012 | Posted by rmounce in Content Mining | Panton Fellowship updates - (Comments Off on Panton Fellowship updates: July (month 4))

It’s the Olympics now so this work update is a) late and b) short


As ever progress has been exciting – look what we can extract from some PDFs:

(click to enlarge each) Attribution: The left panel is from Cánovas et al. BMC Evolutionary Biology 2011 11:371 doi:10.1186/1471-2148-11-371

On the left is the original figure, and on the right we have an SVG representation of the data we can extract automatically from this figure. We have the topology, the taxon labels AND the support values 100% correctly interpreted! Obviously we can’t reclaim phylogenetic data with this much precision and recall from all papers. But it’s a promising example, automatically generated – no manual guidance or tweaking needed – just feed it the PDF. [My WordPress server won’t let me upload the original SVG copy of this for “security reasons” so the image on the right is a .jpg copy of the original .svg]


I should also note this was achieved completely independently of previous image-based tree-extraction softwares like TreeSnatcher Plus, TreeRipper & TreeThief. This is a great example of why it’s very important for editors and publishers to strictly stipulate that diagrams in figures containing data such as this be uploaded and produced in the final PDF version as lossless vector graphics rather than lossy bitmaps such as .png .jpg or .bmp – only vectors keep the fidelity of the underlying data. We note that there are many publishers out there who regularly seem to produce figures in their PDFs that are NOT on the whole very good quality wrt this. Difficult to know whether the authors or the publishers are to blame in each case but either way standards need to be improved.


By mining PDFs we can re-extract and re-release far more than just phylogenetic data from the literature – we’re fairly sure we can reliably identify the rough type of figure depicted in PDFs by machine methods using certain diagnostic features such as number & proportion of horizontal and vertical lines.



Peter Murray-Rust & I now are looking for a collaborator to help us implement machine learning methods to classify scientific figures into discrete categories e.g. bar charts, scatter plots, network diagrams (including phylogenies), pie charts, box & whisker plots etc… in an automated way.

If you’re interested please contact myself or Peter.

That’s all for now.

PS If you’re watching the London 2012 Olympics Volleyball tomorrow morning you may well just see me in the crowd. Managed to snaffle some returned tickets by setting up an alert for new tickets using a combination of (to alert me to page changes on the ticket website) and to email me as soon as the RSS feed gets a new item (updated ticket information). Without this nifty trick I very much doubt I’d have got any tickets.

I realise thus far, I may not have explained too clearly exactly what I’m doing for my Panton fellowship. With this post I shall attempt to remedy this and shed a little more light on what I’ve been doing lately.

The main thrust of my fellowship is to extract phylogenetic tree data from the literature using content mining approaches (think text mining, but not just text!) – using the literature in its entirety as my data. I have very little prior experience in this area, but luckily I have an expert mentor guiding me: Peter Murray-Rust (whom you may often see referred to as PMR). For those of us biologists who may not be familiar with his work, whilst trying not to be too sycophantic about it, PMR is simply brilliant, it’s amazing what he and his collaborators have done to extract chemical data from the chemical literature and provide it openly for everyone, in spite of fierce opposition at times from those with vested interests in this data remaining ‘closed’.

Now he’s turned his attention to the biological literature for my project and together we’re going to try and provide open tools to extract phylogenetic data from the literature. Initially I proposed trying to grab just tree topology and tip labels – a kind of bare minimum, but PMR has convinced me that we should be ambitious and all-encompassing, and thus our aims have expanded to include branch lengths, support values, the data-type the phylogeny was inferred from, and other useful metadata. And why not? We’re ingesting the totality of the paper in our process, from title page to reference list, so there’s plenty of machine-readable data to be gleaned. The question is, can we glean it off accurately enough, balancing precision and recall?

So for starters, we’ve been using test materials that we’re legally allowed to, namely Open Access CC-BY papers from BMC & PLoS to test our extraction tools, specifically focusing on a subset of all ~8500 papers containing the word-stem phylogen* from BMC. It’s a rough proxy for papers that’ll contain a tree, and it’s good enough for now – we’ll need to be able to deal with false positives along with all the positive positives, so it’s instructive to keep these in our sample.

We’ve been working on the regular structure of BMC PDFs, getting out bibliographic metadata, and the main-text for further NLP processing downstream to pick out data & method relevant words like say PAUP* , ML , mitochondrial loci etc… But the real reason we’re deliberately using PDFs rather than the XML (which we also have access to) is the figures – where all the valuable phylogenetic tree data is. If this can be re-interpreted with reference to the bibliographic metadata, the figure caption and further methodological details from the full-text of the paper, then we may be able to reconstruct some fairly rich and useful phylogenetic data.

To make it clear, in slight contrast to the Lapp et al iEvoBio presentation embedded above, we’re not trying to just extract the images, but rather to re-interpret them back into actual re-useable data, probably to be provided in NeXML (and from there on, whatever form you want). We’re pretty sure it’s an achievable goal. Programs like TreeThief, TreeRipper, and TreeSnatcher Plus have gone some way towards this already, but never before been incorporated in a content mining workflow AFAIK.

Unfortunately I wasn’t at iEvoBio 2012 (I’m short on money and on time these days) but it’s great to see from the slides the growing recognition of the SVG image file format as a brilliant tool for communicating digital science. I also put a bit about that in my Hennig XXXI talk slides too (towards the end). Programs like TNT do output SVG files, so there’s scope to make this a normal part of any publication workflow. Regrettably though, rather few publisher produced PDFs contain SVG formatted images – but if people, and editorial boards (perhaps?) can be made aware of their advantages, perhaps we can change this in future…?

the very same file, opened as plain-text. It’s fairly easy to reconvert back into re-useable machine-readable data.


Agapornis phylogeny.svg from Wikipedia (PD)










Gathering phylogenetic data from beyond PLoS, BMC and other smaller Open Access publishers is going to be hard, not for technical, but purely legal reasons:

The scope and scale of phylogenetic research (using ‘phylogen*’ as a proxy):

There’s a lot of phylogenetic research out there… but little of it is Open Access – which is problematic for content mining approaches – particularly if subscription-access publishers are reticent to allow access.

Some facts:

  • with a Thomson Reuters Web of Science search, SCI-EXPANDED database (only), Topic=(phylogen*) AND Year Published=(2000-2011) this returns 101,669 results (at the time of searching YMMV)
  • 91,788 of which are primary Research Articles (as opposed to Reviews, Proceedings Papers, Meeting Abstracts, Editorial Materials, Corrections, Book Reviews etc…)
  • Recent MIAPA working group research I contributed to (in review) quantitatively estimates that approximately 66% of papers containing ‘phylogen*’ report a new phylogenetic analysis (new data).
  • Thus conservatively assuming just one tree per paper (there are often many per paper), there are > 60,000 trees contained within just 21st century research articles.
  • As with STM publishing as a whole, the number of phylogenetic research articles being published each year shows consistent year-on-year increases
  • Cross-match this with publisher licencing data and you’ll find that only ~11% of phylogenetic research published in 2010 was CC-BY Open Access (and this % probably decreases as you go back before 2010)
So the real fun and games will come later this year, when I’m sure we’ll have the capability (software tools) to do some amazing stuff, having first perfected it on OA materials… but will they let us? Heather Piwowar’s experience earlier this year didn’t look too fun – and that was all for just one publisher. Phylogenetic research occurs in and beyond at least 80 separate STM publishers by my count (let alone the >500 journals it occurs in!) – so there’s no way anyone would bother trying to negotiate with them all! I’m sticking by the intuitive principle that The Right to Read Is the Right to Mine but I’ll have a think about that some more when we actually get to that bridge.

Finally, it’s also worth acknowledging that we’re certainly not the first in this peculiar non-biomedical mining space – ‘biodiversity informaticists’ have been doing useful things with these techniques for a while now in innovative ways largely unrelated to medicine e.g. LINNAEUS from Casey Bergmann’s lab, and a recent review of other projects from Thessen et al (2012) [hat-tip to @rdmpage for bringing that later paper to the world’s attention via Twitter]. Literally all areas of academia could probably benefit from some form or another of content mining – it’s not just a biomed / biochem tool.

So, I hope that explains things a bit better. Any questions?


Some references (but not all!):

Gerner, M., Nenadic, G., and Bergman, C. 2010. LINNAEUS: A species name identification system for biomedical literature. BMC Bioinformatics 11:85+. [CC-BY Open Access]

Thessen, A. E., Cui, H., and Mozzherin, D. 2012. Applications of natural language processing in biodiversity science. Advances in Bioinformatics 2012:1-17. [CC-BY Open Access]

Hughes, J. 2011. TreeRipper web application: towards a fully automated optical tree recognition software. BMC Bioinformatics 12:178+.  [CC-BY Open Access]

Laubach, T., von Haeseler, A., and Lercher, M. 2012. TreeSnatcher plus: capturing phylogenetic trees from images. BMC Bioinformatics 13:110+. [CC-BY Open Access, incidentally I was one of the reviewers for this paper. I signed my review, and made a point of it too. Nor was it a soft review either I might add]

It’s that time again… time to write my monthly Panton Fellowship update.

The trouble is, as I start writing this it’s 6am (London, UK). I arrived back from the Hennig XXXI meeting (University of California Riverside) after a long flight yesterday and am supremely jetlagged. I still can’t decide whether this is awesome (I can get more work done, by waking up earlier), or terrible as I can’t keep my eyes open past 9pm at night!

At this conference I shoe-horned some of my Panton Fellowship project work into the latter half of my talk (slides below), as it fitted in with the theme of the submitted abstract on supertrees.

Supertrees are just one of many many different possible (re)uses of the phylogenetic tree data I am trying to liberate from the literature for this project. I tried to stress this during my talk, as a lot of people at Hennig aren’t too keen on supertrees as a method for inferring large phylogenies. In fact, there was a compelling talk with solid data from Dan Janies given later on in conference, critiquing supertree methods such as SuperFine and SuperTriplets, which were outperformed in most tests in terms of both speed and optimality (tree length) by supermatrix methods using TNT. That’s fine though – there are so many other interesting hypotheses one can investigate with large samples of real phylogenetic estimates (trees).


  • Do model-based phylogenetic analyses perform better than parsimony? [Probably not, judging by the conclusions in this paper]  –  I’d like to see this hypothesis re-tested more rigorously using tree-to-tree distance comparisons between the different method trees. Except we can’t currently do this very easily because there’s a paucity of machine-readable tree data from published papers
  • Meta-analysis of phylogenetic tree balance and factors that influence balance e.g. (this thesis, and this PLoS ONE article).  Are large trees more imbalanced than small trees? Are vertebrate trees more balanced than invertebrate trees?
  • Fossil taxa in phylogenetic trees – are they more often than not found at the base of the tree? Is this ‘real’ or perhaps apparent ‘stem-ward slippage‘ caused by preservational biases?
  • Similarity and dissimilarity between phylogeny and measures of morphological disparity as studied  by my lab mate Martin Hughes

So, I hope you’ll appreciate this data isn’t just needed for producing large supertrees.

I could go on about the conference – it was excellent as ever, but I’ll save that for a dedicated later post.

Other activities this month included:

  • submitting my quarterly Panton report to the Fellowship Board
  • attending the OKFN Bibliohack session at QMUL’s Mile End campus (13th & 14th June) helping out with the creation of the OKFN Open Access Index, and learning how to use & debug a few issues with PubCrawler (a web crawler for scraping academic publication information, not a beer finder app!), with Peter Murray-Rust
  • discussing Open Access, Open Data and full text XML publishing with the Geological Society of London. The GSL have a working group currently investigating if/how they can transition to greater openness. Kudos to them for looking into this. Many a UK academic society may currently be hiding their heads in the sand at moment ignoring that the UK policy-wise is now committed to Open Access as the future of research publishing. It probably won’t be easy for GSL to make this transition as their accounts[PDF] show they are rather reliant on subscription-based journals and books for income. It’s hard to see how Open Access article processing charges could immediately replace the £millions subscription income per year from relatively few books & journals. Careful and perhaps difficult decisions will have to be made at some point to balance the goals of this charitable society, the acceptable level of income and the choice and amount of expenditure on non-publication related activites (e.g. ‘outreach events’).Interestingly, I note The American Geophysical Union (AGU) has recently decided to outsource their publications to an external company. Does anyone know ‘who’ yet? I just hope it’s not Elsevier.

Finally, the audio for the talk on the Open Knowledge Foundation and the Panton Fellowships, I gave in Cambridge recently, has now been uploaded, so I can now present the slides and the actual talk I gave together (below) for the first time! Many thanks for the organisers of the conference for doing all this work to make audio from all the talks available – it’s really cool that a relatively modest, small PhD student conference can produce such an excellent digital archive of what happened – I only wish the ‘bigger’ conferences has the resources & willpower to do this too!

…and if that’s not enough Panton updates for you, you can read Sophie Kershaw’s updates for June too, over on her blog