Show me the data!

To try and publicize the variety of Gold Open Access article publication options on offer, I’ve decided to create a visualization of the journal data that has previously been collected as part of my survey of ‘Open Access’ publisher licenses’ spreadsheet.

Here is version 0.1 of the ‘Mounce plot’ (much more data still to be added! It’s a work in progress) I may well refine and perhaps add a third axis or variable to the plot in future versions:

UPDATE: version 0.2 of the plot is now available here



Not all “open access” options provided by publishers actually provide BOAI-compliant Open Access, and this is very important – thus I have used the y-axis in this plot to reflect the level of openness supplied for the fees paid (x-axis). Therefore, the ‘best’ journals providing CC BY BOAI-compliant Open Access for the lowest fees possible appear in the top left of the plot. The ‘worst’ journals providing a far inferior level of openness for a high price appear in the bottom right of the plot. The lowest level of ‘free’ access is provided by journals and societies who provide free access to papers, but seem not to provide them with recognised standard licences such as those from the Creative Commons suite. Ambiguity is arguably the worst and laziest thing a publisher can offer from a re-user / reader POV and thus I score this as the lowest class.


Kudos then to Cellular Therapy and Transplantation , Multidisciplinary Digital Publishing Institute , Copernicus PublicationsJournal of Advances in Modeling Earth Systems (an American Geosciences Union journal, ’tis a shame they charge $3500 for other AGU journals!), Standards in Genomics Sciences and others for providing low cost BOAI-compliant immediate Gold Open Access publishing options. […and what a mouthful that last statement was. It is such a pity that the meaning of ‘open access’ has been degraded and loosely applied since it was originally (well)defined, that I have to apply so many additional adjectives to describe exactly what I mean.]

I’d be amazed if The Lancet & Cell Press journals (e.g. Cell) published by Elsevier could still get away with the absurdly high APC’s they ask for in 5 years time. I hope all researchers are sensible enough to realise that they can publish their manuscripts in other Open Access venues and have just the same research impact (and avoid these hugely expensive options).

I may well make further posts in future with updated, corrected and further explored and deliberated plots. There’s a lot still to talk about!


UPDATES:   I sometimes encounter academics who have never heard of fee waiver schemes before. If you look at this plot as an unfunded academic with no or little institutional funding, you might panic. DON’T : a lot of good Open Access publishers offer ‘fee waiver’ schemes to such academics that really cannot pay the APCs. Examples are PLoS and BMC . You can’t always get your fee waived but it is certainly worth asking if you think your circumstances deserve it.

PMR has noted that I have included some ‘predatory’ Open Access publishers in this plot e.g. the OMICS publishing group. I will just state that by placing publishers on this plot I am not especially endorsing any of them unless otherwise stated. There are of course other important criteria aside from ‘price’ and ‘openness’ in choosing where to submit a manuscript. Choose your venue wisely!


Further Reading:

[1] Page, R. 2010

[2] Murray-Rust, P. 2010

[3] Hagedorn, G., Mietchen, D., Morris, R., Agosti, D., Penev, L., Berendsohn, W., & Hobern, D. (2011). Creative Commons licenses and the non-commercial condition: Implications for the re-use of biodiversity information ZooKeys, 150 DOI: 10.3897/zookeys.150.2189

[4] Carroll, M. W., Nov. 2011. Why full open access matters. PLoS Biol 9 (11), e1001210+.

Technical notes:

  • Some journals charge a fee per page thus I have assumed 10 pages per article for my plot
  • Where fees are not listed in USD I have converted them into USD using the current exchange rates.
  • Where the journal permits authors to choose a licence I have assumed authors will choose the better, less restrictive licence option (although sadly, in reality some authors do opt for a more restrictive licence for their work).
  • Some journals offer discounted ‘OA fees’ if they are at a ‘member institution’ or some such. This usually involves additional cost to the member/subscribing institution thus I have used the full ‘non-member’ rate in such instances for a fairer comparison.
  • I only included a couple of BMC journals, just to show the range of prices offered (small: BMC Research Notes, and larger: BMC Biology). They split their prices quite finely between journals so I chose not to overcrowd the CC BY layer with too many BMC journals.
  • When I get time I’ll update WileyOpenAccess to the CC BY top class (they recently changed policy). Sadly, the Wiley OnlineOpen (hybrid) option, which is available to 100 times as many journals,  is still NC-restricted and less open.
  • Don’t see your publisher or Open Access option on the plot? Make your own plot then – the data is all there on the spreadsheet. I don’t doubt that many could make a better job of visualising it than I have done here…

Comments, extra data and/or corrections are welcome as always.  This data was hand-collected, so there may be errors.

Considering I emailed him on a Friday, Darin Croft has done well to reply to my questions about the SVP abstract embargo so soon, on Monday. I don’t always get such swift replies, so that is much appreciated. [Thanks Darin!]

Below is his email in full, as promised (the bits in quotes are my original questions) publicly supplied so the confusion can be cleared-up for all:

> 1.) What would happen if a researcher (and SVP member) deliberately broke
> the embargo and blogged/tweeted/published research that was the basis of
> their own submitted talk abstract (I’m surprised this hasn’t happened
> already tbh, given how early the abstract deadline is – some e-journals have
> very quick turnaround times…)

Our embargo is meant to protect the researchers themselves so that
they have greater control over when and how their research is made
accessible to members of the media. Therefore, our embargo policy does
not apply to a researcher publicizing their own work. This has, in
fact, happened many times already, typically in the scenario you note
(i.e., a researcher’s work is published after the abstract deadline
but before the annual meeting). Based on your question, perhaps this
is something we should clarify to avoid confusion.

> 2.) What would happen if a researcher (and SVP member) broke the embargo and
> blogged or tweeted some or all the of the content of another researcher’s
> talk abstract

This would most likely be referred to the SVP’s Ethics Committee,
which is the standard procedure in the case of possible violations of
the SVP’s Bylaws or policies by a member.

> 3.) If a blogger or journalist *did* write an article or two on the basis of
> the meeting abstract booklet – do you seriously think that could harm the
> chances of VP’ers getting published in one of the glamour mags?

I believe this has actually occurred in the past, though before my
tenure as Chair of the Media Liaison Committee. Regardless, I cannot
speak for what the editors of high profile journals might or might not
do in such an instance. I would suggest you contact them directly. Our
current policy is that the potential risk for researchers does not
outweigh any potential benefit.

I hope that information is useful. Thank you for letting me know
beforehand that you plan to publish these responses.



[End of email]

So, from that it seems we can at least talk about our own abstracts. I still disagree that the potential ‘risk’ of freely accessible abstracts outweighs the benefits, but I’ll leave it there for now – I’m just happy to let you all know what my talk is about without fear of losing the talk slot!

I thoroughly agree with Darin that they should change the wording of the policy next year to make this clearer, because frankly what is written in the embargo policy currently (as emailed to all conference registrants) clearly contradicts what Darin says here, and I’m not the only one to have been confused and slightly annoyed by this.

I’d also be intrigued to know more about the SVP’s Ethics Committee procedures, Bylaws and rules. Perhaps there is a URL for these somewhere? But I will not pursue that any further now.

That just leaves me to say that my talk for this year’s SVP will be:


by Ross Mounce & Matthew A Wills, University of Bath

Previous phylogenetic work using conventional character partition homogeneity tests has
often revealed significant incongruence between cranial and postcranial character data. We
extend this approach by applying pairwise character compatibility tests across a sample of
more than 60 pseudo-independent vertebrate data sets. We contrast ‘fuzzy’ compatibility,
boildown bootstrap and clique approaches. In particular, we find that the Le Quesne
probability (LQP) has several desirable properties. The LQP is simply the probability that a
randomly permuted character will have incompatibility with other characters in the matrix
as low or lower than that of the original character. Within recent analyses of Sauropod taxa
we find that characters related to neural arches often conflict with dental characters in some
datasets but it is difficult to generalise; we are still exploring possible causative mechanisms
for this. In contrast, other vertebrate groups such as ratites appear to have relatively
little character conflict between morphological characters. Pairwise tests of character
compatibility work well with binary data and ordered multistate characters, but can only give
an indication of ‘potential compatibility’ with unordered multistate characters. Composite
‘higher’ taxa and polymorphic codes are also problematic for existing compatibility
software, typically creating artificial incompatibilities. We recommend that composite taxa
are decomposed into their constituents in order to remove ambiguity for the purpose of these
tests, or else that polymorphic states are treated as missing data.

It’s part review, part defence of an oft ignored method, and part meta-analysis of lots of datasets using congruence methods to look at character compatibility. It forms part of my thesis work on comparing different statistical methods to compare and contrast the utility & congruence of morphological characters in phylogenetic analyses.

Great to be able to talk about my research without worry :)

I just sent this email to Darin Croft (of SVP). I chose to contact him because he recently answered questions about the embargo for EmbargoWatch and it was rather unclear who else I should approach. I did not want to blanket email the whole council.

This is the (entire) email I sent him, from my gmail account:
(I will post his reply as and when I receive it)

Dear Darin,

It’s been noted many times before, by many different researchers – but the SVP meeting abstract embargo just doesn’t make sense to me. I know of no other conference that operates like this, and indeed for most other conferences the abstract booklet (and it’s open, free availability online) is a big promotional aid in getting people interested in the event in the lead-up to it.

I saw you answered some questions on EmbargoWatch recently, so I thought you might be the correct person to contact for my queries on the same subject:

I have blogged my own displeasure with the embargo policy here:

I would like to ask:

1.) What would happen if a researcher (and SVP member) deliberately broke the embargo and blogged/tweeted/published research that was the basis of their own submitted talk abstract (I’m surprised this hasn’t happened already tbh, given how early the abstract deadline is – some e-journals have very quick turnaround times…)

2.) What would happen if a researcher (and SVP member) broke the embargo and blogged or tweeted some or all the of the content of another researcher’s talk abstract

3.) If a blogger or journalist *did* write an article or two on the basis of the meeting abstract booklet – do you seriously think that could harm the chances of VP’ers getting published in one of the glamour mags?

I look forward to hearing from you, and will publish your response in full context with this email on my blog



Ross Mounce
PhD Student & Panton Fellow
Fossils, Phylogeny and Macroevolution Research Group
University of Bath, 4 South Building, Lab 1.07


Sometimes you just have to laugh…

The year is 2012, we have the internet, we have blogs, and a huge variety of other tools to enable free, efficient and rapid communication of information and yet the Society of Vertebrate Paleontology annual meeting rules still insist that all information within this year’s abstract booklet remain a big secret until the day of the event.

Many others have justly written to complain about this before.

Here’s the 2012 version I just received in my inbox today:

SVP Embargo Policy Regarding Content in the Program and Abstract Book

Unless specified otherwise, coverage of abstracts presented orally at the Annual Meeting is strictly prohibited until the start time of the presentation, and coverage of poster presentations is prohibited until the relevant poster session opens for viewing. As defined here, “œcoverage” includes all types of electronic and print media; this includes blogging, tweeting and other intent to communicate or disseminate results or discussion presented at the SVP Annual Meeting. Content that may be pre-published online in advance of print publication is also subject to the SVP embargo policy.

So I think I can tell you I’m giving a talk there in the ‘Phylogenetic and Comparative Paleobiology — New Approaches to the Study of Vertebrate Macroevolution’ symposium.

But can I tell you what the title of my talk is, or the abstract I submitted (a rather long time ago, which is another bugbear I have with this particular conference)? Well, given the quote above, probably not!

And therein is part of the ridiculousness of the embargo. By submitting a (subsequently accepted) talk & abstract to this conference – I’m banned from communicating about my own research on that subject until I give the talk. Not even a tweet about it.

It also seems to me that they’re preventing their own members from effectively promoting the event with this policy. Wouldn’t it be great if all speakers could blog and tweet: “Hey, I’m giving a talk on new dinosaur XXXX and it’s unusual anatomy (further details of which are in my abstract here) at a meeting in Raleigh, NC. Come along, tickets still available here” Isn’t that 100 times better than “Hey, I’m giving a talk at this conference – I can’t tell you what the title is or the subject, sorry” ?

This policy strikes me as a massive and unjustified own goal. I appreciate some of the science glamour mags don’t take kindly to press reportage of science before it is published in their glossy pages BUT I think we’ve got to remember that science talks & posters are NOT papers, and they should not and are not treated as such. The abstracts for SVP are only minimally peer-reviewed before acceptance and the talk content itself is completely unreviewed. Therefore if a journalist/blogger/tweeter did report on the abstract booklet (and btw, it would take tremendous journalistic spin to make good, interesting copy from most talk abstracts I’ve ever seen – they’re rather short!) they’d be reporting non-peer reviewed discussion, that may or may not be related to unspecified future peer-reviewed publications. So I don’t buy [what I presume is the justification for all this?] the argument that reportage of talk abstracts jeopardises the publication of peer-reviewed papers. The two may be related, but are also very distinct from each other.

I think it’s only a matter of time until this policy changes. SVP have being doing reasonably well with respect to openness recently. They’ve reduced their hybrid Open Access fees, and instituted new editorial policy encouraging data archiving so that data published in their journal is more transparent & re-usable (=better science). But it seems there are still improvements to be made. Will there be an abstract embargo in 2013 I wonder? I for one hope not.

Open Access discussed on the radio

August 20th, 2012 | Posted by rmounce in Open Access - (0 Comments)

[I’m cross posting this from the OKFN version so I can embed the audio of the show in the post]

Last Friday (17/08/12), representing the Open Knowledge Foundation, I had the pleasure of discussing the new Research Councils UK (RCUK) plan for all UK publicly-funded research to be published Open Access, on a special half hour Voice of Russia UK broadcast radio discussion.

I have written about this policy before and am very supportive of it, just as I am with Open Access in academia in general. I personally believe it will aid transparency and equality in research – so that no researcher has an unfair advantage over another through greater/easier access to vital research literature (just one of many worthy benefits arising from Open Access). But there are certainly also vocal opponents to this plan – mostly those with vested interests in keeping the obscene profits of the traditional subscription access publishing system alive (which commonly generate >30% profit margins largely derived from the taxpayer-spending of the world’s research libraries on journal subscriptions). Whilst others express vague and often unspecified “concerns” about Open Access and further still many academics are notably apathetic towards it, or are even proudly agnostic on the issue.

Thus a publicly-broadcast discussion of this new open policy is well warranted.

No secret science
Voice of Russia UK radio Open Access discussion hosted by Daniel Cinna
I won’t say anything about the discussion itself, only that you should listen to it (embedded above; alt link here) if you are at all interested in the future of science, and the benefits of the new RCUK Open Access policy.

The members of the discussion panel included Rita Gardner, the Director of the Royal Geographical Society, noted for her concerns about the potential effects of Open Access on UK Learned Society income and revenue [paywalled link]; Ross Mounce, Panton Fellow promoting open data in science (myself) from the Open Knowledge Foundation; Bjorn Brembs, Professor at the Department of Genetics at the University of Leipzig, noted critic of for-profit publishers and their lack of ‘value-add’ amongst other issues; and Timothy Gowers, the Rouse Ball Professor of Mathematics at Cambridge University, instigator of the popular academic-led boycott of the academic for-profit publisher Elsevier.

The ensuing discussion was ably guided by VoiceofRussia radio presenter Daniel Cinna, and recorded by a backroom team with an impressively professional studio setup (Timothy & Bjorn were joining the debate via phone from abroad almost seemlessly, whilst Rita and I were in the London studio). As noted by Rita off-air, it would have been nice to have had a publisher representative in the discussion to add their unique viewpoint but apparently the VoR production team had asked, but no for-profit publisher they had asked was willing to take part. So one cannot attribute any blame to the VoR team if the discussion panel lacked representational balance.

About Voice of Russia (adapted from their own website):

The Voice of Russia is the world’s oldest international broadcaster and is among the world’s top five radio broadcasters today which include the BBC, the Voice of America, Deutsche Welle and Radio France International. The London-based team produces programs for VoR that bring our listeners a Russian perspective on our two countries and the world. VoR broadcasts to 160 countries in 38 languages using short and medium waves, FM, satellite and the global communications network. In London we are now also available online and via DAB radio. We aim to welcome a new British audience to our 109 million listeners worldwide.

It’s the Olympics now so this work update is a) late and b) short


As ever progress has been exciting – look what we can extract from some PDFs:

(click to enlarge each) Attribution: The left panel is from Cánovas et al. BMC Evolutionary Biology 2011 11:371 doi:10.1186/1471-2148-11-371

On the left is the original figure, and on the right we have an SVG representation of the data we can extract automatically from this figure. We have the topology, the taxon labels AND the support values 100% correctly interpreted! Obviously we can’t reclaim phylogenetic data with this much precision and recall from all papers. But it’s a promising example, automatically generated – no manual guidance or tweaking needed – just feed it the PDF. [My WordPress server won’t let me upload the original SVG copy of this for “security reasons” so the image on the right is a .jpg copy of the original .svg]


I should also note this was achieved completely independently of previous image-based tree-extraction softwares like TreeSnatcher Plus, TreeRipper & TreeThief. This is a great example of why it’s very important for editors and publishers to strictly stipulate that diagrams in figures containing data such as this be uploaded and produced in the final PDF version as lossless vector graphics rather than lossy bitmaps such as .png .jpg or .bmp – only vectors keep the fidelity of the underlying data. We note that there are many publishers out there who regularly seem to produce figures in their PDFs that are NOT on the whole very good quality wrt this. Difficult to know whether the authors or the publishers are to blame in each case but either way standards need to be improved.


By mining PDFs we can re-extract and re-release far more than just phylogenetic data from the literature – we’re fairly sure we can reliably identify the rough type of figure depicted in PDFs by machine methods using certain diagnostic features such as number & proportion of horizontal and vertical lines.



Peter Murray-Rust & I now are looking for a collaborator to help us implement machine learning methods to classify scientific figures into discrete categories e.g. bar charts, scatter plots, network diagrams (including phylogenies), pie charts, box & whisker plots etc… in an automated way.

If you’re interested please contact myself or Peter.

That’s all for now.

PS If you’re watching the London 2012 Olympics Volleyball tomorrow morning you may well just see me in the crowd. Managed to snaffle some returned tickets by setting up an alert for new tickets using a combination of (to alert me to page changes on the ticket website) and to email me as soon as the RSS feed gets a new item (updated ticket information). Without this nifty trick I very much doubt I’d have got any tickets.

just a quick post…

I’m pretty shocked at the poor indexing service given by Thompson Reuters Web of Knowledge (or ISI Web of Science as you might know it).

I’ve unashamedly bashed them before and I’ll bash them again here. (They deserve criticism because they’re paid a lot of money to do this as a commercial for-profit enterprise, and I don’t think they’re doing it as well as they could be.)

I performed a very simple search today looking for the articles containing the word ‘cladistic’ but NOT ‘phylogen*’ for articles published in the year 2010.

Topic=(cladistic) NOT Topic=(phylogen*) AND Year Published=(2010)

Below is a screenshot of just one of many of the disappointing results. I’ve refined the search to just the PLoS paper, to clearly show that it does come-up in this search:

It’s an Open Access paper, so we can all go see for ourselves the FULL content of the paper

Na, B.-K., Bae, Y.-A., Zo, Y.-G., Choe, Y., Kim, S.-H., Desai, P. V., Avery, M. A., Craik, C. S., Kim, T.-S., Rosenthal, P. J., and Kong, Y. 2010. Biochemical properties of a novel cysteine protease of plasmodium vivax, vivapain-4. PLOS NEGLECTED TROPICAL DISEASES 4.

In which we find the text caption for figure 1 mentions ‘phylogen*’ twice!

from Na, B.-K., Bae, Y.-A., Zo, Y.-G., Choe, Y., Kim, S.-H., Desai, P. V., Avery, M. A., Craik, C. S., Kim, T.-S., Rosenthal, P. J., and Kong, Y. 2010. Biochemical properties of a novel cysteine protease of plasmodium vivax, vivapain-4. PLOS NEGLECTED TROPICAL DISEASES 4. CC-BY licenced

so at the very least I suspect Web of Science (WoS) is systematically NOT indexing the caption text of figures (if you know more than I about this, please do comment). Academics rely on services like this to effectively and accurately search the literature, to perform comprehensive reviews and such. If all the textual content of science isn’t actually being indexed by WoS, that’s clearly going to lead to bad science at some point (e.g. a vital missing paper, not picked up in an otherwise well designed literature search). I could forgive them for not being able to OCR the text within the images of figures, but NOT for the fully machine-readable text captions like this one. Furthermore, it’s Open Access and fully-digital – why aren’t they indexing figure caption text?



It appears it’s not just figure caption text they don’t index. Do they index only titles and abstracts?

many of the other 81 results (papers) of that search for ‘cladistic’ but NOT ‘phylogen*’ contain the word-stem ‘phylogen*’ in the full text of the paper!


Wilts, E. F., Arbizu, P. M., and Ahlrichs, W. H. 2010. Description of bryceella perpusilla n. sp (monogononta: Proalidae), a new rotifer species from terrestrial mosses, with notes on the ground plan of bryceella REMANE, 1929. INTERNATIONAL REVIEW OF HYDROBIOLOGY 95.

Echeverry, A. and Morrone, J. J. 2010. Parsimony analysis of endemicity as a panbiogeographical tool: an analysis of caribbean plant taxa. BIOLOGICAL JOURNAL OF THE LINNEAN SOCIETY 101.

Stutz, H. L., Shiozawa, D. K., and Evans, R. P. 2010. Inferring dispersal of aquatic invertebrates from genetic variation: a comparative study of an amphipod and mayfly in great basin springs. JOURNAL OF THE NORTH AMERICAN BENTHOLOGICAL SOCIETY 29

Campo, D., Molares, J., Garcia, L., Fernandez-Rueda, P., Garcia-Gonzalez, C., and Garcia-Vazquez, E. 2010. Phylogeography of the european stalked barnacle (pollicipes pollicipes): identification of glacial refugia. MARINE BIOLOGY 157.

Choiniere, J. N., Clark, J. M., Forster, C. A., and Xu, X. 2010. A basal coelurosaur (dinosauria: Theropoda) from the late jurassic (oxfordian) of the shishugou formation in wucaiwan, people’s republic of china. JOURNAL OF VERTEBRATE PALEONTOLOGY 30.

Caldwell, M. W. and Palci, A. 2010. A new species of marine ophidiomorph lizard, adriosaurus skrbinensis, from the upper cretaceous of slovenia. JOURNAL OF VERTEBRATE PALEONTOLOGY 30.

Hastings, A. K., Bloch, J. I., Cadena, E. A., and Jaramillo, C. A. 2010. A new small short-snouted dyrosaurid (crocodylomorpha, mesoeucrocodylia) from the paleocene of northeastern colombia. JOURNAL OF VERTEBRATE PALEONTOLOGY 30.

Karanovic, I. and McKay, K. 2010. Two new species of leicacandona karanovic (ostracoda, candoninae) from the great sandy desert, australia. JOURNAL OF NATURAL HISTORY 44.

(and more, these are just some of the articles I’ve looked at the full-text of so far… I think it’s safe to say now this is NOT a one off phenomenon)

I’ve now found through manual inspection that at least 47 of the ‘hits’ for this search actually contain a ‘phylogen*’ word within the main text of the paper (excluding the reference list)

I guess I’m probably not the first to realise this but… wow. Is this not *really* poor service? I’m pretty sure my desktop software could do a better job of indexing than this. All it is, is simple string matching!

…and of course I can do a better job of this myself with Open Access papers. All one need do is download the OA corpus from UKPMC and index the *FULL* text including figure caption text and reference lists yourself. I wonder how many more relevant papers I might ‘find’ with my searches if I did this rather than relying on Web of Science?

I realise thus far, I may not have explained too clearly exactly what I’m doing for my Panton fellowship. With this post I shall attempt to remedy this and shed a little more light on what I’ve been doing lately.

The main thrust of my fellowship is to extract phylogenetic tree data from the literature using content mining approaches (think text mining, but not just text!) – using the literature in its entirety as my data. I have very little prior experience in this area, but luckily I have an expert mentor guiding me: Peter Murray-Rust (whom you may often see referred to as PMR). For those of us biologists who may not be familiar with his work, whilst trying not to be too sycophantic about it, PMR is simply brilliant, it’s amazing what he and his collaborators have done to extract chemical data from the chemical literature and provide it openly for everyone, in spite of fierce opposition at times from those with vested interests in this data remaining ‘closed’.

Now he’s turned his attention to the biological literature for my project and together we’re going to try and provide open tools to extract phylogenetic data from the literature. Initially I proposed trying to grab just tree topology and tip labels – a kind of bare minimum, but PMR has convinced me that we should be ambitious and all-encompassing, and thus our aims have expanded to include branch lengths, support values, the data-type the phylogeny was inferred from, and other useful metadata. And why not? We’re ingesting the totality of the paper in our process, from title page to reference list, so there’s plenty of machine-readable data to be gleaned. The question is, can we glean it off accurately enough, balancing precision and recall?

So for starters, we’ve been using test materials that we’re legally allowed to, namely Open Access CC-BY papers from BMC & PLoS to test our extraction tools, specifically focusing on a subset of all ~8500 papers containing the word-stem phylogen* from BMC. It’s a rough proxy for papers that’ll contain a tree, and it’s good enough for now – we’ll need to be able to deal with false positives along with all the positive positives, so it’s instructive to keep these in our sample.

We’ve been working on the regular structure of BMC PDFs, getting out bibliographic metadata, and the main-text for further NLP processing downstream to pick out data & method relevant words like say PAUP* , ML , mitochondrial loci etc… But the real reason we’re deliberately using PDFs rather than the XML (which we also have access to) is the figures – where all the valuable phylogenetic tree data is. If this can be re-interpreted with reference to the bibliographic metadata, the figure caption and further methodological details from the full-text of the paper, then we may be able to reconstruct some fairly rich and useful phylogenetic data.

To make it clear, in slight contrast to the Lapp et al iEvoBio presentation embedded above, we’re not trying to just extract the images, but rather to re-interpret them back into actual re-useable data, probably to be provided in NeXML (and from there on, whatever form you want). We’re pretty sure it’s an achievable goal. Programs like TreeThief, TreeRipper, and TreeSnatcher Plus have gone some way towards this already, but never before been incorporated in a content mining workflow AFAIK.

Unfortunately I wasn’t at iEvoBio 2012 (I’m short on money and on time these days) but it’s great to see from the slides the growing recognition of the SVG image file format as a brilliant tool for communicating digital science. I also put a bit about that in my Hennig XXXI talk slides too (towards the end). Programs like TNT do output SVG files, so there’s scope to make this a normal part of any publication workflow. Regrettably though, rather few publisher produced PDFs contain SVG formatted images – but if people, and editorial boards (perhaps?) can be made aware of their advantages, perhaps we can change this in future…?

the very same file, opened as plain-text. It’s fairly easy to reconvert back into re-useable machine-readable data.


Agapornis phylogeny.svg from Wikipedia (PD)










Gathering phylogenetic data from beyond PLoS, BMC and other smaller Open Access publishers is going to be hard, not for technical, but purely legal reasons:

The scope and scale of phylogenetic research (using ‘phylogen*’ as a proxy):

There’s a lot of phylogenetic research out there… but little of it is Open Access – which is problematic for content mining approaches – particularly if subscription-access publishers are reticent to allow access.

Some facts:

  • with a Thomson Reuters Web of Science search, SCI-EXPANDED database (only), Topic=(phylogen*) AND Year Published=(2000-2011) this returns 101,669 results (at the time of searching YMMV)
  • 91,788 of which are primary Research Articles (as opposed to Reviews, Proceedings Papers, Meeting Abstracts, Editorial Materials, Corrections, Book Reviews etc…)
  • Recent MIAPA working group research I contributed to (in review) quantitatively estimates that approximately 66% of papers containing ‘phylogen*’ report a new phylogenetic analysis (new data).
  • Thus conservatively assuming just one tree per paper (there are often many per paper), there are > 60,000 trees contained within just 21st century research articles.
  • As with STM publishing as a whole, the number of phylogenetic research articles being published each year shows consistent year-on-year increases
  • Cross-match this with publisher licencing data and you’ll find that only ~11% of phylogenetic research published in 2010 was CC-BY Open Access (and this % probably decreases as you go back before 2010)
So the real fun and games will come later this year, when I’m sure we’ll have the capability (software tools) to do some amazing stuff, having first perfected it on OA materials… but will they let us? Heather Piwowar’s experience earlier this year didn’t look too fun – and that was all for just one publisher. Phylogenetic research occurs in and beyond at least 80 separate STM publishers by my count (let alone the >500 journals it occurs in!) – so there’s no way anyone would bother trying to negotiate with them all! I’m sticking by the intuitive principle that The Right to Read Is the Right to Mine but I’ll have a think about that some more when we actually get to that bridge.

Finally, it’s also worth acknowledging that we’re certainly not the first in this peculiar non-biomedical mining space – ‘biodiversity informaticists’ have been doing useful things with these techniques for a while now in innovative ways largely unrelated to medicine e.g. LINNAEUS from Casey Bergmann’s lab, and a recent review of other projects from Thessen et al (2012) [hat-tip to @rdmpage for bringing that later paper to the world’s attention via Twitter]. Literally all areas of academia could probably benefit from some form or another of content mining – it’s not just a biomed / biochem tool.

So, I hope that explains things a bit better. Any questions?


Some references (but not all!):

Gerner, M., Nenadic, G., and Bergman, C. 2010. LINNAEUS: A species name identification system for biomedical literature. BMC Bioinformatics 11:85+. [CC-BY Open Access]

Thessen, A. E., Cui, H., and Mozzherin, D. 2012. Applications of natural language processing in biodiversity science. Advances in Bioinformatics 2012:1-17. [CC-BY Open Access]

Hughes, J. 2011. TreeRipper web application: towards a fully automated optical tree recognition software. BMC Bioinformatics 12:178+.  [CC-BY Open Access]

Laubach, T., von Haeseler, A., and Lercher, M. 2012. TreeSnatcher plus: capturing phylogenetic trees from images. BMC Bioinformatics 13:110+. [CC-BY Open Access, incidentally I was one of the reviewers for this paper. I signed my review, and made a point of it too. Nor was it a soft review either I might add]