Welcome to 21st century academia…
Header

In the last 2 weeks I’ve given talks in Brussels & Amsterdam.

The first one was given during a European Commission (Brussels) working group meeting on Text & Data Mining. There were perhaps only ~30 people in the room for that.

The second presentation was given just a few days ago at Beyond The PDF 2 (#btpdf2) in Amsterdam.

I uploaded the slides from both of these talks to Slideshare just before or after I gave each talk to help maximize their impact. Since then they’ve had nearly 1000 views according to my Slideshare analytics dashboard.

It’s not just the view count I’m impressed with. The global reach is also pretty cool too (see below, created with BatchGeo):

View My Slideshare Impact 08/Mar/2013 to 22/Mar/2013 in a full screen map

Now obviously, these view counts don’t always mean that the viewers always went through all the slides, and a minority of the view-count are bots crawling the web but still I’m pretty pleased. Imagine if I hadn’t uploaded my Content Mining presentation to the public web? I would have travelled all the way to Brussels and back again (in the same day!) for the benefit of *just* ~30 people (albeit rather important people!). Instead, over 800 people have had the opportunity to view my slides, from all over the world (although, admittedly mostly just US & Europe).

The moral of this short story: upload your slides & tweet about them whenever you give a talk!
You may not appreciate just how big your potential audience could be. Something academics sceptical of Open Access should perhaps think about?

Particular thanks should go to @openscience for helping disseminate these slides far and wide. During just a 60 minute period, upon first release, thanks to @openscience and others my PDF metadata slidedeck got over 100 views this Wednesday!

Next step… must work on getting these stats into an ImpactStory widget for the next version of my CV!

So a week ago, I investigated publisher-produced Version of Record PDFs with pdfinfo and the results were very disappointing. Lots of missing metadata was found and one could not reliably identify most of these PDFs from metadata alone, let alone extract particular fields of interest.

But Rod Page kindly alerted to me the that fact that I might be using the wrong tool for this investigation. So at his suggestion I’ve tried again to extract metadata from the exact same set of PDFs as last time…

Only this time I’ll be using exiftool version 9.10.

This time I’ve put the full raw metadata output from exiftool on figshare for each and every PDF file, just to really prove the point, reproducible research and all. I’d love to post the corresponding PDFs too but sadly many of them are not Open Access and this thus prevents me from uploading them to a public space.   **Insert timely comment here about how closed access publications stifle effective research practices…**

Exiftool is really simple to use. You just need type:
exiftool NameOfPDF.pdf
to get a human-readable exhaustive output of all possible metadata.

and
exiftool -b -XMP NameOfPDF.pdf
to get XML-structured metadata. I could only extract this from 56 of the 69 PDF files. The data output from this for those 56 PDFs is available as a separate fileset on figshare here.

Finally, if you want to test a whole bunch of PDF files in your working directory I’ve made a simple shell script that loops through all PDFs in your working directory, available here (oops, it’s not data, perhaps I should have put that on github instead?). [I'm sure many readers will be able to create a simple bash loop themselves but just for those that don't...]

 

I’m assuming that the reason exiftool -b -XMP failed on 13 of those PDFs is because they have no embedded XMP metadata – an empty (zero-byte sized) file is created for these. This is an assumption though… I notice that those 13 exactly correspond with all the 13 that were produced with iText. I checked the website and I’m pretty sure iText 2.x and up can embed XMP metadata, it’s just whether the publishers have bothered to use & apply this functionality.

So if I’m right, neither Taylor & Francis, BRILL, nor Acta Palaeontology Polonica embed XMP metadata (at all!) in their PDFs. The alternative explanation is that the XMP metadata is in there but exiftool for whatever reason can’t read/parse it from iText produced PDFs. I find this an unlikely alternative explanation though tbh.

Elsevier have superior XMP metadata to everyone else by the looks of it, but Elsevier aside the metadata is still very poor, so my conclusions from last week’s post still stand I think.

Most of the others do contain metadata (of some sort) but by and large it’s rather poor. I need to get some other work done on Monday so I’m afraid this is where I’m going to leave this for now. But I hope I’ve made the point.

Further angles to explore

Interestingly Brian Kelly, has taken this a slightly different direction and looked at the metadata of PDFs in institutional repositories. I hadn’t realised this but apparently some institutional repositories (IRs) universally add cover pages to most deposits. If this is done without care for the embedded metadata, the original metadata can be wiped and/or replaced with newer (less informative) metadata.  Not to mention that cover pages are completely unnecessary -> all the information on a cover page is exactly the kind of stuff that should be put in embedded metadata! No need to waste time and space by putting that info as the first page. JSTOR does this too (cover pages) and it annoys the hell out of me.

After some excellent chat on Twitter about this IR angle I’ve discovered that UKOLN based here on campus at Bath have also done some interesting research in this area, in particular the FixRep project which is described in more detail here. CrossRef labs pdfmark tool also looks like something of interest towards fixing poor quality metadata PDFs. I’ve got this installed/compiled from the source on github but haven’t tried it out yet. It would be interesting to see the difference it makes – a before and after comparison of metadata to see what we’re missing… But why should we fix a problem that shouldn’t exist in the first place? Publishers are the point of origin for this. It’s their job to be the first to publish the Version of Record. They should provide the highest level of metadata possible IMO.

 

Why would publishers add metadata?

Because their customers – libraries, governments, research funders (in the case of Open Access PDFs ) should demand it. A pipe dream perhaps but that’s my $.02.  I would ask for a refund if I downloaded MP3′s from iTunes/Amazon MP3 with insufficient embedded metadata. Why not the same principle for electronically published PDFs?

 

PS Apologies for some of the very cryptic filenames in the metadata uploads on figshare. You’ll have to cross-match with this list here or the spreadsheet I uploaded last week to work out which metadata file corresponds to which PDF/Bibliographic Data record/Publisher.

Publisher Identifier Journal Contains embedded XMP metadata? Filename
American Association for the Advancement of Science Ezard2011 Science yes? ezard_11_interplay_759293.pdf
American Association for the Advancement of Science Nagalingum2011 Science yes? nagalingum_11_recent_719133.pdf
American Association for the Advancement of Science Rowe2011 Science yes? Science-2011-Rowe-955-7.pdf
Blackwell Publishing Ltd Burks2011 Cladistics yes? burks_11_combined_694888.pdf
Blackwell Publishing Ltd Janies2011 Cladistics yes? janies_11_supramap_779773.pdf
Blackwell Publishing Ltd Simmons2011 Cladistics yes? simmons_11_deterministic_779537.pdf
BRILL Barbosa2011 Insect Systematics & Evolution no barbosa_11_phylogeny_779910.pdf
BRILL Dellape2011 Insect Systematics & Evolution no dellape_11_phylogenetic_779909.pdf
Cambridge Journals Online Knoll2010 Geological Magazine yes? knoll_10_primitive_475553.pdf
Cambridge Journals Online Saucede2007 Geological Magazine yes? thomas_saucegraved_07_phylogeny_506869.pdf
CSIRO Chamorro2011 Invertebrate Systematics yes? chamorro_11_phylogeny_780467.pdf
CSIRO Daugeron2011 Invertebrate Systematics yes? daugeron_11_phylogenetic_780466.pdf
CSIRO Johnson2011 Invertebrate Systematics yes? johnson_11_collaborative_750540.pdf
Elsevier Lane2011 Molecular Phylogenetics and Evolution yes E3-1-s2.0-S1055790311001448-main.pdf
Elsevier Cunha2011 Molecular Phylogenetics and Evolution yes E2-1-s2.0-S1055790311001680-main.pdf
Elsevier Spribille2011 Molecular Phylogenetics and Evolution yes E1-1-s2.0-S1055790311001606-main.pdf
Frontiers In Horn2011 Frontiers in Neuroscience yes? fnins-05-00088.pdf
Frontiers In Ogura2011 Frontiers in Neuroscience yes? fnins-05-00091.pdf
Frontiers In Tsagareli2011 Frontiers in Neuroscience yes? fnins-05-00092.pdf
Hindawi Diniz2012 Psyche: A Journal of Entomology yes? 79139500.pdf
Hindawi Restrepo2012 Psyche: A Journal of Entomology yes? 516419.pdf
Hindawi Savopoulou2012 Psyche: A Journal of Entomology yes? 167420.pdf
Institute of Paleobiology, Polish Academy of Sciences Amson2011 Acta Palaeontologica Polonica no amson_11_affinities_666987.pdf
Institute of Paleobiology, Polish Academy of Sciences Edgecombe2011 Acta Palaeontologica Polonica no edgecombe_11_new_666988.pdf
Institute of Paleobiology, Polish Academy of Sciences Williamson2011 Acta Palaeontologica Polonica no app2E20092E0147.pdf
Magnolia Press Agiuar2011 Zootaxa yes? zt02846p098.pdf
Magnolia Press Ebach2011 Zootaxa yes? ebach_11_taxonomy_599972.pdf
Magnolia Press Nelson2011 Zootaxa yes? nelson_11_resemblance_688762.pdf
National Academy of Sciences Casanovas2011 Proceedings of the National Academy of Sciences yes? casanovas-vilar_11_updated_644658.pdf
National Academy of Sciences Goswami2011 Proceedings of the National Academy of Sciences yes? goswami_11_radiation_814757.pdf
National Academy of Sciences Thorne2011 Proceedings of the National Academy of Sciences yes? thorne_11_resetting_654055.pdf
Nature Publishing Group Meng2011 Nature yes? meng_11_transitional_644647.pdf
Nature Publishing Group Rougier2011 Nature yes? rougier_11_highly_720202.pdf
Nature Publishing Group Venditti2011 Nature yes? venditti_11_multiple_779840.pdf
NRC Research Press CruzadoCaballero2010 Canadian Journal of Earth Sciences yes? 650000.pdf
NRC Research Press Druckenmiller2010 Canadian Journal of Earth Sciences yes? 80000000c5.pdf
NRC Research Press Mazierski2010 Canadian Journal of Earth Sciences yes? mazierski_10_description_577223.pdf
NRC Research Press Modesto2009 Canadian Journal of Earth Sciences yes? modesto_09_new_577201.pdf
NRC Research Press Parsons2009 Canadian Journal of Earth Sciences yes? parsons_09_new_575744.pdf
NRC Research Press Wu2007 Canadian Journal of Earth Sciences yes? wu_07_new_622125.pdf
Pensoft Publishers Hagedorn2011 ZooKeys yes? hagedorn_11_creative_779747.pdf
Pensoft Publishers Penev2011 ZooKeys yes? penev_11_interlinking_694886.pdf
Pensoft Publishers Thessen2011 ZooKeys yes? thessen_11_data_779746.pdf
Public Library of Science Hess2011 PLoS ONE yes? hess_11_addressing_694222.pdf
Public Library of Science McDonald2011 PLoS ONE yes? mcdonald_11_subadult_694229.pdf
Public Library of Science Wicherts2011 PLoS ONE yes? wicherts_11_willingness_779788.pdf
SAGE Publications deKloet2011 Journal of Veterinary Diagnostic Investigation yes? Invest-2011-deKloet-421-9.pdf
SAGE Publications Richter2011 Journal of Veterinary Diagnostic Investigation yes? Invest-2011-Richter-430-5.pdf
SAGE Publications Wassmuth2011 Journal of Veterinary Diagnostic Investigation yes? Invest-2011-Wassmuth-436-53.pdf
Senckenberg Natural History Collections Dresden Fresneda2011 Arthropod Systematics & Phylogeny yes? fresneda_11_phylogenetic_785869.pdf
Senckenberg Natural History Collections Dresden Mally2011 Arthropod Systematics & Phylogeny yes? ASP_69_1_Mally_55-71.pdf
Senckenberg Natural History Collections Dresden Shimizu2011 Arthropod Systematics & Phylogeny yes? ASP_69_2_Shimizu_75-81.pdf
Springer-Verlag Beermann2011 Zoomorphology yes? 10.1007_s00435-011-0129-9.pdf
Springer-Verlag Cuezzo2011 Zoomorphology yes? cuezzo_11_ultrastructure_694669.pdf
Springer-Verlag Vinn2011 Zoomorphology yes? 10.1007_s00435-011-0133-0.pdf
Taylor & Francis Bianucci2011 Journal of Vertebrate Paleontology no bianucci_11_aegyptocetus_778747.pdf
Taylor & Francis Makovicky2011 Journal of Vertebrate Paleontology no makovicky_11_new_694826.pdf
Taylor & Francis Pietri2011 Journal of Vertebrate Paleontology no pietri_11_revision_689491.pdf
Taylor & Francis Rook2011 Journal of Vertebrate Paleontology no rook_11_phylogeny_694916.pdf
Taylor & Francis Tsuihiji2011 Journal of Vertebrate Paleontology no tsuihiji_11_cranial_660620.pdf
Taylor & Francis Yates2011 Journal of Vertebrate Paleontology no yates_11_new_694821.pdf
Taylor & Francis Gerth2011 Systematics and Biodiversity no gerth_11_wolbachia_779749.pdf
Taylor & Francis Krebes2011 Systematics and Biodiversity no krebes_11_phylogeography_779700.pdf
Sociedade Brasileira de Ictiologia Britski2011 Neotropical Ichthyology yes? a02v9n2.pdf
Sociedade Brasileira de Ictiologia Sarmento2011 Neotropical Ichthyology yes? a03v9n2.pdf
Sociedade Brasileira de Ictiologia Calegari2011 Neotropical Ichthyology yes? a04v9n2.pdf
Royal Society Billet2011 Proceedings of the Royal Society B: Biological Sciences yes? billet_11_oldest_687630.pdf
Royal Society Polly2011 Proceedings of the Royal Society B: Biological Sciences yes? polly_11_history_625430.pdf
Royal Society Sansom2011 Proceedings of the Royal Society B: Biological Sciences yes? sansom_11_decay_625429.pdf

Since Sunday afternoon I’ve been at an International Council for Science (ICSU) / Royal Society invited workshop on ‘Revaluing Science in the Digital Age’.

We’ve had a fascinating set of talks from academics, publishers (PLoS, Nature, BMC), librarians, policymakers, data managers, scientific societies…

Attendees included:

http://en.wikipedia.org/wiki/Peter_Buneman
http://en.wikipedia.org/wiki/Geoffrey_Boulton
Jose Cotta, European Commision
http://en.wikipedia.org/wiki/John_M._Ball

Mark Thorley (RCUK)
Chris Banks  (University Librarian and Director, Aberdeen)
Mark Hahnel (Figshare)
Max Wilkinson (UCL, Head of Research Data Service)
Dave Roberts (ViBRANT)
Rob Frost (GSK)
Catriona MacCallum (PLoS)
Mark Forster (Syngenta)
Iain Hrynaszkiewicz (BMC)
Ruth Wilson (Nature Publishing Group)
Kaitlin Thaney (Digital Science)
Stuart Taylor (Royal Society)
Robert Simpson (Zooniverse)
Paul Groth (OpenPHACTS)
and more…

 

I gave a talk on content mining and the importance of full BOAI-compliant Open Access with respect to this, on behalf of the Open Knowledge Foundation:

There was lots of discussion on reproducibility, provenance of data, peer review, incentives, research misconduct and ethics.

I’ve met many new people and have learnt many new things. For example, on the subject of reproducibility I talked about Roger Peng and the journal Biostatistics in discussion, and then was soon informed that there was an analogous journal in Chemistry called Organic Syntheses whereby:

In order for a procedure to be accepted for publication, each reaction must be successfully repeated in the laboratory of a member of the Editorial Board at least twice, with similar yields (generally ±5%) and selectivity similar to that reported by the submitters.

Fantastic! We were also informed that this rigorous protocol ensures that research published in this journal is very highly regarded. I’ve suggested similar such reproducibility checks for phylogenetics research before (at the Systematics Association Biennial meeting Belfast, 2011) but this was viewed as too futuristic / infeasible…

Right now we’re working on a draft statement of outcome from this workshop that ICSU can pass to its members to possibly officially agree to endorse.

So I better finish here, and get back to the discussion.
I’m rather hoping they will endorse the Panton Principles rather than reinvent the wheel (policy-wise).

Exciting times!

 

PS I have made a Storify of the tweets from the workshop here .

It’s the Olympics now so this work update is a) late and b) short

Nevermind…

As ever progress has been exciting – look what we can extract from some PDFs:

(click to enlarge each) Attribution: The left panel is from Cánovas et al. BMC Evolutionary Biology 2011 11:371 doi:10.1186/1471-2148-11-371

On the left is the original figure, and on the right we have an SVG representation of the data we can extract automatically from this figure. We have the topology, the taxon labels AND the support values 100% correctly interpreted! Obviously we can’t reclaim phylogenetic data with this much precision and recall from all papers. But it’s a promising example, automatically generated – no manual guidance or tweaking needed – just feed it the PDF. [My WordPress server won't let me upload the original SVG copy of this for "security reasons" so the image on the right is a .jpg copy of the original .svg]

 

I should also note this was achieved completely independently of previous image-based tree-extraction softwares like TreeSnatcher Plus, TreeRipper & TreeThief. This is a great example of why it’s very important for editors and publishers to strictly stipulate that diagrams in figures containing data such as this be uploaded and produced in the final PDF version as lossless vector graphics rather than lossy bitmaps such as .png .jpg or .bmp – only vectors keep the fidelity of the underlying data. We note that there are many publishers out there who regularly seem to produce figures in their PDFs that are NOT on the whole very good quality wrt this. Difficult to know whether the authors or the publishers are to blame in each case but either way standards need to be improved.

 

By mining PDFs we can re-extract and re-release far more than just phylogenetic data from the literature – we’re fairly sure we can reliably identify the rough type of figure depicted in PDFs by machine methods using certain diagnostic features such as number & proportion of horizontal and vertical lines.

 

 

Peter Murray-Rust & I now are looking for a collaborator to help us implement machine learning methods to classify scientific figures into discrete categories e.g. bar charts, scatter plots, network diagrams (including phylogenies), pie charts, box & whisker plots etc… in an automated way.

If you’re interested please contact myself or Peter.

That’s all for now.

PS If you’re watching the London 2012 Olympics Volleyball tomorrow morning you may well just see me in the crowd. Managed to snaffle some returned tickets by setting up an alert for new tickets using a combination of www.page2rss.com (to alert me to page changes on the ticket website) and http://ifttt.com/ to email me as soon as the RSS feed gets a new item (updated ticket information). Without this nifty trick I very much doubt I’d have got any tickets.

just a quick post…

I’m pretty shocked at the poor indexing service given by Thompson Reuters Web of Knowledge (or ISI Web of Science as you might know it).

I’ve unashamedly bashed them before and I’ll bash them again here. (They deserve criticism because they’re paid a lot of money to do this as a commercial for-profit enterprise, and I don’t think they’re doing it as well as they could be.)

I performed a very simple search today looking for the articles containing the word ‘cladistic’ but NOT ‘phylogen*’ for articles published in the year 2010.

Topic=(cladistic) NOT Topic=(phylogen*) AND Year Published=(2010)
Refined by: Source Titles=( PLOS NEGLECTED TROPICAL DISEASES )
Databases=SCI-EXPANDED.

Below is a screenshot of just one of many of the disappointing results. I’ve refined the search to just the PLoS paper, to clearly show that it does come-up in this search:

It’s an Open Access paper, so we can all go see for ourselves the FULL content of the paper

Na, B.-K., Bae, Y.-A., Zo, Y.-G., Choe, Y., Kim, S.-H., Desai, P. V., Avery, M. A., Craik, C. S., Kim, T.-S., Rosenthal, P. J., and Kong, Y. 2010. Biochemical properties of a novel cysteine protease of plasmodium vivax, vivapain-4. PLOS NEGLECTED TROPICAL DISEASES 4.

In which we find the text caption for figure 1 mentions ‘phylogen*’ twice!

from Na, B.-K., Bae, Y.-A., Zo, Y.-G., Choe, Y., Kim, S.-H., Desai, P. V., Avery, M. A., Craik, C. S., Kim, T.-S., Rosenthal, P. J., and Kong, Y. 2010. Biochemical properties of a novel cysteine protease of plasmodium vivax, vivapain-4. PLOS NEGLECTED TROPICAL DISEASES 4. http://dx.doi.org/10.1371/journal.pntd.0000849 CC-BY licenced

so at the very least I suspect Web of Science (WoS) is systematically NOT indexing the caption text of figures (if you know more than I about this, please do comment). Academics rely on services like this to effectively and accurately search the literature, to perform comprehensive reviews and such. If all the textual content of science isn’t actually being indexed by WoS, that’s clearly going to lead to bad science at some point (e.g. a vital missing paper, not picked up in an otherwise well designed literature search). I could forgive them for not being able to OCR the text within the images of figures, but NOT for the fully machine-readable text captions like this one. Furthermore, it’s Open Access and fully-digital – why aren’t they indexing figure caption text?

*grr*

UPDATE

It appears it’s not just figure caption text they don’t index. Do they index only titles and abstracts?

many of the other 81 results (papers) of that search for ‘cladistic’ but NOT ‘phylogen*’ contain the word-stem ‘phylogen*’ in the full text of the paper!

e.g.

Wilts, E. F., Arbizu, P. M., and Ahlrichs, W. H. 2010. Description of bryceella perpusilla n. sp (monogononta: Proalidae), a new rotifer species from terrestrial mosses, with notes on the ground plan of bryceella REMANE, 1929. INTERNATIONAL REVIEW OF HYDROBIOLOGY 95. http://dx.doi.org/10.1002/iroh.201011280

Echeverry, A. and Morrone, J. J. 2010. Parsimony analysis of endemicity as a panbiogeographical tool: an analysis of caribbean plant taxa. BIOLOGICAL JOURNAL OF THE LINNEAN SOCIETY 101. http://dx.doi.org/10.1111/j.1095-8312.2010.01535.x

Stutz, H. L., Shiozawa, D. K., and Evans, R. P. 2010. Inferring dispersal of aquatic invertebrates from genetic variation: a comparative study of an amphipod and mayfly in great basin springs. JOURNAL OF THE NORTH AMERICAN BENTHOLOGICAL SOCIETY 29 http://dx.doi.org/10.1899/09-157.1

Campo, D., Molares, J., Garcia, L., Fernandez-Rueda, P., Garcia-Gonzalez, C., and Garcia-Vazquez, E. 2010. Phylogeography of the european stalked barnacle (pollicipes pollicipes): identification of glacial refugia. MARINE BIOLOGY 157. http://dx.doi.org/10.1007/s00227-009-1305-z

Choiniere, J. N., Clark, J. M., Forster, C. A., and Xu, X. 2010. A basal coelurosaur (dinosauria: Theropoda) from the late jurassic (oxfordian) of the shishugou formation in wucaiwan, people’s republic of china. JOURNAL OF VERTEBRATE PALEONTOLOGY 30. http://dx.doi.org/10.1080/02724634.2010.520779

Caldwell, M. W. and Palci, A. 2010. A new species of marine ophidiomorph lizard, adriosaurus skrbinensis, from the upper cretaceous of slovenia. JOURNAL OF VERTEBRATE PALEONTOLOGY 30. http://dx.doi.org/10.1080/02724631003762963

Hastings, A. K., Bloch, J. I., Cadena, E. A., and Jaramillo, C. A. 2010. A new small short-snouted dyrosaurid (crocodylomorpha, mesoeucrocodylia) from the paleocene of northeastern colombia. JOURNAL OF VERTEBRATE PALEONTOLOGY 30. http://dx.doi.org/10.1080/02724630903409204

Karanovic, I. and McKay, K. 2010. Two new species of leicacandona karanovic (ostracoda, candoninae) from the great sandy desert, australia. JOURNAL OF NATURAL HISTORY 44. http://dx.doi.org/10.1080/00222933.2010.502977

(and more, these are just some of the articles I’ve looked at the full-text of so far… I think it’s safe to say now this is NOT a one off phenomenon)

I’ve now found through manual inspection that at least 47 of the ‘hits’ for this search actually contain a ‘phylogen*’ word within the main text of the paper (excluding the reference list)

I guess I’m probably not the first to realise this but… wow. Is this not *really* poor service? I’m pretty sure my desktop software could do a better job of indexing than this. All it is, is simple string matching!

…and of course I can do a better job of this myself with Open Access papers. All one need do is download the OA corpus from UKPMC and index the *FULL* text including figure caption text and reference lists yourself. I wonder how many more relevant papers I might ‘find’ with my searches if I did this rather than relying on Web of Science?

I realise thus far, I may not have explained too clearly exactly what I’m doing for my Panton fellowship. With this post I shall attempt to remedy this and shed a little more light on what I’ve been doing lately.

The main thrust of my fellowship is to extract phylogenetic tree data from the literature using content mining approaches (think text mining, but not just text!) – using the literature in its entirety as my data. I have very little prior experience in this area, but luckily I have an expert mentor guiding me: Peter Murray-Rust (whom you may often see referred to as PMR). For those of us biologists who may not be familiar with his work, whilst trying not to be too sycophantic about it, PMR is simply brilliant, it’s amazing what he and his collaborators have done to extract chemical data from the chemical literature and provide it openly for everyone, in spite of fierce opposition at times from those with vested interests in this data remaining ‘closed’.

Now he’s turned his attention to the biological literature for my project and together we’re going to try and provide open tools to extract phylogenetic data from the literature. Initially I proposed trying to grab just tree topology and tip labels – a kind of bare minimum, but PMR has convinced me that we should be ambitious and all-encompassing, and thus our aims have expanded to include branch lengths, support values, the data-type the phylogeny was inferred from, and other useful metadata. And why not? We’re ingesting the totality of the paper in our process, from title page to reference list, so there’s plenty of machine-readable data to be gleaned. The question is, can we glean it off accurately enough, balancing precision and recall?

So for starters, we’ve been using test materials that we’re legally allowed to, namely Open Access CC-BY papers from BMC & PLoS to test our extraction tools, specifically focusing on a subset of all ~8500 papers containing the word-stem phylogen* from BMC. It’s a rough proxy for papers that’ll contain a tree, and it’s good enough for now – we’ll need to be able to deal with false positives along with all the positive positives, so it’s instructive to keep these in our sample.

We’ve been working on the regular structure of BMC PDFs, getting out bibliographic metadata, and the main-text for further NLP processing downstream to pick out data & method relevant words like say PAUP* , ML , mitochondrial loci etc… But the real reason we’re deliberately using PDFs rather than the XML (which we also have access to) is the figures – where all the valuable phylogenetic tree data is. If this can be re-interpreted with reference to the bibliographic metadata, the figure caption and further methodological details from the full-text of the paper, then we may be able to reconstruct some fairly rich and useful phylogenetic data.

To make it clear, in slight contrast to the Lapp et al iEvoBio presentation embedded above, we’re not trying to just extract the images, but rather to re-interpret them back into actual re-useable data, probably to be provided in NeXML (and from there on, whatever form you want). We’re pretty sure it’s an achievable goal. Programs like TreeThief, TreeRipper, and TreeSnatcher Plus have gone some way towards this already, but never before been incorporated in a content mining workflow AFAIK.

Unfortunately I wasn’t at iEvoBio 2012 (I’m short on money and on time these days) but it’s great to see from the slides the growing recognition of the SVG image file format as a brilliant tool for communicating digital science. I also put a bit about that in my Hennig XXXI talk slides too (towards the end). Programs like TNT do output SVG files, so there’s scope to make this a normal part of any publication workflow. Regrettably though, rather few publisher produced PDFs contain SVG formatted images – but if people, and editorial boards (perhaps?) can be made aware of their advantages, perhaps we can change this in future…?

the very same file, opened as plain-text. It’s fairly easy to reconvert back into re-useable machine-readable data.

 

Agapornis phylogeny.svg from Wikipedia (PD)

 

 

 

 

 

 

 

 

 

Gathering phylogenetic data from beyond PLoS, BMC and other smaller Open Access publishers is going to be hard, not for technical, but purely legal reasons:

The scope and scale of phylogenetic research (using ‘phylogen*’ as a proxy):

There’s a lot of phylogenetic research out there… but little of it is Open Access – which is problematic for content mining approaches – particularly if subscription-access publishers are reticent to allow access.

Some facts:

  • with a Thomson Reuters Web of Science search, SCI-EXPANDED database (only), Topic=(phylogen*) AND Year Published=(2000-2011) this returns 101,669 results (at the time of searching YMMV)
  • 91,788 of which are primary Research Articles (as opposed to Reviews, Proceedings Papers, Meeting Abstracts, Editorial Materials, Corrections, Book Reviews etc…)
  • Recent MIAPA working group research I contributed to (in review) quantitatively estimates that approximately 66% of papers containing ‘phylogen*’ report a new phylogenetic analysis (new data).
  • Thus conservatively assuming just one tree per paper (there are often many per paper), there are > 60,000 trees contained within just 21st century research articles.
  • As with STM publishing as a whole, the number of phylogenetic research articles being published each year shows consistent year-on-year increases
  • Cross-match this with publisher licencing data and you’ll find that only ~11% of phylogenetic research published in 2010 was CC-BY Open Access (and this % probably decreases as you go back before 2010)
So the real fun and games will come later this year, when I’m sure we’ll have the capability (software tools) to do some amazing stuff, having first perfected it on OA materials… but will they let us? Heather Piwowar’s experience earlier this year didn’t look too fun – and that was all for just one publisher. Phylogenetic research occurs in and beyond at least 80 separate STM publishers by my count (let alone the >500 journals it occurs in!) – so there’s no way anyone would bother trying to negotiate with them all! I’m sticking by the intuitive principle that The Right to Read Is the Right to Mine but I’ll have a think about that some more when we actually get to that bridge.

Finally, it’s also worth acknowledging that we’re certainly not the first in this peculiar non-biomedical mining space – ‘biodiversity informaticists’ have been doing useful things with these techniques for a while now in innovative ways largely unrelated to medicine e.g. LINNAEUS from Casey Bergmann’s lab, and a recent review of other projects from Thessen et al (2012) [hat-tip to @rdmpage for bringing that later paper to the world's attention via Twitter]. Literally all areas of academia could probably benefit from some form or another of content mining – it’s not just a biomed / biochem tool.

So, I hope that explains things a bit better. Any questions?

 

Some references (but not all!):

Gerner, M., Nenadic, G., and Bergman, C. 2010. LINNAEUS: A species name identification system for biomedical literature. BMC Bioinformatics 11:85+. http://dx.doi.org/10.1186/1471-2105-11-85 [CC-BY Open Access]

Thessen, A. E., Cui, H., and Mozzherin, D. 2012. Applications of natural language processing in biodiversity science. Advances in Bioinformatics 2012:1-17. http://dx.doi.org/10.1155/2012/391574 [CC-BY Open Access]

Hughes, J. 2011. TreeRipper web application: towards a fully automated optical tree recognition software. BMC Bioinformatics 12:178+. http://dx.doi.org/10.1186/1471-2105-12-178  [CC-BY Open Access]

Laubach, T., von Haeseler, A., and Lercher, M. 2012. TreeSnatcher plus: capturing phylogenetic trees from images. BMC Bioinformatics 13:110+. http://dx.doi.org/10.1186/1471-2105-13-110 [CC-BY Open Access, incidentally I was one of the reviewers for this paper. I signed my review, and made a point of it too. Nor was it a soft review either I might add]

After sending a letter to my local MP, urging him to support the recommendations of the Hargreaves Report on Intellectual Property reform in parliament nearly a month ago (sent on the 17th June 2012) – I finally have a reply!

Sadly, it’s not the reply I wanted. Don Foster does not appear willing to support the Early Day Motion on Intellectual Property law reform to further enable research, that I explicitly asked him to sign.

 

Below is a verbatim copy of his letter as was emailed to me earlier (12th July 2012), which I am posting here so all his constituents can see, for their own satisfaction (or not, as in my case), his position with respect to IP reform.

 

 

12 July 2012

Mr Ross Mounce

Flat 3, Rochfort Court

Forester Avenue

Bath

BA2 6QY

Our Ref: Moun010/1

Dear Mr Mounce

 

Thank you for writing to me in reference to the exceptions for digital content proposed by the Hargreaves Review. I had in fact received your letter, so I apologise the confusion on Twitter and the delay in replying; life has been rather hectic recently not least because of the impending Olympics and my role as a member of the Olympic Board. Adding this to the very large amounts of correspondence received in my office each day, I’m sorry to say that I cannot always respond as quickly as I’d like.

 

As you may know, I have been involved in these issues for some time. I am currently working within Government on the development of the Communications White paper (which will touch on these issues when it is published in early 2013), I helped initiate – and serve as a member of – the Creative Industries Council (which, among other issues, is reviewing the Hargreaves Report) and am a member of the All Party Parliamentary Group on IP (which is currently reviewing the work of the IPO and its various recent pronouncements).

 

Inevitably I am somewhat constrained in responding in detail to your letter since some of the work I am involved in is not yet public and, more importantly, because final conclusions haven’t been reached.

 

I am also conscious that we have to work within – while seeking, potentially, to change – relevant EU legislation. As you will know, this includes the InfoSoc Directive (Directive 2001/29/EC; see http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=CELEX:32001L0029:EN:HTML.

 

You will know better than I, that the development of “exceptions” is never easy. An exception for “format shifting” may be alright and reasonable for, say, the music industry but the situation is very different for the UK Video Industry. Similarly, an exception for “parody” could make sense for small snippets used in a comedy show, but would not necessarily be appropriate for a situation where one artist does a complete performance of another artist’s song and claims it to be a parody. Some of the proposed exceptions for the copying of educational works seem to worryingly disregard the importance of copyright/IP protection in ensuring the flow of new works. 

Whilst Don Foster MP will treat as confidential any personal information which you pass on, he may need to allow his staff and authorised volunteers to see it if this is needed to help and advise you. Don may pass on all or some of this information to agencies such as DWP, Inland Revenue or Local Council if this is necessary to help with your case. We may write to you from time to time, to keep you informed on issues that you might find of interest. Please let us know if you do not wish him to contact you for this purpose

— page break —

In your letter you argue that the exceptions proposed in relation to “data mining” should be accepted.

 

I am well aware of the strength of feeling among some on this issue. But, as the IPO’s summary of the responses to consultation on the issue makes clear, proposals to enable researchers to use computerised techniques to read information contained in journal articles without infringing publishers’ rights have drawn “strongly divided” views from the industry.

 

Certainly it is true that, as the IPO puts it, “Researchers and research institutions generally supported the proposed exception. They argued that copyright was not established to restrict the use of data, and the added value of these technologies was provided by the actions of researchers, not publishers.”

 

However, there is also a strongly held opposing view; one which suggested that it was “too soon to seek a regulatory solution in a new and fast-developing sector, ” that “a copyright exception would prevent publishers from ensuring security of content and stability of provision,” and that “an unremunerated exception would remove the incentive for publishers to make the considerable investments needed to convert content into the right forms, to develop their own services, and to support the application of services by researchers or third parties.”

 

In the light of these arguments we are currently working to find a way forward. I am not yet in a position to give you assurances that I will press from the exceptions you seek as they are currently formulated (although, I am more inclined to support them than to oppose them).

 

I hope you will understand, and again, apologies for the delay in replying.

With best wishes,

Yours sincerely,

Rt Hon Don Foster MP

Please reply to 31 James Street West, Bath, BA1 2BT

Tel: 01225 338 973 Fax: 01225 463 630

 

I have also forwarded Don’s letter to all the staff in my department, as this is about research, and their local parliamentary representative’s views with respect to research, and thus concerns them:

 

Dear all,

Last month, I sent a formal letter to our local MP Don Foster, urging him
to sign his support for recommendations in the Hargreaves report[1] on
Intellectual Property law reform as it relates to research, particularly
the proposed exceptions to copyright to enable text-mining of otherwise ‘closed
access’ content. (You can read the full letter I sent here[2] it’s fairly short.)
The legal obstructions to text-mining research were also recently described in a Nature news piece[3].

This directly relates to my fellowship research project, as I’m using
content mining techniques to re-extract phylogenetic data from the
literature. At the moment I can only legally mine just ~13% of UKPMC
literature (the XML of which is just 5.5Gb[4] btw, let me know
if you want a copy). This is a great shame, as we have the tools and
capability to make use of ALL of the research literature and more with
current techniques. Only legal barriers prevent this.

I have attached his reply. At best he is sitting on the fence. At worst I
think he has failed to critically evaluate the meekness of the
counter-arguments given such as it being “too soon to seek a regulatory
solution”.

So, is anyone here interested in sending a further letter in support of
enabling text-mining research? This is perhaps our one and only chance to
directly influence government policy on this research issue. I feel a more
senior scientist from the University of Bath would perhaps help convince
Don to lend his support to this issue. The parliamentary motion already has
the support of 27 MPs[5], but will need more.

Please feel free to contact me offlist about this. I will not contact this
list about this issue again.

Many thanks for your time,

Ross

Links:
——
[1] http://www.ipo.gov.uk/ipreview.htm
[2] http://rossmounce.co.uk/2012/06/15/please-dont-restrict-scientific-endeavour/
[3] http://www.nature.com/news/trouble-at-the-text-mine-1.10184
[4] http://ukpmc.ac.uk/ftp/oa
[5] http://www.parliament.uk/edm/2012-13/151

Having invested some time in this already, there seems little point in giving up now. I will wait and see over the weekend if there is anyone else from my department (or the University as whole), that wants to pursue this further before I try again.

The subscription-access segment of the STM publishing industry that is protesting the recommendations of the Hargreaves report undoubtedly has paid lobbyists dedicated to this issue, and they have clearly done a good job here. I am unpaid and inexperienced with respect to arguing this from the research perspective, and probably outnumbered. But I will not stop trying to further enable research.