Show me the data!

So a week ago, I investigated publisher-produced Version of Record PDFs with pdfinfo and the results were very disappointing. Lots of missing metadata was found and one could not reliably identify most of these PDFs from metadata alone, let alone extract particular fields of interest.

But Rod Page kindly alerted to me the fact that I might be using the wrong tool for this investigation. So at his suggestion I’ve tried again to extract metadata from the exact same set of PDFs as last time…

Only this time I’ll be using exiftool version 9.10.

This time I’ve put the full raw metadata output from exiftool on figshare for each and every PDF file, just to really prove the point, reproducible research and all. I’d love to post the corresponding PDFs too but sadly many of them are not Open Access and this thus prevents me from uploading them to a public space.   **Insert timely comment here about how closed access publications stifle effective research practices…**

Exiftool is really simple to use. You just need type:
exiftool NameOfPDF.pdf
to get a human-readable exhaustive output of all possible metadata.

exiftool -b -XMP NameOfPDF.pdf
to get XML-structured metadata. I could only extract this from 56 of the 69 PDF files. The data output from this for those 56 PDFs is available as a separate fileset on figshare here.

Finally, if you want to test a whole bunch of PDF files in your working directory I’ve made a simple shell script that loops through all PDFs in your working directory, available here (oops, it’s not data, perhaps I should have put that on github instead?). [I’m sure many readers will be able to create a simple bash loop themselves but just for those that don’t…]


I’m assuming that the reason exiftool -b -XMP failed on 13 of those PDFs is because they have no embedded XMP metadata – an empty (zero-byte sized) file is created for these. This is an assumption though… I notice that those 13 exactly correspond with all the 13 that were produced with iText. I checked the website and I’m pretty sure iText 2.x and up can embed XMP metadata, it’s just whether the publishers have bothered to use & apply this functionality.

So if I’m right, neither Taylor & Francis, BRILL, nor Acta Palaeontology Polonica embed XMP metadata (at all!) in their PDFs. The alternative explanation is that the XMP metadata is in there but exiftool for whatever reason can’t read/parse it from iText produced PDFs. I find this an unlikely alternative explanation though tbh.

Elsevier have superior XMP metadata to everyone else by the looks of it, but Elsevier aside the metadata is still very poor, so my conclusions from last week’s post still stand I think.

Most of the others do contain metadata (of some sort) but by and large it’s rather poor. I need to get some other work done on Monday so I’m afraid this is where I’m going to leave this for now. But I hope I’ve made the point.

Further angles to explore

Interestingly Brian Kelly, has taken this a slightly different direction and looked at the metadata of PDFs in institutional repositories. I hadn’t realised this but apparently some institutional repositories (IRs) universally add cover pages to most deposits. If this is done without care for the embedded metadata, the original metadata can be wiped and/or replaced with newer (less informative) metadata.  Not to mention that cover pages are completely unnecessary -> all the information on a cover page is exactly the kind of stuff that should be put in embedded metadata! No need to waste time and space by putting that info as the first page. JSTOR does this too (cover pages) and it annoys the hell out of me.

After some excellent chat on Twitter about this IR angle I’ve discovered that UKOLN based here on campus at Bath have also done some interesting research in this area, in particular the FixRep project which is described in more detail here. CrossRef labs pdfmark tool also looks like something of interest towards fixing poor quality metadata PDFs. I’ve got this installed/compiled from the source on github but haven’t tried it out yet. It would be interesting to see the difference it makes – a before and after comparison of metadata to see what we’re missing… But why should we fix a problem that shouldn’t exist in the first place? Publishers are the point of origin for this. It’s their job to be the first to publish the Version of Record. They should provide the highest level of metadata possible IMO.


Why would publishers add metadata?

Because their customers – libraries, governments, research funders (in the case of Open Access PDFs ) should demand it. A pipe dream perhaps but that’s my $.02.  I would ask for a refund if I downloaded MP3’s from iTunes/Amazon MP3 with insufficient embedded metadata. Why not the same principle for electronically published PDFs?


PS Apologies for some of the very cryptic filenames in the metadata uploads on figshare. You’ll have to cross-match with this list here or the spreadsheet I uploaded last week to work out which metadata file corresponds to which PDF/Bibliographic Data record/Publisher.

Publisher Identifier Journal Contains embedded XMP metadata? Filename
American Association for the Advancement of Science Ezard2011 Science yes? ezard_11_interplay_759293.pdf
American Association for the Advancement of Science Nagalingum2011 Science yes? nagalingum_11_recent_719133.pdf
American Association for the Advancement of Science Rowe2011 Science yes? Science-2011-Rowe-955-7.pdf
Blackwell Publishing Ltd Burks2011 Cladistics yes? burks_11_combined_694888.pdf
Blackwell Publishing Ltd Janies2011 Cladistics yes? janies_11_supramap_779773.pdf
Blackwell Publishing Ltd Simmons2011 Cladistics yes? simmons_11_deterministic_779537.pdf
BRILL Barbosa2011 Insect Systematics & Evolution no barbosa_11_phylogeny_779910.pdf
BRILL Dellape2011 Insect Systematics & Evolution no dellape_11_phylogenetic_779909.pdf
Cambridge Journals Online Knoll2010 Geological Magazine yes? knoll_10_primitive_475553.pdf
Cambridge Journals Online Saucede2007 Geological Magazine yes? thomas_saucegraved_07_phylogeny_506869.pdf
CSIRO Chamorro2011 Invertebrate Systematics yes? chamorro_11_phylogeny_780467.pdf
CSIRO Daugeron2011 Invertebrate Systematics yes? daugeron_11_phylogenetic_780466.pdf
CSIRO Johnson2011 Invertebrate Systematics yes? johnson_11_collaborative_750540.pdf
Elsevier Lane2011 Molecular Phylogenetics and Evolution yes E3-1-s2.0-S1055790311001448-main.pdf
Elsevier Cunha2011 Molecular Phylogenetics and Evolution yes E2-1-s2.0-S1055790311001680-main.pdf
Elsevier Spribille2011 Molecular Phylogenetics and Evolution yes E1-1-s2.0-S1055790311001606-main.pdf
Frontiers In Horn2011 Frontiers in Neuroscience yes? fnins-05-00088.pdf
Frontiers In Ogura2011 Frontiers in Neuroscience yes? fnins-05-00091.pdf
Frontiers In Tsagareli2011 Frontiers in Neuroscience yes? fnins-05-00092.pdf
Hindawi Diniz2012 Psyche: A Journal of Entomology yes? 79139500.pdf
Hindawi Restrepo2012 Psyche: A Journal of Entomology yes? 516419.pdf
Hindawi Savopoulou2012 Psyche: A Journal of Entomology yes? 167420.pdf
Institute of Paleobiology, Polish Academy of Sciences Amson2011 Acta Palaeontologica Polonica no amson_11_affinities_666987.pdf
Institute of Paleobiology, Polish Academy of Sciences Edgecombe2011 Acta Palaeontologica Polonica no edgecombe_11_new_666988.pdf
Institute of Paleobiology, Polish Academy of Sciences Williamson2011 Acta Palaeontologica Polonica no app2E20092E0147.pdf
Magnolia Press Agiuar2011 Zootaxa yes? zt02846p098.pdf
Magnolia Press Ebach2011 Zootaxa yes? ebach_11_taxonomy_599972.pdf
Magnolia Press Nelson2011 Zootaxa yes? nelson_11_resemblance_688762.pdf
National Academy of Sciences Casanovas2011 Proceedings of the National Academy of Sciences yes? casanovas-vilar_11_updated_644658.pdf
National Academy of Sciences Goswami2011 Proceedings of the National Academy of Sciences yes? goswami_11_radiation_814757.pdf
National Academy of Sciences Thorne2011 Proceedings of the National Academy of Sciences yes? thorne_11_resetting_654055.pdf
Nature Publishing Group Meng2011 Nature yes? meng_11_transitional_644647.pdf
Nature Publishing Group Rougier2011 Nature yes? rougier_11_highly_720202.pdf
Nature Publishing Group Venditti2011 Nature yes? venditti_11_multiple_779840.pdf
NRC Research Press CruzadoCaballero2010 Canadian Journal of Earth Sciences yes? 650000.pdf
NRC Research Press Druckenmiller2010 Canadian Journal of Earth Sciences yes? 80000000c5.pdf
NRC Research Press Mazierski2010 Canadian Journal of Earth Sciences yes? mazierski_10_description_577223.pdf
NRC Research Press Modesto2009 Canadian Journal of Earth Sciences yes? modesto_09_new_577201.pdf
NRC Research Press Parsons2009 Canadian Journal of Earth Sciences yes? parsons_09_new_575744.pdf
NRC Research Press Wu2007 Canadian Journal of Earth Sciences yes? wu_07_new_622125.pdf
Pensoft Publishers Hagedorn2011 ZooKeys yes? hagedorn_11_creative_779747.pdf
Pensoft Publishers Penev2011 ZooKeys yes? penev_11_interlinking_694886.pdf
Pensoft Publishers Thessen2011 ZooKeys yes? thessen_11_data_779746.pdf
Public Library of Science Hess2011 PLoS ONE yes? hess_11_addressing_694222.pdf
Public Library of Science McDonald2011 PLoS ONE yes? mcdonald_11_subadult_694229.pdf
Public Library of Science Wicherts2011 PLoS ONE yes? wicherts_11_willingness_779788.pdf
SAGE Publications deKloet2011 Journal of Veterinary Diagnostic Investigation yes? Invest-2011-deKloet-421-9.pdf
SAGE Publications Richter2011 Journal of Veterinary Diagnostic Investigation yes? Invest-2011-Richter-430-5.pdf
SAGE Publications Wassmuth2011 Journal of Veterinary Diagnostic Investigation yes? Invest-2011-Wassmuth-436-53.pdf
Senckenberg Natural History Collections Dresden Fresneda2011 Arthropod Systematics & Phylogeny yes? fresneda_11_phylogenetic_785869.pdf
Senckenberg Natural History Collections Dresden Mally2011 Arthropod Systematics & Phylogeny yes? ASP_69_1_Mally_55-71.pdf
Senckenberg Natural History Collections Dresden Shimizu2011 Arthropod Systematics & Phylogeny yes? ASP_69_2_Shimizu_75-81.pdf
Springer-Verlag Beermann2011 Zoomorphology yes? 10.1007_s00435-011-0129-9.pdf
Springer-Verlag Cuezzo2011 Zoomorphology yes? cuezzo_11_ultrastructure_694669.pdf
Springer-Verlag Vinn2011 Zoomorphology yes? 10.1007_s00435-011-0133-0.pdf
Taylor & Francis Bianucci2011 Journal of Vertebrate Paleontology no bianucci_11_aegyptocetus_778747.pdf
Taylor & Francis Makovicky2011 Journal of Vertebrate Paleontology no makovicky_11_new_694826.pdf
Taylor & Francis Pietri2011 Journal of Vertebrate Paleontology no pietri_11_revision_689491.pdf
Taylor & Francis Rook2011 Journal of Vertebrate Paleontology no rook_11_phylogeny_694916.pdf
Taylor & Francis Tsuihiji2011 Journal of Vertebrate Paleontology no tsuihiji_11_cranial_660620.pdf
Taylor & Francis Yates2011 Journal of Vertebrate Paleontology no yates_11_new_694821.pdf
Taylor & Francis Gerth2011 Systematics and Biodiversity no gerth_11_wolbachia_779749.pdf
Taylor & Francis Krebes2011 Systematics and Biodiversity no krebes_11_phylogeography_779700.pdf
Sociedade Brasileira de Ictiologia Britski2011 Neotropical Ichthyology yes? a02v9n2.pdf
Sociedade Brasileira de Ictiologia Sarmento2011 Neotropical Ichthyology yes? a03v9n2.pdf
Sociedade Brasileira de Ictiologia Calegari2011 Neotropical Ichthyology yes? a04v9n2.pdf
Royal Society Billet2011 Proceedings of the Royal Society B: Biological Sciences yes? billet_11_oldest_687630.pdf
Royal Society Polly2011 Proceedings of the Royal Society B: Biological Sciences yes? polly_11_history_625430.pdf
Royal Society Sansom2011 Proceedings of the Royal Society B: Biological Sciences yes? sansom_11_decay_625429.pdf

Since Sunday afternoon I’ve been at an International Council for Science (ICSU) / Royal Society invited workshop on ‘Revaluing Science in the Digital Age’.

We’ve had a fascinating set of talks from academics, publishers (PLoS, Nature, BMC), librarians, policymakers, data managers, scientific societies…

Attendees included:
Jose Cotta, European Commision

Mark Thorley (RCUK)
Chris Banks  (University Librarian and Director, Aberdeen)
Mark Hahnel (Figshare)
Max Wilkinson (UCL, Head of Research Data Service)
Dave Roberts (ViBRANT)
Rob Frost (GSK)
Catriona MacCallum (PLoS)
Mark Forster (Syngenta)
Iain Hrynaszkiewicz (BMC)
Ruth Wilson (Nature Publishing Group)
Kaitlin Thaney (Digital Science)
Stuart Taylor (Royal Society)
Robert Simpson (Zooniverse)
Paul Groth (OpenPHACTS)
and more…


I gave a talk on content mining and the importance of full BOAI-compliant Open Access with respect to this, on behalf of the Open Knowledge Foundation:

There was lots of discussion on reproducibility, provenance of data, peer review, incentives, research misconduct and ethics.

I’ve met many new people and have learnt many new things. For example, on the subject of reproducibility I talked about Roger Peng and the journal Biostatistics in discussion, and then was soon informed that there was an analogous journal in Chemistry called Organic Syntheses whereby:

In order for a procedure to be accepted for publication, each reaction must be successfully repeated in the laboratory of a member of the Editorial Board at least twice, with similar yields (generally ±5%) and selectivity similar to that reported by the submitters.

Fantastic! We were also informed that this rigorous protocol ensures that research published in this journal is very highly regarded. I’ve suggested similar such reproducibility checks for phylogenetics research before (at the Systematics Association Biennial meeting Belfast, 2011) but this was viewed as too futuristic / infeasible…

Right now we’re working on a draft statement of outcome from this workshop that ICSU can pass to its members to possibly officially agree to endorse.

So I better finish here, and get back to the discussion.
I’m rather hoping they will endorse the Panton Principles rather than reinvent the wheel (policy-wise).

Exciting times!


PS I have made a Storify of the tweets from the workshop here .

Panton Fellowship updates: July (month 4)

August 4th, 2012 | Posted by rmounce in Content Mining | Panton Fellowship updates - (Comments Off on Panton Fellowship updates: July (month 4))

It’s the Olympics now so this work update is a) late and b) short


As ever progress has been exciting – look what we can extract from some PDFs:

(click to enlarge each) Attribution: The left panel is from Cánovas et al. BMC Evolutionary Biology 2011 11:371 doi:10.1186/1471-2148-11-371

On the left is the original figure, and on the right we have an SVG representation of the data we can extract automatically from this figure. We have the topology, the taxon labels AND the support values 100% correctly interpreted! Obviously we can’t reclaim phylogenetic data with this much precision and recall from all papers. But it’s a promising example, automatically generated – no manual guidance or tweaking needed – just feed it the PDF. [My WordPress server won’t let me upload the original SVG copy of this for “security reasons” so the image on the right is a .jpg copy of the original .svg]


I should also note this was achieved completely independently of previous image-based tree-extraction softwares like TreeSnatcher Plus, TreeRipper & TreeThief. This is a great example of why it’s very important for editors and publishers to strictly stipulate that diagrams in figures containing data such as this be uploaded and produced in the final PDF version as lossless vector graphics rather than lossy bitmaps such as .png .jpg or .bmp – only vectors keep the fidelity of the underlying data. We note that there are many publishers out there who regularly seem to produce figures in their PDFs that are NOT on the whole very good quality wrt this. Difficult to know whether the authors or the publishers are to blame in each case but either way standards need to be improved.


By mining PDFs we can re-extract and re-release far more than just phylogenetic data from the literature – we’re fairly sure we can reliably identify the rough type of figure depicted in PDFs by machine methods using certain diagnostic features such as number & proportion of horizontal and vertical lines.



Peter Murray-Rust & I now are looking for a collaborator to help us implement machine learning methods to classify scientific figures into discrete categories e.g. bar charts, scatter plots, network diagrams (including phylogenies), pie charts, box & whisker plots etc… in an automated way.

If you’re interested please contact myself or Peter.

That’s all for now.

PS If you’re watching the London 2012 Olympics Volleyball tomorrow morning you may well just see me in the crowd. Managed to snaffle some returned tickets by setting up an alert for new tickets using a combination of (to alert me to page changes on the ticket website) and to email me as soon as the RSS feed gets a new item (updated ticket information). Without this nifty trick I very much doubt I’d have got any tickets.