PDF metadata: different tool, same story
January 6th, 2013 | Posted by in Content Mining | Open Data | Panton Fellowship updatesSo a week ago, I investigated publisher-produced Version of Record PDFs with pdfinfo and the results were very disappointing. Lots of missing metadata was found and one could not reliably identify most of these PDFs from metadata alone, let alone extract particular fields of interest.
But Rod Page kindly alerted to me the fact that I might be using the wrong tool for this investigation. So at his suggestion I’ve tried again to extract metadata from the exact same set of PDFs as last time…
Only this time I’ll be using exiftool version 9.10.
This time I’ve put the full raw metadata output from exiftool on figshare for each and every PDF file, just to really prove the point, reproducible research and all. I’d love to post the corresponding PDFs too but sadly many of them are not Open Access and this thus prevents me from uploading them to a public space. **Insert timely comment here about how closed access publications stifle effective research practices…**
Exiftool is really simple to use. You just need type:
exiftool NameOfPDF.pdf
to get a human-readable exhaustive output of all possible metadata.
and
exiftool -b -XMP NameOfPDF.pdf
to get XML-structured metadata. I could only extract this from 56 of the 69 PDF files. The data output from this for those 56 PDFs is available as a separate fileset on figshare here.
Finally, if you want to test a whole bunch of PDF files in your working directory I’ve made a simple shell script that loops through all PDFs in your working directory, available here (oops, it’s not data, perhaps I should have put that on github instead?). [I’m sure many readers will be able to create a simple bash loop themselves but just for those that don’t…]
I’m assuming that the reason exiftool -b -XMP
failed on 13 of those PDFs is because they have no embedded XMP metadata – an empty (zero-byte sized) file is created for these. This is an assumption though… I notice that those 13 exactly correspond with all the 13 that were produced with iText. I checked the website and I’m pretty sure iText 2.x and up can embed XMP metadata, it’s just whether the publishers have bothered to use & apply this functionality.
So if I’m right, neither Taylor & Francis, BRILL, nor Acta Palaeontology Polonica embed XMP metadata (at all!) in their PDFs. The alternative explanation is that the XMP metadata is in there but exiftool for whatever reason can’t read/parse it from iText produced PDFs. I find this an unlikely alternative explanation though tbh.
Elsevier have superior XMP metadata to everyone else by the looks of it, but Elsevier aside the metadata is still very poor, so my conclusions from last week’s post still stand I think.
Most of the others do contain metadata (of some sort) but by and large it’s rather poor. I need to get some other work done on Monday so I’m afraid this is where I’m going to leave this for now. But I hope I’ve made the point.
Further angles to explore
Interestingly Brian Kelly, has taken this a slightly different direction and looked at the metadata of PDFs in institutional repositories. I hadn’t realised this but apparently some institutional repositories (IRs) universally add cover pages to most deposits. If this is done without care for the embedded metadata, the original metadata can be wiped and/or replaced with newer (less informative) metadata. Not to mention that cover pages are completely unnecessary -> all the information on a cover page is exactly the kind of stuff that should be put in embedded metadata! No need to waste time and space by putting that info as the first page. JSTOR does this too (cover pages) and it annoys the hell out of me.
After some excellent chat on Twitter about this IR angle I’ve discovered that UKOLN based here on campus at Bath have also done some interesting research in this area, in particular the FixRep project which is described in more detail here. CrossRef labs pdfmark tool also looks like something of interest towards fixing poor quality metadata PDFs. I’ve got this installed/compiled from the source on github but haven’t tried it out yet. It would be interesting to see the difference it makes – a before and after comparison of metadata to see what we’re missing… But why should we fix a problem that shouldn’t exist in the first place? Publishers are the point of origin for this. It’s their job to be the first to publish the Version of Record. They should provide the highest level of metadata possible IMO.
Why would publishers add metadata?
Because their customers – libraries, governments, research funders (in the case of Open Access PDFs ) should demand it. A pipe dream perhaps but that’s my $.02. I would ask for a refund if I downloaded MP3’s from iTunes/Amazon MP3 with insufficient embedded metadata. Why not the same principle for electronically published PDFs?
PS Apologies for some of the very cryptic filenames in the metadata uploads on figshare. You’ll have to cross-match with this list here or the spreadsheet I uploaded last week to work out which metadata file corresponds to which PDF/Bibliographic Data record/Publisher.
Publisher | Identifier | Journal | Contains embedded XMP metadata? | Filename |
American Association for the Advancement of Science | Ezard2011 | Science | yes? | ezard_11_interplay_759293.pdf |
American Association for the Advancement of Science | Nagalingum2011 | Science | yes? | nagalingum_11_recent_719133.pdf |
American Association for the Advancement of Science | Rowe2011 | Science | yes? | Science-2011-Rowe-955-7.pdf |
Blackwell Publishing Ltd | Burks2011 | Cladistics | yes? | burks_11_combined_694888.pdf |
Blackwell Publishing Ltd | Janies2011 | Cladistics | yes? | janies_11_supramap_779773.pdf |
Blackwell Publishing Ltd | Simmons2011 | Cladistics | yes? | simmons_11_deterministic_779537.pdf |
BRILL | Barbosa2011 | Insect Systematics & Evolution | no | barbosa_11_phylogeny_779910.pdf |
BRILL | Dellape2011 | Insect Systematics & Evolution | no | dellape_11_phylogenetic_779909.pdf |
Cambridge Journals Online | Knoll2010 | Geological Magazine | yes? | knoll_10_primitive_475553.pdf |
Cambridge Journals Online | Saucede2007 | Geological Magazine | yes? | thomas_saucegraved_07_phylogeny_506869.pdf |
CSIRO | Chamorro2011 | Invertebrate Systematics | yes? | chamorro_11_phylogeny_780467.pdf |
CSIRO | Daugeron2011 | Invertebrate Systematics | yes? | daugeron_11_phylogenetic_780466.pdf |
CSIRO | Johnson2011 | Invertebrate Systematics | yes? | johnson_11_collaborative_750540.pdf |
Elsevier | Lane2011 | Molecular Phylogenetics and Evolution | yes | E3-1-s2.0-S1055790311001448-main.pdf |
Elsevier | Cunha2011 | Molecular Phylogenetics and Evolution | yes | E2-1-s2.0-S1055790311001680-main.pdf |
Elsevier | Spribille2011 | Molecular Phylogenetics and Evolution | yes | E1-1-s2.0-S1055790311001606-main.pdf |
Frontiers In | Horn2011 | Frontiers in Neuroscience | yes? | fnins-05-00088.pdf |
Frontiers In | Ogura2011 | Frontiers in Neuroscience | yes? | fnins-05-00091.pdf |
Frontiers In | Tsagareli2011 | Frontiers in Neuroscience | yes? | fnins-05-00092.pdf |
Hindawi | Diniz2012 | Psyche: A Journal of Entomology | yes? | 79139500.pdf |
Hindawi | Restrepo2012 | Psyche: A Journal of Entomology | yes? | 516419.pdf |
Hindawi | Savopoulou2012 | Psyche: A Journal of Entomology | yes? | 167420.pdf |
Institute of Paleobiology, Polish Academy of Sciences | Amson2011 | Acta Palaeontologica Polonica | no | amson_11_affinities_666987.pdf |
Institute of Paleobiology, Polish Academy of Sciences | Edgecombe2011 | Acta Palaeontologica Polonica | no | edgecombe_11_new_666988.pdf |
Institute of Paleobiology, Polish Academy of Sciences | Williamson2011 | Acta Palaeontologica Polonica | no | app2E20092E0147.pdf |
Magnolia Press | Agiuar2011 | Zootaxa | yes? | zt02846p098.pdf |
Magnolia Press | Ebach2011 | Zootaxa | yes? | ebach_11_taxonomy_599972.pdf |
Magnolia Press | Nelson2011 | Zootaxa | yes? | nelson_11_resemblance_688762.pdf |
National Academy of Sciences | Casanovas2011 | Proceedings of the National Academy of Sciences | yes? | casanovas-vilar_11_updated_644658.pdf |
National Academy of Sciences | Goswami2011 | Proceedings of the National Academy of Sciences | yes? | goswami_11_radiation_814757.pdf |
National Academy of Sciences | Thorne2011 | Proceedings of the National Academy of Sciences | yes? | thorne_11_resetting_654055.pdf |
Nature Publishing Group | Meng2011 | Nature | yes? | meng_11_transitional_644647.pdf |
Nature Publishing Group | Rougier2011 | Nature | yes? | rougier_11_highly_720202.pdf |
Nature Publishing Group | Venditti2011 | Nature | yes? | venditti_11_multiple_779840.pdf |
NRC Research Press | CruzadoCaballero2010 | Canadian Journal of Earth Sciences | yes? | 650000.pdf |
NRC Research Press | Druckenmiller2010 | Canadian Journal of Earth Sciences | yes? | 80000000c5.pdf |
NRC Research Press | Mazierski2010 | Canadian Journal of Earth Sciences | yes? | mazierski_10_description_577223.pdf |
NRC Research Press | Modesto2009 | Canadian Journal of Earth Sciences | yes? | modesto_09_new_577201.pdf |
NRC Research Press | Parsons2009 | Canadian Journal of Earth Sciences | yes? | parsons_09_new_575744.pdf |
NRC Research Press | Wu2007 | Canadian Journal of Earth Sciences | yes? | wu_07_new_622125.pdf |
Pensoft Publishers | Hagedorn2011 | ZooKeys | yes? | hagedorn_11_creative_779747.pdf |
Pensoft Publishers | Penev2011 | ZooKeys | yes? | penev_11_interlinking_694886.pdf |
Pensoft Publishers | Thessen2011 | ZooKeys | yes? | thessen_11_data_779746.pdf |
Public Library of Science | Hess2011 | PLoS ONE | yes? | hess_11_addressing_694222.pdf |
Public Library of Science | McDonald2011 | PLoS ONE | yes? | mcdonald_11_subadult_694229.pdf |
Public Library of Science | Wicherts2011 | PLoS ONE | yes? | wicherts_11_willingness_779788.pdf |
SAGE Publications | deKloet2011 | Journal of Veterinary Diagnostic Investigation | yes? | Invest-2011-deKloet-421-9.pdf |
SAGE Publications | Richter2011 | Journal of Veterinary Diagnostic Investigation | yes? | Invest-2011-Richter-430-5.pdf |
SAGE Publications | Wassmuth2011 | Journal of Veterinary Diagnostic Investigation | yes? | Invest-2011-Wassmuth-436-53.pdf |
Senckenberg Natural History Collections Dresden | Fresneda2011 | Arthropod Systematics & Phylogeny | yes? | fresneda_11_phylogenetic_785869.pdf |
Senckenberg Natural History Collections Dresden | Mally2011 | Arthropod Systematics & Phylogeny | yes? | ASP_69_1_Mally_55-71.pdf |
Senckenberg Natural History Collections Dresden | Shimizu2011 | Arthropod Systematics & Phylogeny | yes? | ASP_69_2_Shimizu_75-81.pdf |
Springer-Verlag | Beermann2011 | Zoomorphology | yes? | 10.1007_s00435-011-0129-9.pdf |
Springer-Verlag | Cuezzo2011 | Zoomorphology | yes? | cuezzo_11_ultrastructure_694669.pdf |
Springer-Verlag | Vinn2011 | Zoomorphology | yes? | 10.1007_s00435-011-0133-0.pdf |
Taylor & Francis | Bianucci2011 | Journal of Vertebrate Paleontology | no | bianucci_11_aegyptocetus_778747.pdf |
Taylor & Francis | Makovicky2011 | Journal of Vertebrate Paleontology | no | makovicky_11_new_694826.pdf |
Taylor & Francis | Pietri2011 | Journal of Vertebrate Paleontology | no | pietri_11_revision_689491.pdf |
Taylor & Francis | Rook2011 | Journal of Vertebrate Paleontology | no | rook_11_phylogeny_694916.pdf |
Taylor & Francis | Tsuihiji2011 | Journal of Vertebrate Paleontology | no | tsuihiji_11_cranial_660620.pdf |
Taylor & Francis | Yates2011 | Journal of Vertebrate Paleontology | no | yates_11_new_694821.pdf |
Taylor & Francis | Gerth2011 | Systematics and Biodiversity | no | gerth_11_wolbachia_779749.pdf |
Taylor & Francis | Krebes2011 | Systematics and Biodiversity | no | krebes_11_phylogeography_779700.pdf |
Sociedade Brasileira de Ictiologia | Britski2011 | Neotropical Ichthyology | yes? | a02v9n2.pdf |
Sociedade Brasileira de Ictiologia | Sarmento2011 | Neotropical Ichthyology | yes? | a03v9n2.pdf |
Sociedade Brasileira de Ictiologia | Calegari2011 | Neotropical Ichthyology | yes? | a04v9n2.pdf |
Royal Society | Billet2011 | Proceedings of the Royal Society B: Biological Sciences | yes? | billet_11_oldest_687630.pdf |
Royal Society | Polly2011 | Proceedings of the Royal Society B: Biological Sciences | yes? | polly_11_history_625430.pdf |
Royal Society | Sansom2011 | Proceedings of the Royal Society B: Biological Sciences | yes? | sansom_11_decay_625429.pdf |
Pingback: Reflections on the Discussion on the Quality of Embedded Metadata in PDFs « UK Web Focus
Pingback: Literaturverwaltung kompakt 1/2013 « Literaturverwaltung & Bibliotheken
Pingback: Academic search engine optimization: for publishers | WoW! Wouter on the Web