Show me the data!
Header

PDF metadata – why so poor?

December 31st, 2012 | Posted by rmounce in Content Mining | Open Data | Panton Fellowship updates

Why is it so difficult to identify academic publisher PDFs?

 

With published MP3 files of audio you get rather good metadata. Take for example an MP3 file I downloaded from Hacker Public Radio available at the bottom of this post.

meta

 

The Full Circle Magazine team added value to the content published by embedding clear and relevant metadata within the MP3 so that even years later, and renaming the file – I still know exactly what this file contains without the need to open/listen to it. The metadata standard for MP3 files is called ID3. I haven’t shown it but there’s even a nice little picture embedded as metadata. Rich, valuable metadata is a good thing to have and easy to provide.

 

What then for PDF files?

 

They have embedded metadata too. On *nix machines you can use the CLI tool pdfinfo to show the metadata on your PDF files. I read that the metadata embedded in PDF is called XMP (Adobe’s Extensible Metadata Platform).

Is there an XMP standard / schema for published academic works in the PDF format? The only reason I ask is when I look at the Version of Record (VoR) files from many different publishers – the metadata I find is so variable in quality! No wonder Mendeley has difficulty identifying all the PDFs I feed it.

Below are results from a little preliminary survey of academic publisher PDF metadata I’ve done, with the supporting data uploaded to figshare here if you’re interested…

Sample population:  21 different academic publishers, ~3 Version of Record PDFs per publisher, mostly all published in the year 2011.

  • AAAS (Science), Wiley-Blackwell, BRILL, BMJ, Cambridge Journals Online, CSIRO, Elsevier, Frontiers In, Hindawi, Polish Academy of Sciences, Magnolia Press, National Academy of Sciences (PNAS), NPG, NRC Research Press, Pensoft Publishers, PLOS, Royal Society, SAGE, Senckenberg Natural History Collections Dresden, Springer-Verlag, Taylor & Francis, Sociedade Brasileira de Ictiologia (Neotropical Ichthyology)

You’ll notice I’m mixing large publishers (Elsevier, Wiley, Springer…) with tiny society/institution published PDFs. Open Access and Toll Access (TA) publishers both represented. The results are rather interesting…

I gathered data on 11 metadata fields:

PDF Version Optimized? File Size (bytes) Page Size Producer Creator Title Subject Author Pages Keywords

 

 Results

The results are very very ragbag. Out of the 70 PDFs I’ve published (meta)data on over at Figshare, only 8 of them had Keywords metadata embedded in them. So take a bow Arthropod Systematics & PhylogenyFrontiers in Neuroscience, and Geological Magazine (Cambridge Journals Online) for those.

55% of them were not Optimized. Among those not optimized were Science (AAAS), Insect Systematics & Evolution (BRILL), Psyche (Hindawi), Acta Palaeontologica PolonicaProceedings of the National Academy of Sciences, Zookeys (Pensoft) and more. Hard to say whether PDF optimization is a good thing or not, depends on your POV I suppose. If anyone has any strong preferences either way please do comment.

PDF version: rare praise for Elsevier here – they appear to be one of only two publishers (SAGE the other) I’ve sampled here that actually publishes PDFs according to the latest standard (1.7), which incidentally has been around since 2008! I’m no PDF guru though, so I don’t know if this actually entails any benefits or if there’s much difference between the different standards. The average joe probably wouldn’t notice the difference.

chart

Curiously although two of the sampled Royal Society year-2011-published PDFs were version 1.4, a third one was a version 1.2 PDF – inconsistent and odd! Most PDFs (28) as the pie chart shows were version 1.4

 

Page size is entertainingly variable too: Geological Magazine, Acta Palaeontologica PolonicaProceedings of the Royal Society B: Biological SciencesZootaxa, and Arthropod Systematics & Phylogeny all go for 595 x 842 pts (A4).  ScienceJournal of Vertebrate Paleontology and Canadian Journal of Earth Sciences use 612 x 792 pts (letter). The rest use an odd variety of sizes. Pensoft’s choice of 467.717 x 680.315 pts looks small in comparison to the rest, I wonder what the rationale behind that choice was?

Author: Only just >50% of the sampled PDFs embedded author metadata. Geological Magazine, Invertebrate SystematicsZootaxa, Arthropod Systematics & Phylogeny and Systematics and Biodiversity can all take credit for supplying full author data for each and every author on the author list of each PDF.

Others like Nature, Molecular Phylogenetics and Evolution (Elsevier), Frontiers in Neuroscience, and Psyche (Hindawi) only acknowledge the first author of each PDF. The latter at least has the decency to acknowledge this with an embedded “{et al.}”.

Title: Only Cladistics, Geological Magazine, Invertebrate SystematicsFrontiers in Neuroscience, Zootaxa, Nature, and Arthropod Systematics & Phylogeny properly filled in this field.

Subject: bit of an odd field this one. Some publishers put the title of the journal in this field e.g. Psyche (Hindawi) and Nature (NPG). Whilst most of the others sampled had no metadata for this field, Frontiers in Neuroscience interestingly used this field for the first sentence of the abstract!

Creator:  

data given here and in the next metadata item (Producer) shines light on how these PDFs were created.

Values extracted included:

“Arbortext Advanced Print Publisher 9.0.114/W”
“Arbortext Advanced Print Publisher 9.1.510/W Unicode”
“3B2 Total Publishing System 7.51n/W”
“3B2 Total Publishing System 8.07j/W”
“dvips(k) 5.95a Copyright 2005 Radical Eye Software”    Geological Magazine
“Elsevier” no surprises for guessing the publisher of this…
“Adobe InDesign CS5 (7.0.4)”
“Adobe InDesign CS4 (6.0)” SAGE
“LaTeX with hyperref package” Hindawi
“FrameMaker 8.0” Zootaxa
“Adobe PageMaker 7.0” Neotropical Ichthyology

Producer

Values extracted include:

“Adobe PDF Library 9.0.1”
“Adobe PDF Library 9.9”
“PDFlib PLOP 2.0.0p6 (SunOS)/Acrobat Distiller 7.0 (Windows)” Cladistics
“iText 2.1.7 by 1T3XT”
“iText 2.1.5 (by lowagie.com)”
“Acrobat Distiller 6.0.1 (Windows)” Nature
“Acrobat Distiller 7.0 (Windows)” Invertebrate Systematics
“Acrobat Distiller 8.1.0 (Windows)” Elsevier
“Acrobat Distiller 9.3.3 (Windows)” Canadian Journal of Earth Sciences

It’s interesting that Nature in 2011 were using the oldest version of Acrobat Distiller, it’s perhaps understandable that they value stability over updates. In both Producer and Creator metadata it seems like NRC Research Press (as represented by Canadian Journal of Earth Sciences) had the ‘newest’ most bleeding-edge PDF software setup in 2011.

Discussion

 

Clearly as with MP3’s there’s a need for good rich metadata to identify all the millions of different files out there. Publishers could provide this and as I’ve shown some do.

  • If there are agreed upon standards in STM publishing what are they?
  • Is there any agreed metadata standard for STM published PDFs? If not I think there should be.

I for one would like richer metadata in 2013 so that PDFs can be more easily identified in a machine-readable way – not even Mendeley can cope with all the PDFs I throw at it – my library is a mess.

Given we live in an increasingly mixed world of Open Access and Closed Access publications with content mining applications on the rise, it seems obvious that in particular these PDFs need a Copyright and/or licencing metadata field (as there is in MP3 metadata), to help indicate clearly what can and cannot be done with each PDF.

Please publishers, sort it out!

 

Happy New Year everyone…

25 Responses