Show me the data!

PDF metadata: different tool, same story

January 6th, 2013 | Posted by rmounce in Content Mining | Open Data | Panton Fellowship updates

So a week ago, I investigated publisher-produced Version of Record PDFs with pdfinfo and the results were very disappointing. Lots of missing metadata was found and one could not reliably identify most of these PDFs from metadata alone, let alone extract particular fields of interest.

But Rod Page kindly alerted to me the fact that I might be using the wrong tool for this investigation. So at his suggestion I’ve tried again to extract metadata from the exact same set of PDFs as last time…

Only this time I’ll be using exiftool version 9.10.

This time I’ve put the full raw metadata output from exiftool on figshare for each and every PDF file, just to really prove the point, reproducible research and all. I’d love to post the corresponding PDFs too but sadly many of them are not Open Access and this thus prevents me from uploading them to a public space.   **Insert timely comment here about how closed access publications stifle effective research practices…**

Exiftool is really simple to use. You just need type:
exiftool NameOfPDF.pdf
to get a human-readable exhaustive output of all possible metadata.

exiftool -b -XMP NameOfPDF.pdf
to get XML-structured metadata. I could only extract this from 56 of the 69 PDF files. The data output from this for those 56 PDFs is available as a separate fileset on figshare here.

Finally, if you want to test a whole bunch of PDF files in your working directory I’ve made a simple shell script that loops through all PDFs in your working directory, available here (oops, it’s not data, perhaps I should have put that on github instead?). [I’m sure many readers will be able to create a simple bash loop themselves but just for those that don’t…]


I’m assuming that the reason exiftool -b -XMP failed on 13 of those PDFs is because they have no embedded XMP metadata – an empty (zero-byte sized) file is created for these. This is an assumption though… I notice that those 13 exactly correspond with all the 13 that were produced with iText. I checked the website and I’m pretty sure iText 2.x and up can embed XMP metadata, it’s just whether the publishers have bothered to use & apply this functionality.

So if I’m right, neither Taylor & Francis, BRILL, nor Acta Palaeontology Polonica embed XMP metadata (at all!) in their PDFs. The alternative explanation is that the XMP metadata is in there but exiftool for whatever reason can’t read/parse it from iText produced PDFs. I find this an unlikely alternative explanation though tbh.

Elsevier have superior XMP metadata to everyone else by the looks of it, but Elsevier aside the metadata is still very poor, so my conclusions from last week’s post still stand I think.

Most of the others do contain metadata (of some sort) but by and large it’s rather poor. I need to get some other work done on Monday so I’m afraid this is where I’m going to leave this for now. But I hope I’ve made the point.

Further angles to explore

Interestingly Brian Kelly, has taken this a slightly different direction and looked at the metadata of PDFs in institutional repositories. I hadn’t realised this but apparently some institutional repositories (IRs) universally add cover pages to most deposits. If this is done without care for the embedded metadata, the original metadata can be wiped and/or replaced with newer (less informative) metadata.  Not to mention that cover pages are completely unnecessary -> all the information on a cover page is exactly the kind of stuff that should be put in embedded metadata! No need to waste time and space by putting that info as the first page. JSTOR does this too (cover pages) and it annoys the hell out of me.

After some excellent chat on Twitter about this IR angle I’ve discovered that UKOLN based here on campus at Bath have also done some interesting research in this area, in particular the FixRep project which is described in more detail here. CrossRef labs pdfmark tool also looks like something of interest towards fixing poor quality metadata PDFs. I’ve got this installed/compiled from the source on github but haven’t tried it out yet. It would be interesting to see the difference it makes – a before and after comparison of metadata to see what we’re missing… But why should we fix a problem that shouldn’t exist in the first place? Publishers are the point of origin for this. It’s their job to be the first to publish the Version of Record. They should provide the highest level of metadata possible IMO.


Why would publishers add metadata?

Because their customers – libraries, governments, research funders (in the case of Open Access PDFs ) should demand it. A pipe dream perhaps but that’s my $.02.  I would ask for a refund if I downloaded MP3’s from iTunes/Amazon MP3 with insufficient embedded metadata. Why not the same principle for electronically published PDFs?


PS Apologies for some of the very cryptic filenames in the metadata uploads on figshare. You’ll have to cross-match with this list here or the spreadsheet I uploaded last week to work out which metadata file corresponds to which PDF/Bibliographic Data record/Publisher.

Publisher Identifier Journal Contains embedded XMP metadata? Filename
American Association for the Advancement of Science Ezard2011 Science yes? ezard_11_interplay_759293.pdf
American Association for the Advancement of Science Nagalingum2011 Science yes? nagalingum_11_recent_719133.pdf
American Association for the Advancement of Science Rowe2011 Science yes? Science-2011-Rowe-955-7.pdf
Blackwell Publishing Ltd Burks2011 Cladistics yes? burks_11_combined_694888.pdf
Blackwell Publishing Ltd Janies2011 Cladistics yes? janies_11_supramap_779773.pdf
Blackwell Publishing Ltd Simmons2011 Cladistics yes? simmons_11_deterministic_779537.pdf
BRILL Barbosa2011 Insect Systematics & Evolution no barbosa_11_phylogeny_779910.pdf
BRILL Dellape2011 Insect Systematics & Evolution no dellape_11_phylogenetic_779909.pdf
Cambridge Journals Online Knoll2010 Geological Magazine yes? knoll_10_primitive_475553.pdf
Cambridge Journals Online Saucede2007 Geological Magazine yes? thomas_saucegraved_07_phylogeny_506869.pdf
CSIRO Chamorro2011 Invertebrate Systematics yes? chamorro_11_phylogeny_780467.pdf
CSIRO Daugeron2011 Invertebrate Systematics yes? daugeron_11_phylogenetic_780466.pdf
CSIRO Johnson2011 Invertebrate Systematics yes? johnson_11_collaborative_750540.pdf
Elsevier Lane2011 Molecular Phylogenetics and Evolution yes E3-1-s2.0-S1055790311001448-main.pdf
Elsevier Cunha2011 Molecular Phylogenetics and Evolution yes E2-1-s2.0-S1055790311001680-main.pdf
Elsevier Spribille2011 Molecular Phylogenetics and Evolution yes E1-1-s2.0-S1055790311001606-main.pdf
Frontiers In Horn2011 Frontiers in Neuroscience yes? fnins-05-00088.pdf
Frontiers In Ogura2011 Frontiers in Neuroscience yes? fnins-05-00091.pdf
Frontiers In Tsagareli2011 Frontiers in Neuroscience yes? fnins-05-00092.pdf
Hindawi Diniz2012 Psyche: A Journal of Entomology yes? 79139500.pdf
Hindawi Restrepo2012 Psyche: A Journal of Entomology yes? 516419.pdf
Hindawi Savopoulou2012 Psyche: A Journal of Entomology yes? 167420.pdf
Institute of Paleobiology, Polish Academy of Sciences Amson2011 Acta Palaeontologica Polonica no amson_11_affinities_666987.pdf
Institute of Paleobiology, Polish Academy of Sciences Edgecombe2011 Acta Palaeontologica Polonica no edgecombe_11_new_666988.pdf
Institute of Paleobiology, Polish Academy of Sciences Williamson2011 Acta Palaeontologica Polonica no app2E20092E0147.pdf
Magnolia Press Agiuar2011 Zootaxa yes? zt02846p098.pdf
Magnolia Press Ebach2011 Zootaxa yes? ebach_11_taxonomy_599972.pdf
Magnolia Press Nelson2011 Zootaxa yes? nelson_11_resemblance_688762.pdf
National Academy of Sciences Casanovas2011 Proceedings of the National Academy of Sciences yes? casanovas-vilar_11_updated_644658.pdf
National Academy of Sciences Goswami2011 Proceedings of the National Academy of Sciences yes? goswami_11_radiation_814757.pdf
National Academy of Sciences Thorne2011 Proceedings of the National Academy of Sciences yes? thorne_11_resetting_654055.pdf
Nature Publishing Group Meng2011 Nature yes? meng_11_transitional_644647.pdf
Nature Publishing Group Rougier2011 Nature yes? rougier_11_highly_720202.pdf
Nature Publishing Group Venditti2011 Nature yes? venditti_11_multiple_779840.pdf
NRC Research Press CruzadoCaballero2010 Canadian Journal of Earth Sciences yes? 650000.pdf
NRC Research Press Druckenmiller2010 Canadian Journal of Earth Sciences yes? 80000000c5.pdf
NRC Research Press Mazierski2010 Canadian Journal of Earth Sciences yes? mazierski_10_description_577223.pdf
NRC Research Press Modesto2009 Canadian Journal of Earth Sciences yes? modesto_09_new_577201.pdf
NRC Research Press Parsons2009 Canadian Journal of Earth Sciences yes? parsons_09_new_575744.pdf
NRC Research Press Wu2007 Canadian Journal of Earth Sciences yes? wu_07_new_622125.pdf
Pensoft Publishers Hagedorn2011 ZooKeys yes? hagedorn_11_creative_779747.pdf
Pensoft Publishers Penev2011 ZooKeys yes? penev_11_interlinking_694886.pdf
Pensoft Publishers Thessen2011 ZooKeys yes? thessen_11_data_779746.pdf
Public Library of Science Hess2011 PLoS ONE yes? hess_11_addressing_694222.pdf
Public Library of Science McDonald2011 PLoS ONE yes? mcdonald_11_subadult_694229.pdf
Public Library of Science Wicherts2011 PLoS ONE yes? wicherts_11_willingness_779788.pdf
SAGE Publications deKloet2011 Journal of Veterinary Diagnostic Investigation yes? Invest-2011-deKloet-421-9.pdf
SAGE Publications Richter2011 Journal of Veterinary Diagnostic Investigation yes? Invest-2011-Richter-430-5.pdf
SAGE Publications Wassmuth2011 Journal of Veterinary Diagnostic Investigation yes? Invest-2011-Wassmuth-436-53.pdf
Senckenberg Natural History Collections Dresden Fresneda2011 Arthropod Systematics & Phylogeny yes? fresneda_11_phylogenetic_785869.pdf
Senckenberg Natural History Collections Dresden Mally2011 Arthropod Systematics & Phylogeny yes? ASP_69_1_Mally_55-71.pdf
Senckenberg Natural History Collections Dresden Shimizu2011 Arthropod Systematics & Phylogeny yes? ASP_69_2_Shimizu_75-81.pdf
Springer-Verlag Beermann2011 Zoomorphology yes? 10.1007_s00435-011-0129-9.pdf
Springer-Verlag Cuezzo2011 Zoomorphology yes? cuezzo_11_ultrastructure_694669.pdf
Springer-Verlag Vinn2011 Zoomorphology yes? 10.1007_s00435-011-0133-0.pdf
Taylor & Francis Bianucci2011 Journal of Vertebrate Paleontology no bianucci_11_aegyptocetus_778747.pdf
Taylor & Francis Makovicky2011 Journal of Vertebrate Paleontology no makovicky_11_new_694826.pdf
Taylor & Francis Pietri2011 Journal of Vertebrate Paleontology no pietri_11_revision_689491.pdf
Taylor & Francis Rook2011 Journal of Vertebrate Paleontology no rook_11_phylogeny_694916.pdf
Taylor & Francis Tsuihiji2011 Journal of Vertebrate Paleontology no tsuihiji_11_cranial_660620.pdf
Taylor & Francis Yates2011 Journal of Vertebrate Paleontology no yates_11_new_694821.pdf
Taylor & Francis Gerth2011 Systematics and Biodiversity no gerth_11_wolbachia_779749.pdf
Taylor & Francis Krebes2011 Systematics and Biodiversity no krebes_11_phylogeography_779700.pdf
Sociedade Brasileira de Ictiologia Britski2011 Neotropical Ichthyology yes? a02v9n2.pdf
Sociedade Brasileira de Ictiologia Sarmento2011 Neotropical Ichthyology yes? a03v9n2.pdf
Sociedade Brasileira de Ictiologia Calegari2011 Neotropical Ichthyology yes? a04v9n2.pdf
Royal Society Billet2011 Proceedings of the Royal Society B: Biological Sciences yes? billet_11_oldest_687630.pdf
Royal Society Polly2011 Proceedings of the Royal Society B: Biological Sciences yes? polly_11_history_625430.pdf
Royal Society Sansom2011 Proceedings of the Royal Society B: Biological Sciences yes? sansom_11_decay_625429.pdf
  • @rmounce “Why would publishers add metadata? Because their customers – libraries, governments, research funders (in the case of Open Access PDFs ) should demand it.” I’m not seeing a compelling business case here. High-quality metadata would be nice, but can anybody argue that their research is being hampered by a lack of such metadata? Could someone working in publishing make a case to their boss that adding such metadata would generate more revenue, web traffic, manuscript submissions (insert whatever metric matters)?

    You ask “I would ask for a refund if I downloaded MP3′s from iTunes/Amazon MP3 with insufficient embedded metadata. Why not the same principle for electronically published PDFs?” You may buy MP3’s, but I suspect the vast majority of people don’t buy PDFs (have you bought an article PDF as an individual?). You aren’t the customer, University libraries (and others with large budgets) are the customers. If you’re not the customer then your wishes aren’t going to matter a great deal. Get your credit card out and things may change ;)

    • I guess it’s just so obvious to me why metadata is vitally important I forget to state it sometimes.

      Science is digital these days. Researchers get their articles, mostly via PDF from publishers and store these on their computers. Storing, arranging and retrieving these PDFs from one’s own personal library of thousands of such files is no trivial task. Sophisticated programs like Papers, Mendeley, Zotero, EndNote, Paperpile, ColWiz… help but since the metadata provided with PDF is so poor (independently confirmed by Victor @ Mendeley btw here: ) often can’t get 100% correct metadata for each PDF. Therefore every academic I know periodically spends (wastes) time arranging, filing, adding metadata to their personal library.

      My second point, perhaps an even more important one, is the case in which PDFs **are the data** for analyses – content mining & textmining analyses. Sure *some* publishers provide special XML formats especially for this technique but a lot don’t – I think it’s easier to list the select places where you *will* find XML e.g. EuropePMC, PLoS, BMC, Hindawi, Pensoft, Frontiers than where you don’t. Furthermore particular approaches that want to mine the figure content too aren’t best served by XML. Rich PDF metadata would significantly help such PDF-mining analyses. And if licence information was included then it would help us PDF-miners to computationally be aware of which PDFs are Open Access and which are not. Not even Elsevier currently embeds anything about the licencing information in their PDFs which is crucially important, especially for mixed journals like Cell Reports of which the content is variously either CC BY or CC BY-NC-ND (a huge difference! not Open Access either)
      Elsevier embeds copyright info e.g. “© 2012 The Authors” but this is something different and whilst welcome doesn’t provide any machine-readable information on what we are allowed to do with the PDF.

      Mining is also one of JISC’s top 7 predictions for the future of research:

      • @rmounce Perhaps I’m being a bit obtuse, but what I’m getting at is that the reasons why you want something may be irrelevant to a publisher. I understand why you’d like rich metadata in PDFs, they help solve problems you face. But how does solving your problems help the publisher? I guess I’m looking for the incentive for them. If one exists, they are more likely to produce the kind of metadata you are after. Organising your PDFs has no benefit for a publisher, nor does data mining (unless the publisher can monetize the results of the mining, which is where I suspect Elsevier are heading). If you were a publisher why would you embed metadata – and “because it’s the right thing to do” isn’t an answer ;)

        Regarding the issue of organising PDFs (the fabled “iTunes of papers”), there are several ways this could be tackled. Embedded metadata is one, another is a Gracenote-style solution where, say, you submit a sha1 hash of the PDF and you get back the corresponding metadata. Mendeley could offer something like this if their API could be searched by sha1 (they store sha1 signatures for PDFs that users uploaded). One wrinkle is that some publishers generate unique PDFs with each download, so each PDF will have a unique signature.

        • bobcorrigan

          Ah, the dilemma of product development: having no customers is a setback, having one customer is a disaster.

        • Mike Taylor

          Rod, if your argument is just that predatory publishers don’t give a shit about researchers, than I guess I can’t argue with that. But with all the we’re-you’re-friends rhetoric we keep hearing, it doesn’t seem unreasonable to me that publishers who do give a shit should do their damn jobs properly.

          You say that “because it’s the right thing to do” isn’t an answer”. I hope publishers are better than that — they certainly tell us they are, repeatedly. As for those that really aren’t — well, we have plenty of case-history on what happens to companies that don’t care about their customers.

          • @MikeTaylor I’m not saying publishers “don’t give a shit”, I just imagine a meeting where there’s a bunch of things publishers are thinking about doing next. Where would “embedding full publication metadata using XMP” fall in that list? Publishers are grappling with a changing landscape, such as the rise of Open Access publishing models, mobile (e.g., do we design web sites for mobile, or develop apps?), increasing demands for alternative metrics beyond impact factor, and so on. Any change in practice has costs, so I’m looking for reasons why publishers would want to add XMP. In other words, if you could sit down with a publisher and wanted to convince them that they just had to add full metadata (say in XMP), what would you say? Is this the number one thing you’d want them to do (as opposed, say, to digitising their back catalogue, adding ePub as an output format, assigning DOIs if they don’t have them, or <insert other thing we’d like>).

          • In addition, the XMP toolkit from Adobe is quite shit. exiftool is OK, but not the most amazing tool for a production system. As it happens, I’m going to be visiting our typesetter in India in February, and I’ll be able to learn first hand how they do XMP embedding into our PDFs, I’ll write up my notes when I get back.

          • Looking forward to those notes :)

            It would be good if you could also report on how figures are handled during this process. PMR and I are very keen on seeing that vector formats *remain* vectors and are kept largely intact from submission – no transmogrification into rasters please!

            It’s nice to know there’s someone willing to lift the lid on these otherwise secretive black-box processes that go-on behind the scenes at publishers :)

          • Mike Taylor

            The thing is, Rod, doing this right is trivial. It’s just a matter of taking ten minutes to fix the pipeline. Whereas the other things you mention — digitising a back-catalogue for example — are very significant undertakings.

            So adding proper metadata shouldn’t even be on the agenda for your hypothetical meeting. They should just do it.

          • @MikeTaylor I’m not sure this is trivial. Publishers will have different pipelines, they may not have much control over fine details of production (it may be contracted out, the contract might require re-negotiating or additional charges if changed, etc.). This thread desperately needs input from people involved in the publishing process, which is why I look forward to reading what @IanMulvany learns after visiting their typesetter. I don’t think we’re in a position to simply say “they should just do it” if we don’t understand the production process. I’m not claiming any special insight, but my experience editing Systematic Biology was eye opening. When we deployed the Manuscript Central editorial system we had conference calls where pretty much any change to the system required the publisher talking to Scholar One, and any change of substance was treated as a billable item.

          • Mike Taylor

            This is why we desperately need new publishers (PLOS, eLife, PeerJ). Even leaving aside the business ethics of the old ones, the whole approach is so desperately mired in inefficiencies that a technically trivial change like this becomes organisationally hard. For every $1 you pay these corporations, 1¢ goes on actually getting the work done and 99¢ on managing the process around it. I bet we’ll hear no whinging from PeerJ about how hard it is to run
            $ set-pdftag author “Michael P. Taylor”.

          • Rod, I don’t think you’ll get any publishers commenting here, nor any typesetters (except for Mr big mouth, here!!).

            My view is that most publishers, and their suppliers, have got themselves in a corner by using inappropriate tools for XML generation, PDF generation, etc. And I think they are a bit embarrassed to discuss these points in public!! It is the use of these “broken” and hugely expensive systems and tools that makes it so hard to embed XMP which is a no-brainer to any academic.

            It’s not hard – publishers all demand XML-first typesetting, which I understand it to mean fully automated creation of PDF from XML. This is what we do every day and have done for a decade. The XMP is in the XML and gets embedded automatically in the PDF. Job done. Now where it gets hard is when the PDF is made first in a desktop publishing package, then XML is generated by some complex means, and then checked and rechecked to make sure it is right. Adding XMP becomes an enormous task.

            So the ones who have trouble are those who are using broken systems. And let’s face it, they really don’t deserve to be around long. ;-)

            By the way we are one of the typesetters to Elsevier, so I’m glad their files were better. ;-)

        • One incentive for OA publishers to do this is that Google does pay attention to XMP data, so it should make ones material more visible in the long tail of search results. It’s hard to do a verifiable experiment on this, however.

          • Nice angle. Rich metadata as a tool for increasing discoverability, I like it :)
            If all Open Access publishers could put their licencing details in a standardised way in XMP data perhaps we could one day have effective ways to search for OA-only papers, in the same way that one can search by licence type on Flickr and other places.

  • Phil Harvey

    I would be interested to see why ExifTool doesn’t extract metadata from the iText PDF files. If possible, could you email a sample to me (phil at Thanks.

    • Sent. Many thanks for taking the time to look into this. I’d be very interested to know the results of this. The Acta Palaeontologia Polonica PDFs are all Open Access and available from here: so anyone can try this themselves with these papers.

      • Brilliant. Phil suggested I try exiftool -a -G instead as it tells one exactly where each metadata came from. I’m now sure this is an area in which publishers both commercial and non-commercial alike could significantly improve their products (publications).
        As an example for the Nagalingum et al 2011 Science paper we now get (which again shows that there’s really not much of use in the XMP provisioned metadata):

        exiftool -a -G nagalingum_11_recent_719133.pdf

        [ExifTool] ExifTool Version Number : 9.10
        [File] File Name : nagalingum_11_recent_719133.pdf
        [File] Directory : .
        [File] File Size : 367 kB
        [File] File Modification Date/Time : 2013:01:06 17:50:40+00:00
        [File] File Access Date/Time : 2013:01:06 17:50:57+00:00
        [File] File Inode Change Date/Time : 2013:01:06 17:50:40+00:00
        [File] File Permissions : rw-r–r–
        [File] File Type : PDF
        [File] MIME Type : application/pdf
        [PDF] PDF Version : 1.4
        [PDF] Linearized : No
        [PDF] Page Count : 5
        [PDF] Create Date : 2011:11:11 00:18:09-08:00
        [PDF] Modify Date : 2011:11:11 00:18:10-08:00
        [PDF] Producer : Adobe PDF Library 9.0.1
        [XMP] XMP Toolkit : Adobe XMP Core 4.2.1-c043 52.389687, 2009/06/02-13:20:35
        [XMP] Modify Date : 2011:11:11 00:18:10-08:00
        [XMP] Create Date : 2011:11:11 00:18:09-08:00
        [XMP] Metadata Date : 2011:11:11 00:18:10-08:00
        [XMP] Document ID : uuid:fe09fdeb-1dd1-11b2-0a00-000088ccd6ff
        [XMP] Instance ID : uuid:fe0a792e-1dd1-11b2-0a00-0ff678ced6ff
        [XMP] Format : application/pdf
        [XMP] Producer : Adobe PDF Library 9.0.1

        • and I confirm again, that I’m not seeing *any* XMP data in the iText produced PDFs:

          exiftool -a -G rook_11_phylogeny_694916.pdf

          [ExifTool] ExifTool Version Number : 9.10
          [File] File Name : rook_11_phylogeny_694916.pdf
          [File] Directory : .
          [File] File Size : 626 kB
          [File] File Modification Date/Time : 2013:01:06 17:50:40+00:00
          [File] File Access Date/Time : 2013:01:06 17:50:56+00:00
          [File] File Inode Change Date/Time : 2013:01:06 17:50:40+00:00
          [File] File Permissions : rw-r–r–
          [File] File Type : PDF
          [File] MIME Type : application/pdf
          [PDF] PDF Version : 1.4
          [PDF] Linearized : No
          [PDF] Page Count : 7
          [PDF] Producer : iText 2.1.5 (by
          [PDF] Title : Phylogeny of the Taeniodonta: evidence from dental characters and stratigraphy
          [PDF] Subject : Journal of Vertebrate Paleontology 2011.31:422-427
          [PDF] Modify Date : 2011:09:01 15:08:04-07:00
          [PDF] Author : Deborah L. Rook a b & John P. Hunter c
          [PDF] Create Date : 2011:09:01 15:08:04-07:00
          [ICC_Profile] Profile CMM Type : Lino
          [ICC_Profile] Profile Version : 2.1.0
          [ICC_Profile] Profile Class : Display Device Profile
          [ICC_Profile] Color Space Data : RGB
          [ICC_Profile] Profile Connection Space : XYZ
          [ICC_Profile] Profile Date Time : 1998:02:09 06:49:00
          [ICC_Profile] Profile File Signature : acsp
          [ICC_Profile] Primary Platform : Microsoft Corporation
          [ICC_Profile] CMM Flags : Not Embedded, Independent
          [ICC_Profile] Device Manufacturer : IEC
          [ICC_Profile] Device Model : sRGB
          [ICC_Profile] Device Attributes : Reflective, Glossy, Positive, Color
          [ICC_Profile] Rendering Intent : Perceptual
          [ICC_Profile] Connection Space Illuminant : 0.9642 1 0.82491
          [ICC_Profile] Profile Creator : HP
          [ICC_Profile] Profile ID : 0
          [ICC_Profile] Profile Copyright : Copyright (c) 1998 Hewlett-Packard Company
          [ICC_Profile] Profile Description : sRGB IEC61966-2.1
          [ICC_Profile] Media White Point : 0.95045 1 1.08905
          [ICC_Profile] Media Black Point : 0 0 0
          [ICC_Profile] Red Matrix Column : 0.43607 0.22249 0.01392
          [ICC_Profile] Green Matrix Column : 0.38515 0.71687 0.09708
          [ICC_Profile] Blue Matrix Column : 0.14307 0.06061 0.7141
          [ICC_Profile] Device Mfg Desc : IEC
          [ICC_Profile] Device Model Desc : IEC 61966-2.1 Default RGB colour space – sRGB
          [ICC_Profile] Viewing Cond Desc : Reference Viewing Condition in IEC61966-2.1
          [ICC_Profile] Viewing Cond Illuminant : 19.6445 20.3718 16.8089
          [ICC_Profile] Viewing Cond Surround : 3.92889 4.07439 3.36179
          [ICC_Profile] Viewing Cond Illuminant Type : D50
          [ICC_Profile] Luminance : 76.03647 80 87.12462
          [ICC_Profile] Measurement Observer : CIE 1931
          [ICC_Profile] Measurement Backing : 0 0 0
          [ICC_Profile] Measurement Geometry : Unknown (0)
          [ICC_Profile] Measurement Flare : 0.999%
          [ICC_Profile] Measurement Illuminant : D65
          [ICC_Profile] Technology : Cathode Ray Tube Display
          [ICC_Profile] Red Tone Reproduction Curve : (Binary data 2060 bytes, use -b option to extract)
          [ICC_Profile] Green Tone Reproduction Curve : (Binary data 2060 bytes, use -b option to extract)
          [ICC_Profile] Blue Tone Reproduction Curve : (Binary data 2060 bytes, use -b option to extract)

  • Mike Taylor

    “Lots of missing metadata was found.”#

    Oh, good! Where did you find it?

  • I can speak from a publisher’s perspective why XMP is important for all parties involved (authors, funders, editors, publishers, data miners, etc) to want XMP. (note: I am a co-founder at PeerJ and before that was at Mendeley).

    In one word, branding. More and more researchers use PDF tools to organize and extract metadata from the PDFs that they download. Even those who don’t use those tools, are coming across data sets and statistics that make use of that aggregate data from others using these tools (and some of those people are major decision-makers).

    When those tools built to utilize XMP are unable to properly extract the metadata, due to insufficient XMP in the PDF, then at least two negative consequences occur.

    First, users are often frustrated with the tool unable to extract the metadata, but they also become frustrated with the publisher or journal. Imagine a user with 200 articles from journal ‘ABC’ and 200 from journal ‘XYZ’, but only metadata from ‘XYZ’ has metadata properly extracted near 100%. As we move more towards an author-pays model and away from subscriptions, that is an important negative branding experience for journal ‘ABC”. Even if it doesn’t affect an author’s intent on where to next publish, the fact that journal ‘XYZ’ has its metadata shown in full with every interaction for the user will have a positive branding experience. This is why brands like Coca-Cola spend millions/billions on display adverts on and offline. Constant presence even if you don’t buy that coke today, you will tomorrow.

    Second, as mentioned above, that metadata is eventually aggregated, either alone in services such as Mendeley, or aggregated even further through a combination of APIs from various services or other means. The catch is that only accurate metadata can be properly aggregated; the rest is either lost or too incomplete to get an accurate count of how many times its usage is appearing. That too is bad branding, as it limits not just the dissemination of the brand, but reduces the appearance of being a highly read journal (or specific article within that journal). Not good news for either the publisher, authors, editor, funders, and potentially the reviewers of that article.

    There are more reasons, but a less data-driven third reason is that XMP represents just the minimum bar in innovation. If a publisher cannot achieve that minimum standard, then how likely is it that they can be entrusted to improve science going forward?

    • +1 very well put!

      Your first point particularly resonates with me. Certain journals *really* frustrate me WRT metadata (not just strictly PDFs but they would help no doubt) – if I can’t easily get the 100% correct metadata into Mendeley / Zotero / CIteulike every time for every article – my likelihood of wanting to cite articles in that publication goes down, I’ll go find another more easily citable paper to cite (not always an option, but increasingly so). Similarly if I’m disinclined to cite an article just because it’s harder to get accurate metadata / bibliographic data for it – I’ll remember that when it comes to submitting my next manuscript and avoid that difficult to cite journal…

      Frustration with brands & journals can all too easily occur. is perhaps a rather good example of brand damage to think about!

      • @jasonHoyt @rmounce Just to continue to play Devil’s advocate, I suspect whether the journal embeds XMP metadata or not currently has pretty much zero impact on anybody’s decisions about where they publish or what they cite. And yes, if a tool can’t extract metadata from a PDF I’ll blame the tool rather than the PDF (as unfair as that may be). I say this as someone who embeds XMP metadata in PDFs I generate, see

        I take the point about branding, but I’m unconvinced that XMP has a big part in that. Put another way, Elsevier does great XMP, what impact does that have, if any, on its brand? Nature also supports XMP, but their brand is probably seen as innovative for other reasons (e.g., ePub-based publishing on iOS devices).

        Lastly, journals have been experimenting with metadata formats for a long time now, including RSS feeds, Dublin Core and Google Scholar tags in HTML, OAI-PMH harvesting, Medline/PubMed indexing, CrossRef metadata, XMP, OpenURL, etc. It would be interesting to discover which of these had the most impact for publishers and/or for users.

        • @rdmpage:disqus if embedding XMP was the only secret to driving a positive experience for a research publication then it would be quite the easy job :). You’re right, that it is just one part of building trust, that can be ruined in a flash as we’ve seen happen before. That said, there is no excuse for a publisher with dozens, hundreds, or thousands of staff not to be able to include XMP and other metadata tools.

  • Pingback: Reflections on the Discussion on the Quality of Embedded Metadata in PDFs « UK Web Focus()

  • Pingback: Literaturverwaltung kompakt 1/2013 « Literaturverwaltung & Bibliotheken()

  • Pingback: Academic search engine optimization: for publishers | WoW! Wouter on the Web()

  • wonderboy

    I Simply
    Your Web
    Suggesting That I Actually
    Is Going
    Be Back
    Order To
    Up On
    New Posts seiryokushop

  • Hey, I Think Your Blog Might Be Having Browser Compatibility Issues. When I Look At Your Website In Safari, It Looks Fine But When Opening In Internet Explorer, It Has Some Overlapping. I Just Wanted To Give You A Quick Heads Up! Other Then That, Great Blog!

  • Teofila Popham

    Interesting analysis , Just to add my thoughts , if you is looking for a service to merge some PDF files , my family encountered a tool here

  • Zack Barkley

    Even if the publishers are not responsible enough to do such a simple thing as properly tag their pdfs to make researchers lives easier and more productive, we as researchers should “at least” be able to at least write our own tags into pdfs so that we can easily sort pdfs in windows explorer browser. This was something doable in Windows XP, but sadly, I have worked many hours on this and it just seems impossible with Windows 7/8. Exiftool writes some tags, but not the ones Adobe uses, and neither the exif nor the Adobe tags are visible in explorer from Vista onwards, although incredulously pdfs are the most commonly shared document style and there are 288 tags for other things .