Show me the data!
Header

U2in
Just a quick post.

I happened to see @wisealic Tweet about her “new Atira/Pure colleagues” yesterday. I didn’t know what Atira was, but I’d heard of PURE.

I googled it to find out more… and soon found the official Elsevier press release , dated August 15, 2012 (so this isn’t really new news). But combined with recent rumours it does worry me. Elsevier own perhaps a fifth of the academic literature, whatever the true figure it’s a significant share. Despite the research that went into most of those papers being publicly or charitably-funded, Elsevier now rent access to this work back to us (the world) for vast sums of money each and every year.

Not to mention the fake journals they published, the arms dealings their parent company (Reed Elsevier) was involved in, their initial support for the RWA (since withdrawn), the megabundling of journals, the non-provision of open bibliographic metadata (even NPG release this!), the obscene profit margins (and to be fair they’re not the only corporate publisher making a killing here by selling freely provided academic work),  there are 1001 reasons why –  this isn’t an exhaustive list of all the evils…

So Elsevier are not a well-loved company in academia at the moment – more than 13,000 people have signed a boycott of them.

There are rumours that Elsevier are in talks to buy Mendeley at the moment. And Atira/PURE now part of the Elsevier (Umbrella?) corporation are I think the exclusive(?) providers of the research information ‘management’ systems that the UK will be using for it’s next Research Evaluation Framework (REF formerly RAE) exercise in 2014.

So… Elsevier own a significant portion of our papers,  and they may soon own a significant chunk of the bibliographic metadata stored by academics (Mendeley data) and all the commercial insight and advantage that gives, AND they own the company that is managing the data that evaluates UK academics and more round the world no doubt.

I do wonder if there isn’t a significant conflict of interest if thousands of UK academics have publicly boycotted Elsevier and now their academic work is going to be evaluated by… Elsevier. Academic jobs thoroughly depend on the results of these evaluations as I understand it, and heads will roll if the results at an institution are below expectations.

From a purely business perspective many financial analysts would rightly applaud these acquisitions as “good business moves” (good for profits no doubt). But from an ethical standpoint? Elsevier now seem to have a worrying empire of services built around academia and a significant amount of data which presumably they can pool together from each of these different services to gain additional insight? They also have a very poor record when it comes to providing open data. Why are we still giving them our data so easily – they’re only going to rent it back to us at a later date?

To me it’s clear, we’re giving up far too much of our data to this company and they do not have our best interests at heart – shareholder profits are by definition their primary goal. They have a sizeable monopoly on academic data in all it’s forms which they can and do leverage and I suspect we’re going to be made to pay for this mistake in the future as we have with hugely inflated journal subscription prices.

Is it just me that’s worried?

Anyone who knows me, knows I’m very passionate on the subject of data sharing in science, and after all the relevant conferences I’ve been to and research I’ve done – I don’t mind saying I’m fairly knowledgeable on the subject too.

It’s part of the reason I got this Panton Fellowship that has helped me develop my work and do what I want to do in pursuit of Open Data goals.

So when I saw this article come up on my RSS feeds – I thought great! It’s finally happening. The vertebrate palaeontology community is finally seeing the light – the absolute need to share research data associated with published papers (we’ll tackle pre-publication data sharing later, first things first…)!

Uhen, M. D., Barnosky, A. D., Bills, B., Blois, J., Carrano, M. T., Carrasco, M. A., Erickson, G. M., Eronen, J. T., Fortelius, M., Graham, R. W., Grimm, E. C., O'Leary, M. A., Mast, A., Piel, W. H., Polly, P. D., and Säilä, L. K. 2013.
From card catalogs to computers: databases in vertebrate paleontology. Journal of Vertebrate Paleontology 33:13-28.

2013-01-12-142813_1054x983_scrot

…and yet when I read the paper – it sorely disappointed me for a variety of reasons.

Choosing examples: bad choices & odd absences

Despite clear criteria given, I found the choice of databases reviewed to be an odd selection – for example they choose to include AHOB (Ancient Human Occupation of Britain) and write about it that:

“Access is restricted to project members during the life of the project, after which access will be publicly granted.”

This probably explains why then, that when I go to the database website – I can’t seem to get access to any of the purported data to be there!

AHOB
Screenshot of the login screen for AHOB. Try it yourself.

Yet apparently: “More than 250 publications have results from the AHOB project, all of which are recorded in the database.”

How many more publications will come out of this cosy little database before access will be publicly granted I wonder? I don’t think this is a good example of a research database as it doesn’t seem to publicly share any data.

Where’s Dryad?

Furthermore there are some really big, obvious, relevant databases it neglects to review, in particular Dryad – the only mention of which is that TreeBASE received “some support from Dryad” – with absolutely no mention anywhere that Dryad itself is a database with lots of vertebrate palaeontological data in it and likely to be a strongly important, long-lasting database in this area for the foreseeable future IMO! Even some data associated with an article in JVP itself is in Dryad! Although less prominently paleo-related figshare (with no less that 26 paleontology-related datasets there at the moment, TreeBASE has approximately as many!) might have been worth mentioning too.

Dryad has a partnership with The Paleontological Society and many evolutionary biology journals. Dryad even bought a promotional stand at last year’s Society of Vertebrate Paleontology annual meeting (the society that publishes the Journal of Vertebrate Paleontology) but as Richard Butler has pointed out to me on Twitter this article was submitted before that meeting. Still, it’s simply impossible that none of the 16 authors listed doesn’t know about Dryad. I find the non-inclusion of Dryad deeply suspicious and possibly political given it could ‘compete’ to store much of the data that some of the other reviewed databases do (it’s a broad generalist in the types of data it accepts).

Isn’t there a conflict of interest issue given that most of the authors of this paper are involved with at least one of the ‘reviewed’ (=advertised) databases in the paper? I see no mention of this conflict of interest anywhere in the paper. I dearly hope this paper was peer-reviewed – that it is an ‘invited article’ makes me wonder a bit about that…

The inclusion of Polyglot Paleontologist too, in the reviewed databases does also rather stretch the meaning of ‘data’ in the word database. Are translations of 434 different papers ‘data’? In the same way that TreeBASE or PaleoDB contain data? It’s a fantastic freely provided resource, no doubt – I mean no criticism of it – but is it data? I think not tbh.

Strong contenders for things that could/should have been cited but weren’t

WRT to Data Portals: rOpenSci provide great R interfaces for a wide variety of databases, including TreeBASE which was one of the ‘reviewed’ databases.

WRT to the History of databases section: I find it odd that they didn’t think to mention my own widely publicised and well-supported call for data archiving in palaeontology back in 2011. Nearly 200 palaeontologists signed in support of our ideas with some memorable quotes of support e.g. Brian Huber “This is the way of the future” , P J Wagner “I’ve been trying to get the Paleo Society to sign on with Dryad, but it’s been like slamming my head on jello…”

They could have explained why freely accessible databases/archives are so important a bit better in my opinion:
that ‘Data archiving is a good investment‘ (Piwowar et al, 2012),
that only 4% of phylogenetic data is currently archived and that it’s really useful data (Stoltzfus et al, 2012),
that Willingness to Share Research Data Is Related to the Strength of the Evidence and the Quality of Reporting of Statistical Results (Wicherts et al, 2011),
that the “data available upon request” system really doesn’t work (Wicherts et al, 2006)
the undesirable consquences of non-commercial clauses applied to biodiversity data (Hagedorn et al, 2011)

Odd wording

“…community approach, facilitated by the open access of the WWW and…”

sounds like something my dad would say about the interweb

“The CCL 3.0 license allows…”

a classic mistake – which CCL license?
In this case they mean the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 license, or CC BY-NC-SA for short. Calling it “Creative Commons License 3.0 (BY-NC-SA)” makes me wonder how familiar they are with licencing. Perhaps a sub-editor did this. And why they link specifically to the US version not the international unported license I do not know.

Data Citation: the Elephant in the Room?

Attribution is mentioned many times, and is vitally important to motivate people to share data. Yet the concept of citing data in countable ways or Data Citation isn’t explicitly mentioned once. Nor altmetrics for that matter.

This would have been an excellent opportunity – the start of a new year to encourage authors to actually cite data that they re-use from someone else so that those citations can be easily counted and contribute towards research evaluations, but alas no.

So what now?

So I like some of the message of this paper. But I don’t think it goes far enough, nor does a good job of it. Call me egotistical but I think I could do better and expand upon what I’ve written above.

If any journal editor happens to read this, and would like to commission an ‘invited article’, comment, or proper independent critical review of databases in vertebrate palaeontology / evolutionary biology please contact me. I think I could offer an interesting perspective.

PS I’m not going to write to the journal. I tried that with Nature and it took 6 months from submission for my comment to get published! It’s 2013 – if I’m going to do post-publication peer review – I’ll definitely be blogging it from now on, Rosie Redfield style!

So a week ago, I investigated publisher-produced Version of Record PDFs with pdfinfo and the results were very disappointing. Lots of missing metadata was found and one could not reliably identify most of these PDFs from metadata alone, let alone extract particular fields of interest.

But Rod Page kindly alerted to me the fact that I might be using the wrong tool for this investigation. So at his suggestion I’ve tried again to extract metadata from the exact same set of PDFs as last time…

Only this time I’ll be using exiftool version 9.10.

This time I’ve put the full raw metadata output from exiftool on figshare for each and every PDF file, just to really prove the point, reproducible research and all. I’d love to post the corresponding PDFs too but sadly many of them are not Open Access and this thus prevents me from uploading them to a public space.   **Insert timely comment here about how closed access publications stifle effective research practices…**

Exiftool is really simple to use. You just need type:
exiftool NameOfPDF.pdf
to get a human-readable exhaustive output of all possible metadata.

and
exiftool -b -XMP NameOfPDF.pdf
to get XML-structured metadata. I could only extract this from 56 of the 69 PDF files. The data output from this for those 56 PDFs is available as a separate fileset on figshare here.

Finally, if you want to test a whole bunch of PDF files in your working directory I’ve made a simple shell script that loops through all PDFs in your working directory, available here (oops, it’s not data, perhaps I should have put that on github instead?). [I’m sure many readers will be able to create a simple bash loop themselves but just for those that don’t…]

 

I’m assuming that the reason exiftool -b -XMP failed on 13 of those PDFs is because they have no embedded XMP metadata – an empty (zero-byte sized) file is created for these. This is an assumption though… I notice that those 13 exactly correspond with all the 13 that were produced with iText. I checked the website and I’m pretty sure iText 2.x and up can embed XMP metadata, it’s just whether the publishers have bothered to use & apply this functionality.

So if I’m right, neither Taylor & Francis, BRILL, nor Acta Palaeontology Polonica embed XMP metadata (at all!) in their PDFs. The alternative explanation is that the XMP metadata is in there but exiftool for whatever reason can’t read/parse it from iText produced PDFs. I find this an unlikely alternative explanation though tbh.

Elsevier have superior XMP metadata to everyone else by the looks of it, but Elsevier aside the metadata is still very poor, so my conclusions from last week’s post still stand I think.

Most of the others do contain metadata (of some sort) but by and large it’s rather poor. I need to get some other work done on Monday so I’m afraid this is where I’m going to leave this for now. But I hope I’ve made the point.

Further angles to explore

Interestingly Brian Kelly, has taken this a slightly different direction and looked at the metadata of PDFs in institutional repositories. I hadn’t realised this but apparently some institutional repositories (IRs) universally add cover pages to most deposits. If this is done without care for the embedded metadata, the original metadata can be wiped and/or replaced with newer (less informative) metadata.  Not to mention that cover pages are completely unnecessary -> all the information on a cover page is exactly the kind of stuff that should be put in embedded metadata! No need to waste time and space by putting that info as the first page. JSTOR does this too (cover pages) and it annoys the hell out of me.

After some excellent chat on Twitter about this IR angle I’ve discovered that UKOLN based here on campus at Bath have also done some interesting research in this area, in particular the FixRep project which is described in more detail here. CrossRef labs pdfmark tool also looks like something of interest towards fixing poor quality metadata PDFs. I’ve got this installed/compiled from the source on github but haven’t tried it out yet. It would be interesting to see the difference it makes – a before and after comparison of metadata to see what we’re missing… But why should we fix a problem that shouldn’t exist in the first place? Publishers are the point of origin for this. It’s their job to be the first to publish the Version of Record. They should provide the highest level of metadata possible IMO.

 

Why would publishers add metadata?

Because their customers – libraries, governments, research funders (in the case of Open Access PDFs ) should demand it. A pipe dream perhaps but that’s my $.02.  I would ask for a refund if I downloaded MP3’s from iTunes/Amazon MP3 with insufficient embedded metadata. Why not the same principle for electronically published PDFs?

 

PS Apologies for some of the very cryptic filenames in the metadata uploads on figshare. You’ll have to cross-match with this list here or the spreadsheet I uploaded last week to work out which metadata file corresponds to which PDF/Bibliographic Data record/Publisher.

Publisher Identifier Journal Contains embedded XMP metadata? Filename
American Association for the Advancement of Science Ezard2011 Science yes? ezard_11_interplay_759293.pdf
American Association for the Advancement of Science Nagalingum2011 Science yes? nagalingum_11_recent_719133.pdf
American Association for the Advancement of Science Rowe2011 Science yes? Science-2011-Rowe-955-7.pdf
Blackwell Publishing Ltd Burks2011 Cladistics yes? burks_11_combined_694888.pdf
Blackwell Publishing Ltd Janies2011 Cladistics yes? janies_11_supramap_779773.pdf
Blackwell Publishing Ltd Simmons2011 Cladistics yes? simmons_11_deterministic_779537.pdf
BRILL Barbosa2011 Insect Systematics & Evolution no barbosa_11_phylogeny_779910.pdf
BRILL Dellape2011 Insect Systematics & Evolution no dellape_11_phylogenetic_779909.pdf
Cambridge Journals Online Knoll2010 Geological Magazine yes? knoll_10_primitive_475553.pdf
Cambridge Journals Online Saucede2007 Geological Magazine yes? thomas_saucegraved_07_phylogeny_506869.pdf
CSIRO Chamorro2011 Invertebrate Systematics yes? chamorro_11_phylogeny_780467.pdf
CSIRO Daugeron2011 Invertebrate Systematics yes? daugeron_11_phylogenetic_780466.pdf
CSIRO Johnson2011 Invertebrate Systematics yes? johnson_11_collaborative_750540.pdf
Elsevier Lane2011 Molecular Phylogenetics and Evolution yes E3-1-s2.0-S1055790311001448-main.pdf
Elsevier Cunha2011 Molecular Phylogenetics and Evolution yes E2-1-s2.0-S1055790311001680-main.pdf
Elsevier Spribille2011 Molecular Phylogenetics and Evolution yes E1-1-s2.0-S1055790311001606-main.pdf
Frontiers In Horn2011 Frontiers in Neuroscience yes? fnins-05-00088.pdf
Frontiers In Ogura2011 Frontiers in Neuroscience yes? fnins-05-00091.pdf
Frontiers In Tsagareli2011 Frontiers in Neuroscience yes? fnins-05-00092.pdf
Hindawi Diniz2012 Psyche: A Journal of Entomology yes? 79139500.pdf
Hindawi Restrepo2012 Psyche: A Journal of Entomology yes? 516419.pdf
Hindawi Savopoulou2012 Psyche: A Journal of Entomology yes? 167420.pdf
Institute of Paleobiology, Polish Academy of Sciences Amson2011 Acta Palaeontologica Polonica no amson_11_affinities_666987.pdf
Institute of Paleobiology, Polish Academy of Sciences Edgecombe2011 Acta Palaeontologica Polonica no edgecombe_11_new_666988.pdf
Institute of Paleobiology, Polish Academy of Sciences Williamson2011 Acta Palaeontologica Polonica no app2E20092E0147.pdf
Magnolia Press Agiuar2011 Zootaxa yes? zt02846p098.pdf
Magnolia Press Ebach2011 Zootaxa yes? ebach_11_taxonomy_599972.pdf
Magnolia Press Nelson2011 Zootaxa yes? nelson_11_resemblance_688762.pdf
National Academy of Sciences Casanovas2011 Proceedings of the National Academy of Sciences yes? casanovas-vilar_11_updated_644658.pdf
National Academy of Sciences Goswami2011 Proceedings of the National Academy of Sciences yes? goswami_11_radiation_814757.pdf
National Academy of Sciences Thorne2011 Proceedings of the National Academy of Sciences yes? thorne_11_resetting_654055.pdf
Nature Publishing Group Meng2011 Nature yes? meng_11_transitional_644647.pdf
Nature Publishing Group Rougier2011 Nature yes? rougier_11_highly_720202.pdf
Nature Publishing Group Venditti2011 Nature yes? venditti_11_multiple_779840.pdf
NRC Research Press CruzadoCaballero2010 Canadian Journal of Earth Sciences yes? 650000.pdf
NRC Research Press Druckenmiller2010 Canadian Journal of Earth Sciences yes? 80000000c5.pdf
NRC Research Press Mazierski2010 Canadian Journal of Earth Sciences yes? mazierski_10_description_577223.pdf
NRC Research Press Modesto2009 Canadian Journal of Earth Sciences yes? modesto_09_new_577201.pdf
NRC Research Press Parsons2009 Canadian Journal of Earth Sciences yes? parsons_09_new_575744.pdf
NRC Research Press Wu2007 Canadian Journal of Earth Sciences yes? wu_07_new_622125.pdf
Pensoft Publishers Hagedorn2011 ZooKeys yes? hagedorn_11_creative_779747.pdf
Pensoft Publishers Penev2011 ZooKeys yes? penev_11_interlinking_694886.pdf
Pensoft Publishers Thessen2011 ZooKeys yes? thessen_11_data_779746.pdf
Public Library of Science Hess2011 PLoS ONE yes? hess_11_addressing_694222.pdf
Public Library of Science McDonald2011 PLoS ONE yes? mcdonald_11_subadult_694229.pdf
Public Library of Science Wicherts2011 PLoS ONE yes? wicherts_11_willingness_779788.pdf
SAGE Publications deKloet2011 Journal of Veterinary Diagnostic Investigation yes? Invest-2011-deKloet-421-9.pdf
SAGE Publications Richter2011 Journal of Veterinary Diagnostic Investigation yes? Invest-2011-Richter-430-5.pdf
SAGE Publications Wassmuth2011 Journal of Veterinary Diagnostic Investigation yes? Invest-2011-Wassmuth-436-53.pdf
Senckenberg Natural History Collections Dresden Fresneda2011 Arthropod Systematics & Phylogeny yes? fresneda_11_phylogenetic_785869.pdf
Senckenberg Natural History Collections Dresden Mally2011 Arthropod Systematics & Phylogeny yes? ASP_69_1_Mally_55-71.pdf
Senckenberg Natural History Collections Dresden Shimizu2011 Arthropod Systematics & Phylogeny yes? ASP_69_2_Shimizu_75-81.pdf
Springer-Verlag Beermann2011 Zoomorphology yes? 10.1007_s00435-011-0129-9.pdf
Springer-Verlag Cuezzo2011 Zoomorphology yes? cuezzo_11_ultrastructure_694669.pdf
Springer-Verlag Vinn2011 Zoomorphology yes? 10.1007_s00435-011-0133-0.pdf
Taylor & Francis Bianucci2011 Journal of Vertebrate Paleontology no bianucci_11_aegyptocetus_778747.pdf
Taylor & Francis Makovicky2011 Journal of Vertebrate Paleontology no makovicky_11_new_694826.pdf
Taylor & Francis Pietri2011 Journal of Vertebrate Paleontology no pietri_11_revision_689491.pdf
Taylor & Francis Rook2011 Journal of Vertebrate Paleontology no rook_11_phylogeny_694916.pdf
Taylor & Francis Tsuihiji2011 Journal of Vertebrate Paleontology no tsuihiji_11_cranial_660620.pdf
Taylor & Francis Yates2011 Journal of Vertebrate Paleontology no yates_11_new_694821.pdf
Taylor & Francis Gerth2011 Systematics and Biodiversity no gerth_11_wolbachia_779749.pdf
Taylor & Francis Krebes2011 Systematics and Biodiversity no krebes_11_phylogeography_779700.pdf
Sociedade Brasileira de Ictiologia Britski2011 Neotropical Ichthyology yes? a02v9n2.pdf
Sociedade Brasileira de Ictiologia Sarmento2011 Neotropical Ichthyology yes? a03v9n2.pdf
Sociedade Brasileira de Ictiologia Calegari2011 Neotropical Ichthyology yes? a04v9n2.pdf
Royal Society Billet2011 Proceedings of the Royal Society B: Biological Sciences yes? billet_11_oldest_687630.pdf
Royal Society Polly2011 Proceedings of the Royal Society B: Biological Sciences yes? polly_11_history_625430.pdf
Royal Society Sansom2011 Proceedings of the Royal Society B: Biological Sciences yes? sansom_11_decay_625429.pdf