Show me the data!
Header

This is a re-post of something I was invited to write to sum-up my experiences at OKCon 2011. The original post can be viewed here on the official OKFN Open Science blog. For some reason the Prezi embed code at the bottom didn’t work, but does here on my blog

Many thanks to Jenny Molloy for inviting me to write the post, and Maria Neicu for editing it.

A couple of months ago, I gave a talk at the Open Knowledge Conference 2011, on ‘Open Palaeontology’ – based upon 18 months experience as a lowly PhD student trying, and mostly failing to get usable digital data from palaeontological research papers. As you might well have inferred already from that last sentence; it’s been an interesting ride.

The main point of my talk was the sheer stupidity/naivety of the way in which data is supplied (or in some cases, not at all!) with or within research papers. Effective science operates through the accumulation of knowledge and data, all advances are incremental and build upon the work of others – the Panton Principles probably sum it up far better than I could. Any such barriers to the accumulation of knowledge/data therefore impede the progress of science.

Whilst there are numerous barriers to academic research – access to research papers being perhaps the most well-known and well-publicised; the issue that most aggravates me, is not access to these papers, but the actual papers themselves – in the context of the 21st century (I’m thinking the Internet Age here…), they are only barely adequate (at best) for communicating research data and this is a major problem for the future legacy of our published work… and my research project.

My PhD thesis title is quite broad: ‘The Importance of Fossils in Phylogeny’. Given this title and (wide)scope, I need to look at a lot of papers, in a lot of different journals, and extract data from these articles to re-analyse; to assess the importance of fossils in phylogeny; on a meta-scale. There are long established data formats for the particular type of data I wish to extract. So well established and easy to understand there’s even a Wikipedia page here describing the most commonly used data format (nexus). There exist multiple databases set aside specifically to host this type of data e.g. TreeBASE and MorphoBank. Yet despite all this standardisation and provisioning for paleomorphological phylogenetic data – far less than 1% of all data published on, is actually readily-available in a standardised, digital, usable format.

In most cases the data is there; you just have to dig very very hard to release it from the pdf file it’s usually buried in (and then spend unnecessary and copious amounts of time, manually reformatting and validating it). See the picture below for a typical example (and yes, it is sadly printed sideways, this is a common and silly practice that publishers use to inappropriately squeeze data matrices into papers):
7BHO

I hope you’ll agree with me that this is clearly absurd and hugely inefficient. As I explain in my presentation (slides at the bottom of this post) the data, as originially analysed/used, comes in a much richer, more usable, digital, Standardised format. Yet when published it gets stripped of all useful metadata and converted into a flat, inextricable and significantly obfuscated table. Why? It’s my belief this practise is a lazy unwanted vestigial hangover from the days of paper-based (only) publishing, in which this might have been the only way in which to convey the data with the paper. But in 2011, I can confidently say that the vast majority of researchers read and the use the digital versions of research papers – so why not make full and proper use of the digital format to aid scientific communication? I argue, not to axe paper copies. But to make sure that digital versions are more than just plain pdf versions of the paper copy, as they can and should IMO be.

With this goal in mind, I set about writing an Open Letter to the rest of my research community to explain why we need to richly-digitise our published research data ASAP. Naturally, I wouldn’t get very far just by myself, so I enlisted the support of a variety of academic friends via Facebook, and (inspired by OKFN pads I’d seen) we concocted a draft letter together using an Etherpad. The result of this was a fairly basic Drupal-based website that we launched http://supportpalaeodataarchiving.co.uk/ and disseminated via mailing lists, Twitter, Academia.edu as far and wide as we possibly could, *hoping* just hoping, that our fellow academics would read, take note and support our cause.

Surprisingly, it worked to an extent and a lot of big names in Palaeontology signed our Open Letter in support of our cause; then things got even better when a Nature journalist (Ewen Callaway) got interested in our campaign and wrote an article for Nature News about it, which can be found here. A huge thanks must go to everyone who helped out with the campaign, it’s generated truly International support, as can be demonstrated on the map below:
(you might have to zoom out a bit. For some reason it zooms into Africa by default )


View Open Letter Signatures in a larger map

It’s far too soon to know the true impact of the campaign. Journal editorial boards can be very slow to change their editorial policies, especially if it requires a modicum of extra effort on the part of the publisher. Additionally, once editorial policy does change at a journal, it can only apply to articles submitted from henceforth and thus articles already in the submission pipeline don’t get affected by any new guidelines. It’s not uncommon for delays of a year between submission and publishing in palaeontology, so for this and other reasons, I’m not expecting to see visible change until 2012, but I think we might have helped get the ball rolling, if nothing else…
The Paleontological Society journals (Paleobiology and Journal of Paleontology) have recently adopted mandatory data submission to the Dryad repository, and the Journal of Vertebrate Paleontology has also improved their editorial policy with respect to certain types of data, but these are just a few of many many journals that publish palaeontological articles. I’m very much hoping that other journals will follow suit in the next few months and years by taking steps to improve the way in which research data is communicated, for the good of everyone; authors, publishers, funders and readers.

Anyway, here’s the Prezi I used to convey some of that (and more) at OKCon 2011. Huge thanks to the conference organisers for inviting me to give this talk. It was the most professionally run conference I’ve ever been to, by far. Great food, excellent WiFi provisioning, good comms, superb accommodation… I could go on. If the conference is on next year – I’ll be there for sure!

After more than 6 months of waiting, my long overdue first paper has finally been published in Nature

Why do I say ‘long overdue’?
Doesn’t it usually take a long time for academic papers to get published?

Well… if you read it here [paywalled], you’ll see it’s only 600 words and 1 figure (at the absolute limit allowed by Nature’s editorial policy). It’s not really a full and lengthy contribution to science, just a simple “this previous paper is significantly wrong, and this is why”. I would have gone into greater depth of analysis had I been allowed to, but the strict word-limit rather prohibits this. I have no idea quite why it took so long to get accepted and published. I suspect Nature isn’t to blame. Politeness dictates that the original study authors can take a certain amount of time to reply to such letters, fair’s fair – but 6 months? Hmmm…

Anyway, some background and context:

My entire research thesis relies on re-analysing other people’s data. I have no access to specimens, I’m not a field palaeontologist, my research angle is ‘palaeoinformatics’ – large-scale re-analyses of hundreds if not thousands of palaeontological datasets. So naturally, when a new fossil specimen (reconstruction below) with novel phylogenetic data gets published in Nature I justifiably take a keen interest.

Artistic reconstruction of Diania cactiformis by Mingguang Chi as first published in Nature

Using the freely-available software program TNT, one can quickly and easily re-analyse the data matrix given in the Liu et al 2011 supplementary materials in a matter of seconds with any desktop computer. Any such reasonably informed parsimony analysis of this data (there are numerous parameters and settings one could specify), does NOT generate the consensus phylogeny they depict and furthermore the ‘real’ result generates cladograms in which the position of Diania cactiformis is significantly different.

It took several different re-analyses to be sure that I was onto something interesting. Frustratingly in phylogenetics, authors aren’t always as explicit in their methods sections as I’d like them to be. In this instance, it was not stated (as is true of most papers) what branch collapsing rules were followed. Is it safe to assume that they used the default setting? I think not!

So, once I was sure there was a problem, I emailed the corresponding author with my concerns. This was her reply in verbatim:

Hello, Ross, thank you for your attention, now I am very busy with writing an application (the deadline is March 10th), could you wait me for few days, I will show you all the phylogeny tree which I got, the methods, the rules and so on. Thank you very much!
All best regards!
Jianni

I never did get a follow-up email. [I assume because I submitted a formal reply to Nature which Jianni will have been notified of]. As a lowly grad student I’m sadly used to getting fobbed-off all the time…
Good thing I didn’t wait too long to formally reply either. It soon became apparent via Facebook that another research group intended to challenge this paper too. Well done to David Legg et al for their successful reply, also published in Nature.

Finally, I’d like to thank my supervisor Matthew Wills for helping me appropriately word my submission. I may have had the idea, and done the analyses, but it might not have been published without his excellent editorial input into the wording of the piece. We tried very hard to be polite and maintain the importance of the specimen, whilst necessarily pointing out the flaws of the analysis given.

I’ve been in Nature two times now in 2011! I didn’t write the first piece though. Can I make it a hat-trick? Time will tell…

As for the Liu et al. counter-reply which I’ve only just been allowed to see like the rest of the world – I’m surprised certain bits of it made it past peer-review. Their use of the PTP test is particularly intriguing and defies current scientific consensus on the validity of the usage of this test:

“Additionally, a significant value of the partitioning tail permutation (PTP) test (P = 0.01) suggests the presence of a clear phylogenetic signal in the morphological data, also strongly supporting the topology shown.”

First of all, I’ll generously assume they intended to refer to the permutation tail probability test not the “partitioning tail permutation” [sic] test. I don’t doubt the numerical result of the PTP test they present – it’s the inference they make from that result that I find illogical and unjustified given the numerous papers that have critically examined the PTP test – that this result somehow “strongly support[s] the topology shown” well; it doesn’t!!! One only has to look at the titles of numerous papers on the PTP test (although, please by all means read them!) to see that few if any can recommend it’s use for the purpose of supporting particular topologies:

  • Swofford et al (1996) The Topology-Dependent permutation test for monophyly does not test for monophyly.
  • Slowinski & Crother (1998) Is the PTP test useful? [Barely; only to determine if data has ‘signal’]
  • Peres-Neto & Marques (2000) When are random data not random, or is the PTP test useful? [Largely, no]
  • Harshman (2001) Does the T-PTP test tell us anything we want to know? [Largely, no]
  • Wilkinson, M. et al (2002) Type 1 error rates of the parsimony permutation tail probability test. [Points out errors in the Peres-Neto & Marques 2000 analysis, but still agrees that “the parsimony PTP cannot generally be assumed to guarantee well-supported phylogenetic hypotheses.”]

FYI the original papers describing the PTP test are Archie, 1989; Faith 1991; and Faith & Cranston 1991 [full citations and links given at the bottom]

There are other problems with the Liu et al reply which I hope others may be able to see for themselves but I’ll leave it at that for now. I perceive this statement for instance:

In this context, Mounce and Wills seem to have overlooked the potential significance of their reanalysis of our data.

to be a bit of a ‘cheap shot’ especially considering the extremely short word-limit enforced on our comment by Nature. Of course we would have loved to have described the implications of the reanalysis for each and every character – but this simply wasn’t relevant enough to the criticism we were presenting. How this was deemed relevant to the rebuttal of our valid and justified points in the Mounce & Wills comment I leave up to you to decide

I’d also like to say how much I admire the work of many of the scientists on the Liu et al paper. I think they produce some excellent work, and I’ve met Jason Dunlop in particular many times – he’s a nice guy and an excellent scientist. As a middle author I’m pretty confident he has little to do with the problems inherent in the original Liu et al Nature paper and the subsequent Liu et al counter-reply. My own objections to both papers are nothing personal – just an obsession with good science and logical reasoning!

Welcome to the world of academia…

    References & Links:

Archie, J. W. A randomization test for phylogenetic information in systematic data. Systematic Biology 38, 239-252 (1989). URL http://dx.doi.org/10.2307/2992285.

Callaway, E. Fossil data enter the web period : Nature news. Nature (2011). URL http://www.nature.com/news/2011/110411/full/472150a.html.

Coddington, J. & Scharff, N. Problems with zero-length branches. Cladistics 10, 415-423 (1994). URL http://dx.doi.org/10.1111/j.1096-0031.1994.tb00187.x.

Faith, D. P. Cladistic permutation tests for monophyly and nonmonophyly. Systematic Zoology 40, 366-375 (1991). URL http://dx.doi.org/10.2307/2992329.

Faith, D. P. & Cranston, P. S. Could a cladogram this short have arisen by chance alone?: On permutation tests for cladistic structure. Cladistics 7, 1-28 (1991). URL http://dx.doi.org/10.1111/j.1096-0031.1991.tb00020.x.

Harshman, J. Does the T-PTP test tell us anything we want to know? Systematic Biology 50 (2001). URL http://dx.doi.org/10.2307/3070842.

Legg, D. A. et al. Lobopodian phylogeny reanalysed. Nature 476, E1 (2011). URL http://dx.doi.org/10.1038/nature10267.

Liu, J., Steiner, M., Dunlop, J., Keupp, H., Shu, D., Ou, Q., Han, J., Zhang, Z., & Zhang, X. (2011). An armoured Cambrian lobopodian from China with arthropod-like appendages Nature, 470 (7335), 526-530 DOI: 10.1038/nature09704

Liu, J. et al. Liu et al. reply. Nature 476, E1 (2011). URL http://dx.doi.org/10.1038/nature10268.

Mounce, R., & Wills, M. (2011). Phylogenetic position of Diania challenged Nature, 476 (7359) DOI: 10.1038/nature10266

Peres-Neto, P. R. & Marques, F. When are random data not random, or is the PTP test useful? Cladistics 16, 420-424 (2000). URL http://dx.doi.org/10.1111/j.1096-0031.2000.tb00361.x.

Slowinski, J. B. & Crother, B. I. Is the PTP test useful? Cladistics 14, 297-302 (1998). URL http://dx.doi.org/10.1111/j.1096-0031.1998.tb00340.x.

Swofford, D. L., Thorne, J. L., Felsenstein, J. & Wiegmann, B. M. The Topology-Dependent permutation test for monophyly does not test for monophyly. Syst Biol 45, 575-579 (1996). URL http://dx.doi.org/10.1093/sysbio/45.4.575.

Wilkinson, M., Peres Neto, P. R., Foster, P. G. & Moncrieff, C. B. Type 1 error rates of the parsimony permutation tail probability test. Systematic Biology 51, 524-527 (2002). URL http://dx.doi.org/10.2307/3070887.

Having just written a lengthy blog post / rant about publishing data for another blog (I’ll link to it later if/when it gets published). I thought I’d post a technical demonstration of my issues here.

I want need to extract simple matrices of numbers from research papers for my PhD research. Theoretically, I shouldn’t even need to do this. In an ideal world where funding/author/researcher effort (overall) was time and cost efficient, all the data I need would be automatically (and indeed mandatorily submitted) to proper phylogenetic data repositories like TreeBASE or MorphoBank but it isn’t. It’s just lazily and inappropriately lobbed into pdfs with little thought or care as to it’s potential re-use legacy or fidelity…

So I have to do embarrassingly complicated and time-consuming things to re-extract this simple data AND then put it in a usable form (reformatting). Scarily enough, a lot of the time, I have to resort to OCR scanning to re-extract data. Below is such a case, rather ironically from a freely-accessible (Open Access? it’s unclear) paper; sure it’s openly accessible but it’s certainly NOT openly re-usable data-wise!

Orozco, J. & Philips, T. K. Phylogenetic analysis of the american genus euphoria and related groups based on morphological characters of adults (coleoptera: Scarabaeidae: Cetoniinae: Cetoniini). Insect Systematics Evolution 39-54 (2010)

The data I need is in a table printed sideways AND unnecessarily split between two separate pages: a classic example of deranged/lazy print-based thinking in an otherwise digital world. [Update: if it’s not clear already, my ire is for publishers only. I absolve all authors of any ‘blame’ particularly in the case I present here, just FYI ]

sidebyside

Using this (below) image as my test object, I thought I’d test out a variety of free OCR options at my disposal – I’m aware these options are very basic. If you know of anything better, or have tips on how I can ‘train’ tesseract to perform better then please fire away – I’m all ears 

1.) Google Docs OCR

2.) http://www.free-ocr.com/

3.) http://www.onlineocr.net/

4.) http://www.newocr.com/

5.) tesseract

Method (in brief): page 45 pdf -> png image -> rotate to correct orientation -> OCR

7C8v

Results (verbatim):

1.) Google – where did the taxon names go!!!!

1201100000 210110000? 04?0100400 2010000001 2101100010 2101000000 2100011001 2101000000 2101000001 2100010210 010 1 100000 1 20 1 000000 200 0020000 2000020000 200 003 0000 101 1020000 1010020000 0001101000 2200021000 0101101000

00000 120 00 on 000020 10 0000020000 0000000000 0000000000 l 0000 020 1 0 DU 000 020 1 U DU O00 O20 1 U DD 000 020 00 O0 0000 10?!) 0000010010 0101000010 0100010010 1000010010 0100010010 1101000010 1100000010 0100010010 0000010010 0i‘000lI|010

0011011000 0011010001 110103011? 2000010000 2000030000 0012131001 0010031001 00—0030000 0000130000 0010130001 0002111000 00l—031000 0011031000 0011031000 0011031000 0012031000 0012031001 00-0010000 0011031000 00—00I0000′

0000000011 000010011

00-01000!)0000100001 1000100001 1000011011 1000001111 0000001111 0111001011 110?!) U1-1 0000001011 001101111

0000001011 0000001011 0000001111 0000001011 0000011011 01?00010l1 0000011111 0010001011

0001000 0000??0 10100-0 0l00??0 00200-0 101

0021000 0101000 0021000 202 1010 2021000 2021000 2021000 2021000 2021000 2021000‘ 0021000 2022000 1021000

2.) free-ocr: wouldn’t work with .png so I converted to .jpg

Diplognat/ya gagates 1201100000 0000012000 0011011000 0000000011 2000000011 0001000

Coptomia aliveri 210110000? 0000002010 0011010001 000010011 0000200010 0000??0

P/medirmu meridionalis 04?0100400 0000020000 110103011? 00-010000— 0010000011 10100-0

Poerilopharis so/:0:/2i 2010000001 0000000000 2000010000 0000100001 0000000001 0100??0

Tmesorrbina uiridirinrta 2101100010 0000000000 2000030000 1000100001 0000200100 00200-0

Oxycetoniajucunda 2101000000 1000002010 0012131001 1000011011 2000201011 1011??0

Oxyt/ryrmfinmta 2100011001 0000002010 0010031001 1000001111 100021101? ?0210-0

Cetonia aumta 2101000000 0000002010 00—0030000 0000001111 ?100001000 0021000

R/nzbdoti: sobrina 2101000001 0000002000 0000130000 0111001011 ?000210011 0101000

E/aphinis irrorata 2100010210 00000010?0 0010130001 110?0 01-1 000021001? 0021000

E auita 0101100000 0000010010 0002111000 0000001011 0000001000 2021010

E. basalis 1201000000 0101000010 001-031000 001101111 0001001011 2021000

E. biguttata 2000020000 0100010010 0011031000 0000001011 2000011011 2021000

E. lineoligem 2000020000 1000010010 0011031000 0000001011 2000011011 2021000

E. mnescms 2000030000 0100010010 0011031000 0000001111 0000011011 2021000

E subtammtasa 1011020000 1101000010 0012031000 0000001011 2000111011 2021000

E. histronim 1010020000 1100000010 0012031001 0000011011 3000011011 2021000

E. fasnfem 0001101000 0100010010 00—0010000 01?000l011 100000-011 0021000

E. pulcbelbz 2200021000 0000010010 0011031000 0000011111 0000012011 2022000

E. mndezei 0101101000 0?00000010 00—0010000 0010001011 1000001000 1021000

3.) onlineocr.net doesn’t seem to like numbers greater than 1

0001001 0001000001 1101000100 0000100-00 0100000010 0001011010 ’2.7.4, Y 0000000 1100100000 1111100000 000101100 0100100000 00010000ZZ 47/.P7.d’i 0001000 110-000001 1101000110 0000100-00 0100100010 0001011000 m=f,,,,,f ’1 0001000 1101100000 1101100000 100100100 0100000011 0000000101 4,00.o01 0001000 1101110000 1101000000 000100100 0100001011 0000001101 r5omazuwq,15 y 000100? 1101100000 1111000000 0001001100 0100100010 000000000 5,..,52m4′,Y 0001000 1101100000 1101000000 0001001100 0100100001 0000000000 ,,,,,,p2m,/ Y 0001000 1101100000 1101000000 0001001100 0100100010 0000000000 orm.19 ’1 0001000 1101001000 111101100 000100-100 0100001010 0000001001 5701 ’1 0101000 0001000000 1101000000 0001110000 0100100000 0000011010 ’1.4′Y 0001000 /100100000 1-100/011 1000010100 0101000000 0100100010 0001010 1100100001 1101001110 0000010000 0000000000 100000101? 0001000 0001000011 1111000000 0000000-00 0100000000 0000001011 0-01002 /101100001 1111000001 100100100 0100000000 1001100011 001101 1101000000 1101100001 1001010100 0100000001 0000001011 70,..,,,,,I.6,0 0-00Z00 0010000000 1000010001 000000000? 0000000000 0100011010 4,..,.’0′.,,,1″°52.1 000010 1000000000 1000010000 0000100000 0000000000 1000000100 ’0′”0″‘”7′?’ ‘V 0-00101 1100000100 -00■01010 i11001011 0000000000 004001010 0110000 0100000000 I 1001011110 1000101100 0100000000 1000011010 iddaqormuodop 0001010 1100000000 1100000000 0001101100 0000100000 0000011001 ,darvraoldi,

4.) newocr provides annoyingly columnar output

lD§pbgnanbagwgunn

Cfipmnfianfiuni

Phanhhuninrrufinnaflk

lhnihyimrksdwrhi

Tmrmrr/Jina viridirinrta

Chgnflvniajurunah

Cbgwkwwaj%nznu

Cnnnia aumta

R/Jabdntis mbrina

ELI];/Jinis irmrata

Eiawhw

Eihwafis

Eibgwnum

E. /inm/igmz

Eirannrnu

lisubnvnrnnwa

Eihknwnhw

Eifkrfikw

Eipuhhrfla

Eiranakzn

1201100000

210110000?

04?0100400

2010000001

2101100010

2101000000

2100011001

2101000000

2101000001

2100010210

0101100000

1201000000

2000020000

2000020000

2000030000

1011020000

1010020000

0001101000

2200021000

0101101000

0000012000

0000002010

0000020000

0000000000

0000000000

1000002010

0000002010

0000002010

0000002000

00000010?0

0000010010

0101000010

0100010010

1000010010

0100010010

1101000010

1100000010

0100010010

0000010010

0?00000010

0011011000

0011010001

110103011?

2000010000

2000030000

0012131001

0010031001

00-0030000

0000130000

0010130001

0002111000

001-031000

0011031000

0011031000

0011031000

0012031000

0012031001

00-0010000

0011031000

00-0010000

0000000011

000010011

00-010000-

0000100001

1000100001

1000011011

1000001111

0000001111

0111001011

110?001-1

0000001011

001101111

0000001011

0000001011

0000001111

0000001011

0000011011

01?0001011

0000011111

0010001011

2000000011

0000200010

0010000011

0000000001

0000200100

2000201011

100021101?

?100001000

?000210011

000021001?

0000001000

0001001011

2000011011

2000011011

0000011011

2000111011

3000011011

100000-011

0000012011

1000001000

0001000

0000H0

10100-0

0100H0

00200-0

1011H0

?0210-0

0021000

0101000

0021000

2021010

2021000

2021000

2021000

2021000

2021000

2021000

0021000

2022000

1021000

5.) tesseract: only takes .tif flattened and decompressed

Dq>/agmztbugagutn 1201 100000 0000012000 001101 1000 0000000011 200000001 1 0001000

Captnmm a/mn: 2101 10000? 0000002010 0011010001 00001001 1 0000200010 0000??0

Pbardnmn mrnduzmz/u 04?0100400 0000020000 1 1010301 1? 0040100004 001000001 1 1010040

Parr:/apburxx xrbarbx 2010000001 0000000000 2000010000 0000100001 0000000001 0100??0

Tmrmnbzmz zuudmmta 2101 100010 0000000000 2000030000 1000100001 0000200100 0020040

Oxyrrtnnxapnxné 2101000000 1000002010 0012131001 1000011011 2000201011 1011??0

Oxytbyrrujiannta 210001 1001 0000002010 0010031001 1000001 1 11 100021 101? ?021040

Crtama mmzta 2101000000 0000002010 0040030000 0000001 1 11 ?100001000 0021000

Rhalzdatu mbnmz 2101000001 0000002000 0000130000 011 1001011 ?000210011 0101000

Ehphmn ummm 2100010210 0000001020 0010130001 110?0 0141 000021001? 0021000

E, mum 0101100000 0000010010 0002111000 0000001011 0000001000 2021010

E, [mm/:1 1201000000 0101000010 0014031000 001101111 0001001011 2021000

E, Ingmtuta 2000020000 0100010010 0011031000 0000001011 2000011011 2021000

E, /mm/zgmz 2000020000 1000010010 0011031000 0000001011 200001 101 1 2021000

E nmnrrm 2000030000 0100010010 0011031000 0000001 1 11 000001 101 1 2021000

E, mlztamrmam 1011020000 1101000010 0012031000 0000001011 2000111011 2021000

E, butmmnz 1010020000 1100000010 0012031001 0000011011 3000011011 2021000

E,jb.rry}m 0001 101000 0100010010 0040010000 01?000101 1 100000401 1 0021000

E, pu/[hr/M 2200021000 0000010010 0011031000 0000011111 0000012011 2022000

E mndrzr: 0101 101000 0?00000010 0040010000 0010001011 1000001000 1021000

Discussion:

Even though the image is reasonably high-res, not one of the tools managed 100% accuracy.

Only tesseract and free-ocr show any promise of being a viable solution. At the moment free-ocr seems demonstrably better at the names than tesseract but this could change if I start training tesseract in italicised Latin binomials.

Finally, to make it a usable nexus file I have to add the metadata wrappers above and below with extra parameters added in, discovered only by manually reading through the paper; so it eventually looks like this:

#NEXUS

[THIS IS AN OPTIONAL COMMENT:

Orozco, J. & Philips, T. K. Phylogenetic analysis of the american genus euphoria and related groups based on morphological characters of adults (coleoptera: Scarabaeidae: Cetoniinae: Cetoniini). Insect Systematics & Evolution 39-54 (2010). URL http://dx.doi.org/10.1163/187631210X12628483805310.]

BEGIN DATA;

DIMENSIONS NTAX =20 NCHAR=57;

FORMAT DATATYPE = STANDARD GAP =- MISSING =? SYMBOLS = “0 1 2 3 4″;

MATRIX

Diplognatha_gagates 1201100000 0000012000 0011011000 0000000011 2000000011 0001000

Coptomia_oliveri 210110000? 0000002010 0011010001 000010011 0000200010 0000??0

Phaedimus_meridionalis 04?0100400 0000020000 110103011? 00-010000- 0010000011 10100-0

Poecilopharis_schochi 2010000001 0000000000 2000010000 0000100001 0000000001 0100??0

Tmesorrhina_viridicincta 2101100010 0000000000 2000030000 1000100001 0000200100 00200-0

Oxycetonia_jucunda 2101000000 1000002010 0012131001 1000011011 2000201011 1011??0

Oxythyrea_funesta 2100011001 0000002010 0010031001 1000001111 100021101? ?0210-0

Cetonia_aurata 2101000000 0000002010 00-0030000 0000001111 ?100001000 0021000

Rhabdotis_sobrina 2101000001 0000002000 0000130000 0111001011 ?000210011 0101000

Elaphinis_irrorata 2100010210 00000010?0 0010130001 110?0 01-1 000021001? 0021000

E.avita 0101100000 0000010010 0002111000 0000001011 0000001000 2021010

E.basalis 1201000000 0101000010 001-031000 001101111 0001001011 2021000

E.biguttata 2000020000 0100010010 0011031000 0000001011 2000011011 2021000

E.lineoligera 2000020000 1000010010 0011031000 0000001011 2000011011 2021000

E.canescens 2000030000 0100010010 0011031000 0000001111 0000011011 2021000

E.subtomentosa 1011020000 1101000010 0012031000 0000001011 2000111011 2021000

E.histronica 1010020000 1100000010 0012031001 0000011011 3000011011 2021000

E.fascifera 0001101000 0100010010 00-0010000 01?000l011 100000-011 0021000

E.pulchella 2200021000 0000010010 0011031000 0000011111 0000012011 2022000

E.candezei 0101101000 0?00000010 00-0010000 0010001011 1000001000 1021000

;

end;

Final step: checking (validation) the file works by reading it into a suitable nexus file reading program e.g. PAUP

After all that effort – I discover that there’s a huge problem with the published dataset:

taxon 2: Coptomia oliveri only has 56 characters coded in the printed matrix.

It’s not all that obvious immediately – partly because the matrix is printed sideways and split over two pages; I feel this has rather aided and abetted this mistake to escape the corrective gaze of peer review. Indeed I’m led to believe that it’s rather common for underlying data to go entirely uncritiqued and unobserved during the review process. Rather odd considering data is the very basis for most phylogeny scientific papers!

This makes the entire dataset unusable for my purposes and I’ll now have to run the gauntlet of finding a current email address for the author, composing and sending an email, and then waiting indefinitely for a possible response in which they may or may not send me the correct data.

This is sadly not an uncommon situation, and I have tried to make my fellow colleagues aware of this. Only just recently I gave a talk here at the Systematics Association biennial meeting, on the very subject of data publishing. I doubt it’ll have made much impact, but I’m happy with myself for at least raising the issue in a public forum.

When will the madness of burying, corrupting, obfuscating and generally throwing-away valuable data end? Data is far more useful in it’s original, usable formats, and in most cases I’d argue it’s easier/better for all stakeholders (funders, authors, publishers, re-users, readers…) if it’s left this way.

But in the meantime, I’ll just have to keep digging away at those pdfs to extract the data I need…

*sighs*

Comments resurrected from the old blog:

Graeme Lloyd
This story is in no way familiar…

Personally I have NEVER trusted OCR to do the job right and so have typed in countless numbers of these by hand.

A Liked Reply
10 months ago 2 Likes

Ross Mounce
I should probably make clear as well (thanks for the prompt) that I too *never* trust OCR, I always carefully check line-by-line what it gives me – but I do use it for large and/or awkward matrices.

Edit Reply
10 months ago in reply to Graeme Lloyd

alf
Why OCR? It’s not an image PDF; easy enough to copy/paste the data out of the table: https://spreadsheets.google.co…

(not quite as easy as it could have been, but not too difficult)

Like Reply
10 months ago

Ross Mounce
You’ve copied out Table 1 which is on page 42, yes; I can do this easily too.

I want the Table 2 that’s split between pages 44 and 45. Get that with perfect fidelity (e.g. formating preserved) and not much fuss and then and only then will I be impressed ;)

:P

Edit Reply
10 months ago in reply to alf

Alf Eaton
Table 2 is in that spreadsheet as well, in a separate tab. It took a quick regexp and a couple of manual edits to restore the tabulation, that’s all.

Like Reply
10 months ago in reply to Ross Mounce

Ross Mounce
okay, pdftotext command-line hackery can harvest the data with fidelity. But the tabulation (formatting) is still screwed, and like you said requires manual effort to put back in place

Edit Reply
10 months ago in reply to Alf Eaton

Alf Eaton
I just copy/pasted the table from Adobe Reader – no need for command-line tools. That was my point, really: unless the PDF is a scanned image, the data’s right there to be copied… It’s not as nice as if it was HTML, obviously, but much easier than trying to OCR something that isn’t even an image :-)

Like Reply
10 months ago in reply to Ross Mounce

bljog
Sorry, can’t remember the specifics of GOCR but tesseract-OCR performed better in later benchmarking for phylogeny labels.

Like Reply
10 months ago

Ross Mounce
Seems like I’m certainly not the only one that needs to do this kind of stuff. Joseph Hughes had a go a few years ago (and seems to have had some excellent results using GOCR):
http://evo-karma.blogspot.com/…

Slightly different data, but same problem, and same methods :)

Edit Reply
10 months ago

Ross Mounce
Just tried GOCR. I converted the image into a .pcx file with GIMP and then ran: ‘gocr .pcx’ and got a load of gibberish back…

_n_n_n_nn_n_n_nn_________%%_,mM,:__5____,,__,v_,_,_v_, _____,_,_?_?____ _,_,_m_ _v___._, n,

methinks I’m probably doing something wrong…

Edit Reply
10 months ago in reply to Ross Mounce

Doing some information research ahead of my imminent OKCon 2011, Berlin talk, it’s come to my attention that the Open Access journal PLoS ONE is actually an excellent journal to publish in, with respect to Impact Factor.

In the Digital Age, journals are merely vessels in which we can publish our research. Aside from the prestige of the huge, well-established journals like Nature and Science, there’s not all that much difference between the other journals. Sure, there’s cost to think about, perceived quality of peer-review, length of time it takes to get from submission to being printed, and a few other factors but really – it’s impact (this is not necessarily best measured by the Impact Factor metric, as Bjorn Brembs often points out) that for me at least, is the most important.

If the PLoS ONE Paleontology Collection was a journal it would have a 2010 Impact Factor of 4.15 which would make it the #1 Paleontology-specific journal (vs 2009 JCR ‘Paleontology’ journal scores). But it’s not a journal so perhaps the comparison is an unfair one. Likewise I’m sure if one collected together Nature palaeontological articles and treated them as a ‘journal’ that ‘Nature Palaeontology’ pseudo-journal would have a massive Impact Factor.

Here’s my calculations (numbers listed in the order that the publications are in my personal online CUL library, linked to below):

Cites in 2010 to items published in: 2009 = 3 + 6 + 6 + 1 + 1 + 0 + 2 + 1 + 3 + 5 + 6 + 0 + 2 + 5 + 3 + 2 + 5 + 5 + 2 + 4 + 2 + 2 + 3 + 12 = 81

Cites in 2010 to items published in: 2008 = 0 + 6 + 4 + 3 + 4 + 5 + 4 + 10 + 4 + 3 + 6 + 0 + 5 + 8 + 15 + 8 = 85

Number of items published in: 2009 = 24 link to bibliography

Number of items published in: 2008 = 16 link to bibliography

Calculation: IF = (Cites to recent items / Number of recent items) = (81+85) / (24+16) = 4.15

Of course Thomson-Reuters official JCR probably doesn’t count citations from journals such as “Caminhos de Geografia” and Google Scholar (which I used because it’s much quicker/easier/Open than WoK) doesn’t always provide the correct year metadata for each article. But still, as a rough estimate I think this is quite impressive. Well done PLoS!

The task now, is to convince fellow palaeontologists that it’s worth publishing here.

Every day I get hugely frustrated that I can’t access articles published in otherwise excellent journals such as Neues Jahrbuch für Geologie und Paläontologie, Abhandlungen, the Canadian Journal of Earth Sciences, and Zootaxa.

These journals and authors who publish in this Closed manner aren’t doing themselves any favours IMO. What’s the point of publishing research if only a very select few people can read it?

Sure, granted many palaeontologists will happily send you a pdf if you ask for one either directly via email or on a mailing list such as VRTPALEO but those routes don’t always work…

Whether it be ‘Gold’ Open Access, or ‘Green’ Open Access it’s a simple matter of logic that Open Access is beneficial for authors and readers alike.