Having just written a lengthy blog post / rant about publishing data for another blog (I’ll link to it later if/when it gets published). I thought I’d post a technical demonstration of my issues here.
I want need to extract simple matrices of numbers from research papers for my PhD research. Theoretically, I shouldn’t even need to do this. In an ideal world where funding/author/researcher effort (overall) was time and cost efficient, all the data I need would be automatically (and indeed mandatorily submitted) to proper phylogenetic data repositories like TreeBASE or MorphoBank but it isn’t. It’s just lazily and inappropriately lobbed into pdfs with little thought or care as to it’s potential re-use legacy or fidelity…
So I have to do embarrassingly complicated and time-consuming things to re-extract this simple data AND then put it in a usable form (reformatting). Scarily enough, a lot of the time, I have to resort to OCR scanning to re-extract data. Below is such a case, rather ironically from a freely-accessible (Open Access? it’s unclear) paper; sure it’s openly accessible but it’s certainly NOT openly re-usable data-wise!
Orozco, J. & Philips, T. K. Phylogenetic analysis of the american genus euphoria and related groups based on morphological characters of adults (coleoptera: Scarabaeidae: Cetoniinae: Cetoniini). Insect Systematics Evolution 39-54 (2010)
The data I need is in a table printed sideways AND unnecessarily split between two separate pages: a classic example of deranged/lazy print-based thinking in an otherwise digital world. [Update: if it’s not clear already, my ire is for publishers only. I absolve all authors of any ‘blame’ particularly in the case I present here, just FYI ]
Using this (below) image as my test object, I thought I’d test out a variety of free OCR options at my disposal – I’m aware these options are very basic. If you know of anything better, or have tips on how I can ‘train’ tesseract to perform better then please fire away – I’m all ears
1.) Google Docs OCR
2.) http://www.free-ocr.com/
3.) http://www.onlineocr.net/
4.) http://www.newocr.com/
5.) tesseract
Method (in brief): page 45 pdf -> png image -> rotate to correct orientation -> OCR
Results (verbatim):
1.) Google – where did the taxon names go!!!!
1201100000 210110000? 04?0100400 2010000001 2101100010 2101000000 2100011001 2101000000 2101000001 2100010210 010 1 100000 1 20 1 000000 200 0020000 2000020000 200 003 0000 101 1020000 1010020000 0001101000 2200021000 0101101000
00000 120 00 on 000020 10 0000020000 0000000000 0000000000 l 0000 020 1 0 DU 000 020 1 U DU O00 O20 1 U DD 000 020 00 O0 0000 10?!) 0000010010 0101000010 0100010010 1000010010 0100010010 1101000010 1100000010 0100010010 0000010010 0i‘000lI|010
0011011000 0011010001 110103011? 2000010000 2000030000 0012131001 0010031001 00—0030000 0000130000 0010130001 0002111000 00l—031000 0011031000 0011031000 0011031000 0012031000 0012031001 00-0010000 0011031000 00—00I0000′
0000000011 000010011
00-01000!)0000100001 1000100001 1000011011 1000001111 0000001111 0111001011 110?!) U1-1 0000001011 001101111
0000001011 0000001011 0000001111 0000001011 0000011011 01?00010l1 0000011111 0010001011
0001000 0000??0 10100-0 0l00??0 00200-0 101
0021000 0101000 0021000 202 1010 2021000 2021000 2021000 2021000 2021000 2021000‘ 0021000 2022000 1021000
2.) free-ocr: wouldn’t work with .png so I converted to .jpg
Diplognat/ya gagates 1201100000 0000012000 0011011000 0000000011 2000000011 0001000
Coptomia aliveri 210110000? 0000002010 0011010001 000010011 0000200010 0000??0
P/medirmu meridionalis 04?0100400 0000020000 110103011? 00-010000— 0010000011 10100-0
Poerilopharis so/:0:/2i 2010000001 0000000000 2000010000 0000100001 0000000001 0100??0
Tmesorrbina uiridirinrta 2101100010 0000000000 2000030000 1000100001 0000200100 00200-0
Oxycetoniajucunda 2101000000 1000002010 0012131001 1000011011 2000201011 1011??0
Oxyt/ryrmfinmta 2100011001 0000002010 0010031001 1000001111 100021101? ?0210-0
Cetonia aumta 2101000000 0000002010 00—0030000 0000001111 ?100001000 0021000
R/nzbdoti: sobrina 2101000001 0000002000 0000130000 0111001011 ?000210011 0101000
E/aphinis irrorata 2100010210 00000010?0 0010130001 110?0 01-1 000021001? 0021000
E auita 0101100000 0000010010 0002111000 0000001011 0000001000 2021010
E. basalis 1201000000 0101000010 001-031000 001101111 0001001011 2021000
E. biguttata 2000020000 0100010010 0011031000 0000001011 2000011011 2021000
E. lineoligem 2000020000 1000010010 0011031000 0000001011 2000011011 2021000
E. mnescms 2000030000 0100010010 0011031000 0000001111 0000011011 2021000
E subtammtasa 1011020000 1101000010 0012031000 0000001011 2000111011 2021000
E. histronim 1010020000 1100000010 0012031001 0000011011 3000011011 2021000
E. fasnfem 0001101000 0100010010 00—0010000 01?000l011 100000-011 0021000
E. pulcbelbz 2200021000 0000010010 0011031000 0000011111 0000012011 2022000
E. mndezei 0101101000 0?00000010 00—0010000 0010001011 1000001000 1021000
3.) onlineocr.net doesn’t seem to like numbers greater than 1
0001001 0001000001 1101000100 0000100-00 0100000010 0001011010 ’2.7.4, Y 0000000 1100100000 1111100000 000101100 0100100000 00010000ZZ 47/.P7.d’i 0001000 110-000001 1101000110 0000100-00 0100100010 0001011000 m=f,,,,,f ’1 0001000 1101100000 1101100000 100100100 0100000011 0000000101 4,00.o01 0001000 1101110000 1101000000 000100100 0100001011 0000001101 r5omazuwq,15 y 000100? 1101100000 1111000000 0001001100 0100100010 000000000 5,..,52m4′,Y 0001000 1101100000 1101000000 0001001100 0100100001 0000000000 ,,,,,,p2m,/ Y 0001000 1101100000 1101000000 0001001100 0100100010 0000000000 orm.19 ’1 0001000 1101001000 111101100 000100-100 0100001010 0000001001 5701 ’1 0101000 0001000000 1101000000 0001110000 0100100000 0000011010 ’1.4′Y 0001000 /100100000 1-100/011 1000010100 0101000000 0100100010 0001010 1100100001 1101001110 0000010000 0000000000 100000101? 0001000 0001000011 1111000000 0000000-00 0100000000 0000001011 0-01002 /101100001 1111000001 100100100 0100000000 1001100011 001101 1101000000 1101100001 1001010100 0100000001 0000001011 70,..,,,,,I.6,0 0-00Z00 0010000000 1000010001 000000000? 0000000000 0100011010 4,..,.’0′.,,,1″°52.1 000010 1000000000 1000010000 0000100000 0000000000 1000000100 ’0′”0″‘”7′?’ ‘V 0-00101 1100000100 -00■01010 i11001011 0000000000 004001010 0110000 0100000000 I 1001011110 1000101100 0100000000 1000011010 iddaqormuodop 0001010 1100000000 1100000000 0001101100 0000100000 0000011001 ,darvraoldi,
4.) newocr provides annoyingly columnar output
lD§pbgnanbagwgunn
Cfipmnfianfiuni
Phanhhuninrrufinnaflk
lhnihyimrksdwrhi
Tmrmrr/Jina viridirinrta
Chgnflvniajurunah
Cbgwkwwaj%nznu
Cnnnia aumta
R/Jabdntis mbrina
ELI];/Jinis irmrata
Eiawhw
Eihwafis
Eibgwnum
E. /inm/igmz
Eirannrnu
lisubnvnrnnwa
Eihknwnhw
Eifkrfikw
Eipuhhrfla
Eiranakzn
1201100000
210110000?
04?0100400
2010000001
2101100010
2101000000
2100011001
2101000000
2101000001
2100010210
0101100000
1201000000
2000020000
2000020000
2000030000
1011020000
1010020000
0001101000
2200021000
0101101000
0000012000
0000002010
0000020000
0000000000
0000000000
1000002010
0000002010
0000002010
0000002000
00000010?0
0000010010
0101000010
0100010010
1000010010
0100010010
1101000010
1100000010
0100010010
0000010010
0?00000010
0011011000
0011010001
110103011?
2000010000
2000030000
0012131001
0010031001
00-0030000
0000130000
0010130001
0002111000
001-031000
0011031000
0011031000
0011031000
0012031000
0012031001
00-0010000
0011031000
00-0010000
0000000011
000010011
00-010000-
0000100001
1000100001
1000011011
1000001111
0000001111
0111001011
110?001-1
0000001011
001101111
0000001011
0000001011
0000001111
0000001011
0000011011
01?0001011
0000011111
0010001011
2000000011
0000200010
0010000011
0000000001
0000200100
2000201011
100021101?
?100001000
?000210011
000021001?
0000001000
0001001011
2000011011
2000011011
0000011011
2000111011
3000011011
100000-011
0000012011
1000001000
0001000
0000H0
10100-0
0100H0
00200-0
1011H0
?0210-0
0021000
0101000
0021000
2021010
2021000
2021000
2021000
2021000
2021000
2021000
0021000
2022000
1021000
5.) tesseract: only takes .tif flattened and decompressed
Dq>/agmztbugagutn 1201 100000 0000012000 001101 1000 0000000011 200000001 1 0001000
Captnmm a/mn: 2101 10000? 0000002010 0011010001 00001001 1 0000200010 0000??0
Pbardnmn mrnduzmz/u 04?0100400 0000020000 1 1010301 1? 0040100004 001000001 1 1010040
Parr:/apburxx xrbarbx 2010000001 0000000000 2000010000 0000100001 0000000001 0100??0
Tmrmnbzmz zuudmmta 2101 100010 0000000000 2000030000 1000100001 0000200100 0020040
Oxyrrtnnxapnxné 2101000000 1000002010 0012131001 1000011011 2000201011 1011??0
Oxytbyrrujiannta 210001 1001 0000002010 0010031001 1000001 1 11 100021 101? ?021040
Crtama mmzta 2101000000 0000002010 0040030000 0000001 1 11 ?100001000 0021000
Rhalzdatu mbnmz 2101000001 0000002000 0000130000 011 1001011 ?000210011 0101000
Ehphmn ummm 2100010210 0000001020 0010130001 110?0 0141 000021001? 0021000
E, mum 0101100000 0000010010 0002111000 0000001011 0000001000 2021010
E, [mm/:1 1201000000 0101000010 0014031000 001101111 0001001011 2021000
E, Ingmtuta 2000020000 0100010010 0011031000 0000001011 2000011011 2021000
E, /mm/zgmz 2000020000 1000010010 0011031000 0000001011 200001 101 1 2021000
E nmnrrm 2000030000 0100010010 0011031000 0000001 1 11 000001 101 1 2021000
E, mlztamrmam 1011020000 1101000010 0012031000 0000001011 2000111011 2021000
E, butmmnz 1010020000 1100000010 0012031001 0000011011 3000011011 2021000
E,jb.rry}m 0001 101000 0100010010 0040010000 01?000101 1 100000401 1 0021000
E, pu/[hr/M 2200021000 0000010010 0011031000 0000011111 0000012011 2022000
E mndrzr: 0101 101000 0?00000010 0040010000 0010001011 1000001000 1021000
Discussion:
Even though the image is reasonably high-res, not one of the tools managed 100% accuracy.
Only tesseract and free-ocr show any promise of being a viable solution. At the moment free-ocr seems demonstrably better at the names than tesseract but this could change if I start training tesseract in italicised Latin binomials.
Finally, to make it a usable nexus file I have to add the metadata wrappers above and below with extra parameters added in, discovered only by manually reading through the paper; so it eventually looks like this:
#NEXUS
[THIS IS AN OPTIONAL COMMENT:
Orozco, J. & Philips, T. K. Phylogenetic analysis of the american genus euphoria and related groups based on morphological characters of adults (coleoptera: Scarabaeidae: Cetoniinae: Cetoniini). Insect Systematics & Evolution 39-54 (2010). URL http://dx.doi.org/10.1163/187631210X12628483805310.]
BEGIN DATA;
DIMENSIONS NTAX =20 NCHAR=57;
FORMAT DATATYPE = STANDARD GAP =- MISSING =? SYMBOLS = “0 1 2 3 4″;
MATRIX
Diplognatha_gagates 1201100000 0000012000 0011011000 0000000011 2000000011 0001000
Coptomia_oliveri 210110000? 0000002010 0011010001 000010011 0000200010 0000??0
Phaedimus_meridionalis 04?0100400 0000020000 110103011? 00-010000- 0010000011 10100-0
Poecilopharis_schochi 2010000001 0000000000 2000010000 0000100001 0000000001 0100??0
Tmesorrhina_viridicincta 2101100010 0000000000 2000030000 1000100001 0000200100 00200-0
Oxycetonia_jucunda 2101000000 1000002010 0012131001 1000011011 2000201011 1011??0
Oxythyrea_funesta 2100011001 0000002010 0010031001 1000001111 100021101? ?0210-0
Cetonia_aurata 2101000000 0000002010 00-0030000 0000001111 ?100001000 0021000
Rhabdotis_sobrina 2101000001 0000002000 0000130000 0111001011 ?000210011 0101000
Elaphinis_irrorata 2100010210 00000010?0 0010130001 110?0 01-1 000021001? 0021000
E.avita 0101100000 0000010010 0002111000 0000001011 0000001000 2021010
E.basalis 1201000000 0101000010 001-031000 001101111 0001001011 2021000
E.biguttata 2000020000 0100010010 0011031000 0000001011 2000011011 2021000
E.lineoligera 2000020000 1000010010 0011031000 0000001011 2000011011 2021000
E.canescens 2000030000 0100010010 0011031000 0000001111 0000011011 2021000
E.subtomentosa 1011020000 1101000010 0012031000 0000001011 2000111011 2021000
E.histronica 1010020000 1100000010 0012031001 0000011011 3000011011 2021000
E.fascifera 0001101000 0100010010 00-0010000 01?000l011 100000-011 0021000
E.pulchella 2200021000 0000010010 0011031000 0000011111 0000012011 2022000
E.candezei 0101101000 0?00000010 00-0010000 0010001011 1000001000 1021000
;
end;
Final step: checking (validation) the file works by reading it into a suitable nexus file reading program e.g. PAUP
After all that effort – I discover that there’s a huge problem with the published dataset:
taxon 2: Coptomia oliveri only has 56 characters coded in the printed matrix.
It’s not all that obvious immediately – partly because the matrix is printed sideways and split over two pages; I feel this has rather aided and abetted this mistake to escape the corrective gaze of peer review. Indeed I’m led to believe that it’s rather common for underlying data to go entirely uncritiqued and unobserved during the review process. Rather odd considering data is the very basis for most phylogeny scientific papers!
This makes the entire dataset unusable for my purposes and I’ll now have to run the gauntlet of finding a current email address for the author, composing and sending an email, and then waiting indefinitely for a possible response in which they may or may not send me the correct data.
This is sadly not an uncommon situation, and I have tried to make my fellow colleagues aware of this. Only just recently I gave a talk here at the Systematics Association biennial meeting, on the very subject of data publishing. I doubt it’ll have made much impact, but I’m happy with myself for at least raising the issue in a public forum.
When will the madness of burying, corrupting, obfuscating and generally throwing-away valuable data end? Data is far more useful in it’s original, usable formats, and in most cases I’d argue it’s easier/better for all stakeholders (funders, authors, publishers, re-users, readers…) if it’s left this way.
But in the meantime, I’ll just have to keep digging away at those pdfs to extract the data I need…
*sighs*
Comments resurrected from the old blog:
Graeme Lloyd
This story is in no way familiar…
Personally I have NEVER trusted OCR to do the job right and so have typed in countless numbers of these by hand.
A Liked Reply
10 months ago 2 Likes
Ross Mounce
I should probably make clear as well (thanks for the prompt) that I too *never* trust OCR, I always carefully check line-by-line what it gives me – but I do use it for large and/or awkward matrices.
Edit Reply
10 months ago in reply to Graeme Lloyd
alf
Why OCR? It’s not an image PDF; easy enough to copy/paste the data out of the table: https://spreadsheets.google.co…
(not quite as easy as it could have been, but not too difficult)
Like Reply
10 months ago
Ross Mounce
You’ve copied out Table 1 which is on page 42, yes; I can do this easily too.
I want the Table 2 that’s split between pages 44 and 45. Get that with perfect fidelity (e.g. formating preserved) and not much fuss and then and only then will I be impressed ;)
:P
Edit Reply
10 months ago in reply to alf
Alf Eaton
Table 2 is in that spreadsheet as well, in a separate tab. It took a quick regexp and a couple of manual edits to restore the tabulation, that’s all.
Like Reply
10 months ago in reply to Ross Mounce
Ross Mounce
okay, pdftotext command-line hackery can harvest the data with fidelity. But the tabulation (formatting) is still screwed, and like you said requires manual effort to put back in place
Edit Reply
10 months ago in reply to Alf Eaton
Alf Eaton
I just copy/pasted the table from Adobe Reader – no need for command-line tools. That was my point, really: unless the PDF is a scanned image, the data’s right there to be copied… It’s not as nice as if it was HTML, obviously, but much easier than trying to OCR something that isn’t even an image :-)
Like Reply
10 months ago in reply to Ross Mounce
bljog
Sorry, can’t remember the specifics of GOCR but tesseract-OCR performed better in later benchmarking for phylogeny labels.
Like Reply
10 months ago
Ross Mounce
Seems like I’m certainly not the only one that needs to do this kind of stuff. Joseph Hughes had a go a few years ago (and seems to have had some excellent results using GOCR):
http://evo-karma.blogspot.com/…
Slightly different data, but same problem, and same methods :)
Edit Reply
10 months ago
Ross Mounce
Just tried GOCR. I converted the image into a .pcx file with GIMP and then ran: ‘gocr .pcx’ and got a load of gibberish back…
_n_n_n_nn_n_n_nn_________%%_,mM,:__5____,,__,v_,_,_v_, _____,_,_?_?____ _,_,_m_ _v___._, n,
methinks I’m probably doing something wrong…
Edit Reply
10 months ago in reply to Ross Mounce