Show me the data!
Header

Taking inspiration from Cameron Neylon, I have written a DRAFT letter to my local MP urging him to support the recommendations in the Hargreaves Review of Intellectual Property.

[UPDATE: Letter now sent :) ]

 

Dear Don Foster,

As a graduate research student at the University of Bath, currently using textmining techniques to do scientific research in an efficient and comprehensive manner, I urge you to sign this early day motion (no. 151, tabled 11.06.2012).

Aside from my tweets to @DonFosterMP last night I have never before been moved to formally contact you, but this urgently requires your action.

Simple desktop computers can interpret vast amounts of digital information. The capacity, tools and eagerness already exist to enable scientists to do systematic reviews, knowledge syntheses, and innovative analyses on a scale never before imagined. But there is one barrier alone that stifles all this strong potential for good: current UK copyright law.

The exceptions for digital content proposed by Professor Hargreaves in his review would be a boon for research (http://www.ipo.gov.uk/ipreview-finalreport.pdf; Chapter 5). It is galling that 87% of the research contained within UK Pub Med Central cannot be legally mined for information (p47 of the report). Especially exasperating when we are allowed to manually ‘human-read’ nearly all of this content. We just are not allowed to efficiently use machines to read the literature for us.

Current UK copyright law is outdated and is sometimes the *only* factor holding back scientific research. We need to remove this unnecessary artificial barrier to let UK scientists perform world-class research with modern and innovative tools and ideas. Otherwise we will be left in the Dark Ages, instead of the Bright, Shiny Digital Economy of the Future.
Yours sincerely,

Ross Mounce

PhD Candidate & Panton Fellow
Fossils, Phylogeny and Macroevolution Research Group
University of Bath, http://about.me/rossmounce

Further resources:
* The Hargreaves Report
* PMR’s response to the Hargreaves report
* TechDirt – UK Publishers Pretend To Embrace Copyright Reform… In Order To Kill Copyright Reform
* Glyn Moody’s – Review of the UK Government Response to the Hargreaves Review

sciseekclaimtoken-4fdc87e2301bd

sciseekclaimtoken-4fdc8363a6813

Libre redistribution – a key facet of Open Access

May 28th, 2012 | Posted by rmounce in Content Mining | Open Access - (Comments Off on Libre redistribution – a key facet of Open Access)

I have previously commented elsewhere on other blogs, that uniquely, with BOAI-compliant Open Access literature, one is able to re-distribute research however one wishes (provided proper attribution is given). I believe this to be hugely beneficial and perhaps a rather under-appreciated facet of the plurality of benefits offered by Open Access publishing.

Below is an expanded version of the comment I made on Cameron Neylon’s excellent blog Science in the Open on this very theme (and please do read Cameron’s post too for greater context):

Decentralized journal/article distribution is already happening.

I have 20,000+ PLoS articles on my computer right now. You can get them too – via BioTorrents. When compressed (as initially provided there) it’s less than 16GB’s of files – a trivial amount for anyone with a broadband connection. I can now (and do!) take PLoS on a USB stick with me wherever I go, allowing me to do research on trains, planes, and remote locations completely hassle free without even an internet connection. It was easy to download (pretty much 1-click) too via my high-speed institutional connection – and didn’t overload PLoS’s servers because I didn’t *get* the articles from their servers. With peer-2-peer file sharing the load is balanced between seeders (and in turn, I’m now seeding this torrent too, to help share the load). If all institutions/libraries agreed to help seed the world’s research literature, without copyright restriction on electronic redistribution (which we could do tomorrow if it weren’t for the legal copyright barriers imposed by most traditional subscription-access publishers) doing literature research would be pretty much frictionless! We could even get papers & data on campus much quicker over campus LAN rather than the internet.

Institutions already agree to help distribute code e.g. R and it’s multitude of packages – this is hugely beneficial, and helps share the costs associated with bandwidth — why not for research publications? The PLoS corpus is a great way to try out content mining ideas – it shows you how easy academic life *could* be if everything was Open Access. I’ve run some simple scripts on it myself. I’m not sure the simple things I did such as string matching could be classified as ‘text mining’ – but one thing I do know is – it was 100,000x times easier/quicker doing this locally, machine-reading files, rather than doing it paper by paper negotiating paywalls (where do I click, how many hoops do I have to jump through before I’m let in, what information are the ‘helpful’ tracking cookies keeping about me…) and getting cutoff by publishers. It’s worth pointing out as well, that once you have all the literature you need on your computer – you don’t even need the internet to do your research! For research in lesser economically developed countries, with weaker telecomms infrastructure – I’d imagine this would be a real boon for research.

It’s a window on the world that *could* be possible if we just changed our attitude WRT to copyright and research publishing. That PLoS, BMC and other Open Access publishers use the Creative Commons Attribution Licence makes this all possible.

I predict that the rights to electronically redistribute, and machine-read research will be vital for 21st century research – yet currently we academics often wittingly or otherwise relinquish these rights to publishers. This has got to stop. The world is networked, thus scholarly literature should move with the times and be openly networked too.

In short, I think research would be a whole lot easier to do, and ultimately (all things considered) be more cost-effective, if all future publicly-funded research could be made BOAI-compliant Open Access. This is just my opinion – you are welcome to disagree in the comments section below, I sincerely hope I don’t sound like an Open Access ‘zealot‘ for this is certainly not my intention.

Using off-the-shelf OCR to re-extract data

July 20th, 2011 | Posted by rmounce in Content Mining | Palaeontology | Phylogenetics - (Comments Off on Using off-the-shelf OCR to re-extract data)

Having just written a lengthy blog post / rant about publishing data for another blog (I’ll link to it later if/when it gets published). I thought I’d post a technical demonstration of my issues here.

I want need to extract simple matrices of numbers from research papers for my PhD research. Theoretically, I shouldn’t even need to do this. In an ideal world where funding/author/researcher effort (overall) was time and cost efficient, all the data I need would be automatically (and indeed mandatorily submitted) to proper phylogenetic data repositories like TreeBASE or MorphoBank but it isn’t. It’s just lazily and inappropriately lobbed into pdfs with little thought or care as to it’s potential re-use legacy or fidelity…

So I have to do embarrassingly complicated and time-consuming things to re-extract this simple data AND then put it in a usable form (reformatting). Scarily enough, a lot of the time, I have to resort to OCR scanning to re-extract data. Below is such a case, rather ironically from a freely-accessible (Open Access? it’s unclear) paper; sure it’s openly accessible but it’s certainly NOT openly re-usable data-wise!

Orozco, J. & Philips, T. K. Phylogenetic analysis of the american genus euphoria and related groups based on morphological characters of adults (coleoptera: Scarabaeidae: Cetoniinae: Cetoniini). Insect Systematics Evolution 39-54 (2010)

The data I need is in a table printed sideways AND unnecessarily split between two separate pages: a classic example of deranged/lazy print-based thinking in an otherwise digital world. [Update: if it’s not clear already, my ire is for publishers only. I absolve all authors of any ‘blame’ particularly in the case I present here, just FYI ]

sidebyside

Using this (below) image as my test object, I thought I’d test out a variety of free OCR options at my disposal – I’m aware these options are very basic. If you know of anything better, or have tips on how I can ‘train’ tesseract to perform better then please fire away – I’m all ears 

1.) Google Docs OCR

2.) http://www.free-ocr.com/

3.) http://www.onlineocr.net/

4.) http://www.newocr.com/

5.) tesseract

Method (in brief): page 45 pdf -> png image -> rotate to correct orientation -> OCR

7C8v

Results (verbatim):

1.) Google – where did the taxon names go!!!!

1201100000 210110000? 04?0100400 2010000001 2101100010 2101000000 2100011001 2101000000 2101000001 2100010210 010 1 100000 1 20 1 000000 200 0020000 2000020000 200 003 0000 101 1020000 1010020000 0001101000 2200021000 0101101000

00000 120 00 on 000020 10 0000020000 0000000000 0000000000 l 0000 020 1 0 DU 000 020 1 U DU O00 O20 1 U DD 000 020 00 O0 0000 10?!) 0000010010 0101000010 0100010010 1000010010 0100010010 1101000010 1100000010 0100010010 0000010010 0i‘000lI|010

0011011000 0011010001 110103011? 2000010000 2000030000 0012131001 0010031001 00—0030000 0000130000 0010130001 0002111000 00l—031000 0011031000 0011031000 0011031000 0012031000 0012031001 00-0010000 0011031000 00—00I0000′

0000000011 000010011

00-01000!)0000100001 1000100001 1000011011 1000001111 0000001111 0111001011 110?!) U1-1 0000001011 001101111

0000001011 0000001011 0000001111 0000001011 0000011011 01?00010l1 0000011111 0010001011

0001000 0000??0 10100-0 0l00??0 00200-0 101

0021000 0101000 0021000 202 1010 2021000 2021000 2021000 2021000 2021000 2021000‘ 0021000 2022000 1021000

2.) free-ocr: wouldn’t work with .png so I converted to .jpg

Diplognat/ya gagates 1201100000 0000012000 0011011000 0000000011 2000000011 0001000

Coptomia aliveri 210110000? 0000002010 0011010001 000010011 0000200010 0000??0

P/medirmu meridionalis 04?0100400 0000020000 110103011? 00-010000— 0010000011 10100-0

Poerilopharis so/:0:/2i 2010000001 0000000000 2000010000 0000100001 0000000001 0100??0

Tmesorrbina uiridirinrta 2101100010 0000000000 2000030000 1000100001 0000200100 00200-0

Oxycetoniajucunda 2101000000 1000002010 0012131001 1000011011 2000201011 1011??0

Oxyt/ryrmfinmta 2100011001 0000002010 0010031001 1000001111 100021101? ?0210-0

Cetonia aumta 2101000000 0000002010 00—0030000 0000001111 ?100001000 0021000

R/nzbdoti: sobrina 2101000001 0000002000 0000130000 0111001011 ?000210011 0101000

E/aphinis irrorata 2100010210 00000010?0 0010130001 110?0 01-1 000021001? 0021000

E auita 0101100000 0000010010 0002111000 0000001011 0000001000 2021010

E. basalis 1201000000 0101000010 001-031000 001101111 0001001011 2021000

E. biguttata 2000020000 0100010010 0011031000 0000001011 2000011011 2021000

E. lineoligem 2000020000 1000010010 0011031000 0000001011 2000011011 2021000

E. mnescms 2000030000 0100010010 0011031000 0000001111 0000011011 2021000

E subtammtasa 1011020000 1101000010 0012031000 0000001011 2000111011 2021000

E. histronim 1010020000 1100000010 0012031001 0000011011 3000011011 2021000

E. fasnfem 0001101000 0100010010 00—0010000 01?000l011 100000-011 0021000

E. pulcbelbz 2200021000 0000010010 0011031000 0000011111 0000012011 2022000

E. mndezei 0101101000 0?00000010 00—0010000 0010001011 1000001000 1021000

3.) onlineocr.net doesn’t seem to like numbers greater than 1

0001001 0001000001 1101000100 0000100-00 0100000010 0001011010 ’2.7.4, Y 0000000 1100100000 1111100000 000101100 0100100000 00010000ZZ 47/.P7.d’i 0001000 110-000001 1101000110 0000100-00 0100100010 0001011000 m=f,,,,,f ’1 0001000 1101100000 1101100000 100100100 0100000011 0000000101 4,00.o01 0001000 1101110000 1101000000 000100100 0100001011 0000001101 r5omazuwq,15 y 000100? 1101100000 1111000000 0001001100 0100100010 000000000 5,..,52m4′,Y 0001000 1101100000 1101000000 0001001100 0100100001 0000000000 ,,,,,,p2m,/ Y 0001000 1101100000 1101000000 0001001100 0100100010 0000000000 orm.19 ’1 0001000 1101001000 111101100 000100-100 0100001010 0000001001 5701 ’1 0101000 0001000000 1101000000 0001110000 0100100000 0000011010 ’1.4′Y 0001000 /100100000 1-100/011 1000010100 0101000000 0100100010 0001010 1100100001 1101001110 0000010000 0000000000 100000101? 0001000 0001000011 1111000000 0000000-00 0100000000 0000001011 0-01002 /101100001 1111000001 100100100 0100000000 1001100011 001101 1101000000 1101100001 1001010100 0100000001 0000001011 70,..,,,,,I.6,0 0-00Z00 0010000000 1000010001 000000000? 0000000000 0100011010 4,..,.’0′.,,,1″°52.1 000010 1000000000 1000010000 0000100000 0000000000 1000000100 ’0′”0″‘”7′?’ ‘V 0-00101 1100000100 -00■01010 i11001011 0000000000 004001010 0110000 0100000000 I 1001011110 1000101100 0100000000 1000011010 iddaqormuodop 0001010 1100000000 1100000000 0001101100 0000100000 0000011001 ,darvraoldi,

4.) newocr provides annoyingly columnar output

lD§pbgnanbagwgunn

Cfipmnfianfiuni

Phanhhuninrrufinnaflk

lhnihyimrksdwrhi

Tmrmrr/Jina viridirinrta

Chgnflvniajurunah

Cbgwkwwaj%nznu

Cnnnia aumta

R/Jabdntis mbrina

ELI];/Jinis irmrata

Eiawhw

Eihwafis

Eibgwnum

E. /inm/igmz

Eirannrnu

lisubnvnrnnwa

Eihknwnhw

Eifkrfikw

Eipuhhrfla

Eiranakzn

1201100000

210110000?

04?0100400

2010000001

2101100010

2101000000

2100011001

2101000000

2101000001

2100010210

0101100000

1201000000

2000020000

2000020000

2000030000

1011020000

1010020000

0001101000

2200021000

0101101000

0000012000

0000002010

0000020000

0000000000

0000000000

1000002010

0000002010

0000002010

0000002000

00000010?0

0000010010

0101000010

0100010010

1000010010

0100010010

1101000010

1100000010

0100010010

0000010010

0?00000010

0011011000

0011010001

110103011?

2000010000

2000030000

0012131001

0010031001

00-0030000

0000130000

0010130001

0002111000

001-031000

0011031000

0011031000

0011031000

0012031000

0012031001

00-0010000

0011031000

00-0010000

0000000011

000010011

00-010000-

0000100001

1000100001

1000011011

1000001111

0000001111

0111001011

110?001-1

0000001011

001101111

0000001011

0000001011

0000001111

0000001011

0000011011

01?0001011

0000011111

0010001011

2000000011

0000200010

0010000011

0000000001

0000200100

2000201011

100021101?

?100001000

?000210011

000021001?

0000001000

0001001011

2000011011

2000011011

0000011011

2000111011

3000011011

100000-011

0000012011

1000001000

0001000

0000H0

10100-0

0100H0

00200-0

1011H0

?0210-0

0021000

0101000

0021000

2021010

2021000

2021000

2021000

2021000

2021000

2021000

0021000

2022000

1021000

5.) tesseract: only takes .tif flattened and decompressed

Dq>/agmztbugagutn 1201 100000 0000012000 001101 1000 0000000011 200000001 1 0001000

Captnmm a/mn: 2101 10000? 0000002010 0011010001 00001001 1 0000200010 0000??0

Pbardnmn mrnduzmz/u 04?0100400 0000020000 1 1010301 1? 0040100004 001000001 1 1010040

Parr:/apburxx xrbarbx 2010000001 0000000000 2000010000 0000100001 0000000001 0100??0

Tmrmnbzmz zuudmmta 2101 100010 0000000000 2000030000 1000100001 0000200100 0020040

Oxyrrtnnxapnxné 2101000000 1000002010 0012131001 1000011011 2000201011 1011??0

Oxytbyrrujiannta 210001 1001 0000002010 0010031001 1000001 1 11 100021 101? ?021040

Crtama mmzta 2101000000 0000002010 0040030000 0000001 1 11 ?100001000 0021000

Rhalzdatu mbnmz 2101000001 0000002000 0000130000 011 1001011 ?000210011 0101000

Ehphmn ummm 2100010210 0000001020 0010130001 110?0 0141 000021001? 0021000

E, mum 0101100000 0000010010 0002111000 0000001011 0000001000 2021010

E, [mm/:1 1201000000 0101000010 0014031000 001101111 0001001011 2021000

E, Ingmtuta 2000020000 0100010010 0011031000 0000001011 2000011011 2021000

E, /mm/zgmz 2000020000 1000010010 0011031000 0000001011 200001 101 1 2021000

E nmnrrm 2000030000 0100010010 0011031000 0000001 1 11 000001 101 1 2021000

E, mlztamrmam 1011020000 1101000010 0012031000 0000001011 2000111011 2021000

E, butmmnz 1010020000 1100000010 0012031001 0000011011 3000011011 2021000

E,jb.rry}m 0001 101000 0100010010 0040010000 01?000101 1 100000401 1 0021000

E, pu/[hr/M 2200021000 0000010010 0011031000 0000011111 0000012011 2022000

E mndrzr: 0101 101000 0?00000010 0040010000 0010001011 1000001000 1021000

Discussion:

Even though the image is reasonably high-res, not one of the tools managed 100% accuracy.

Only tesseract and free-ocr show any promise of being a viable solution. At the moment free-ocr seems demonstrably better at the names than tesseract but this could change if I start training tesseract in italicised Latin binomials.

Finally, to make it a usable nexus file I have to add the metadata wrappers above and below with extra parameters added in, discovered only by manually reading through the paper; so it eventually looks like this:

#NEXUS

[THIS IS AN OPTIONAL COMMENT:

Orozco, J. & Philips, T. K. Phylogenetic analysis of the american genus euphoria and related groups based on morphological characters of adults (coleoptera: Scarabaeidae: Cetoniinae: Cetoniini). Insect Systematics & Evolution 39-54 (2010). URL http://dx.doi.org/10.1163/187631210X12628483805310.]

BEGIN DATA;

DIMENSIONS NTAX =20 NCHAR=57;

FORMAT DATATYPE = STANDARD GAP =- MISSING =? SYMBOLS = “0 1 2 3 4″;

MATRIX

Diplognatha_gagates 1201100000 0000012000 0011011000 0000000011 2000000011 0001000

Coptomia_oliveri 210110000? 0000002010 0011010001 000010011 0000200010 0000??0

Phaedimus_meridionalis 04?0100400 0000020000 110103011? 00-010000- 0010000011 10100-0

Poecilopharis_schochi 2010000001 0000000000 2000010000 0000100001 0000000001 0100??0

Tmesorrhina_viridicincta 2101100010 0000000000 2000030000 1000100001 0000200100 00200-0

Oxycetonia_jucunda 2101000000 1000002010 0012131001 1000011011 2000201011 1011??0

Oxythyrea_funesta 2100011001 0000002010 0010031001 1000001111 100021101? ?0210-0

Cetonia_aurata 2101000000 0000002010 00-0030000 0000001111 ?100001000 0021000

Rhabdotis_sobrina 2101000001 0000002000 0000130000 0111001011 ?000210011 0101000

Elaphinis_irrorata 2100010210 00000010?0 0010130001 110?0 01-1 000021001? 0021000

E.avita 0101100000 0000010010 0002111000 0000001011 0000001000 2021010

E.basalis 1201000000 0101000010 001-031000 001101111 0001001011 2021000

E.biguttata 2000020000 0100010010 0011031000 0000001011 2000011011 2021000

E.lineoligera 2000020000 1000010010 0011031000 0000001011 2000011011 2021000

E.canescens 2000030000 0100010010 0011031000 0000001111 0000011011 2021000

E.subtomentosa 1011020000 1101000010 0012031000 0000001011 2000111011 2021000

E.histronica 1010020000 1100000010 0012031001 0000011011 3000011011 2021000

E.fascifera 0001101000 0100010010 00-0010000 01?000l011 100000-011 0021000

E.pulchella 2200021000 0000010010 0011031000 0000011111 0000012011 2022000

E.candezei 0101101000 0?00000010 00-0010000 0010001011 1000001000 1021000

;

end;

Final step: checking (validation) the file works by reading it into a suitable nexus file reading program e.g. PAUP

After all that effort – I discover that there’s a huge problem with the published dataset:

taxon 2: Coptomia oliveri only has 56 characters coded in the printed matrix.

It’s not all that obvious immediately – partly because the matrix is printed sideways and split over two pages; I feel this has rather aided and abetted this mistake to escape the corrective gaze of peer review. Indeed I’m led to believe that it’s rather common for underlying data to go entirely uncritiqued and unobserved during the review process. Rather odd considering data is the very basis for most phylogeny scientific papers!

This makes the entire dataset unusable for my purposes and I’ll now have to run the gauntlet of finding a current email address for the author, composing and sending an email, and then waiting indefinitely for a possible response in which they may or may not send me the correct data.

This is sadly not an uncommon situation, and I have tried to make my fellow colleagues aware of this. Only just recently I gave a talk here at the Systematics Association biennial meeting, on the very subject of data publishing. I doubt it’ll have made much impact, but I’m happy with myself for at least raising the issue in a public forum.

When will the madness of burying, corrupting, obfuscating and generally throwing-away valuable data end? Data is far more useful in it’s original, usable formats, and in most cases I’d argue it’s easier/better for all stakeholders (funders, authors, publishers, re-users, readers…) if it’s left this way.

But in the meantime, I’ll just have to keep digging away at those pdfs to extract the data I need…

*sighs*

Comments resurrected from the old blog:

Graeme Lloyd
This story is in no way familiar…

Personally I have NEVER trusted OCR to do the job right and so have typed in countless numbers of these by hand.

A Liked Reply
10 months ago 2 Likes

Ross Mounce
I should probably make clear as well (thanks for the prompt) that I too *never* trust OCR, I always carefully check line-by-line what it gives me – but I do use it for large and/or awkward matrices.

Edit Reply
10 months ago in reply to Graeme Lloyd

alf
Why OCR? It’s not an image PDF; easy enough to copy/paste the data out of the table: https://spreadsheets.google.co…

(not quite as easy as it could have been, but not too difficult)

Like Reply
10 months ago

Ross Mounce
You’ve copied out Table 1 which is on page 42, yes; I can do this easily too.

I want the Table 2 that’s split between pages 44 and 45. Get that with perfect fidelity (e.g. formating preserved) and not much fuss and then and only then will I be impressed ;)

:P

Edit Reply
10 months ago in reply to alf

Alf Eaton
Table 2 is in that spreadsheet as well, in a separate tab. It took a quick regexp and a couple of manual edits to restore the tabulation, that’s all.

Like Reply
10 months ago in reply to Ross Mounce

Ross Mounce
okay, pdftotext command-line hackery can harvest the data with fidelity. But the tabulation (formatting) is still screwed, and like you said requires manual effort to put back in place

Edit Reply
10 months ago in reply to Alf Eaton

Alf Eaton
I just copy/pasted the table from Adobe Reader – no need for command-line tools. That was my point, really: unless the PDF is a scanned image, the data’s right there to be copied… It’s not as nice as if it was HTML, obviously, but much easier than trying to OCR something that isn’t even an image :-)

Like Reply
10 months ago in reply to Ross Mounce

bljog
Sorry, can’t remember the specifics of GOCR but tesseract-OCR performed better in later benchmarking for phylogeny labels.

Like Reply
10 months ago

Ross Mounce
Seems like I’m certainly not the only one that needs to do this kind of stuff. Joseph Hughes had a go a few years ago (and seems to have had some excellent results using GOCR):
http://evo-karma.blogspot.com/…

Slightly different data, but same problem, and same methods :)

Edit Reply
10 months ago

Ross Mounce
Just tried GOCR. I converted the image into a .pcx file with GIMP and then ran: ‘gocr .pcx’ and got a load of gibberish back…

_n_n_n_nn_n_n_nn_________%%_,mM,:__5____,,__,v_,_,_v_, _____,_,_?_?____ _,_,_m_ _v___._, n,

methinks I’m probably doing something wrong…

Edit Reply
10 months ago in reply to Ross Mounce