Show me the data!

Deep indexing supplementary data files

June 20th, 2015 | Posted by rmounce in Conservation Hackathon | Content Mining | Hack days

To prove my point about the way that supplementary data files bury useful data, making it utterly indiscoverable to most, I decided to do a little experiment (in relation to text mining for museum specimen identifiers, but also perhaps with some relevance to the NHM Conservation Hackathon):

I collected the links for all Biology Letters supplementary data files. I then filtered out the non-textual media such as audio, video and image files, then downloaded the remaining content.

A breakdown of file extensions encountered in this downloaded subset:

763 .doc files
543 .pdf files
109 .docx files
75 .xls files
53 .xlsx files
25 .csv files
19 .txt files
14 .zip files
2 .rtf files
2 .nex files
1 .xml file
1 “.xltx” file

I then converted some of these unfriendly formats into simpler, more easily searchable plain text formats:

for i in *.zip ; do unzip $i -d /home/ross/work/royal-soc-si/biol-letters-supp-info/transformed/unzipped_$i ; done
for i in *.docx ; do docx2txt $i /home/ross/work/royal-soc-si/biol-letters-supp-info/transformed/$i.txt ; done
for i in *.doc ; do catdoc -a $i > /home/ross/work/royal-soc-si/biol-letters-supp-info/transformed/$i.txt ; done
for i in *.pdf ; do pdftotext $i > /home/ross/work/royal-soc-si/biol-letters-supp-info/transformed/$i.txt ; done
for i in *.rtf ; do unrtf --text $i > /home/ross/work/royal-soc-si/biol-letters-supp-info/transformed/$i.txt ; done
for i in *.xls ; do in2csv $i > /home/ross/work/royal-soc-si/biol-letters-supp-info/transformed/$i.csv ; done
for i in *.xlsx ; do in2csv $i > /home/ross/work/royal-soc-si/biol-letters-supp-info/transformed/$i.csv ; done


Now everything is properly searchable and indexable!

In a matter of seconds I can find NHM specimen identifiers that might not otherwise be mentioned in the full text of the paper, without actually wasting any time manually reading any papers. Note, not all the ‘hits’ are true positives but most are, and those that aren’t e.g. “NHMQEVLEGYKKKYE” are easy to distinguish as NOT valid NHM specimen identifiers:

$ grep -ior 'nhm............'
20120949_ESM_1.txt:NHMUK R6792), N
20120949_ESM_1.txt:NHMUK R8646) in
20120949_ESM_1.txt:NHMUK R36615, ‘
20120949_ESM_1.txt:NHMUK R36620), 
20120949_ESM_1.txt:NHMUK R16586). 
20120949_ESM_1.txt:NHMUK R36620) a
20120949_ESM_1.txt:NHMUK R16586) a
20120949_ESM_1.txt:NHMUK R6856 was
20120949_ESM_1.txt:NHMUK Charig ar
20120949_ESM_1.txt:NHMUK R6856 and
20120949_ESM_1.txt:NHMUK R6856 wer
20120949_ESM_1.txt:NHMUK R6856 wer
20120949_ESM_1.txt:NHMUK R6856 and
20120949_ESM_1.txt:NHM R6856 just 
20120949_ESM_1.txt:NHM R6856 (figu
20120949_ESM_1.txt:NHMUK R6856 had
20120949_ESM_1.txt:NHMUK R3592) an
20120949_ESM_1.txt:NHMUK R6856. Th
20120949_ESM_1.txt:NHMUK R6856). M
20120949_ESM_1.txt:NHMUK with the 
20120949_ESM_1.txt:NHMUK R6856 is 
20120949_ESM_1.txt:NHMUK R6856 sug
20120949_ESM_1.txt:NHMUK R6856. Th
20120949_ESM_1.txt:NHMUK R6856 sug
20120949_ESM_1.txt:NHMUK R6856, bu
20120949_ESM_1.txt:NHMUK R6586 is 
20120949_ESM_1.txt:NHMUK R6586 als
20120949_ESM_1.txt:NHMUK R6586, we
20120949_ESM_1.txt:NHMUK R6586 can
20120949_ESM_1.txt:NHMUK R6586 was
20120949_ESM_1.txt:NHMUK R6586 may
20120949_ESM_1.txt:NHMUK R6856 are
20120949_ESM_1.txt:NHMUK R6856) av
20120949_ESM_1.txt:NHMUK R6795) in
20120949_ESM_1.txt:NHMUK R6795 is 
20120949_ESM_1.txt:NHMUK R6856 and
20120949_ESM_1.txt:NHMUK R6856 and
20120949_ESM_1.txt:NHMUK R6856 was
20120949_ESM_1.txt:NHMUK R6856 fal
20120949_ESM_1.txt:NHMUK R6856 is 
20120949_ESM_1.txt:NHMUK R6856 + S
20120949_ESM_1.txt:NHMUK R6856 whe
20120949_ESM_1.txt:NHMUK R6856 + S
20120949_ESM_1.txt:NHMUK 1, Tanzan
20120949_ESM_1.txt:NHMUK R6856 and
20120949_ESM_1.txt:NHMUK Charig ar
20120949_ESM_1.txt:NHMUK R6856 to 
20120949_ESM_1.txt:NHMUK) for perm
20120949_ESM_1.txt:NHMUK) for acce
20120949_ESM_1.txt:NHMUK Image Res
20120949_ESM_1.txt:NHMUK, The Natu
rsbl20060505supp.txt:NHM uncataloged
rsbl20060505supp.txt:NHM uncataloged
rsbl20070502supp01.doc.txt:NHM) provided v
rsbl20090302supp3.doc.txt:NHM = The Natur
rsbl20090302supp3.doc.txt:NHMW = Natural 
rsbl20090302supp3.doc.txt:NHM E32070	Plan
rsbl20090302supp3.doc.txt:NHM EE5034	Plan
rsbl20090302supp3.doc.txt:NHM E4381	Plank
rsbl20090302supp3.doc.txt:NHM E10384	Plan
rsbl20090302supp3.doc.txt:NHM EE4825	Plan
rsbl20090302supp3.doc.txt:NHM E8389	Plank
rsbl20090302supp3.doc.txt:NHM EE8132	Plan
rsbl20090302supp3.doc.txt:NHM EE5585	Non-
rsbl20090302supp3.doc.txt:NHM EE ?	Non-pl
rsbl20090302supp3.doc.txt:NHM EE1961	?	?	
rsbl20090302supp3.doc.txt:NHM E35551	Plan
rsbl20090302supp3.doc.txt:NHM E76539	?	Up
rsbl20090302supp3.doc.txt:NHM EE4055	Plan
rsbl20090302supp3.doc.txt:NHM E81494	Plan
rsbl20090302supp3.doc.txt:NHM EE4631	?	Ap
rsbl20090302supp3.doc.txt:NHM EE4632	?	Ap
rsbl20090302supp3.doc.txt:NHM EE4641	Plan
rsbl20090302supp3.doc.txt:NHM E20098	Plan
rsbl20090302supp3.doc.txt:NHM EE4404	Plan
rsbl20090302supp3.doc.txt:NHM EE8397	Plan
rsbl20090302supp3.doc.txt:NHM EE2372	?	Ma
rsbl20090302supp3.doc.txt:NHM E79718	Plan
rsbl20090302supp3.doc.txt:NHM E40574	Plan
rsbl20090302supp3.doc.txt:NHM EE4524	Plan
rsbl20090302supp3.doc.txt:NHM E79415	Non-
rsbl20090302supp3.doc.txt:NHM E45372	?	Tu
rsbl20090302supp3.doc.txt:NHM EE2321	Plan
rsbl20090302supp3.doc.txt:NHM EE2262	Plan
rsbl20090302supp3.doc.txt:NHM EE4610	Plan
rsbl20090302supp3.doc.txt:NHM E4052	Non-p
rsbl20090302supp3.doc.txt:NHM EE191	Plank
rsbl20090302supp3.doc.txt:NHM EE2353	Plan
rsbl20090302supp3.doc.txt:NHM E4034	Plank
rsbl20090302supp3.doc.txt:NHM EE2432	Plan
rsbl20090302supp3.doc.txt:NHM E4176	Plank
rsbl20090302supp3.doc.txt:NHM EE4048	?	Ma
rsbl20090302supp3.doc.txt:NHM E9892	Plank
rsbl20090302supp3.doc.txt:NHM E4979	?	Tur
rsbl20090302supp3.doc.txt:NHM E75821	Plan
rsbl20090302supp3.doc.txt:NHM E40974	?	Se
rsbl20090302supp3.doc.txt:NHM E79094	Plan
rsbl20090302supp3.doc.txt:NHM E582	Plankt
rsbl20090302supp3.doc.txt:NHMW 2005z0083/
rsbl20090302supp3.doc.txt:NHM E82582	?	U.
rsbl20090302supp3.doc.txt:NHM EE7698	Plan
rsbl20090302supp3.doc.txt:NHM E9392	Plank
rsbl20090302supp3.doc.txt:NHM E73207	?	Al
rsbl20090302supp3.doc.txt:NHM E43810	Plan
rsbl20090302supp3.doc.txt:NHM 56422	?	Apt
rsbl20090302supp3.doc.txt:NHM E83246	Plan
20120949_ESM_5.txt:NHMUK R6856) am
rsbl2011364supp1.doc.txt:NHM-72.666; MCZ
20120949_ESM_3.txt:NHMUK R6856). P
20120949_ESM_3.txt:NHMUK R6856) in
rsbl20090778supp1.doc.txt:NHM as a contro
rsbl20090139supp1.txt:NHM, The Natura
rsbl20090139supp1.txt:NHM R1034). As
20120949_ESM_2.txt:NHMUK R6856) in
20120949_ESM_2.txt:NHMUK R6856) in
rsbl20080409supp01.doc.txt:NHMW, Naturhist
rsbl20130021supp1.doc.txt:NHM, Staatliche
rsbl20130021supp1.doc.txt:NHMUK PV R498 a
rsbl20130021supp1.doc.txt:NHMUK PV OR3612
rsbl20130021supp1.doc.txt:NHMUK PV R3938 
rsbl20130021supp1.doc.txt:NHMUK PV R5465)
rsbl20130021supp1.doc.txt:NHMUK PV OR2003
rsbl20130021supp1.doc.txt:NHMUK PV R1158)
rsbl20130021supp1.doc.txt:NHMUK PV R5595)
rsbl20130021supp1.doc.txt:NHMUK PV R4086)
rsbl20130021supp1.doc.txt:NHMUK and GLAHM
rsbl20130021supp1.doc.txt:NHM); Sveltonec
rsbl20130021supp1.doc.txt:NHMUK PV R11185
rsbl20130021supp1.doc.txt:NHM1284-R); Mal
rsbl20130021supp1.doc.txt:NHMUK PV R6682)
rsbl20130021supp1.doc.txt:NHMUK PV R6682)
rsbl20130021supp1.doc.txt:NHMUK in 1959, 
rsbl20130021supp1.doc.txt:NHMUK. While th
rsbl20130021supp1.doc.txt:NHMUK PV R6682 
rsbl20130021supp1.doc.txt:NHMUK PV R6682)
rsbl20130021supp1.doc.txt:NHMUK PV R6682,
rsbl20130021supp1.doc.txt:NHMUK PV R6682,
rsbl20130021supp1.doc.txt:NHMUK PV R6682 
rsbl20130021supp1.doc.txt:NHMUK) for the 
20120949_ESM_4.txt:NHMUK R6856) wh


Perhaps this approach might be useful to the PREDICTS / LPI teams, looking for species occurrence data sets?

I don’t know why figshare doesn’t do deep indexing by default – it’d be really useful to search the morass of published supplementary data that out there!