Show me the data!
Header

Twitter tips for Systematists

January 11th, 2013 | Posted by rmounce in Publications - (1 Comments)

I wrote a piece for The Systematist newsletter last year which has now been published & disseminated to members. The official version won’t be freely accessible from the website until next year (instant access is currently a perk of Systematics Association membership only) so in the meantime I’ll re-blog it here:

Is this the first mention of #icanhazpdf in scholarly literature?

I’d like once again (I already have by email) to thank the new editor Jane Droop for taking care to provide many many clickable linkouts in the PDF to all the different resources I mention – there’s a *lot* of links!

Here’s the full reference for the original version:
Mounce, R 2012 Twitter for systematists. The Systematist, vol. 34, pages 14-15

Twitter for systematists

Despite or perhaps because of being limited to just 140 character messages at a time, Twitter is an excellent medium for the near instantaneous dissemination of information over the Internet. It’s been successfully used to remotely sense earthquakes [1] and flu outbreaks [2], and to predict the outcomes of elections [3] and box office success [4]. It’s also a very hand tool for academics, with ever-increasing usage amongst the population.

Here’s my top tips for using twitter for science (a far from exhaustive list):

Remotely following conferences you can’t attend.

There are too many interesting conferences these days. No one has the time or money to attend them all. Furthermore some may occur simultaneously and one cannot be at two places at the same time! But with Twitter one can often get a reasonable description of what’s going on at a conference by following the official conference hashtag e.g. #evol2012 #ievobio (Evolution, Ottawa), and #HennigXXXI (Hennig, Riverside). At some conferences remote participation via Twitter is possible, to ask questions from afar at panel discussions and such.

Expand the impact of your conference talks

Extending upon the above, if you’re giving a talk at a conference – put your twitter handle on your conference name badge and on the title slide of your talk so tweeters in the audience can link to you on Twitter when describing your talk. This is particularly useful if you have a common name – John Smith could be anyone online but @JSmith69 exactly identifies who (and is shorter). If you can, put your slides online before your talk using a service like Slideshare or Prezi and use a URL shorterner to provide an easily tweetable link to that online slidedeck. Put this short-link on your first and last slides, so tweeters can disseminate this link to everyone following the conference hashtag from afar to also view your slides. This can dramatically increase the number of people seeing your talk (albeit, a slide-only version of it). For example, my talk this year at #HennigXXXI once tweeted out by @rdmpage and others (thanks!) was seen by over 200 people online after just a couple of days. At the conference itself there were less than 100 people in attendance, so it really helped maximise the impact of the talk.

Discuss, promote and critique papers on Twitter

Like a paper? Tweet about it including a link to the paper (attribution and links are key on Twitter) and maybe start a discussion with fellow academics. Don’t just tweet-promote your own papers or those of your close colleagues – this is bad netiquette. Some groups even have journal clubs conducted in the open on Twitter e.g. http://www.twitjc.com/

Get help or canvass the opinion of your research community

Got a problem you can’t solve yourself, but might easily & quickly be solved by someone else? One can’t abuse twitter for this all the time, but the occasional well-put question on twitter often elicits good responses if you have enough followers. The key here is reciprocity – if you’re always asking for help you’ll soon be ignored. But if you can give as well as receive help you’ll generate a healthy respect. Twitter convention has it that questions are often marked with the #lazyweb hashtag – use this to indicate you have a question that you want answered. Similarly if you need a PDF you don’t have subscription access to, try supplying the URL link to the paper + your email address + #icanhazpdf in a tweet. @BoraZ created this convention and it’s now rather popular with many requests *every* day appearing on Twitter for PDFs. This facilitates quick and easy access to the literature, enabling thorough scholarship, by-passing the often tedious and slow inter-library loans procedure.

The Systematics Association, like other societies e.g. @SVP_vertpaleo, @GeolSoc, @LinneanSociety and journals e.g @systbiol @MethodsEcolEvol, @BiolJLinnSoc , @ecologyletters have had a presence on Twitter since 2011: @SystAssn.

Want to talk about systematics? Tweet us at @SystAssn . Happy tweeting tweeps :)

References

1. Sakaki, T., Okazaki, M., and Matsuo, Y. 2010. Earthquake shakes twitter users: real-time event detection by social sensors. In Proceedings of the 19th international conference on World wide web, WWW ’10, pp. 851-860, New York, NY, USA. ACM. http://dx.doi.org/10.1145/1772690.1772777
2. Culotta, A. 2010. Towards detecting influenza epidemics by analyzing Twitter messages. KDD Workshop on Social Media Analytics http://arxiv.org/abs/1007.4748
3. Tumasjan, A., Sprenger, T. O., Sandner, P. G., and Welpe, I. M. 2010. Predicting elections with twitter: What 140 characters reveal about political sentiment. In Proceedings of the Fourth International AAAI Conference on Weblogs and Social Media, pp. 178-185. http://www.aaai.org/ocs/index.php/ICWSM/ICWSM10/paper/viewFile/1441/1852
4. http://www.hpl.hp.com/research/scl/papers/socialmedia/socialmedia.pdf

A list of some relevant accounts on Twitter to follow:

@David_Hillis (University of Texas)
@kcranstn Karen Cranston (Open Tree of Life)
@rdmpage (Professor of Taxonomy at Glasgow University)
@cydparr (EOL)
@phylofoundation (updates from The Phyloinformatics Research Foundation)
@phylogenomics (Prof. Jonathan Eisen, UC Davis)
@Dr_Bik (marine genomics, UC Davis)
@JChrisPires (plant genomics)
@k8hert (Kate Hertweck, NESCent)
@TRyanGregory (University of Guelph)
@pedrobeltrao (bioinformatics, UCSF)
@ewanbirney (assoicate director at the EBI)
@caseybergman (University of Manchester)
@ianholmes (compuational biologist)
@lukejharmon (University of Idaho)
@cboettig (theoretical ecology & evolution)
@tomezard (University of Surrey)
@eperlste (evolutionary pharmacologist, Princeton University)
@RosieRedfield (UBC)
@NYCuratrix (Susan Perkins, AMNH)
@theleechguy (Mark Siddall, AMNH)
@AndyFarke (vertebrate paleontologist)
@TomHoltzPaleo (paleobiologist)
@Bill_Sutherland (conservationist)

and at the Natural History Museum London:

@nhm_london (official NHM London account)
@edwbaker (biodiversity informatics)
@DavidMyWilliams (diatomist)
@vsmithuk (cybertaxonomist)
@Coleopterist (Max Barclay)
@SandyKnapp (Solanaceae taxonomist)
@NHMdinolab (updates from Paul Barrett’s lab)
@gna-phylo (updates from Thomas Richards’ lab)

So a week ago, I investigated publisher-produced Version of Record PDFs with pdfinfo and the results were very disappointing. Lots of missing metadata was found and one could not reliably identify most of these PDFs from metadata alone, let alone extract particular fields of interest.

But Rod Page kindly alerted to me the fact that I might be using the wrong tool for this investigation. So at his suggestion I’ve tried again to extract metadata from the exact same set of PDFs as last time…

Only this time I’ll be using exiftool version 9.10.

This time I’ve put the full raw metadata output from exiftool on figshare for each and every PDF file, just to really prove the point, reproducible research and all. I’d love to post the corresponding PDFs too but sadly many of them are not Open Access and this thus prevents me from uploading them to a public space.   **Insert timely comment here about how closed access publications stifle effective research practices…**

Exiftool is really simple to use. You just need type:
exiftool NameOfPDF.pdf
to get a human-readable exhaustive output of all possible metadata.

and
exiftool -b -XMP NameOfPDF.pdf
to get XML-structured metadata. I could only extract this from 56 of the 69 PDF files. The data output from this for those 56 PDFs is available as a separate fileset on figshare here.

Finally, if you want to test a whole bunch of PDF files in your working directory I’ve made a simple shell script that loops through all PDFs in your working directory, available here (oops, it’s not data, perhaps I should have put that on github instead?). [I'm sure many readers will be able to create a simple bash loop themselves but just for those that don't...]

 

I’m assuming that the reason exiftool -b -XMP failed on 13 of those PDFs is because they have no embedded XMP metadata – an empty (zero-byte sized) file is created for these. This is an assumption though… I notice that those 13 exactly correspond with all the 13 that were produced with iText. I checked the website and I’m pretty sure iText 2.x and up can embed XMP metadata, it’s just whether the publishers have bothered to use & apply this functionality.

So if I’m right, neither Taylor & Francis, BRILL, nor Acta Palaeontology Polonica embed XMP metadata (at all!) in their PDFs. The alternative explanation is that the XMP metadata is in there but exiftool for whatever reason can’t read/parse it from iText produced PDFs. I find this an unlikely alternative explanation though tbh.

Elsevier have superior XMP metadata to everyone else by the looks of it, but Elsevier aside the metadata is still very poor, so my conclusions from last week’s post still stand I think.

Most of the others do contain metadata (of some sort) but by and large it’s rather poor. I need to get some other work done on Monday so I’m afraid this is where I’m going to leave this for now. But I hope I’ve made the point.

Further angles to explore

Interestingly Brian Kelly, has taken this a slightly different direction and looked at the metadata of PDFs in institutional repositories. I hadn’t realised this but apparently some institutional repositories (IRs) universally add cover pages to most deposits. If this is done without care for the embedded metadata, the original metadata can be wiped and/or replaced with newer (less informative) metadata.  Not to mention that cover pages are completely unnecessary -> all the information on a cover page is exactly the kind of stuff that should be put in embedded metadata! No need to waste time and space by putting that info as the first page. JSTOR does this too (cover pages) and it annoys the hell out of me.

After some excellent chat on Twitter about this IR angle I’ve discovered that UKOLN based here on campus at Bath have also done some interesting research in this area, in particular the FixRep project which is described in more detail here. CrossRef labs pdfmark tool also looks like something of interest towards fixing poor quality metadata PDFs. I’ve got this installed/compiled from the source on github but haven’t tried it out yet. It would be interesting to see the difference it makes – a before and after comparison of metadata to see what we’re missing… But why should we fix a problem that shouldn’t exist in the first place? Publishers are the point of origin for this. It’s their job to be the first to publish the Version of Record. They should provide the highest level of metadata possible IMO.

 

Why would publishers add metadata?

Because their customers – libraries, governments, research funders (in the case of Open Access PDFs ) should demand it. A pipe dream perhaps but that’s my $.02.  I would ask for a refund if I downloaded MP3′s from iTunes/Amazon MP3 with insufficient embedded metadata. Why not the same principle for electronically published PDFs?

 

PS Apologies for some of the very cryptic filenames in the metadata uploads on figshare. You’ll have to cross-match with this list here or the spreadsheet I uploaded last week to work out which metadata file corresponds to which PDF/Bibliographic Data record/Publisher.

Publisher Identifier Journal Contains embedded XMP metadata? Filename
American Association for the Advancement of Science Ezard2011 Science yes? ezard_11_interplay_759293.pdf
American Association for the Advancement of Science Nagalingum2011 Science yes? nagalingum_11_recent_719133.pdf
American Association for the Advancement of Science Rowe2011 Science yes? Science-2011-Rowe-955-7.pdf
Blackwell Publishing Ltd Burks2011 Cladistics yes? burks_11_combined_694888.pdf
Blackwell Publishing Ltd Janies2011 Cladistics yes? janies_11_supramap_779773.pdf
Blackwell Publishing Ltd Simmons2011 Cladistics yes? simmons_11_deterministic_779537.pdf
BRILL Barbosa2011 Insect Systematics & Evolution no barbosa_11_phylogeny_779910.pdf
BRILL Dellape2011 Insect Systematics & Evolution no dellape_11_phylogenetic_779909.pdf
Cambridge Journals Online Knoll2010 Geological Magazine yes? knoll_10_primitive_475553.pdf
Cambridge Journals Online Saucede2007 Geological Magazine yes? thomas_saucegraved_07_phylogeny_506869.pdf
CSIRO Chamorro2011 Invertebrate Systematics yes? chamorro_11_phylogeny_780467.pdf
CSIRO Daugeron2011 Invertebrate Systematics yes? daugeron_11_phylogenetic_780466.pdf
CSIRO Johnson2011 Invertebrate Systematics yes? johnson_11_collaborative_750540.pdf
Elsevier Lane2011 Molecular Phylogenetics and Evolution yes E3-1-s2.0-S1055790311001448-main.pdf
Elsevier Cunha2011 Molecular Phylogenetics and Evolution yes E2-1-s2.0-S1055790311001680-main.pdf
Elsevier Spribille2011 Molecular Phylogenetics and Evolution yes E1-1-s2.0-S1055790311001606-main.pdf
Frontiers In Horn2011 Frontiers in Neuroscience yes? fnins-05-00088.pdf
Frontiers In Ogura2011 Frontiers in Neuroscience yes? fnins-05-00091.pdf
Frontiers In Tsagareli2011 Frontiers in Neuroscience yes? fnins-05-00092.pdf
Hindawi Diniz2012 Psyche: A Journal of Entomology yes? 79139500.pdf
Hindawi Restrepo2012 Psyche: A Journal of Entomology yes? 516419.pdf
Hindawi Savopoulou2012 Psyche: A Journal of Entomology yes? 167420.pdf
Institute of Paleobiology, Polish Academy of Sciences Amson2011 Acta Palaeontologica Polonica no amson_11_affinities_666987.pdf
Institute of Paleobiology, Polish Academy of Sciences Edgecombe2011 Acta Palaeontologica Polonica no edgecombe_11_new_666988.pdf
Institute of Paleobiology, Polish Academy of Sciences Williamson2011 Acta Palaeontologica Polonica no app2E20092E0147.pdf
Magnolia Press Agiuar2011 Zootaxa yes? zt02846p098.pdf
Magnolia Press Ebach2011 Zootaxa yes? ebach_11_taxonomy_599972.pdf
Magnolia Press Nelson2011 Zootaxa yes? nelson_11_resemblance_688762.pdf
National Academy of Sciences Casanovas2011 Proceedings of the National Academy of Sciences yes? casanovas-vilar_11_updated_644658.pdf
National Academy of Sciences Goswami2011 Proceedings of the National Academy of Sciences yes? goswami_11_radiation_814757.pdf
National Academy of Sciences Thorne2011 Proceedings of the National Academy of Sciences yes? thorne_11_resetting_654055.pdf
Nature Publishing Group Meng2011 Nature yes? meng_11_transitional_644647.pdf
Nature Publishing Group Rougier2011 Nature yes? rougier_11_highly_720202.pdf
Nature Publishing Group Venditti2011 Nature yes? venditti_11_multiple_779840.pdf
NRC Research Press CruzadoCaballero2010 Canadian Journal of Earth Sciences yes? 650000.pdf
NRC Research Press Druckenmiller2010 Canadian Journal of Earth Sciences yes? 80000000c5.pdf
NRC Research Press Mazierski2010 Canadian Journal of Earth Sciences yes? mazierski_10_description_577223.pdf
NRC Research Press Modesto2009 Canadian Journal of Earth Sciences yes? modesto_09_new_577201.pdf
NRC Research Press Parsons2009 Canadian Journal of Earth Sciences yes? parsons_09_new_575744.pdf
NRC Research Press Wu2007 Canadian Journal of Earth Sciences yes? wu_07_new_622125.pdf
Pensoft Publishers Hagedorn2011 ZooKeys yes? hagedorn_11_creative_779747.pdf
Pensoft Publishers Penev2011 ZooKeys yes? penev_11_interlinking_694886.pdf
Pensoft Publishers Thessen2011 ZooKeys yes? thessen_11_data_779746.pdf
Public Library of Science Hess2011 PLoS ONE yes? hess_11_addressing_694222.pdf
Public Library of Science McDonald2011 PLoS ONE yes? mcdonald_11_subadult_694229.pdf
Public Library of Science Wicherts2011 PLoS ONE yes? wicherts_11_willingness_779788.pdf
SAGE Publications deKloet2011 Journal of Veterinary Diagnostic Investigation yes? Invest-2011-deKloet-421-9.pdf
SAGE Publications Richter2011 Journal of Veterinary Diagnostic Investigation yes? Invest-2011-Richter-430-5.pdf
SAGE Publications Wassmuth2011 Journal of Veterinary Diagnostic Investigation yes? Invest-2011-Wassmuth-436-53.pdf
Senckenberg Natural History Collections Dresden Fresneda2011 Arthropod Systematics & Phylogeny yes? fresneda_11_phylogenetic_785869.pdf
Senckenberg Natural History Collections Dresden Mally2011 Arthropod Systematics & Phylogeny yes? ASP_69_1_Mally_55-71.pdf
Senckenberg Natural History Collections Dresden Shimizu2011 Arthropod Systematics & Phylogeny yes? ASP_69_2_Shimizu_75-81.pdf
Springer-Verlag Beermann2011 Zoomorphology yes? 10.1007_s00435-011-0129-9.pdf
Springer-Verlag Cuezzo2011 Zoomorphology yes? cuezzo_11_ultrastructure_694669.pdf
Springer-Verlag Vinn2011 Zoomorphology yes? 10.1007_s00435-011-0133-0.pdf
Taylor & Francis Bianucci2011 Journal of Vertebrate Paleontology no bianucci_11_aegyptocetus_778747.pdf
Taylor & Francis Makovicky2011 Journal of Vertebrate Paleontology no makovicky_11_new_694826.pdf
Taylor & Francis Pietri2011 Journal of Vertebrate Paleontology no pietri_11_revision_689491.pdf
Taylor & Francis Rook2011 Journal of Vertebrate Paleontology no rook_11_phylogeny_694916.pdf
Taylor & Francis Tsuihiji2011 Journal of Vertebrate Paleontology no tsuihiji_11_cranial_660620.pdf
Taylor & Francis Yates2011 Journal of Vertebrate Paleontology no yates_11_new_694821.pdf
Taylor & Francis Gerth2011 Systematics and Biodiversity no gerth_11_wolbachia_779749.pdf
Taylor & Francis Krebes2011 Systematics and Biodiversity no krebes_11_phylogeography_779700.pdf
Sociedade Brasileira de Ictiologia Britski2011 Neotropical Ichthyology yes? a02v9n2.pdf
Sociedade Brasileira de Ictiologia Sarmento2011 Neotropical Ichthyology yes? a03v9n2.pdf
Sociedade Brasileira de Ictiologia Calegari2011 Neotropical Ichthyology yes? a04v9n2.pdf
Royal Society Billet2011 Proceedings of the Royal Society B: Biological Sciences yes? billet_11_oldest_687630.pdf
Royal Society Polly2011 Proceedings of the Royal Society B: Biological Sciences yes? polly_11_history_625430.pdf
Royal Society Sansom2011 Proceedings of the Royal Society B: Biological Sciences yes? sansom_11_decay_625429.pdf

I’ve enrolled in some MOOCs

January 5th, 2013 | Posted by rmounce in phdchat - (2 Comments)

I’ve written about MOOCs last December but never actually enrolled in one myself… until now.

Sure, I’ve done Codecademy courses and Codeschool courses which I’ve immensely enjoyed but they’re perhaps(?) not quite the same thing.

This year I’ve decided to bite the bullet and do some Coursera courses (depicted below, confusingly there are different courses run by different teams with the exact same titles/topics):

Coursera courses
The more I think about it – why not? It’s free to enrol. It’s free to drop-out & ignore if you don’t have the time for it, or you realise it’s too easy/hard/uninteresting. WHY NOT?

So I’ve sent a few tweets out that unashamedly I’m enrolling in some Coursera courses this year and not unsurprisingly found that other people I respect are also dipping their toes in the MOOC water: @gawbul (Steve Moss, University of Hull) a fellow PhD student, is also taking many of the same courses that caught my eye.

Some initial observations:

  • Coursera definitely isn’t Open. I see no Creative Commons licenses anywhere – you probably can’t repost or remix the content provided on each of these courses which is a big shame IMO. It’s an MFOC (free rather than open) not a MOOC, but sadly few would recognize this distinction.
  • Roger Peng is running the Computing for Data Analysis course. I’m a huge fan of reproducible research, I got my first little peer-reviewed contribution in Nature simply through reproducing (and finding significant error with) published research – it’s really cool to see lectures from someone you kinda idolise. There’s 0% chance of personal interaction with him through the course; there’s simply too many thousands enrolled but still that’s pretty cool – a big name draw.
  • The sheer diversity of people enrolled in the courses is very inspiring, in one discussion thread of IT professionals I find Ahmed from Sudan “Software Architect Trying to Learn more about Statistics and Business” and Gurneet from India, old and young people from across the globe all wanting to learn. I really do get that warm fuzzy feeling that MOOCs could contribute significantly to educating the world and making it a better place. It’s not about replacing or being the alternative to a college degree, it’s just about learning what you want to learn and feeding curiosity.
  • Without looking at any of the lectures or materials on my first attempt I managed to get 9/10 on the first Computing for Data Analysis quiz assessment (which I’ve since re-attempted to get the full 10/10 score). So at week 1, introducing R and data manipulation in R, it’s fairly easy for me. But even so it did help me tighten-up, refresh and test my knowledge. I’m looking forward to week 2 of the course starting 9th January. And especially the start of the Machine Learning & NLP courses. These will be invaluable for my postdoc work I suspect…

So far so good. Do let me know in the comments if you’ve signed down for a MOOC too, I’d be interested to know. At first I felt mildly guilty as a PhD student enrolling for these things but now I see it’s a no brainer – if you have time for it, and it might benefit you – why not give it a try? There’s no shame in that.

Just a quick note that BMC journal APC’s have increased from what they were in 2012.

 

Luckily I had the 2012 data saved on my computer so I can compare prices directly.
I’ve put the data for 97 journals (not all of them) here on figshare.

The mean price increase is just over 5%.

Although to give it a fair statistical treatment – the median price increase is just 3.3% (to 1 d.p.). There is a lot of variance. Some of the biggest price hikes appear to be from society journals e.g. Journal of Physiological Anthropology (An official journal of the Japan Society of Physiological Anthropology) and thus the price hike is probably the society decision rather than BMC’s doing. But in the era of PeerJ & eLife should prices be going up at all? If anything I’d expect prices to go down to remain competitive. Perhaps BMC are hoping things will be business as usual this year?

I got what I assume to be the correct 2013 prices over at the official BioMedCentral website today.

It’s a shame y’know. I’ve read a little of the history of the Open Access movement and in earlier times, perhaps a decade ago BioMedCentral really helped enable Open Access, convincing sceptical academics that it could work.

But now, it does make me wonder whether their prices aren’t a bit too high:

BMC tweet

As James McInerney tweeted on 1st January 2013. Are BMC price gouging?

I’m proud to announce I have a new article over at Palaeontology [Online]

The Palaeontology [Online] logo – by the P [O] team, licensed under a Creative Commons Attribution License

Posts at ‘P [O]‘ are primarily aimed at public-engagement and since the site was launched back in July 2011, with sponsorship and support from the Palaeontology Association, one post per month has been featured on site. This month [December], I’ve written a rather different type of post for them. Not so much about fossils, creatures, classification and rocks – but instead on how palaeontology and science-as-a-whole is made available with respect to Open Access, Open Data, Open Source (code), and Open Educational Resources (OERs). Incidentally, I think it’s also the first P [O] post with embedded video content too – really making using of the digital medium!

I’ve tied these strands together with an explicit acknowledgement that Creative Commons has legally enabled all this Open content and that it’s a fantastic achievement. Consider it my early birthday present to celebrate that it’s now been nearly 10 years since Creative Commons first launched (#cc10 on Twitter btw for related news & events).

I’m hoping it will raise awareness that citizens & scientists alike can directly read the primary scientific literature themselves (via Open Access journals and articles) and they should be encouraged to – given as taxpayers they’ve paid for most of it to be created! Also more than just mere engagement, I’ve highlighted that uniquely with an Open philosophy there’s nothing stopping ‘amateur’ or citizen science contributions in palaeontology – it’s sad that more of the literature, data, code and educational resources in this area aren’t openly available for re-use – arguably the world would know a lot more about palaeontology if they were.

With specific reference to http://opendefinition.org/ I try and make it clear what open actually means in this context. There’s been a lot of openwashing this year. Open is clearly a desirable state, and a label which will help sell and ‘add value’ to products, therefore both innocent and malicious temptation abounds to mistakenly label or brand things as ‘open’ when they are de facto not open. Education and awareness-raising clearly has a significant role to play here in preventing this problem.

 

During the production of the article some interesting points were raised, which in the end didn’t make it to the ‘final’ version of the post, so I’ll blog them here instead.

 

On Open Access:

For the sake of simplicity I neglected to point out that in actuality the definition of OA is slightly narrower and more specific than just open as per http://opendefinition.org/ . OA is defined by the BOAI-definition which does not require nor allow(?) the ShareAlike (SA) clause. It does however require the Attribution clause (BY):

the relevant excerpt…

… The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited. …

see Mike Taylor’s excellent posts over at SVPOW for more.

 

On Open Data:

I wonder if perhaps there is still a perception out there that there are still technical barriers to sharing data openly?

Particularly with regard to very very large datasets & data files. I decided this was too niche a point for inclusion in the main post but in case anyone’s wondering – you can easily share *any* filesize these days.

Journals like GigaScience specialise in publishing ‘big data’ studies and already make available petabytes (=1 million gigabytes) worth of data. Data archives like figshare allow unlimited filesize uploads (only limited to 1GB if you keep it private), I’m sure Dryad would also be willing to archive large files. Want proof? Look no further than this 21GB database of microbial data that’s been downloaded at least 10 times as made available via BioTorrents - I couldn’t find anyone seeding it just now, but if there was greater institutional support for p2p data sharing I’m sure this would take off.

 

On Open Source (code):

There’s an excellent editorial in PLoS Computational Biology that regrettably I only just became aware of too late to include. It’s by Andreas Prlić & Hilmar Lapp the latter of whom I had the pleasure of meeting recently at NESCent in Durham, North Carolina. Its a short paper and Open Access so I recommend you all read at least the Why Do We Support Open-Source Scientific Software? section – it’s an excellent clear and concise summary of the greater value of open in this area.

 

On MOOCs: 

The American Museum of Natural History (AMNH) have some online courses available here, and whilst they’re probably of the very highest quality, they are neither MOOCs nor OERs because they’re not open, nor free. Each course costs $495 plus a $25 one-time registration fee. Grad credit is also available at an additional cost.

Perhaps one day the AMNH might be persuaded to run one of these courses as a MOOC? If not to help advertise and drive interest in the other courses but also to demonstrate their quality. If MIT can do it…

I should also refer interested readers to the excellent set of seminars on phylogenetics over at phyloseminar. I’ve virtually-attended (live, over-the-internet but not there in person) a few of these and have enjoyed recordings of others. It’s not a complete course (so not a MOOC) but depending upon exact licencing details could perhaps be classed as OER-like material.  The next one will be soon: 5pm (UK time) Wednesday 5th December Understanding biodiversity patterns using the Tree of Life given by Hélène Morlon.

 

 

All the posts over at P [O] are of very high-quality and are worthy academic contributions. As such I’m going to list my post there on my CV as soon as I update it. It’ll sit nicely in my publications list alongside articles in BMC Research Notes, Nature, and The Systematist. I deliberately intermix peer-reviewed publications and non-peer-reviewed publications to make people reconsider and examine the relative merits of each, rather than just counting volume or (worse) the journal Impact Factor which is of course irrelevant.

I encourage everyone else who’s published an article at P [O] to also proudly display it on their CV.

 

 

A couple of days ago I posted specifically about the data re-use session.
I’m going to use this post to muse about the conference more generally.

About SpotOn London 2012

It used to be called Science Online London – an informative, sensible and appropriate name. This year I hear (rumours) that it had to change name to SpotOn because Science AAAS or some other litigious entity was claiming brand identity infringement. I have sympathies with the organisers for this enforced change but ‘SpotOn London’ which apparently stands for “Science Policy, Outreach and Tools Online” (you wouldn’t know unless told!) is not my cup of tea, tbh. I suggest next year we continue to use the #solo hashtag and continue to call it Science Online London (informally), even if for legal reasons this can’t be the official conference title.

SOLO12

As the focus of the conference was tripartite: “Policy“, “Outreach“, and “Tools“; the mix of speakers, panellists & attendees was refreshingly diverse – unlike most academic conferences I go to. High-level & high-profile academics like Prof. Stephen Curry (Imperial College), Prof. Athene Donald (U.of Cambridge),Dr Ethan Perlstein and Dr Jenny Rohn (Science is Vital) mixed freely with PhD students like myself, Jon Tennant, Jojo Scoble, Nick Crumpton, Tom Phillips and others. There were policy people like Mark Henderson and Nic Bilham and even politicians themselves: we should all be grateful for Julian Huppert MP for Cambridge, one of unfortunately few UK politicians to take a genuine interest in science. There were publishers reps including Matt Hodgkinson & Martin Fenner (PLOS), Brian Hole (Ubiquity Press), Graeme Moffat & Kamila Markram (FrontiersIn), Ian Mulvany (eLife), Michael Habib (Elsevier), and ‘independents’ like Anna Sharman (Anna Sharman Editorial Services) & Kaveh Bazargan (of River Valley). Librarians Peter Morgan (U.of Cambridge) & Frank Norman, research funders Geraldine Clement-Stoneham (MRC), journalists Ed Yong and really interesting people who defy easy classification(!) like Brian Kelly (UKOLN), Tony Hirst, and the Digital Science team (some of): Euan Adie, Mark Hahnel and Kaitlin Thaney.

Now apologies to those I didn’t name check in the above list – there were many other brilliant and interesting people there (Ed Baker, Vince Smith, David Shotton, Josh Greenberg, I could go on… There’s a fuller list of attendees by Twitter handle here). I merely selected a few from broad categories to show the impressive diversity of representation there. This is one of the very best things about the conference – it attracts virtually all of the stakeholders of science. It’s not just about researchers, publishers, research funders and librarians – it rightly recognises that science isn’t only for ivory tower academics; it’s for everyone.

[Incidentally, for those interested I'd say gender diversity was quite balanced. Alas racial diversity was rather too imbalanced - perhaps sadly reflective of academia as whole?]

As befits a conference formerly known as ‘Science Online’ *all* of the talks were recorded & tweeted, so there are videos on youtube of every single one, and Storifys (of the tweets) available to view.

Selected Highlights (aside from the #solo12reuse session):

The journal is dead, long live the journal

In the early stages of this session, I was worried it wasn’t really going anywhere interesting with the discussion…

and then Dr Kaveh Bazargan took the microphone at about ~28:22 (skip to that section, it’s brilliant)

on the publishing process, author manuscript submissions & typesetting:

It’s madness really. I’m here to say I shouldn’t be in business.

as any manuscript-submitting biologist knows… publishers ask for all sorts of ridiculously pedantic formatting from us, particularly for reference lists. As Kaveh reminds us this is all pointless and stupid because when the publishers get this, they send it off to typesetters to be typeset anyhow – this process is hugely inefficient: “madness”. Not only this, but if a submitted manuscript gets rejected from one journal, the poor authors have to waste often significant amounts of time and energy to re-format their manuscript to suit the stylistic vagaries of another journal. Microsoft Word is not a good authoring tool – it’s largely unstructured. The publishing process requires a high-degree of technical structure, usually provided by XML or TeX.

If you dig into the issue a little bit. You’ll see that programs like Mendeley (and any other reference manager I would think) are fully capable of providing reference lists as structured XML. And yet journal policies enforce that we submit plain text (in say a Word doc), only for the typesetters to get paid by the publishing companies to then re-implement those plain text references back into fully-structured XML. Madness! 

Typesetters are mostly located in areas where the labour is cheap. India, Phillipines etc… it’s an intensely manual process and perhaps in future may be less of a necessity(?). Furthermore, as I discovered with a recent BMC manuscript I was an author on, typesetters can sometime introduce new errors into the publication process which slow down the process even further! I commend the brutal honesty of Dr Bazargan in bravely speaking-out about this, this is an issue completely separate to OA/TA journals; both can be guilty of this madness.

It also affects re-use potential as he also remarked in the #solo12reuse session. Not every publisher publicly exposes the XML version of the papers they publish, these are of extreme importance to re-use potential (e.g. mining) – the Geological Society of London and their publications are one of these which is a great shame. I asked Neal Marriott of GSL about this back in June via email and he replied: “We do not currently have a feature to allow download of the NLM XML source.”   I also tried to take this up politely with Nic Bilham at the bar after the first night of the conference, but for someone with “external relations” in his job title I found him rather frosty towards me. Happy to say I had no such problems with Grace Baynes (NPG) it was charming to meet her in person after our exchanges regarding NPG’s new OA pricing strategy.

So how do we get GSL and other publishers to expose their XML which they surely have? They already provide HTML & PDF versions, what’s difficult about exposing the underlying XML version too?

 

Altmetrics

I was at another session at the time this was going on but I think this is an important session I should highlight. Broadening the assessment & evaluation of research beyond incredibly narrow metrics (like the journal Impact Factor; die die!) is something that’s clearly very important. Everyone agrees it’s “early days” and that not all impact is measurable, but that shouldn’t dissuade us from actively researching and cautiously embracing this new (positive) trend.

Fraud in Science

Virginia Barbour, Ed Yong and others were on the panel for this one and again, whilst I wasn’t in the room at the time for this session – looking back at the video, I rather wish I was at the session – it was really interesting:

  • Virginia Barbour 18:45 “I think there’s much more evidence of sloppiness than outright fraud… at some PLOS journals we ask authors for the original figures to check for figure manipulation before acceptance… when we ask authors to supply these a large number of authors can’t do it” (is it really that they can’t find the files, or just that they don’t *want* to supply the originals?) “It is completely unacceptable, but not uncommon” Amen Virginia – I agree very much!
  • Virginia Barbour 21:36 “…the larger issues that plague science; sloppiness, unwillingness to share data, conflict of interest, and publication bias. There are *solutions* to these and the great thing is that the internet makes it much easier to spot and actually makes it easier to address than previously…” +1
  • much of the later talk about clinical trials refers to work done in this paper: Prayle, A. P., Hurley, M. N., and Smyth, A. R. 2012. Compliance with mandatory reporting of clinical trial results on ClinicalTrials.gov: cross sectional study. BMJ 344. in which the authors report a rather disappointing 22% compliance rate with US Food and Drug Administration Amendments Act legislation that requires the results of all clinical trials to be reported within reasonable time.
  • Finally, I was seriously impressed and pleased that Ed Yong extolled the virtues of Open Data from 45:07 to enable greater transparency and lower the barriers to critical re-analyses. This is something I most definitely would have raised had I been there, alas I think I was at the “publishing research data: what’s in it for me?” session

So yeah, the conference was great. Not all the sessions were brilliant. The ‘big data’ session was a little disappointing (no offence to any of the panel, just small attendence, little engagement) – perhaps because the topic is already well-covered for conferences with alternative events like the recent O’Reilly Strata Big Data meetup dominating?

I’ll be there at next year’s Science Online London event for sure – whatever it’s called!
;)

So there’s been a few already:

SpotOn day 1 & SpotOn day 2 being the best I’ve read, from Ian Mulvany

see also: SpotOn London – a global conference by Jon Tennant | #solo12 reflection by A J Cann | SpotOn London 2012 in brief by Charis Cook | Altmetric @ SpotOn London 2012 by Jean Liu, and more

but since SpotOn London 2012 was such an interesting event, and there were so many parallel sessions I thought it would be good to add some more to the post-conference discussion

Data is the new black

photo by Bastian Greshake (@gedankenstuecke), Copyright not mine.

It wasn’t just a popular badge at the conference… data is a hot topic in science right now (as it should be!). Data is an undervalued but absolutely vital output of research. Research funding agencies appear to have over-incentivized the production of research publications (many of which are mere executive summaries of the years of research effort they represent) to the exclusion of almost everything else.

Science isn’t just about the production of papers; data and code are extremely important research outputs too (I’m not going to mention patents – they’re a sticky issue best dealt with in another post). The good news is that funding bodies now seem to have realised that they’re seriously missing out on RoI by focusing solely on papers; just recently NSF Grant Proposal Guidelines changed with amended terminology away from narrow-measurement ‘Publications’ to the newer broader term ‘Products’ that explicitly recognises non-publication outputs as creditworthy first class research objects (incidentally, this was one of the many excellent suggestions made in the Force11 manifesto for ‘Improving Future Research Communication and e-Scholarship’ read it if you haven’t already).

The immense value to be gained, time to be saved, and innovative research enabled by making data available for re-use was up for discussion at the #solo12reuse session. Mark Hahnel (@figshare) was organiser/chair, and Sarah Callaghan (@sorcha_ni) of the British Atmospheric Data Centre and I were the invited panelists for a ~1hr slot. As the conference was extremely well-organized *all* sessions were live-streamed via Google Hangouts & made publicly available via YouTube afterwards. I’ve embedded the stream of the #solo12reuse session below:

A transcript of some of what was discussed:

Intro’s from ~02:00 … then straight into discussion from ~09:00 onwards: Josh Greenberg (@epistemographer) contends that data sharing in chemistry perhaps ‘doesn’t make as much sense’ – I have a feeling PMR & many others would disagree with this!

At 13:20 Sarah Callaghan: NERC sets its data embargo policy so that data can only be withheld for a maximum of 2 years after it was collected after which it must be made publicly available, somewhere, somehow – the ambiguity of which IMO needs to be worked on…

At 14:25 discussion of ‘levels of re-usability’ and definition. Access control as a means of encouraging data sharing (?)

17:30 Sarah Callaghan: “It’s important to have ‘first dibs’ on your own data” but not beyond this without peer-vetted justification/scrutiny IMO

18:30 David Shotton (@dshotton): noted that one shouldn’t expect absolutely every data point/item to be shared – not all data is useful/valuable. It’s about retaining & making available bits that might be of re-use value.

At 20:24 I start to introduce AMI2 & the OpenContentMining project.

37:50 I bring the Panton Principles on screen, I also had the OKFN Science Working Group page displayed (although not discussed) for a good ~10 minutes. Note to self: hijack the display computer at panel sessions more often…

from 40:17 onwards… Mark Hahnel: “In terms of re-use and getting people incentivized, are Data Papers the future?” Sarah Callaghan “NO. Until research achievement is predicated on something other than publishing in ‘high-impact’ journals then we’re stuffed: we’ve got to shoehorn data & code in order for them to ‘count’ [lamentably]” So for now we need data papers, but perhaps in the future we won’t need to constrain these outputs to a ‘paper’ style format.

from 43:00 Martin Fenner (@mfenner) plays Devil’s Advocate and suggests that data citation may not work and that perhaps #altmetrics might be better indicators of usage. Much debate ensues…

from 45:50 I give a plug to Iain H’s paper: ‘Open By Default’ Hrynaszkiewicz, I. and Cockerill, M. 2012. Open by default: a proposed copyright license and waiver agreement for open access research and data in peer-reviewed journals. BMC Research Notes 5:494+ then we discuss legal barriers to re-using data.

This post has taken a while to write and is fairly long now, so I’m going to split my recap of #solo12 into two or more parts now. In part 2 I’ll attempt to discuss some of other *excellent* sessions I saw, in particular the brilliant, well-received outburst on the absurd inefficiency of the publication process by professional typesetter Dr Kaveh Bazargan during the #solo12journals session. I’m surprised someone hasn’t done a whole blogpost about this already – it was my highlight of the conference tbh!
I’ll be posting part two on Monday 19th November (weekends are slow for blogs… I want people to read this!)

Until then…