Show me the data!
Header

[A monthly update on my Panton Fellowship related activities]

Last month I was slightly late with my monthly report, so this month I’m going to get things back on track and write my post now, on this leisurely sunny Sunday afternoon…

It’s been a good month:

First of all, I had the chance to speak about my Fellowship work for the Ede & Ravenscroft Prize final. I made a few choice comments to our Pro-Vice-Chancellor who was present, about the plurality of benefits of Open Access & Open Data, and the difficulties of trying to do content mining research on subscription-access journals. I didn’t win the prize in the end, but getting to the final, and being recognised as one of the top 5 research students at the University of Bath was pretty cool. I then immediately went out and spent part of the £50 runners-up prize on Michael Nielsen‘s excellent Open Science book Reinventing Discovery. I gave it a read, then immediately passed it on to another lab for a friend to read, and it now resides with my supervisor who will also hopefully find time to read it (part of my not so subtle attempt to help spread the knowledge of how digital, networked, openness can hugely benefit research).

I bought some other books too, but this was the important one

Then on the 11th of May, PMR came Bath to give a talk to our Biology & Biochemistry Department. Those who came (including our subject librarian – thanks for coming!) were wowed with the ways in which PMR and colleagues have helped make semantically-enriched Linked Open Data available on chemicals for everyone, not just academic chemists! It’s brilliant to have an expert demonstration of the ways in which projects like CrystalEye have made the data underlying some chemical research publications far more easily searchable, open, and re-usable across many thousands of publications. There’s a strong, easily-justified need for more of this type of post-publication data scraping in biology (and palaeontology I might add!). We share a strong belief that research publications should be made open and explicitly re-usable without restriction.

Sadly, most of the biological literature in my domain is neither open, nor re-usable without permission (more of which in a later post) – which makes my highly integrative data-focused research, that much harder than it otherwise could be. As I’ve said before on the internet – I have all of PLoS on my USB stick, I’ve no doubt I could put all the relevant papers I need & scrape data from them, on just my desktop computer hard drive – yet subscription-access paywalls, and current copyright law prevent me from doing this for much of the literature (PLoS and other Open Access literature aside). I can understand how we arrived at this strange situation (we didn’t used to have such computational power to analyze large volumes of data, nor the Internet with which to freely & easily distribute research) but now we *do*, it seems like utter madness to continue to publish research in ways (e.g. subscription-access, copyright-transferred to the publisher) which make it very hard to analyze or re-use en masse.

The Panton Principles

So I’ve been joining the nascent OKFN working group Skype calls on Content Mining and soon we will hopefully have some interesting things to announce…

PMR also got the chance to meet my PhD supervisor and the rest of the lab which is great since I’m doing this fellowship work concurrently with my PhD work on fossils & phylogeny.

Later on in the month, I suggested the excellent Panton Discussions be made more amenable for podcasting. An OKFN group are now working on producing an audio-only version of all of them, and making them more easily integrable on personal listening devices (mobile phones & MP3 players).

Finally, the past week has been a whirlwind:

On Tuesday (22nd May) I was at the Natural History Musuem, London to talk with Dr Mark Wilkinson about some PhD project-related work – he’s kindly supplied me with some source code (among other things), so I can recompile his programs to run on my linux machines. I told him all about the OKFN & Panton Fellowship and he was very supportive of the goals. Time and time again, I encounter such enlightened, high-up academics and wonder why & how academic publishing is still in it’s current state – it’s not for want of researcher support for Open Access in my experience!

On Wednesday, I was back with PMR in Cambridge hacking PDFs, focusing particularly on BMC literature as this is BOAI-compliant Open Access and we can do what we want with such material. Towards the end of the session we had a think about what metadata would desirable to extract from the text of the papers and figure labels that might add context and information to the phylogenetic analysis performed, and phylogenetic tree presented in each of the papers. By coincidence the Open Tree of Life group have also just republished the MIAPA working group list of desirable metadata for phylogenies. We certainly won’t be able to get all this information, and the information we can extract may not necessarily be interpreted and associated 100% correctly, but it will certainly be hugely valuable as this information would otherwise take 4 years to re-digitise(!) by some estimates.

On Thursday, I went to ProgPal (Progressive Palaeontology), a conference also in Cambridge. There I gave a short ‘announcement’ talk with slides to explain to everyone there a) what the Open Knowledge Foundation are about, and b) why they might be of interest to academic palaeontologists. I touched upon Open Access and Open Data issues in palaeontology and encouraged those with an interest to visit the website, join the Open Science mailing list, listen to or watch the Panton Discussions, and consider applying for a Panton Fellowship next year if they had any innovative ideas for paleo-data. This talk tied-in very well with the other announcement talks for Palaeontology Online (a new free outreach & education initiative) and Palaeocast (a new paleo podcast).

Which reminds me, I should really pop them both an email to explain why they should post their content with a Creative Commons Attribution Licence, so their materials can be re-used, re-posted and remixed as Open Educational Resources

Best of all, on Friday I travelled down to London to my alma mater to attend & furiously tweet the Open Access debate at Imperial College London, in the very same lecture room I sat most of my undergrad lectures! There were rather a many palaeontologists also there, including Tori Herridge and Nick Crumpton and a large volume of tweets under the #OAdebate hashtag were sent (archived here if you’re interested). Graham Taylor of the Publishers Association said some rather provocative things that got me rather hot under the collar including:

…we [publishers] are the stewards of genuine science…

Which I think could all too easily be misinterpreted to overstate the importance of the role that publishers play in organising peer-review, spell-checking, typesetting and other such tasks. I also couldn’t help laughing out loud at Graham’s straight-faced proposal for subscription-access publishers to offer ‘fee-waived walk-in access at public libraries‘ as a way to provide taxpayer access to taxpayer funded research. Stephen Curry (also on the panel) thankfully quickly interrupted to state how ridiculous that was. I’ll leave it to Mike Taylor’s post here to explain just how ludicrous that proposal is in light of 21st century technology. I will however give Graham Taylor credit for further disavowing the Research Works Act, he said of it [and presumably his organisation's initial support for it]: “the RWA was not such a good idea, don’t ask me to defend that one”, which elicited a pleased response from the audience.

There will be another debate held after the release of the Finch report which I suspect will be rather more exciting. A lot of the issues were aired at this debate, but the brevity of the time slot allowed for the event meant that there was not enough time for in-depth discussion IMO.

That’s just about it for the month. I can’t wait for what the next month will bring!

 

I have previously commented elsewhere on other blogs, that uniquely, with BOAI-compliant Open Access literature, one is able to re-distribute research however one wishes (provided proper attribution is given). I believe this to be hugely beneficial and perhaps a rather under-appreciated facet of the plurality of benefits offered by Open Access publishing.

Below is an expanded version of the comment I made on Cameron Neylon’s excellent blog Science in the Open on this very theme (and please do read Cameron’s post too for greater context):

Decentralized journal/article distribution is already happening.

I have 20,000+ PLoS articles on my computer right now. You can get them too – via BioTorrents. When compressed (as initially provided there) it’s less than 16GB’s of files – a trivial amount for anyone with a broadband connection. I can now (and do!) take PLoS on a USB stick with me wherever I go, allowing me to do research on trains, planes, and remote locations completely hassle free without even an internet connection. It was easy to download (pretty much 1-click) too via my high-speed institutional connection – and didn’t overload PLoS’s servers because I didn’t *get* the articles from their servers. With peer-2-peer file sharing the load is balanced between seeders (and in turn, I’m now seeding this torrent too, to help share the load). If all institutions/libraries agreed to help seed the world’s research literature, without copyright restriction on electronic redistribution (which we could do tomorrow if it weren’t for the legal copyright barriers imposed by most traditional subscription-access publishers) doing literature research would be pretty much frictionless! We could even get papers & data on campus much quicker over campus LAN rather than the internet.

Institutions already agree to help distribute code e.g. R and it’s multitude of packages – this is hugely beneficial, and helps share the costs associated with bandwidth — why not for research publications? The PLoS corpus is a great way to try out content mining ideas – it shows you how easy academic life *could* be if everything was Open Access. I’ve run some simple scripts on it myself. I’m not sure the simple things I did such as string matching could be classified as ‘text mining’ – but one thing I do know is – it was 100,000x times easier/quicker doing this locally, machine-reading files, rather than doing it paper by paper negotiating paywalls (where do I click, how many hoops do I have to jump through before I’m let in, what information are the ‘helpful’ tracking cookies keeping about me…) and getting cutoff by publishers. It’s worth pointing out as well, that once you have all the literature you need on your computer – you don’t even need the internet to do your research! For research in lesser economically developed countries, with weaker telecomms infrastructure – I’d imagine this would be a real boon for research.

It’s a window on the world that *could* be possible if we just changed our attitude WRT to copyright and research publishing. That PLoS, BMC and other Open Access publishers use the Creative Commons Attribution Licence makes this all possible.

I predict that the rights to electronically redistribute, and machine-read research will be vital for 21st century research – yet currently we academics often wittingly or otherwise relinquish these rights to publishers. This has got to stop. The world is networked, thus scholarly literature should move with the times and be openly networked too.

In short, I think research would be a whole lot easier to do, and ultimately (all things considered) be more cost-effective, if all future publicly-funded research could be made BOAI-compliant Open Access. This is just my opinion – you are welcome to disagree in the comments section below, I sincerely hope I don’t sound like an Open Access ‘zealot‘ for this is certainly not my intention.

If you haven’t heard yet – I was successful in my Panton Fellowship proposal
Logo for the Panton Fellowships

I wasn’t the only successful applicant either – huge congratulations to Sophie Kershaw and her excellent proposal to train doctoral students how to do Open Science at Oxford University. We’ll be working together on shared goals throughout the year I suspect

As part of the Fellowship process I’ll be making monthly short reports on progress and more lengthy quarterly reports.

So without further ado, here’s what I’ve been getting up to in April:

  • For the main component of my proposal – extracting phylogenetic data from PDF’s – I’ve spent the month getting up to speed with things with the expert guidance of PMR. I even spent a whole day (16th Apr) in Cambridge with PMR working on this. Things are coming on in leaps and bounds.
  • Visited Digital Science HQ in King’s Cross to have a chat with them about all the exciting web technology they’re working on.
  • Successfully arranged for the Open Knowledge Foundation to have a stall, and possibly a talk at the upcoming Progressive Palaeontology academic conference in Cambridge later this month.
  • Raised transparency and Open Data issues at the Systematics Association council meeting. As a result of this, we will soon upload our official constitution to our website to make it crystal clear what our guiding principles are. Additionally, all council members unanimously agreed in principle that we should try and make the data underlying our future Systematics Association special volume publications Open Data online somewhere, somehow – but we need to get feedback and agreement from our publisher, Cambridge University Press before we proceed further with this.
  • Together with Sophie Kershaw we agreed a strategy for our OKFest plans and with the excellent help of Laura Newman submitted a talk session proposal for the OKfestival, Helsinki later this year.
  • Attended the OKFN London Open Science hackday, further details on that are in my previous blogpost.

 

and of course this is all concurrent with my ‘regular’ PhD work which included, two manuscripts currently being prepared, 3 conference abstract submissions (and associated work to actually have something to write about!), undergrad demonstrating work and all the other day to day stuff.

I even had time for a small holiday over the long Bank Holiday weekend, to St Austell to see The Lost Gardens of Heligan & The Eden Project amongst other things.

It’s been a busy month!

 

PS I’ve been enjoying the new HTML classes on Codecademy. Below I’m going to see if some of these new HTML tricks work in WordPress:

This box should have rounded corners
This box should have a black shadow

I can guess the number you are thinking of

Follow the Rules and then hover the card below

  1. Think of a number below 10
  2. Double the number you have
  3. Add 6
  4. Divide it by 2
  5. Subtract the original number from your answer