Show me the data!
Header

PLOS ONE PHYLOGENY

May 7th, 2014 | Posted by rmounce in Content Mining | Open Data | Open Science | PLoS | PLUTo

I’m proud to announce an interesting public output from my BBSRC-funded postdoc project:
PLUTo: Phyloinformatic Literature Unlocking Tools. Software for making published phyloinformatic data discoverable, open, and reusable

MOAR PHYLOGENY!

Screenshot of some of the PLOS ONE phylogeny figure collection on Flickr

 

 

 

 

 

 

 

 

 

 

 

 

 

 

I’ve made openly available my first-pass filter of PLOS ONE phylogeny figures (I’m not in any way claiming this is *all* of them).

This curated & tagged image collection is on Flickr for easy browsing: http://bit.ly/PLOStrees

As well as on Github for version control, open archiving, and collaboration (I have remote collaborators):

https://github.com/rossmounce/P1-phylo-part1

https://github.com/rossmounce/P1-phylo-part2

https://github.com/rossmounce/P1-phylo-part3

https://github.com/rossmounce/P1-phylo-part4

(Github doesn’t like repositories over 1GB so I’ve had to split-up the content between 4 separate repositories)

 

Why?

The aim of the PLUTo project is to re-extract & liberate phylogenetic data & associated metadata from the research literature. Sadly, only ~4% of modern published phylogenetic analysis studies make their underlying data available. Another study finds that if you ask the authors for this data, only 16% will be kind enough to reply with the requested data!

This particular data type is a cornerstone of modern evolutionary biology. You’ll find phylogenetic analyses across a whole host of journal subjects – medical, ecological, natural history, palaeontology… There are also many different ways in which this data can be re-used e.g. supertrees  & comparative cladistics. Not to mention, simple validation studies &/or analyses which extend-upon or map new data on to a phylogeny. It’s really useful data and we should be archiving it for future re-use and re-analysis. To my great delight, this is what I’m being paid to attempt to do for my first postdoc; on a grant I co-wrote – finding & liberating phylogenetic data for everyone!

 

Why PLOS ONE?

 

  •  It’s a BOAI-compliant open access journal that publishes most articles under CC BY, with a few under CC0.
    • This means I can openly re-publish figures online (provided sufficient attribution is given) — no need to worry about DMCA takedown notices or ‘getting sued’! This makes the process of research much easier. Private, non-public, access-restricted repositories for collaboration are a hassle I’d rather do without.
  • It’s a high-volume ‘megajournal’ publishing ~200 articles per day, many of which include phylogenetic analyses.
    • Thus its worthwhile establishing a regular daily or weekly method for parsing-out phylogenetic tree figures from this journal
  • Killer feature: as far as I know, PLOS are the only publisher to embed rich metadata inside their figure image files.
    • This makes satisfying the CC BY licence trivially easy — sufficient attribution metadata is already embedded in the file. Just ensure that wherever you’re uploading the file to doesn’t wipe this embedded data, hence why I chose Flickr as my initial upload platform.

 

What does this enable or make easier?

 

On it’s own, this collection doesn’t do much, this is still an early stage – but it gives us an important insight into the prevalence of certain types of visual display-style that researchers are using:

‘radial’ phylogenies

https://www.flickr.com/search?user_id=123621741%40N08&sort=relevance&text=radial

Source: Zerillo et al 2013 PLOS ONE. Carbohydrate-Active Enzymes in Pythium and Their Role in Plant Cell Wall and Storage Polysaccharide Degradation

Source: Zerillo et al 2013 PLOS ONE. Carbohydrate-Active Enzymes in Pythium and Their Role in Plant Cell Wall and Storage Polysaccharide Degradation

 

 

 

 

 

 

 

 

 

 

 

 

 

‘geophylogeny’ (phylogeny displayed relative to a map of some sort, 2D or 3D)

https://www.flickr.com/search?user_id=123621741%40N08&sort=relevance&text=geophylogeny

Source: Guo et al 2012 PLOS ONE. Evolution and Biogeography of the Slipper Orchids: Eocene Vicariance of the Conduplicate Genera in the Old and New World Tropics

Source: Guo et al 2012 PLOS ONE. Evolution and Biogeography of the Slipper Orchids: Eocene Vicariance of the Conduplicate Genera in the Old and New World Tropics

 

 

 

 

 

 

 

 

 

 

‘timescaled’ (phylogenies where the branch lengths are proportional to units of time or geological periods)
https://www.flickr.com/search?user_id=123621741%40N08&sort=relevance&text=timescaled

Source: Pol et al 2014 PLOS ONE. A New Notosuchian from the Late Cretaceous of Brazil and the Phylogeny of Advanced Notosuchians

Source: Pol et al 2014 PLOS ONE. A New Notosuchian from the Late Cretaceous of Brazil and the Phylogeny of Advanced Notosuchians

 

 

 

 

 

 

 

 

 

‘splitstrees’

https://www.flickr.com/search?user_id=123621741%40N08&sort=relevance&text=splitstree

Source: McDowell et al 2013 PLOS ONE. The Opportunistic Pathogen Propionibacterium acnes: Insights into Typing, Human Disease, Clonal Diversification and CAMP Factor Evolution

Source: McDowell et al 2013 PLOS ONE. The Opportunistic Pathogen Propionibacterium acnes: Insights into Typing, Human Disease, Clonal Diversification and CAMP Factor Evolution

 

 

 

 

 

 

 

 

 

 

 

Arguably it also facilitates complex searches for specific types of phylogeny

e.g. analyses using cytochrome b
https://www.flickr.com/search/?w=123621741@N08&q=%22cyt%20b%22%20OR%20%22cytochrome%20b%22
(you could use PLOS’s API to do this, particularly their figure/table caption search field — but you’d get a lot of false positives — this is an expert-curated collection that has filtered-out non-phylo figures)

In my initial roadmap, the plan is to do PLOS ONE, the other PLOS journals, then BMC journals, then possibly Zootaxa & Phytotaxa (Magnolia Press). There will be a Github-based website for the project soon, lots still to do…!

 

Want to know more / collaborate / critique ?

Conferences:

I’ve got an accepted lightning talk at iEvoBio in Raleigh, NC later this year about the PLUTo project.

As well as an accepted lightning talk at the Bioinformatics Open Source Conference (BOSC) in Boston, MA.

Elsewise, contact me via twitter @rmounce , the comment section on this blog post, or email ross dot mounce <at> gmail dot com

  • Pingback: PLOS ONE PHYLOGENY | Python in science | Scoop...

  • Pingback: PLOS ONE PHYLOGENY | mitochondrial dna | Scoop...

  • Pingback: PLOS ONE PHYLOGENY | Tools and tips for scienti...

  • Pingback: PLUTO: Phyloinformatic literature unlocking too...

  • Mike Taylor

    This is helpful. But splitting the data across four separate github repos is really, really not helpful. Surely github has paid plans for going over 1 Gb? Or given that your project is specifically about openness and they are Good Guys, I bet they’d waive the fee if you asked.

    • http://www.rossmounce.co.uk Ross Mounce

      Possibly.

      But tbh it just requires four lines, instead of one to get all the files:

      git clone https://github.com/rossmounce/P1-phylo-part1

      git clone https://github.com/rossmounce/P1-phylo-part2

      git clone https://github.com/rossmounce/P1-phylo-part3

      git clone https://github.com/rossmounce/P1-phylo-part4

      Github already provides essentially unlimited space for public repo’s. I feel that’s already amazingly generous – I don’t want, nor do I need to push for more.

      • Mike Taylor

        This is a recipe for confusion and inconsistency.

        “Don’t you see my changes?”
        “No, and I pulled the repo!”
        “Which one?”
        “Aaahhhh….”

        And later:

        “Don’t you see my new data?”
        “No, and I pulled all four repos!”
        “Oh, there are five now”
        “Aaarrrggghhhh….”

    • arfon

      Hi Ross and Mike, it’s Arfon Smith from GitHub here.

      Firstly, congratulations Ross on a very impressive piece of work! I’m planning on being at BOSC this year so hopefully I’ll get to see your poster there.

      As for the disk usage Mike: I agree this situation is less than optimal for situations like this but the 1GB limit is actually more about us providing a predictable level of performance for all of our users than a paid vs free repo thing. The 1GB limit is in place on free and paid plans.

      Ross – have you considered putting all of these images into a data archiving service such as Dryad, figshare or Zenodo? They’d be able to host these images in a single dataset and mint you a DOI to go with it.

      Cheers
      Arfon

      • http://www.rossmounce.co.uk Ross Mounce

        Hi Arfon! Thanks for the Github stickers. I grabbed a few at #CW14

        Figshare & Zenodo *meh* – good for static, final product dumping.

        I was concerned they wouldn’t allow a ~3GB collection to be uploaded. Pre-publication I’m not fussed about DOI’s. One doesn’t need a DOI to cite something any how! Carl Boettiger’s got a pretty definitive post with more on that: http://www.carlboettiger.info/2013/06/03/DOI-citable.html

        Dryad can be ruled-out because a) it charges and b) I think (?) it only takes data associated with a publication and/or thesis. When I’ve written/submitted a corresponding paper I may well upload to Dryad, for a more lasting record of my research. I can delete those Github repo’s at any time so they’re not optimal for long-term preservation of stuff associated with a formal publication IMO.

        But at this early, pre-submission, pre-analysis data-gathering stage Github seems pretty optimal & agile for doing science openly. It’s highly likely I’ll be adding to the collection, modifying the metadata of the files (adding XMP tags), I can use informative commit messages to describe the changes between versions… etc. So Github would seem to me to be much better than figshare/Zenodo which don’t seem to me to cater to *dynamic* storage of living, changing datasets (?).

        Another of Digital Science’s things – ‘Projects’ might be more suitable http://www.digital-science.com/blog/posts/introducing-projects-digital-science-s-first-home-grown-tool but AFAIK there’s still no linux version so until then it’s of no use to me.

        • arfon

          They’re all fair points – to be honest I hadn’t quite realised these weren’t a ‘static’ dataset (my bad) so I agree that something that supports explicit versioning with documentation is a better solution.

          I’m glad you’ve found a way to make this work for you on GitHub!

          Cheers
          Arfon

          • http://www.rossmounce.co.uk Ross Mounce

            No no, my bad – I can appreciate from the lack of action on the repo so far that it *will* look fairly static. And to be fair there are unlikely to be many updates but XMP metadata tags are one such update I plan :)

  • Pingback: PLOS ONE PHYLOGENY | O acordado

  • Pingback: Impact of Social Sciences – The right to read is the right to mine: Text and data mining copyright exceptions introduced in the UK.