Show me the data!
Header

Making a journal scraper

May 13th, 2015 | Posted by rmounce in Content Mining

Yesterday, I made a journal scraper for the International Journal of Systematic and Evolutionary Microbiology (IJSEM).

Fortunately, Richard Smith-Unna and the ContentMine team have done most of the hard work in creating the general framework with quickscrape (open-source and available on github), I just had to modify the available journal-scrapers to work with IJSEM.

How did I do it?

Find an open access article in the target journal e..g James et al (2015) Kazachstania yasuniensis sp. nov., an ascomycetous yeast species found in mainland Ecuador and on the Galápagos

In your browser, view the HTML source of the full text page, in the Chrome/Chromium browser the keyboard shortcut to do this is Ctrl-U. You should then see something like this, perhaps with less funky highlighting colours:

I based my IJSEM scraper on the existing set of scraper definitions for eLife because I know both journals use similar underlying technology to create their webpages.

The first bit I clearly had to modify was the extraction of publisher. In the eLife scraper this works:

but at IJSEM that information isn’t specified with ‘citation_publisher’, instead it’s tagged as ‘DC.Publisher’ so I modified the element to reflect that:

The license and copyright information extraction is even more different between eLife and IJSEM, here’s the correct scraper for the former:

and here’s how I changed it to extract that information from IJSEM pages:

The XPath needed is completely different. The information is inside a div, not a meta tag.

 

Hardest of all though were the full size figures and the supplementary materials files – they’re not directly linked from the full text HTML page which is rather annoying. Richard had to help me out with these by creating “followables”:

In his words:

any element can ‘follow’ any other element in the elements array, just by adding the key-value pair "follow": "element_name" to the element that does the following. If you want to follow an element, but don’t want the followed element to be included in the results, you add it to a followables array instead of the elements array. The followed array must capture a URL.

 

 

The bottom-line is, it might look complicated initially, but actually it’s not that hard to write a fully-functioning  journal scraper definition, for use with quickscrape. I’m off to go and create one for Taylor & Francis journals now :)

 

Wouldn’t it be nice if all scholarly journals presented their content on the web in the same way, so we didn’t have to write a thousand different scrapers to download it? That’d be just too helpful wouldn’t it?

 

 

  • Interesting! So having scraped this metadata, do you then republish it openly in a canonical format?

    • rmounce

      Could do I suppose. A gigantic bibliography?
      Would certainly help establish / prove when articles are paywalled (when they shouldn’t be!).

      We do plan to normalize the full text output to aid downstream analysis. See ‘norma’ for that: https://github.com/ContentMine/norma

      • That would be very valuable.

        • Mike,
          Yes – we are certainly interested in extracting the metadata and some of this is already in our CATalog. Are you interested in becoming an early adopter to test it out and drive it?

          • Yes. Where’s the access point and what’s the API?

            If you want to communicate this privately to me in the mean time, you’re welcome to use email to the usual address.

  • hi there how can i add to the plos json to get just the author’s summary?
    and do you know any available datasets that has author summaries from plos?
    thanks!

  • hi

    sorry for noob question! i want to know how to extract just the author summary from a PLOS article using quickscrape?

    thanks for any hints!
    mura

    • rmounce

      Hi Mura,

      Try asking that question at the ContentMine forums (they make quickscrape). They may be able to help you further: http://discuss.contentmine.org/