PlosGet.py – Downloading test data from PlosONE

One of the big problems that I’ve been having recently is a severe lack of test data for testing new machine learning behaviours with. I started off with just papers from the ART Corpus and manually cherrypicked some papers from PlosONE. However, I had to go through, manually viewing each paper and doing right click -> download as XML on each one, I decided it was time to automate paper acquisition.

Plos provide a great, restful search API which can be used for free provided that you accept their very reasonable terms of use. They also provide examples of how to use their API in Python (Which coincidentally is the language that I’m primarily building Partridge in).

After a little bit of fiddling, I’ve managed to set up a basic commandline script – plosget.py – which downloads batches of up to 100 papers at a time from PlosONE and puts them in a directory of your choice. If you already have a paper with the same name in that directory it skips it. You can also completely control the query sent to the Plos API, retrieving specific papers as per requirement. An example call to the program looks like the following:

~$ python plosget.py -s 0 -q "subject:\"Computer and information sciences\"" --field-query="article_type:Review AND doc_type:full" -r 50                     develop [241595b] modified untracked
Requesting http://api.plos.org/search?fq=article_type%3AReview%20AND%20doc_type%3Afull&rows=100&q=subject%3A%22Computer%20and%20information%20sciences%22&start=0&wt=json&api_key=dCFoAOqdkBUcOI2
Found 37 documents matching query
Downloading http://www.plosone.org/article/fetchObjectAttachment.action?uri=info%3Adoi%2F10.1371/journal.ppat.0030116&representation=XML...
Storing it at papers/journal.ppat.0030116.xml...
Downloading http://www.ploscompbiol.org/article/fetchObjectAttachment.action?uri=info%3Adoi%2F10.1371/journal.pcbi.1002291&representation=XML...
Storing it at papers/journal.pcbi.1002291.xml...
Downloading http://www.plosone.org/article/fetchObjectAttachment.action?uri=info%3Adoi%2F10.1371/journal.pgen.0020118&representation=XML...
Storing it at papers/journal.pgen.0020118.xml...
Downloading http://www.plosone.org/article/fetchObjectAttachment.action?uri=info%3Adoi%2F10.1371/journal.pntd.0000893&representation=XML...
Storing it at papers/journal.pntd.0000893.xml...
Downloading http://www.ploscompbiol.org/article/fetchObjectAttachment.action?uri=info%3Adoi%2F10.1371/journal.pcbi.1000029&representation=XML...
Storing it at papers/journal.pcbi.1000029.xml...

As you can see, the query is manipulated by changing the -q and –field-query options. The -s option allows you to change where you start downloading from (i.e. if you’ve already seen the first 100 papers, start from paper 101 with -s 101). The -r option allows you to specify the maximum number of papers to download. This works on any positive integer up to and including 100. The -d option allows you to specify which directory to dump the papers in and also check for existing papers.

PlosONE themselves have even been sending me encouraging tweets today! It’s great to have their support!


My plan is to use plosget.py to download a selection of 1000 or so random papers and see what my partitioning clusterer does with them.

If you think plosget.py might be useful to you, you can get it from the Patridge repository to the right hand side!

Posted in Uncategorized Tagged with: , , , , , , ,
2 comments on “PlosGet.py – Downloading test data from PlosONE
  1. dahl says:

    Hey I’m very interesseted in the script, but the [email protected] link is 404’ing on me!

Leave a Reply

Your email address will not be published. Required fields are marked *

*