One of the big problems that I’ve been having recently is a severe lack of test data for testing new machine learning behaviours with. I started off with just papers from the ART Corpus and manually cherrypicked some papers from PlosONE. However, I had to go through, manually viewing each paper and doing right click -> download as XML on each one, I decided it was time to automate paper acquisition.
After a little bit of fiddling, I’ve managed to set up a basic commandline script – plosget.py – which downloads batches of up to 100 papers at a time from PlosONE and puts them in a directory of your choice. If you already have a paper with the same name in that directory it skips it. You can also completely control the query sent to the Plos API, retrieving specific papers as per requirement. An example call to the program looks like the following:
~$ python plosget.py -s 0 -q "subject:\"Computer and information sciences\"" --field-query="article_type:Review AND doc_type:full" -r 50 develop [241595b] modified untracked Requesting http://api.plos.org/search?fq=article_type%3AReview%20AND%20doc_type%3Afull&rows=100&q=subject%3A%22Computer%20and%20information%20sciences%22&start=0&wt=json&api_key=dCFoAOqdkBUcOI2 Found 37 documents matching query Downloading http://www.plosone.org/article/fetchObjectAttachment.action?uri=info%3Adoi%2F10.1371/journal.ppat.0030116&representation=XML... Storing it at papers/journal.ppat.0030116.xml... Downloading http://www.ploscompbiol.org/article/fetchObjectAttachment.action?uri=info%3Adoi%2F10.1371/journal.pcbi.1002291&representation=XML... Storing it at papers/journal.pcbi.1002291.xml... Downloading http://www.plosone.org/article/fetchObjectAttachment.action?uri=info%3Adoi%2F10.1371/journal.pgen.0020118&representation=XML... Storing it at papers/journal.pgen.0020118.xml... Downloading http://www.plosone.org/article/fetchObjectAttachment.action?uri=info%3Adoi%2F10.1371/journal.pntd.0000893&representation=XML... Storing it at papers/journal.pntd.0000893.xml... Downloading http://www.ploscompbiol.org/article/fetchObjectAttachment.action?uri=info%3Adoi%2F10.1371/journal.pcbi.1000029&representation=XML... Storing it at papers/journal.pcbi.1000029.xml...
As you can see, the query is manipulated by changing the -q and –field-query options. The -s option allows you to change where you start downloading from (i.e. if you’ve already seen the first 100 papers, start from paper 101 with -s 101). The -r option allows you to specify the maximum number of papers to download. This works on any positive integer up to and including 100. The -d option allows you to specify which directory to dump the papers in and also check for existing papers.
PlosONE themselves have even been sending me encouraging tweets today! It’s great to have their support!
— PLOS ONE (@PLOSONE) February 26, 2013
My plan is to use plosget.py to download a selection of 1000 or so random papers and see what my partitioning clusterer does with them.
If you think plosget.py might be useful to you, you can get it from the Patridge repository to the right hand side!