After long weeks of trying, I’ve managed to get some positive results from a paper type classifier that uses Random Decision Forests to classify papers into categories “Research Paper”, “Review Paper” or “Case Study”.
Acquiring Test Data
Firstly, I needed papers – and lots of them. As I posted below I wrote a small utility for downloading papers in batch from PlosONE. I also harvested a set of case studies from the PubMedCentral open access repository. I now have approximately 600 papers in total. These are made up approximately of 200 each of Research, Review and Case Study type papers.
Picking out the best features
Firstly, I used k-means clustering on my entire paper database to find which features would be best for discriminating between sets of unlabelled papers. I used all combinations of 3 CoreSCs. After about half an hour of chugging away, my Laptop came back to me: the Best k silhouette value was for K=3 where the features are Background, Methodology and Motive. This was exactly what I was hoping for, especially given that I’d already identified these three CoreSCs as very important in distinguishing the difference between Review and Research.
Forests and Trees
The next step was to train a random forest to learn how to classify papers, based upon the database I’d built up. I used three folds cross validation whereby the whole database is randomly split into three and then each part is then split into testing and training data. The results from each testing set are then compared and averaged to get a good picture of the system’s accuracy.
I also used two different types of tree for my random forest. One used Information Gain to select the features to split upon, the other split randomly. The results of the test can be seen below:
--------------------3 Fold Cross Validation------------------------ Learner CA Brier AUC for_gain 0.800 0.282 0.934 for_simp 0.792 0.278 0.938
The system was approximately 80% accurate, carrying low Brier Score (closer to zero is better) and high Area Under Curve (AUC) reading (closer to 1.0 is better). I was still curious, so I decided to put all of the CoreSC types in as features and see what happens. I was again pleasantly surprised.
--------------------3 Fold Cross Validation------------------------ Learner CA Brier AUC for_gain 0.848 0.202 0.966 for_simp 0.869 0.203 0.966
Although Background, Methodology and Motivation were very good discriminators on their own, correctly categorising 80% of papers, adding all CoreSC papers improved the classifier by approximately 5%.
What’s next ?
Now that I’m able to correctly discern a paper’s type from its CoreSC content, I can get started on paper type filtering in Partridge’s frontend.