Clustering adventures (contd.)

So I have been back through and done some more work on the clustering of the data. It looks like my algorithm is accurately clustering research and review papers correctly. To begin with  I had a very low k-means sillhouette value and I was finding that my clusters were changing spontaneously depending on which paper was used as the centroid.

I noticed that the main differences between review and research papers is in the background and model coresc sections of the paper. Therefore, I decided to rebuild my model, taking into account only these attributes. I noticed that my strongest K silhouette went up from 0.2 to around 0.7 immediately.

I’ve also been examining the contents of the clusters and found that the algorithm is fairly accurate in putting the research papers (from the ART corpus) and the review papers (from the PlosOne corpus) in the right cluster.

Loading data...
Clustering Data...
Counter({'pone': 33, '?': 7, 'art': 6})
Counter({'art': 257, 'pone': 6, '?': 2})

I was worried that the anomalous art papers and PlosONE papers that slipped into the wrong clusters meant that the system had misclassified them. However, upon closer examination, these papers are actually Review and Research papers respectively that had slipped in under the net. This is great news as it shows that there is some really strong correlation going on between the amount of background information in a paper and its ‘type’.

Next step will be training a supervised model to recognise these differences and then adding this behaviour to the paper preclassifier built into Partridge’s upload system.

Posted in Uncategorized Tagged with: , , , , , ,

Leave a Reply

Your email address will not be published. Required fields are marked *