Blog Archives

timetrack improvements

I’ve just added a couple of improvements to timetrack that allow you to append to existing time recordings (either with an amount like 15m or using live to time additional minutes spent and append them). You can also remove entries using timetrack rm instead of remove – saving keystrokes is what programming is all about. … Continue reading timetrack improvements

Posted in Open Source, PhD, timetrack Tagged with:

ElasticSearch: Turning analysis off and why its useful

I have recently been playing with Elastic search a lot for my PhD and started trying to do some more complicated queries and pattern matching using the DSL syntax. I have an index on my local machine called impact_studies which contains all 6637 REF 2014 impact case studies in a JSON format. One of the … Continue reading ElasticSearch: Turning analysis off and why its useful

Posted in analysis, elasticsearch, indexing, PhD Tagged with: ,

Freecite python wrapper

I’ve written a simple wrapper around the Brown University Citation parser FreeCite. I’m planning to use the service to pull out author names from references in REF impact studies and try to link them back to investigators listed on RCUK funding applications. The code is here and is MIT licensed. It provides a simple method … Continue reading Freecite python wrapper

Posted in citations, freecite, PhD, rcuk, ref, references Tagged with:

Scrolling in ElasticSearch

I know I’m doing a lot of flip-flopping between SOLR and Elastic at the moment – I’m trying to figure out key similarities and differences between them and where one is more suitable than the other. The following is an example of how to map a function f onto an entire set of indexed data in elastic using … Continue reading Scrolling in ElasticSearch

Posted in elasticsearch, lucene, PhD, results, scan, scroll Tagged with:

SSSplit Improvements

Introduction As part of my continuing work on Partridge, I’ve been working on improving the sentence splitting capability of SSSplit – the component used to split academic papers from PLosOne and PubMedCentral into separate sentences. Papers arrive in our system as big blocks of text with the occasional diagram, formula or diagram and in order … Continue reading SSSplit Improvements

Posted in demo, improvements, java, PhD, regex, split, sssplit, test, Work Tagged with: , ,

SSSplit Improvements

Introduction As part of my continuing work on Partridge, I’ve been working on improving the sentence splitting capability of SSSplit – the component used to split academic papers from PLosOne and PubMedCentral into separate sentences. Papers arrive in our system as big blocks of text with the occasional diagram, formula or diagram and in order … Continue reading SSSplit Improvements

Posted in demo, improvements, java, PhD, regex, split, sssplit, test, Work Tagged with: , ,

New Annotation Backend

Over the last couple of months I’ve been working with Dr Liakata on rewriting the SAPIENTA project in Python. The work was fun and rewarding and I got to visit Warwick University and deliver a talk about Partridge to some

Posted in Uncategorized Tagged with: , , , ,

New paper processing system

After a brief discussion with my supervisors, it turns out that Partridge has been incorrectly annotating papers (very very slightly) and this may have caused problems with our data integrety. That meant that we’re having to rebuild the database from

Posted in Uncategorized Tagged with: , , , , , , ,

PlosGet.py – Downloading test data from PlosONE

One of the big problems that I’ve been having recently is a severe lack of test data for testing new machine learning behaviours with. I started off with just papers from the ART Corpus and manually cherrypicked some papers from

Posted in Uncategorized Tagged with: , , , , , , ,