My master thesis work with my collegue and friend Salvatore Trani was a nice challenging problem on recommending citations for academic papers. We've basically built a system that is able to suggest related papers given the general idea of some research you want to do.
The system wants to shorten and simplify the initial explorative step required during a research. The prototype is available at http://vinello.isti.cnr.it/cite-as-you-write/ but it's quite far from being production ready (kinda slow, SQLite doesn't scale enough for the served archive and the web service used for the internal communication is Bottle, a very nice and simple Python module but probably not the best solution we could've adopted). Still it might be interesting for some of you. We've tested it in a few real life cases and it gave surprising results. Our thesis is written in Italian but I'm very glad to share what we've done. This post wants to be an introduction to the most interesting parts of the work from the algorithm point of view. Actually most of the job was done on the preprocessing part of the data but I'll omit this part, maybe I'll talk about it in another post.
First let me talk about the differences against other search engine: Cite-as-you-write is not a search engine. Usually search engines are keyword based, this means that if you put a very long query on Google then you're likely to see non relevant results. To make Cite-as-you-write answer to very long queries (that can consist of more than one sentence) the index must be very tolerant and returns documents that don't have all the words inserted by the user. We've accomplished this by extracting a kind of context from the query. The summarisation is done simply by selecting the most important words from the query, we did it in a very naive way: sort the terms by tf-idf and choose the first K. The posting lists linked to each term are then merged all together, we don't do the intersection of the lists, that would reduce the candidate list massively.
After the candidates are selected we do the ranking to order the suggestions. The ranking is done in two steps: the first step just compute the cosine similarity between the query and the abstracts/title of the candidates, the second step uses a Random Forest model to exploit also other features like the number of citations, pagerank, publication date, etc...
Some of you could argue that the cosine similarity might be too slow for the first ranking step. Indeed it is, what we can do in this case is to adopt some of the early termination methods discussed in literature. In our case we've tried another path that is an aggressive index pruning. What does it mean? Usually to index a document you split the text in terms and then you add the document to all the posting lists of the terms. A light pruning strategy would remove at least the stopwords, our aggressive choice remove all but the "important terms". Once again, the important terms are the first K terms sorted by tf-idf.
The code of the prototype is not published yet, we've released a small part that was the tool used to train the function for the second step of the ranking phase, that we've called mltool (it stands for Machine Learning Tool). The tool is available on Bitbucket at the following address: https://bitbucket.org/duilio/mltool. Please, feel free to use, fork, complain and suggest improvements. You can actually install it without even looking at the source code using pip or easy_install:
$ pip install mltool $ easy_install mltool
Enjoy ;-)