Notes on Learning To Rank


We want to learn a function \(f(q, D)\) which takes in a query \(q\) and a list of documents \(D=\{d_1, d_2, ..., d_n\}\), and produces scores using which we can rank/order the list of documents.


There are multiple ways we can formulate the problem:

  1. Pointwise
  2. Pairwise
  3. Listwise


In this approach we learn \(f(q,d)\), which scores the match-ness between the query and document independently. When scoring a data point, the function does not take other document in the list into consideration.

To train a model in this approach, the data would be in the long format where each row contains a \((q,d)\) pair and we need labels for every row. Either the label is binary(classification) or relevance scores(regression).


In this approach we learn \(Pr(rank(d_i,q)\succ rank(d_j,q))\), that is to learn to determine relevant preference between two documents given a query.

It can be treated as binary classification problem, the data would be in the format where each row contains a triplet of \((q,d_i,d_j)\) and we need a binary label for each row. We can hand crafting features that captures the difference between \(d_i,d_j\) with respect to \(q\) and feed that difference to a binary classifier.

Or more often we learn it in a pointwise fashion, by learning the intermediate rank function. Let \(rank(d_i,q)=s_i\) , the pairwise classification problem becomes classification on the difference between rank scores. That is:

\[Pr(rank(d_i,q) > rank(d_j,q))=\frac{1}{1+exp(-(s_i-s_j))}\]

The loss would be the negative log of this likelihood, which is: \(L_{ij}=log(1+exp(s_j-s_i))\) and we can train the rank function to minimize this loss.

If we work out the graident with respect to the parameter in the rank function, it is:

\[\begin{aligned} \frac{\partial L_{ij}}{\partial \theta}&=\frac{\partial L_{ij}}{\partial s_i}\frac{\partial s_i}{\partial \theta} + \frac{\partial L_{ij}}{\partial s_j}\frac{\partial s_j}{\partial \theta} \\ &=-\frac{1}{1 + exp(s_i-s_j)}(\frac{\partial s_i}{\theta} - \frac{\partial s_j}{\theta}) \\ &=\lambda_{ij}(\frac{\partial s_i}{\theta} - \frac{\partial s_j}{\theta}) \end{aligned}\]

As a result, a single gradient descent step with this gradient is doing a gradient ascent for \(s_i\) and gradient descent for \(s_j\) together with a weight of \(\lambda_{ij}\). That is, for a given pair of documents, we make the score of a more relevant document higher, and make the score of a less relevant document lower, and how much we perform the update is determined by the score difference.


To be continued.