# Notes on Learning To Rank

We want to learn a function $$f(q, D)$$ which takes in a query $$q$$ and a list of documents $$D=\{d_1, d_2, ..., d_n\}$$, and produces scores using which we can rank/order the list of documents.

## Types

There are multiple ways we can formulate the problem:

1. Pointwise
2. Pairwise
3. Listwise

### Pointwise

In this approach we learn $$f(q,d)$$, which scores the match-ness between the query and document independently. When scoring a data point, the function does not take other document in the list into consideration.

To train a model in this approach, the data would be in the long format where each row contains a $$(q,d)$$ pair and we need labels for every row. Either the label is binary(classification) or relevance scores(regression).

### Pairwise

In this approach we learn $$Pr(rank(d_i,q)\succ rank(d_j,q))$$, that is to learn to determine relevant preference between two documents given a query.

It can be treated as binary classification problem, the data would be in the format where each row contains a triplet of $$(q,d_i,d_j)$$ and we need a binary label for each row. We can hand crafting features that captures the difference between $$d_i,d_j$$ with respect to $$q$$ and feed that difference to a binary classifier.

Or more often we learn it in a pointwise fashion, by learning the intermediate rank function. Let $$rank(d_i,q)=s_i$$ , the pairwise classification problem becomes classification on the difference between rank scores. That is:

$Pr(rank(d_i,q) > rank(d_j,q))=\frac{1}{1+exp(-(s_i-s_j))}$

The loss would be the negative log of this likelihood, which is: $$L_{ij}=log(1+exp(s_j-s_i))$$ and we can train the rank function to minimize this loss.

If we work out the graident with respect to the parameter in the rank function, it is:

\begin{aligned} \frac{\partial L_{ij}}{\partial \theta}&=\frac{\partial L_{ij}}{\partial s_i}\frac{\partial s_i}{\partial \theta} + \frac{\partial L_{ij}}{\partial s_j}\frac{\partial s_j}{\partial \theta} \\ &=-\frac{1}{1 + exp(s_i-s_j)}(\frac{\partial s_i}{\theta} - \frac{\partial s_j}{\theta}) \\ &=\lambda_{ij}(\frac{\partial s_i}{\theta} - \frac{\partial s_j}{\theta}) \end{aligned}

As a result, a single gradient descent step with this gradient is doing a gradient ascent for $$s_i$$ and gradient descent for $$s_j$$ together with a weight of $$\lambda_{ij}$$. That is, for a given pair of documents, we make the score of a more relevant document higher, and make the score of a less relevant document lower, and how much we perform the update is determined by the score difference.

To be continued.