This post continues from the earlier post on fixing DPO (https://kyunghyuncho.me/a-proper-preference-optimization-loss-and-its-gradient/). by the way, the dinner reservation was at Ramro (https://www.ramronyc.com/, https://maps.app.goo.gl/jwpyPvy2pjNsxS6h9), and i recommend you try it out. a very interesting cuisine!

## Direct Preference Optimization

let’s start by stating the direct preference optimization (DPO) loss for each example $(x,y_+, y_-)$:

\[

\log \left( 1 + \exp \left(-\left(

\beta \log \frac{\pi(y_+)}{\pi(y_-)}

-\gamma \log \frac{\pi_0(y_+)}{\pi_0(y_-)}

\right) \right) \right).

\]

this takes a slightly different form from the original DPO loss. in the original DPO loss, $\gamma = \beta$ was forced, which leaves the scale (or entropy) of the reference model $\pi_0$ uncontrollable. this formulation above is more desirable, as it allows us to remove the effect of the scale of the reference model by tuning $\gamma$ appropriately.

take as an example, two reference distributions $\pi_0$ and $\pi_0’$ that satisfy

\[

\pi_0′(y) \propto \pi_0(y)^\alpha,

\]

where $\alpha \geq 0$. the preference is maintained between two distributions, but the preference ratio may change dramatically, since

\[

\left(\frac{\pi_0(y_+)}{\pi_0(y_-)}\right)^\alpha =

\frac{\pi_0′(y_+)}{\pi_0′(y_-)}.

\]

because it is the relative ranking between $y_+$ and $y_-$ we are concerned with in DPO, we should arrive at more or less the same solution regardless of whether we use $\pi_0$ or $\pi_0’$ as a reference (or a prior) distribution. Without an extra hyperparameter $\gamma$, this is essentially impossible. we thus stick to the formulation above in the rest of this post.

## Margin-based formulation of DPO

already in 2011, Collobert & Weston et al. told us that “*[i]t is therefore desirable to define alternative training criteria [to cross-entropy, a.k.a. log-loss]. We propose here to use a pairwise ranking approach.*” so, i will follow their lead (and also because it makes the analysis below easier) and create a hinge loss variant of DPO:

\[

\max\left(0,

\gamma \log \frac{\pi_0(y_+)}{\pi_0(y_-)}

-\beta \log \pi(y_+) + \beta \log \pi(y_-)

\right).

\]

this version of the DPO loss is minimized when the following condition is satisfied:

\[

\begin{array}{l l}

&\frac{\gamma}{\beta} \log \frac{\pi_0(y_+)}{\pi_0(y_-)}

-\log \pi(y_+) + \log \pi(y_-) \leq 0

\\

\iff&

\log \pi(y_+) \geq \log \pi(y_-) + \frac{\gamma}{\beta} \log \frac{\pi_0(y_+)}{\pi_0(y_-)},

\end{array}

\]

where we assume $\beta > 0$.

in other words, the log probability assigned to $y_+$ should be greater than that assigned to $y_-$ with the margin of $\frac{\gamma}{\beta} \log \frac{\pi_0(y_+)}{\pi_0(y_-)}$. This margin can be written down as

\[

\frac{\gamma}{\beta}

\left(\log \pi_0(y_+) – \log \pi_0(y_-)\right).

\]

## What does this margin-based DPO do?

let $\gamma/\beta=1$, without loss of generality.

consider $(y_+, y_-)$ for which the reference (prior) model disagrees with, i.e. $\log \frac{\pi_0(y_+)}{\pi_0(y_-)} < 0$. the new model $\pi$ then does not need to ensure $\pi(y_+) > \pi(y_-)$. it only needs to ensure that $\log \pi(y_+) – \log \pi(y_-) \geq \log \pi_0(y_+) – \log \pi_0(y_-)$. in other words, as long as the new model puts even so slightly more probability on $y_+$ than on $y_-$, relatively to the reference model, it is all fine.

on the other hand, when the reference (prior) model is already correct, the new model must also ensure that it puts a higher probability on $y_+$ than on $y_-$ with the margin that matches the probability ratio between $y_+$ and $y_-$ under the prior model.

it is a usual practice, if not *the* practice, to initialize the new model $\pi$ with the reference model $\pi_0$. In this case, with $\gamma/\beta = 1$, there is almost no learning, since the (hinge loss-based) DPO loss is already zero by default. even the original version of DPO would be very close to zero, resulting in a very small amount of learning.

Let us dig more deeply into this below.

### $\gamma$ determines the learning behaviour

w.l.o.g., let us assume $\beta=1$.

if $\gamma = 0$, the new model must learn the preference purely from the data, as there is no (positive or negative) margin derived from the reference model. in this case, there is no constraint on how much the new model can deviate from the reference model, even if it started from the reference model. it will however learn the correct preference ranking.

as $\gamma$ increases, i.e. $\gamma \to \infty$, two things start to happen. first, for a pair $(y_+, y_-)$ for which the prior model $\pi_0$ is incorrect, the new model $\pi$ also does not need to get this pair correct, as long as the log-probability assigned to $y_+$ is within some margin from the log-probability assigned to $y_-$. with a very large $\gamma$, learning simply does not need to do anything with such a pair. In other words, learning does not make the new model get such a pair correct. if the new model already gets it correct, nothing happens.

second, for a pair $(y_+, y_-)$ for which the prior model is correct, the new model needs to get this pair correct as well as the scaled prior model, $\pi_0^{\gamma/\beta}$. if the pair was incorrect under the new model, the new model will be updated to ensure the pair becomes correctly ranked. even if the pair was correct to begin with under the new model, learning will continue to increase the margin, i.e. $\log \pi(y_+) – \log \pi(y_-)$, until it matches at least that under the prior model.

overall, these observations tell us that a large $\gamma$ (given fixed $\beta$) effectively prevents most of the training pairs from contributing to learning. as it is shown in the table below, the current DPO formulation is heavily asymetric in that learning is only largely driven by the training examples for which the reference model is already correct.

$\gamma \gg 0$ | $\pi_0$ incorrect | $\pi_0$ correct |

$\pi$ incorrect | no learning (*) | learning |

$\pi$ correct | no learning | learning |

this is in stark contrast to the case of $\gamma=0$, where learning happens as long as the new model $\pi$ is incorrect. this can be shown in the following table:

$\gamma = 0$ | $\pi_0$ incorrect | $\pi_0$ correct |

$\pi$ incorrect | learning (*) | learning |

$\pi$ correct | no learning | no learning (*) |

### When $\pi$ is initialized to be $\pi_0$, …

a usual practice, if not *the* practice, is to initialize the new policy $\pi$ with the reference policy $\pi_0$. in that case, these two models agree with each other on all training examples initially. in other words, we only need to focus on the diagonal of the tables above (marked with *).

a weird observation with $\gamma \gg 0$ is that learning largely happens only when the new model $\pi$ is already correct. While this is not the case with $\gamma = 0$. that is, DPO with $\gamma \gg 0$ is effectively operating on a small subset of examples for which the reference model was correct by increasing the margin on those correct examples. because it was already correct, there is less to learn, compared to $\gamma=0$, implicitly regularizing learning so that the new model effectively stays close to the original (prior) model. this is a weird way to regularize learning in this way, compared to a more explicit approach, such as mixout (yes, yes, i wanted to plug it in ..)

## So, what are you saying?

i have a nagging feeling that DPO isn’t what people claim it is, and that its success is not actually due to DPO, or its motivation, but simply due to the combination of some luck, hyperparameter tuning (including early stopping) and stochasticiticy (yes, I consider stochasticity separate from luck.)

and … i just saw on twitter this past weekend that Meng et al. (2024) proposed a variant of DPO, called SimPO, that effectively sets $\gamma=0$ and introduce a constant margin (instead of the input-dependent margin in DPO.) i want to say great minds think alike, but I know that this team is of greater mind than i am, and that the margin-based ranking loss has been Jason Weston‘s favourite loss ever since early 2000’s; what a visionary!