This post continues from the earlier post on fixing DPO (https://kyunghyuncho.me/a-proper-preference-optimization-loss-and-its-gradient/). by the way, the dinner reservation was at Ramro (https://www.ramronyc.com/, https://maps.app.goo.gl/jwpyPvy2pjNsxS6h9), and i recommend you try it out. a very interesting cuisine! Direct Preference Optimization let’s start by stating the direct preference optimization (DPO) loss for each example $(x,y_+, y_-)$: \[\log \left( 1 + \exp \left(-\left(\beta \log \frac{\pi(y_+)}{\pi(y_-)}-\gamma \log \frac{\pi_0(y_+)}{\pi_0(y_-)}\right) \right) \right).\] this takes a slightly different form from the original DPO loss. in the original DPO loss, $\gamma = \beta$ was forced, which leaves the scale (or entropy) of the reference model $\pi_0$ uncontrollable. this formulation above is

# Author: kyunghyuncho

## Fixing DPO but I have a dinner reservation …

Direct preference optimization (DPO; https://arxiv.org/abs/2305.18290) is all the rage, i heard. i also hear from my students that DPO, which minimizes the following loss, often results in weird behaviours, such as unreasonable preference toward lengthy responses (even when there is no statistical difference in lengths between desirable and undesirable responses.) i won’t go into details of these issues, but i feel like there’s a relatively simple reason behind these pathologies based on basic calculus. \[\mathcal{L}_{\mathrm{dpo}}(\theta) = -\log \left(1 + \exp \left(- \log \frac{p_{\theta}(y|x)}{p_{0}(y|x)}+ \log \frac{p_\theta(y’|x)}{p_{0}(y’|x)}\right)\right),\] where $p_0$ is the so-called reference model from which $y$ and $y’$ were drawn independently

## A random thought on retrieval-augmented generation

retrieval-augmented generation (RAG) is all the rage in the world of LLM’s (i heard.) RAG confuses me quite a bit, since it’s unclear to me how RAG should work. in particular, i have a major confusion in how language models should be trained to be good at retrieval augmented generation. it’s a simple confusion, and let me describe it here. let $D$ be an entire training corpus i have prepared to train a language model. a naive way to train a language model is to \[\max_{\theta} \sum_{x \in D} \log p_{\theta}(x).\] this whole process of learning can be thought of

## Gradient-based planning, mapping and execution

this post continues from the previous post <Gradient-based trajecotry planning>, because i became even busier. in fact, i should work on my presentation slide for my talk at the University of Washington tomorrow (sorry, Yejin and Noah!), and probably because of that, i decided to push it a bit further. the main assumption i made in the previous slide was that our bot has access to the entire map. this is a huge assumption that does not often hold in practice. instead, i decided to restrict the visibility of our bot. it will be able to see the obstacles in

## Gradient-based trajectory planning

this semester has been completely crazy for me, and i anticipate that this madness will only worsen over the next couple of months. of course, because of this crazy schedule, my brain started to revolt by growing a doubt inside me on how much i trust gradient descent. crazy, right? yes. i then succumbed to this temptation and looked for some simple example to test my trust in gradient descent. yes, i know that i should never doubt our lord Gradient Descent, but my belief is simply too weak. so, i decided to use gradient descent for simple trajectory planning