let’s start by stating the direct preference optimization (DPO) loss for each example $(x,y_+, y_-)$:

\[

\log \left( 1 + \exp \left(-\left(

\beta \log \frac{\pi(y_+)}{\pi(y_-)}

-\gamma \log \frac{\pi_0(y_+)}{\pi_0(y_-)}

\right) \right) \right).

\]

this takes a slightly different form from the original DPO loss. in the original DPO loss, $\gamma = \beta$ was forced, which leaves the scale (or entropy) of the reference model $\pi_0$ uncontrollable. this formulation above is more desirable, as it allows us to remove the effect of the scale of the reference model by tuning $\gamma$ appropriately.

take as an example, two reference distributions $\pi_0$ and $\pi_0’$ that satisfy

\[

\pi_0′(y) \propto \pi_0(y)^\alpha,

\]

where $\alpha \geq 0$. the preference is maintained between two distributions, but the preference ratio may change dramatically, since

\[

\left(\frac{\pi_0(y_+)}{\pi_0(y_-)}\right)^\alpha =

\frac{\pi_0′(y_+)}{\pi_0′(y_-)}.

\]

because it is the relative ranking between $y_+$ and $y_-$ we are concerned with in DPO, we should arrive at more or less the same solution regardless of whether we use $\pi_0$ or $\pi_0’$ as a reference (or a prior) distribution. Without an extra hyperparameter $\gamma$, this is essentially impossible. we thus stick to the formulation above in the rest of this post.

already in 2011, Collobert & Weston et al. told us that “*[i]t is therefore desirable to define alternative training criteria [to cross-entropy, a.k.a. log-loss]. We propose here to use a pairwise ranking approach.*” so, i will follow their lead (and also because it makes the analysis below easier) and create a hinge loss variant of DPO:

\[

\max\left(0,

\gamma \log \frac{\pi_0(y_+)}{\pi_0(y_-)}

-\beta \log \pi(y_+) + \beta \log \pi(y_-)

\right).

\]

this version of the DPO loss is minimized when the following condition is satisfied:

\[

\begin{array}{l l}

&\frac{\gamma}{\beta} \log \frac{\pi_0(y_+)}{\pi_0(y_-)}

-\log \pi(y_+) + \log \pi(y_-) \leq 0

\\

\iff&

\log \pi(y_+) \geq \log \pi(y_-) + \frac{\gamma}{\beta} \log \frac{\pi_0(y_+)}{\pi_0(y_-)},

\end{array}

\]

where we assume $\beta > 0$.

in other words, the log probability assigned to $y_+$ should be greater than that assigned to $y_-$ with the margin of $\frac{\gamma}{\beta} \log \frac{\pi_0(y_+)}{\pi_0(y_-)}$. This margin can be written down as

\[

\frac{\gamma}{\beta}

\left(\log \pi_0(y_+) – \log \pi_0(y_-)\right).

\]

let $\gamma/\beta=1$, without loss of generality.

consider $(y_+, y_-)$ for which the reference (prior) model disagrees with, i.e. $\log \frac{\pi_0(y_+)}{\pi_0(y_-)} < 0$. the new model $\pi$ then does not need to ensure $\pi(y_+) > \pi(y_-)$. it only needs to ensure that $\log \pi(y_+) – \log \pi(y_-) \geq \log \pi_0(y_+) – \log \pi_0(y_-)$. in other words, as long as the new model puts even so slightly more probability on $y_+$ than on $y_-$, relatively to the reference model, it is all fine.

on the other hand, when the reference (prior) model is already correct, the new model must also ensure that it puts a higher probability on $y_+$ than on $y_-$ with the margin that matches the probability ratio between $y_+$ and $y_-$ under the prior model.

it is a usual practice, if not *the* practice, to initialize the new model $\pi$ with the reference model $\pi_0$. In this case, with $\gamma/\beta = 1$, there is almost no learning, since the (hinge loss-based) DPO loss is already zero by default. even the original version of DPO would be very close to zero, resulting in a very small amount of learning.

Let us dig more deeply into this below.

w.l.o.g., let us assume $\beta=1$.

if $\gamma = 0$, the new model must learn the preference purely from the data, as there is no (positive or negative) margin derived from the reference model. in this case, there is no constraint on how much the new model can deviate from the reference model, even if it started from the reference model. it will however learn the correct preference ranking.

as $\gamma$ increases, i.e. $\gamma \to \infty$, two things start to happen. first, for a pair $(y_+, y_-)$ for which the prior model $\pi_0$ is incorrect, the new model $\pi$ also does not need to get this pair correct, as long as the log-probability assigned to $y_+$ is within some margin from the log-probability assigned to $y_-$. with a very large $\gamma$, learning simply does not need to do anything with such a pair. In other words, learning does not make the new model get such a pair correct. if the new model already gets it correct, nothing happens.

second, for a pair $(y_+, y_-)$ for which the prior model is correct, the new model needs to get this pair correct as well as the scaled prior model, $\pi_0^{\gamma/\beta}$. if the pair was incorrect under the new model, the new model will be updated to ensure the pair becomes correctly ranked. even if the pair was correct to begin with under the new model, learning will continue to increase the margin, i.e. $\log \pi(y_+) – \log \pi(y_-)$, until it matches at least that under the prior model.

overall, these observations tell us that a large $\gamma$ (given fixed $\beta$) effectively prevents most of the training pairs from contributing to learning. as it is shown in the table below, the current DPO formulation is heavily asymetric in that learning is only largely driven by the training examples for which the reference model is already correct.

$\gamma \gg 0$ | $\pi_0$ incorrect | $\pi_0$ correct |

$\pi$ incorrect | no learning (*) | learning |

$\pi$ correct | no learning | learning |

this is in stark contrast to the case of $\gamma=0$, where learning happens as long as the new model $\pi$ is incorrect. this can be shown in the following table:

$\gamma = 0$ | $\pi_0$ incorrect | $\pi_0$ correct |

$\pi$ incorrect | learning (*) | learning |

$\pi$ correct | no learning | no learning (*) |

a usual practice, if not *the* practice, is to initialize the new policy $\pi$ with the reference policy $\pi_0$. in that case, these two models agree with each other on all training examples initially. in other words, we only need to focus on the diagonal of the tables above (marked with *).

a weird observation with $\gamma \gg 0$ is that learning largely happens only when the new model $\pi$ is already correct. While this is not the case with $\gamma = 0$. that is, DPO with $\gamma \gg 0$ is effectively operating on a small subset of examples for which the reference model was correct by increasing the margin on those correct examples. because it was already correct, there is less to learn, compared to $\gamma=0$, implicitly regularizing learning so that the new model effectively stays close to the original (prior) model. this is a weird way to regularize learning in this way, compared to a more explicit approach, such as mixout (yes, yes, i wanted to plug it in ..)

i have a nagging feeling that DPO isn’t what people claim it is, and that its success is not actually due to DPO, or its motivation, but simply due to the combination of some luck, hyperparameter tuning (including early stopping) and stochasticiticy (yes, I consider stochasticity separate from luck.)

and … i just saw on twitter this past weekend that Meng et al. (2024) proposed a variant of DPO, called SimPO, that effectively sets $\gamma=0$ and introduce a constant margin (instead of the input-dependent margin in DPO.) i want to say great minds think alike, but I know that this team is of greater mind than i am, and that the margin-based ranking loss has been Jason Weston‘s favourite loss ever since early 2000’s; what a visionary!

]]>\[

\mathcal{L}_{\mathrm{dpo}}(\theta) = -\log \left(1 + \exp \left(- \log \frac{p_{\theta}(y|x)}{p_{0}(y|x)}

+ \log \frac{p_\theta(y’|x)}{p_{0}(y’|x)}\right)\right),

\]

where $p_0$ is the so-called reference model from which $y$ and $y’$ were drawn independently given $x$.

Let’s consider DPO with a fixed set $D$ of query-responses triplets $(x, y, y’)$ drawn from $p_0$. that is, $y, y’ \sim p_0(\cdot | x)$. Without loss of generality, i will always say that $y$ is more preferred than $y’$. the overall loss is then:

\[

J_{0} (\theta) =

\mathbb{E}_{x} \mathbb{E}_{y, y’ \sim p_0(\cdots | x)} \left[

\mathcal{L}_{\mathrm{dpo}}(\theta)

\right].

\]

what’s the issue here? the issue is that updating $\theta$ by minimizing this loss does not necessarily lead to $p_{\theta}(\cdot |x)$ from which we draw a good response. that is, there is no reason why $p_{\theta}(y|x) \gg p_{\theta}(\tilde{y}|x)$, where $y \in D$ and $\tilde{y}$ is an arbitrary sequence.

instead, a proper loss would be the following:

\[

J_{\mathrm{PDPO}} (\theta) =

\mathbb{E}_{x} \mathbb{E}_{y, y’ \sim p_{\theta}(\cdots | x)} \left[

\mathcal{L}_{\mathrm{dpo}}(\theta)

\right] =

\sum_{x} p(x)

\sum_{y, y’} p_{\theta}(y|x) p_{\theta}(y’|x) \mathcal{L}_{\mathrm{dpo}}(\theta).

\]

the main difference is that we are not using a fixed set of triplets drawn from $p_0$ but we use the samples drawn from the latest model $p_{\theta}$. This makes perfect sense, since responses we care about are those that we are more likely to draw from the trained model $p_{\theta}$. let’s now look at the gradient of this proper loss $J_{\mathrm{proper}}$ with respect to $\theta$ here.

\[

\begin{array}{rl}

\nabla J_{\mathrm{PDPO}}

=&

\nabla \mathbb{E}_x

\mathbb{E}_{y, y’ \sim p_{\theta}(\cdot|x)} \left[ \mathcal{L}_{\mathrm{dpo}}(\theta)\right]

\\

=&

\mathbb{E}_x

\mathbb{E}_{y, y’ \sim p_{\theta}(\cdot|x)}

\left[

\mathcal{L}_{\mathrm{dpo}}(y, y’, x)

\nabla_{\theta} (\mathcal{L}_{\mathrm{NLL}}(y, x) + \mathcal{L}_{\mathrm{NLL}}(y’, x))

+

\nabla_{\theta} \mathcal{L}_{\mathrm{dpo}}(y, y’, x)

\right],

\end{array}

\]

where we use a couple of tricks for computing the derivative, such as $\nabla (a \cdot b) = (\nabla a) b + a (\nabla b)$ and the log-derivative trick ($\nabla a = a \nabla \log a$). we use $\mathcal{L}_{\mathrm{NLL}}(y, x)$ as a short-hand notation of $- \log p_{\theta}(y|x)$.

what is interesting is that we automatically end up with two types of loss functions. the first one is the usual DPO loss. the second one is the likelihood on both desirable and undesriable responses. the second one is extremely important, since this one ensures that we are more likely to sample responses for which the first one (DPO) was optimized, after training.

now, this proper DPO loss (perhaps i can call it PDPO, since i was told we must name every single math formula in an obscure way) is not easy to minimize, as we must be able to determine which of an arbitrary pair of responses $(y, y’)$ given the query $x$ is more desirable. if $y$ is a molecular description, we would need to synthesize them and experiment with them to tell which is better. in other words, this PDPO loss is more readily usable when we have a ready and cheap way to tell the preference.

we can instead use importance sampling with the fixed set $D$ of the preference triplets $(x, y, y’)$:

\[

\nabla J_{\mathrm{PDPO}}^{\mathrm{IS}}(\theta)

\approx

\sum_{(x, y, y’) \in D}

\frac{p_\theta(y|x)}{p_0(y|x)}

\frac{p_\theta(y’|x)}{p_0(y’|x)}

\left(

\mathcal{L}_{\mathrm{dpo}}(y, y’, x)

\nabla_{\theta} (\mathcal{L}_{\mathrm{NLL}}(y, x) + \mathcal{L}_{\mathrm{NLL}}(y’, x))

+

\nabla_{\theta} \mathcal{L}_{\mathrm{dpo}}(y, y’, x)

\right).

\]

the importance weights, $\frac{p_\theta(y|x)}{p_0(y|x)}\frac{p_\theta(y’|x)}{p_0(y’|x)}$, say that we would use the pre-collected preference triplets only if they are reasonably likely under the current model $p_{\theta}$. this makes sense, as we care about the examples that are more likely to be drawn from the current model. unfortunately, this approach is not ideal, since the quality of each preference triplet becomes worse as $\theta$ drifts away.

so, what should we do? we should (1) draw triplets from the current model $p_{\theta}$ as frequently as possible based on the available resources and constraints and (2) use importance sampling $\nabla J_{\mathrm{PDPO}}^{\mathrm{IS}}(\theta)$ to update the parameters rather than the original gradient. unfortunately (1) is very difficult because it’s often really costly to measure the preference over two responses. (2) is also difficult, because the variance of this estimator will blow up quite rapidly as $\theta$ quickly evolves.

do i have any empirical evidence? unfortunately i have a dinner reservation and need to leave for the restaurant shortly.

**Acknowledgement**: i’d like to thank Richard Pang, who’s a graduating PhD student at NYU, for spending a couple of hours later afternoon on Friday to hear my rant and (then-incorrect) derivation. Also, i thank Weizhe Yuan and Angie Chen for keeping me up-to-date on mysteries and magics people perform each day finetuning language models, which serves as the constant motivation for me to think about this problem.

let $D$ be an entire training corpus i have prepared to train a language model. a naive way to train a language model is to

\[

\max_{\theta} \sum_{x \in D} \log p_{\theta}(x).

\]

this whole process of learning can be thought of as compressing $D$ into $\theta$ so that the language model approximately goes through $D$ given a new instance in the future to check how similar this new instance to the instances within $D$ and returns a similarity score, i.e., $\log p_{\theta} (x’)$, where $x’$ was a new instance.

let’s say i want to ensure that my language model is good at retrieval augmented generation. then, i would change the objective above into

\[

\max_{\theta} \sum_{x \in D} \mathbb{E}_{i \sim \mathrm{uniform}(1,|x|)} \log p_{\theta}(x_{i+1:|x|}|x_{1:i}, \mathrm{retrieval}(x_{1:i}, D)),

\]

where $|x|$ is the length of $x$ and $\mathrm{retrieval}(x_{1:i}, D)$ is the retrieval function that retrieves passages similar from $D$ according to $x_{1:i}$. sounds good so far, right?

… not really, because there is a genuine ambiguity in whether the language model should use information from $D$ relevant to predicting $x_{i+1:|x|}$ given $x_{1:i+1}$ via its parameters $\theta$ or via $\mathrm{retrieval}(x_{1:i}, D)$. after all, $\theta$ is a compressed version of $D$ in a way that facilitates retrieval of relevant information by the language model. what would make the language model prefer to rely on the passages retrieved by $\mathrm{retrieval}$ rather than its own (unknown) internal mechanism?

of course, we can fix this explicitly by splitting $D$ into a training set $D_{\mathrm{train}}$ and a retreival set $D_{\mathrm{retrieval}}$, such that $D_{\mathrm{train}} \cup D_{\mathrm{retrieval}} = D$ and $D_{\mathrm{train}} \cap D_{\mathrm{retreival}} = \emptyset$. we can then change the training objective above into

\[

\max_{\theta} \sum_{x \in D_{\mathrm{train}}} \mathbb{E}_{i \sim \mathrm{uniform}(1,|x|)} \log p_{\theta}(x_{i+1:|x|}|x_{1:i}, \mathrm{retrieval}(x_{1:i}, D_{\mathrm{retrieval}})),

\]

in this case, the language model must learn to rely on the retrieved passages, since the retrieved passages are not included in the training set and thereby not in $\theta$ (at least their verbatim copies.)

of course, this approach brings up perhaps even more difficult questions. first, how big should $D_{\mathrm{retrieval}}$ be? we might think this should be big enough, but of course, that implies that $D_{\mathrm{train}}$ is smaller. we know the importance of using a large corpus to train a language model, and it’s unclear how much compromise we can make on the size of the training set.

second, what if the retrieval function misses relevant information from $D_{\mathrm{retrieval}}$? when that happens, such relevant information is totally missed during training, effectively leading to the loss of information from the overall corpus $D$.

so, are you training your language models to excel at RAG? if so, how are you doing it?

p.s. this question came to my mind and was crystalized over Andrew Drozdov‘s dissertation defense that just finished (and yes, he successfully defended his dissertation!)

]]>the main assumption i made in the previous slide was that our bot has access to the entire map. this is a huge assumption that does not often hold in practice. instead, i decided to restrict the visibility of our bot. it will be able to see the obstacles in its neighbourhood. furthermore, this visibility will be noise. this noisy observation is implemented as:

\[

\tilde{m}_o(i,j) = \min(\max(m_c(i,j) m_o(i,j) + \epsilon_{i,j}, 0), 1),

\]

where $m_c(i,j)$ and $m_o(i,j)$ are the soft map of the current position of the bot and the true obstacle map, respectively. they were defined in the previous post. $\epsilon_{i,j}$ is the observational noise.

as an example, when the true map with the walls was like on the left panel, the bot, which is situated in the top left corner, would see the map on the right panel, below. this is with Gaussian noise of mean $0$ and standard deviation $0.001$. so, it can get a glimpse of the wall, but there’s quite a bit of noise.

for estimating the true map based on successive noisy observations, we will again rely on Our Lord gradient descent. although there must be a better loss function, i got lazy and decided to use the simplest possible version here:

\[

l_{\mathrm{map}}(\hat{m}_o, \tilde{m}_o) =

\sum_{i,j} | m_c(i,j) \cdot \mathrm{sigmoid}(\hat{m}_o(i,j)) – \tilde{m}_o(i,j) |,

\]

where $\hat{m}_o(i,j)$ is a logit. why do i estimate the logit instead of the actual $[0,1]$ value of each position? because i am a deep learner.

because the bot is restless and moves around, it’s not easy to interpret what it means to minimize this loss function. if we assume the bot is in a fixed location, we can however interpret this as estimating the underlying map from repeated observations by assuming that noise will cancel out across those observations. this is a good interpretation, but also reveals a weakness of this approach. that is, it will not work well with biased noise. i won’t address this here, but this shouldn’t be too difficult to address as long as we know the noise model (but how!?)

so, how well does it work? i’ve implemented as a very simple random-walk bot and let it walk over the same map from the previous post. while taking the stroll, the bot kept updating its internal view of the map. as you can see below, the bot is able to figure out where walls are, when those walls are near where it has been.

when the level of noise is higher ($0.01$), the estimated map is indeed noisier:

you can check the mapping code at https://github.com/kyunghyuncho/map_plan_backprop/blob/main/test_mapping.ipynb.

now we have two components we need in order to endow our bot an ability to navigate toward a given goal location when it does not have access to the map. given the current position of the bot and the goal position, the bot can repeat the following steps to reach the goal position in an unknown environment:

- observe the environment and update its map estimate using several gradient descent updates.
- plan the trajectory toward the goal given the estimated map so far using several gradient descent updates.
- take a small step toward the first position in the trajectory, that is different from the current position.

the size of the small step in (3) above is determined based on the actuator of the bot, and we can assume for now that it can move to any positions within up to 3 steps away in each axis. of course, the bot cannot move out of the map nor run over the wall. it will simply stop when it hits the wall, just like your Roomba at home does.

first, we want to know if this scheme works when there is no observation noise. although there is no noise, observation is not full, in the sense that it has only limited visibility into the neighbourhood of the current position. the bot is taking 20 Adam updates for both updating the map estimate and producting the trajectory after each move in this case. i ran it up to 500 steps but allowed the bot to terminate when it is within distance 2 from the goal position.

as you can see here, it works pretty much perfectly. near the end of the execution, the bot has a good estimate of the entire map, except for the top right corner which was very far away from the bot’s trajectory. because there was no noise, we see very crisp estimates of the walls.

when the noise level increases to $0.01$, we start to observe that the bot’s estimate of the map is quite rough, as below. near the bot’s trajectory, we can notice some of the true walls, but even slight further away, the bot has pretty much no idea how the whole environment looks like. the third wall (the bottom right one) simply doesn’t show up at all in the bot’s map estimate. Nevertheless, the bot successfully reached the goal, since the local estimate is all that matters in this case, just like how actor-critic algorithms in reinforcement learning do not often require us to have a global critic but only a local critic (and often just a linear approximation to it.)

you can reproduce the figures above and play around with our first autonomous bot almost entirely based on gradient descent at https://github.com/kyunghyuncho/map_plan_backprop/blob/main/test_move.ipynb.

although i wrote it as if everything can be done as gradient descent, that’s not true, as it becomes eventually necessary to discretize the outcome of gradient descent in order to interact with the world (i.e. the 2-D grid in our case.) for this, i had to add in a number of heuristics here and there to avoid various degenerate cases such as gradient descent taking a zero-discrete step. you can check all these heuristics (though, there are only a few) at https://github.com/kyunghyuncho/map_plan_backprop.

of course, you can now imagine what the next step would be. the next step should be for us to grant our bot an ability to localize its own location, which was assumed to be given so far. once we are done with localization, we would end up with a full gradient-based autonomous bot that can simultaneously localize, map and navigate in an unknown environment. this will however have to wait until my next decompression coding (what a beautiful phrase, Jakob!)

and, if you’re interested in building autonomous robots based on machine learning, don’t forget to check out <Probabilistic Robotics> by Sebastian Thrun.

]]>so, i decided to use gradient descent for simple trajectory planning given a 2D map. for instance, the following map is a sample map i decided to use for this exercise. the starting position is (0,0), and the goal position is (19,10). the sky blue blocks denote obstacles (or walls). the goal is to find a reasonable trajectory from the starting position to the goal position while avoiding the obstacles. in order to make the problem simpler (thereby my life simpler,) i assume the whole map is available. furthermore, we assume that each position in the map is discrete, i.e., represented as a tuple of integers, which makes it slightly tricky to use gradient descent.

the obstacle map is represented as

\[

m_o(i,j) = \begin{cases}

1,& \text{ if there is an obstacle at }(i,j) \\

0,& \text{ otherwise}

\end{cases}

\]

the goal position is simply represented as $(x_g, y_g)$.

since the goal is to use gradient descent, we need to think about how to smooth the current position of a bot on the map, which allows us to compute the gradient of the loss function w.r.t. its current position. let $(x,y)$ be the current position of the bot. we then construct a _soft_ map showing the bot’s current position by setting the value of each coordinate $(i,j)$ as

\[

m_c(i,j) \propto \exp(-\frac{(x-i)^2 + (y-j)^2}{\beta}).

\]

we normalize $m_c(i,j)$ to sum to one. $\beta$ controls the smoothness of this map, and also tells us about our confidence in the bot’s position. this formulation allows us to compute the gradient of $m_c$ with respect to $(x,y)$. here’s a sample of the soft map of the current position $(0,0)$ with $\beta=0.1$ (for the purpose of better visualization):

now we initialize the trajectory of a fixed number of steps randomly by

\[

\mathcal{T}_x(t) \sim \mathcal{U}(0, x_{\max})\quad\text{and}\quad

\mathcal{T}_y(t) \sim \mathcal{U}(0, y_{\max}).

\]

an example of a random trajectory is shown here. not really great, is it? it’s all over the place, overlaps with the obstacles and also each transition is unrealistic for our bot to take.

how would we know whether the bot is too close to the obstacle? the soft map formula above turned out to be extremely handy in this case, as all we need to do is to compute the expected obstacleness of the map given the soft map of the bot’s current position, as in

\[

s(m_c) = \sum_{i,j} m_c(i,j) \times m_o(i,j).

\]

if the bot is near the obstacle, this score would be high. if the bot is far away from any obstacle, the score would converge toward $0$.

now, we need to design a loss function to be minimized by gradient descent with respect to the bot’s positions in the trajectory. more specifically, the loss function needs at least the four terms.

- the distance of the final position in the trajectory to the goal position $L_1(\mathcal{T}) = \| \mathcal{T}_{x,y}(T) – [x_g, y_g] \|$.
- the amount of collision with the obstacle $L_2(\mathcal{T}) = \sum_{t=1}^T s(m_{\mathcal{T}_{x,y}}(t))$.
- the distance of the initial position of the trajectory to the starting position $L_3(\mathcal{T}) = \| \mathcal{T}_{x,y}(T) – [x_s, y_s] \|$.
- the smoothness of the trajectory $L_4(\mathcal{T}) = \sum_{t=1}^{T-1} \| \mathcal{T}_{x,y}(t) – \mathcal{T}_{x,y}(t+1) \|$.

the first one is trivially understandable, since we want the planned trajectory to end up near the goal. the second one is also understandable, as we want to minimize the chance of colliding into the obstracles/walls in the planned trajectory. the third term is there to ensure that the trajectory starts from where the bot was placed.

the final one is there to ensure that the transition from one position to the next position within the trajectory is small enough so that our bot’s actuator can execute this transition without too much trouble. Without this term, the trajectory could very well simply jump from the starting point to the goal point in one step and call it a day.

the final loss w.r.t. the trajectory is then the weighted sum of these loss functions:

\[

L(\mathcal{T}) = \sum_{k=1}^4 \omega_k L_k(\mathcal{T}).

\]

Because $L$ is differentiable w.r.t. all positions in $\mathcal{T}$, you can now use your favourite optimization algorithm (in my case, i will stick to naive gradient descent here). you can also let your favourite software package compute the gradient $\nabla_{\mathcal{T}} (L)$ automatically for you (in my case, i use PyTorch.)

one final step we need after performing gradient descent is to discretize each estimated point in the final trajectory, as we assumed that the map is given to us as a discrete grid. in this particular case, i simply round each coordinate to the nearest integer. that is, for all $t=1,\ldots,T$,

\[

\mathcal{T}_x(t) \leftarrow \mathrm{round}(\mathcal{T}_x(t))

\quad\text{ and }\quad

\mathcal{T}_y(t) \leftarrow \mathrm{round}(\mathcal{T}_y(t))

\]

in this particular example, i’ve set the learning rate to $0.1$ and ran $1,000$ steps of gradient descent (probably unnecessarily many, though) on a trajectory of length 20. this results in the following optimized trajectory after rounding:

it slightly missed the goal by one, but the trajectory looks pretty reasonable. each transition is more or less about 2-3 blocks long, and the final point is within 1 block away from the goal. it also avoids all three walls perfectly.

at the end of the day, gradient descent has stood yet another test, and i must repent my sin of doubting the almighty gradient descent. it even works well on planning on a discretized 2d grid, although it was probably too simple a map for any algorithm to fail. indeed, i recall that one of the homework assignments from the course on Probabilistic Robotics by Prof. Kee Eung Kim at KAIST long time ago was to build a planner in almost exactly the same environment however using dynamic programming.

what can we do further if we want to test our trust in gradient descent further? we can introduce a potentially noisy differentiable model of the bot’s actuation and try to optimize the trajectory of not positions but controls. we can also perform this planning repeatedly as the bot makes progress toward the goal to make it more realistic. finally, we can imagine extending it to a partially-observed environment, where we also run gradient descent based inference on missing parts of the map. many interesting problems … that are not new but have been studied extensively in the fields of robotics, control and machine learning …

you can find the code at the following Github repo. it includes both the main code and the notebook to reproduce all the figures above:

https://github.com/kyunghyuncho/map_plan_backprop/

and … i heard i must disclose these days what i have used to do any research/development. hence, here you go: i’ve used Visual Studio Code with Github CoPilot.

]]>in this blog post, i will write out what kind of conditions i believe any watermarking technique should satisfy, in order for watermarking to be useful and effective.

let $x \in \mathcal{X}$ be the observation we want to watermark with a marker $m \in \mathcal{M}$. we will use $F: \mathcal{X} \times \mathcal{M} \to \mathcal{X}$ as a watermarking function. the first condition $F$, called *perceptual Indistinguishability*, needs to satisfy is

$$d_{\mathrm{perceptual}}(x, F(x, m)) \leq \delta,$$

which states that the percetual difference between $x$ and its watermarked version $F(x, m)$ must be very small (smaller than $\delta$). that is, we shouldn’t be able to distinguish between the original and watermarked observations.

the second condition is *marker verifiability*. there must be a tractable way (i will get to why it needs to be tractable shortly when discussing the third condition) to tell whether a certain marker was applied. given a verification function $G: \mathcal{X} \times \mathcal{M} \to \left\{0, 1\right\}$, this can be written down as

$$\frac{\mathrm{Pr}(G(F(x,m), m)=1)}{\mathrm{Pr}(G(F(x,m), m)=0)} > 1$$

and

$$\frac{\mathrm{Pr}(G(F(x,m), m)=1)}{\mathrm{Pr}(G(F(x,m), m’)=1)} > 1~\forall m \neq m’.$$

that is, $G$ must be able to tell whether the watermark was applied and which watermark was applied as well.

the third condition, to which i will refer as *marker irreversibility*, implies that a watermarked version cannot easily be reverted to the original version. this is as important as the first two conditions, but is often overlooked, which makes many watermarking techniques pretty much obsolete. for instance, you have probably noticed that pirated TV shows often have top-5% of each frame cut off; this is done in order to remove the TV station mark, which is sometimes used as a watermark to track whether a TV show from that particular TV station was pirated.

we can write this condition in the context of computational complexity. for instance, we want to ensure that the inverse watermarking function $F^{-1}: \mathcal{X} \times \mathcal{M} \to \mathcal{X}$ takes exponential time w.r.t. the sizes of the watermarked object $F(x, m) \in \mathcal{X}$ and the marker $m \in \mathcal{M}$, i.e., $O(e^{\max\{|F(x,m)|, |m|\}})$.

then, we must think a bit about what this complexity should be; is exponential complexity enough? an interesting observation here is that what is enough is relative to the second condition on the *marker verifiability* above. if the computational complexities of verification $G$ and removal $F^{-1}$ were of the same order, e.g. both take linear time $O(\max\{|F(x,m)|, |m|\})$, watermarking is a bit of a moot point, since anyone who wants to break tracking by watermarking would simply remove the marker from the content before watching and forwarding it to others, while spending the same amount of computation as any verify would.

in other words, this condition of *marker irreversibility* is defined w.r.t. the condition of *marker verifiability*. that is, the marker removal and verification must reside in different levels in the polynomial hierarchy, with the verification on a lower level.

in summary, there are three conditions that must be met by any reasonable watermarking technique:

*perceptual indistinguishability*: a watermarked object must be (almost) perceptually indistinguishable from the original object.*marker verifiability*: we must be able to tractably verify a given object was watermarked with a particular marker and not with another.*marker irreversibility*: it must be tractable for anyone to remove the marker from a watermarked object to obtain the original object.

up until this point, i have assumed the marker $m$ is openly available, as in the airline marker on movies on airplanes. this is the reason why we wanted to ensure that marker reversal was significantly more complex than marker verification. if we however can (it’s a big CAN) keep the marker $m$ secret, it becomes trivial to separate marker verification and reversibility in terms of computational complexity, as the marker space can be made arbitrarily large. this is however a bit unrealistic, and it may be that the marker will leak via multiple watermarked objects eventually.

based on these conditions, let’s examine the airline’s watermarking strategy:

*perceptual indistinguishability*: largely satisfied. those airline markers only show up quite rarely throuhgout the whole show that it doesn’t really bother me or many viewers.**✓***marker verifiability*: largely satisfied. not sure how to implement it effectively off the top of my head, but it feels pretty straightforward to do so with rudimentary image processing tools.**✓***marker irreversibility*: not satisfied**×**. it’s quite trivial to remove such a marker especially with the recent advances in machine learning for image processing and generation, which can be done in almost linear time w.r.t. the length of the show.

so, it’s not really a great watermarking technology from the technical perspective, although simplicity of such an approach is quite attractive from both business and maintenance perspectives. if i were a hollywood studio executive, i would ask for a stronger watermarking strategy, as any leak affects my studio much more so than the airline via which the show was leaked.

that said, *how does your novel watermarking algorithm fare?*

since then, a lot of have changed. the then-raging pandemic is largely behind us. Prescient Design, i co-founded at the very beginning of 2021, has been fully acquired by Genentech, and we have spent already two years as part of gRED. Putin started an unjustifiable and cruel war against Ukraine. Korea has a new president. my father retired after 30+ years of being a professor of korean literature and language. i served as a program chair of both NeurIPS and ICML. i finally saw mountain gorillas in the wild. i have graduated 11 PhD’s since June 2021. i am two years older than then, as does everyone else who was already born by then. i really can’t list all that happened, but so many infuriating, sad, happy and joyous moments have come and gone.

despite all these up’s and down’s, i want to make sure at least one thing about me stays constant; that is, i want to contribute however little i can to supporting education for the next generation of students by supporting higher-education institutes. to this cause, i’ve just made the following donations to my alma mater’s (though, it’s unclear whether a postdoc institute is considered alma mater):

- Mila: \$20,000 USD
- to support AI4Good Lab which is a 7-week program for women and under-served population in STEM with the skills to build ML projects.

- Aalto University: \$20,000 USD
- the first half to support Aalto Junior which offers art, science, technology and entrepreneurship for children, young people and teachers.; i was given a tour of Aalto Junior earlier this year, and could not have been more impressed by their programs and efforts.
- the other half to support various programs within Aalto at their discretion, including supporting the special course for rebuilding Ukraine hosted at the School of Arts, Design, and Architecture.

- KAIST: ₩15,000,000 KRW (approximately \$12,000 USD)
- to support building a scholarship at the School of Computing for students who are excluded from existing scholarship schemes due to various (often family and personal) reasons.

it’s much smaller than 2 years ago, and probably much smaller than how much i should donate and can afford to do so. but, i hope this compels me to do so more often in the future.

also, i’d like to ask my fellow alumni of these institutes and my colleagues to join me in supporting our colleagues, friends and family of the future for their education.

]]>in their paper, Kostrikov et al. present the following loss function to estimate the $\tau$-th expectile of a random variable $X$:

$$\arg\min_{m_{\tau}} \mathbb{E}_{x \sim X}\left[ L_2^\tau (x – m_{\tau}) \right],$$

where $L_2^\tau(u) = | \tau – \mathbf{1}(u < 0) | u^2$ and $\tau \in (0.5, 1]$.

i couldn’t tell where this loss function comes from and together with Daekyu tried to reason our way toward this loss function. to be frank, i had never heard of “expectile” as a term before this …

first, i decided to figure out the definition of “expectile” and found it inside the scipy.stats.expectile documentation. based on the documentation, the $\tau$-th expectile $m_{\tau}$ satisfies

$$\tau \mathbb{E}_{x \sim X} \left[ \max(0, x – m_\tau) \right] = (1-\tau) \mathbb{E}_{x \sim X} \left[ \max(0, m_\tau-x) \right].$$

now, let’s rewrite this equation a bit by first moving the right hand side to the left hand side:

$$\tau \mathbb{E}_{x \sim X} \left[ \max(0, x – m_\tau) \right] + (\tau – 1)\mathbb{E}_{x \sim X} \left[ \max(0, m_\tau-x) \right] = 0.$$

i love expectation (not expectile) because it is linear:

$$\mathbb{E}_{x \sim X} \left[ \tau \max(0, x – m_\tau) + (\tau – 1) \max(0, m_\tau-x) \right] = 0.$$

let’s use the indicator function $\mathbb{1}(a) = 1$ if $a$ is true and $0$ otherwise:

$$\mathbb{E}_{x \sim X} \left[ \mathbb{1}(x > m_{\tau}) \tau(x – m_\tau) – \mathbb{1}(x \leq m_{\tau}) (\tau – 1) (x-m_\tau) \right] = 0.$$

moving things around a bit, i end up with

$$\mathbb{E}_{x \sim X} \left[ \right(\mathbb{1}(x > m_{\tau}) \tau – \mathbb{1}(x \leq m_{\tau}) (\tau – 1)\left) (x-m_\tau) \right] = 0.$$

at this point, i can see that for this equation to hold, i need to make $m_\tau$ very very close to $x$ on expectation. being a proud deep learner, i naturally want to minimize $(x – m_\tau)^2$. but then, i notice that i don’t want to make $m_{\tau}$ close to $x$ equally across all $x$. rather, there is a weighting factor:

$$\mathbb{1}(x > m_{\tau}) \tau – \mathbb{1}(x \leq m_{\tau}) (\tau – 1)$$

if $x > m_{\tau}$, the weight term is same as $\tau$. otherwise, it is $1 – \tau$ which is equivalent to $| \tau – 1|$, because $\tau \in [0, 1]$. also because of this condition, $\tau = |\tau|$. in other words, we can combine these two cases into:

$$| \tau – \mathbb{1}(x \leq m_{\tau})|.$$

finally, by multiplying the $L_2$ loss $(x – m_\tau)^2$ with this weighting coefficient, we end up with the loss function from Kostrikov et al. (2021):

$$\mathbb{E}_{x \sim X} \left[ | \tau – \mathbb{1}(x \leq m_{\tau})| (x – m_\tau)^2 \right].$$

ugh … why did i derive it myself without trusting our proud alumnus Ilya and decide to write a blog post …? waste of time … but it was fun.

]]>i do not want to discuss any particular paper/tweet/blog, because this topic seems to attract a weird set of people arguing for weird things, when in fact there are just a couple of different views into a single phenomenon, which is only natural in science and engineering. that said, if anyone’s interested in this recent (non-)controversy, these two papers seem to be the ones to take a look: Wei et al. [2022 TMLR] and Schaeffer et al. [2023 arXiv].

in this blog, let me instead define *emergence* in my own words so that i can point anyone to this blog when i end up talking with *emergence* with them. as the first step, here are three variables we must keep in our mind:

- $x \in \mathbb{R}$: the quantity that we vary ourselves to study emergence. some examples are # of parameters given a particular parametrization scheme, # of data points sampled from a particular distribution, etc. these are all discrete quantities, but we can imagine these as points sampled from the real line.
- $z \in \mathcal{Z}$: the quantity that we can’t/don’t control or sometimes don’t even observe while varying $x$. some examples include bit flip by cosmic ray. we often want to marginalize this out.
- because we often can’t control nor observe $z$, we assume $z$ follows a distribution $p_Z$.

- $y \in \mathbb{R}$: the quantity that we observe given $x$ and $z$. some examples are accuracy (average 0-1 loss), average negative log-probability (tight upperbound to the average 0-1 loss), etc.

with these variables, i can think of the very first definition of emergence:

Definition 1[Weak subjective emergence of $y$]Given $y = \mathbb{E}_z f(x, z)$ and $\delta > 0$, there exists $x’ \in \mathbb{R}$ such that $\left| \mathbb{E}_z \frac{\partial f}{\partial x}(x’, z) \right| > \left| \mathbb{E}_z \frac{\partial f}{\partial x}(\tilde{x}, z) \right| + \delta$ for all $\left| \tilde{x} – x’\right| > \epsilon$.

in words, this definition says that emergence is defined as the existence of a point $x’$ at which the change in $y$ is greater than any other point $\tilde{x}$. this can be further strengthened to include all higher order derivatives instead of only the first order derivative, but let me just stop here for now.

to measure whether this *subjective emergence *happens in a neural net of a particular architecture w.r.t. the number of parameters, we can follow the steps below:

- given the number of parameters $x$, train the neural net multiple times while varying random seeds in order to account for $z$. let the average validation accuracy be $y(x)$.
- $f$ then corresponds to training a neural net and measuring its accuracy on a held-out validation set.

- repeat this while varying the number of parameters.
- find a pair of consecutive $x$’s between which the validation accuracy changes most; call the mid-point $x’$.
- if this validation accuracy change is greater than that of any other consecutive pair in a meaningful amount $\delta$, we call it
*weak subjective emergence*.

this sounds reasonable, but it raises a lot of questions. some of those questions include:

- why is the particular choice of $f$ meaningful?
- why is the number of parameters a meaningful quantity to use? what if we use the number of bits after compressing all the parameters using e.g. gzip after each update? what makes the former more interesting than the latter?
- why is the accuracy a meaningful quantity to use? what if we use the margin loss since we care about the quality of decision boundary beyond mere accuracy? what makes the former more interesting than the latter?

- why is the particular resolution of $x$ and $y$ meaningful?
- how do we decide on the meaningful amount $\delta$?
- how do we decide on the neighbourhood size $\epsilon$?

there are a few more questions i had, such as whether marginalization of $z$ is desirable over max or min over $z$, but they seem rather minor, compared to these questions above. though, i must emphasize that we have to take into account $z$ one way or another, and it feels very weird to look at only one particular configuration $z$.

these questions naturally answer why i called this particular notion of emergence *subjective*; it is subjective because we leave the answers to these critical questions to the one who declares *emergence* of a property. in other words, one can use their *subjective* choices of $f$, $\delta$ and $\epsilon$. furthermore, this emergence is *weak* in that one merely needs to choose *one particular choice* of $f$, $\delta$ and $\epsilon$ to show that emergence happens.

can we then define a stronger version of subjective emergence? i believe we can, but this requires us to introduce a few more concepts:

- $T_x: \mathbb{R} \to \mathbb{R} \in \mathcal{T}_x$: this is a transformation that can be applied to $x$ to change e.g. its scale, magnitude, etc.
- one example of $\mathcal{T}_x$ a set of all monotonic transformations on $x$, although we can imagine many other types of transformations.
- in the case of neural net training, another example is to simply enumerate all the things that change as the number of updates (or the number of parameters) changes. for instance, $T_x$ may map the number of updates to the $L_2$-norm of the parameters.

- $T_y: \mathbb{R} \to \mathbb{R} \in \mathcal{T}_y$: this is a transformation that can be applied to $y$ to change e.g. its scale, magnitude, etc.
- for instance, $T_y$ can map the average accuracy to the logit of the true class.

we can now define a stronger version of subjective emergence:

Definition 2.[Strong subjective emergence of $y$]For all $T_x \in \mathcal{T}_x$ and $T_y \in \mathcal{T}_y$, let $T_y(y) = \mathbb{E}_z f(T_x(x), z)$. Then, given $\delta_{T_x,T_y} > 0$, there exists $T_x(x’) \in \mathbb{R}$ such that $\left| \mathbb{E}_z \frac{\partial f}{\partial T_x(x)}(T_x(x’), z) \right| > \left| \mathbb{E}_z \frac{\partial f}{\partial T_x(x)}(T_x(\tilde{x}), z) \right| + \delta_{T_x,T_y}$ for all $\left| T_x(\tilde{x}) – T_x(x’)\right| > \epsilon_{T_x}$.

this is essentially identical to *weak subjective emergence* except that we now impose that emergence should hold over a set of possible transformations made to $x$ and $y$. that is, we cannot simply choose *one* particular choice of $x$ and $y$, observe emergence and declare that emergence happened. rather, we need to show that such emergence happens even if we transform $x$ and $y$ in many reasonable ways.

these two definitions collapse onto each other when $|\mathcal{T}_x|=1$ and $|\mathcal{T}_y|=1$; that is, if we only consider one particular combination of $x$ and $y$ without considering any other possible transformations of them.

this definition of emergence is still subjective, since it relies on the subjective choice of $\mathcal{T}_x$, $\mathcal{T}_y$, $\delta$ (for each combination $(T_x,T_y)$) and $\epsilon$ (again for each $(T_x,T_y)$). one may even say this is even more subjective, as we need to decide on more things here, including transformations of $x$ and $y$ as well as the tolerance and neighbourhood radius for each transformation combination. nevertheless, because the notion of emergence must hold over a larger set of how we define $x$ and $y$, i’d find emergence observed according to this definition to be stronger and much more interesting.

so, we want these transformation sets to be not too narrow so that these two definitions collapse or not too broad so that we will never observe strong emergence. what would be some possible transformation sets that fall in the middle (since almost always the answer is somewhere in the middle)?

in my view, a good choice of the transformation set (either $x$ or $y$) is a set of all (noisy) monotonic transformations. for instance, if we take $x$ to be the number of updates in neural net training, we should also consider the $L_2$-norm of the parameters, as it grows (almost) monotonically w.r.t. the number of updates. if the claimed weak emergence over the number of updates disappears when we transform it into the $L_2$-norm of the parameters, we can’t claim stronger emergence. in the case of $y$, an interesting transformation is the repeated application of $\log$. how many $\log$-transformations of $y$ does the claimed emergence withstand? this would give us a sense of the strength of observed emergence.

finally, can there be *objective emergence*? i believe so, although such emergence would be very narrow in that there is essentially no room for any choice or interpretation. for instance, earlier together with Laura Graesser and Douwe Kiela, we demonstrated that a symmetric pair-wise protocol only emerges among communicating agents if there are at least three agents (it’s a bit obvious, though.) in this case, this emergence is objective, in that there’s no other transformation to choose (i.e., the number of agents is just the number of agents, and the communication success is defined as 0-1 and no other way) nor any other definition of tolerance or neighbourhood. in other words, *objective emergence* would be identical to *subjective emergence* except that the problem setup is extremely constrained to the point that there is no room for subjective choice nor interpretation, which makes it less interesting in general.

that wraps up yet another post of my random thoughts that would never make it to papers. have a nice day!

**Acknowledgement**:

- Thank you, Prof. Ernest Davis, for pointing out that the emergence should be defined w.r.t. $y$. this comment has been reflected.
- Thanks to Daniel Paleka’s comment, i clarified in the second definition that $\delta$ and $\epsilon$ are dependent on the choice of transformations.

for instance, imagine training a face detector for your phone’s camera in order to determine which filter (one optimized for portraits and the other for other types of pictures). if most of the training examples for building such face detector were taken in bright day light, one often without hesitation says that this face detector would work better on pictures taken under bright day light than on pictures taken indoor. this sounds pretty reasonable *until* you start thinking of some simplified scenarios. And, that started for me a few years ago, which eventually led me to write this blog post.

so, let’s consider a very simple binary classification setup. let $D=\{(x_n, y_n)\}$ be the training set. $f(x)$ returns the number of occurrences of $x$ within $D$, that is,

$$f(x) = \frac{1}{N} \sum_{n=1}^N I(sim(x_n, x) \leq \epsilon),$$

where $sim$ is a similarity metric, and $\epsilon$ is a similarity threshold. $I$ is an indicator function. if we set $\epsilon=0$ and $sim(a,b) = I(a=b)$, $f(x)$ literally looks at the number of duplicates of $x$ within the training set.

we assume that the training set is *separable*, which makes everything so much easier to imagine in our head and also reason through.

in this simple setup, what is really interesting (to me, at least) is that the number of duplicates $f(x_n)$ of any $x_n \in D$ does not affect a separating decision boundary. as soon as one of the duplicates is correctly classified (i.e., on the right side of the decision boundary), all the other duplicates are equally well classified and would not affect our choice of the decision boundary.

this is most clearly demonstrated by the perceptron learning rule which is defined as

$$w^n = \begin{cases}

w^{n-1}, &\text{if } y_n (w^n \cdot x_n) > 0 \\

w^{n-1} + x_n, &\text{otherwise}.

\end{cases}$$

that is, the decision boundary defined by $w^n$ is only updated if $x^n$ is incorrectly classified, i.e., $y_n (w^n \cdot x_n) \leq 0$. once $x^n$ is correctly classified, all the subsequent duplicates of $x^n$ do not contribute to the decision boundary.

another example is a max-margin classifier, such as a support vector machine. in this case, we can think of how the margin of a (separating) decision boundary is defined. the margin is defined as the sum of the distance to the nearest correctly-classified examples from both classes (positive and negative) respectively. in other words, the only examples that matter for determining the optimal decision boundary are the ones that are nearest correctly-classified ones (at least two; they are called *support vector*), and all the other examples that are correctly classified and far from the decision boundary (recall the separability assumption) do not contribute to the optimal decision boundary. in other words, it really doesn’t matter whether there are many duplicate copies of any particular example, as either that group of examples contribute equally to the margin or does not contribute at all.

Then, does it mean that the existence of duplicates of each training example does not matter when it comes to learning a classifier? Or, better put, why do we think the existence of duplicates changes how our classifiers work?

every now and then, i stumble upon discussion on the difference between parametric and non-parametric methods. every time i believe i found the answer to this question in a way that is explainable to my students and colleagues, but quite rapidly my belief on that answer fades away, and i start to doubt myself as a computer scientist. the last episode was pretty recent, and you can find people’s responses and insightful answers at

it turned out that this seemingly naive and dumb question connects to this issue of whether/how duplicates of training examples impact classification. what do i mean by that?

instead of perceptron and support vector machine above, which can be thought of as parametric approaches, since their discovered decision boundaries are described *without* referring to the training examples, i.e., on their own, let us consider one of the simplest and perhaps most powerful non-parametric classifier whose decision boundary is a function of the training examples and its complexity grows as we include more training examples. and, this classifier is $k$-nearest neighbour classifier ($k$NN).

given a new example $x$ we want to classify using our $k$NN classifier, let $(x_n,y_n)$ be the nearest neighbour of $x$. given the number of duplicates in the training set $f(x_n)$, we can now tell how many other neighbours are considered by this $k$NN; the answer is $k – f(x_n)$. that is, the probability of this new example $x$ belonging to $y_n$ is written down as:

$$p(y=y_n| x) = \frac{\min(k, f(x_n))}{k} + \frac{1}{k} \sum_{(x’,y’) \in \mathcal{N}_k(x)} I(x’ \neq x_n) I(y’ = y_n),$$

where $\mathcal{N}_k(x)$ is a set of $k$ nearest neighbours of $x$. as $f(x_n)$ grows, the first term dominates, and the chance of classifying $x$ into $y_n$ consequently grows as well. that is, the more duplicates we have of $x_n$ the higher probability for $y_n$. that is, the region corresponding to $(x_n,y_n)$ grows as the number of its duplicates increases, which is precisely what a non-parametric classifier does.

so, what does this tell us? the impact of duplicates in the training set differs between parametric and non-parametric approaches. it is not only in classification, but also in generative modeling, since much of generative modeling can be thought of as supervised learning in disguise. if we are dealing with non-parametric methods, we probably want to take into account duplicates in the training set and either keep them as they are or de-duplicate them. this decision will have to be made for each problem separately. if we are working with parametric methods, we probably don’t need to worry about these duplicates beyond the computational concern.

how does this observation connect with the urban legend/myth on the impact of duplicates? i believe this simply tells us that classifiers we use often in practice are often non-parametric, including $k$NN, neural nets and random forests. in other words, it wasn’t really about whether duplicates matter but it was more about what is a common practice in modern machine learning; that is, we use non-parametric classifiers.

there’s nothing serious nor insightful here, but i enjoyed this thought experiment!

]]>