i’ve been thinking about mixup quite a bit over the past few years since it was proposed in [1710.09412] mixup: Beyond Empirical Risk Minimization (arxiv.org). what a fascinatingly simple and yet intuitively correct idea! we want our model to behave linearly between any pair of training examples, which thus helps our model generalize better to an unseen example which is likely to be close to an interpolated point between some pair of training examples.

if we consider the case of regression (oh i hate this name “regression” so much..) we can write this down as minimizing

$$

-\frac{1}{2} \| \alpha y + (1-\alpha) y’ – G(F(\alpha x + (1-\alpha) x’))\|^2,

$$

where \((x,y)\) and \((x’,y’)\) are two training examples, and \(\alpha \in [0, 1]\) is a mixing ratio. nothing more to explain than to simply look at this loss function: we want our regressor \(G \circ F\) to linearly interpolate between any pair \((x,x’)\).

Manifold mixup followed up on the original mixup ([1806.05236] Manifold Mixup: Better Representations by Interpolating Hidden States (arxiv.org)) by proposing to interpolate in the hidden space (after \(F\) above,) similarly to my own work on interpolating the hidden representations of retrieved examples (see here done together with Jake Zhao.) although there are quite a bit of details, such as randomly selecting the layer at which mixup is done, etc. in the case of manifold mixup, let me ignore that and just consider the key loss function:

$$

L^{\mathrm{mmix}} = -\frac{1}{2} \| \alpha y + (1-\alpha) y’ – G(\alpha F(x) + (1-\alpha) F(x’))\|^2.

$$

a natural inclination is to think of this as ensuring that \(G\) interpolates linearly between two points in the space induced by \(F\). that is probably what the authors of manifold mixup meant by saying that “*Manifold Mixup Flattens Representations*“, although their theory (ยง3.1) doesn’t seem to have anything to do with this phenomenon of flattening. their theory seems to be largely about universal approximation (which doesn’t really tell us much about linear interpolation) and that classes eventually become linearly separable (again doesn’t tell us much about linear interpolation.)

one thing that’s emphasized in the manifold mixup paper is that it “*backpropagates gradients through the earlier parts of the network*” (i.e. \(F\) above). totally understandable to any deep learner, as the motto we live and die by is end-to-end learning, but if \(F\) changes, it changes the space over which \(G\) linearly interpolates, or \(G\) can linearly interpolate in the space induced by \(F\) by adapting \(F\) rather than \(G\). furthermore, the tie between the linear interpolation between two training examples can dramatically change as the nonlinear \(F\) changes. so… confusing…

let’s look at the gradient of this loss function w.r.t. \(F\) above ourselves, after assuming that \(y \in \mathbb{R}\) and \(G\) is a linear function (similar to sentMixup in [1905.08941] Augmenting Data with Mixup for Sentence Classification: An Empirical Study (arxiv.org)) for simplicity.

$$

\frac{\partial L^{\mathrm{mmix}}}{\partial F} =

(\alpha y + (1-\alpha) y’ – G(\alpha F(x) + (1-\alpha) F(x’)))

\frac{\partial G}{\partial Z}

\frac{\partial Z}{\partial F},

$$

where $Z = \alpha F(x) + (1-\alpha) F(x’)$ and

$$

\frac{\partial Z}{\partial F} = \alpha F'(x) + (1-\alpha) F'(x’).

$$

because $G$ is linear,

$$

(\alpha y + (1-\alpha) y’ – G(\alpha F(x) + (1-\alpha) F(x’))) =

(\alpha y + (1-\alpha) y’ – \alpha G(F(x)) – (1-\alpha) G(F(x’))).

$$

combining all these together,

$$

\frac{\partial L^{\mathrm{mmix}}}{\partial F} =

\left(

\alpha (y-G(F(x))) + (1-\alpha) (y’ – G(F(x’)))

\right)

\left(

\alpha \frac{\partial G}{\partial F}(x) +

(1-\alpha) \frac{\partial G}{\partial F}(x’)

\right).

$$

what you notice here is that there are essentially four terms after expanding this multiplication. two terms are usual gradients we get from making $G \circ F$ predict $y$ given $x$ and $y’$ given $x’$, just like any regression:

- $\alpha^2 (y-G(F(x))) \frac{\partial G}{\partial F}(x)$
- $(1-\alpha)^2 (y’-G(F(x’))) \frac{\partial G}{\partial F}(x’)$

the other two terms are quite unusual:

- $\alpha(1-\alpha) (y-G(F(x))) \frac{\partial G}{\partial F}(x’)$
- $\alpha(1-\alpha) (y’-G(F(x’))) \frac{\partial G}{\partial F}(x)$

in other words, the direction and scale of the update of $F$ given $x$ is determined by the regression error for $x’$ (!) and that given $x’$ by the error for $x$ (!).

one could think of these two terms as the ones that flatten the representation space induced by $F$, but one also notices that the regression error terms are shared between the two usual terms and the two unusual terms. in other words, the gradient is zero when regression on the original pairs $(x,y)$ and $(x’,y’)$ is solved, regardless of how *flattened* the space induced by $F$ is.

this is unlike the original mixup (or input mixup) where the contribution of each of $x$ and $x’$ cannot be separated throughout the entire network ($G \circ F$). in manifold mixup, because the contributions of $x$ and $x’$ can be separated out at the level of $F$ (not at $G$, though,) there is a room for $F$ to make linear interpolation pretty much meaningless.

in fact, this may be what the authors pointed out by the theory of manifold mixup already: “*In the more general case with larger $\mathrm{dim} (H)$, the majority of directions in H-space will be empty in the class-conditional manifold.*” there is no meaningful interpolation between these class-conditional manifolds, because a majority of directions that would otherwise connect them be empty (pretty much meaningless from $G$’s perspective.)

another way to put it is that the feature extractor $F$ can easily give up on inducing a space that is meaningfully interpolate between any pair of training examples, since it stops changing as long as the model $G \circ F$ can predict original training examples very well. in other words, there is no reason why $F$ should induce a space over which $G$ linearly interpolates in a meaningful way.

this leaves us with a BIG mystery: why does manifold mixup work well? it worked well for the authors of the original manifold mixup, and since then, various authors have claimed that it works well (see, e.g., sentMixup as well as TMix). what do those two unusual terms above in the gradient do to make the final model generalize better?

until this mystery is resolved, my suggestion is to stick to a much more explicit way to ensure the representation is *flattened* by ensuring that small changes in the input space indeed map to small changes in the representation space. this can be done by e.g. making representation predictive of the input (see. e,g, [1306.3874] Classifying and Visualizing Motion Capture Sequences using Deep Neural Networks (arxiv.org), http://machinelearning.org/archive/icml2008/papers/601.pdf, [1207.4404] Better Mixing via Deep Representations (arxiv.org), etc.) or explicitly making representation linear using some auxiliary information such as time (see, e.g., [1506.03011] Learning to Linearize Under Uncertainty (arxiv.org)). of course, i need to plug my latest work on learning to interpolate in an unsupervised way as well: [2112.13969] LINDA: Unsupervised Learning to Interpolate in Natural Language Processing (arxiv.org).