my inbox started to over-flow with emails that urgently require my attention, and my TODO list (which doesn’t exist outside my own brain) started to randomly remove entries to avoid overflowing. of course, this is perfect time for me to think of some random stuff.

This time, this random stuff is contrastive learning. my thought on this stuff was sparked by Lerrel Pinto’s message on #random in our group’s Slack responding to the question “*What is wrong with contrastive learning?*” thrown by Andrew Gordon Wilson. Lerrel said,

Lerrel Pinto (2021)

My understanding is that getting negatives for contrastive learning is difficult.

## Restricted Boltzmann Machines

i haven’t worked on the (post-)modern version of contrastive learning, but every time i hear of “*negative samples*” i am reminded me of my phd years. during my phd years, i’ve mainly worked on a restricted Boltzmann machine which defines a distribution over the observation space as

$$p(x; W, b, c) \propto \exp(x^\top b) \prod_{j=1}^J (1+\exp(x^\top w_{\cdot, j} + c_j)),$$

where $W$, $b$ and $c$ are the weight matrix, visible bias and hidden bias. for simplicity, i’ll assume the visible bias is $0$, which is equivalent to saying that the input is on expectation an all-zero vector. This makes the definition above a bit simpler, and especially so when we look at the log-probability:

$$\log p(x; W, c) = \sum_{j=1}^J \log (1+\exp(x^\top w_{\cdot, j} + c_j)) – \log Z,$$

where $\log Z$ is the log-partition function or log-normalization constant.

the goal of learning with a restricted Boltzmann machine is then to maximize the log-probabilities of the observations (training examples):

$$\max_{W, c} \mathbb{E}_{x \sim D} [\log p(x; W, c)],$$

using stochastic gradient descent with the stochastic gradient derived to be

$$g_{\theta} = \sum_{j=1}^J \nabla_\theta \log (1+\exp(x^\top w_{\cdot,j} + c_j)) – \mathbb{E}_{x_- \sim p(x; W,c)} [\sum_{j=1}^J \nabla_\theta \log (1+\exp({x_-}^\top w_{\cdot,j} + c_j)].$$

the first term ensures that each hidden unit (or expert) $j$ is well aligned with the correct observation $x$ drawn from the data distribution (or training set.) not too surprising, since the alignment (dot product) between the expert weight $w_{\cdot, j}$ and a given observation gives rise to the probability of $x$.

the second term corresponds to computing the expected negative energy (ugh, i hate this discrepancy; we maximize the probability but we minimize the energy) over all possible observations according to the model distribution. what this term does is to look for all input configurations $x_-$ that are good under our current model and to make sure the hidden units (or experts) are not well aligned with them.

you can imagine this as playing whac-a-mole. we try to pull out our favourite moles, while we “whac” any mole that’s favoured by the whac-a-mole machine.

in training a restricted boltzmann machine, the major difficulty lies with how to efficiently and effectively draw negative samples from the model distribution. a lot of bright minds at the University of Toronto and University of Montreal back then (somewhere between 2006 and 2013) spent years on figuring this out. unfortunately, we (as the field) have never got it to work well, which is probably not surprising since we’re talking about sampling from an unnormalized (often discrete) distribution over hundreds if not thousands of dimensions. if it were easy, we would’ve solved most of problems in ML already.

## Data augmentation creates a restricted Boltzmann machine

let’s consider a stochastic transformation $T: \mathcal{X} \to \mathcal{X}$, where $\mathcal{X}$ is the input space. given any input $x \in \mathcal{X}$, this transformation outputs $\tilde{x} \sim T$ that highly likely maintains the same semantics as the original $x$. this is often used for data augmentation which has been found to be a critical component of contrastive learning (or as a matter of fact any so-called self-supervised learning algorithms).

imagine a widely used set of input transformations in e.g. computer vision. $T$ would include (limited) translation, (limited) rotation, (limited) color distortion, (limited) elastic distortion, etc. we know these transformations often in advance, and these are often domain/problem-specific.

what we will now do is to create a very large set of hidden units (or experts) by drawing transformed inputs from the stochastic transformation $T$ for one particular input $x$. that is, we have $J$-many $\tilde{x}_j \sim T(x)$. in the case of computer vision, we’ll have $J$-many possible distortions of $x$ that largely maintain the semantics of $x$.

these hidden units then define a restricted Boltzmann machine and allow us to compute the probability of any input $x’$:

$$\log p(x’ | \tilde{x}_1, \ldots, \tilde{x}_J) = \sum_{j=1}^J \log (1+\exp(s(x’,\tilde{x}_j))) – \log Z,$$

where i’m now using a compatibility function $s: \mathcal{X} \times \mathcal{X} \to \mathbb{R}$ instead of the dot-product for more generality.

starting from here, we’ll make two changes (one relaxation and one restriction). first, we don’t want to only use $J$ many transformed copies of the input $x$. we want to in fact use all possible transformed versions of $x$ out of $T$. in other words, we want to relax our construction that this restricted Boltzmann machine has a finite number of hidden units. this turns the equation above to be:

$$\log p(x’ | x, T) = \mathbb{E}_{\tilde{x} \sim T(x)}\left[ \log (1+\exp(s(x’,\tilde{x})))\right] – \log Z.$$

second, we will assume that the input space $\mathcal{X}$ coincides with the training set $D$ which has a finite number of training examples, i.e., $D=\left\{ x_1, \ldots, x_N \right\}$. this second change only affects the second term (the log-partition function) only:

$$\log p(x’ | T(x)) = \mathbb{E}_{\tilde{x} \sim T(x)}\left[ \log (1+\exp(s(x’,\tilde{x})))\right] – \log \frac{1}{N} \sum_{n=1}^N \mathbb{E}_{\tilde{x} \sim T(x)}\left[ \log (1+\exp(s(x_n,\tilde{x})))\right].$$

to summarize what we’ve done so far: we build one restricted Boltzmann machine for a given input $x \in \mathcal{X}$ by drawing the hidden units (or experts) from the transformation distribution $\tilde{x} \sim T(x)$. the support of this restricted Boltzmann machine is restricted (pun intended) to be a training set.

## Contrastive learning trains *N* restricted Boltzmann machines

what would be a good training criterion for one such restricted Boltzmann machine? the answer is almost always maximum likelihood! in this particular case, we want to ensure that the original example $x$ is most likely under the restricted Boltzmann machine induced by itself:

$$\max_{\theta} \log p(x | T(x)),$$

where $\theta$ is the parameters for defining the compatibility function $s$ from above.

we do so for all $N$ restricted Boltzmann machines induced from $N$ training examples:

$$\max_{\theta} \frac{1}{N} \sum_{n=1}^N \log p(x_n | T(x_n)).$$

since it’s decomposed over the training examples, let’s consider only one example $x \in D$. we then train the induced restricted Boltzmann machine with stochastic gradient descent, following

$$\frac{1}{M} \sum_{m=1}^M \nabla_{\theta} \log (1+\exp(s(x, \tilde{x}_m; \theta))) – \frac{1}{M} \sum_{m=1}^M \sum_{n=1}^N p(x_n|T(x)) \nabla_{\theta} \log (1+\exp(s(x_n, \tilde{x}_m; \theta))),$$

where we use $M$ transformed copies to approximate the two expectations over $T(x)$ but not $p(x_n|T(x))$. we probably should use another set of $M$ transformed copies to get the unbiased estimate.

this does look quite similar to more recently popular variants of contrastive learning. we start from a training example $x$, generate a transformed version $\tilde{x}$, maximize the compatibility between $x$ and $\tilde{x}$, and minimize the compatibility between $\tilde{x}$ and all the training examples (including $x$). there are minor differences, such as the choice of nonlinearity, but at the high level, it turned out we can derive contrastive learning from the restricted Boltzmann machine.

perhaps the only major difference is that this formulation gives us a clear guideline on how we should pick the negative examples. that is, according to this formula, we should either use all the training examples weighted according to how likely they are under this $x$-induced restricted Boltzmann machine or use a subset of training examples drawn according to the $x$-induced restricted Boltzmann machine without further weighting. of course, another alternative is to use uniformly-selected training examples as negative samples but weight them according to their probabilities under the $x$-induced restricted Boltzmann machine, *à la* importance sampling.

so, yes, contrastive learning can be derived from restricted Boltzmann machines, and this is advantageous, because this tells us how we should pick negative examples. in fact, as i was writing this blog post (and an earlier internal slack message,) i was reminded of a recent workshop i’ve attended together with Yoshua Bengio. there was a talk on how to choose *hard* negative samples for contrastive learning (or representation learning) on graphs, and after the talk was over, Yoshua raised his hand and made this remark

Yoshua Bengio (2019, paraphrased)

That’s called Boltzmann machine learning!

Indeed…

## Data augmentation is what matters

Based on this exercise of deriving modern contrastive learning from restricted Boltzmann machines, we can now have a meta-framework for coming up with a contrastive learning recipe. Any recipe must consist of three major ingredients:

**A per-example density estimator**: i used the restricted Boltzmann machine, but you may very well use variational autoencoders, independent component analysis, principal component analysis, sparse coding, etc. these will give rise to different variants of self-supervised learning. the latter three are particularly interesting, because they are fully described by a set of basis vectors and don’t require any negative samples for learning. i’m almost 100% certain you can derive all these non-contrastive learning algorithms by choosing one of these three.**A compatibility function**$s$: this is the part where we design a network “architecture”, and how the output from this network is used to compute a scalar that indicates how similar a pair of examples is. it looks like the current practice is to use a deep neural net with a cosine similarity to implement this compatibility function.**A stochastic transformation****generator**: this generator effectively generates a density estimator for each example. this is very important, since it defines the set of bases used by these density estimators. any aspect of data cannot be modelled if these generated bases do not cover them.

we have a pretty good idea of what kind of density estimator is suitable for various purposes. we have a pretty good idea what’s the best way to measure the similarity between two highly-complex, high-dimensional inputs (thanks, deep learning!) but, we cannot know what the right stochastic transformation generator should be, because it is heavily dependent on the problem and domain. for instance, the optimal transformation generator for static, natural images won’t be optimal for e.g. natural language text.

so, my sense is that the success of using contrastive learning (or any self-supervised learning) for any given problem will ultimately boil down to the choice and design of stochastic transformation, since there’s a chance that we may find a near-optimal pair of the first two (density estimator and compatibility function) that works well across multiple problems and domains.