my inbox started to over-flow with emails that urgently require my attention, and my TODO list (which doesn’t exist outside my own brain) started to randomly remove entries to avoid overflowing. of course, this is perfect time for me to think of some random stuff.
This time, this random stuff is contrastive learning. my thought on this stuff was sparked by Lerrel Pinto’s message on #random in our group’s Slack responding to the question “What is wrong with contrastive learning?” thrown by Andrew Gordon Wilson. Lerrel said,
My understanding is that getting negatives for contrastive learning is difficult.
Lerrel Pinto (2021)
Restricted Boltzmann Machines
i haven’t worked on the (post-)modern version of contrastive learning, but every time i hear of “negative samples” i am reminded me of my phd years. during my phd years, i’ve mainly worked on a restricted Boltzmann machine which defines a distribution over the observation space as
where
where
the goal of learning with a restricted Boltzmann machine is then to maximize the log-probabilities of the observations (training examples):
using stochastic gradient descent with the stochastic gradient derived to be
the first term ensures that each hidden unit (or expert)
the second term corresponds to computing the expected negative energy (ugh, i hate this discrepancy; we maximize the probability but we minimize the energy) over all possible observations according to the model distribution. what this term does is to look for all input configurations
you can imagine this as playing whac-a-mole. we try to pull out our favourite moles, while we “whac” any mole that’s favoured by the whac-a-mole machine.
in training a restricted boltzmann machine, the major difficulty lies with how to efficiently and effectively draw negative samples from the model distribution. a lot of bright minds at the University of Toronto and University of Montreal back then (somewhere between 2006 and 2013) spent years on figuring this out. unfortunately, we (as the field) have never got it to work well, which is probably not surprising since we’re talking about sampling from an unnormalized (often discrete) distribution over hundreds if not thousands of dimensions. if it were easy, we would’ve solved most of problems in ML already.
Data augmentation creates a restricted Boltzmann machine
let’s consider a stochastic transformation
imagine a widely used set of input transformations in e.g. computer vision.
what we will now do is to create a very large set of hidden units (or experts) by drawing transformed inputs from the stochastic transformation
these hidden units then define a restricted Boltzmann machine and allow us to compute the probability of any input
where i’m now using a compatibility function
starting from here, we’ll make two changes (one relaxation and one restriction). first, we don’t want to only use
second, we will assume that the input space
to summarize what we’ve done so far: we build one restricted Boltzmann machine for a given input
Contrastive learning trains N restricted Boltzmann machines
what would be a good training criterion for one such restricted Boltzmann machine? the answer is almost always maximum likelihood! in this particular case, we want to ensure that the original example
where
we do so for all
since it’s decomposed over the training examples, let’s consider only one example
where we use
this does look quite similar to more recently popular variants of contrastive learning. we start from a training example
perhaps the only major difference is that this formulation gives us a clear guideline on how we should pick the negative examples. that is, according to this formula, we should either use all the training examples weighted according to how likely they are under this
so, yes, contrastive learning can be derived from restricted Boltzmann machines, and this is advantageous, because this tells us how we should pick negative examples. in fact, as i was writing this blog post (and an earlier internal slack message,) i was reminded of a recent workshop i’ve attended together with Yoshua Bengio. there was a talk on how to choose hard negative samples for contrastive learning (or representation learning) on graphs, and after the talk was over, Yoshua raised his hand and made this remark
That’s called Boltzmann machine learning!
Yoshua Bengio (2019, paraphrased)
Indeed…
Data augmentation is what matters
Based on this exercise of deriving modern contrastive learning from restricted Boltzmann machines, we can now have a meta-framework for coming up with a contrastive learning recipe. Any recipe must consist of three major ingredients:
- A per-example density estimator: i used the restricted Boltzmann machine, but you may very well use variational autoencoders, independent component analysis, principal component analysis, sparse coding, etc. these will give rise to different variants of self-supervised learning. the latter three are particularly interesting, because they are fully described by a set of basis vectors and don’t require any negative samples for learning. i’m almost 100% certain you can derive all these non-contrastive learning algorithms by choosing one of these three.
- A compatibility function
: this is the part where we design a network “architecture”, and how the output from this network is used to compute a scalar that indicates how similar a pair of examples is. it looks like the current practice is to use a deep neural net with a cosine similarity to implement this compatibility function. - A stochastic transformation generator: this generator effectively generates a density estimator for each example. this is very important, since it defines the set of bases used by these density estimators. any aspect of data cannot be modelled if these generated bases do not cover them.
we have a pretty good idea of what kind of density estimator is suitable for various purposes. we have a pretty good idea what’s the best way to measure the similarity between two highly-complex, high-dimensional inputs (thanks, deep learning!) but, we cannot know what the right stochastic transformation generator should be, because it is heavily dependent on the problem and domain. for instance, the optimal transformation generator for static, natural images won’t be optimal for e.g. natural language text.
so, my sense is that the success of using contrastive learning (or any self-supervised learning) for any given problem will ultimately boil down to the choice and design of stochastic transformation, since there’s a chance that we may find a near-optimal pair of the first two (density estimator and compatibility function) that works well across multiple problems and domains.