last night, Douwe Kiela sent me a link to this article by Ted Chiang. i was already quite drunk already back then, quickly read the whole column and posted the following tweet:
Delip Rao then retweeted and said that he does not “buy his lossy compression analogy for LMs”, in particular in the context of JPEG compression. Delip and i exchanged a few tweets earlier today, and i thought i’d state it here in a blog post how i described in the following tweet why i think LM and JPEG have the same conceptual background:
one way in which I view a compression algorithm is that it (the algorithm $F$) produces a concise description of a distribution $p_{compressed}$ that closely mimics the original distribution $p_{true}$. that is, the goal of $F$ is to turn the description of $p_{true}$ (i.e., $d(p_{true})$) into the description of $p_{compressed}$ (i.e., $d(p_{compressed})$) such that (1) $p_{true}$ and $p_{compressed}$ are similar to each other, and (2) $d(p_{true}) \gg d(p_{compressed})$. now, this is only a way to think of compression, as it doesn’t really tell us much about whether i can compress the number of bits i need to describe one particular instance $x$ (over which these distributions are defined).
then, how does JPEG can be viewed in this angle? in JPEG, there is a compression-decompression routine that can be thought of as a conditional distribution over the JPEG encoded/decoded images given the original image, i.e., $p_{JPEG}(x’ | x)$, where $x$ and $x’$ are both images. it is almost always deterministic, and this may be considered as a Dirac delta distribution. Then, given the trust natural image distribution $p_{true}$, we can get the following compressed distribution:
$$p_{compressed}(x’) = \sum_{x \in \mathcal{X}_{image}} p_{JPEG}(x’|x) p_{true}(x).$$
that is, we convolve all the images with the JPEG conditional distribution to obtain the compressed distribution.
why is this compression? because JPEG loses many fine details about the original image, there are many original images that map to a single image with JPEG-induced artifacts. this makes the number of probable modes under $p_{compressed}$ fewer than those under the original distribution, leading to a lower entropy. this in turn leads to a fewer number of bits we need to describe this distribution, hence, compression.
when there is a mismatch between $p_{true}$ and $p_{compressed}$, we can imagine two scenarios. one is that we lose a probable configuration under $p_{true}$ in $p_{compressed}$, which is often referred to as mode collapse. the other is $p_{true}(x) \downarrow$ when $p_{compressed}(x) \uparrow$, which is often referred to as hallucination. the latter is not really desirable in the case of JPEG compression, as we do not want it to produce an image that has nothing to do with any original image, but this is at the heart of generalization.
combining these two cases we end up with what we mean by lossy compression. in other words, any mismatch between $p_{true}$ and $p_{compressed}$ is what we mean by lossy.
in language modeling, we start with a vast amount of training examples, which i will collectively considered to constitute $p_{true}$, and our compression algorithm is regularized maximum likelihood (yeah, yeah, RLHF, instructions, blah blah). this compression algorithm (LM training, if you prefer to use) results in $p_{compressed}$ which we use a trained neural net to represent (though, this does not imply that this is the most concise representation of $p_{compressed}$.)
just like JPEG, LM training inevitably results in a discrepancy between $p_{true}$ (i.e., the training set under my definition above) and $p_{compressed}$ due to a number of factors, including the use of finite data as well as our imperfect parametrization. this mismatch however turned out to be blessing in this particular case, as this implies generalization. that is, $p_{compressed}$ is able to assign a high probability to an input configuration that was not seen during training, but then such a highly probable input turned out to look amazing to us (humans!)
in summary, both JPEG compression and LM training turn the original distributions of natural image and human written text, respectively, into their compressed versions. in doing so, inevitable mismatch between these two distributions, in each case, and this is why we refer to this process as lossy compression. this lossy nature ends up assigning non-zero probabilities to unseen input configurations, and this is generalization. in the case of JPEG, such generalization is often undesirable, while desirable generalization happens with LM thanks to our decades of innovations that have been culminated into modern language models.
so, yes, both are lossy compression with comparable if not identical underlying conceptual frameworks. the real question is however not about whether lossy compression makes LM’s less or more interesting, but more like which ingredients we have found to build these large-scale LM’s contribute to such desirable generalization and how.