[WARNING: there is nothing “WOW” nor technical about this post, but a piece of thought i had about GPT-3 and few-shot learning.]

Many aspects of OpenAI’s GPT-3 have fascinated and continue to fascinate people, including myself. these aspects include the sheer scale, both in terms of the number of parameters, the amount of compute and the size of data, the amazing infrastructure technology that has enabled training this massive model, etc. of course, among all these fascinating aspects, meta-learning, or few-shot learning, seems to be the one that fascinates people most.

the idea behind this observation of GPT-3 as a meta-learner is relatively straightforward. GPT-3 in its essence computes the conditional distribution over all possible next tokens (from a predefined vocabulary) given a prefix: $p(x’ | x_1, \ldots, x_t)$. this conditional distribution can be chained to form a conditional distribution over sequences given the prefix: $p(x’_1, \ldots, x’_{t’} | x_1, \ldots, x_t) = \prod_{t”=1}^{t’} p(x’_{t”} | x’_{<t”}, x_{<t})$. this makes GPT-3 subsume a so-called sequence-to-sequence or encoder-decoder model, allowing one to use GPT-3 to find an answer $(x’_1, \ldots, x’_{t’})$ given a question (often referred to as “prompt” which comes together with a couple of known examples) $(x_1, \ldots, x_t)$ by solving

\[

\arg\max_{x_1, \ldots, x_t} \log p(x’_1, \ldots, x’_{t’} | x_1, \ldots, x_t).

\]

This problem turned out to be intractable, and people have been using an approximate search algorithm, such as greedy search or top-$k$ sampling, to find an answer given a prompt. In the GPT-3 paper, the authors present an impressive set of experimental results highlighting this meta-learning aspect of GPT-3.

But, then, you start to wonder: in particular for me, i began to wonder about this just today over our research group‘s weekly meeting, when Elman Mansimov presented a few recent papers that have followed up on this meta-learning aspect of a language model of which GPT-3 greatly increased the awareness. What do i wonder? I wonder if it’s meta-learning, as we think of meta-learning conceptually, that drives this phenomenon, or if there is actually a simpler mechanism behind this observation.

let’s imagine a wonderful hypothetical world in which I can train another GPT-3 on the same data myself at NYU, but this time i will make one slightly tweak. that is, i will train this new GPT-3, to which i refer as GPT-E, after reversing the order of all documents in the original dataset. that is, GPT-E computes the conditional distribution over all possible previous tokens given a suffix: $p(x | x’_t, x’_{t-1}, \ldots)$. since OpenAI has successfully trained GPT-3, you’d trust that i would be able to train this model in this hypothetical, but happy world. I will also assume that in this happy parallel universe, i can hire all the amazing talents who worked on GPT-3 at NYU perhaps as postdocs or even as PhD students so that the quality of GPT-E rivals that of GPT-3.

but, then, something weird happens. if we believe GPT-3’s meta-learning capability, GPT-E does something as amazing as (if not more amazing than) what GPT-3 can do. It takes as input a test question-answer pair and can outputs the prompt, which contains both a few training examples and a test question (!) of course, assuming the amounts of information on both sides are comparable (which should be the case for zero-shot or few-shot learning.)

Do you see where I am getting at? yes, we can now alternate between GPT-3 and GPT-E to sequentially create an encyclopedia of all the knowledge in the world (well, at least those that were represented in the training set.) We start from a random factoid and call it $(Q_0,A_0)$. We can find a reasonable “prompt” by feeding GPT-E with $(r(A_0), r(Q_0))$, where $r$ reverse a string, and sampling from $P_0 \sim p(x_1, \ldots, x_t | A_0, Q_0)$ preferably using top-$k$ sampling to reduce noise but to maintain some stochasticity. this prompt $P_0$ would consist of a (noisy) description of the task that corresponds to this factoid and a few noisy examples that are not exactly $(Q_0,A_0)$, in addition to the next question $Q_1$. We switch to GPT-3 and now sample another piece of factoid $(Q_1, A_1)$ based on $P_0$. We alternate between these two steps or more like between GPT-3 (real) and GPT-E (hypothetical) as long as we want and accumulate $(Q_n, A_n)$ to create the encyclopedia of world knowledge. Beautiful, isn’t it?

But, hold on. Where did meta-learning go? where is meta-learning in this Gibbs-like sampling procedure? is meta-learning just “noise” injected in each round of alternating between GPT-3 and GPT-E, for this Gibbs-like procedure to explore the space of knowledge effectively? If i wanted to put some positive, promising spin: is meta-learning how such noise is shaped by a large neural net so that it only spans relevant directions in this high-dimensional space corresponding to the knowledge manifold?

as I warned you at the beginning, there’s no “wow” moment nor “wow” conclusion in this post. this is just one piece of thought i had about GPT-3 that got me even more confused about all things machine learning (meta-learning, generative modeling, denoising, gibbs sampling, etc.)

P.S. i’m waiting for big tech firms with deep pockets (Amazon, Google, FB, etc. i’m looking at you) to train GPT-E for me to test this idea 😛

P.P.S. you see why it was called GPT-E?

3LIT3!