[Updated on Nov 30 2020] added a section about the scaling law w.r.t. the model size, per request from Felix Hill.

[Updated on Dec 1 2020] added a paragraph referring to Dauphin & Bengio’s “Big Neural Networks Waste Capacity“.

this is a short post on why i **thought** (or more like imagined) the scaling laws from <scaling laws for autoregressive generative modeling> by Heninghan et al. “[is] inevitable from using log loss (the reducible part of KL(p||q))” when “the log loss [was used] with a max entropy model“, which was my response to Tim Dettmers’s tweet on “why people are not talking more about the OpenAI scaling law papers“. thanks to Joรฃo Guilherme for brining it this to my attention. it’s given me a chance to run some fun thought experiments over the weekend, although most of, if not all of, them failed as usual with any ideas and experiments i have. anyhow, i thought i’d leave here why i thought so particularly from the perspective of dataset size.

- The scaling law for Bernoulli w.r.t. the dataset size
- The scaling law for Bernoulli w.r.t. the model size
- The scaling law for Bernoulli w.r.t. the compute amount
- Final thoughts

instead of considering a grand neural autoregressive model, i’ll simply consider estimating the mean of a Bernoulli variable after $N$ trials, and compare the log loss at this point against the log loss computed after $N+\Delta$ trials. let’s start by writing down the loss value after $N$ trials:

$$

-L(N) = p^* \log \frac{N_1}{N} + (1-p^*) \log \frac{N-N_1}{N} =

p^* \log N_1 + (1-p^*) \log (N-N_1) – \log N,

$$

where $p^*$ is the true ratio of heads and $N_1 < N$ is the number of heads from the $N$ trials.

let’s now consider tossing the coin $\Delta$ more times. i will use $\Delta_1 < \Delta$ as the number of additional heads after these additional trials. what’s the loss after $N+\Delta$ trials?

$$

-L(N+\Delta) = p^* \log (N_1 + \Delta_1) + (1-p^*)(N+\Delta – N_1 – \Delta_1) – \log (N+\Delta_1).

$$

so far so good. now, what kind of relationship between these two quantities $L(N)$ and $L(N+\Delta)$ do i want to get? in my mind, one way to say there’s a power law like structure behind $L$ is to show that the amount of improvement i get by running $\Delta$ more trials decreases as the number of existing trials $N$ increase. that is, there’s diminishing return from a unit effort as more efforts have been put.*

then, let’s look at their difference by starting from the loss at $N+\Delta$, while assuming that $\Delta \ll N$ (and naturally $\Delta_1 \ll N_1$ as well) so that i can use $\log (1+x) \approx x$ when $x$ is small:

$$

\begin{align*}

-L(N+\Delta) =& p^* \log (N_1 + \Delta_1) + (1-p^*)\log(N+\Delta – N_1 – \Delta_1) – \log (N+\Delta)

\\

=&

p^* \log N_1 (1+ \frac{\Delta_1}{N_1}) + (1-p^*) \log(N-N_1)(1 + \frac{\Delta – \Delta_1}{N-N_1}) – \log N(1+ \frac{\Delta}{N})

\\

\approx

&

\underbrace{p^* \log N_1 + (1-p^*) \log (N-N_1) – \log N}_{=-L(N)} + p^* \frac{\Delta_1}{N_1} + (1-p^*)\frac{\Delta – \Delta_1}{N-N_1} – \frac{\Delta}{N}.

\end{align*}

$$

The decrease in the loss by running $\Delta$ more trials can now be written as

$$

L(N) – L (N+\Delta) = p^* \frac{\Delta_1}{N_1} + (1-p^*)\frac{\Delta – \Delta_1}{N-N_1} – \frac{\Delta}{N}.

$$

since $\Delta_1 < \Delta$ and $N_1 < N$, let’s rewrite them as $\Delta_1 = \beta \Delta$ and $N_1 = \alpha N$, where $\alpha \in [0,1]$ and $\beta \in [0,1]$. then,

$$

L(N) – L (N+\Delta) = p^* \frac{\beta \Delta}{\alpha N} + (1-p^*) \frac{(1-\beta)\Delta}{(1-\alpha)N} -\frac{\Delta}{N} = \frac{\Delta}{N} \left(p^* \frac{\beta}{\alpha} + (1-p^*)\frac{1-\beta}{1-\alpha} – 1\right)

$$

this says that the change from the loss at $N$ to the loss at $N+\Delta$ is inversely proportional to $N$ itself, which is what i wanted to see from the beginning. although there were a few leaps of faith along the way, but it looks like more tosses I have made (i.e, large $N$), the change i can make to my loss with a constant number of extra tosses diminishes.

the second (multiplicative) term is more complicated, and i find it easier to think of two extreme cases; $p^*=1$ and $p^*=0$. these cases are reasonable if we think of this exercise as a proxy to studying classification, where it’s often assumed that a given input either belongs to one (positive) or the other (negative) class in an ideal world. when $p^*=1$, the second term reduces to

$$

\frac{\beta}{\alpha} – 1~~

\begin{cases}

> 0, & \text{if } \beta > \alpha \\

< 0, & \text{if } \beta < \alpha \\

= 0, & \text{if } \beta = \alpha

\end{cases}

$$

in other words, if the extra tosses reflected the true distribution better ($\beta > \alpha$, because the true positive rate is $1$,) the loss dropped. otherwise, the loss increases ($\alpha > \beta$) or stays same (i.e., no additional information has been added.) the other extreme case of $p^* = 0$ works similarly.

what’s important is that this second term largely dictates the sign of how the loss changes with the extra $\Delta$ tosses. since we are considering only the ratios of the heads within sets of trials and (suddenly!) assume both $N$ and $\Delta$ are reasonably large, the magnitude of change is instead largely determined by the ratio between $\Delta$ and $N$, with $N$ in the denominator.

so, this is how i arrived at my shallow take on twitter that these scaling laws may not have too much to do with whether we use neural net parameterization or not, whether we are solving language modeling, machine translation, etc., nor whether we are working with text, image or both. “i think” it arises naturally from the maximum entropy formulation (you can think of estimating the log-frequency of the heads above with sigmoid/softmax to turn it into the Bernoulli distribution) and the log loss.

of course, because i had to make a number of leaps of faith (or to put it another way, a few unreasonable assumptions,) it’s possible that this actually doesn’t make much sense at the end of the day. furthermore, i’m super insecure about my math in general, and i’m about 99.9% sure there’s something wrong in the derivation above . hence, why “i think” the scaling law arises from log loss (cross-entropy) and maximum entropy models.

it’s important for me to point out at this point that Heninghan et al. did much more than what i’ve discussed in this post and provide a much more extensive set of very interesting findings. they looked not only at the effect of the data size, but also the compute budget $C$ and model size $|\theta|$. in fact, they focus much more on the latter two than the former which was my focus here.

in the case of the model size, it’s quite trivial to map it to the argument above i made regarding the number $N$ of observations. let’s consider the model size $|\theta|$ in this context of recovering Bernoulli as the number of bits (with an arbitrary basis, including $e$) allowed to represent $N$ and $N_1$ (and consequently, $\Delta$ and $\Delta_1$.) then, the maximum $N$ a model can count up to is $\exp(|\theta|)$, and by increasing the model size by $\delta$ (i.e., $|\theta|+\delta$,) we can toss the coin

$$

\exp(|\theta|) \exp(\delta) – \exp(|\theta|) = \exp(|\theta|) (\exp(\delta) – 1)

$$

more. in other words, increasing the size of the model, while assuming that we can run as many tosses as we can to saturate the model capacity, is equivalent to setting $\Delta$ above to $\exp(|\theta|) (\exp(\delta) – 1)$.

in this case, the first term in the change in the loss above reduces to

$$

\frac{\Delta}{N} = \frac{\exp(|\theta|) (\exp(\delta) – 1)}{\exp(|\theta|)} = \exp(\delta),

$$

which is weird, because the dependence on $N = \exp(|\theta|)$ disappeared. that is, the change in the loss w.r.t. the increase in the model size (the number of bits) is not dependent on the number of existing bits used by the model.

what is happening here? in my opinion, this implies that the # of parameters in a neural net, or increasing it, is **not** optimally done in terms of compression.

what if we instead assume that only a polynomial number of trials can be compressed, i.e., $N=|\theta|^c$? in particular, for the sake of simplicity, let’s assume $c=2$. in this case,

$$

\frac{\Delta}{N} = \frac{(|\theta|+\delta)^2}{|\theta|^2} = 2\frac{\delta}{|\theta|} + \left(\frac{\delta}{|\theta|}\right)^2,

$$

and voila! we recovered the dependence on the model size $|\theta|$, and this dependence is inverse proportional, as expected. by further assuming that $\delta \ll |\theta|$, we end up with

$$

\frac{\Delta}{N} \approx 2 \frac{\delta}{|\theta|}.

$$

so, what does it say about the observation by Henighan et al. that there is a scaling law w.r.t. the model size? i suspect that their observation is telling us that deep nets we use are far from optimal in the sense of compressing data. it could be due to the choice of architectures, due to our choice of learning algorithms or even due to regularization techniques we use. it’ll be interesting to pinpoint what’s behind this sub-optimality will be interesting.

as i was writing the last paragraph, i was reminded of this earlier workshop paper by Yann Dauphin & Yoshua Bengio from the workshop track of ICLR’13, titled “Big Neural Networks Waste Capacity.” in this work, they observed the “rapidly decreasing return on investment for capacity in big networks” and conjectured this is due to the “failure of first order gradient descent.” perhaps, Yann was onto something, although i don’t think he’s followed up on this.

in the case of the compute budget, i have absolutely no idea, but i wonder if a similar argument as the model size could be made. the number of SGD steps largely dictates the maximum magnitude of the weights in a neural net. the resolution (?) of the computed probability is largely determined by the maximum magnitude of (or the variance of individual weights in) the final weight matrix (that feeds into the final softmax). perhaps we can connect these two to show that more SGD updates allow our neural net to more precisely identify the target probability. of course, this suggests that different optimization strategies may result in radically different scaling laws.

assuming what i wrote above makes even slightest bits of sense, this raises two interesting question, in my opinion. first, is all a sophisticated neural net does counting examples? the strict answer is no, because it both counts and compresses. it however looks as if it’s compression without any interesting emergent property (such as systematic generalization). second, how does this property change when we move away from the maximum entropy formulation and log-loss? i’ve pointed out two directions that look promising in a tweet earlier: margin ranking loss by Collobert & Weston and entmax series by Martins and co. if so, will it be the change in a desirable direction?

let me wrap up by thanking Henighan et al. and Kaplan&McCandlish et al. for thought-provoking pieces that have made me think of these models and problems i’ve been working with all along from a very different angle.

(*) of course the other (more positive) way to look at it is that there’s always more to be learned if we are ready to invest as much as we have invested already.

]]>**Detour**: Before I continue to talk about this award, let me just briefly share with you my experience as having been living abroad in three different places (Helsinki, Montreal and NYC) that speak three different languages (Finnish, French and English) as an expat and in particular as a student expat, over the past ten years or so. In short, it’s not easy. It’s not easy in many ways, but one that I felt as most challenging was this feeling I had whenever I moved to a new place that I have to stay alert, watch my account balance and prepare for the worst until I fully settle down and get used to this new city and country. Even then, there’s a nagging feeling that I am only a temporary resident here and that I must be prepared to leave immediately without any hesitation if I’m forced to or decide to.

You can literally see this stress from newly arriving students or more broadly expats who are not financially well off. They have a difficult time appreciating beauty and joy in a new place, not to mention enjoying them. Even if this new town is filled up with awesome restaurants, they wouldn’t facy the idea of dining at those restaurants. Even if the city is surrounded by amazing tourist destinations, they wouldn’t spare their time to visit them unless their parents come visit them. Their places are often light on furnitures, and even the furnitures they get are on the cheapest end of the spectrum: in fact, a lot of them don’t even buy a full bed but just a cheap mattress placed on their floor.

Even in my case, where I have been relatively well off financially for a newly arriving student/postdoc, i’ve never bought a couch ever since i left my parents’ place (don’t worry i’m planning to do so shortly,) and i bought a bed with a box spring for the first time only when I moved to NYC as a new faculty member. It took me my parents’ visit after my second year in Finland to travel to Rovaniemi and other touristic destinations in Finland and neighbouring countries (and let me tell you: there aren’t so many.) It took me a workshop at NRC Canada to visit Ottawa when I was in Montreal, and took me an invitation by Hugo Larochelle to visit U. Sherbrooke to visit Quebec City (I know.. it’s not on the way to Sherbrooke, but I took a detour.) Even when I could afford it, it took several walk-by’s before I could mentally prepare myself to decide to dine in at this reasonably fancy (but not that much…) place, and it still does.

That’s the weirdest thing: most of these I could afford back then and can certainly afford now. However, even if I could afford it, even if I knew it would improve how I live, and even if I knew that would make my days more comfortable, a lot of things felt much less accessible and looked overly and unnecessarily luxurious. I’ve experienced this stress, although I’ve thoroughly enjoyed and never regretted moving to and living in these places, been financially stable for most of my expat years and haven’t had any dependent to support. One begins to wonder how challenging it must be for others (and you!) who may be in worse situations.

**Back to the award**: this award comes with generous $30,000 USD monetary prize^{1} (!) And, no, it’s not paid to the university for me to use to support my research, but it is the prize paid directly to me. In other words, I’m free to do whatever i want with this $30,000 that sprang out of nowhere. should i finally buy a couch? well, i could, but i can buy it without this prize money. should i buy a car? well, i live in manhattan. should i go on a luxury vacation? well, pandemic…

After a brief period of pondering, i’ve decided to donate the prize money^{2} to Mila where I was a postdoc for 1.5y + a visiting student for 0.5y. More specifically, i’ve decided to donate the prize money to Mila on the condition that it is used to provide a *one-time cash supplement* of up to $1,500 CAD to each incoming *female* students/postdoc, arriving from either *Latin America*, *Africa*, *South Asia*, *South East Asia* and *Korea*, until the donation runs out. I hope this supplement gives students, who have just arrived at Montreal to start the new chapter of their lives, a bit of room for breathing. Perhaps they can use it to go enjoy a dinner at a nice restaurant in Montreal. Perhaps they can go out with their new friends and family for beer. Perhaps they can buy not just a mattress but a proper bed. it’s not for me to determine what lets them relax a bit in the midst of settling down in a new environment, and I just hope this to be helpful in whatever way suits them best.

I thoroughly enjoyed my time at Mila (which was, to be precise, called Lisa back then,) and have greatly benefited from spending my time there as a postdoc. i cannot imagine where i would be had i not been a postdoc at Mila. And, I hope this small gesture of mine could make a diverse group of incoming students/postdocs from all corners of the world to have a more enjoyable time in Mila and benefit from their time in Mila as much as if not more than i have.

**Why female students from these regions (Latin America, Africa, South Asia, South East Asia and Korea)?** our field has an issue of representation in many aspects. we have an issue of gender representation. we have an issue of geographical representation. we have an issue of educational background/discipline representation. we have many more issues of representation in different aspects. All these issues of representation are equally important and critical, and I know that these are not just pipeline issues, based on my experiences of meeting amazing talents while teaching at Deep Learning Indaba 2018, Khipu.AI 2019, SEAML 2019, Deep Learning UB 2019 and the African Master’s Programme in Machine Intelligence (AMMI). these issues are often of opportunities and support. I believe we need to take even a little action at a time rather than waiting to address all of them simultaneously. in this particular case, I decided to give a minuscule shot at addressing a couple of these issues; the lack of female representation and the limited representation of researchers and students from Latin America, Africa, South Asia and South East Asia (I added Korea because the prize came from a Korean company :))

Also, perhaps a bit selfishly, i want to make sure there’ll be a role model my niece can look up to in the field of AI when she’s older.

(1) they also sent me this awesome plaque, but i don’t think Mila would appreciate it as donation.

(2) i’ve decided to donate $35,000 CAD after setting aside a bit for tax. after all, i’ve been paying more federal tax than the president for quite some time already and am expecting to pay some more this coming tax season.

]]>**Background:** Right before COVID-19 struck NY heavily this past Spring, K-12 teachers from Busan, Korea stopped by at NYC on their trip to US for studying various AI education strategies in US, and asked me for a short meeting. Frankly i was quite skeptical about this meeting, and was assuming it was their vacation in disguise. This skepticism of mine completely melted down when I met them in their hotel’s meeting room and began to hear what they’ve done and are doing at their schools, covering primary (1-6y), middle (7-9y) and high schools (10-12y), to teach their students what AI is, what these students can already do with it, and what they would be able to do with it in the future. it was eye-widening and has since made me realize how outdated my view of K-12 education (be it in Korea or elsewhere) is and how much K-12 education can be updated to keep up with latest developments in the society when teachers are enthusiastic and given opportunities.

This trip was a part of their effort in creating a teaching material for AI education aimed at K-12 teachers. I heard back from them a few months later that this material is ready to be published as a series of four books and was asked to write an opening remark. I was of course more glad to write one for them. Because I’m not too comfortable writing about AI in Korean (i mean.. when have i ever written anything AI in Korean?) i went ahead with English, and one of the participating teachers translated it into Korean.

Today (Nov 21 2020), i received the pdf copies of these four books and was able to take a more careful look at the content. it’s filled up with fun activities teachers can help students go through to learn about AI by experiencing a diverse set of sub-disciplines, including robotics, computer vision, natural language processing, machine learning, data science, etc. i’m so envious of these kids who will get to experience and have fun with all these activities and projects and ultimately become AI-native, unlike any of us.

And, without further ado, here it is.

**Foreword:** Intelligence is one of the last remaining mysteries of this universe and of ourselves that has evaded our collective attempt at uncovering its underlying mechanisms. We think every day, every hour, every minute, if not every second, effortlessly, without realizing that there are 86 billion neurons that are interacting with each other in both highly coordinated and highly chaotic manner behind this process of thinking. We perceive the surrounding world, which consists of our family, our friends and everything you can imagine and interact with each day, effortlessly, when the surrounding world never stays idle but dynamically changes its appearance non-stop. Based on our perception and pondering, we act in the surrounding environment effortlessly, although there are infinitely many possible ways in which our action could go wrong. Intelligence is behind these seemingly facile activities, driving each and every of us from one moment to another, but intelligence has largely evaded our interrogation and investigation even until now.

Despite โartificialโ in artificial intelligence, artificial intelligence (AI) is a scientific discipline in which intelligence in general, not necessarily artificial one, is studied. As the first step in this direction, AI scientists ask what intelligence is. To answer this question, some are inspired by biological intelligence. To answer this question, some look into psychology. To answer this question, some look into philosophy. To answer this question, some look into mathematics. To answer this question, some, like myself, look into computer science which has a good track record of rigorously defining and understanding traditionally illusory concepts, such as information and computation, thanks to Claude Shannon, Alan Turing, who originally โpropose[d] to consider the question, โCan machines think?โ in 1950, and the like.

In this scientific pursuit of (artificial) intelligence, โlearningโ has been found to be a central concept to intelligence. Intelligence is not merely a bag of algorithms and knowledge for solving a fixed set of problems, but it is rather the process of learning to solve a new problem by creating a new algorithm. Every time a new problem or a variant of a known problem is given, a machine, either biological or not, must โlearnโ to solve it and acquire a set of sophisticated skills in this process. The question of โwhat is intelligence?โ has suddenly morphed itself into the question of whether we can build a machine that can learn to solve any problem. If we could build one, that machine would be intelligent, and this machine itself would be our answer to the ultimate question of โwhat is intelligenceโ. Machine learning is a sub-discipline in computer science that has pursued this direction of building a learning machine to figure out what intelligence is.

Machine learning has made rapid progress in recent years, thanks to theoretical and empirical advances in learning algorithms, increased availability of data, wide adoption of open-source software and incredible advances in computing systems. A few years ago, a deep neural network learned to listen to speech in a quiet room and transcribe it almost as well as an average person could. This was quickly followed by a deep convolutional network which could detect an incredible number of different objects in a picture, rivaling humans in object recognition. A couple of years later, a deep recurrent neural network was trained to translate news articles between English and Chinese and ended up translating almost as well as average bilingual speakers could. All these results were openly shared in forms of open-access publications and open-source software packages, which led to an unprecedented level of adoption of these new technologies. Industry has rapidly implemented and deployed these AI systems in various products, including voice assistants, real-time machine translators, automatic image tagging, content recommendation, driving assistance and even automated tutoring. These AI technologies are being deployed in increasingly more challenging domains, such as healthcare, medicine and automation.

Unfortunately positive is not the only way to describe this rapid advance and wide adoption of machine learning and thereby artificial intelligence in recent years. These AI systems have been silently tested and deployed in the society, touching many, if not most, of us often without our realization. These silent, and often premature, tests have sadly revealed negative sides of AI.

Billions of people use social media regularly, and social media companies extensively use AI technology to personalize individual usersโ experience, effectively censoring the flow of information. Billions of people use video streaming services and news aggregation services every day, and the providers of these services use AI to decide not only what to but also what not to recommend and display to individual users, effectively shaping the usersโ opinions without their own realization. This mass adoption of AI-based content filtering has unintentionally but unmistakably resulted in deepening polarization in many societies all over the world, sometimes resulting in fatal incidents and destabilization of otherwise stable, democratic societies.

Hastily developed and prematurely deployed AI systems, such as face recognition, automated exam proctoring and automated interviewer assessment, have been found to amplify undesirable societal biases and inequalities, such as racial bias, gender bias, income inequality and geographical inequality. For instance, incorrect identification of a face recognition system, which has repeatedly been found to disproportionately associate black people and people of colour as threatening, by police in the US has recently led to the wrongful arrest of an innocent black male. The worldโs largest e-commerce company recently had to drop an AI-based recruiting system, because it was giving female candidates unjustifiable disadvantages for software engineering roles. A recent study has uncovered that commercial object recognition systemsโ accuracies significantly drop when presented with pictures taken from poorer countries.

For AI to truly benefit us and the society, these shortcomings must be addressed and addressed fully. Technical advances alone, often made by a small group of elite scientists, will not be enough to make AI safe, fair and beneficial for all. Safe, fair and beneficial AI will only be possible when the whole society, consisting of both AI scientists and others, is aware of AIโs capability, adoption and deployment. The society must continue to carefully watch and monitor AIโs impact on the society, and be ready to rise and intervene against unsafe, unfair and unjust use of AI. This awareness of capability, limitations and underlying technology of AI is necessary for the society to benefit from AI.

Such awareness in the society of a new technology, in particular when it is an enabling technology, does not happen overnight. It must happen carefully and patiently over many years, if not decades, to ensure the whole society possesses a rational and coherent view of AI technology and its use. For this to happen, we must go beyond the status quo in which discourse on AI happens within and across universities and industry. We must start discourse and education on AI already with K-12 students who will be the first generation in the history of humanity to grow to live in a society where AI is not a novelty but an everyday reality. As the first step toward this goal, we must educate teachers of all levels to be familiar with and comfortable with the technologies and implications of AI, and must immediately start preparing educational materials and systems for teaching AI.

I thus applaud this effort by the Busan Metropolitan City Office of Education preparing a new curriculum and accompanying educational materials on AI for both students and teachers. In doing so, the team from the Office of Education has struck perfect balance between theory and application, between history and modern practices, and between technology and ethics. I am envious of students in Busan who will learn to be native in AI according to this curriculum, and am now hopeful rather than worried about the future of AI and its impact on society.

]]>Many aspects of OpenAI’s GPT-3 have fascinated and continue to fascinate people, including myself. these aspects include the sheer scale, both in terms of the number of parameters, the amount of compute and the size of data, the amazing infrastructure technology that has enabled training this massive model, etc. of course, among all these fascinating aspects, meta-learning, or few-shot learning, seems to be the one that fascinates people most.

the idea behind this observation of GPT-3 as a meta-learner is relatively straightforward. GPT-3 in its essence computes the conditional distribution over all possible next tokens (from a predefined vocabulary) given a prefix: $p(x’ | x_1, \ldots, x_t)$. this conditional distribution can be chained to form a conditional distribution over sequences given the prefix: $p(x’_1, \ldots, x’_{t’} | x_1, \ldots, x_t) = \prod_{t”=1}^{t’} p(x’_{t”} | x’_{<t”}, x_{<t})$. this makes GPT-3 subsume a so-called sequence-to-sequence or encoder-decoder model, allowing one to use GPT-3 to find an answer $(x’_1, \ldots, x’_{t’})$ given a question (often referred to as “prompt” which comes together with a couple of known examples) $(x_1, \ldots, x_t)$ by solving

\[

\arg\max_{x_1, \ldots, x_t} \log p(x’_1, \ldots, x’_{t’} | x_1, \ldots, x_t).

\]

This problem turned out to be intractable, and people have been using an approximate search algorithm, such as greedy search or top-$k$ sampling, to find an answer given a prompt. In the GPT-3 paper, the authors present an impressive set of experimental results highlighting this meta-learning aspect of GPT-3.

But, then, you start to wonder: in particular for me, i began to wonder about this just today over our research group‘s weekly meeting, when Elman Mansimov presented a few recent papers that have followed up on this meta-learning aspect of a language model of which GPT-3 greatly increased the awareness. What do i wonder? I wonder if it’s meta-learning, as we think of meta-learning conceptually, that drives this phenomenon, or if there is actually a simpler mechanism behind this observation.

let’s imagine a wonderful hypothetical world in which I can train another GPT-3 on the same data myself at NYU, but this time i will make one slightly tweak. that is, i will train this new GPT-3, to which i refer as GPT-E, after reversing the order of all documents in the original dataset. that is, GPT-E computes the conditional distribution over all possible previous tokens given a suffix: $p(x | x’_t, x’_{t-1}, \ldots)$. since OpenAI has successfully trained GPT-3, you’d trust that i would be able to train this model in this hypothetical, but happy world. I will also assume that in this happy parallel universe, i can hire all the amazing talents who worked on GPT-3 at NYU perhaps as postdocs or even as PhD students so that the quality of GPT-E rivals that of GPT-3.

but, then, something weird happens. if we believe GPT-3’s meta-learning capability, GPT-E does something as amazing as (if not more amazing than) what GPT-3 can do. It takes as input a test question-answer pair and can outputs the prompt, which contains both a few training examples and a test question (!) of course, assuming the amounts of information on both sides are comparable (which should be the case for zero-shot or few-shot learning.)

Do you see where I am getting at? yes, we can now alternate between GPT-3 and GPT-E to sequentially create an encyclopedia of all the knowledge in the world (well, at least those that were represented in the training set.) We start from a random factoid and call it $(Q_0,A_0)$. We can find a reasonable “prompt” by feeding GPT-E with $(r(A_0), r(Q_0))$, where $r$ reverse a string, and sampling from $P_0 \sim p(x_1, \ldots, x_t | A_0, Q_0)$ preferably using top-$k$ sampling to reduce noise but to maintain some stochasticity. this prompt $P_0$ would consist of a (noisy) description of the task that corresponds to this factoid and a few noisy examples that are not exactly $(Q_0,A_0)$, in addition to the next question $Q_1$. We switch to GPT-3 and now sample another piece of factoid $(Q_1, A_1)$ based on $P_0$. We alternate between these two steps or more like between GPT-3 (real) and GPT-E (hypothetical) as long as we want and accumulate $(Q_n, A_n)$ to create the encyclopedia of world knowledge. Beautiful, isn’t it?

But, hold on. Where did meta-learning go? where is meta-learning in this Gibbs-like sampling procedure? is meta-learning just “noise” injected in each round of alternating between GPT-3 and GPT-E, for this Gibbs-like procedure to explore the space of knowledge effectively? If i wanted to put some positive, promising spin: is meta-learning how such noise is shaped by a large neural net so that it only spans relevant directions in this high-dimensional space corresponding to the knowledge manifold?

as I warned you at the beginning, there’s no “wow” moment nor “wow” conclusion in this post. this is just one piece of thought i had about GPT-3 that got me even more confused about all things machine learning (meta-learning, generative modeling, denoising, gibbs sampling, etc.)

P.S. i’m waiting for big tech firms with deep pockets (Amazon, Google, FB, etc. i’m looking at you) to train GPT-E for me to test this idea

P.P.S. you see why it was called GPT-E?

]]>There have been a series of news articles in Korea about AI and its applications that have been worrying me for sometime. I’ve often ranted about them on social media, but I was told that my rant alone is not enough, because it does not tell others why I ranted about those news articles. Indeed that is true. Why would anyone trust my judgement delivered without even a grain of supporting evidence? So, I’ve decided to write a short post on Facebook (shared on Twitter) and perhaps surprisingly in Korean (!) This may have been the first AI/ML-related (though, very casual) post I’ve ever written in Korean, and is definitely not the best written piece from me, although I hope this post would clarify why I’ve been fuming about those news articles.

This post is quite casual and not academic. If I’m missing any important references for general public, that you want me to include here, please drop me a line. As I’m not in any way an expert in this topic, I’m sure I’ve missed many important references, discussions and points.

That said, I realized that it’s not only Korean speakers who engage with this post (via Google Translate, etc.) and that the automatic translation of this post into English is awful (thanks to the hat tip by my colleague Ernest Davis at NYU.) Since it’s a pretty short post, I’ve decided to put its English version along with the original Korean version here in my blog. The version in Korean comes first, and the one in English follows immediately.

Twitter์ FB๋ฅผ ๋น๋กฏํ social media ๋ฐ ํ๊ณ์์ ๋ง์ด ๋ ผ์๊ฐ ๋์ง๋ง ํ๊ตญ์ด๋ก ๋ ๋ ผ์๋ ํฌ๊ฒ ์์ด ๋ณด์ฌ์ ์์ฃผ ๊ฐ๋จํ Social impact & bias of AI ๋ผ๋ ์ฃผ์ ์์ ์ค์ํ๋ค ์๊ฐ๋๋, ๋ฐ์ ํ ์ฐ๊ด๋ point ๋ช ๊ฐ๋ฅผ ์๋ ๋ฆฌ์คํธ์ ํฉ๋๋ค. ์๋ง ์๋๋ฐ ์ ๊ฐ ๋ชป ์ฐพ์ ๊ฒ์ผ ์๋ ์๊ณ , ํน์ ๊ด๋ จ๋ ํ๊ตญ์ด๋ก๋ ์ฐ๊ตฌ ๋๋ ๋ ผ์๊ฐ ์์ผ๋ฉด ๋ต๊ธ์ ๋จ๊ฒจ์ฃผ์๊ธฐ ๋ฐ๋๋๋ค.

[์๋ฌด๋๋ ํ๊ตญ์ด๋ก ๊ธ์ ์ ์จ ๋ฒ๋ฆํด์ ์ ์ฝ๊ธฐ ๋ถํธํด ๋ณด์ ๋๋ค. ์ํด ๋ถํ๋๋ฆฝ๋๋ค.]

*Amplification*

๊ธฐ์ ์ ์ฌํ๋ฅผ ๋ฐ์ํ๋๊ฒ์ด ๋ง์ต๋๋ค. ๋ค๋ง ๊ทธ๋ ๊ฒ ๋ฐ์๋ ์ฌํ์ ํน์ง์ด ๊ธฐ์ ์ ํตํด ๊ฐ์ ์ฌํ ์์์ ์ฆํญ์ด ๋ฉ๋๋ค. Virginia Eubanks์ ๋๋ Ruha Benjamin์ ๋ฅผ ์ฝ์ด๋ณด๋ฉด ์ด๋ป๊ฒ ์ด๋ฐ ์ฆํญ์ด ์ฌ๋๋ค์๊ฒ ํด๋ฅผ ๊ฐํ๋์ง ์๊ฒ ๋ฉ๋๋ค (https://www.nytimes.com/2018/05/04/books/review/automating-inequality-virginia-eubanks.html, https://us.macmillan.com/books/9781250074317, https://www.ruhabenjamin.com/race-after-technology) ์ต๊ทผ์ ์ ๊ฐ AI ์ธํฐ๋ทฐ๊ฐ ๋ง์ด ์ฐ์ธ๋ค๋ ๊ธฐ์ฌ๋ฅผ ๋ณด๊ณ ์ด์ ๋๋ ์ด์ ์ค ํ๋๋ก, ๋ค๋ค ๋ด ์๊ธฐ๋ ์๋๊ฒ ๊ฑฐ๋ ํ์ง๋ง ์ด๋ฐ ์ฆํญ๋ ๋ถ์ ์ ์ธ ๋ฉด์ ๊ถ๊ทน์ ์ผ๋ก ๋ชจ๋๋ฅผ ํดํ๊ฒ ๋ฉ๋๋ค. ํน์ ๋ณธ์ธ์ ์๋ ๊ฐ ์ด๋ฆฐ ์์ ์ ๊น ๊ฐ๋จ์ด ์๋ ๊ณณ์์ ์ด๋ฑํ๊ต๋ฅผ ๋ค๋๋ ๋ฐ๋์ AI ์ธํฐ๋ทฐ์์ ์๋์ ์ผ๋ก ๋จ์ด์ง ๊ฑด ์๋๊น์?

์ฌ์ง์ด๋ ์๋ฒฝํ AI ์์คํ ์ด ์กด์ฌํด๋ amplification ๋ฌธ์ ๋ ์ฌ์ ํ ์กด์ฌํฉ๋๋ค. ๋ง์ฝ AI ์์คํ ์์ ๋ฉด์ ๋ณด๋ ์ฌ๋์ด 60%์ ํ๋ฅ ๋ก ์ฑ๊ณต์ ์ผ ๊ฒ์ด๋ผ๊ณ ํ๊ณ , ์ค์ ๋ก 60%๊ฐ ์๋ฒฝํ (un)certainty๋ผ๋ฉด ์ด๋ป๊ฒ ํ ๊น์? ์๋ง ๋ชจ๋ ํฉ๊ฒฉ์ด๋ผ๊ณ ๊ฒฐ์ ํ ๊ฒ ์ ๋๋ค. AI ์์คํ ์ด ์ค์ ์ ์ฌ์ฉ๋๋ฉด ํด๋น ์์คํ ์ uncertainty๋ฅผ ๋์ด์๋ ๊ฒฐ์ ์ ๋ด๋ฆฌ๊ฒ ๋๊ณ amplification์ด ๋ ์ฌํด์ง๋๋ค.

*Opaqueness* of a model

AI/ML ์์คํ ์ด ํ์ ์์ ์ง์ค์ ์ผ๋ก ์ฐ์ด๊ธฐ ์์ํ ๊ฒ์ ๊ฝค ์ค๋๋ ์ผ์ง๋ง ์ด๋ฌํ ์์คํ ์ complexity๊ฐ ๊ธ๊ฒฉํ ๋์์ง ๊ฒ์ ์๋์ ์ผ๋ก ์ต๊ทผ์ ๋๋ค. ์ด๋ฐ highly complexํ ์์คํ ์ deployํ๋ ์ ์ฅ๊ณผ ์ฌ์ฉํ๋ ์ ์ฅ ๊ทธ๋ฆฌ๊ณ ์ ์ฉ๋ฐ๋ ์ ์ฅ์์๋ ํด๋น ์์คํ ์ ํน์ง์ ๋ํด ์์์ผ ํฉ๋๋ค. ์์ฝ๊ฒ๋ ๋์ ์๋ฆฌ๋ฅผ ์์๋ด๋ ๊ฒ์ ์ด๋ ต๊ณ ์ฐ๊ตฌ ์ค ๋๋ ๊ธฐ์ ๊ธฐ๋ฐ ์ด๋ผ๋ ํ๊ณ ์๋ ์ด๋ฐ ํ์์ฑ์ด ๋ฌด์ ๋นํ๊ณค ํฉ๋๋ค. ๋น์ฐํ ์ด๋ ต๊ณ ์ฐ๊ตฌ ์ค์ธ ๋ด์ฉ์ด๊ธด ํ์ง๋ง ์ค์ ๋ก ์ฌ์ฉ์ ๊ทธ๋ฆฌ๊ณ ์ ์ฉ๋ฐ๋ ์ ์ฅ์์๋ ์ธ์ธํ ๊ณผํ์ ์๋ฆฌ๋ฅผ ์๊ตฌํ๋๊ฒ ์๋๊ณ ํด๋น ์์คํ ์ ๋์ ์์ค์ ๋์ ์๋ฆฌ, ์ฌํ์ ์ํฅ ๋ฑ ์ ํ์๋ก ํ ๋ฟ ์ ๋๋ค (ํ๊ฒฝ์ ์๊ฐํด์ ์๋์ฐจ ๋ฐฐ๊ธฐ๋์ด ์ผ๋ง๋ ๋๋์ง ์๊ณ ์ถ์๋ฐ ๊ฐ์๊ธฐ ๋ด์ฐ๊ธฐ๊ด์ ์๋ฆฌ ๋ฐ ํด๋น ์ฐจ์ข ์ ๋ชจ๋ ๋ํ ์ผ์ ์์ง ๋ชปํ๋ฉด ๋ฐฐ๊ธฐ๋์ ์๋ ๊ฒ์ ์๋ฏธ๊ฐ ์๋ค๋ฉด ๋ง์ด ์ ๋๊ฒ ์ฃ .) ์ด๋ฐ ๋ด์ฉ๋ค์ด ๊ณ ์ง ๋์ง ์์ผ๋ฉด ์์ ๋งํ amplification์ผ๋ก ์ธํ ๋ถ์ ์ ์ธ ์ํฅ์ ์ด๋ฏธ ๋์ดํฌ ์ ์๋ ์ํฉ์ด ๋์ด์๋ ์ ์ ์์ต๋๋ค.

์ด๋ฅผ ์ํด์๋ model card (https://dl.acm.org/doi/abs/10.1145/3287560.3287596) ๋ฐ datasheets for datasets (https://arxiv.org/abs/1803.09010) ๋ฑ์ด ์ด์ ๊ฒจ์ฐ ์์์ด์ง๋ง ์ข์ ๋ฐฉํฅ์ผ๋ก ์ฌ๊ฒจ์ง๋๋ค. ๊ณผ์ฐ ์์ฌ AI ์์คํ ์ ์๋ํ๋ CEO/CTO ๋๋ ๊ฐ๋ฐ์ ์ค model card์ dataset datasheet์์ ์ถ์ฒํ๋ ์ง๋ฌธ์ ์์ฌ ์์คํ ์ ๋ํด ํ์ ๋ ๋ตํ ์ ์๋ ์ฌ๋์ด ์ผ๋ง๋ ๋ ๊น์? ์ ์ค์ค๋ก๋ ์ ๋ชป ํฉ๋๋ค๋ง ํนํ๋ AI ์์คํ ์ deployํ๋ ์ ์ฅ์์๋ ์ด๋ฐ ๋ฌธ์ ์ ๋ํ ๋ต์ ๊ผญ ํ ์ ์์ด์ผ ํฉ๋๋ค.

*Selection bias* of data

์์ ๋ด์ฉ๋ ๋ฐ์ ํ๊ฒ ์ฐ๊ฒฐ๋๋ ๋ด์ฉ์ผ๋ก AI ์์คํ ์ ๋ง๋๋๋ฐ ์ฌ์ฉ๋๋ ๋ฐ์ดํ๊ฐ ์ด๋ป๊ฒ ๋ง๋ค์ด์ง๋์ง๊ฐ ํฐ ๋ฌธ์ ์ ๋๋ค. ๋ค๋ง ์ด์ ๋ํ ๋ ผ์๋ ๋ฐ์ดํ๋ฅผ ๋ง์ด ์ฌ์ฉํ๋ ๋ค๋ฅธ ๋ถ์ผ์ ๋นํด (์, survey) ์๋์ ์ผ๋ก ์ ์ด๋ค์ง์ง ์์ต๋๋ค. ์ต๊ทผ ๋ค์ด AI/ML์ ๋ํ ๊ด์ฌ์ด ๋์์ง๋ฉด์ ๋คํํ data์ ๋ํ ๊ด์ฌ๋ ๋ง์ด ๋์์ง๊ณ ์๊ณ ์ด์ ๋ฐ๋ผ ๊ธฐ์กด์ ๋์น ์ฑ์ง ๋ชปํ๋ ๋ค์ํ ๋ฌธ์ ๋ค์ด ๋๋ฌ๋๊ณ ์์ต๋๋ค. ์๋ฅผ ๋ค์ด Parbhu & Birhane ( https://arxiv.org/abs/2006.16923) ๋ CIFAR-10์ด๋ ๋งค์ฐ ์ ๋ช ํ ๋ฐ์ดํ์ ์ ๋ง๋๋๋ฐ ์ฌ์ฉ๋์๋ TinyImage dataset์ ์ฌ๊ฐํ ๋ฌธ์ ์ ๋ค์ ๋ฐ๊ฒฌํ๊ณ ์ด๋ฅผ ํตํด TinyImage dataset์ด take-down๋์์ต๋๋ค. ์ง๊ธ์ด์ผ take-down๋์์ง๋ง ๊ณผ์ฐ ๊ทธ์ ๊น์ง ํด๋น ๋ฐ์ดํ๋ฅผ ์ฌ์ฉํ AI/ML ์์คํ ๋ค์ด ๋ฐ์ดํ์ ๋ฌธ์ ๋ฅผ ๊ณ ๋ฏผ ํ์ง ์๊ณ ๋ง๋ค์ด์ง ํ ์ผ๋ง๋ ํ์ค์ ์ ์ฉ๋์๋์ง ์๊ฐํด๋ณด์ง ์์ ์ ์์ต๋๋ค. Gururangan et al. (https://arxiv.org/abs/1803.02324) ์ ์์ฐ์ด์ฒ๋ฆฌ ๋ถ์ผ์์ ๊ต์ฅํ ๋๊ฒ ์ฌ์ฉ๋๋ Stanford NLI ๋ฐ์ดํ ์์ ๋ค์ด์๋ ๋ฌธ์ ์ ์ ๋ฐ๊ฒฌํ๊ณ , ํด๋น ๋ฌธ์ ์ ์ด ๋ฐ์ดํ ์์ง ๊ณผ์ ์์ ์๊ฒผ๋ค๋ ๊ฒ์ ๋ณด์์ต๋๋ค. ์ด๋ฐ ๋ฌธ์ ์ ๋ฐ๊ฒฌ์๋ ์ต์ AI/ML ๊ธฐ์ ๋ฐ ์ฐ๊ตฌ์ ๊ฐ๊ฐ์ธ์ manualํ ๋ ธ๋ ฅ์ด ํ์ํ์ต๋๋ค.

์ผ๋ฐ์ ์ผ๋ก AI ์์คํ ์ด ์ผ๋ง๋ ์ ๋์ํ๋์ง ์๋ํ๋ ๊ธฐ์ฌ ๋ฐ ๋ ผ๋ฌธ์ ๋ณด๋ ๊ฒ์ ์ด๋ ต์ง ์์ต๋๋ค. ํ์ง๋ง ์ฌ์ฉ์ ๋ฐ AI ์์คํ ์ ํ๋จ์ ๋ฐ๋ ์ฌ๋์ผ๋ก์จ ๋ ์ค์ํ ๊ฒ์ ๊ณผ์ฐ ํด๋น ์์คํ ์ด ์ด๋ค ํน์ง์ ๊ฐ๊ณ ์๋์ง, ๊ทธ๋ฆฌ๊ณ ํด๋น AI ์์คํ ์ ๋ง๋๋๋ฐ ์ฌ์ฉ๋ ๋ฐ์ดํ๊ฐ ์ผ๋ง๋ ์ ์์ง๋๊ณ ์ ์ ๋์๋์ง๊ฐ ๋ ์ค์ํฉ๋๋ค. ์ด๋ฅผ ์ํด ๋ ๋ง์ ์ฐ๊ตฌ๊ฐ ํ์ํ๊ณ ํ์ ์์๋ ์ค์ AI ์์คํ ๊ฐ๋ฐ๋ณด๋ค๋ ๋ ํฐ ํฌ์์ ๋ ธ๋ ฅ์ ๊ธฐ์ธ์ฌ์ผ ํฉ๋๋ค.

์ต๊ทผ FB์์ ๋์จ ์ฐ๊ตฌ ๊ฒฐ๊ณผ๋ฅผ ๋ณด๋ฉด ๋ฐ์ดํ์ ์ํฅ์ด ์ผ๋ง๋ ํฐ์ง ์ ์ ์์ต๋๋ค (https://openaccess.thecvf.com/content_CVPRW_2019/html/cv4gc/de_Vries_Does_Object_Recognition_Work_for_Everyone_CVPRW_2019_paper.html). ์ด ๋ ผ๋ฌธ์์๋ ์์ฉ object recognition ์์คํ ์ ์ ํ๋๊ฐ ์ฌ์ง์ด ์ฐํ ์ง์ญ์ ์๋๊ณผ correlateํ๋ค๋ ๊ฒ์ ๋ณด์์ต๋๋ค. ํน์ ์ ๋ผ๋จ๋์ ์ด๋ฉด ์์ธ์์ ๋ชจ์ธ ๋ฐ์ดํ๊ฐ ์๋์ ์ผ๋ก ๋ง์ด ์ฐ์ธ ๋ค์ด๋ฒ OCR์ด ๋ ์ ํํ๊ฑด ์๋๊ฒ ์ฃ ? (http://kosis.kr/statHtml/statHtml.do?orgId=101&tblId=DT_1C65, ์ฌ์ค ๋ค์ด๋ฒ OCR์ด ์ด๋ป๊ฒ ๋ง๋ค์ด์ง๋์ง ๋ชจ๋ฆ ๋๋ค. ๋ค๋ง ์์ธ/๊ฒฝ๊ธฐ์์ ๋ชจ์ธ ๋ฐ์ดํ๊ฐ ๋๋ถ๋ถ์ผ ๊ฒ์ผ๋ก ์๊ฐ๋๊ธด ํ๋ค์.)

์ด๋ฅผ ํด๊ฒฐํ๊ธฐ ์ํ ๋ฐฉํฅ์ผ๋ก๋ human-and-machine-in-the-loop์ด๋ผ๋ ํจ๋ฌ๋ค์์ด promisingํด ๋ณด์ ๋๋ค: https://arxiv.org/abs/1909.12434, https://arxiv.org/abs/1910.14599, https://openreview.net/forum?id=H1g8p1BYvS. ๋ค๋ง ์ด๋ฐ ํจ๋ฌ๋ค์์ ์ด๋ป๊ฒ ๊ตฌํ์ ํ๋๋์ ๋ฐ๋ผ ๊ฒฐ๊ณผ๊ฐ ํฌ๊ฒ ๋ฌ๋ผ์ง ์ ์๊ณ , ๊ตฌํํ๋ ๊ณผ์ ์์ ํผํด๋ฅผ ๋ณด๋ ์ฌ๋๋ค์ด ์๊ธธ ์๋ ์์ต๋๋ค (์๋ฅผ ๋ค๋ฉด https://www.theverge.com/2019/2/25/18229714/cognizant-facebook-content-moderator-interviews-trauma-working-conditions-arizona)

*Correlation vs. Causation* & *systematic generalization*

์ข ์ข ์ด๋ฐ ๋ฌธ์ ๋ ๊ธฐ์ ์ ๋ฌธ์ ๊ฐ ์๋๋ผ๊ณ ์ฃผ์ฅํ๋ ์ฌ๋๋ค์ด ์์ต๋๋ค. ์ด๋ฐ ์ฃผ์ฅ์ ๋ณดํต AI/ML์ ๊ทผ๋ณธ์ ์ธ ๋ชฉํ๋ฅผ ์ดํดํ์ง ๋ชปํด์ ํ๋ ๊ฒ ์ ๋๋ค. ํนํ๋ AI/ML์ ๋ชฉํ์ ์ฃผ์ด์ง ๋ฐ์ดํ์ sufficient statistics๋ฅผ ๋ฝ์๋ด๋ ๊ฒ์ ๋์ผํ๊ฒ ๋ณด๋ ๊ฒฝ์ฐ๊ฐ ์๋๋ฐ, ์ด๊ฑด ์ฌ์ค์ด ์๋๋๋ค.

AI/ML์ ๋ชฉํ๋ ์ผ๋ฐ์ ์ผ๋ก inductive inference๊ณ , Vapnik์ ์ํ๋ฉด ์ด๊ฒ์ “an informal act [with] technical assistance from statisticians” (paraphrase) ์ ๋๋ค. ์กฐ๊ธ ๋ ์ต๊ทผ์ ๋์จ Arjosvky et al. (2019; invariant risk minimization https://arxiv.org/abs/1907.02893)์์๋ ์ข ๋ ๋ถ๋ช ํ๊ฒ “minimizing training error leads machines into recklessly absorbing all the correlations found in training data” ํ์ฌ “machine learning fails to fulfill the promises of artificial intelligence” ๋ผ๊ณ ํฉ๋๋ค. ํ ๋ง๋๋ก AI์ ๋ชฉํ๋ ๋ฐ์ดํ ์์ง ํ๊ฒฝ์ ๊ตฌ์ ๋ฐ์ง ์๋ mechanism (์ธ์ ๋๋ ์๋์ง๋ง ๋ง์ ๊ฒฝ์ฐ causal) ์ ์ฐพ์๋ด์ out-of-domain (๋๋ systematic) generalization์ ์ฑ๊ณต์ ์ผ๋ก ์ํํ๋ ๊ฒ์ ๋ชฉํ๋ก ํฉ๋๋ค.

์ํ๊น๊ฒ๋ ๊ธฐ์กด์ ์ฌ์ฉ๋๋ ๋๋ถ๋ถ์ ML algorithm๋ค์ ์ด๋ฐ ๋ฉด์ด ๋ถ์กฑํฉ๋๋ค (์ด๋ฐ ์๊ฐ ๊ถ๊ธํ๋ฉด ์ต๊ทผ ์ ๋ฐํ์ ์ด๋ฐ์ ๋ณด๋ฉด ๋ฉ๋๋ค: https://drive.google.com/file/d/1CrkxcaQs5sD8K2HL2AWCMnrMRpFoquij/view) ์ด๋ฅผ ๊ทน๋ณตํ๊ธฐ ์ํด meta-learning๊ณผ IRM ๋ฑ์ ์๋ก์ด paradigm๋ ์ ์๋๊ณ causal inference from observational data๋ฅผ ML์ ์ ์ฉ์ํค๋ ์ฐ๊ตฌ๋ ๋ง์ด ์งํ๋๊ณ ์์ต๋๋ค (์๋ฅผ ๋ค๋ฉด https://arxiv.org/abs/1911.10500, https://arxiv.org/abs/1901.10912, https://arxiv.org/abs/1805.06826.)

๋จ์ํ ๋ฐ์ดํ์ ์๋ correlated feature๋ฅผ ์๊ณ ๋ฆฌ์ฆ์ด ์ฐพ์ ๊ฒ์ธ๋ฐ ์ด์งธ์ ๊ทธ๊ฒ์ด ๋ฌธ์ ์ด๋ ๋ฌป๋๋ค๋ฉด ์ผ๋จ AI/ML์ด ๋ฌด์์ธ์ง์ ๋ํ ๊ณ ๋ฏผ๋ถํฐ ๋ค์ ํด์ผ ํฉ๋๋ค.

Although it’s a topic that’s actively discussed both in academic settings and social media, such as Twitter and FB, I haven’t seen much discussion on the Social Impact & Bias of AI in Korean. To contribute even minimally to addressing this lack of discussion, here’s the list of a few points that are relevant to this topic. It’s possible that I simply have failed to find discussions surrounding this topic in Korean, and if there’s any, please kindly point me to them.

[My apologies for unprofessional writing. It’s not really everyday I write anything in Korean.]

*Amplification*

It is true that technology reflects the society. It is however also true that such technology that reflects the society is used within the society and that it inevitably amplifies what’s been reflected on the technology. It’s illuminating to read <Automating Inequality> by Virginia Eubanks and <Race after Technology> by Ruha Benjamin to see how such amplification harms people. (https://www.nytimes.com/2018/05/04/books/review/automating-inequality-virginia-eubanks.html, https://us.macmillan.com/books/9781250074317, https://www.ruhabenjamin.com/race-after-technology) This amplification of negative aspects of the society is precisely why I fumed over the recent news articles on wide adoption of AI inteviews in Korea. You may think you’re not the one who’ll suffer from such amplification, but it eventually gets to everyone unless without any intervention. Have you ever considered the possibility that your kid may not have received the job offer because he didn’t attend a primary school in Gangnam when they were small?

Even if one imagines a perfect AI system, the issue of amplification still exists. Consider this hypothetically perfect AI system that has determined a candidate to be 60% fit to the company and that this 60% is perfectly calibrated. As soon as a user of this system simply thresholds at 50% to make a hiring decision, it ends up with the same issue of amplification, because in practice users of such AI system inevitably overrule the supposedly perfect uncertainty estimated by the system.

*Opaqueness* of a model

Although it has been quite some time since so-called AI/ML systems have been put in practice, it’s relatively recent that their complexity has greatly increased. When a system in practice exhibits such a high level of complexity, it is important for both a provider, user of and those who are influenced by such a system to be aware of the principle behind these systems. Unfortunately there’s a trend that this need and request for awareness are ignored based on a variety of excuses such as that it is difficult to know the full details of the working principles, it is under active research to figure out the working principles and it is a part of corporate secret. Of course it is a difficult scientific issue on its own, but what is needed in terms of transparency is not every single scientific and engineering detail but a high-level description of the working principle behind such systems and understanding of their impacts on the society (think of how ridiculous it would be when a car manufacturer doesn’t tell you the horse power of a car you are considering because there’s no way you can know about all the details of the car such as the minute details of internal combustion engines.) Unless these (even high-level) details are provided together with these AI systems, the negative impact of such systems on the society will only be discovered once the (potentially irreversible) damages have been made.

One promising direction I have observed in recent years is the proposal for model cards and datasheets for datasets: https://dl.acm.org/doi/abs/10.1145/3287560.3287596 and https://arxiv.org/abs/1803.09010. I wonder how many CEO/CTO and developers can answer the questions, suggested for the model cards and datasheets, about their own AI systems they tout as well as data used for those systems. I’m not particularly a good example myself, but I believe the bar is even higher for those who tout and deploy AI systems in the society.

*Selection bias* of data

It’s quite related to the previous point. It is important to think of how data used for building an AI system was collected and created. Unfortunately and perhaps surprisingly this aspect of data has received relatively little attention compared to other adjacent areas, but the research community has begun to pay more attention to data itself and notice various issues behind widely used datasets. For instance, Parbhu & Birhane (https://arxiv.org/abs/2006.16923) identified serious flaws and issues behind one of the most widely used image datasets, called TinyImages, from which the widely used CIFAR-10 was created. This has led to the removal of the TinyImages dataset after 10 years since the original dataset was created and released. Although it’s now removed, you must wonder how many AI systems have been built using this data and been deployed in practice. Gururangan et al. (https://arxiv.org/abs/1803.02324) found various issues (or artifacts, as they called them) in the Stanford natural language inference (SNLI) data, stemmed from the process of data collection. These findings are the result of the combination of both state-of-the-art AI/ML techniques and individual researchers’ manual efforts.

It’s not difficult to find news articles and academic papers bragging the awesomeness of their AI systems. It is however more important for users and people who are being (either intentionally or unintentionally) judged by such systems to know the properties and characteristics of such systems and to be able to trust the quality of data and its collection process. It is thus imperative to invest more on this aspect of quality assurance than on the actual development of AI systems, in addition to continued research.

A recent work from FB demonstrates well the impact and importance of data and its collection: https://openaccess.thecvf.com/content_CVPRW_2019/html/cv4gc/de_Vries_Does_Object_Recognition_Work_for_Everyone_CVPRW_2019_paper.html. In this paper, the authors demonstrated that the accuracies of commercial object recognition systems correlate with the income levels of the regions in which pictures were taken. Hopefully, it doesn’t mean that the OCR service from Naver is less accurate for those who live in Jeollanam-do (which has the lowest per-capita GDP in Korea according to http://kosis.kr/statHtml/statHtml.do?orgId=101&tblId=DT_1C65) because the OCR system was trained mainly using data from Seoul and its metropolitan area (to be honest, I have no idea how Naver OCR is implemented, but I’m quite sure the majority of data used for building the system were collected from Seoul and its surrounding regions.)

To me, human-and-machine-in-the-loop paradigm looks quite promising: https://arxiv.org/abs/1909.12434, https://arxiv.org/abs/1910.14599 and https://openreview.net/forum?id=H1g8p1BYvS. Although promising, it’s important to keep in our mind that the outcome of such a paradigm heavily depends on how it’s implemented, not to mention that some may suffer from its implementation. See for instance https://www.theverge.com/2019/2/25/18229714/cognizant-facebook-content-moderator-interviews-trauma-working-conditions-arizona.

*Correlation vs. Causation* & *systematic generalization*

Often we see people who claim this is *not* the problem of technology. Such a claim often arises from the lack of understanding the fundamental goal of AI/ML. In particular, some equate the goal of AI/ML to estimating sufficient statistics from given data, which is simply not true.

In general, the goal of AI/ML is inductive inference, and according to Vapnik (https://www.wiley.com/en-us/Statistical+Learning+Theory-p-9780471030034), it’s “an informal act [with] technical assistance from statisticians” (paraphrase). More recently, Arjosvsky et al. (https://arxiv.org/abs/1907.02893) explicitly stated that “minimizing training error leads machines into recklessly absorbing all the correlations found in training data” and this makes “machine learning [fail] to fulfill the promises of artificial intelligence.”In short, the goal of AI is to identify an underlying mechanism that is independent of (or invariant to) changing environments (which are often but not always causal) and successfully generalize to a new environment, which is often referred to as out-of-domain (or systematic) generalization.

Sadly, most of the existing (widely used) ML algorithms fall short in this aspect. See the first part of my recent talk for an example: https://drive.google.com/file/d/1CrkxcaQs5sD8K2HL2AWCMnrMRpFoquij/view. In order to overcome this inability, new paradigms have been proposed, such as meta-learning and invariant risk minimization, and there is an on-going effort in marrying causal inference from observational data with machine learning. See e.g. https://arxiv.org/abs/1911.10500, https://arxiv.org/abs/1901.10912 and https://arxiv.org/abs/1805.06826.

If you still insist that it is not an issue of the algorithm which has faithfully captured correlations that exist in data, I suggest you to think once more what AI/ML is and what its goal is.

]]>TL;DR: after all, isn’t $k$-NN all we do?

in my course, i use $k$-NN as a bridge between a linear softmax classifier and a deep neural net via an adaptive radial basis function network. until this year, i’ve been considering the special case of $k=1$, i.e., 1-NN, only and from there on moved to the adaptive radial basis function network. i decided however to show them how $k$-NN with $k > 1$ could be implemented as a sequence of computational layers this year, hoping that this would facilitate students understanding the spectrum spanning between linear softmax classification and deep learning.

we are given $D=\left\{ (x_1, y_1), \ldots, (x_N, y_N) \right\}$, where $x_n \in \mathbb{R}^d$ and $y_n$ is an associated label represented as a one-hot vector. let us construct a layer that computes the nearest neighbour of a new input $x$. this can be implemented by first computing the activation of each training instance:

\begin{align*}

h^1_n =

\frac{\exp(-\beta | x_n – x |^2)}

{\sum_{n’=1}^N \exp(-\beta | x_{n’} – x |^2)}.

\end{align*}

in the limit of $\beta \to \infty$, we notice that this activation saturates to either $0$ or $1$:

\begin{align*}

h^1_n {\to}_{\beta \to \infty}

\begin{cases}

1, &\text{if $x_n$ is the nearest neighbour of $x$} \\

0, &\text{otherwise}

\end{cases}

\end{align*}

the output from this 1-NN is then computed as

\begin{align*}

\hat{y}^1 = \sum_{n=1}^N h^1_n y_n = Y^\top h^1,

\end{align*}

where $h^1$ is a vector stacking $h^1_n$’s and

\begin{align*}

Y=\left[

\begin{array}{c}

y_1 \\

\vdots \\

y_N

\end{array}

\right].

\end{align*}

this was relatively straightforward with 1-NN. how do we extend it to 2-NN? to do so, we define a new computational layer that computes the following activation for each training instance:

\begin{align*}

h^2_n =

\frac{\exp(-\beta (| x_n – x |^2 + \gamma h^1_n))}

{\sum_{n’=1}^N \exp(-\beta (| x_{n’} – x |^2 + \gamma h^1_n))}.

\end{align*}

now we consider the limit of both $\beta\to \infty$ and $\gamma \to \infty$, at which this new activation also saturates to either 0 or 1:

\begin{align*}

h^2_n \to_{\beta, \gamma \to \infty}

\begin{cases}

1, \text{if $x_n$ is the second nearest neighbour of $x$} \\

0, \text{otherwise}

\end{cases}

\end{align*}

this magical property comes from the fact that $\gamma h_n^1$ effectively kills the *first* nearest neighbour’s activation when $\gamma \to \infty$. this term does not affect any non-nearest neighbour instances, because $h_n^1=0$ for those instances.

the output from this 2-NN is then

\begin{align*}

\hat{y}^2 = \frac{1}{2} \sum_{k=1}^2 \sum_{n=1}^N h^k_n y_n.

\end{align*}

now you see where i’m getting at, right? let me generalize this to the $k$-th nearest neighbour:

\begin{align*}

h^k_n = \frac{

\exp(-\beta (| x_n – x |^2 + \gamma \sum_{k’=1}^{k-1} h^{k’}_n))

} {

\sum{n’=1}^N \exp(-\beta (| x_{n’} – x |^2 + \gamma \sum_{k’=1}^{k-1} h^{k’}_n))

},

\end{align*}

where we see some resemblance to residual connections (add the previous layers’ activations directly.)

In the limit of $\beta\to\infty$ and $\gamma \to \infty$,

\begin{align*}

h^k_n \to_{\beta, \gamma \to \infty}

\begin{cases}

1, \text{if $x_n$ is the $k$-th nearest neighbour of $x$} \\

0, \text{otherwise}

\end{cases}

\end{align*}

the output from this $K$-NN is then

\begin{align*}

\hat{y}^K = \frac{1}{K} \sum_{k=1}^K \sum_{n=1}^N h_n^K y_n,

\end{align*}

which is reminiscent of so-called deeply supervised nets from a few years back.

it is not difificult to imagine not taking the infinite limits of $\beta$ and $\gamma$, which leads to soft $k$-NN.

In summary, soft $k$-NN consists of $k$ nonlinear layers. Each nonlinear layer consists of radial basis functions with training instances as bases (nonlinear activation), and further takes as input the sum of the previous layers’ activations (residual connection.) each layer’s activation is used to compute the softmax output (self-normalized) using the one-hot label vectors associated with the training instances, and we average the predictions from all the layers (deeply supervised).

of course, this perspective naturally leads us to think of generalization in which we replace training instances with learnable bases across all $k$ layers and learn them using backpropagation. this is what we call *deep learning*.

[NOTE: I became aware that an extreme similar (however with some differences in how 1-NN is generalized to k-NN) has been proposed recently in 2018 by Plรถtz and Roth at NeurIPS’18: https://papers.nips.cc/paper/7386-neural-nearest-neighbors-networks]

]]>===

]]>What is intelligence?

It turned out that thereโs a great list of speakers scheduled after my quick lightning talk, covering a broad set of topics spanning from mathematics, computer science, natural sciences, healthcare and medicine all the way to law. Each speaker will without a doubt tell us about the latest and greatest research in each direction they pursue and how itโs connected to data science and perhaps even more broadly artificial intelligence.

For me, iโm going through a bit of research identity crisis at the moment and thought i would spend a brief moment talking about why i decided to join the NYU Center for Data Science as one of the earliest so-called core faculty members in 2015.

My background is in computer science. I have received all of my degrees in computer science. The reason why i decided to pursue computer science was simple; i was fascinated by the idea that we can pose and answer this question; โwhat is computation?โ this seemingly straightforward question has a lot of implications. First, it brings an abstract concept of computation into a scientifically well-founded concept that we can characterize and study. Second, this investigation into what computation is has led to practical solutions to many problems that were simply not straightforward to even define due to the lack of the definition of what computation is. What started as formal, mathematical journey into figuring out computation had become a major scientific discipline touching every corner of the society already when i started my undergrad years. Look around yourself and think of what you do everyday both personally and professionally. It is pretty much impossible nowadays to find a single activity that does not involve the outcome of computer science, and computer science continues to make progress in answering the question: โwhat is computation?โ

Then, what is the next question we should and must ask? In my opinion, the next question is this; โwhat is intelligence?โ or perhaps equivalently โwhat is knowledge?โ

This question asks us what key concepts are needed for defining a sophisticated problem and a solution to such a problem, how these concepts could be scientifically and rigorously defined and characterized, and how they should be combined and searched through for us to automatically find a solution or algorithm to solve a complicated, real-world problems. In answering this question, two things have emerged as crucial components; they are learning and data.

โLearningโ refers to in this context a process by which we automatically construct an algorithm to solve a problem. In other words, itโs a meta-algorithm that automatically builds a new algorithm. This โlearningโ process heavily relies on the availability of โdataโ, be it collected by humans, other algorithms or the learning process itself. From data, it identifies many underlying rules and regularities that could be exploited to solve a problem efficiently and effectively. This is precisely why we refer to this whole new discipline as data science.

We study mathematical and computational aspects surrounding this core concept of โdataโ behind intelligence and knowledge. What is the correct way to characterize data? What is the correct way to automate the collection of data in order to maximize the effectiveness and efficiency of โlearningโ? What is the correct way for a learning algorithm to maximally extract underlying rules and regularities from this data to construct an algorithm to solve a problem? All these questions point to the ultimate question of what intelligence is and what knowledge is, and on the way, solve many real-world problems based on data and learning.

This is the reason why I decided to join the Center for Data Science in addition to computer science in 2015 even when the center was in its early years. I havenโt had a single moment of regret since I joined CDS especially looking at the trajectory we have been taking.

Now let me tell you briefly about my own research in this context. One particular aspect of intelligence that sets us (humans) apart from other seemingly intelligent animals, such as other mammals and insects, is our use of sophisticated language. This use of sophisticated language presents a unique opportunity for us to push the boundaries of our experience. Although none of us in this room (iโm quite certain) has ever been to the Antarctica ourselves, we somehow all know that there are penguins in Antarctica. Although none of us in this room (iโm 100% certain) has ever been to the ancient Greece in person, we somehow know a lot about Ancient Greece, and probably more so than average Ancient Greeks who lived it themselves. Both of these are possible, because we use language to share experiences and broaden our boundary both spatially and temporally, which sets ourselves clearly apart from any other intelligent being on this earth. Together with our unique level of intelligence, it makes me believe we must study language carefully in order for us to answer the question โwhat is intelligence?โ

There are two parts to studying and designing learning algorithms for natural language. One is to build a learning algorithm that focuses on extracting underlying semantics of language in order to solve problems that require in-depth knowledge expressed in text. This direction is pursued mainly by Sam Bowman and He He at NYU CDS, and I will skip it here myself. The other is to build a learning algorithm that knows how to generate a well-formed text, and this is my main research direction.

The problem of text generation belongs to a wider category of structured output prediction. In structured output prediction, a set of possible outcomes is very large, which is equivalent to technically saying the size grows exponentially w.r.t. the input size. In other words, it is not possible for our learning algorithm to naively test each and every possible configuration, and the learning algorithm must extract and exploit underlying structures that are often not apparent. once a good set of regularities have been extracted, learning provides us an efficient algorithm to rapidly search for a good configuration/sentence from this exponentially large space.

One particular approach Iโve been exploring since 2014 is called neural autoregressive models with attention, which has become the de facto standard not only in academia but also in industry for building machine translation, speech recognition and speech synthesis systems. This approach has recently been found by others as well as by my own group to be generally applicable to any structured object generation, where structured objects refer to generic graphs. One quick example is conditional molecule design. Together with Prof. Kang from SKKU who was visiting NYU Center for Data Science on his sabbatical, I was able to demonstrate the effectiveness of recurrent nets with attention mechanism and latent variables in controllable generation of molecular hypotheses. This effort, which started late 2017, has now been expanded to using graph neural networks (about which I believe Joan Bruna will tell you a more exciting story) to better capture the graph-likeness of molecules and proteins.

We are a very long way from answering the question; โwhat is intelligenceโ or โwhat is knowledgeโ in a rigorous manner. We have barely taken a step toward this goal, and if the history of any scientific discipline is any indication, it will be many correct and incorrect steps taken over decades if not centuries before we could barely claim to have taken a peek at the answer to this question.

One thing that is certain however is that we have been successfully building an environment here at the Center for Data Science by bringing in and hiring people with expertise necessary for us to advance toward answering this ultimate question. My research has certainly benefited from having a diverse set of colleagues of world-class caliber. I have designed and proposed a unified framework for online learning algorithms for recurrent networks together with Cristina Savin, which will be a crucial component for us to build an intelligent agent that lives indefinitely. I have worked with Sam Bowman to better understand and characterize these language understanding neural networks. I have studied the applicability of deep learning to physics and biology by working together with Kyle Cranmer and Rich Bonneau. I have been spending my effort in building a deep learning based diagnostic system for early-stage breast cancer screening together with Kryzsztof Geras, who was a postdoc at the NYU Center for Data Science and is now an assistant professor at Radiology. I have even had the pleasure of investigating the impact of uncertainty-aware word embedding in political science together with Arthur Spirling.

Thanks for listening to me, and Iโll be happy to chat more about any of these topics as well as how my experience with CDS has been so far.

<Rebooting AI> is a well-written piece (somewhat hastily) summarizing the current state of artificial intelligence (or perhaps more like machine learning) in both terms of research and deployment. If one has not been in the field themself, they would appreciate the effort of the authors in gathering various recent (and old) findings that succinctly describe what we could and should expect from the current technology and what we cannot expect from the current technology. To me, and perhaps some of my colleagues in the field of deep learning (and slightly more broadly machine learning), which is often the target of skepticism from the authors (to be fair, the authors do demonstrate healthy skepticism toward any other existing technology in machine learning and artificial intelligence,) the book feels relatively light despite its grand reception expressed by various folks on social media.

Why do I feel this way? Perhaps itโs because I could classify the set of failure modes of the current technology, which are presented in this book as surprisingly findings, into two categories. The first category of these failure cases almost exclusively consists of what have been reported by machine learning researchers. That is, unlike how I have felt the book was implying (either implicitly or explicitly), it is machine learning researchers who are at the frontier of discovering, investigating and trying their best to address these weaknesses of the current technology. The second category consists of failures that were found largely by the authors themselves manually playing around (or more seriously testing) some of the products or demos that boast to have employed latest technology. Whether this limited interaction (just because everyone has 24 hours a day without any exception) is enough depends on what kind of argument in which way these failure cases are used, and I see some cases in this book that I find refreshing as these examples do clearly demonstrate weak aspects of those systems. It is however the empirical side of me that finds it a bit less satisfying to see a scientific argument made based on a few manually selected examples. In summary, unlike the authorsโ implication, these problems are known and are being actively discovered by AI researchers (in particular ML researchers), and we are actively seeking to tackle these problems, although itโs rare for journalists or pundits to talk about these compared to other fancier news, e.g., silicon valley acquisition/merger/funding of supposedly-AI companies.

Yet another reason might be that the book does not really provide a clear, verified (or even verifable) way to โrebootโ AI or even how we would think of approaching the problem of AI. In short, there were too many โseems toโ, โwill need toโ, โshouldโ, โis pretty clearโ, and other uncertain, perhaps risk-avoiding terms in the book whenever the authors tried to argue the importance (or more like necessity) of a certain direction or method they โpretty clearlyโ believe a general AI system โseems toโ require to have. The empirical side of me had struck again and again whenever I ran into these statements; that is, if we cannot prove it somewhat rigorously nor cannot demonstrate empirically and convincingly it, my scientific trust in these arguments tends to go down. Especially, in the latter case (empirical demonstration), the level at which the demonstration was convincing almost directly correlates with my trust, and sadly I could not find much of those in this book. For instance, I was much more convinced of the importance of common sense, which was emphasized over and over by the authors, by Yejin Choi of UW who showed me, over beer in Chicago two days ago, her latest work on natural language based learning of common sense, than by the arguments in this book. This is of course not to say that the authorsโ proposals/arguments are incorrect nor fully unconvincingly. It is just that, as I mentioned earlier, they feel lighter than what I wouldโve expected from its title <Rebooting AI> and the weights of the authors, Gary Marcus and Ernest Davis, both of whom I know in person.

This brief note on what I thought of <Rebooting AI> has concentrated mostly on the first part (which arguably takes most of the book) that is mainly about the technological side of AI. For me, it was more enjoyable to read the second part (or the last part) that discusses the true danger/consequence of AI, perceived by the authors, beyond the usual straw-man argument on humanityโs extinction by super-intelligence. I wonder what researchers in AI safety or ethical use of AI/ML think of this second part. Would they also find it too light, as I have found the first part of the book, however without sacrificing the correctness? If so, that would ironically imply that the authors have done a commendable job of summarizing various, latest developments (and non-developments) in AI/ML, while nicely blending in their own views and research, so as to pique the interest of bystanders and educate them to a level that they are aware of these developments and potential consequences/concerns.

<Rebooting AI> reads a bit too light for my taste, but itโs almost certainly due to my own involvement in the field of AI myself as a researcher and educator. Taking a small step back from my current position, I believe it was necessary and perhaps timely for some book to succinctly summarize both the up- and down-sides of the current state of AI for laypersons (as in anyone who is necessarily following the non-stop flood of academic papers in the field of AI), and it is not easy to imagine a better person (or a better team of people) than Gary and Ernie.

In short, I would recommend my parents (when Korean translation becomes available) to read <Rebooting AI> (although they might feel sad my name wasnโt mentioned even once when the improvement in Google Translate was described ;)) if your parents are not AI researchers, Iโd suggest you recommend them as well. I would not however find it necessary for AI researchers themselves to read this book, unless you want to get a short, but interesting discussion on trustworthy AI toward the end of the book. Of course, if you want to have a Twitter or Facebook debate with Gary, I guess it wouldnโt hurt giving the book a quick look (although I donโt find it too necessary.)

]]>Let’s consider the following meta-optimization objective function:

$$\mathcal{L}'(D’; \theta_0 – \eta \nabla_{\theta} \mathcal{L}(D; \theta_0))$$

which we want to minimize w.r.t. ฮธโ. it has become popular recently thanks to the success ofย MAMLย and itsย earlierย andย more recent variants to use gradient descent to minimize such a meta-optimization objective function. the gradient can be written down as*

$$\nabla_{\theta_0} \mathcal{L}'(D’; \theta_0 – \eta \nabla_\theta \mathcal{L}(D; \theta_0) = \nabla_{\theta’} \times (1-\eta \nabla_{\theta_0} \nabla_{\theta} \mathcal{L}(D; \theta_0),$$

where ฮธ’ is the updated parameter set. in this derivation, what we see is that the gradient w.r.t. the original parameter setย ฮธโ is propagated from the outer objective function L’ viaย ฮธ’ which was computed using the gradient of the inner objective function L w.r.t. ฮธ evaluated at the original parameter setย ฮธโ.

so far so good, but what if the inner optimization procedure wasย *stochastic*?

that is, what if the meta-optimization objective function was:

$$\mathcal{L}'(D’; \theta_0 – \eta \mathbb{E}_z \nabla_\theta \mathcal{L}_z(D; \theta_0)),$$

where $z$ is used to absorb any stochasticity present in this gradient descent procedure. for instance, $z$ could be use to sample a subset from $D$ to build a minibatch gradient. after all, this is often what we do in deep learning rather than full-batch, deterministic gradient descent as shown above.

in this case, the gradient of the meta-objective function w.r.t.ย ฮธโ looks slightly different from above:*

$$\nabla_{\theta_0} \mathcal{L}'(D’; \theta_0 – \eta \mathbb{E}_z \nabla_z(D; \theta_0)) = \nabla_{\theta’}\mathcal{L}'(D’; \theta_0 – \eta \mathbb{E}_z \nabla_\theta \mathcal{L}_z(D; \theta_0)) \times (1-\eta \mathbb{E}_z \nabla_{\theta_0} \nabla_{\theta} \mathcal{L}_z(D; \theta_0)).$$

what’s really important to notice here is that there are suddenlyย **two** expectations rather than just one expectation in the meta-objective function. this might make a difference, because we now need two independent sets of samples from z to estimate the meta-objective gradient w.r.t. ฮธโ.

how would this be implemented in practice? we first draw one minibatch and updateย ฮธโย up toย ฮธ’. we then draw another minibatch and updateย ฮธโ up to ฮธ” (notice the double prime here!)ย we draw a validation minibatch D’ to evaluateย ฮธ’ using the meta-objective function L’. then we backprop up untilย ฮธ’ (using the same validation minibatch). we then suddenly switch toย ฮธ” and backprop through it untilย ฮธโ. in other words, we use two separate paths untilย ฮธ’ for forward and backward passes, which is pretty different from a usual practice.

what does this imply? what it implies is thatย **correct meta-objective optimization looks forย ****ฮธโ that is robust to the optimization trajectory taken due to the inherent stochasticity in SGD**. in order to do so it must consider what would have happened had a different optimization trajectory been used, and this can be estimated well by using separate minibatches for forward and backward passes. i believeย Ferenc Huszar made a similar argument in “What is missing? Stochasticity” section ofย his recent blog post.

an interesting question here is what z is and what kind of distribution we should impose on z. for instance, can we fold the choice of optimization algorithm into z in addition to other stochastic behaviours such as data permutation, dropout and others? if so, can we extend MAML to find a more robust initialization that would not only be robust to the stochasticity behind a select optimization algorithm but robust to the choice of optimization algorithm itself?

(*)ย i’m being massively sloppy with scalars, vectors, matrices, gradient and jacobian, and my apologies in advance. you could simply think of scalars only and the whole argument still largely holds.