NYU에서는 이번 가을에 blended insturction을 했다. 각 수업은 규모 (학생 수 및 주당 강의 수), 특성 (대면 필수) 등을 고려하여 remote, in-person 또는 blended로 학기 시작 전 구분을 지었고, 나는 blended mode의 강의를 진행했다. blended mode 수업의 강의는 in-person 그리고 lab sessions은 평소의 2-3배로 갯수를 늘려서 in-person과 remote를 모두 가졌다. 모든 강의와 lab은 zoom을 통해 livestream했고 이를 통해 뉴욕에 오지 못한, NYU의 global campus에 대신 진학한 학생들이 강의를 듣는데 문제 없도록 했다. 매 강의 및 lab session에 in-person으로 참석할 학생은 학기 시작 전 미리 배정된 주에 미리 배정된 자리에 앉도록 했고, NYU의 모든 facility에서는 마스크 착용이 필수였다. 각 강의에는 해당 강의실 최대 수용 인원의 1/4-1/3만 들어올 수있도록 했고, 교수도 예외 없이 언제나 마스크를 착용하고 강의를 진행했다. 내 강의에는 한번에 최대 25-30명이 들어올 수 있었으나 실질적으로는 3-10명 정도가 들어오고 나머지 학생들은 zoom을 통해 livestream으로 참석했다.

이와 동시에 각 학과는 교수 및 포스닥, PhD 학생들이 필요에 따라 연구실에 돌아올 수 있도록 연구실 배정 및 책상 배치를 모두 바꿨다. NYU Center for Data Science의 경우, 연구실 재배정 및 미팅룸 재배정을 통해 모든 교수, 포스닥, PhD 학생이 1인1실을 쓰도록 했고, 이를 통해 주거 환경이 상대적으로 열악한 학생들이 맘 편히 연구에 집중할 수 있는 환경을 제공하도록 노력했다.

학부생들도 원하는 경우 residence hall로 들어와서 학기를 지냈고, 이런 경우에도 residence hall reconfiguration을 통해서 학생들 간의 불필요한 접촉을 최소화하도록 했다. 학교내 식당 (대부분 학부생들 이용) 은 모두 pick up으로 변경했고, 학교 내 모든 책상 및 공부할 수 있는 공간은 미리 예약을 하지 않으면 쓰지 못 하도록 시스템을 갖추었다.

정지훈 교수님 글에 쓰신 것처럼 이런 환경을 뉴욕 맨하튼 한가운데서 구축하고 covid-19 outbreak을 피하기 위해 NYU는 뉴욕 캠퍼스에 온 모든 학생과 교직원에게 학기 시작 전 2주 동안 1-2 번씩 PCR 테스트를 받게 했다. 뉴욕에 사는 교직원들은 뉴욕대학교 Langone 대학병원에서, 그리고 학생들은 residence hall들 근처에 텐트를 치고 대대적으로 검사를 진행했다.

학기가 시작한 후 모든 구성원은 의무적으로 2주에 한 번씩 침을 이용한 테스트를 받았다 (https://www.nyu.edu/life/safety-health-wellness/coronavirus-information/safety-and-health/coronavirus-testing/ongoing-testing.html) 매 2주에 한 번씩 reminder 이메일이 오고 해당 주에 학교 내에 구축된 4-5군데의 테스트 collection point에 직접 찾아가 test kit을 받은 후, 집 또는 사무실에서 침을 뱉은 후 다시 collection point에 돌려준다. 그 후 1-3일 후 온라인으로 결과를 확인할 수 있고, 해당 결과나 입력되지 않은 경우 카드키를 통한 NYU 출입이 제한된다.

이를 통해 양성 판정이 나면 해당 구성원은 바로 격리에 들어가고 학교에서는 contact tracing에 들어간다. 아쉽게도 contact tracing은 학교 내로 한정이 되고, NY주에서 학교 밖 contract tracing을 진행한다. 물론 후자가 전혀 안된다는 건 모두가 아는 비밀이다. 학기초반 residence hall 등에서 outbreak의 기미가 있어서 2-3 층을 통째로 격리하고 전원 검사를 두 번 진행한 경우가 있었고, 이를 통해 더 큰 outbreak을 피했다.

이런 과정을 통해 총 6만 여명 구성원 중 15,000명 정도가 이번 학기에 캠퍼스에 돌아왔고 지난주를 마지막으로 학기가 중단 없이 끝났다. 실시간으로 업데이트 되어온 대시보드 (https://www.nyu.edu/life/safety-health-wellness/coronavirus-information/nyc-covid-19-testing-data.html)를 보니 8월 1일 이후 총 19만 9870번 테스트를 진행했고, 758 케이스가 양성으로 판정되었고, 뉴욕 이외의 지역까지 확장하면 약 1000여 케이스가 양성이었다. 양성율 0.38%로 실제 뉴욕시에 비해 현저하게 낮다.

올 봄 뉴욕시는.. 큰 병원들은 지옥이었고, 병원 밖은 유령 도시였다. 지금도 여전히 뉴욕주는 매일 만명 이상 확진되고 있고, 100명 이상 사망하고 있다. 그럼에도 불구하고 NYU에서 학기 중단 없이 학생들 교육을 시켰고, 뉴욕시의 public school 또한 중간에 일시적인 2주 중단 외에 학기를 진행했다는 것에 한편으로는 마음이 조금 편해진다. 이번 봄에도, 그리고 다음 가을에도 필요한 모든 것을 다 해서라도 학교가 열고, 학생 지도가 제대로 진행되길 바라고 학교 구성원 중 하나로써 최선을 다할 예정이다.

뉴욕도, 미국 다른주도, 한국도, 캐나다도, 유럽도, 내가 뉴스를 어느 정도 따라가는 많은 지역들이 대기업, 건물주, 부동산 그리고 부자들 걱정을 많이 한다. 이런 경제적인 고려와 그에 더해 미국에서 보다시피 정치적인 계산이 이번 pandemic을 얼마나 잘, 또는 얼마나 잘못 버텨내느냐에 많은 영향을 미치고 있다. 안타깝게도 이런 복잡한 고려 하에서 교육이 쉽게 묻혀 버린다. pandemic은 끝이 나겠지만 이 기간 1-3년 동안 다른 세대들에 비해 교육을 제대로 못 받은 세대에 대한 영향은 얼마나 오래갈까?

혹여나 지난주 밖에서 시원하다며 마신 맥주 한 잔 때문에 사회의 미래를 희생한건 아닐런지…

]]>Aalto University (in particular School of Science within) and Finland just keep on giving, and I feel like I continue to receive without giving anything back. I will have to think of some way for me to pay back all that I have received from them.

Kiitos paljon!

Of course, the whole event was virtual, and due to the time difference, I could not attend myself. Instead, I sent the video recording of my greetings. You can watch it at https://youtu.be/074nhA9SQvA. I’m also attaching the script I used for recording this video below.

hi,

i received admission to the international master’s program in machine learning and data mining, which was called back then Macadamia, from Helsinki University of Technology in the spring of 2009.

although i applied to the program myself, finland was largely a land of mystery to me. perhaps this mysterious nature of the country may have been one of the major motivations for me to apply for this program in the first place. in my mind back then, finland was associated with just a couple of things, such as Nokia and Helsinki Olympics. i must confess that i wasn’t even aware that finland shared a border with russia. unsurprisingly, going to finland to study was definitely not what i had in my mind until one of my friends then handed me the brochure of the Macadamia program in the winter of 2008.

the very first lecture i attended in helsinki university of technology, which was about to be merged with the other two universities to form aalto university back then, was of the course “Machine Learning: Basic Principles”. this course was taught by Tapani Raiko, who had advised and mentored me for the next five years and who i still continue to admire and keep in touch with. in the very first lecture, i could immediately tell that i made the right choice to be there to study machine learning. and, to this day, i still believe i made the right choice to be there at Aalto University to study machine learning and data mining.

as a part of the Macadamia program, some students were assigned to some of the labs within the department, which was back then information and computer science (ICS), to assist in research one day a week with a small amount of stipend. the master’s program was still free to anyone from anywhere in the world back then in finland, which i sadly learned recently that is not the case anymore. without tuition-free education, my decision to come to finland to study in Aalto University may have taken a very different course.

anyways, i was assigned to the Bayes group, which i do not believe exists anymore and despite its name had a longer history of research in neural networks. the group back then was led by Prof. Juha Karhunen, who i believe had recently retired, together with Tapani and Prof. Alexander Ilin, who recently made a comeback to Aalto to re-build the Bayes group however with a new name “Deep Learning”. this part-time research gig at the then-Bayes group, which started in September 2009, was the beginning of my research career that is still on-going.

i often wonder what i would’ve become had it not been for this program, called the “honours program” then, if i remember correctly, had it not been for me to be assigned to the Bayes group, or had it not been for me to be advised by Tapani and Alexander. it’s simply unimaginable. five years later in March 2014, i defended my doctoral dissertation against my “opponent” Prof. Nando de Freitas, in front of my friends, colleagues and supervisors from then-newly-formed the Department of Computer Science of Aalto University School of Science.

over those five years, i spent many days and nights in Maarintalo, studying for exams and working on projects. over those five years, i spent many days and nights in the computer science building, working toward my dissertation. over those five years, i had an uncountable number of lunches at the cafeterias in the computer science building as well as the main building. over those five years, i met so many friends and colleagues, many of whom i still keep in touch with.

Aalto University gave me an enormous opportunity by bringing me to Finland and giving me rigorous education on machine learning. Furthermore, Aalto University had successfully created an international environment in which I could immerse myself among talents from all over the world and be inspired by them. These were just the beginning of the series of opportunities Aalto University School of Science had given me over those five years.

my phd years were generously supported by FICS (the finnish doctoral programme in computational sciences), which has since discontinued and i believe has been replaced by HICT. near the end of my phd programme, i was given a chance and supported by FICS and Prof. Erkki Oja to spend six months visiting the University of Montreal to broaden my view and to further learn from the very best in the world.

this research visit opened my eyes to a broader set of topics in machine learning, and in particular this visit was how and when i began to seriously delve into studying how machine learning and more broadly AI could be used for and improve natural language processing and machine translation. this research visit led me to join the University of Montreal as a postdoc in a lab which was called Lisa back then and is now called Mila, immediately after i defended my dissertation.

And, now, i am an associate professor of computer science & data science at New York University, running my own research lab and teaching machine learning to aspiring students from all over the world.

in my opinion, one of the most important roles served by higher education is to bring the best out of each student. what this implies is that higher education cannot simply shove down knowledge into students, and education cannot simply show easy, comfortable and convenient ways forward to students. education must strive to provide as diverse and broad a set of opportunities and perspectives to students as possible in order to ensure each and every student has a chance to discover their way forward.

What i experienced during my years at Helsinki University Technology which had become Aalto University School of Science and Technology and has eventually become Aalto University School of Science, was precisely this; rigorous and thorough education, and a string of educational and extra-curricular opportunities within and beyond the wall of the university and even the country’s border.

It is truly my honour to be named the alumnus of the year, and to be frank I am quite unsure whether i deserve it. off the top of my head, i can think of Prof. Alexander Ilin, who is now back at Aalto University. Dr. Tapani Raiko, who is now at Apple, is another obvious candidate. and, no, it’s a totally objective list. they just happened to have mentored me throughout my years at Aalto.

let me wrap it up by dusting off my finnish: Kiitos paljon!

]]>i enjoyed answering those questions, because they made me think quite a bit about them myself. of course, as usual i ended up leaving only a short answer to each, but i thought i’d share them here in the case any students in the future run into the same questions. although my questions are all quite speculative and based on experience rather than rigorously justified, what’s fun in rigorously proven and well-known answers?

of course, there were so much more questions asked and answered during live lectures and at the chatrooms, but i just cannot recall all of them easily nor am i energetic enough after this unprecedented semester to go through the whole chat log to dig out interesting questions. i just ask you to trust me that the list of questions below is a tiny subset of interesting questions.

i will paraphrase/shorten the answers below and remove any identifying information (if any):

- Why was backprop controversial? Yann mentioned that one of the big things that made the use of ConvNets in various applications controversial was the use of backpropagation. backprop is just an application of the chain rule, so why would anyone be suspicious of using it?
- Professor LeCun said that mini-batch has no advantage over single-batch SGD besides being easier to parallelize, and online SGD is actually superior. Is there any other theoretical reason why single-batch is preferable?
- Why we would do batch normalization instead of normalizing the whole dataset all at once at first? Is it for when normalizing the whole dataset is too computationally expensive? I understood that normalization makes the optimization process easier through making the eigenvalues equal. However, if you’re only normalizing over the batch, your normalization for each batch is subject to noise and might still lead to bad learning rates for each dimension.
- Batch normalization in VAE: While implementing the convolutional VAE model, I noticed that removing these BatchNorm layers enabled the model to train as expected. I was wondering why does BatchNorm cause this issue in the VAE model?
- In semi-supervised VAE, how do we decide the embedding dimensions for the class? Also, BERT used position embedding to represent the position, so how do we determine the position embedding dimensions in BERT?
- Why do we divide the input to the softmax in dot product attention by the square root of the dimensionality?
- DL appears to add double descent as a caveat in addition to bias-variance tradeoff learned earlier. Do you have any insights on how we should think about double-descent?
- In your opinion, will we achieve AGI?

**1. Why was backprop controversial? Yann mentioned that one of the big things that made the use of ConvNets in various applications controversial was the use of backpropagation. backprop is just an application of the chain rule, so why would anyone be suspect of using it?**

when yann said it was controversial to use backprop earlier, i believe he meant it in two different ways: (1) backprop itself to compute the gradient of the loss function w.r.t. the parameters and (2) backprop to refer to gradient-based optimization. i’ll explain a bit of each below, but neither of them is considered a serious argument against using backprop anymore.

(1) backprop was controversial and is under great scrutiny when artificial neural nets (what we learn) are compared against biological neural nets (what we have). it’s quite clear due to biological constraints that backprop is not implemented in brains, as it is in our deep learning toolkits (see e.g., https://openreview.net/forum?id=HJgPEXtIUS for some of interesting biological constraints/properties that should be satisfied by any biologically plausible learning algorithms.) to some people, this is a make-or-break kind of issue, because there seems to exist a learning algorithm that results in a superior neural net (human brains!) of course, this could just mean that a biological brain is approximating the gradient computation as well as it could under the constraints, but it’s not easy to verify this (see, e.g., https://www.youtube.com/watch?v=VIRCybGgHts for how a brain might implement backprop.)

another criticism or objection along this line is that biological brains seem to have either zero or multiple objectives that are being optimized simultaneously. this is unlike our usual practice in deep learning where we start by defining one clear objective function to minimize.

(2) gradient-based optimization often refers to a set of techniques developed for (constrained/unconstrained) convex optimization. when such a technique is used for a non-convex problem, we are often working with the local quadratic approximation, that is, given any point in the space, the underlying non-convex objective function can be approximated by a convex quadratic function ($\theta^\top H \theta + g^\top \theta + c$.) under this assumption, gradient-based optimization would be attracted toward the minimum of this local quadratic approximation, regardless of whether there exists a better minimum far away from the current point in the space. this is often used as a reason for criticizing the use of gradient-based optimization with a non-convex objective function, thereby for criticizing the use of backprop. see e.g. http://leon.bottou.org/publications/pdf/online-1998.pdf for extensive study on the convergence properties of SGD.

this criticism however requires one big assumption that there is a big gap of quality between one of the nearby local minimum (we’ll talk about it in a few weeks at the course) and the global minimum. if there is a big gap, this would indeed be a trouble, but what if there isn’t?

it turned out that we’ve known for already a few decades that most of local minima are of reasonable quality (in terms of both training and test accuracies) as long as we make neural nets larger than necessary. let me quote Rumelhart, Hinton & Williams (1986):

“

<Learning representations by back-propagating errors> by Rumelhart, Hinton & Williams (1986)The most obvious drawback of the learning procedure is that the error-surface may contain local minima so that gradient descent is not guaranteed to find a global minimum. However, experience with many tasks shows that the network very rarely gets stuck in poor local minima that are significantly worse than the global minimum. We have only encountered this undesirable behaviour in networks that have just enough connections to perform the task. Adding a few more connections creates extra dimensions in weight-space and these dimensions provide paths around the barriers that create poor local minima in the lower dimensional subspaces.“

this phenomenon has been and is being studied quite extensively from various angles. if you’re interested in this topic, see e.g. http://papers.nips.cc/paper/5486-identifying-and-attacking-the-saddle-point-problem-in-high-dimensional-non-convex-optimization and https://arxiv.org/abs/1803.03635 for some recent directions. or, if you feel lazy, you can see my slides at https://drive.google.com/file/d/1YxHbQ0NeSaAANaFEmlo9H5fUsZRsiGJK/view which i prepared recently.

**2. Professor LeCun said that mini-batch has no advantage over single-batch SGD besides being easier to parallelize, and SGD is actually superior. Is there any other theoretical reason why single-batch is preferable?**

this is an interesting & important question, and the answer to this varies from one expert to another, including Yann and myself as well, based on what are being implicitly assumed and what are being used as criteria to tell which is preferred (computational efficiency, generalization accuracy, etc.)

Yann’s view is that noise in SGD greatly helps generalization because it prevents learning from being stuck at a sharp local minimum and drives learning to find a flatter local minimum which would imply that the final neural net is more robust to perturbation to the parameters, which naturally translates to the robust to the perturbation to the input, implying that it would generalize better. under this perspective, you want to maximize the level of noise, as long as they roughly cancel out on average across all the stochastic gradients computed from the training examples. that would correspond to using just one training example for computing each stochastic gradient.

of course, the amount of noise, which is proportional to the variance of the stochastic gradient, does impact the speed at which learning happens. in recent years, we (as the community of deep learning researchers) have found that certain network architectures require stochastic gradients computed using large minibatches (though, it’s unclear what large means, as it’s quite relative to the size of the training set) to be trained at all. in these cases, it looks like high level of noise sometimes prevents any progress in learning especially in the early stage.

so, in short, it’s still an open question. yann’s perspective may turn out to be the correct one (and that wouldn’t be the first time this happend,) or we may find a completely different explanation in the future.

**3. Why we would do batch normalization instead of normalizing the whole dataset all at once at first? Is it for when normalizing the whole dataset is too computationally expensive?** **I understood that normalization makes the optimization process easier through making the eigenvalues equal. However, if you’re only normalizing over the batch, your normalization for each batch is subject to noise and might still lead to bad learning rates for each dimension.**

there are three questions/points here. let me address each separately below:

“*normalization makes the optimization process easier through making the eigenvalues equal*“

we need to specify what kind of normalization you refer to, but in general, it’s not possible to make the hessian to be identity by simply normalizing the input. this is only possible when we are considering a linear network with a specific loss function (e.g., l2 loss for regression and cross-entropy for classification.) however, it is empirically known and for some cases rigorously as well that normalizing the input variables to be zero-mean and unit-variance makes the conditioning number (the ratio between the largest and smallest real eigenvalues of the hessian matrix) close to 1 (which is good.)

“*why we would do batch normalization instead of normalizing the whole dataset all at once at first?*“

now, in the case of a network with multiple layers, it turned out that we can maximize the benefit of normalization by normalizing the input to each layer to be zero-mean and unit-variance. unfortunately, this is not trivial, because the input to each layer changes as the lower layers’ weights and biases evolve. in other words, if we wanted to normalize the input to each layer, we would need to sweep through the entire dataset every time we update the weight matrices and bias vectors, which would make it intolerable. furthermore, renormalizing the input at a lower layer changes the input to the upper layers, ultimately resulting in the loss function to change dramatically each time we renormalize all the layers, likely making learning impossible. though, this is up to a certain degree addressible (see http://www.jmlr.org/proceedings/papers/v22/raiko12/raiko12.pdf by Tapani Raiko, my phd advisor, and Yann LeCun.)

“*your normalization for each batch is subject to noise*“

this is indeed true, and that’s precisely why it’s a customary practice to keep the running averages of the mean and variance of each dimension in batch normalization. assuming that the parameters of the network evolve slowly, such practice ultimately converges to the population mean and variance.

**4. Batch normalization in VAE: While implementing the convolutional VAE model, I noticed that removing these BatchNorm layers enabled the model to train as expected. I was wondering why does BatchNorm cause this issue in the VAE model?**

i don’t have a clear answer unfortunately, but can speculate a bit on why this is the case. my answer will depend on where batchnorm was used. of course, before reading the answer below, make sure your implementation of batchnorm doesn’t have a bug.

if batchnorm was used in the approximate posterior (encoder), it shouldn’t really matter, since the approximate posterior can be anything by definition. it can depend not only on the current observation $x$

, but can be anything else that helps minimizing the KL divergence from this approximate posterior to the true posterior. so, i wouldn’t be surprised if it’s totally fine leaving batchnorm in the encoder.

if batchnorm was used in the decoder, it may matter, as the likelihood distribution (generative distribution) is over the observation space $\mathcal{X}$ conditioned on the latent variable configuration $z$. with batchnorm, instead, the decoder is conditioned on the entire minibatch of latent variable configurations, that is, the latent variable configurations of the other examples. this may hinder optimization in the early stage of learning (in the later stage of learning, it shouldn’t really matter much, though.)

in general, batchnorm is a tricky technique and makes it difficult to analyze SGD, because it introduces correlation across per-example stochastic gradients within each minibatch.

5. **In semi-supervised VAE, how do we decide the embedding dimensions for the class**? **Also, BERT used position embedding to represent the position, so how do we determine the position embedding dimensions in BERT?**

this question can be answered from two angles.

a. network size

the embedding dimensionality is a part of a neural net, and it can be thought of as a part of determining the size of your neural network. it’s a good rule of thumb to use as large as neural net as you can within your computational and financial budget to maximize your gain in terms of generalization. this might sound counter-intuitive, if you have learned from earlier courses that we want to choose the most succinct model (according to the principle of occam’s razor,) but in neural nets, it’s not simply the size of the model, but the choice of optimization and regularization that matters perhaps even more. in particular, as we will learn next week, SGD is inherently working in a low-dimensional subspace of the parameter space and cannot explore the whole space of the parameters, a larger network does not imply that it’s more prone to overfitting.

b. why more than one dimension?

let’s think of the class embedding (though, the same argument applies to positional embedding.) take as an example handwritten digit classification, where our classes consists of 0, 1, 2, .., 9. it seems quite natural that there’s a clear one-dimensional structure behind these classes, and we would only need a one-dimensional embedding. why we do need then multi-dimensional class embedding?

it turned out that there are multiple degrees of similarity among these classes, and that the similarity among these classes is context-dependent. that is, depending on what we see as an input, the class similarity changes. for instance, when the input is a slanted 3 (3 significantly rotated clock-wise), it looks like either 3 or 2 but not 8 nor 0. when the input is a straight-standing 3, it looks like either 3 or 8 but not 2. in other words, the classes 3 and 2 are similar to each other when the input was a slanted 3, while the classes 3 and 8 are similar to each other when the input was a upright 3.

having multiple dimensions to represent each class allows us to capture these different degrees of similarity among classes. a few dimensions in the class embeddings of 3 and 2 will point toward a similar direction, while a few other dimensions in the class embeddings of 3 and 8 will point toward another similar direction. when the input is a slanted 3, the feature extractor (a convolutional net) will output a vector that will emphasize the first few dimensions and suppress the other dimensions to exploit the similarity between 3 and 2. a similar mechanism would lead to a feature vector of an upright 3 that would suppress the first few dimensions and emphasize the latter few to exploit the similarity between 3 and 8.

it’s impossible to tell in advance how many such degrees of similarity exist and how to encode them. that’s why we need to use as high dimensional embedding as possible for encoding any discrete, one-hot input.

**6. Why do we divide the input to the softmax in dot product attention by the square root of the dimensionality? **

This question was asked at one of the office hours, and Richard Pang (one of the TA’s) and i attempted at reverse-engineering the motivations behind the scaled dot-product attention from the transformers.

assume each key vector $k \in \mathbb{R}^d$ is a sample drawn from a multivariate, standard Normal distribution, i.e., $k_i \sim \mathcal{N}(0, 1^2).$ given a query vector $q \in \mathbb{R}^d$, we can now compute the variance of the dot product between the query and key vectors as $\mathbb{V}[q^\top k] = \mathbb{V}[\sum_{i=1}^d q_i k_i] = \sum_{i=1}^d q_i^2 \mathbb{V}[k_i] = \sum_{i=1}^d q_i^2$. in other words, the variance of each logit is the squared norm of the query vector.

assume the query vector $q$ is also a sample drawn from a multivariate, standard Normal distribution, i.e., $q_i \sim \mathcal{N}(0, 1^2)$. in other words, $\mathbb{E}[q_i]=0$ and $\mathbb{V}[q_i]=\mathbb{E}{q_i} \left[(q_i – \mathbb{E}[q_i])^2\right] = \mathbb{E}{q_i} \left[ q_i^2 \right] = 1$. then, the expected variance of the logit ends up being $\mathbb{E}{q} \left[ \mathbb{V}[q^\top k] \right] = \mathbb{E}{q} \sum_{i=1}^d q_i^2 = \sum_{i=1}^d \mathbb{E}{q_i} q_i^2 = \sum{i=1}^d 1 = d.$

we can now standardize the logit to be $0$-mean and unit-variance (or more precisely, we make the logit’s scale to be invariant to the dimensionality of the key and query vectors) by dividing it with the standard deviation $\sqrt{\mathbb{E}_q \mathbb{V}[q^\top k]}=\sqrt{d}.$

these assumptions of Normality do not hold in reality, but as we talked about it earlier, Normality is one of the safest things to assume when we don’t know much about the underlying process.

As Ilya Kulikov kindly pointed out, this explanation doesn’t answer “why” and instead answers “what” scaling does. “why” is a bit more difficult to answer (perhaps unsurprisingly,) but one answer is that softmax saturates as the logits (the input to softmax) grow in their magnitudes, which may slow down learning due to the vanishing gradient. though, it’s unclear what’s the right way to quantify it.

**7. DL appears to add double descent as a caveat in addition to bias-variance tradeoff learned early on. Do you have any insights about how we should think about double-descent? **

The so-called double descent phenomenon is a relatively recently popularized concept that’s still being studied heavily (though, it was observed and reported by Yann already in the early 90s. see, e.g., https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.66.2396 and also https://iopscience.iop.org/article/10.1088/0305-4470/25/5/020 by Krogh and Hartz) The issue I have with double descent in deep neural nets is that it’s unclear how we define a model capacity. the # of parameters is certainly not the best proxy, because the parameters are all heavily correlated and redundant. perhaps it should be the number of SGD steps, because we learned that the size of the hypothesis space is in fact the function of the number of SGD steps.

One particular proxy I find interesting and convincing is the fraction of positive eigenvalues of the Hessian at a solution. With this proxy, it looks like the apparent double descent phenomenon often lessens. see e.g. https://arxiv.org/abs/2003.02139.

So, in short, the model capacity is a key to understanding the bias-variance trade-off or more generally generalization in machine learning, but is not a simple concept to grasp with deep neural networks.

**8. In your opinion, will we achieve AGI?**

Of course, I’m far from being qualified to answer this question well. Instead, let me quote Yann:

]]><An executive primer on artificial general intelligence> by Federico Berruti, Pieter Nel, and Rob Whiteman

Yann LeCun, a professor at the Courant Institute of Mathematical Sciences at New York University (NYU), is much more direct: “It’s hard to explain to non-specialists that AGI is not a ‘thing’, and that most venues that have AGI in their name deal in highly speculative and theoretical issues…

[Updated on Nov 30 2020] added a section about the scaling law w.r.t. the model size, per request from Felix Hill.

[Updated on Dec 1 2020] added a paragraph referring to Dauphin & Bengio’s “Big Neural Networks Waste Capacity“.

this is a short post on why i **thought** (or more like imagined) the scaling laws from <scaling laws for autoregressive generative modeling> by Heninghan et al. “[is] inevitable from using log loss (the reducible part of KL(p||q))” when “the log loss [was used] with a max entropy model“, which was my response to Tim Dettmers’s tweet on “why people are not talking more about the OpenAI scaling law papers“. thanks to João Guilherme for brining it this to my attention. it’s given me a chance to run some fun thought experiments over the weekend, although most of, if not all of, them failed as usual with any ideas and experiments i have. anyhow, i thought i’d leave here why i thought so particularly from the perspective of dataset size.

- The scaling law for Bernoulli w.r.t. the dataset size
- The scaling law for Bernoulli w.r.t. the model size
- The scaling law for Bernoulli w.r.t. the compute amount
- Final thoughts

instead of considering a grand neural autoregressive model, i’ll simply consider estimating the mean of a Bernoulli variable after $N$ trials, and compare the log loss at this point against the log loss computed after $N+\Delta$ trials. let’s start by writing down the loss value after $N$ trials:

$$

-L(N) = p^* \log \frac{N_1}{N} + (1-p^*) \log \frac{N-N_1}{N} =

p^* \log N_1 + (1-p^*) \log (N-N_1) – \log N,

$$

where $p^*$ is the true ratio of heads and $N_1 < N$ is the number of heads from the $N$ trials.

let’s now consider tossing the coin $\Delta$ more times. i will use $\Delta_1 < \Delta$ as the number of additional heads after these additional trials. what’s the loss after $N+\Delta$ trials?

$$

-L(N+\Delta) = p^* \log (N_1 + \Delta_1) + (1-p^*)(N+\Delta – N_1 – \Delta_1) – \log (N+\Delta_1).

$$

so far so good. now, what kind of relationship between these two quantities $L(N)$ and $L(N+\Delta)$ do i want to get? in my mind, one way to say there’s a power law like structure behind $L$ is to show that the amount of improvement i get by running $\Delta$ more trials decreases as the number of existing trials $N$ increase. that is, there’s diminishing return from a unit effort as more efforts have been put.*

then, let’s look at their difference by starting from the loss at $N+\Delta$, while assuming that $\Delta \ll N$ (and naturally $\Delta_1 \ll N_1$ as well) so that i can use $\log (1+x) \approx x$ when $x$ is small:

$$

\begin{align*}

-L(N+\Delta) =& p^* \log (N_1 + \Delta_1) + (1-p^*)\log(N+\Delta – N_1 – \Delta_1) – \log (N+\Delta)

\\

=&

p^* \log N_1 (1+ \frac{\Delta_1}{N_1}) + (1-p^*) \log(N-N_1)(1 + \frac{\Delta – \Delta_1}{N-N_1}) – \log N(1+ \frac{\Delta}{N})

\\

\approx

&

\underbrace{p^* \log N_1 + (1-p^*) \log (N-N_1) – \log N}_{=-L(N)} + p^* \frac{\Delta_1}{N_1} + (1-p^*)\frac{\Delta – \Delta_1}{N-N_1} – \frac{\Delta}{N}.

\end{align*}

$$

The decrease in the loss by running $\Delta$ more trials can now be written as

$$

L(N) – L (N+\Delta) = p^* \frac{\Delta_1}{N_1} + (1-p^*)\frac{\Delta – \Delta_1}{N-N_1} – \frac{\Delta}{N}.

$$

since $\Delta_1 < \Delta$ and $N_1 < N$, let’s rewrite them as $\Delta_1 = \beta \Delta$ and $N_1 = \alpha N$, where $\alpha \in [0,1]$ and $\beta \in [0,1]$. then,

$$

L(N) – L (N+\Delta) = p^* \frac{\beta \Delta}{\alpha N} + (1-p^*) \frac{(1-\beta)\Delta}{(1-\alpha)N} -\frac{\Delta}{N} = \frac{\Delta}{N} \left(p^* \frac{\beta}{\alpha} + (1-p^*)\frac{1-\beta}{1-\alpha} – 1\right)

$$

this says that the change from the loss at $N$ to the loss at $N+\Delta$ is inversely proportional to $N$ itself, which is what i wanted to see from the beginning. although there were a few leaps of faith along the way, but it looks like more tosses I have made (i.e, large $N$), the change i can make to my loss with a constant number of extra tosses diminishes.

the second (multiplicative) term is more complicated, and i find it easier to think of two extreme cases; $p^*=1$ and $p^*=0$. these cases are reasonable if we think of this exercise as a proxy to studying classification, where it’s often assumed that a given input either belongs to one (positive) or the other (negative) class in an ideal world. when $p^*=1$, the second term reduces to

$$

\frac{\beta}{\alpha} – 1~~

\begin{cases}

> 0, & \text{if } \beta > \alpha \\

< 0, & \text{if } \beta < \alpha \\

= 0, & \text{if } \beta = \alpha

\end{cases}

$$

in other words, if the extra tosses reflected the true distribution better ($\beta > \alpha$, because the true positive rate is $1$,) the loss dropped. otherwise, the loss increases ($\alpha > \beta$) or stays same (i.e., no additional information has been added.) the other extreme case of $p^* = 0$ works similarly.

what’s important is that this second term largely dictates the sign of how the loss changes with the extra $\Delta$ tosses. since we are considering only the ratios of the heads within sets of trials and (suddenly!) assume both $N$ and $\Delta$ are reasonably large, the magnitude of change is instead largely determined by the ratio between $\Delta$ and $N$, with $N$ in the denominator.

so, this is how i arrived at my shallow take on twitter that these scaling laws may not have too much to do with whether we use neural net parameterization or not, whether we are solving language modeling, machine translation, etc., nor whether we are working with text, image or both. “i think” it arises naturally from the maximum entropy formulation (you can think of estimating the log-frequency of the heads above with sigmoid/softmax to turn it into the Bernoulli distribution) and the log loss.

of course, because i had to make a number of leaps of faith (or to put it another way, a few unreasonable assumptions,) it’s possible that this actually doesn’t make much sense at the end of the day. furthermore, i’m super insecure about my math in general, and i’m about 99.9% sure there’s something wrong in the derivation above . hence, why “i think” the scaling law arises from log loss (cross-entropy) and maximum entropy models.

it’s important for me to point out at this point that Heninghan et al. did much more than what i’ve discussed in this post and provide a much more extensive set of very interesting findings. they looked not only at the effect of the data size, but also the compute budget $C$ and model size $|\theta|$. in fact, they focus much more on the latter two than the former which was my focus here.

in the case of the model size, it’s quite trivial to map it to the argument above i made regarding the number $N$ of observations. let’s consider the model size $|\theta|$ in this context of recovering Bernoulli as the number of bits (with an arbitrary basis, including $e$) allowed to represent $N$ and $N_1$ (and consequently, $\Delta$ and $\Delta_1$.) then, the maximum $N$ a model can count up to is $\exp(|\theta|)$, and by increasing the model size by $\delta$ (i.e., $|\theta|+\delta$,) we can toss the coin

$$

\exp(|\theta|) \exp(\delta) – \exp(|\theta|) = \exp(|\theta|) (\exp(\delta) – 1)

$$

more. in other words, increasing the size of the model, while assuming that we can run as many tosses as we can to saturate the model capacity, is equivalent to setting $\Delta$ above to $\exp(|\theta|) (\exp(\delta) – 1)$.

in this case, the first term in the change in the loss above reduces to

$$

\frac{\Delta}{N} = \frac{\exp(|\theta|) (\exp(\delta) – 1)}{\exp(|\theta|)} = \exp(\delta),

$$

which is weird, because the dependence on $N = \exp(|\theta|)$ disappeared. that is, the change in the loss w.r.t. the increase in the model size (the number of bits) is not dependent on the number of existing bits used by the model.

what is happening here? in my opinion, this implies that the # of parameters in a neural net, or increasing it, is **not** optimally done in terms of compression.

what if we instead assume that only a polynomial number of trials can be compressed, i.e., $N=|\theta|^c$? in particular, for the sake of simplicity, let’s assume $c=2$. in this case,

$$

\frac{\Delta}{N} = \frac{(|\theta|+\delta)^2}{|\theta|^2} = 2\frac{\delta}{|\theta|} + \left(\frac{\delta}{|\theta|}\right)^2,

$$

and voila! we recovered the dependence on the model size $|\theta|$, and this dependence is inverse proportional, as expected. by further assuming that $\delta \ll |\theta|$, we end up with

$$

\frac{\Delta}{N} \approx 2 \frac{\delta}{|\theta|}.

$$

so, what does it say about the observation by Henighan et al. that there is a scaling law w.r.t. the model size? i suspect that their observation is telling us that deep nets we use are far from optimal in the sense of compressing data. it could be due to the choice of architectures, due to our choice of learning algorithms or even due to regularization techniques we use. it’ll be interesting to pinpoint what’s behind this sub-optimality will be interesting.

as i was writing the last paragraph, i was reminded of this earlier workshop paper by Yann Dauphin & Yoshua Bengio from the workshop track of ICLR’13, titled “Big Neural Networks Waste Capacity.” in this work, they observed the “rapidly decreasing return on investment for capacity in big networks” and conjectured this is due to the “failure of first order gradient descent.” perhaps, Yann was onto something, although i don’t think he’s followed up on this.

in the case of the compute budget, i have absolutely no idea, but i wonder if a similar argument as the model size could be made. the number of SGD steps largely dictates the maximum magnitude of the weights in a neural net. the resolution (?) of the computed probability is largely determined by the maximum magnitude of (or the variance of individual weights in) the final weight matrix (that feeds into the final softmax). perhaps we can connect these two to show that more SGD updates allow our neural net to more precisely identify the target probability. of course, this suggests that different optimization strategies may result in radically different scaling laws.

assuming what i wrote above makes even slightest bits of sense, this raises two interesting question, in my opinion. first, is all a sophisticated neural net does counting examples? the strict answer is no, because it both counts and compresses. it however looks as if it’s compression without any interesting emergent property (such as systematic generalization). second, how does this property change when we move away from the maximum entropy formulation and log-loss? i’ve pointed out two directions that look promising in a tweet earlier: margin ranking loss by Collobert & Weston and entmax series by Martins and co. if so, will it be the change in a desirable direction?

let me wrap up by thanking Henighan et al. and Kaplan&McCandlish et al. for thought-provoking pieces that have made me think of these models and problems i’ve been working with all along from a very different angle.

(*) of course the other (more positive) way to look at it is that there’s always more to be learned if we are ready to invest as much as we have invested already.

]]>**Detour**: Before I continue to talk about this award, let me just briefly share with you my experience as having been living abroad in three different places (Helsinki, Montreal and NYC) that speak three different languages (Finnish, French and English) as an expat and in particular as a student expat, over the past ten years or so. In short, it’s not easy. It’s not easy in many ways, but one that I felt as most challenging was this feeling I had whenever I moved to a new place that I have to stay alert, watch my account balance and prepare for the worst until I fully settle down and get used to this new city and country. Even then, there’s a nagging feeling that I am only a temporary resident here and that I must be prepared to leave immediately without any hesitation if I’m forced to or decide to.

You can literally see this stress from newly arriving students or more broadly expats who are not financially well off. They have a difficult time appreciating beauty and joy in a new place, not to mention enjoying them. Even if this new town is filled up with awesome restaurants, they wouldn’t facy the idea of dining at those restaurants. Even if the city is surrounded by amazing tourist destinations, they wouldn’t spare their time to visit them unless their parents come visit them. Their places are often light on furnitures, and even the furnitures they get are on the cheapest end of the spectrum: in fact, a lot of them don’t even buy a full bed but just a cheap mattress placed on their floor.

Even in my case, where I have been relatively well off financially for a newly arriving student/postdoc, i’ve never bought a couch ever since i left my parents’ place (don’t worry i’m planning to do so shortly,) and i bought a bed with a box spring for the first time only when I moved to NYC as a new faculty member. It took me my parents’ visit after my second year in Finland to travel to Rovaniemi and other touristic destinations in Finland and neighbouring countries (and let me tell you: there aren’t so many.) It took me a workshop at NRC Canada to visit Ottawa when I was in Montreal, and took me an invitation by Hugo Larochelle to visit U. Sherbrooke to visit Quebec City (I know.. it’s not on the way to Sherbrooke, but I took a detour.) Even when I could afford it, it took several walk-by’s before I could mentally prepare myself to decide to dine in at this reasonably fancy (but not that much…) place, and it still does.

That’s the weirdest thing: most of these I could afford back then and can certainly afford now. However, even if I could afford it, even if I knew it would improve how I live, and even if I knew that would make my days more comfortable, a lot of things felt much less accessible and looked overly and unnecessarily luxurious. I’ve experienced this stress, although I’ve thoroughly enjoyed and never regretted moving to and living in these places, been financially stable for most of my expat years and haven’t had any dependent to support. One begins to wonder how challenging it must be for others (and you!) who may be in worse situations.

**Back to the award**: this award comes with generous $30,000 USD monetary prize^{1} (!) And, no, it’s not paid to the university for me to use to support my research, but it is the prize paid directly to me. In other words, I’m free to do whatever i want with this $30,000 that sprang out of nowhere. should i finally buy a couch? well, i could, but i can buy it without this prize money. should i buy a car? well, i live in manhattan. should i go on a luxury vacation? well, pandemic…

After a brief period of pondering, i’ve decided to donate the prize money^{2} to Mila where I was a postdoc for 1.5y + a visiting student for 0.5y. More specifically, i’ve decided to donate the prize money to Mila on the condition that it is used to provide a *one-time cash supplement* of up to $1,500 CAD to each incoming *female* students/postdoc, arriving from either *Latin America*, *Africa*, *South Asia*, *South East Asia* and *Korea*, until the donation runs out. I hope this supplement gives students, who have just arrived at Montreal to start the new chapter of their lives, a bit of room for breathing. Perhaps they can use it to go enjoy a dinner at a nice restaurant in Montreal. Perhaps they can go out with their new friends and family for beer. Perhaps they can buy not just a mattress but a proper bed. it’s not for me to determine what lets them relax a bit in the midst of settling down in a new environment, and I just hope this to be helpful in whatever way suits them best.

I thoroughly enjoyed my time at Mila (which was, to be precise, called Lisa back then,) and have greatly benefited from spending my time there as a postdoc. i cannot imagine where i would be had i not been a postdoc at Mila. And, I hope this small gesture of mine could make a diverse group of incoming students/postdocs from all corners of the world to have a more enjoyable time in Mila and benefit from their time in Mila as much as if not more than i have.

**Why female students from these regions (Latin America, Africa, South Asia, South East Asia and Korea)?** our field has an issue of representation in many aspects. we have an issue of gender representation. we have an issue of geographical representation. we have an issue of educational background/discipline representation. we have many more issues of representation in different aspects. All these issues of representation are equally important and critical, and I know that these are not just pipeline issues, based on my experiences of meeting amazing talents while teaching at Deep Learning Indaba 2018, Khipu.AI 2019, SEAML 2019, Deep Learning UB 2019 and the African Master’s Programme in Machine Intelligence (AMMI). these issues are often of opportunities and support. I believe we need to take even a little action at a time rather than waiting to address all of them simultaneously. in this particular case, I decided to give a minuscule shot at addressing a couple of these issues; the lack of female representation and the limited representation of researchers and students from Latin America, Africa, South Asia and South East Asia (I added Korea because the prize came from a Korean company :))

Also, perhaps a bit selfishly, i want to make sure there’ll be a role model my niece can look up to in the field of AI when she’s older.

(1) they also sent me this awesome plaque, but i don’t think Mila would appreciate it as donation.

(2) i’ve decided to donate $35,000 CAD after setting aside a bit for tax. after all, i’ve been paying more federal tax than the president for quite some time already and am expecting to pay some more this coming tax season.

]]>**Background:** Right before COVID-19 struck NY heavily this past Spring, K-12 teachers from Busan, Korea stopped by at NYC on their trip to US for studying various AI education strategies in US, and asked me for a short meeting. Frankly i was quite skeptical about this meeting, and was assuming it was their vacation in disguise. This skepticism of mine completely melted down when I met them in their hotel’s meeting room and began to hear what they’ve done and are doing at their schools, covering primary (1-6y), middle (7-9y) and high schools (10-12y), to teach their students what AI is, what these students can already do with it, and what they would be able to do with it in the future. it was eye-widening and has since made me realize how outdated my view of K-12 education (be it in Korea or elsewhere) is and how much K-12 education can be updated to keep up with latest developments in the society when teachers are enthusiastic and given opportunities.

This trip was a part of their effort in creating a teaching material for AI education aimed at K-12 teachers. I heard back from them a few months later that this material is ready to be published as a series of four books and was asked to write an opening remark. I was of course more glad to write one for them. Because I’m not too comfortable writing about AI in Korean (i mean.. when have i ever written anything AI in Korean?) i went ahead with English, and one of the participating teachers translated it into Korean.

Today (Nov 21 2020), i received the pdf copies of these four books and was able to take a more careful look at the content. it’s filled up with fun activities teachers can help students go through to learn about AI by experiencing a diverse set of sub-disciplines, including robotics, computer vision, natural language processing, machine learning, data science, etc. i’m so envious of these kids who will get to experience and have fun with all these activities and projects and ultimately become AI-native, unlike any of us.

And, without further ado, here it is.

**Foreword:** Intelligence is one of the last remaining mysteries of this universe and of ourselves that has evaded our collective attempt at uncovering its underlying mechanisms. We think every day, every hour, every minute, if not every second, effortlessly, without realizing that there are 86 billion neurons that are interacting with each other in both highly coordinated and highly chaotic manner behind this process of thinking. We perceive the surrounding world, which consists of our family, our friends and everything you can imagine and interact with each day, effortlessly, when the surrounding world never stays idle but dynamically changes its appearance non-stop. Based on our perception and pondering, we act in the surrounding environment effortlessly, although there are infinitely many possible ways in which our action could go wrong. Intelligence is behind these seemingly facile activities, driving each and every of us from one moment to another, but intelligence has largely evaded our interrogation and investigation even until now.

Despite “artificial” in artificial intelligence, artificial intelligence (AI) is a scientific discipline in which intelligence in general, not necessarily artificial one, is studied. As the first step in this direction, AI scientists ask what intelligence is. To answer this question, some are inspired by biological intelligence. To answer this question, some look into psychology. To answer this question, some look into philosophy. To answer this question, some look into mathematics. To answer this question, some, like myself, look into computer science which has a good track record of rigorously defining and understanding traditionally illusory concepts, such as information and computation, thanks to Claude Shannon, Alan Turing, who originally “propose[d] to consider the question, ‘Can machines think?” in 1950, and the like.

In this scientific pursuit of (artificial) intelligence, “learning” has been found to be a central concept to intelligence. Intelligence is not merely a bag of algorithms and knowledge for solving a fixed set of problems, but it is rather the process of learning to solve a new problem by creating a new algorithm. Every time a new problem or a variant of a known problem is given, a machine, either biological or not, must “learn” to solve it and acquire a set of sophisticated skills in this process. The question of “what is intelligence?” has suddenly morphed itself into the question of whether we can build a machine that can learn to solve any problem. If we could build one, that machine would be intelligent, and this machine itself would be our answer to the ultimate question of “what is intelligence”. Machine learning is a sub-discipline in computer science that has pursued this direction of building a learning machine to figure out what intelligence is.

Machine learning has made rapid progress in recent years, thanks to theoretical and empirical advances in learning algorithms, increased availability of data, wide adoption of open-source software and incredible advances in computing systems. A few years ago, a deep neural network learned to listen to speech in a quiet room and transcribe it almost as well as an average person could. This was quickly followed by a deep convolutional network which could detect an incredible number of different objects in a picture, rivaling humans in object recognition. A couple of years later, a deep recurrent neural network was trained to translate news articles between English and Chinese and ended up translating almost as well as average bilingual speakers could. All these results were openly shared in forms of open-access publications and open-source software packages, which led to an unprecedented level of adoption of these new technologies. Industry has rapidly implemented and deployed these AI systems in various products, including voice assistants, real-time machine translators, automatic image tagging, content recommendation, driving assistance and even automated tutoring. These AI technologies are being deployed in increasingly more challenging domains, such as healthcare, medicine and automation.

Unfortunately positive is not the only way to describe this rapid advance and wide adoption of machine learning and thereby artificial intelligence in recent years. These AI systems have been silently tested and deployed in the society, touching many, if not most, of us often without our realization. These silent, and often premature, tests have sadly revealed negative sides of AI.

Billions of people use social media regularly, and social media companies extensively use AI technology to personalize individual users’ experience, effectively censoring the flow of information. Billions of people use video streaming services and news aggregation services every day, and the providers of these services use AI to decide not only what to but also what not to recommend and display to individual users, effectively shaping the users’ opinions without their own realization. This mass adoption of AI-based content filtering has unintentionally but unmistakably resulted in deepening polarization in many societies all over the world, sometimes resulting in fatal incidents and destabilization of otherwise stable, democratic societies.

Hastily developed and prematurely deployed AI systems, such as face recognition, automated exam proctoring and automated interviewer assessment, have been found to amplify undesirable societal biases and inequalities, such as racial bias, gender bias, income inequality and geographical inequality. For instance, incorrect identification of a face recognition system, which has repeatedly been found to disproportionately associate black people and people of colour as threatening, by police in the US has recently led to the wrongful arrest of an innocent black male. The world’s largest e-commerce company recently had to drop an AI-based recruiting system, because it was giving female candidates unjustifiable disadvantages for software engineering roles. A recent study has uncovered that commercial object recognition systems’ accuracies significantly drop when presented with pictures taken from poorer countries.

For AI to truly benefit us and the society, these shortcomings must be addressed and addressed fully. Technical advances alone, often made by a small group of elite scientists, will not be enough to make AI safe, fair and beneficial for all. Safe, fair and beneficial AI will only be possible when the whole society, consisting of both AI scientists and others, is aware of AI’s capability, adoption and deployment. The society must continue to carefully watch and monitor AI’s impact on the society, and be ready to rise and intervene against unsafe, unfair and unjust use of AI. This awareness of capability, limitations and underlying technology of AI is necessary for the society to benefit from AI.

Such awareness in the society of a new technology, in particular when it is an enabling technology, does not happen overnight. It must happen carefully and patiently over many years, if not decades, to ensure the whole society possesses a rational and coherent view of AI technology and its use. For this to happen, we must go beyond the status quo in which discourse on AI happens within and across universities and industry. We must start discourse and education on AI already with K-12 students who will be the first generation in the history of humanity to grow to live in a society where AI is not a novelty but an everyday reality. As the first step toward this goal, we must educate teachers of all levels to be familiar with and comfortable with the technologies and implications of AI, and must immediately start preparing educational materials and systems for teaching AI.

I thus applaud this effort by the Busan Metropolitan City Office of Education preparing a new curriculum and accompanying educational materials on AI for both students and teachers. In doing so, the team from the Office of Education has struck perfect balance between theory and application, between history and modern practices, and between technology and ethics. I am envious of students in Busan who will learn to be native in AI according to this curriculum, and am now hopeful rather than worried about the future of AI and its impact on society.

]]>Many aspects of OpenAI’s GPT-3 have fascinated and continue to fascinate people, including myself. these aspects include the sheer scale, both in terms of the number of parameters, the amount of compute and the size of data, the amazing infrastructure technology that has enabled training this massive model, etc. of course, among all these fascinating aspects, meta-learning, or few-shot learning, seems to be the one that fascinates people most.

the idea behind this observation of GPT-3 as a meta-learner is relatively straightforward. GPT-3 in its essence computes the conditional distribution over all possible next tokens (from a predefined vocabulary) given a prefix: $p(x’ | x_1, \ldots, x_t)$. this conditional distribution can be chained to form a conditional distribution over sequences given the prefix: $p(x’_1, \ldots, x’_{t’} | x_1, \ldots, x_t) = \prod_{t”=1}^{t’} p(x’_{t”} | x’_{<t”}, x_{<t})$. this makes GPT-3 subsume a so-called sequence-to-sequence or encoder-decoder model, allowing one to use GPT-3 to find an answer $(x’_1, \ldots, x’_{t’})$ given a question (often referred to as “prompt” which comes together with a couple of known examples) $(x_1, \ldots, x_t)$ by solving

\[

\arg\max_{x_1, \ldots, x_t} \log p(x’_1, \ldots, x’_{t’} | x_1, \ldots, x_t).

\]

This problem turned out to be intractable, and people have been using an approximate search algorithm, such as greedy search or top-$k$ sampling, to find an answer given a prompt. In the GPT-3 paper, the authors present an impressive set of experimental results highlighting this meta-learning aspect of GPT-3.

But, then, you start to wonder: in particular for me, i began to wonder about this just today over our research group‘s weekly meeting, when Elman Mansimov presented a few recent papers that have followed up on this meta-learning aspect of a language model of which GPT-3 greatly increased the awareness. What do i wonder? I wonder if it’s meta-learning, as we think of meta-learning conceptually, that drives this phenomenon, or if there is actually a simpler mechanism behind this observation.

let’s imagine a wonderful hypothetical world in which I can train another GPT-3 on the same data myself at NYU, but this time i will make one slightly tweak. that is, i will train this new GPT-3, to which i refer as GPT-E, after reversing the order of all documents in the original dataset. that is, GPT-E computes the conditional distribution over all possible previous tokens given a suffix: $p(x | x’_t, x’_{t-1}, \ldots)$. since OpenAI has successfully trained GPT-3, you’d trust that i would be able to train this model in this hypothetical, but happy world. I will also assume that in this happy parallel universe, i can hire all the amazing talents who worked on GPT-3 at NYU perhaps as postdocs or even as PhD students so that the quality of GPT-E rivals that of GPT-3.

but, then, something weird happens. if we believe GPT-3’s meta-learning capability, GPT-E does something as amazing as (if not more amazing than) what GPT-3 can do. It takes as input a test question-answer pair and can outputs the prompt, which contains both a few training examples and a test question (!) of course, assuming the amounts of information on both sides are comparable (which should be the case for zero-shot or few-shot learning.)

Do you see where I am getting at? yes, we can now alternate between GPT-3 and GPT-E to sequentially create an encyclopedia of all the knowledge in the world (well, at least those that were represented in the training set.) We start from a random factoid and call it $(Q_0,A_0)$. We can find a reasonable “prompt” by feeding GPT-E with $(r(A_0), r(Q_0))$, where $r$ reverse a string, and sampling from $P_0 \sim p(x_1, \ldots, x_t | A_0, Q_0)$ preferably using top-$k$ sampling to reduce noise but to maintain some stochasticity. this prompt $P_0$ would consist of a (noisy) description of the task that corresponds to this factoid and a few noisy examples that are not exactly $(Q_0,A_0)$, in addition to the next question $Q_1$. We switch to GPT-3 and now sample another piece of factoid $(Q_1, A_1)$ based on $P_0$. We alternate between these two steps or more like between GPT-3 (real) and GPT-E (hypothetical) as long as we want and accumulate $(Q_n, A_n)$ to create the encyclopedia of world knowledge. Beautiful, isn’t it?

But, hold on. Where did meta-learning go? where is meta-learning in this Gibbs-like sampling procedure? is meta-learning just “noise” injected in each round of alternating between GPT-3 and GPT-E, for this Gibbs-like procedure to explore the space of knowledge effectively? If i wanted to put some positive, promising spin: is meta-learning how such noise is shaped by a large neural net so that it only spans relevant directions in this high-dimensional space corresponding to the knowledge manifold?

as I warned you at the beginning, there’s no “wow” moment nor “wow” conclusion in this post. this is just one piece of thought i had about GPT-3 that got me even more confused about all things machine learning (meta-learning, generative modeling, denoising, gibbs sampling, etc.)

P.S. i’m waiting for big tech firms with deep pockets (Amazon, Google, FB, etc. i’m looking at you) to train GPT-E for me to test this idea

P.P.S. you see why it was called GPT-E?

]]>There have been a series of news articles in Korea about AI and its applications that have been worrying me for sometime. I’ve often ranted about them on social media, but I was told that my rant alone is not enough, because it does not tell others why I ranted about those news articles. Indeed that is true. Why would anyone trust my judgement delivered without even a grain of supporting evidence? So, I’ve decided to write a short post on Facebook (shared on Twitter) and perhaps surprisingly in Korean (!) This may have been the first AI/ML-related (though, very casual) post I’ve ever written in Korean, and is definitely not the best written piece from me, although I hope this post would clarify why I’ve been fuming about those news articles.

This post is quite casual and not academic. If I’m missing any important references for general public, that you want me to include here, please drop me a line. As I’m not in any way an expert in this topic, I’m sure I’ve missed many important references, discussions and points.

That said, I realized that it’s not only Korean speakers who engage with this post (via Google Translate, etc.) and that the automatic translation of this post into English is awful (thanks to the hat tip by my colleague Ernest Davis at NYU.) Since it’s a pretty short post, I’ve decided to put its English version along with the original Korean version here in my blog. The version in Korean comes first, and the one in English follows immediately.

Twitter와 FB를 비롯한 social media 및 학계에서 많이 논의가 되지만 한국어로 된 논의는 크게 없어 보여서 아주 간단히 Social impact & bias of AI 라는 주제에서 중요하다 생각되는, 밀접히 연관된 point 몇 개를 아래 리스트업 합니다. 아마 있는데 제가 못 찾은 것일 수도 있고, 혹시 관련된 한국어로된 연구 또는 논의가 있으면 답글에 남겨주시기 바랍니다.

[아무래도 한국어로 글을 안 써 버릇해서 영 읽기 불편해 보입니다. 양해 부탁드립니다.]

*Amplification*

기술은 사회를 반영하는것이 맞습니다. 다만 그렇게 반영된 사회의 특징이 기술을 통해 같은 사회 안에서 증폭이 됩니다. Virginia Eubanks의 또는 Ruha Benjamin의 를 읽어보면 어떻게 이런 증폭이 사람들에게 해를 가하는지 알게 됩니다 (https://www.nytimes.com/2018/05/04/books/review/automating-inequality-virginia-eubanks.html, https://us.macmillan.com/books/9781250074317, https://www.ruhabenjamin.com/race-after-technology) 최근에 제가 AI 인터뷰가 많이 쓰인다는 기사를 보고 열을 냈던 이유 중 하나로, 다들 내 얘기는 아니겠거니 하지만 이런 증폭된 부정적인 면은 궁극적으로 모두를 해하게 됩니다. 혹시 본인의 자녀가 어린 시절 잠깐 강남이 아닌 곳에서 초등학교를 다니는 바람에 AI 인터뷰에서 자동적으로 떨어진 건 아닐까요?

심지어는 완벽한 AI 시스템이 존재해도 amplification 문제는 여전히 존재합니다. 만약 AI 시스템에서 면접 보는 사람이 60%의 확률로 성공적일 것이라고 하고, 실제로 60%가 완벽한 (un)certainty라면 어떻게 할까요? 아마 모두 합격이라고 결정할 것 입니다. AI 시스템이 실전에 사용되면 해당 시스템의 uncertainty를 넘어서는 결정을 내리게 되고 amplification이 더 심해집니다.

*Opaqueness* of a model

AI/ML 시스템이 현업에서 집중적으로 쓰이기 시작한 것은 꽤 오래된 일지만 이러한 시스템의 complexity가 급격히 높아진 것은 상대적으로 최근입니다. 이런 highly complex한 시스템을 deploy하는 입장과 사용하는 입장 그리고 적용받는 입장에서는 해당 시스템의 특징에 대해 알아야 합니다. 아쉽게도 동작 원리를 알아내는 것은 어렵고 연구 중 또는 기업기밀 이라는 핑계 아래 이런 필요성이 무시 당하곤 합니다. 당연히 어렵고 연구 중인 내용이긴 하지만 실제로 사용자 그리고 적용받는 입장에서는 세세한 과학적 원리를 요구하는게 아니고 해당 시스템의 높은 수준의 동작 원리, 사회적 영향 등 을 필요로 할 뿐 입니다 (환경을 생각해서 자동차 배기량이 얼마나 되는지 알고 싶은데 갑자기 내연기관의 원리 및 해당 차종의 모든 디테일을 알지 못하면 배기량을 아는 것은 의미가 없다면 말이 안 되겠죠.) 이런 내용들이 고지 되지 않으면 앞서 말한 amplification으로 인한 부정적인 영향을 이미 돌이킬 수 없는 상황이 되어서나 알 수 있습니다.

이를 위해서는 model card (https://dl.acm.org/doi/abs/10.1145/3287560.3287596) 및 datasheets for datasets (https://arxiv.org/abs/1803.09010) 등이 이제 겨우 시작이지만 좋은 방향으로 여겨집니다. 과연 자사 AI 시스템을 자랑하는 CEO/CTO 또는 개발자 중 model card와 dataset datasheet에서 추천하는 질문을 자사 시스템에 대해 했을 때 답할 수 있는 사람이 얼마나 될까요? 저 스스로도 잘 못 합니다만 특히나 AI 시스템을 deploy하는 입장에서는 이런 문제에 대한 답을 꼭 할 수 있어야 합니다.

*Selection bias* of data

위의 내용도 밀접하게 연결되는 내용으로 AI 시스템을 만드는데 사용되는 데이타가 어떻게 만들어지는지가 큰 문제입니다. 다만 이에 대한 논의는 데이타를 많이 사용하는 다른 분야에 비해 (예, survey) 상대적으로 잘 이뤄지지 않습니다. 최근 들어 AI/ML에 대한 관심이 높아지면서 다행히 data에 대한 관심도 많이 높아지고 있고 이에 따라 기존에 눈치 채지 못했던 다양한 문제들이 드러나고 있습니다. 예를 들어 Parbhu & Birhane ( https://arxiv.org/abs/2006.16923) 는 CIFAR-10이란 매우 유명한 데이타셋을 만드는데 사용되었던 TinyImage dataset의 심각한 문제점들을 발견했고 이를 통해 TinyImage dataset이 take-down되었습니다. 지금이야 take-down되었지만 과연 그전까지 해당 데이타를 사용한 AI/ML 시스템들이 데이타의 문제를 고민 하지 않고 만들어진 후 얼마나 현실에 적용되었는지 생각해보지 않을 수 없습니다. Gururangan et al. (https://arxiv.org/abs/1803.02324) 은 자연어처리 분야에서 굉장히 넓게 사용되는 Stanford NLI 데이타 안에 들어있는 문제점을 발견했고, 해당 문제점이 데이타 수집 과정에서 생겼다는 것을 보였습니다. 이런 문제점 발견에는 최신 AI/ML 기술 및 연구자 개개인의 manual한 노력이 필요했습니다.

일반적으로 AI 시스템이 얼마나 잘 동작하는지 자랑하는 기사 및 논문을 보는 것은 어렵지 않습니다. 하지만 사용자 및 AI 시스템의 판단을 받는 사람으로써 더 중요한 것은 과연 해당 시스템이 어떤 특징을 갖고 있는지, 그리고 해당 AI 시스템을 만드는데 사용된 데이타가 얼마나 잘 수집되고 정제되었는지가 더 중요합니다. 이를 위해 더 많은 연구가 필요하고 현업에서는 실제 AI 시스템 개발보다도 더 큰 투자와 노력을 기울여야 합니다.

최근 FB에서 나온 연구 결과를 보면 데이타의 영향이 얼마나 큰지 알 수 있습니다 (https://openaccess.thecvf.com/content_CVPRW_2019/html/cv4gc/de_Vries_Does_Object_Recognition_Work_for_Everyone_CVPRW_2019_paper.html). 이 논문에서는 상용 object recognition 시스템의 정확도가 사진이 찍힌 지역의 소득과 correlate한다는 것을 보였습니다. 혹시 전라남도에 살면 서울에서 모인 데이타가 압도적으로 많이 쓰인 네이버 OCR이 덜 정확한건 아니겠죠? (http://kosis.kr/statHtml/statHtml.do?orgId=101&tblId=DT_1C65, 사실 네이버 OCR이 어떻게 만들어지는지 모릅니다. 다만 서울/경기에서 모인 데이타가 대부분일 것으로 생각되긴 하네요.)

이를 해결하기 위한 방향으로는 human-and-machine-in-the-loop이라는 패러다임이 promising해 보입니다: https://arxiv.org/abs/1909.12434, https://arxiv.org/abs/1910.14599, https://openreview.net/forum?id=H1g8p1BYvS. 다만 이런 패러다임은 어떻게 구현을 하느냐에 따라 결과가 크게 달라질 수 있고, 구현하는 과정에서 피해를 보는 사람들이 생길 수도 있습니다 (에를 들면 https://www.theverge.com/2019/2/25/18229714/cognizant-facebook-content-moderator-interviews-trauma-working-conditions-arizona)

*Correlation vs. Causation* & *systematic generalization*

종종 이런 문제는 기술의 문제가 아니라고 주장하는 사람들이 있습니다. 이런 주장은 보통 AI/ML의 근본적인 목표를 이해하지 못해서 하는 것 입니다. 특히나 AI/ML의 목표와 주어진 데이타의 sufficient statistics를 뽑아내는 것을 동일하게 보는 경우가 있는데, 이건 사실이 아닙니다.

AI/ML의 목표는 일반적으로 inductive inference고, Vapnik에 의하면 이것은 “an informal act [with] technical assistance from statisticians” (paraphrase) 입니다. 조금 더 최근에 나온 Arjosvky et al. (2019; invariant risk minimization https://arxiv.org/abs/1907.02893)에서는 좀 더 분명하게 “minimizing training error leads machines into recklessly absorbing all the correlations found in training data” 하여 “machine learning fails to fulfill the promises of artificial intelligence” 라고 합니다. 한 마디로 AI의 목표는 데이타 수집 환경에 구애 받지 않는 mechanism (언제나는 아니지만 많은 경우 causal) 을 찾아내서 out-of-domain (또는 systematic) generalization을 성공적으로 수행하는 것을 목표로 합니다.

안타깝게도 기존에 사용되는 대부분의 ML algorithm들은 이런 면이 부족합니다 (이런 예가 궁금하면 최근 제 발표의 초반을 보면 됩니다: https://drive.google.com/file/d/1CrkxcaQs5sD8K2HL2AWCMnrMRpFoquij/view) 이를 극복하기 위해 meta-learning과 IRM 등의 새로운 paradigm도 제시되고 causal inference from observational data를 ML에 적용시키는 연구도 많이 진행되고 있습니다 (예를 들면 https://arxiv.org/abs/1911.10500, https://arxiv.org/abs/1901.10912, https://arxiv.org/abs/1805.06826.)

단순히 데이타에 있는 correlated feature를 알고리즘이 찾은 것인데 어째서 그것이 문제이냐 묻는다면 일단 AI/ML이 무엇인지에 대한 고민부터 다시 해야 합니다.

Although it’s a topic that’s actively discussed both in academic settings and social media, such as Twitter and FB, I haven’t seen much discussion on the Social Impact & Bias of AI in Korean. To contribute even minimally to addressing this lack of discussion, here’s the list of a few points that are relevant to this topic. It’s possible that I simply have failed to find discussions surrounding this topic in Korean, and if there’s any, please kindly point me to them.

[My apologies for unprofessional writing. It’s not really everyday I write anything in Korean.]

*Amplification*

It is true that technology reflects the society. It is however also true that such technology that reflects the society is used within the society and that it inevitably amplifies what’s been reflected on the technology. It’s illuminating to read <Automating Inequality> by Virginia Eubanks and <Race after Technology> by Ruha Benjamin to see how such amplification harms people. (https://www.nytimes.com/2018/05/04/books/review/automating-inequality-virginia-eubanks.html, https://us.macmillan.com/books/9781250074317, https://www.ruhabenjamin.com/race-after-technology) This amplification of negative aspects of the society is precisely why I fumed over the recent news articles on wide adoption of AI inteviews in Korea. You may think you’re not the one who’ll suffer from such amplification, but it eventually gets to everyone unless without any intervention. Have you ever considered the possibility that your kid may not have received the job offer because he didn’t attend a primary school in Gangnam when they were small?

Even if one imagines a perfect AI system, the issue of amplification still exists. Consider this hypothetically perfect AI system that has determined a candidate to be 60% fit to the company and that this 60% is perfectly calibrated. As soon as a user of this system simply thresholds at 50% to make a hiring decision, it ends up with the same issue of amplification, because in practice users of such AI system inevitably overrule the supposedly perfect uncertainty estimated by the system.

*Opaqueness* of a model

Although it has been quite some time since so-called AI/ML systems have been put in practice, it’s relatively recent that their complexity has greatly increased. When a system in practice exhibits such a high level of complexity, it is important for both a provider, user of and those who are influenced by such a system to be aware of the principle behind these systems. Unfortunately there’s a trend that this need and request for awareness are ignored based on a variety of excuses such as that it is difficult to know the full details of the working principles, it is under active research to figure out the working principles and it is a part of corporate secret. Of course it is a difficult scientific issue on its own, but what is needed in terms of transparency is not every single scientific and engineering detail but a high-level description of the working principle behind such systems and understanding of their impacts on the society (think of how ridiculous it would be when a car manufacturer doesn’t tell you the horse power of a car you are considering because there’s no way you can know about all the details of the car such as the minute details of internal combustion engines.) Unless these (even high-level) details are provided together with these AI systems, the negative impact of such systems on the society will only be discovered once the (potentially irreversible) damages have been made.

One promising direction I have observed in recent years is the proposal for model cards and datasheets for datasets: https://dl.acm.org/doi/abs/10.1145/3287560.3287596 and https://arxiv.org/abs/1803.09010. I wonder how many CEO/CTO and developers can answer the questions, suggested for the model cards and datasheets, about their own AI systems they tout as well as data used for those systems. I’m not particularly a good example myself, but I believe the bar is even higher for those who tout and deploy AI systems in the society.

*Selection bias* of data

It’s quite related to the previous point. It is important to think of how data used for building an AI system was collected and created. Unfortunately and perhaps surprisingly this aspect of data has received relatively little attention compared to other adjacent areas, but the research community has begun to pay more attention to data itself and notice various issues behind widely used datasets. For instance, Parbhu & Birhane (https://arxiv.org/abs/2006.16923) identified serious flaws and issues behind one of the most widely used image datasets, called TinyImages, from which the widely used CIFAR-10 was created. This has led to the removal of the TinyImages dataset after 10 years since the original dataset was created and released. Although it’s now removed, you must wonder how many AI systems have been built using this data and been deployed in practice. Gururangan et al. (https://arxiv.org/abs/1803.02324) found various issues (or artifacts, as they called them) in the Stanford natural language inference (SNLI) data, stemmed from the process of data collection. These findings are the result of the combination of both state-of-the-art AI/ML techniques and individual researchers’ manual efforts.

It’s not difficult to find news articles and academic papers bragging the awesomeness of their AI systems. It is however more important for users and people who are being (either intentionally or unintentionally) judged by such systems to know the properties and characteristics of such systems and to be able to trust the quality of data and its collection process. It is thus imperative to invest more on this aspect of quality assurance than on the actual development of AI systems, in addition to continued research.

A recent work from FB demonstrates well the impact and importance of data and its collection: https://openaccess.thecvf.com/content_CVPRW_2019/html/cv4gc/de_Vries_Does_Object_Recognition_Work_for_Everyone_CVPRW_2019_paper.html. In this paper, the authors demonstrated that the accuracies of commercial object recognition systems correlate with the income levels of the regions in which pictures were taken. Hopefully, it doesn’t mean that the OCR service from Naver is less accurate for those who live in Jeollanam-do (which has the lowest per-capita GDP in Korea according to http://kosis.kr/statHtml/statHtml.do?orgId=101&tblId=DT_1C65) because the OCR system was trained mainly using data from Seoul and its metropolitan area (to be honest, I have no idea how Naver OCR is implemented, but I’m quite sure the majority of data used for building the system were collected from Seoul and its surrounding regions.)

To me, human-and-machine-in-the-loop paradigm looks quite promising: https://arxiv.org/abs/1909.12434, https://arxiv.org/abs/1910.14599 and https://openreview.net/forum?id=H1g8p1BYvS. Although promising, it’s important to keep in our mind that the outcome of such a paradigm heavily depends on how it’s implemented, not to mention that some may suffer from its implementation. See for instance https://www.theverge.com/2019/2/25/18229714/cognizant-facebook-content-moderator-interviews-trauma-working-conditions-arizona.

*Correlation vs. Causation* & *systematic generalization*

Often we see people who claim this is *not* the problem of technology. Such a claim often arises from the lack of understanding the fundamental goal of AI/ML. In particular, some equate the goal of AI/ML to estimating sufficient statistics from given data, which is simply not true.

In general, the goal of AI/ML is inductive inference, and according to Vapnik (https://www.wiley.com/en-us/Statistical+Learning+Theory-p-9780471030034), it’s “an informal act [with] technical assistance from statisticians” (paraphrase). More recently, Arjosvsky et al. (https://arxiv.org/abs/1907.02893) explicitly stated that “minimizing training error leads machines into recklessly absorbing all the correlations found in training data” and this makes “machine learning [fail] to fulfill the promises of artificial intelligence.”In short, the goal of AI is to identify an underlying mechanism that is independent of (or invariant to) changing environments (which are often but not always causal) and successfully generalize to a new environment, which is often referred to as out-of-domain (or systematic) generalization.

Sadly, most of the existing (widely used) ML algorithms fall short in this aspect. See the first part of my recent talk for an example: https://drive.google.com/file/d/1CrkxcaQs5sD8K2HL2AWCMnrMRpFoquij/view. In order to overcome this inability, new paradigms have been proposed, such as meta-learning and invariant risk minimization, and there is an on-going effort in marrying causal inference from observational data with machine learning. See e.g. https://arxiv.org/abs/1911.10500, https://arxiv.org/abs/1901.10912 and https://arxiv.org/abs/1805.06826.

If you still insist that it is not an issue of the algorithm which has faithfully captured correlations that exist in data, I suggest you to think once more what AI/ML is and what its goal is.

]]>TL;DR: after all, isn’t $k$-NN all we do?

in my course, i use $k$-NN as a bridge between a linear softmax classifier and a deep neural net via an adaptive radial basis function network. until this year, i’ve been considering the special case of $k=1$, i.e., 1-NN, only and from there on moved to the adaptive radial basis function network. i decided however to show them how $k$-NN with $k > 1$ could be implemented as a sequence of computational layers this year, hoping that this would facilitate students understanding the spectrum spanning between linear softmax classification and deep learning.

we are given $D=\left\{ (x_1, y_1), \ldots, (x_N, y_N) \right\}$, where $x_n \in \mathbb{R}^d$ and $y_n$ is an associated label represented as a one-hot vector. let us construct a layer that computes the nearest neighbour of a new input $x$. this can be implemented by first computing the activation of each training instance:

\begin{align*}

h^1_n =

\frac{\exp(-\beta | x_n – x |^2)}

{\sum_{n’=1}^N \exp(-\beta | x_{n’} – x |^2)}.

\end{align*}

in the limit of $\beta \to \infty$, we notice that this activation saturates to either $0$ or $1$:

\begin{align*}

h^1_n {\to}_{\beta \to \infty}

\begin{cases}

1, &\text{if $x_n$ is the nearest neighbour of $x$} \\

0, &\text{otherwise}

\end{cases}

\end{align*}

the output from this 1-NN is then computed as

\begin{align*}

\hat{y}^1 = \sum_{n=1}^N h^1_n y_n = Y^\top h^1,

\end{align*}

where $h^1$ is a vector stacking $h^1_n$’s and

\begin{align*}

Y=\left[

\begin{array}{c}

y_1 \\

\vdots \\

y_N

\end{array}

\right].

\end{align*}

this was relatively straightforward with 1-NN. how do we extend it to 2-NN? to do so, we define a new computational layer that computes the following activation for each training instance:

\begin{align*}

h^2_n =

\frac{\exp(-\beta (| x_n – x |^2 + \gamma h^1_n))}

{\sum_{n’=1}^N \exp(-\beta (| x_{n’} – x |^2 + \gamma h^1_n))}.

\end{align*}

now we consider the limit of both $\beta\to \infty$ and $\gamma \to \infty$, at which this new activation also saturates to either 0 or 1:

\begin{align*}

h^2_n \to_{\beta, \gamma \to \infty}

\begin{cases}

1, \text{if $x_n$ is the second nearest neighbour of $x$} \\

0, \text{otherwise}

\end{cases}

\end{align*}

this magical property comes from the fact that $\gamma h_n^1$ effectively kills the *first* nearest neighbour’s activation when $\gamma \to \infty$. this term does not affect any non-nearest neighbour instances, because $h_n^1=0$ for those instances.

the output from this 2-NN is then

\begin{align*}

\hat{y}^2 = \frac{1}{2} \sum_{k=1}^2 \sum_{n=1}^N h^k_n y_n.

\end{align*}

now you see where i’m getting at, right? let me generalize this to the $k$-th nearest neighbour:

\begin{align*}

h^k_n = \frac{

\exp(-\beta (| x_n – x |^2 + \gamma \sum_{k’=1}^{k-1} h^{k’}_n))

} {

\sum{n’=1}^N \exp(-\beta (| x_{n’} – x |^2 + \gamma \sum_{k’=1}^{k-1} h^{k’}_n))

},

\end{align*}

where we see some resemblance to residual connections (add the previous layers’ activations directly.)

In the limit of $\beta\to\infty$ and $\gamma \to \infty$,

\begin{align*}

h^k_n \to_{\beta, \gamma \to \infty}

\begin{cases}

1, \text{if $x_n$ is the $k$-th nearest neighbour of $x$} \\

0, \text{otherwise}

\end{cases}

\end{align*}

the output from this $K$-NN is then

\begin{align*}

\hat{y}^K = \frac{1}{K} \sum_{k=1}^K \sum_{n=1}^N h_n^K y_n,

\end{align*}

which is reminiscent of so-called deeply supervised nets from a few years back.

it is not difificult to imagine not taking the infinite limits of $\beta$ and $\gamma$, which leads to soft $k$-NN.

In summary, soft $k$-NN consists of $k$ nonlinear layers. Each nonlinear layer consists of radial basis functions with training instances as bases (nonlinear activation), and further takes as input the sum of the previous layers’ activations (residual connection.) each layer’s activation is used to compute the softmax output (self-normalized) using the one-hot label vectors associated with the training instances, and we average the predictions from all the layers (deeply supervised).

of course, this perspective naturally leads us to think of generalization in which we replace training instances with learnable bases across all $k$ layers and learn them using backpropagation. this is what we call *deep learning*.

[NOTE: I became aware that an extreme similar (however with some differences in how 1-NN is generalized to k-NN) has been proposed recently in 2018 by Plötz and Roth at NeurIPS’18: https://papers.nips.cc/paper/7386-neural-nearest-neighbors-networks]

]]>