This time, this random stuff is contrastive learning. my thought on this stuff was sparked by Lerrel Pinto’s message on #random in our group’s Slack responding to the question “*What is wrong with contrastive learning?*” thrown by Andrew Gordon Wilson. Lerrel said,

Lerrel Pinto (2021)

My understanding is that getting negatives for contrastive learning is difficult.

i haven’t worked on the (post-)modern version of contrastive learning, but every time i hear of “*negative samples*” i am reminded me of my phd years. during my phd years, i’ve mainly worked on a restricted Boltzmann machine which defines a distribution over the observation space as

$$p(x; W, b, c) \propto \exp(x^\top b) \prod_{j=1}^J (1+\exp(x^\top w_{\cdot, j} + c_j)),$$

where $W$, $b$ and $c$ are the weight matrix, visible bias and hidden bias. for simplicity, i’ll assume the visible bias is $0$, which is equivalent to saying that the input is on expectation an all-zero vector. This makes the definition above a bit simpler, and especially so when we look at the log-probability:

$$\log p(x; W, c) = \sum_{j=1}^J \log (1+\exp(x^\top w_{\cdot, j} + c_j)) – \log Z,$$

where $\log Z$ is the log-partition function or log-normalization constant.

the goal of learning with a restricted Boltzmann machine is then to maximize the log-probabilities of the observations (training examples):

$$\max_{W, c} \mathbb{E}_{x \sim D} [\log p(x; W, c)],$$

using stochastic gradient descent with the stochastic gradient derived to be

$$g_{\theta} = \sum_{j=1}^J \nabla_\theta \log (1+\exp(x^\top w_{\cdot,j} + c_j)) – \mathbb{E}_{x_- \sim p(x; W,c)} [\sum_{j=1}^J \nabla_\theta \log (1+\exp({x_-}^\top w_{\cdot,j} + c_j)].$$

the first term ensures that each hidden unit (or expert) $j$ is well aligned with the correct observation $x$ drawn from the data distribution (or training set.) not too surprising, since the alignment (dot product) between the expert weight $w_{\cdot, j}$ and a given observation gives rise to the probability of $x$.

the second term corresponds to computing the expected negative energy (ugh, i hate this discrepancy; we maximize the probability but we minimize the energy) over all possible observations according to the model distribution. what this term does is to look for all input configurations $x_-$ that are good under our current model and to make sure the hidden units (or experts) are not well aligned with them.

you can imagine this as playing whac-a-mole. we try to pull out our favourite moles, while we “whac” any mole that’s favoured by the whac-a-mole machine.

in training a restricted boltzmann machine, the major difficulty lies with how to efficiently and effectively draw negative samples from the model distribution. a lot of bright minds at the University of Toronto and University of Montreal back then (somewhere between 2006 and 2013) spent years on figuring this out. unfortunately, we (as the field) have never got it to work well, which is probably not surprising since we’re talking about sampling from an unnormalized (often discrete) distribution over hundreds if not thousands of dimensions. if it were easy, we would’ve solved most of problems in ML already.

let’s consider a stochastic transformation $T: \mathcal{X} \to \mathcal{X}$, where $\mathcal{X}$ is the input space. given any input $x \in \mathcal{X}$, this transformation outputs $\tilde{x} \sim T$ that highly likely maintains the same semantics as the original $x$. this is often used for data augmentation which has been found to be a critical component of contrastive learning (or as a matter of fact any so-called self-supervised learning algorithms).

imagine a widely used set of input transformations in e.g. computer vision. $T$ would include (limited) translation, (limited) rotation, (limited) color distortion, (limited) elastic distortion, etc. we know these transformations often in advance, and these are often domain/problem-specific.

what we will now do is to create a very large set of hidden units (or experts) by drawing transformed inputs from the stochastic transformation $T$ for one particular input $x$. that is, we have $J$-many $\tilde{x}_j \sim T(x)$. in the case of computer vision, we’ll have $J$-many possible distortions of $x$ that largely maintain the semantics of $x$.

these hidden units then define a restricted Boltzmann machine and allow us to compute the probability of any input $x’$:

$$\log p(x’ | \tilde{x}_1, \ldots, \tilde{x}_J) = \sum_{j=1}^J \log (1+\exp(s(x’,\tilde{x}_j))) – \log Z,$$

where i’m now using a compatibility function $s: \mathcal{X} \times \mathcal{X} \to \mathbb{R}$ instead of the dot-product for more generality.

starting from here, we’ll make two changes (one relaxation and one restriction). first, we don’t want to only use $J$ many transformed copies of the input $x$. we want to in fact use all possible transformed versions of $x$ out of $T$. in other words, we want to relax our construction that this restricted Boltzmann machine has a finite number of hidden units. this turns the equation above to be:

$$\log p(x’ | x, T) = \mathbb{E}_{\tilde{x} \sim T(x)}\left[ \log (1+\exp(s(x’,\tilde{x})))\right] – \log Z.$$

second, we will assume that the input space $\mathcal{X}$ coincides with the training set $D$ which has a finite number of training examples, i.e., $D=\left\{ x_1, \ldots, x_N \right\}$. this second change only affects the second term (the log-partition function) only:

$$\log p(x’ | T(x)) = \mathbb{E}_{\tilde{x} \sim T(x)}\left[ \log (1+\exp(s(x’,\tilde{x})))\right] – \log \frac{1}{N} \sum_{n=1}^N \mathbb{E}_{\tilde{x} \sim T(x)}\left[ \log (1+\exp(s(x_n,\tilde{x})))\right].$$

to summarize what we’ve done so far: we build one restricted Boltzmann machine for a given input $x \in \mathcal{X}$ by drawing the hidden units (or experts) from the transformation distribution $\tilde{x} \sim T(x)$. the support of this restricted Boltzmann machine is restricted (pun intended) to be a training set.

what would be a good training criterion for one such restricted Boltzmann machine? the answer is almost always maximum likelihood! in this particular case, we want to ensure that the original example $x$ is most likely under the restricted Boltzmann machine induced by itself:

$$\max_{\theta} \log p(x | T(x)),$$

where $\theta$ is the parameters for defining the compatibility function $s$ from above.

we do so for all $N$ restricted Boltzmann machines induced from $N$ training examples:

$$\max_{\theta} \frac{1}{N} \sum_{n=1}^N \log p(x_n | T(x_n)).$$

since it’s decomposed over the training examples, let’s consider only one example $x \in D$. we then train the induced restricted Boltzmann machine with stochastic gradient descent, following

$$\frac{1}{M} \sum_{m=1}^M \nabla_{\theta} \log (1+\exp(s(x, \tilde{x}_m; \theta))) – \frac{1}{M} \sum_{m=1}^M \sum_{n=1}^N p(x_n|T(x)) \nabla_{\theta} \log (1+\exp(s(x_n, \tilde{x}_m; \theta))),$$

where we use $M$ transformed copies to approximate the two expectations over $T(x)$ but not $p(x_n|T(x))$. we probably should use another set of $M$ transformed copies to get the unbiased estimate.

this does look quite similar to more recently popular variants of contrastive learning. we start from a training example $x$, generate a transformed version $\tilde{x}$, maximize the compatibility between $x$ and $\tilde{x}$, and minimize the compatibility between $\tilde{x}$ and all the training examples (including $x$). there are minor differences, such as the choice of nonlinearity, but at the high level, it turned out we can derive contrastive learning from the restricted Boltzmann machine.

perhaps the only major difference is that this formulation gives us a clear guideline on how we should pick the negative examples. that is, according to this formula, we should either use all the training examples weighted according to how likely they are under this $x$-induced restricted Boltzmann machine or use a subset of training examples drawn according to the $x$-induced restricted Boltzmann machine without further weighting. of course, another alternative is to use uniformly-selected training examples as negative samples but weight them according to their probabilities under the $x$-induced restricted Boltzmann machine, *à la* importance sampling.

so, yes, contrastive learning can be derived from restricted Boltzmann machines, and this is advantageous, because this tells us how we should pick negative examples. in fact, as i was writing this blog post (and an earlier internal slack message,) i was reminded of a recent workshop i’ve attended together with Yoshua Bengio. there was a talk on how to choose *hard* negative samples for contrastive learning (or representation learning) on graphs, and after the talk was over, Yoshua raised his hand and made this remark

Yoshua Bengio (2019, paraphrased)

That’s called Boltzmann machine learning!

Indeed…

Based on this exercise of deriving modern contrastive learning from restricted Boltzmann machines, we can now have a meta-framework for coming up with a contrastive learning recipe. Any recipe must consist of three major ingredients:

**A per-example density estimator**: i used the restricted Boltzmann machine, but you may very well use variational autoencoders, independent component analysis, principal component analysis, sparse coding, etc. these will give rise to different variants of self-supervised learning. the latter three are particularly interesting, because they are fully described by a set of basis vectors and don’t require any negative samples for learning. i’m almost 100% certain you can derive all these non-contrastive learning algorithms by choosing one of these three.**A compatibility function**$s$: this is the part where we design a network “architecture”, and how the output from this network is used to compute a scalar that indicates how similar a pair of examples is. it looks like the current practice is to use a deep neural net with a cosine similarity to implement this compatibility function.**A stochastic transformation****generator**: this generator effectively generates a density estimator for each example. this is very important, since it defines the set of bases used by these density estimators. any aspect of data cannot be modelled if these generated bases do not cover them.

we have a pretty good idea of what kind of density estimator is suitable for various purposes. we have a pretty good idea what’s the best way to measure the similarity between two highly-complex, high-dimensional inputs (thanks, deep learning!) but, we cannot know what the right stochastic transformation generator should be, because it is heavily dependent on the problem and domain. for instance, the optimal transformation generator for static, natural images won’t be optimal for e.g. natural language text.

so, my sense is that the success of using contrastive learning (or any self-supervised learning) for any given problem will ultimately boil down to the choice and design of stochastic transformation, since there’s a chance that we may find a near-optimal pair of the first two (density estimator and compatibility function) that works well across multiple problems and domains.

]]>- Ho-Am Prize & Scholarship for Macademia at Aalto University
- Ho-Am Prize & 백규고전학술상 (Baek-Gyu Scholarly Award for Classics)
- Ho-Am Prize & Lim Mi-Sook Scholarhip (임미숙 장학금) at KAIST

i graduated from Korea Advanced Institute of Science and Technology (KAIST) with the Bachelor in Science (B.Sc.) degree. i majored in computer science which is the subject i’ve never left so far, having become a professor of computer science (and data science) in 2015. although my undergraduate years in terms of education was closer to failure than success (which is extremely visible on my transcript,) i thoroughly enjoyed my days at KAIST and have fond memory of the years I spent there.

although the whole field, including myself, has become so much more aware of the issue of gender imbalance in computer science in recent years, it was already super-clear that there was this issue in computer science when i was in my undergraduate years. my memory is definitely failing me, but i recall there were less than five if not four females students out of approximately 60-70 students in my cohort. of course, the awareness did not mean that i felt any issue with it nor was compelled to do something about it. it just felt only natural back then that boys majored computer science and girls in biology (yes, i’m simplifying it quite a bit here, but this is how it seemed to me back then.)

perhaps this is precisely what my mom and others in the family felt back when i was born. before i was born, my mom used to be a teacher in a (junior) high school, teaching Korean. my mom and dad graduated from the same university for their undergraduate degree, after which my mom became a teacher and my dad decided to pursue higher degrees, eventually becoming a professor of korean literature. clearly both of them had the same level of education up until a certain point, but at that point, mom gave up on her career to raise me and my younger brother who was born less than 2 years after i was born. again, i’m sure this was the choice that was only natural back then.

unfortunately it’s about 20 years since i started my undergrad years at KAIST, and the issue of gender balance in computer science hasn’t gotten any better. in fact, this issue, which i didn’t even realize existed back then, turned out to be just a tip of the iceberg. the field of computer science, or perhaps more narrowly machine learning, is riddled with imbalances; gender imbalance, geographical imbalances (over-representation of north america, europe and east asia over other parts of the world), imbalance across races (6 black researchers out of more than 5,000 attendees of NeurIPS 2017, noticed by Timnit Gebru), and many more.†

these issues are somehow “discovered” each day, but the truth is that we are barely freeing ourselves from the social constructs that have blinded us or have convinced us that these imbalances are only natural. this is just like how i never thought it was an issue that all boys majored in computer science while all girls majored in biology when i was in my sophomore year. this is just like why my mom quit her job to raise me and my brother more than 35 years ago, which i’m sure no one questioned then.

i don’t have any solution to this issue of social blindness, but one thing i have become aware of is that one cannot see what is not there for them to see. when i was one of 90% or more of the boys who majored in computer science 18 or so years ago, i couldn’t see the problem. when i was one of 90% or so of the non-black, male researchers attending ICML and NeurIPS over many years, i couldn’t see the problem. i mean i was having beer, tequila, etc. non-stop together with Yann Dauphin, but i couldn’t see this near-complete lack of black researchers as a troubling trend at all. i only started to see these problems of equal access, equity, etc. only when i started to see people raising these issues and bringing these issues to my attention. in other words, the one remedy i know and have experienced myself is to create a diverse environment in which each individual can see and interact with diverse individuals and hear their stories.

so, as a small effort toward helping build such diverse environments, i have decided to donate approximately ₩100,000,000 KRW (≈ \$91,000 USD) to the Department of Computer Science, School of Computing at KAIST to create a small scholarship named after my mom (Lim Mi-Sook 임미숙) that will provide a small amount of supplement (≈ \$900) each to a small group of female students who major in computer science, at the beginning of each semester, until the fund runs out.^{∘} it’s not a lot, but it never hurts to have some extra allowance at the beginning of each semester. they might use it for buying a new iPad for either taking better notes in their classes or watching Netflix more comfortably. they might use it to hang out with their friends and have some nice meals. they might use it to pay for their hobbies.^{⊚} however they spend it, i only hope this would encourage them to continue their study in computer science and to encourage others to join computer science in the future, thereby contributing toward building a more diverse community of computer scientists (so that my little niece will eventually want to study computer science and be a computer scientist.) furthermore, i wish this will help us, including myself, more easily and readily see and break ourselves free from these social constructs/biases that unfairly disadvantage and harm subsets of population.

finally, here’s why i named it after my mom: although i structured this scholarship to be from my mom, this won’t let me nor my mom answer how her career would’ve been had she not given up on it when i was born. it however will make all of us think more about the burden of raising children that is placed often disproportionately on mothers and how it should be better distributed among parents, relatives and society, in order to ensure and maximize equity in education, and career development and advances.

† more and more organizations and initiatives are founded to address these challenges, including Women in Machine Learning, Black in AI, etc. (see e.g. the Diversity, Equity and Inclusion page of ICLR’21.) these are organizations that make me proud to be a part of this research community.

∘ oh, and i asked the department to arrange a lunch between my parents and these students each semester. i think my parents will love talking with them, and i hope the students will also enjoy the lunch.

⊚ see my earlier post <Giving thanks: Samsung AI Researcher of the Year Award and Donation to Mila> for more of my thoughts on this.

]]>- Ho-Am Prize & Scholarship for Macademia at Aalto University
- Ho-Am Prize & 백규고전학술상 (Baek-Gyu Scholarly Award for Classics)
- Ho-Am Prize & Lim Mi-Sook Scholarship (임미숙 장학금) at KAIST

i’ve rarely mentioned my father in this blog without any particular reason, but perhaps it’s a good time to talk about him briefly in this post.

his name is Kyu-Ick Cho (조규익),and he’s a professor of Korean Language and Literature at Soong-sil University in Seoul, Korea. perhaps unsurprisingly, i don’t know much of Korean language nor literature, not to mention Korean *classical *literature and art in which he is one of the world-wide experts. i only know a few things i picked up here and there about his research as i grew up. unfortunately i’m way out of my depth & breadth even list up what he’s worked on, done and continues to work on, although i can point you to his homepage (http://kicho.pe.kr/), where you can find the ever-growing list of books and papers he authors (warning: all in Korean).

one thing i can talk about is that it’s helped me see the stark difference between how things work in engineering/science and in humanity, just seeing my father from the side. when it comes to Korean classical literature and art research, the intellectual curiosity and perhaps intellectual responsibility truly matters. you do not build anything new that may change the world. you do not discover something that may change the world. you do not learn skills that may make you valuable to for-profit organizations. your research is probably not supported by deep pocketed industry and if by federal government, at the level that barely keeps you alive. it’s pretty much all about fulfilling your intellectual curiosity and carrying our your duty and responsibility as an academic.

although economy in korea has grown tremendously, this doesn’t necessarily translate to increased investment in humanities research, especially for those areas in humanities that do not translate immediately to economic value. korean classical literature and art is clearly one such area where no one expects any *return* on investment at any time. after all, it is *literature* and *art*, and perhaps worse yet, it is *classic*.

there are many negative consequences from such plateaued or shrinking investment, that i’d love to talk a lot about. in this post however let me stick to just one particular consequence. that is, such lack of investment discourages (if not outright prevents) researchers from pursuing their intellectual curiosity and responsibility, thereby effectively serving as a death sentence to the field. to understand what i mean here immediately, imagine how you’d react when your kid announces they’ll pursue PhD in Korean Literature .

perhaps surprisingly, i find it quite disturbing that we may be looking at a serious chance that there won’t be anyone who’ll study and research korean classical literature and art at some point not too far in the future. out of a few things that set us (humans) apart from other intelligent species, literature and art, which are closely related to each other with their boundary becoming fuzzier as we go back further in time, are clearly at the forefront of these unique features of us, and if we can’t afford to spare our effort & time in creating, enjoying and preserving these artifacts ourselves, what are we really doing here?

of course, despite this shrinking investment in korean classical literature & art research, researchers in this field have not given up, including my father. in order to build an environment to accommodate more junior and less established researchers in the field of korean classical literature & art, he founded a research center at Soong-Sil University, named the Center for Korean Literature & Art, in 2006 and has continued to run it so far. this research center has its own journal that publishes 3-4 issues each year. it hosts annual conferences to gather a small number of researchers who are dedicated to korean literature & art. it publishes many books each year. as far as i can tell, the center is not growing in terms of the number of people, but its activities as well as the coverage of research areas within Korean classical literature and art have steadily grown over the past decades.

so, yes, he is really trying hard together with a small number of his colleagues and peers. in fact, he’s been doing so ever since his career as a professor of korean language and literature in mid-80’s, although from what i’ve scantly seen from the side this has been an uphill battle. and, now with his retirement in 1 year, the future of korean classical literature and art does not look particularly brighter.

when i was a kid, i recall one year (1996) when my father received two highly respected awards. one was Do-Nam Award for Korean Literature Research (도남국문학상), and the other was Seong-San Award ~~for Korean Classical Poetary Research~~ (성산~~시조~~학술상). obviously i wasn’t aware of how big deals these awards were back then, not do i know how big deals these were even now. i could however feel that these must be big deals because i could sense the pride in my father’s eyes when he broke the news. i even remember attending the ceremony for one of these awards (not sure if i attended both, though. my memory is failing me here.)

that was 25 years ago, when my father was still considered junior (i mean… it’s the field of Korean *classical* literature and art, where everyone’s supposed to be junior ever.) these prizes must’ve meant quite a bit in that they recognize his own research but also encourage him to advance his research further. noticing that these two awards always mentioned in his bio’s as well as CV’s, i presume i’m not too wrong in this.

unfortunately, it doesn’t look like either of these awards exists anymore. i could trace Do-Name award up to 2008, but i couldn’t find any information about it. in fact, i couldn’t even find the list of awardees from a few minutes of Googling (and Navering). the same goes with Seong-San award. i could trace it up to 2003 or so, but i again can’t find anything substantial about this award. it’s quite shame. two prominent ways to recognize and encourage researchers in this relatively narrow field of korean classical literature and art seem to have been lost over time (, although these awards were not only for the classical literature & art but recognize achievements in a broader field of korean literature.)

no individual will be able to save the whole field of korean classical literature and art. it’ll have to be the whole society’s effort to save this field and along the way our soul as well. my father has contributed his entire career to this cause and will continue to do so even after his retirement, although his forecast becomes gloomier each time i talk with him. to this end, i’ve decided to contribute just a little myself to this effort to saving and perhaps even growing research in Korean classical literature and art by donating ₩100,000,000 (approx. $90,000 USD) to the Center for Korean Literature and Art with the stipulation that this is used to create an award for Korean classical literature and art.

this award will be given to 1-2 researchers each year with approximately $2,000-5,000 each (to be determined by the Center’s Board each year) until the fund runs out, with a hope that this award can be used to recognize the achievements of and encourage future endeavors of researchers in the field of Korean classical literature and art, just like what those two awards above did to my father and what Ho-Am Prize is doing to me.

oh, right, i almost forgot to mention: i’ve also put one small condition that this award be named after my father’s pen name^{*} 백규 (Baek-Gyu, 白圭). so, this award, which will hopefully start to be awarded starting next year (2022), will be called the Baek-Gyu Award in the field of Korean Classical Literature and Art (백규고전학술상).

* 호; i’m not sure what’s the right translation of this in English. it’s a kind of a nick name given by another, often a teacher or fatherly figure.

]]>**Note**: This is the first in a series of up to three posts related to the Ho-Am Prize I was awarded this year.

- Ho-Am Prize & Scholarship for Macademia at Aalto University
- Ho-Am Prize & 백규고전학술상 (Baek-Gyu Scholarly Award for Classics)
- Ho-Am Prize & Lim Mi-Sook Scholarship (임미숙 장학금) at KAIST

What an honour it has been to be a recipient of the Samsung Ho-Am Prize in Engineering this year (2021)! The Ho-Am Prize is one of the biggest and perhaps most recognized awards in Korea. Quoting the Ho-Am Foundation directly:

The Prize is presented each year to individuals who have contributed to academics, the arts, and social development, or who have furthered the welfare of humanity through distinguished accomplishments in their respective professional fields.

In particular, the Ho-Am Prize in Engineering is awarded to “*people of Korean heritage whose accomplishments have contributed to the development of industry for greater prosperity for humanity.*“

I’m quite certain what i’ve done so far is anything remotely close to contributing to either the development of industry or greater prosperity for humanity. but, i take it that this Prize was awarded to me not for my individual achievement but to recognize “*what we have been able to collectively achieve over many decades in the field of deep learning and more broadly artificial intelligence and data science.*“^{*}

regardless of whether the Prize celebrates my own achievement or the set of achievements we have made collectively, it turned out that i am the one who receives “*cash prize of KRW 300million (approx. 275,000 USD)*“. I KNOW! this is the biggest cash prize i’ve ever received. in fact, i could even say this is by far the biggest chunk of money i’ve received at once, and the second largest one does not even come close to it.

since i take it that this Prize recognizes our field rather than myself as an individual, i’ve decided to use this enormous cash prize not for myself but to serve a broader society. because it’s a pretty hefty prize, i’ll spend it in 2-4 distinct ways over the next few months, and in this post, i’ll share with you my first attempt at giving away this cash prize.

one of the most fortunate moments in my career so far was one day in Fall 2008. my friend (Yongwook) and i were taking a course designed for freshman students in a non-computer science major, when both of us were very, very, very far from our freshman years. perhaps obviously, we were always sitting at the very back of a large lecture hall with the sole goal of finally graduating from the university at some point. one day, Yongwook showed up a bit late, rushed into the lecture hall and sat down next to me. he then showed me a (possibly the ugliest) brochure he picked up in front of the department office on his way to the lecture hall. it was a brochure sent to KAIST Computer Science by Aalto University (back then Helsinki University Technology) about the (relatively) new international master’s program in **mac**hine learning **a**nd **da**ta **mi**ning. the program was named “**Macadami**a” (no idea where the final “a” comes from.)^{∘}

until then, i never planned to continue my study beyond my undergraduate degree, i never thought of going abroad for studying further, and i never even imagined moving to Finland. but, somehow, there it was: the pamphlet from Finland, telling me about this master’s program in machine learning and data mining. within a few months, i was on a Finnair flight on my way to Helsinki (though, i’ve never “lived” in Helsinki but only in Espoo ever.) and, until now, this was one of the best decisions, if not *the* best one, i’ve ever made in my whole life.

i still cherish the years i spent in Finland.

internationalization matters. just by talking with, hanging out with and just simply listening to people from all over the world, we not only learn how others live, but we ourselves live, experience, understand and accept how others live all over the world. in doing so, we become more tolerant and open-minded. so, yes, internationalization matters, and we must strive to actively create an environment in which no group of people is marginalized and in which everyone is welcome and can interact with each other.

representation matters. representation matters from at least two aspects. first, representation self-reinforces. for instance, it’s quite difficult for me to imagine my little niece dreaming of becoming an AI researcher, because it’s not easy for me to see how she would find the field of artificial intelligence welcoming, when the whole field is pretty much dominated by men. the only way to break this is to make sure all, truly all, are represented. second, representation is a path toward safety, equity and fairness in engineering and science. i might sound a bit like a broken record at this point, but for instance quite a bit of issues arising from deploying AI/ML systems could have been caught before their deployment had those systems been developed and vetted by a team of developers that properly represent the diversity of the society (see here for a few examples and pointers to original sources.) so, yes, representation matters in ensuring safe, equitable and fair development and deployment of systems we build.

compared to my experience prior to joining Aalto University back in Korea, Aalto University provided me an environment which were much better internationalized and had generally better representation across various aspects. this greatly helped me broaden my view and perspective on a diverse set of topics, and really changed how i perceive the world in general. looking back however i must unfortunately say that my bar was very low.

Aalto University, and the Finnish society more broadly, also suffers from the (relatively) lack of internationalization and diversity. i was in the “international” master’s program which was taught (almost) entirely in English (if i recall correctly Finnish 1 was required, which was perhaps unsurprisingly in the mix of English & Finnish) and attracted talents from all over the world. indeed in my cohort, if i recall correctly, either all but one or all my peers were from abroad, which allowed me to interact with them, learn from them and become a friend with them. however, outside this program, along with a few other international master’s programs, it was a reasonably rare sight to find non-Finnish students at Aalto University (well, at least in the School of Science and Engineering, back then.) there were certainly more non-Finnish but European students who were spending their exchange years, although they weren’t too many either.

Furthermore, within my cohort of Macadamia, if i recall correctly, there was one female student out of 12 or so students.^{#} this balance seems particularly bad, but the balance wasn’t too good among students as well as faculty members within general computer science. i have no statistics available in my hands now, but my personal experience tells me that gender balance was definitely better at Aalto CS than at KAIST CS where i studied computer science in my undergrad years. this however did not mean that it was any good at Aalto CS, but just that my bar was very low.

as i’ve explored beyond Finland, i’ve seen, experienced and enjoyed places that are more internationalized and have more balanced representation of a diverse population. Aalto University can and should do better to better serve its students as well as Finland and more broadly the world by further improving its internationalization and building an even more diverse campus.

here’s two sides of my feeling toward Aalto University:

- my experience at Aalto University and Finland was simply amazing, and i want to contribute to making this experience available to a broader group of students from all over the world.
^{@} - Aalto, and more broadly Finland, could benefit even more from having a more diverse set of students so that the whole society, and its members, continue to stay (and become even more) open-minded and tolerant.

these are not mutually exclusive nor mutually independent. in fact, one may say that these are essentially the same thing.

to this end, i’ve decided to donate €30,000, using the prize from the Ho-Am Award, to Aalto University School of Science with a condition that this is used to support *female *students from *non-EU countries *who are entering the *Macadamia *program.^{$} Similarly to my earlier donation to Mila, i’ve asked Aalto University to provide one-time supplement of €1,000 each to approximately five such students each year. See here for the official announcement from Aalto University.

this is my small gesture of thanking them for coming to Aalto University and Finland to study, which in turn improves internationalization and diversity in Aalto University and Finland and makes this place even more awesome. €1,000 in Finland is definitely not much (thus a small if not tiny gesture), but i hope this would even a tiny bit help students enjoy Aalto University and Finland, just like I did many years back.

안녕하세요?

13년 전 처음 <Probabilistic Robotics>와 <인공지능> 강의를 듣기 전까지 사실 전 machine learning, 자연어 처리, machine translation, 인공지능, 이런 단어들을 들어보지도 못했습니다. 다만 당시 우연찮게 학과 사무실에 앞에 놓여있던 핀란드 헬싱키 공대 machine learning 석사과정 팜플렛을 우연찮게 저와 같이 강의를 듣던 선배가 전달해줬고, 무장적 핀란드로 유학을 떠났습니다.

제가 진학한 석사 프로그램은 당시 신입생들을 신청 없이 무작위로 학과 내 연구실에 배정해서 일주일에 하루씩 연구 경험을 쌓도록 했습니다. 저는 우연찮게 당시 Bayes’ Group이라 불렸던, 이름과 달린 뉴럴넷 연구를 하던 그룹에 속하게 되었습니다. 당시에는 사실 뉴럴넷이 무엇인지도 모르던 시절이었고, 뉴럴넷을 갖고 뭘 할 수 있는지도 전혀 몰랐습니다. 다만 연구실에 일주일에 하루라도 속해서 연구하는 방법을 배우고 다른 연구원들의 연구하는 모습을 어깨 너머로 볼 수 있다는 것만으로도 굉장히 신이 났습니다.

아직 딥러닝이 그리고 인공지능이 지금만큼 뜨지 않았던 시절이라 그런지 이 내용으로 석사를 하고 같은 학과, 같은 그룹에서 박사 과정에 진학한 후에도 대단한 연구를 해봐야겠다, 대단한 논문을 써봐야겠다, 대단한 발명을 해봐야 겠다, 는 생각 없이, 마음 편히 궁금한 것은 공부하고, 새로운 것은 직접 시도해보면서 즐겁게 대학원 생활을 보냈습니다.

이런 대학원 생활 막바지 우연찮게 당시 새롭게 생긴 인공지능 학회인 아이클리어라는 학회에 참석했습니다. 제 기억에 따르면 40-60명 정도만 참여했던 조촐한 학회였습니다. 학회 첫 날 아침 식사에 우연찮게 몬트리올에 있는 Yoshua Bengio 교수 옆에 앉게 되었고 그 아침 식사를 계기로 몬트리올 대학교에 박사 후 연구원으로 가게 되었습니다.

몬트리올에 도착 한 다음날 Yoshua가 앉아있는 제게 와서 4가지 연구 주제를 던져줬습니다. 그 중 하나가 machine translation 이었고, 그에 대해 아는 것 하나 없는 상황에서도 다만 재밌을 것 같다는 생각 하나만 갖고 machine translation 연구를 해보겠다고 했습니다. 그 후 8년이 지났고, 이런 우연찮은 선택들이 모이고 모여 지금 이 자리에서 제게 너무 과분한 상을 받게되었습니다.

본 소감을 준비하다보니 제 공부 및 연구 경력에는 “우연”과 “운”이 많이 작용했다는 생각이 듭니다.

만약 13년 전 용욱이 형이 그 팸플렛을 우연찮게 주어서 갖다주지 않았으면 어땠을까? 만약 12년 전 우연찮게 Bayes group에 배정받지 않았으면 어땠을까? 만약 8년 전 우연찮게 아침 식사를 위해 앉은 자리 옆에 Yoshua Bengio가 없었으면 어땠을까? 만약 8년 전 뜬금없이 machine translation을 선택하지 않고 조금 더 익숙했던 주제를 선택했으면 어땠을까?

이런 질문에 대한 답을 곰곰히 생각하다 보면, 제가 지금껏 이룬 일은 제 개인이 이룬 것이 아니라 생각합니다. 제 성과는 인공지능, machine learning, data science 등 다양한 이름으로 불리우는 분야에 속하는 연구원 모두가 다같이 이뤄낸 수 많은 성과들 중 아주 작은 하나일 뿐이라는 결론에 도다르곤 합니다.

인공지능 연구란 큰 흐름 안에서 좋은 우연의 연속으로 남들보다 살짝 더 드러나는, 하지만 여전히 한 없이 작은 성과를 이뤘을 뿐이라는 것을 생각해보면 제가 개인적으로 이런 과분한 상을 받는다는 사실에 인공지능 연구의 선배, 동료, 후배 과학자 분들께 한없이 죄송할 뿐 입니다.

인공지능 연구의 궁극적 목표는 지능이란 무엇인지, 이성이란 무엇인지 등 감히 과학적으로는 답할 수 없을 것만 같았던 이런 질문에 대한 답을 찾는 것 입니다. 마치 지난 몇 년 또는 몇 십년 간 인공지능 분야에서 대단한 성과가 나온 것 처럼 보일 수도 있습니다만 이런 근원적인 질문에 대한 답을 찾기 위해서는 아직 갈 길이 멀고, 사실 어느 방향으로 전진해야 할지도 막막할 때가 많습니다.

그럼에도 불구하고 호암 재단에서 저희 분야, 인공지능 연구 분야, 에 이런 큰 상을 주셔서 감사드립니다.

지금까지 이룩한 것에 대한 축하보다는, 저를 비롯하여 인공지능 연구에 불철주야 매진하고 있는 교수, 연구원, 개발자 그리고 학생들에게 더 앞으로 나아가라는 격려와 응원의 의미가 담겨 있다고 생각합니다.

인공지능 분야의 모두를 감히 대신해서 호암재단에 다시 한 번 감사 말씀 드리고 싶습니다.

감사합니다.

* i know it’s weird to quote myself from another blog post, but i think i said it pretty well when i was asked about how i feel about this Prize earlier.

∘ if you want to know more about the origin and original design of the Macadamia program, see this report.

# i might be off by $N$ here. if any of my peers remembers the correct number, drop me a line so that i can fix it.

@ sadly Finland, as a whole, does not seem to share this goal with me. a few years back (a few years after i left Finland,) Finland introduced tuition for non-EU students enrolled in programs that are mainly taught in English, breaking its amazing tradition of providing free education to *all*. i seriously believe this was a mistake.

$ the name of the Macadamia program was changed to the “Master’s Programme in Computer, Communication and Information Sciences – Machine Learning, Data Science and Artificial Intelligence”. yes, they changed the name of Macadamia to include “Artificial Intelligence”, which would be the least surprising decision ever..

]]>this is a slightly expanded version of my fb post: https://www.facebook.com/cho.k.hyun/posts/10216267975445626.

i’ve lived in three countries-finland, canada and US- over the past 12 years as an expat/immigrant myself, which makes me pretty well aware of issues and challenges faced by immigrants, in particular east asian ones, in these countries. this made me *incorrectly* believe that i know the challenges and issues faced by immigrants everywhere beyond these three countries, including korea where i was born and raised as a korean national and had lived for 20+ years. this was until i saw this post by Alice, where she shared a link to the homepage of “*Hanmaum Education Volunteer Corp who helps children of immigrant families in challenging environments by providing free education*” (my own translation of an excerpt from the original post.)

how did i miss this? this glaringly obvious omission of immigrant kids from all those years i was growing up in korea. somehow i’ve never had a chance to even have a single peer in any of the schools i had attended who was a kid of an immigrant family. realizing this was and is still quite a shock, considering that the number of immigrants, immigrant families and their children has been only growing over the past decades.

then, i realize it’s because i was born and raised near the center of the society. this has made me pretty much blind to corners of the society, and all these immigrant moms (it’s also a bit concerning that it’s disproportionately immigrant “moms”) and their children were and are in those corners of the society. it was this post by Alice and this effort by Emeritus professor Byung-Gyu Choi of KAIST that barely made me take a glimpse at this corner. what a blind fool have i been, and what else am i being blind to..?

last november (2020), i was invited to give an opening talk at SK ICT Tech Summit 2020, perhaps unsurprisingly together with Alice (i’m a huge fan!), and talked about my on-going project on breast cancer screening (see the recording of the talk here). SKT generously paid me $6,000 lecture fee (and yes, it was super-generous, and i rarely receive any lecture fee from my invited talks ever.) i’ve been thinking about how i was going to spend this, and have decided to donate the entire sum to the Hanmaum education volunteer corp.

it’s not a lot, and it doesn’t come any close to students and other volunteers who are on the ground providing education to these moms and kids. i hope however that this small gesture of mine would help immigrant parents & kids receive education they truly deserve.

p.s. i’m quite proud to see my former visiting student and current good friend, Keunwoo, following my lead and showing others what to do

]]>NYU에서는 이번 가을에 blended insturction을 했다. 각 수업은 규모 (학생 수 및 주당 강의 수), 특성 (대면 필수) 등을 고려하여 remote, in-person 또는 blended로 학기 시작 전 구분을 지었고, 나는 blended mode의 강의를 진행했다. blended mode 수업의 강의는 in-person 그리고 lab sessions은 평소의 2-3배로 갯수를 늘려서 in-person과 remote를 모두 가졌다. 모든 강의와 lab은 zoom을 통해 livestream했고 이를 통해 뉴욕에 오지 못한, NYU의 global campus에 대신 진학한 학생들이 강의를 듣는데 문제 없도록 했다. 매 강의 및 lab session에 in-person으로 참석할 학생은 학기 시작 전 미리 배정된 주에 미리 배정된 자리에 앉도록 했고, NYU의 모든 facility에서는 마스크 착용이 필수였다. 각 강의에는 해당 강의실 최대 수용 인원의 1/4-1/3만 들어올 수있도록 했고, 교수도 예외 없이 언제나 마스크를 착용하고 강의를 진행했다. 내 강의에는 한번에 최대 25-30명이 들어올 수 있었으나 실질적으로는 3-10명 정도가 들어오고 나머지 학생들은 zoom을 통해 livestream으로 참석했다.

이와 동시에 각 학과는 교수 및 포스닥, PhD 학생들이 필요에 따라 연구실에 돌아올 수 있도록 연구실 배정 및 책상 배치를 모두 바꿨다. NYU Center for Data Science의 경우, 연구실 재배정 및 미팅룸 재배정을 통해 모든 교수, 포스닥, PhD 학생이 1인1실을 쓰도록 했고, 이를 통해 주거 환경이 상대적으로 열악한 학생들이 맘 편히 연구에 집중할 수 있는 환경을 제공하도록 노력했다.

학부생들도 원하는 경우 residence hall로 들어와서 학기를 지냈고, 이런 경우에도 residence hall reconfiguration을 통해서 학생들 간의 불필요한 접촉을 최소화하도록 했다. 학교내 식당 (대부분 학부생들 이용) 은 모두 pick up으로 변경했고, 학교 내 모든 책상 및 공부할 수 있는 공간은 미리 예약을 하지 않으면 쓰지 못 하도록 시스템을 갖추었다.

정지훈 교수님 글에 쓰신 것처럼 이런 환경을 뉴욕 맨하튼 한가운데서 구축하고 covid-19 outbreak을 피하기 위해 NYU는 뉴욕 캠퍼스에 온 모든 학생과 교직원에게 학기 시작 전 2주 동안 1-2 번씩 PCR 테스트를 받게 했다. 뉴욕에 사는 교직원들은 뉴욕대학교 Langone 대학병원에서, 그리고 학생들은 residence hall들 근처에 텐트를 치고 대대적으로 검사를 진행했다.

학기가 시작한 후 모든 구성원은 의무적으로 2주에 한 번씩 침을 이용한 테스트를 받았다 (https://www.nyu.edu/life/safety-health-wellness/coronavirus-information/safety-and-health/coronavirus-testing/ongoing-testing.html) 매 2주에 한 번씩 reminder 이메일이 오고 해당 주에 학교 내에 구축된 4-5군데의 테스트 collection point에 직접 찾아가 test kit을 받은 후, 집 또는 사무실에서 침을 뱉은 후 다시 collection point에 돌려준다. 그 후 1-3일 후 온라인으로 결과를 확인할 수 있고, 해당 결과나 입력되지 않은 경우 카드키를 통한 NYU 출입이 제한된다.

이를 통해 양성 판정이 나면 해당 구성원은 바로 격리에 들어가고 학교에서는 contact tracing에 들어간다. 아쉽게도 contact tracing은 학교 내로 한정이 되고, NY주에서 학교 밖 contract tracing을 진행한다. 물론 후자가 전혀 안된다는 건 모두가 아는 비밀이다. 학기초반 residence hall 등에서 outbreak의 기미가 있어서 2-3 층을 통째로 격리하고 전원 검사를 두 번 진행한 경우가 있었고, 이를 통해 더 큰 outbreak을 피했다.

이런 과정을 통해 총 6만 여명 구성원 중 15,000명 정도가 이번 학기에 캠퍼스에 돌아왔고 지난주를 마지막으로 학기가 중단 없이 끝났다. 실시간으로 업데이트 되어온 대시보드 (https://www.nyu.edu/life/safety-health-wellness/coronavirus-information/nyc-covid-19-testing-data.html)를 보니 8월 1일 이후 총 19만 9870번 테스트를 진행했고, 758 케이스가 양성으로 판정되었고, 뉴욕 이외의 지역까지 확장하면 약 1000여 케이스가 양성이었다. 양성율 0.38%로 실제 뉴욕시에 비해 현저하게 낮다.

올 봄 뉴욕시는.. 큰 병원들은 지옥이었고, 병원 밖은 유령 도시였다. 지금도 여전히 뉴욕주는 매일 만명 이상 확진되고 있고, 100명 이상 사망하고 있다. 그럼에도 불구하고 NYU에서 학기 중단 없이 학생들 교육을 시켰고, 뉴욕시의 public school 또한 중간에 일시적인 2주 중단 외에 학기를 진행했다는 것에 한편으로는 마음이 조금 편해진다. 이번 봄에도, 그리고 다음 가을에도 필요한 모든 것을 다 해서라도 학교가 열고, 학생 지도가 제대로 진행되길 바라고 학교 구성원 중 하나로써 최선을 다할 예정이다.

뉴욕도, 미국 다른주도, 한국도, 캐나다도, 유럽도, 내가 뉴스를 어느 정도 따라가는 많은 지역들이 대기업, 건물주, 부동산 그리고 부자들 걱정을 많이 한다. 이런 경제적인 고려와 그에 더해 미국에서 보다시피 정치적인 계산이 이번 pandemic을 얼마나 잘, 또는 얼마나 잘못 버텨내느냐에 많은 영향을 미치고 있다. 안타깝게도 이런 복잡한 고려 하에서 교육이 쉽게 묻혀 버린다. pandemic은 끝이 나겠지만 이 기간 1-3년 동안 다른 세대들에 비해 교육을 제대로 못 받은 세대에 대한 영향은 얼마나 오래갈까?

혹여나 지난주 밖에서 시원하다며 마신 맥주 한 잔 때문에 사회의 미래를 희생한건 아닐런지…

]]>Aalto University (in particular School of Science within) and Finland just keep on giving, and I feel like I continue to receive without giving anything back. I will have to think of some way for me to pay back all that I have received from them.

Kiitos paljon!

Of course, the whole event was virtual, and due to the time difference, I could not attend myself. Instead, I sent the video recording of my greetings. You can watch it at https://youtu.be/074nhA9SQvA. I’m also attaching the script I used for recording this video below.

hi,

i received admission to the international master’s program in machine learning and data mining, which was called back then Macadamia, from Helsinki University of Technology in the spring of 2009.

although i applied to the program myself, finland was largely a land of mystery to me. perhaps this mysterious nature of the country may have been one of the major motivations for me to apply for this program in the first place. in my mind back then, finland was associated with just a couple of things, such as Nokia and Helsinki Olympics. i must confess that i wasn’t even aware that finland shared a border with russia. unsurprisingly, going to finland to study was definitely not what i had in my mind until one of my friends then handed me the brochure of the Macadamia program in the winter of 2008.

the very first lecture i attended in helsinki university of technology, which was about to be merged with the other two universities to form aalto university back then, was of the course “Machine Learning: Basic Principles”. this course was taught by Tapani Raiko, who had advised and mentored me for the next five years and who i still continue to admire and keep in touch with. in the very first lecture, i could immediately tell that i made the right choice to be there to study machine learning. and, to this day, i still believe i made the right choice to be there at Aalto University to study machine learning and data mining.

as a part of the Macadamia program, some students were assigned to some of the labs within the department, which was back then information and computer science (ICS), to assist in research one day a week with a small amount of stipend. the master’s program was still free to anyone from anywhere in the world back then in finland, which i sadly learned recently that is not the case anymore. without tuition-free education, my decision to come to finland to study in Aalto University may have taken a very different course.

anyways, i was assigned to the Bayes group, which i do not believe exists anymore and despite its name had a longer history of research in neural networks. the group back then was led by Prof. Juha Karhunen, who i believe had recently retired, together with Tapani and Prof. Alexander Ilin, who recently made a comeback to Aalto to re-build the Bayes group however with a new name “Deep Learning”. this part-time research gig at the then-Bayes group, which started in September 2009, was the beginning of my research career that is still on-going.

i often wonder what i would’ve become had it not been for this program, called the “honours program” then, if i remember correctly, had it not been for me to be assigned to the Bayes group, or had it not been for me to be advised by Tapani and Alexander. it’s simply unimaginable. five years later in March 2014, i defended my doctoral dissertation against my “opponent” Prof. Nando de Freitas, in front of my friends, colleagues and supervisors from then-newly-formed the Department of Computer Science of Aalto University School of Science.

over those five years, i spent many days and nights in Maarintalo, studying for exams and working on projects. over those five years, i spent many days and nights in the computer science building, working toward my dissertation. over those five years, i had an uncountable number of lunches at the cafeterias in the computer science building as well as the main building. over those five years, i met so many friends and colleagues, many of whom i still keep in touch with.

Aalto University gave me an enormous opportunity by bringing me to Finland and giving me rigorous education on machine learning. Furthermore, Aalto University had successfully created an international environment in which I could immerse myself among talents from all over the world and be inspired by them. These were just the beginning of the series of opportunities Aalto University School of Science had given me over those five years.

my phd years were generously supported by FICS (the finnish doctoral programme in computational sciences), which has since discontinued and i believe has been replaced by HICT. near the end of my phd programme, i was given a chance and supported by FICS and Prof. Erkki Oja to spend six months visiting the University of Montreal to broaden my view and to further learn from the very best in the world.

this research visit opened my eyes to a broader set of topics in machine learning, and in particular this visit was how and when i began to seriously delve into studying how machine learning and more broadly AI could be used for and improve natural language processing and machine translation. this research visit led me to join the University of Montreal as a postdoc in a lab which was called Lisa back then and is now called Mila, immediately after i defended my dissertation.

And, now, i am an associate professor of computer science & data science at New York University, running my own research lab and teaching machine learning to aspiring students from all over the world.

in my opinion, one of the most important roles served by higher education is to bring the best out of each student. what this implies is that higher education cannot simply shove down knowledge into students, and education cannot simply show easy, comfortable and convenient ways forward to students. education must strive to provide as diverse and broad a set of opportunities and perspectives to students as possible in order to ensure each and every student has a chance to discover their way forward.

What i experienced during my years at Helsinki University Technology which had become Aalto University School of Science and Technology and has eventually become Aalto University School of Science, was precisely this; rigorous and thorough education, and a string of educational and extra-curricular opportunities within and beyond the wall of the university and even the country’s border.

It is truly my honour to be named the alumnus of the year, and to be frank I am quite unsure whether i deserve it. off the top of my head, i can think of Prof. Alexander Ilin, who is now back at Aalto University. Dr. Tapani Raiko, who is now at Apple, is another obvious candidate. and, no, it’s a totally objective list. they just happened to have mentored me throughout my years at Aalto.

let me wrap it up by dusting off my finnish: Kiitos paljon!

]]>i enjoyed answering those questions, because they made me think quite a bit about them myself. of course, as usual i ended up leaving only a short answer to each, but i thought i’d share them here in the case any students in the future run into the same questions. although my questions are all quite speculative and based on experience rather than rigorously justified, what’s fun in rigorously proven and well-known answers?

of course, there were so much more questions asked and answered during live lectures and at the chatrooms, but i just cannot recall all of them easily nor am i energetic enough after this unprecedented semester to go through the whole chat log to dig out interesting questions. i just ask you to trust me that the list of questions below is a tiny subset of interesting questions.

i will paraphrase/shorten the answers below and remove any identifying information (if any):

- Why was backprop controversial? Yann mentioned that one of the big things that made the use of ConvNets in various applications controversial was the use of backpropagation. backprop is just an application of the chain rule, so why would anyone be suspicious of using it?
- Professor LeCun said that mini-batch has no advantage over single-batch SGD besides being easier to parallelize, and online SGD is actually superior. Is there any other theoretical reason why single-batch is preferable?
- Why we would do batch normalization instead of normalizing the whole dataset all at once at first? Is it for when normalizing the whole dataset is too computationally expensive? I understood that normalization makes the optimization process easier through making the eigenvalues equal. However, if you’re only normalizing over the batch, your normalization for each batch is subject to noise and might still lead to bad learning rates for each dimension.
- Batch normalization in VAE: While implementing the convolutional VAE model, I noticed that removing these BatchNorm layers enabled the model to train as expected. I was wondering why does BatchNorm cause this issue in the VAE model?
- In semi-supervised VAE, how do we decide the embedding dimensions for the class? Also, BERT used position embedding to represent the position, so how do we determine the position embedding dimensions in BERT?
- Why do we divide the input to the softmax in dot product attention by the square root of the dimensionality?
- DL appears to add double descent as a caveat in addition to bias-variance tradeoff learned earlier. Do you have any insights on how we should think about double-descent?
- In your opinion, will we achieve AGI?

**1. Why was backprop controversial? Yann mentioned that one of the big things that made the use of ConvNets in various applications controversial was the use of backpropagation. backprop is just an application of the chain rule, so why would anyone be suspect of using it?**

when yann said it was controversial to use backprop earlier, i believe he meant it in two different ways: (1) backprop itself to compute the gradient of the loss function w.r.t. the parameters and (2) backprop to refer to gradient-based optimization. i’ll explain a bit of each below, but neither of them is considered a serious argument against using backprop anymore.

(1) backprop was controversial and is under great scrutiny when artificial neural nets (what we learn) are compared against biological neural nets (what we have). it’s quite clear due to biological constraints that backprop is not implemented in brains, as it is in our deep learning toolkits (see e.g., https://openreview.net/forum?id=HJgPEXtIUS for some of interesting biological constraints/properties that should be satisfied by any biologically plausible learning algorithms.) to some people, this is a make-or-break kind of issue, because there seems to exist a learning algorithm that results in a superior neural net (human brains!) of course, this could just mean that a biological brain is approximating the gradient computation as well as it could under the constraints, but it’s not easy to verify this (see, e.g., https://www.youtube.com/watch?v=VIRCybGgHts for how a brain might implement backprop.)

another criticism or objection along this line is that biological brains seem to have either zero or multiple objectives that are being optimized simultaneously. this is unlike our usual practice in deep learning where we start by defining one clear objective function to minimize.

(2) gradient-based optimization often refers to a set of techniques developed for (constrained/unconstrained) convex optimization. when such a technique is used for a non-convex problem, we are often working with the local quadratic approximation, that is, given any point in the space, the underlying non-convex objective function can be approximated by a convex quadratic function ($\theta^\top H \theta + g^\top \theta + c$.) under this assumption, gradient-based optimization would be attracted toward the minimum of this local quadratic approximation, regardless of whether there exists a better minimum far away from the current point in the space. this is often used as a reason for criticizing the use of gradient-based optimization with a non-convex objective function, thereby for criticizing the use of backprop. see e.g. http://leon.bottou.org/publications/pdf/online-1998.pdf for extensive study on the convergence properties of SGD.

this criticism however requires one big assumption that there is a big gap of quality between one of the nearby local minimum (we’ll talk about it in a few weeks at the course) and the global minimum. if there is a big gap, this would indeed be a trouble, but what if there isn’t?

it turned out that we’ve known for already a few decades that most of local minima are of reasonable quality (in terms of both training and test accuracies) as long as we make neural nets larger than necessary. let me quote Rumelhart, Hinton & Williams (1986):

“

<Learning representations by back-propagating errors> by Rumelhart, Hinton & Williams (1986)The most obvious drawback of the learning procedure is that the error-surface may contain local minima so that gradient descent is not guaranteed to find a global minimum. However, experience with many tasks shows that the network very rarely gets stuck in poor local minima that are significantly worse than the global minimum. We have only encountered this undesirable behaviour in networks that have just enough connections to perform the task. Adding a few more connections creates extra dimensions in weight-space and these dimensions provide paths around the barriers that create poor local minima in the lower dimensional subspaces.“

this phenomenon has been and is being studied quite extensively from various angles. if you’re interested in this topic, see e.g. http://papers.nips.cc/paper/5486-identifying-and-attacking-the-saddle-point-problem-in-high-dimensional-non-convex-optimization and https://arxiv.org/abs/1803.03635 for some recent directions. or, if you feel lazy, you can see my slides at https://drive.google.com/file/d/1YxHbQ0NeSaAANaFEmlo9H5fUsZRsiGJK/view which i prepared recently.

**2. Professor LeCun said that mini-batch has no advantage over single-batch SGD besides being easier to parallelize, and SGD is actually superior. Is there any other theoretical reason why single-batch is preferable?**

this is an interesting & important question, and the answer to this varies from one expert to another, including Yann and myself as well, based on what are being implicitly assumed and what are being used as criteria to tell which is preferred (computational efficiency, generalization accuracy, etc.)

Yann’s view is that noise in SGD greatly helps generalization because it prevents learning from being stuck at a sharp local minimum and drives learning to find a flatter local minimum which would imply that the final neural net is more robust to perturbation to the parameters, which naturally translates to the robust to the perturbation to the input, implying that it would generalize better. under this perspective, you want to maximize the level of noise, as long as they roughly cancel out on average across all the stochastic gradients computed from the training examples. that would correspond to using just one training example for computing each stochastic gradient.

of course, the amount of noise, which is proportional to the variance of the stochastic gradient, does impact the speed at which learning happens. in recent years, we (as the community of deep learning researchers) have found that certain network architectures require stochastic gradients computed using large minibatches (though, it’s unclear what large means, as it’s quite relative to the size of the training set) to be trained at all. in these cases, it looks like high level of noise sometimes prevents any progress in learning especially in the early stage.

so, in short, it’s still an open question. yann’s perspective may turn out to be the correct one (and that wouldn’t be the first time this happend,) or we may find a completely different explanation in the future.

**3. Why we would do batch normalization instead of normalizing the whole dataset all at once at first? Is it for when normalizing the whole dataset is too computationally expensive?** **I understood that normalization makes the optimization process easier through making the eigenvalues equal. However, if you’re only normalizing over the batch, your normalization for each batch is subject to noise and might still lead to bad learning rates for each dimension.**

there are three questions/points here. let me address each separately below:

“*normalization makes the optimization process easier through making the eigenvalues equal*“

we need to specify what kind of normalization you refer to, but in general, it’s not possible to make the hessian to be identity by simply normalizing the input. this is only possible when we are considering a linear network with a specific loss function (e.g., l2 loss for regression and cross-entropy for classification.) however, it is empirically known and for some cases rigorously as well that normalizing the input variables to be zero-mean and unit-variance makes the conditioning number (the ratio between the largest and smallest real eigenvalues of the hessian matrix) close to 1 (which is good.)

“*why we would do batch normalization instead of normalizing the whole dataset all at once at first?*“

now, in the case of a network with multiple layers, it turned out that we can maximize the benefit of normalization by normalizing the input to each layer to be zero-mean and unit-variance. unfortunately, this is not trivial, because the input to each layer changes as the lower layers’ weights and biases evolve. in other words, if we wanted to normalize the input to each layer, we would need to sweep through the entire dataset every time we update the weight matrices and bias vectors, which would make it intolerable. furthermore, renormalizing the input at a lower layer changes the input to the upper layers, ultimately resulting in the loss function to change dramatically each time we renormalize all the layers, likely making learning impossible. though, this is up to a certain degree addressible (see http://www.jmlr.org/proceedings/papers/v22/raiko12/raiko12.pdf by Tapani Raiko, my phd advisor, and Yann LeCun.)

“*your normalization for each batch is subject to noise*“

this is indeed true, and that’s precisely why it’s a customary practice to keep the running averages of the mean and variance of each dimension in batch normalization. assuming that the parameters of the network evolve slowly, such practice ultimately converges to the population mean and variance.

**4. Batch normalization in VAE: While implementing the convolutional VAE model, I noticed that removing these BatchNorm layers enabled the model to train as expected. I was wondering why does BatchNorm cause this issue in the VAE model?**

i don’t have a clear answer unfortunately, but can speculate a bit on why this is the case. my answer will depend on where batchnorm was used. of course, before reading the answer below, make sure your implementation of batchnorm doesn’t have a bug.

if batchnorm was used in the approximate posterior (encoder), it shouldn’t really matter, since the approximate posterior can be anything by definition. it can depend not only on the current observation $x$

, but can be anything else that helps minimizing the KL divergence from this approximate posterior to the true posterior. so, i wouldn’t be surprised if it’s totally fine leaving batchnorm in the encoder.

if batchnorm was used in the decoder, it may matter, as the likelihood distribution (generative distribution) is over the observation space $\mathcal{X}$ conditioned on the latent variable configuration $z$. with batchnorm, instead, the decoder is conditioned on the entire minibatch of latent variable configurations, that is, the latent variable configurations of the other examples. this may hinder optimization in the early stage of learning (in the later stage of learning, it shouldn’t really matter much, though.)

in general, batchnorm is a tricky technique and makes it difficult to analyze SGD, because it introduces correlation across per-example stochastic gradients within each minibatch.

5. **In semi-supervised VAE, how do we decide the embedding dimensions for the class**? **Also, BERT used position embedding to represent the position, so how do we determine the position embedding dimensions in BERT?**

this question can be answered from two angles.

a. network size

the embedding dimensionality is a part of a neural net, and it can be thought of as a part of determining the size of your neural network. it’s a good rule of thumb to use as large as neural net as you can within your computational and financial budget to maximize your gain in terms of generalization. this might sound counter-intuitive, if you have learned from earlier courses that we want to choose the most succinct model (according to the principle of occam’s razor,) but in neural nets, it’s not simply the size of the model, but the choice of optimization and regularization that matters perhaps even more. in particular, as we will learn next week, SGD is inherently working in a low-dimensional subspace of the parameter space and cannot explore the whole space of the parameters, a larger network does not imply that it’s more prone to overfitting.

b. why more than one dimension?

let’s think of the class embedding (though, the same argument applies to positional embedding.) take as an example handwritten digit classification, where our classes consists of 0, 1, 2, .., 9. it seems quite natural that there’s a clear one-dimensional structure behind these classes, and we would only need a one-dimensional embedding. why we do need then multi-dimensional class embedding?

it turned out that there are multiple degrees of similarity among these classes, and that the similarity among these classes is context-dependent. that is, depending on what we see as an input, the class similarity changes. for instance, when the input is a slanted 3 (3 significantly rotated clock-wise), it looks like either 3 or 2 but not 8 nor 0. when the input is a straight-standing 3, it looks like either 3 or 8 but not 2. in other words, the classes 3 and 2 are similar to each other when the input was a slanted 3, while the classes 3 and 8 are similar to each other when the input was a upright 3.

having multiple dimensions to represent each class allows us to capture these different degrees of similarity among classes. a few dimensions in the class embeddings of 3 and 2 will point toward a similar direction, while a few other dimensions in the class embeddings of 3 and 8 will point toward another similar direction. when the input is a slanted 3, the feature extractor (a convolutional net) will output a vector that will emphasize the first few dimensions and suppress the other dimensions to exploit the similarity between 3 and 2. a similar mechanism would lead to a feature vector of an upright 3 that would suppress the first few dimensions and emphasize the latter few to exploit the similarity between 3 and 8.

it’s impossible to tell in advance how many such degrees of similarity exist and how to encode them. that’s why we need to use as high dimensional embedding as possible for encoding any discrete, one-hot input.

**6. Why do we divide the input to the softmax in dot product attention by the square root of the dimensionality? **

This question was asked at one of the office hours, and Richard Pang (one of the TA’s) and i attempted at reverse-engineering the motivations behind the scaled dot-product attention from the transformers.

assume each key vector $k \in \mathbb{R}^d$ is a sample drawn from a multivariate, standard Normal distribution, i.e., $k_i \sim \mathcal{N}(0, 1^2).$ given a query vector $q \in \mathbb{R}^d$, we can now compute the variance of the dot product between the query and key vectors as $\mathbb{V}[q^\top k] = \mathbb{V}[\sum_{i=1}^d q_i k_i] = \sum_{i=1}^d q_i^2 \mathbb{V}[k_i] = \sum_{i=1}^d q_i^2$. in other words, the variance of each logit is the squared norm of the query vector.

assume the query vector $q$ is also a sample drawn from a multivariate, standard Normal distribution, i.e., $q_i \sim \mathcal{N}(0, 1^2)$. in other words, $\mathbb{E}[q_i]=0$ and $\mathbb{V}[q_i]=\mathbb{E}{q_i} \left[(q_i – \mathbb{E}[q_i])^2\right] = \mathbb{E}{q_i} \left[ q_i^2 \right] = 1$. then, the expected variance of the logit ends up being $\mathbb{E}{q} \left[ \mathbb{V}[q^\top k] \right] = \mathbb{E}{q} \sum_{i=1}^d q_i^2 = \sum_{i=1}^d \mathbb{E}{q_i} q_i^2 = \sum{i=1}^d 1 = d.$

we can now standardize the logit to be $0$-mean and unit-variance (or more precisely, we make the logit’s scale to be invariant to the dimensionality of the key and query vectors) by dividing it with the standard deviation $\sqrt{\mathbb{E}_q \mathbb{V}[q^\top k]}=\sqrt{d}.$

these assumptions of Normality do not hold in reality, but as we talked about it earlier, Normality is one of the safest things to assume when we don’t know much about the underlying process.

As Ilya Kulikov kindly pointed out, this explanation doesn’t answer “why” and instead answers “what” scaling does. “why” is a bit more difficult to answer (perhaps unsurprisingly,) but one answer is that softmax saturates as the logits (the input to softmax) grow in their magnitudes, which may slow down learning due to the vanishing gradient. though, it’s unclear what’s the right way to quantify it.

**7. DL appears to add double descent as a caveat in addition to bias-variance tradeoff learned early on. Do you have any insights about how we should think about double-descent? **

The so-called double descent phenomenon is a relatively recently popularized concept that’s still being studied heavily (though, it was observed and reported by Yann already in the early 90s. see, e.g., https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.66.2396 and also https://iopscience.iop.org/article/10.1088/0305-4470/25/5/020 by Krogh and Hartz) The issue I have with double descent in deep neural nets is that it’s unclear how we define a model capacity. the # of parameters is certainly not the best proxy, because the parameters are all heavily correlated and redundant. perhaps it should be the number of SGD steps, because we learned that the size of the hypothesis space is in fact the function of the number of SGD steps.

One particular proxy I find interesting and convincing is the fraction of positive eigenvalues of the Hessian at a solution. With this proxy, it looks like the apparent double descent phenomenon often lessens. see e.g. https://arxiv.org/abs/2003.02139.

So, in short, the model capacity is a key to understanding the bias-variance trade-off or more generally generalization in machine learning, but is not a simple concept to grasp with deep neural networks.

**8. In your opinion, will we achieve AGI?**

Of course, I’m far from being qualified to answer this question well. Instead, let me quote Yann:

]]><An executive primer on artificial general intelligence> by Federico Berruti, Pieter Nel, and Rob Whiteman

Yann LeCun, a professor at the Courant Institute of Mathematical Sciences at New York University (NYU), is much more direct: “It’s hard to explain to non-specialists that AGI is not a ‘thing’, and that most venues that have AGI in their name deal in highly speculative and theoretical issues…

[Updated on Nov 30 2020] added a section about the scaling law w.r.t. the model size, per request from Felix Hill.

[Updated on Dec 1 2020] added a paragraph referring to Dauphin & Bengio’s “Big Neural Networks Waste Capacity“.

{Update on Feb 8 2021] see “Learning Curve Theory” by Marcus Hutter for a better exposition of the scaling law and where it might be coming from.

this is a short post on why i **thought** (or more like imagined) the scaling laws from <scaling laws for autoregressive generative modeling> by Heninghan et al. “[is] inevitable from using log loss (the reducible part of KL(p||q))” when “the log loss [was used] with a max entropy model“, which was my response to Tim Dettmers’s tweet on “why people are not talking more about the OpenAI scaling law papers“. thanks to João Guilherme for brining it this to my attention. it’s given me a chance to run some fun thought experiments over the weekend, although most of, if not all of, them failed as usual with any ideas and experiments i have. anyhow, i thought i’d leave here why i thought so particularly from the perspective of dataset size.

- The scaling law for Bernoulli w.r.t. the dataset size
- The scaling law for Bernoulli w.r.t. the model size
- The scaling law for Bernoulli w.r.t. the compute amount
- Final thoughts

instead of considering a grand neural autoregressive model, i’ll simply consider estimating the mean of a Bernoulli variable after $N$ trials, and compare the log loss at this point against the log loss computed after $N+\Delta$ trials. let’s start by writing down the loss value after $N$ trials:

$$

-L(N) = p^* \log \frac{N_1}{N} + (1-p^*) \log \frac{N-N_1}{N} =

p^* \log N_1 + (1-p^*) \log (N-N_1) – \log N,

$$

where $p^*$ is the true ratio of heads and $N_1 < N$ is the number of heads from the $N$ trials.

let’s now consider tossing the coin $\Delta$ more times. i will use $\Delta_1 < \Delta$ as the number of additional heads after these additional trials. what’s the loss after $N+\Delta$ trials?

$$

-L(N+\Delta) = p^* \log (N_1 + \Delta_1) + (1-p^*)(N+\Delta – N_1 – \Delta_1) – \log (N+\Delta).

$$

so far so good. now, what kind of relationship between these two quantities $L(N)$ and $L(N+\Delta)$ do i want to get? in my mind, one way to say there’s a power law like structure behind $L$ is to show that the amount of improvement i get by running $\Delta$ more trials decreases as the number of existing trials $N$ increase. that is, there’s diminishing return from a unit effort as more efforts have been put.*

then, let’s look at their difference by starting from the loss at $N+\Delta$, while assuming that $\Delta \ll N$ (and naturally $\Delta_1 \ll N_1$ as well) so that i can use $\log (1+x) \approx x$ when $x$ is small:

$$

\begin{align*}

-L(N+\Delta) =& p^* \log (N_1 + \Delta_1) + (1-p^*)\log(N+\Delta – N_1 – \Delta_1) – \log (N+\Delta)

\\

=&

p^* \log N_1 (1+ \frac{\Delta_1}{N_1}) + (1-p^*) \log(N-N_1)(1 + \frac{\Delta – \Delta_1}{N-N_1}) – \log N(1+ \frac{\Delta}{N})

\\

\approx

&

\underbrace{p^* \log N_1 + (1-p^*) \log (N-N_1) – \log N}_{=-L(N)} + p^* \frac{\Delta_1}{N_1} + (1-p^*)\frac{\Delta – \Delta_1}{N-N_1} – \frac{\Delta}{N}.

\end{align*}

$$

The decrease in the loss by running $\Delta$ more trials can now be written as

$$

L(N) – L (N+\Delta) = p^* \frac{\Delta_1}{N_1} + (1-p^*)\frac{\Delta – \Delta_1}{N-N_1} – \frac{\Delta}{N}.

$$

since $\Delta_1 < \Delta$ and $N_1 < N$, let’s rewrite them as $\Delta_1 = \beta \Delta$ and $N_1 = \alpha N$, where $\alpha \in [0,1]$ and $\beta \in [0,1]$. then,

$$

L(N) – L (N+\Delta) = p^* \frac{\beta \Delta}{\alpha N} + (1-p^*) \frac{(1-\beta)\Delta}{(1-\alpha)N} -\frac{\Delta}{N} = \frac{\Delta}{N} \left(p^* \frac{\beta}{\alpha} + (1-p^*)\frac{1-\beta}{1-\alpha} – 1\right)

$$

this says that the change from the loss at $N$ to the loss at $N+\Delta$ is inversely proportional to $N$ itself, which is what i wanted to see from the beginning. although there were a few leaps of faith along the way, but it looks like more tosses I have made (i.e, large $N$), the change i can make to my loss with a constant number of extra tosses diminishes.

the second (multiplicative) term is more complicated, and i find it easier to think of two extreme cases; $p^*=1$ and $p^*=0$. these cases are reasonable if we think of this exercise as a proxy to studying classification, where it’s often assumed that a given input either belongs to one (positive) or the other (negative) class in an ideal world. when $p^*=1$, the second term reduces to

$$

\frac{\beta}{\alpha} – 1~~

\begin{cases}

> 0, & \text{if } \beta > \alpha \\

< 0, & \text{if } \beta < \alpha \\

= 0, & \text{if } \beta = \alpha

\end{cases}

$$

in other words, if the extra tosses reflected the true distribution better ($\beta > \alpha$, because the true positive rate is $1$,) the loss dropped. otherwise, the loss increases ($\alpha > \beta$) or stays same (i.e., no additional information has been added.) the other extreme case of $p^* = 0$ works similarly.

what’s important is that this second term largely dictates the sign of how the loss changes with the extra $\Delta$ tosses. since we are considering only the ratios of the heads within sets of trials and (suddenly!) assume both $N$ and $\Delta$ are reasonably large, the magnitude of change is instead largely determined by the ratio between $\Delta$ and $N$, with $N$ in the denominator.

so, this is how i arrived at my shallow take on twitter that these scaling laws may not have too much to do with whether we use neural net parameterization or not, whether we are solving language modeling, machine translation, etc., nor whether we are working with text, image or both. “i think” it arises naturally from the maximum entropy formulation (you can think of estimating the log-frequency of the heads above with sigmoid/softmax to turn it into the Bernoulli distribution) and the log loss.

of course, because i had to make a number of leaps of faith (or to put it another way, a few unreasonable assumptions,) it’s possible that this actually doesn’t make much sense at the end of the day. furthermore, i’m super insecure about my math in general, and i’m about 99.9% sure there’s something wrong in the derivation above . hence, why “i think” the scaling law arises from log loss (cross-entropy) and maximum entropy models.

it’s important for me to point out at this point that Heninghan et al. did much more than what i’ve discussed in this post and provide a much more extensive set of very interesting findings. they looked not only at the effect of the data size, but also the compute budget $C$ and model size $|\theta|$. in fact, they focus much more on the latter two than the former which was my focus here.

in the case of the model size, it’s quite trivial to map it to the argument above i made regarding the number $N$ of observations. let’s consider the model size $|\theta|$ in this context of recovering Bernoulli as the number of bits (with an arbitrary basis, including $e$) allowed to represent $N$ and $N_1$ (and consequently, $\Delta$ and $\Delta_1$.) then, the maximum $N$ a model can count up to is $\exp(|\theta|)$, and by increasing the model size by $\delta$ (i.e., $|\theta|+\delta$,) we can toss the coin

$$

\exp(|\theta|) \exp(\delta) – \exp(|\theta|) = \exp(|\theta|) (\exp(\delta) – 1)

$$

more. in other words, increasing the size of the model, while assuming that we can run as many tosses as we can to saturate the model capacity, is equivalent to setting $\Delta$ above to $\exp(|\theta|) (\exp(\delta) – 1)$.

in this case, the first term in the change in the loss above reduces to

$$

\frac{\Delta}{N} = \frac{\exp(|\theta|) (\exp(\delta) – 1)}{\exp(|\theta|)} = \exp(\delta) – 1,

$$

which is weird, because the dependence on $N = \exp(|\theta|)$ disappeared. that is, the change in the loss w.r.t. the increase in the model size (the number of bits) is not dependent on the number of existing bits used by the model.

what is happening here? in my opinion, this implies that the # of parameters in a neural net, or increasing it, is **not** optimally done in terms of compression.

what if we instead assume that only a polynomial number of trials can be compressed, i.e., $N=|\theta|^c$? in particular, for the sake of simplicity, let’s assume $c=2$. in this case,

$$

\frac{\Delta}{N} = \frac{(|\theta|+\delta)^2}{|\theta|^2} = 2\frac{\delta}{|\theta|} + \left(\frac{\delta}{|\theta|}\right)^2,

$$

and voila! we recovered the dependence on the model size $|\theta|$, and this dependence is inverse proportional, as expected. by further assuming that $\delta \ll |\theta|$, we end up with

$$

\frac{\Delta}{N} \approx 2 \frac{\delta}{|\theta|}.

$$

so, what does it say about the observation by Henighan et al. that there is a scaling law w.r.t. the model size? i suspect that their observation is telling us that deep nets we use are far from optimal in the sense of compressing data. it could be due to the choice of architectures, due to our choice of learning algorithms or even due to regularization techniques we use. it’ll be interesting to pinpoint what’s behind this sub-optimality will be interesting.

as i was writing the last paragraph, i was reminded of this earlier workshop paper by Yann Dauphin & Yoshua Bengio from the workshop track of ICLR’13, titled “Big Neural Networks Waste Capacity.” in this work, they observed the “rapidly decreasing return on investment for capacity in big networks” and conjectured this is due to the “failure of first order gradient descent.” perhaps, Yann was onto something, although i don’t think he’s followed up on this.

in the case of the compute budget, i have absolutely no idea, but i wonder if a similar argument as the model size could be made. the number of SGD steps largely dictates the maximum magnitude of the weights in a neural net. the resolution (?) of the computed probability is largely determined by the maximum magnitude of (or the variance of individual weights in) the final weight matrix (that feeds into the final softmax). perhaps we can connect these two to show that more SGD updates allow our neural net to more precisely identify the target probability. of course, this suggests that different optimization strategies may result in radically different scaling laws.

assuming what i wrote above makes even slightest bits of sense, this raises two interesting question, in my opinion. first, is all a sophisticated neural net does counting examples? the strict answer is no, because it both counts and compresses. it however looks as if it’s compression without any interesting emergent property (such as systematic generalization). second, how does this property change when we move away from the maximum entropy formulation and log-loss? i’ve pointed out two directions that look promising in a tweet earlier: margin ranking loss by Collobert & Weston and entmax series by Martins and co. if so, will it be the change in a desirable direction?

let me wrap up by thanking Henighan et al. and Kaplan&McCandlish et al. for thought-provoking pieces that have made me think of these models and problems i’ve been working with all along from a very different angle.

(*) of course the other (more positive) way to look at it is that there’s always more to be learned if we are ready to invest as much as we have invested already.

]]>Earlier this month (Nov 2020) at the Samsung AI Forum 2020 I was one of the five recipients of the inaugural Samsung AI Researcher of the Year Award by the Samsung Advanced Institute of Technology (SAIT). Samsung has been supporting my research ever since I was a postdoc at Mila in Montreal, and without their support I wouldn’t have been able to support all my PhD students (NSF, i’m looking at you!) Because of this prolonged support, I had been already grateful to Samsung even before this award, and I am even more thankful. It was also a humbling experience for me because of my fellow awardees, Seth Flaxman, Chelsea Finn, Cho-Jui Hsieh, and Jiajun Wu, who are so much more awesome than I am. Thanks for Seth’s suggestion, we are now all on each other’s whatsapp, which is another perk I got out of this award.

**Detour**: Before I continue to talk about this award, let me just briefly share with you my experience as having been living abroad in three different places (Helsinki, Montreal and NYC) that speak three different languages (Finnish, French and English) as an expat and in particular as a student expat, over the past ten years or so. In short, it’s not easy. It’s not easy in many ways, but one that I felt as most challenging was this feeling I had whenever I moved to a new place that I have to stay alert, watch my account balance and prepare for the worst until I fully settle down and get used to this new city and country. Even then, there’s a nagging feeling that I am only a temporary resident here and that I must be prepared to leave immediately without any hesitation if I’m forced to or decide to.

You can literally see this stress from newly arriving students or more broadly expats who are not financially well off. They have a difficult time appreciating beauty and joy in a new place, not to mention enjoying them. Even if this new town is filled up with awesome restaurants, they wouldn’t facy the idea of dining at those restaurants. Even if the city is surrounded by amazing tourist destinations, they wouldn’t spare their time to visit them unless their parents come visit them. Their places are often light on furnitures, and even the furnitures they get are on the cheapest end of the spectrum: in fact, a lot of them don’t even buy a full bed but just a cheap mattress placed on their floor.

Even in my case, where I have been relatively well off financially for a newly arriving student/postdoc, i’ve never bought a couch ever since i left my parents’ place (don’t worry i’m planning to do so shortly,) and i bought a bed with a box spring for the first time only when I moved to NYC as a new faculty member. It took me my parents’ visit after my second year in Finland to travel to Rovaniemi and other touristic destinations in Finland and neighbouring countries (and let me tell you: there aren’t so many.) It took me a workshop at NRC Canada to visit Ottawa when I was in Montreal, and took me an invitation by Hugo Larochelle to visit U. Sherbrooke to visit Quebec City (I know.. it’s not on the way to Sherbrooke, but I took a detour.) Even when I could afford it, it took several walk-by’s before I could mentally prepare myself to decide to dine in at this reasonably fancy (but not that much…) place, and it still does.

That’s the weirdest thing: most of these I could afford back then and can certainly afford now. However, even if I could afford it, even if I knew it would improve how I live, and even if I knew that would make my days more comfortable, a lot of things felt much less accessible and looked overly and unnecessarily luxurious. I’ve experienced this stress, although I’ve thoroughly enjoyed and never regretted moving to and living in these places, been financially stable for most of my expat years and haven’t had any dependent to support. One begins to wonder how challenging it must be for others (and you!) who may be in worse situations.

**Back to the award**: this award comes with generous $30,000 USD monetary prize^{1} (!) And, no, it’s not paid to the university for me to use to support my research, but it is the prize paid directly to me. In other words, I’m free to do whatever i want with this $30,000 that sprang out of nowhere. should i finally buy a couch? well, i could, but i can buy it without this prize money. should i buy a car? well, i live in manhattan. should i go on a luxury vacation? well, pandemic…

After a brief period of pondering, i’ve decided to donate the prize money^{2} to Mila where I was a postdoc for 1.5y + a visiting student for 0.5y. More specifically, i’ve decided to donate the prize money to Mila on the condition that it is used to provide a *one-time cash supplement* of up to $1,500 CAD to each incoming *female* students/postdoc, arriving from either *Latin America*, *Africa*, *South Asia*, *South East Asia* and *Korea*, until the donation runs out. I hope this supplement gives students, who have just arrived at Montreal to start the new chapter of their lives, a bit of room for breathing. Perhaps they can use it to go enjoy a dinner at a nice restaurant in Montreal. Perhaps they can go out with their new friends and family for beer. Perhaps they can buy not just a mattress but a proper bed. it’s not for me to determine what lets them relax a bit in the midst of settling down in a new environment, and I just hope this to be helpful in whatever way suits them best.

I thoroughly enjoyed my time at Mila (which was, to be precise, called Lisa back then,) and have greatly benefited from spending my time there as a postdoc. i cannot imagine where i would be had i not been a postdoc at Mila. And, I hope this small gesture of mine could make a diverse group of incoming students/postdocs from all corners of the world to have a more enjoyable time in Mila and benefit from their time in Mila as much as if not more than i have.

**Why female students from these regions (Latin America, Africa, South Asia, South East Asia and Korea)?** our field has an issue of representation in many aspects. we have an issue of gender representation. we have an issue of geographical representation. we have an issue of educational background/discipline representation. we have many more issues of representation in different aspects. All these issues of representation are equally important and critical, and I know that these are not just pipeline issues, based on my experiences of meeting amazing talents while teaching at Deep Learning Indaba 2018, Khipu.AI 2019, SEAML 2019, Deep Learning UB 2019 and the African Master’s Programme in Machine Intelligence (AMMI). these issues are often of opportunities and support. I believe we need to take even a little action at a time rather than waiting to address all of them simultaneously. in this particular case, I decided to give a minuscule shot at addressing a couple of these issues; the lack of female representation and the limited representation of researchers and students from Latin America, Africa, South Asia and South East Asia (I added Korea because the prize came from a Korean company :))

Also, perhaps a bit selfishly, i want to make sure there’ll be a role model my niece can look up to in the field of AI when she’s older.

(1) they also sent me this awesome plaque, but i don’t think Mila would appreciate it as donation.

(2) i’ve decided to donate $35,000 CAD after setting aside a bit for tax. after all, i’ve been paying more federal tax than the president for quite some time already and am expecting to pay some more this coming tax season.

]]>