언제나 호탕하시고 유쾌하셨던 할머니께서는 한결 같이 저와 통화를 하시면 꼭 두 가지 얘기를 하셨습니다. 하나는 제가 어릴 때 어찌나 서럽게 쉬지 않고 밤낮으로 울었다는 것이었습니다. 저야 애기 때니 기억은 안 나지만 얼마나 울어댔길래 다른 최근 기억은 세월에 잊혀지면서도 이 이야기는 안 잊혀졌던 것 일까요. 이제 할아버지 할머니가 다 되신 외삼촌 외숙모들도 저만 보면 얘는 어릴때 그리 울었다고 하시는것 보면 참 많이 울긴 했나봅니다.

또 다른 이야기는 이렇게 울어대던 와중 한 장면입니다. 제가 드디어 스스로 걷기 시작하고 처음으로 할머니께서 문을 열고 마당에서 걷게 놔두어 보셨답니다. 어째 느낌으론 하늘이 파아란 좋은 날씨 아니었을까 생각되는데, 제가 천천히 걸으며 하늘도 보고 땅도 보면서 뭔지 알아들을 수 없지만 쉬지 않고 중얼중얼 댔다고 하셨습니다. 할머니 느낌에는 아 세상이 참 신기하구나, 이것도 신기하고 저것도 신기하구나, 라고 하는것 같았습니다. 저야 당연히 간난아기 때니 기억 안 나지만 이 얘기를 들을때마다 저도 뭐를 그리 중얼 댔을까 궁금합니다.

수십번도 더 들은 이야기들인데, 아직도 할머니께서 들려주시면 또 듣고 싶네요 ..

]]>with these numbers, we anticipate a much lower level of reviewing load for each area chair/reviewer this year. it is however impossible for us to be certain, especially in the cases of senior area chairs and area chairs. that is the reason why we don’t provide any option to preemptively reduce the reviewing load to senior area chairs and area chairs. instead, if there’s any particular request, we ask senior area chairs and area chairs to reach out to the program chairs directly to discuss the right way to adjust reviewing load individually.

this is however not the case with reviewers. we provide reviewers with an option to request the reduction of the reviewing load already on OpenReview. Unfortunately this option is less visible, evident from a non-stop stream of request emails we’re receiving (yes… my inbox is now … totally filled up and overflowing ..) so, here’s a detailed instruction on how you can request a reduced reviewing load yourself *without* emailing me.

1. Decline the initial reviewer invite: in order to request lower reviewing load, you need to click “DECLINE” link in the original invite email, as shown in the screenshot below:

2. Click “OK” when prompted by OpenReview with “You have chosen to decline this invitation. Do you want to continue?”

3. You will be redirected to a landing page. click “Request reduced load” at the bottom of the landing page.

4. You can then choose the reduced load from {1, 2, 3, 4} in the following page. You can choose one that best suits you and click “Submit”.

And, that’s it!

We strive to ensure no reviewer is overloaded with a huge number of assignments and also all submissions receive a proper level of attention from reviewers, area chairs and senior area chairs. This cannot be done without your service, and we greatly appreciate it.

]]>to be specific, i will use $p(y|x)$ to indicate that this is a distribution over all possible answers $\mathcal{Y}$ returned by a machine learning model $f$ given an input $x$. this is distinguished from a predictive distribution computed directly by that model $f$, which i will denote as $p_f(y|x)$. these two, $p(y|x)$ and $p_f(y|x)$, are different from each other in that the former takes into account uncertainty that cannot be captured by the machine learning model $f$ while the latter captures at most the uncertainty it can capture. this distinction will be made clearer later in this post.

of course, neither of these two needs to be an actual probability given an input and an arbitrary target $(x,y)$ but can just be an arbitrary scalar $-E(x,y) \in \mathbb{R}$, as we can turn this into the probability by

$$p(y|x) = \frac{\exp(-E(x,y))}{\int_{\mathcal{Y}} \exp(-E(x,y’)) \mathrm{d}y’}.$$

what does it mean for us to use a predictive distribution rather than a single-point prediction? this is equivalent to saying that there are a set of answers that we consider likely. then, here comes a natural, follow-up question: why are there many likely answers, not only one? one way to answer this question is to say that there exists uncertainty in the answer. then, here’s the next follow-up question: *where does this uncertainty come from*? this is the question i’ll try to answer by enumerating what i can imagine as the sources of uncertainty in this post.

instead of talking about irreducible (was it aleatoric?) and reducible (was it epistemic?) uncertainty, i’ll just be very much down to earth and talk about some of the sources of uncertainty that i believe we should think of.

before i continue, let me clarify what i mean by $y$ here. $y$ is one of all possible answers. in the case of classification, $y$ is one of all possible classes. in the case of multi-label classification (many binary classifiers,) $y$ is one of the all possible combinations, i.e., $y \in \{0, 1\} \times \cdots \times \{0, 1\}$. in other words, we do not have to worry too much about dependencies between different dimensions of $y$, although this makes it tricky to think of continuous $y$ (it does reveal what i’m interested in, doesn’t it?)

under this setup of $y$, i will care about the probability assigned to $y$ rather than the variance of the probability assigned to $y$. this arises from my desire to consider only those $y$’s that receive reasonably high probabilities. among these reasonably probable $y$’s, those that are more highly probable are also the ones that tend to (but not always for sure) have lower variance (just because the probability is bounded between $0$ and $1$.) in other words, we care about how many highly plausible answers there are and what they are.

of course, you can replace $\mathbb{E}$ with $\mathbb{V}$ below to get the variance of the probability assigned to $y$ rather than the average. perhaps it’s a good idea then to use some combination of the mean and variance, similarly to using e.g. upper-confidence bound in various active learning setups. but, well, i’m writing a blog post not a book here.

the **first** source of noise that comes to my mind is our use of a finite number of examples, for both training and evaluation. even if there exists a single correct answer $y^*$ for an input $x$, it is possible that it may be impossible to precisely identify this correct answer $y^*$ given only a finite number of examples from which our machine learning model learns. even worse, different answers may look more likely when different sets of training examples are used.

it is always reasonable to assume our learning algorithm can only work with a finite number of examples (even in the most optimal case, it will be bounded by how long Google thrives and survives …) let’s say we always use $K$-many training examples drawn from a single data distribution $p_{\mathrm{data}}$. the uncertainty arising from this finite nature of data can be written as

$$p(y|x) = \mathbb{E}_{(x^1, y^1), \ldots, (x^K, y^K) \sim \underbrace{p_{\mathrm{data}} \times \cdots \times p_{\mathrm{data}}}_{K}} \left[\mathrm{LEARN}((x^1, y^1), \ldots, (x^K, y^K))(x)\right],$$

where $\mathrm{LEARN}$ is a learning algorithm that returns a trained model. in other words, we need to try training as many models as possible with size-$K$ subsets and see how the predictions from these models vary.

it makes sense to a certain degree, but of course, this is not tractable in general, because we are often given a single set of training examples to work with, instead of the full data distribution from which we can freely sample a new set of training examples. yes, yes, you’re right that sometimes we have (expensive) access to the data distribution, but let’s assume this is not the case in our case.

instead, it is possible to generate pseudo-training sets by re-sampling multiple sets from this single training set. this is what we often refer to as *bootstrap resampling*. this is a nice way to capture the variation/uncertainty caused by sampling of data, but it is often intractable to use this methodology in deep learning, as the size of a data set needed to train a model is pretty huge.

of course, such a resampling strategy can be used to measure the uncertainty in the test accuracy given a single model $f$:

$$\mathrm{Var}(\mathrm{EVAL}{p(y|x)}) = \mathbb{V}_{(x^1, y^1), \ldots, (x^K, y^K) \sim \underbrace{p_{\mathrm{test}} \times \cdots \times p_{\mathrm{test}}}_{K}} \left[\mathrm{EVAL}_{p_f(y|x)}((x^1, y^1), \ldots, (x^K, y^K))\right],$$

where $p_f(y|x)$ is the predictive distribution from one particular model $f$.

the **second** source of uncertainty is noisy measurement. this is closely related to the sampling-induced uncertainty above, except that we now split the data distribution $p_{\mathrm{data}}$ into two parts; data generation and noise injection. the process by which a single pair $(x,y)$ is sampled is

- true measurement: $(\hat{x}, \hat{y}) \sim p_{\mathrm{data}}(x, y)$
- noisy measurement of the input: $x \sim C_x(x | \hat{x})$
- noisy measurement of the output: $y \sim C_y(y|\hat{y})$

$C_x$ and $C_y$ are the noisy measurement processes for $x$ and $y$, respectively.

let’s first consider the input noise $C_x$. we assume that $C_x$ is symmetric (i.e., $C_x(\hat{x}|x) = C_x(x|\hat{x})$.) this symmetry tells us that we can draw plausible samples of a _clean_ version $x$ given the noisy measurement $\hat{x}$ from $C_x(x | \hat{x})$. we don’t know exactly which of these samples is the original version $x$, but they are all largely plausible. the uncertainty arising from our inability to perfectly denoise the noisy measurement shows up as:

$$p(y|\hat{x}) = \mathbb{E}_{x \sim C_x(x | \hat{x})} \left[ p_f(y|x) \right],$$

where $p_f$ is the predictive distribution returned by a classifier $f$. just like above, this classifier may return an unnormalized scalar, in which case we turn it into the probability by softmax normalization. in other words, the uncertainty is in how the prediction varies across plausible original version of $\hat{x}$ according to the symmetric noisy measurement process $C_x$.

this implies that we can reduce this particular type of uncertainty if we knew (potentially irreversible) $C_x$, by maximizing $\log p(y^*|\hat{x})$ above rather than $p_f(y^*|\hat{x})$. of course, $C_x$ is often (if not always) unknown, and people often resort to manually crafting a proxy corruption process that mimics a reasonable noisy measurement process $C_x$ and sample from it during training to approximate the expectation above. this practice is nowadays referred to as *data augmentation*. of course, smart ones (yes, like my awesome collaborators ) learn a proxy to $C_x$ from unlabelled data, such as we have done with SSMBA recently for natural language processing.

now, let’s quickly consider the output noise $C_y$. “quickly”, because it doesn’t really differ much from the input noise $C_x$. the major difference is that it is often only useful in the training time, since $y$ is not known in the test time.

with this in our mind, let’s consider a particular noisy measurement process $C_y(y | \hat{y}) = \alpha \delta(y|\hat{y}) + (1-\alpha) \mathcal{U}(y; \{1,2, \ldots, L \})$, where $\delta$ is a Dirac delta distribution, $\mathcal{U}$ is a uniform distribution, and $\alpha \in [0, 1]$ is a mixing coefficient. with the probability $\alpha$, there is no noise, and with the probability $1-\alpha$, we switch the label to one of all possible labels uniformly.

we can now express the uncertainty by

$$\mathbb{V}_{y \sim C_y(y|\hat{y})} \left[ \log p_f (y|x) \right]$$

of course, we don’t really have access to $\hat{y}$, but we can flip $\hat{y}$ and $y$ above, because $C_y$ is symmetric; with the probability $\alpha$ the clean answer would’ve been $y$ itself and otherwise it could’ve been anything. in other words, we can reduce this uncertainty by minimizing

$$\mathbb{V}_{y \sim C_y(y’|y)} \left[ \log p_f (y’|x) \right].$$

an interesting observation here is that $\log p_f (y’|x)$ is bounded from above by $0$. this means that we can indirectly minimize this variance by maximizing each $\log p_f (y’|x)$ for $y’ \sim C_y(y’|y)$:

$$\mathbb{E}_{y’ \sim C_y(y’|y)} \left[ \log p_f (y’|x) \right],$$

which can be rewritten with this particular $C_y$ as

$$

\sum_{y’ \in \mathcal{Y}} I(y=y’) \left(1-\frac{1-\alpha}{|\mathcal{Y}|}\right) \log p_f (y|x) + I(y\neq y’) \frac{1-\alpha}{|\mathcal{Y}|} \log p_f (y’|x).

$$

this reminds us of a widely-used technique of *label smoothing* which always ensure that a model assigns some non-zero probabilities to incorrect classes while maximizing the log-probability of the observed label $y$. so, one way to think of what label smoothing does is that it reduces the uncertainty arising from the noisy measurement of labels.

noise in the output $y$ is however trickier when it is *not* noise. what does it mean? it means that there may be genuinely multiple correct answers, and that we cannot tell by looking at $y$ alone whether it is noisy version of $\hat{y}$ or that it is just one of many possible clean answers. this is an interesting observation to think about: so-called irreducible noise is often indistinguishable from so-called reducible noise in practice!

if we somehow know that there are genuine ambiguity in the output (which is quite common, such as in machine translation and any other structured prediction problems,) we can deal with it by introducing stochastic hidden variables in our model, such as in *stochastic feedforward networks* as well as *conditional RBM/NADE*. of course, such a powerful conditional density model will inevitably capture not only genuine ambiguity but also genuine measurement noise, potentially leading to the issue of overfitting.

let $\epsilon$ be some arbitrary random variable from which we can sample numbers in order to make some arbitrary decisions in our learning algorithm. there are so many things we often need to make arbitrary decisions for. some of them are:

- how do we build a minibatch?
- if we are building a minibatch on the fly, which subset of training examples do we use?
- if we are taking the next chunk of training examples, which order do we sort the training examples in?

- parameter initialization
- how do we initialize the parameters of our model?

- dropout (or any stochastic regularizer)
- which hidden units do we drop to $0$?

- Underlying compute engine

furthermore, some learning algorithms intentionally rely on such randomness. a representative example is *policy gradient* in which noise is added to smooth out the super-difficult optimization problem of

$$\max_{\pi} R(\arg\max_a \pi(a|s))$$

into a slightly-less-difficult problem of

$$\max_{\pi} \mathbb{E}_{a \sim \pi(a|s)} R(a).$$

this smoothing is done by arbitrarily choosing the action among many plausible actions according to $\pi$ at state $s$. such sampling is often implemented by transforming a series of random numbers (e.g. drawn from $\epsilon$) into a single sample from $\pi(\cdot|s)$.

we can now abstract out these details and make $\epsilon$ an additional input to $\mathrm{LEARN}$ function above. this learning algorithm takes as input the training set as well as this source of randomness. then, for each input $x$, we can check the uncertainty of our prediction by considering multiple (potentially infinitely many) models arising from the variance induced by $\epsilon$:

$$p(y|x) = \mathbb{E}_{\tilde{\epsilon} \sim \epsilon} \left[\mathrm{LEARN}((x^1, y^1), \ldots, (x^N, y^N), \tilde{\epsilon})(x)\right],$$

which looks exactly like the bootstrap resampling version above. this is only natural, because dataset sampling itself can be thought of as an arbitrary selection of a subset of all possible examples. it is however informative to think of these two separately, since noise in stochastic learning is what we often can explicitly control and noise in dataset sampling is what we often don’t have much control over (perhaps except in active learning.)

let $\tilde{\epsilon} = (\epsilon^1, \ldots, \epsilon^M) \sim \epsilon$ be a series of random numbers drawn from $\epsilon$ for a single training run. one may be tempted to choose $\tilde{\epsilon}$ with the best validation accuracy and deploy the corresponding model. but, in an application where it is important to find more than one answers with reasonable estimates of their probabilities, it is a much better idea to bag all of them for deployment. this is also why you do not want to and should not *tune* a random seed.

it is understandably quite expensive to use many models arising due to $\epsilon$ in real life, unfortunately. it is however an attractive feature of this approach to have a full distribution over $y$ that reflects varying degrees of likelihood of each $y$. it is a usual practice to use the idea of knowledge distillation, to train another model that is not trained on the targets from the data $y^n$ but on the entirely predictive distribution $p(y|x^n)$.

in fact, it is not only the training procedure but also the hyperparameter search procedure that relies extensively on those random numbers sampled from $\epsilon$. it is because we almost always cannot perform exhaustive search nor deterministic line search due to an ever-increasing number of hyperparameters. this is similar to the uncertainty arising from stochastic learning above, in that our choice of hyperparameters which directly affect learning has its own noise, e.g. arising from random search, which results in predictive uncertainty.

furthermore, the uncertainty, that is similar to the data sampling uncertainty above, exists with hyperparameter tuning as well, as it is a common practice to use a fixed, finite set of validation (held-out) examples in hyperparameter tuning. each time we use a different set of validation examples drawn from the true data distribution (whatever that is), we would end up with a hyperparameter configuration that leads to a different model that ends up making slightly different prediction. we would aggregate them to understand how uncertain we are of any particular prediction over this validation set sampling noise. one could imagine that bootstrap resampling would work well, but we often

the first type of uncertainty from hyperparameter tuning, that arises from random search, is one of my favourite along with the smoothing technique we often use in learning. for it is the kind of uncertainty that does not exist in the ideal world in which we have all the time in the world and all the compute in the world. even if it is a deterministic mapping from a hyperparameter to an individual prediction, our inability introduces the uncertainty. now, this is what people call as reducible uncertainty, but is it really reducible? i don’t think so.

finally, i want to spend just one paragraph on one particular case of uncertainty inherent to a problem itself. we already talked about it above when we discussed noisy measurement of the output. that is, what if there are genuinely multiple possible answers?

it turned out that there may be many different reasons why there are multiple possible answers inherently to the problem in our hands. among those many reasons, i want to talk briefly about one particular scenario. in this scenario, there exists a set of *unobserved* variables $u$ that affect the target variable in some way (it is not really important how, in this high-level, light blog post.) that is, the true function that determines the target takes as input not only the observed input $x$ but also $u$, i.e., $y = g(x, u)$. because we do not observe $u$, given only $x$, there are multiple correct answers.

this can be indeed handled to a certain degree by stochastic feedforward networks as well as conditional density models, but it’s pretty impossible to ensure our choice of such a model can model the unobserved $u$. after all, it’s unobserved, and we do not even know what it is and even more so whether it exists.

in this blog post, i enumerated a few sources of uncertainty in machine learning that i could immediately think of off the top of my head. they include finite data sampling, measurement noise, stochastic learning, hyperparameter tuning and unobserved variables. this doesn’t include other potential sources, such as uncontrollable shift in the environment between training and test times. furthermore, it does not cover other paradigms, such as online learning, active learning, etc., because i literally don’t know them well.

now… time to take all those sources into account for quantifying and using uncertainty!

this whole post was motivated by my (continuing) discussion with our wonderful members at Prescient Design: Ji Won Park, Natasha Tagasovska, Jae Hyeon Lee and Stephen Ra. Oh, also, we are hiring!

]]>but, then, i realized i don’t know what uncertainty is in a high level (!) which is somewhat weird, since i think i can often follow specific details of any paper that talks about uncertainty and what to do with it. so, as someone who dies for attention (pun intended, of course), i’ve decided to write a blog post on how i think i (should) view uncertainty. this view has almost no practical implication, but it helps me think of predictive uncertainty (aside from all those crazy epistemic vs. alleatoric uncertainty, which i’m sure i mistyped.)

in my mind, i **start with** the following binary indicator:

$$U(p, y, \tau) = I( \sum_{y’ \in \mathcal{Y}} I(p(y’) > p(y)) p(y’) \leq \tau).$$

if we are considering a continuous $y$, we replace the summation with the integration:

$$U(p, y, \tau) = I( \int_{y’ \in \mathcal{Y}} I(p(y’) > p(y)) p(y’) \mathrm{d}y’ \leq \tau).$$

$\mathcal{Y}$ is a set/space of all possible $y$’s. $I(\cdot)$ is an indicator function, i.e., it returns $1$ if true and otherwise $0$. $p$ is a predictive distribution under which we want to measure the uncertainty (e.g., a categorical distribution returned by a softmax classifier.) $y$ is a particular value of interest, and $\tau$ is a user-provided threshold.

this binary indicator tells us whether a particular value $y$ is within top-$(100 \times \tau)$% values under $p$. this can be used for a number of purposes.

**first**, we can use it to check how certain any particular prediction $\hat{y}$ is under our predictive distribution. let $p(y|x)$ be the predictive distribution returned by our classifier. we can solve the following optimization problem:

$$\min_{\tau \in [0, 1]} \tau$$

subject to

$$U(p(\cdot|x), \hat{y}, \tau) = 1.$$

in other words, we try to find the smallest threshold $\tau$ such that $\hat{y}$ is included. we refer to the solution of this optimizatoin by $\hat{\tau}$.

there is a brute-force approach to solving this optimization problem, which sheds a bit of light on what it does (and a bit on why i started with $U$ above,) although this only works for a discrete $y$. first, we enumerate all possible $y$’s and sort them according to the corresponding $p(y|x)$’s. let us call this sorted list $(y^{(1)}, y^{(2)}, \ldots, y^{(N)})$, where $N = |\mathcal{Y}|$. then, we search for $\hat{y}$ in this sorted list, i.e., $\hat{i} = \min_{i=1,\ldots, N} I(y^{(i)} = \hat{y})$. then, $\tau = \sum_{j=1}^{\hat{i}} p(y^{(j)}|x)$. in short, we look at how much probability mass is taken over by predictions that are more probable than $\hat{y}$, which seems (to me at least) to be the right way to think of the uncertainty assigned to $\hat{y}$.

**second**, we can use it to enumerate all predictions that should be considered under a given threshold $\tau$ beyond one best prediction by solving the following optimization problem:

$$\max_{Y \subseteq \mathcal{Y}} |Y|$$

subject to

$$U(p(\cdot|x), y, \tau) = 1,~\forall y \in Y.$$

in other words, we look at the largest subset $Y$ such that each and every element within $Y$ is certain under the predictive distribution $p(\cdot|x)$ with the certainty $\tau$.

this is a usual problem to solve and return the answer of, especially when we know that the problem has inherent uncertainty. in the case of machine translation, for instance, there are generally more than one equally good translations given a source sentence, and it is only natural to return top-$(100 \times \tau)$% translations rather than one best translation (though, we don’t do that in practice unfortunately.)

the same brute-force solution from the first problem is equally applicable here. once we have a sorted list and find $\hat{i}$, we simply return $Y = (y^{(1)}, y^{(2)}, \ldots, y^{(i)})$. this is too brute-force and is not tractable (nor applicable) in many situations (precisely why we don’t return multiple possible translations in machine translation, in practice.)

**third**, we can use $U$ to calibrate a given predictive distribution toward any criterion. for instance, our calibration criterion could be

$$J(\hat{p}; \tau) = \left|\frac{

\mathbb{E}_{x, y^* \sim p^*} [I(|\hat{\tau}(\hat{p}, \hat{y}) – \tau| < \epsilon)I(|y^* – \hat{y}| – \delta < 0)]

}

{

\mathbb{E}_{x, y^* \sim p^*} [I(|\hat{\tau}(\hat{p}, \hat{y}) – \tau| < \epsilon)]

}

– \tau \right| < \epsilon,~\forall \tau \in [0, 1],$$

where $\hat{p}$ is a monotonic transformation of $p$, and $\hat{y}=\arg\max_y p(y|x)$. you can think of $\hat{p}$ as a target distribution after we calibrate $p$ to satisfy the inequality above.

this criterion looks a bit confusing, but let’s parse it out. the two expectations effectively correspond to drawing true examples $(x, y^*)$’s from the ground-truth distribution $p^*$. for each $x$, we compute how often the prediction $\arg\max_y \hat{p}(y|x)$ is within the confidence threshold $\tau$. among those cases that satisfy this criterion, we check how good the prediction is (i.e., $|y^* – \hat{y} | – \delta < 0$). the proportion of such good predictions (the ratio above) should be within a close neighbourhood of the confidence threshold $\tau$.

with this criterion, we can solve the following optimization algorithm for calibration:

$$\min_{F} \int_{0}^1 J(F(p); \tau) \mathrm{d}\tau + \lambda \mathcal{R}(F),$$

where $\mathcal{R}(F)$ is some measure of the complexity of the monotonic transformation $F$ with the regularization coefficient $\lambda > 0$.

we can think of this optimization problem as finding *minimal* changes we need to make to the original predictive distribution $p$ to maximally satisfy the criterion above. of course, we can use different formulations, such as using a margin loss, but the same idea holds regardless.

there can be many other criteria. for instance, we may only care that the true value $y^*$ be within $\tau$ only. in this case, the optimization problem simplifies to:

$$\min_F \mathbb{E}_{x, y^*} \left[ 1- U(F(p), y^*, \tau) \right] + \lambda \mathcal{R}(F).$$

so, how does it relate to all the discussions on **reducible** (our inability) and **irreducible** (the universe’s inability) **uncertainty**? in my view, which is often extremely pragmatic, it’s almost a moot point to distinguish these two too strongly when we consider the uncertainty of prediction coming out of our system, assuming we’ve tried our best to minimize our inability (reducible uncertainty). with a finite number of training examples, which are almost never enough, and with our inability to tell whether there’s a model mismatch (the answer is almost always yes,) we cannot really even tell between reducible and irreducible uncertainty. then, why bother distinguishing these two rather than just lumping them together into $p(\cdot|x)$?

**anyhow**, the post got longer than i planned but stays as empty as i planned. none of these use cases of the binary indicator $U$ are actionable immediately nor tractably. they need to be polished and specialized for each case by carefully inspecting $p$, $\mathcal{Y}$, etc. but, at least this is how i began to view the problem of uncertainty in machine learning.

this whole post was motivated by my discussion with our wonderful members at Prescient Design: Ji Won Park, Natasha Tagasovska and Jae Hyeon Lee. Oh, also, we are hiring!

]]>if we consider the case of regression (oh i hate this name “regression” so much..) we can write this down as minimizing

$$

-\frac{1}{2} \| \alpha y + (1-\alpha) y’ – G(F(\alpha x + (1-\alpha) x’))\|^2,

$$

where \((x,y)\) and \((x’,y’)\) are two training examples, and \(\alpha \in [0, 1]\) is a mixing ratio. nothing more to explain than to simply look at this loss function: we want our regressor \(G \circ F\) to linearly interpolate between any pair \((x,x’)\).

Manifold mixup followed up on the original mixup ([1806.05236] Manifold Mixup: Better Representations by Interpolating Hidden States (arxiv.org)) by proposing to interpolate in the hidden space (after \(F\) above,) similarly to my own work on interpolating the hidden representations of retrieved examples (see here done together with Jake Zhao.) although there are quite a bit of details, such as randomly selecting the layer at which mixup is done, etc. in the case of manifold mixup, let me ignore that and just consider the key loss function:

$$

L^{\mathrm{mmix}} = -\frac{1}{2} \| \alpha y + (1-\alpha) y’ – G(\alpha F(x) + (1-\alpha) F(x’))\|^2.

$$

a natural inclination is to think of this as ensuring that \(G\) interpolates linearly between two points in the space induced by \(F\). that is probably what the authors of manifold mixup meant by saying that “*Manifold Mixup Flattens Representations*“, although their theory (§3.1) doesn’t seem to have anything to do with this phenomenon of flattening. their theory seems to be largely about universal approximation (which doesn’t really tell us much about linear interpolation) and that classes eventually become linearly separable (again doesn’t tell us much about linear interpolation.)

one thing that’s emphasized in the manifold mixup paper is that it “*backpropagates gradients through the earlier parts of the network*” (i.e. \(F\) above). totally understandable to any deep learner, as the motto we live and die by is end-to-end learning, but if \(F\) changes, it changes the space over which \(G\) linearly interpolates, or \(G\) can linearly interpolate in the space induced by \(F\) by adapting \(F\) rather than \(G\). furthermore, the tie between the linear interpolation between two training examples can dramatically change as the nonlinear \(F\) changes. so… confusing…

let’s look at the gradient of this loss function w.r.t. \(F\) above ourselves, after assuming that \(y \in \mathbb{R}\) and \(G\) is a linear function (similar to sentMixup in [1905.08941] Augmenting Data with Mixup for Sentence Classification: An Empirical Study (arxiv.org)) for simplicity.

$$

\frac{\partial L^{\mathrm{mmix}}}{\partial F} =

(\alpha y + (1-\alpha) y’ – G(\alpha F(x) + (1-\alpha) F(x’)))

\frac{\partial G}{\partial Z}

\frac{\partial Z}{\partial F},

$$

where $Z = \alpha F(x) + (1-\alpha) F(x’)$ and

$$

\frac{\partial Z}{\partial F} = \alpha F'(x) + (1-\alpha) F'(x’).

$$

because $G$ is linear,

$$

(\alpha y + (1-\alpha) y’ – G(\alpha F(x) + (1-\alpha) F(x’))) =

(\alpha y + (1-\alpha) y’ – \alpha G(F(x)) – (1-\alpha) G(F(x’))).

$$

combining all these together,

$$

\frac{\partial L^{\mathrm{mmix}}}{\partial F} =

\left(

\alpha (y-G(F(x))) + (1-\alpha) (y’ – G(F(x’)))

\right)

\left(

\alpha \frac{\partial G}{\partial F}(x) +

(1-\alpha) \frac{\partial G}{\partial F}(x’)

\right).

$$

what you notice here is that there are essentially four terms after expanding this multiplication. two terms are usual gradients we get from making $G \circ F$ predict $y$ given $x$ and $y’$ given $x’$, just like any regression:

- $\alpha^2 (y-G(F(x))) \frac{\partial G}{\partial F}(x)$
- $(1-\alpha)^2 (y’-G(F(x’))) \frac{\partial G}{\partial F}(x’)$

the other two terms are quite unusual:

- $\alpha(1-\alpha) (y-G(F(x))) \frac{\partial G}{\partial F}(x’)$
- $\alpha(1-\alpha) (y’-G(F(x’))) \frac{\partial G}{\partial F}(x)$

in other words, the direction and scale of the update of $F$ given $x$ is determined by the regression error for $x’$ (!) and that given $x’$ by the error for $x$ (!).

one could think of these two terms as the ones that flatten the representation space induced by $F$, but one also notices that the regression error terms are shared between the two usual terms and the two unusual terms. in other words, the gradient is zero when regression on the original pairs $(x,y)$ and $(x’,y’)$ is solved, regardless of how *flattened* the space induced by $F$ is.

this is unlike the original mixup (or input mixup) where the contribution of each of $x$ and $x’$ cannot be separated throughout the entire network ($G \circ F$). in manifold mixup, because the contributions of $x$ and $x’$ can be separated out at the level of $F$ (not at $G$, though,) there is a room for $F$ to make linear interpolation pretty much meaningless.

in fact, this may be what the authors pointed out by the theory of manifold mixup already: “*In the more general case with larger $\mathrm{dim} (H)$, the majority of directions in H-space will be empty in the class-conditional manifold.*” there is no meaningful interpolation between these class-conditional manifolds, because a majority of directions that would otherwise connect them be empty (pretty much meaningless from $G$’s perspective.)

another way to put it is that the feature extractor $F$ can easily give up on inducing a space that is meaningfully interpolate between any pair of training examples, since it stops changing as long as the model $G \circ F$ can predict original training examples very well. in other words, there is no reason why $F$ should induce a space over which $G$ linearly interpolates in a meaningful way.

this leaves us with a BIG mystery: why does manifold mixup work well? it worked well for the authors of the original manifold mixup, and since then, various authors have claimed that it works well (see, e.g., sentMixup as well as TMix). what do those two unusual terms above in the gradient do to make the final model generalize better?

until this mystery is resolved, my suggestion is to stick to a much more explicit way to ensure the representation is *flattened* by ensuring that small changes in the input space indeed map to small changes in the representation space. this can be done by e.g. making representation predictive of the input (see. e,g, [1306.3874] Classifying and Visualizing Motion Capture Sequences using Deep Neural Networks (arxiv.org), http://machinelearning.org/archive/icml2008/papers/601.pdf, [1207.4404] Better Mixing via Deep Representations (arxiv.org), etc.) or explicitly making representation linear using some auxiliary information such as time (see, e.g., [1506.03011] Learning to Linearize Under Uncertainty (arxiv.org)). of course, i need to plug my latest work on learning to interpolate in an unsupervised way as well: [2112.13969] LINDA: Unsupervised Learning to Interpolate in Natural Language Processing (arxiv.org).

you can watch the whole event (1.5hr long) at SW WELCOMES GIRLS 8TH – YouTube, and i’m sharing the script i used to record my talk below. sorry it’s in Korean, and it’s way too long for me to translate it myself:

안녕하세요?

이런 좋은 행사에 초대해주셔서 감사합니다.

일단 간단히 제 소개부터 하겠습니다.

전 현재 뉴욕대학교의 Courant Institute of Mathematical Sciences와 Center for Data Science에서 교수로 재직 중 인 조경현입니다. 올해 8월부터 Genentech에서 Senior Director of Frontier Research로 겸직 중이기도 합니다.

제 연구 주제는 기계학습이고 그 중 artificial neural network를 사용하는 다양한 분야를 살펴보고 공부합니다. 지난 7년 정도는 기계학습을 natural language processing 및 machine translation에 적용하는 연구를 해왔고 최근 들어서는 조금 더 다양한 문제들을 살펴보고 있습니다.

지난 10여년간 제 연구 분야의 학회 등 다양한 세팅에서 강연도 해보고 발표도 해봤습니다. 안타깝게도 저와 전공이 겹치지 않는 분들 앞에서 발표는 거의 못 해봤습니다. 그러다 보니 본 행사 초대를 받았을때 대체 어떤 얘기를 할 수 있을까 많은 고민이 되었습니다.

그래서 주최측을 통해 미리 질문을 받아보기로 했습니다. 과연 누가 미리 질문을 할까 하는 걱정이 있었지만, 다행히 많은 분들께서 질문을 남겨주셨고 그 질문들을 기반으로 짧은 메시지를 준비해봤습니다.

이에 앞서 먼저 제가 어떤 과정을 거쳐서 이 자리에 왔는지 말씀드리겠습니다.

전 집에서 걸어 다닐 수 있는 동작중학교 그리고 경문고등학교를 졸업한 후 카이스트에 입학했습니다.

카이스트에 입학해서 다양한 공부도 하고 다양한 과외 활동도 하며 재밌게 지냈습니다. 제가 한국에서 가르쳐 본 적은 없지만, 이곳 뉴욕대학교에서 학부생들 대학원생들 모두 굉장히 열심히 공부하며 지내는 것을 보다보면 제가 이렇게 놀면서 대학생활을 맘 편히 할 수 있었던 마지막 세대가 아닐까 생각이 들곤 합니다.

그래서 그런지 졸업이 점점 다가올 수록 졸업 후 무엇을 해야 하는지 큰 고민도 없었고 생각도 없었습니다. 정말 인공지능이며 기계학습이며 모르는 상태에서 같이 강의 듣던 선배가 우연히 학과 사무실 앞에서 줏어온 팜플렛을 보고 핀란드로 석사 과정 유학을 갔습니다.

당시 많은 고민을 하고 다양한 옵션들을 조심스레 상세히 알아보고 최선의 선택을 했다면 아마 핀란드에 가지 않았을 것이라 생각됩니다. 핀란드에 안 갔다면 과연 어떤 선택을 하고 지금 어디에 있을까요? 도저히 상상이 되지 않습니다.

석사 시작한 후 한 두 달이 지난 뒤부터 제가 volunteer한 것이 아니라 학과에서 지정해준 연구실에서 일주일에 하루 씩 연구에 참여하기 시작했습니다. 지금은 없어진 이 연구실이 인공신경망 연구를 하는 연구실이었고 이곳에서 저도 기계학습 및 deep learning 연구에 참여하기 시작했습니다. 그때, 만약 학과에서 이 연구실이 아니라 다른 연구실에 저를 배정했다면 전 과연 지금 뭘하고 있을까요? 역시나 상상이 되지 않습니다.

그 후 박사 과정을 마치고 몬트리올 대학교에 박사후 연구원으로 갔습니다. 몬트리올 연구실에 드디어 도착해서 자리를 잡고 앉아 있는데 저를 고용한 Yoshua Bengio 교수가 제게 와서 어떤 연구를 하고 싶냐고 물었습니다. 당시만 해도 큰 생각 없이 박사 과정 동안 하던 연구를 계속 하겠거니 생각했는데, 사실 그럴 이유가 없었던 것이죠.

Yoshua가 제가 제안한 네 개의 연구 주제 중 하나가 기계 번역이었습니다. 가장 생소했고 실제로 이게 연구 분야라는 사실도 모르던 그런 분야였는데… 그냥 너무 재밌을 것 같았습니다. 이때 Yoshua가 기계번역을 제안하지 않았으면 어땠을지, 아니면 제가 기계 번역이 아닌 조금 더 익숙한 주제를 골랐으면 어땠을지, 역시나 어떤 자리에 제가 있었을지 상상이 잘 안 됩니다.

보통 새로운 분야 연구를 시작하면 조심스럽게 기존에 존재하는 연구 결과와 방식들을 공부하고 어떤 것들이 잘 되어 있는지, 어떤 것들이 부족한지 파악하곤 합니다. 다만, 당시 막 박사 졸업을 한, 특히나 박사 과정을 거의 동료 없이 두 명의 포스닥의 도움을 받아가며 끝낸, 저는 도저히 지금 와서는 이해할 수 없는 자신감과 용기가 있었습니다. 옆에서 틈틈히 교과서를 읽어가면서도 대부분의 시간은 처음부터 새로 기계번역 시스템을 뉴럴넷으로 만들면 어떨지를 고민하고 실제로 구현하면서 보냈습니다.

지금 와서 돌아보면 마치 당시에 큰 비전이 있고 그 비전을 따라 앞으로 나아가다 보니 기계번역에 새로운 방식을 적용하고 이를 통해 최근의 많은 발전에 작은 contribution을 해온 것 같아 보입니다. 하지만, 당연히 그럴리 없겠죠. 전혀 모르던 연구 분야에 도전하기로 하고, 쓸데 없는 용기와 자신감에 취해 문제 자체를 조심히 살펴보지 않았다 보니 하루 하루가 시행착오의 연속이 었습니다.

텍스트 데이타를 다뤄본 적이 전혀 없다보니 어떤 형식으로 저장해놓고 불러와야 하는지도 많은 고생을 했습니다. 물론 궁극적으로는 plain text를 gzip으로 압축해놓고 한줄 한줄 읽어가는 식으로 구현하고 말았죠.

포스닥 전까지는 matlab와 독일에 있던 친구가 취미 삼아 만들어놓은 python 라이브러리를 사용해서 직접 모든 것을 구현했었습니다. 몬트리올에 오니 모두 지금은 discontinue된 Theano를 사용했고, 저도 새로운 것을 배워 보자는 마음으로 Theano로 옮겼습니다. 완전히 새로운 paradigm이다 보니 여러모로 많은 고생을 했습니다. 실험이 너무 잘 안돼서 보면 Theano를 잘 이해하지 못해서 만든 버그 때문이었고, 실험이 너무 잘 되어서 보면 역시나 Theano를 잘 이해하지 못해서 만든 버그 때문이었습니다. 간신히 Theano에 익숙해져서 자신감이 붙기 시작했는데 Theano는 2016년에 discontinue되었죠..

몬트리올에서 2년을 지낸 후 2015년 가을에 뉴욕대학교로 옮겨서 지금껏 뉴욕에서 지내고 있습니다. 딱히 교수가 되겠다는 마음은 없었습니다. 아니, 사실 당시에 deep learning을 공부하는 사람은 여전히 소수였고, 그런 소수를 구글, 딥마인드 등에서 공격적으로 뽑아갈 시절이기에 오히려 너무나도 당연하게 저도 제 친구들처럼 그런 회사에 가서 일할거라 생각했었습니다.

당시 우연찮게 학회 가는 길에 잠시 만난, 제 박사 학위 defense의 chair였던 Nando de Freitas 교수가 제게 혹시 교수 자리 생각 없냐고 물어봤던 것이 계기가 되어 회사 연구소에 가지 않고 대학 교수가 될 수 있다는 사실을 깨달았습니다. 당시까지는 몬트리올에서 1년 정도 연구실에 꾸준히 나간 것 외에는 대학원 연구실이 어떤 식으로 운영되고 교수가 어떤 일을 해야 랩을 꾸리고 운영할 수 있는지 몰랐습니다. 다만… 할 수 있다고 하니, 그리고 뭔가 구글, 딥마인드, 페이스북 등에 취업한 친구들과는 조금 다른 길을 가볼 수도 있겠다는 생각에 덜컥 교수 자리에 지원해보기로 결정했습니다.

물론 취업이 쉽지는 않았습니다. 미국, 캐나다, 영국, 핀란드, 스위스 등의 대학교 40군데에 지원하고 6-8 곳에서 인터뷰 요청을 받고 실제 오퍼는 3군데서 받았습니다. NYU가 그 중 하나였고 뉴욕에 살아보고 싶어서 (그리고 제가 도시 생활을 좋아해서) 그리고 NYU가 당시 가장 재밌어 보여서 NYU로 가기로 했습니다.

중간에 part-time이었지만 3년 정도 Facebook AI Research에서 research scientist로 3년 정도 일했고, 얼마 전 protein design하는 회사를 Genentech에 팔고 현재는 Genentech의 Senior Director of Frontier Research로 일하고 있기도 합니다만 2015년 이후 지금껏 꾸준히 NYU에서 교수로 재직 중이고 뉴욕에서 살고 있습니다.

뭔가 제 소개 한다는 것이 많이 길어졌습니다. 다만, 이렇게 제 소개를 하다보니 여러분들이 보내주신 질문들에 많은 답을 한 것 같습니다.

제가 그간 겪은 시행착오를 물어보신 분들이 있습니다. 특히나 신경망 기계번역 연구를 시작한 후에 겪은 시행착오를 물어보셨는데, 이미 답을 해버렸네요. 네, 시행착오 굉장히 많았고, 지금도 많이 하고 있습니다.

지금 제 강연을 듣고 있는 분들, 저와 비슷한 분야에서 일하는 모든 사람들, 그리고 저.. 엔지니어고, 엔지니어의 일은 세상에 없는 것을 만드는 것 입니다. 이 새로운 것이 새로운 연구 분야일 수도 있고, 새로운 제품일 수도 있고, 아니면 기존에 있는 제품을 더 향상 시키는 방법일 수도 있습니다. 뭔가 세상에 존재하지 않는, 아직 인류가 풀어내지 못한 새로운 것을 해내기 위해서는 시행착오가 있는 것이 당연하다고 생각합니다.

제 박사과정 학생들에게 종종 얘기하곤 합니다. 100가지 아이디어를 떠올렸다면 그 중 한두가지 정도가 맞는 아이디어, 실행 가능한 아이디어, 연구 가능한 아이디어라고 합니다. 만약 100가지 아이디어를 떠올렸는데 모든 아이디어가 맞는 아이디어, 실행 가능한 아이디어, 연구 가능한 아이디어라면 아마 셋 중에 하나 일 것 입니다. 첫째, 세상에 지금껏 없던 천재일 수 있습니다. 확률이 매우 낮다고 들었지만 불가능 하진 않겠죠. 둘째, 너무 쉽고 간단한, 나쁘게 말해 뻔한 아이디어만 찾고 있는 것 입니다. 셋째, 아니면 좋겠지만 사기를 치고 있는거 겠죠.

아쉽게도 가보지 않은 길을 개척하기 위해서는 시행착오가 불가피한 것 같습니다. 하지만, 오늘 보는 것과 같은 이런 좋은 커뮤니티가 있어서 서로를 support해주고 시행착오를 이해해줄 수 있기에 점점 해볼만 해진다고 생각합니다.

몇몇 분들이 어떤 계기로, 어떤 마음가짐으로 핀란드로 떠났는지, 지금 하고 있는 분야에 도전했는지 물어보셨습니다. 아쉽게도 정말 운이 좋았다, 그리고 우연의 연속이었다는 말 이상의 답이 없습니다. 제 선배, 용욱이 형이 팜플렛을 갖다주지 않았다면 핀란드 생각조차 못 했겠죠. 핀란드 알토 대학교의 학과에서 저를 뉴럴넷 연구하는 그룹에 배정하지 않았다면 deep learning이라는 분야 연구는 생각도 못 했겠죠. 만약 Yoshua Bengio가 딱히 제가 몬트리올로 간 그 순간에 기계 번역 생각을 안 하고 있었다면 기계 번역 연구를 상상도 못 했겠죠. 그리고 이때 기계 번역 연구를 안 했으면 2014년에 도하에서 열린 자연어처리 학회에 안 갔을 것이고, 그랬다면 Nando 교수가 저한테 교수 생각있냐고 물어보지도 않았을 것 입니다.

그리고 이 모든 순간 순간 (사실 얘기 안 한 우연들이 너무나도 많이 있습니다) 과연 내가 잘 할 수 있다는 자신감이 있었던 것도 아닙니다. 다만, 이런 새로운 옵션을 들었을때 흥미를 느끼고 해보고 싶다는 생각이 들었을 뿐 입니다.

기계학습 내 큰 분야 중 하나인 강화 학습에서 가장 중요한 원칙 중 하나가 “optimism in the face of uncertainty” 입니다. 한국어로 하면 불확실한 상황에 맞닥치면 낙천적인 선택을 해야 한다는 것 입니다. 이제와서 뒤돌아보면 정말 뭘 모르다보니 자연스럽게 낙천적인 선택을 했고, 많은 운과 우연이 따라줘서 이 곳에 오게 됐습니다.

말하고 나니 답이 아니네요. 죄송합니다.

어떤 분들은 어떤 계기로 제가 지금과 비슷한 일을 하고, 생각을 하는지 궁금해 하셨습니다.

어느 한 시점을 집어낼 수 없지만 또 다르게 생각해보면 사실 태어난 후 모든 순간 순간이 지금 저를 만들었으니 전체 다가 답이라고 할 수도 있지 않나 생각합니다.

다만, 제 생각에, 그리고 제 생활에 큰 변화가 생기는 지점들이 언제였나 곰곰히 생각해보면 대부분 제가 익숙한 편한 공간을 벗어나는 순간들이었습니다.

핀란드에 도착해서 몇 주 지난 후, 점차 핀란드 대학 생활에 익숙해 지고, 핀란드 사회에 대해 배워가고, 핀란드 학생들 그리고 유럽 내 학생들 그리고 전세계에서 핀란드로 유학 온 학생들과 친해지면서 세상의 중심은 어디에 사느냐에 따라 달라진다는 것을 느꼈습니다. 에스토니아와 스웨덴이 그리 국제 정세에 중요한 줄은 아마 핀란드에 안 갔으면 몰랐을 것 입니다.

몬트리올에서 Yoshua에게 기계 번역 연구를 하고 싶다고 말했을 때.. 사실 기계번역이 뭔지도 몰랐지만, 그 말을 하고 기계번역 연구를 시작하면서 기계학습 더 넓게는 인공지능에 대한 저의 좁은 시야가 확 넓어지는 것을 느꼈습니다.

Timnit Gebru가 트위터에 올린, 2014년인가 2015년 뉴립스 학회 사진을 올렸습니다. 그 사진을 보고 저는 Timnit의 트위터 글을 읽기 전 아무 생각부터 없이 나는 여기 없나 하면서 사진을 한참 봤었습니다. 그리고 나서 timnit의 짧은 트윗을 읽는 순간 갑자기 눈에 보이지 않던 것이 보이기 시작했습니다. 아니 사실 사진에 없는 것이 보이기 시작했다는 것이 맞겠네요. 그 수 많은 참가자 중에 여자도 거의 안 보이고, 흑인은 전혀 없다는 것을 그제서야 깨달은 것이죠.

조금 엉뚱하지만 이런 의미에서 전 언제나 해외에 나가는 것은 크게 찬성합니다.

마지막으로… 정말 재밌는 질문을 하나 봤습니다.

“세계에서 유명한 과학자가 되셨는데 어떤 기분이신지 궁금합니다.”

너무 좋게 봐주셔서 감사합니다. 우연의 연속으로 좋은 곳에 좋은 사람들과 좋은 때에 있는 바람에 편히 이런 자리까지 오게 됐습니다.

하지만 매일 같이 드는 생각이 있습니다.

저보다 더 전 세대의 deep learning 연구자분들, 예를 들면 geoff hinton, yann lecun, yoshua, juergen schmidhuber 등, 과 얘기해보고 그들의 연구를 따라가 보면 이들은 진정한 비전이 있기에 deep learning이 실제 성과를 보여주기엔 환경이 너무나도 열악했던 시절에도 멈추지 않고 deep learning 연구를 해왔고, 지금의 deep learning이라는 분야를 개척해냈습니다. 이런 분들과 비교해보면, 저는 AI가 뭔지, deep learning이 뭔지도 모르고, 어찌 저찌 선배가 팸플렛을 갖다줘서, 진학한 학과에서 배정해서, deep learning 연구를 시작했고, 타이밍이 잘 맞아서 deep learning이 확 뜰때 박사과정을 마무리하고 편히 교수 자리를 얻고 했습니다.

과연 지금까지는 운이 좋아서 이리 됐지만… 과연 저런 pioneer들처럼 계속 꾸준히 앞을 보고 연구할 수 있을까요…? 걱정이 많이 됩니다.

물론 제 앞 세대만 있는 것은 아닙니다. 제 석박사 과정 초반에는 뉴립스 등의 기계학습 학회에 가면 deep learning 논문을 손가락으로 꼽을 수 있었습니다. 그만큼 인공신경망 연구하는 학생들이 없었죠. 하지만 지금은 기계학습 학회만이 아니라 인공지능에 조금이라도 관련있는 분야의 학회에 가면 대부분의 논문이 deep learning 관련 내용입니다. 그만큼 어마어마하게 많은 학생분들이 이 분야에서 치열하게 연구를 하고 있습니다.

제가 NYU에서 학생 지도도 하고, 학생 입시에도 참여하다 보니 이런 학생들에 대해서 꽤 많이 알게 됐습니다.

정말 어마어마합니다. 전 기계학습이 뭔지, 기반되는 수학/통계 지식도 없이 정말 아무 것도 모르는 상태에서 deep learning 공부를 시작했는데, 지금 공부하는 학생들은 너무나도 학문적인 준비가 잘 되어 있고 심지어는 각종 연구, 개발 경험까지 있습니다. 그럼에도 불구하고 높은 경쟁 때문에 다들 많이 힘들어 하고, 제 시절보다 훨씬 더 힘들게 연구하고 공부합니다.

정말… 이런 학생들과 제 후배들을 보면 미안한 마음 밖에 없습니다. 교수라는 타이틀을 갖고 거들먹 대긴 하는데… 과연 이럴 자격이 있는 것 인지… 아마 지금 다시 박사 과정 석사 과정을 다시 시작한다면 다시 지금의 이 자리까지 올 수 있을까요? 아마 못 올 것 입니다.

기분이요…? 괴롭네요.

다시 한 번 이런 좋은 자리에 초대해주셔서 감사합니다. 이곳 시간은 조금 늦었지만 잠시 후 온라인으로나마 직접 만나뵙도록 하겠습니다.

]]>first, CIFAR started a program named “Neural Computation & Adaptive Perception” (NCAP) in 2004, supporting research in artificial neural networks, which has become a dominant paradigm in machine learning as well as more broadly artificial intelligence and all adjacent areas, including natural language processing and computer vision. i started my graduate study in 2009 with focus on restricted Boltzmann machines and graduated in 2014 with a PhD degree, which makes me perhaps *the* one who has benefited *most* from this success of deep learning. since this success of deep learning was fostered by CIFAR’s NCAP program already starting in 2004, i could even attribute a large part of my career to CIFAR and its NCAP program. i often wonder what would’ve happened to me and my career post-graduate school, had CIFAR decided to start and support another program. i can only guess it would’ve been very different and that i would’ve been worse off certainly.^{@}

second, CIFAR sponsored the very first publicly-open summer school on deep learning hosted by UCLA IPAM in 2012. i was a graduate student at Aalto University in Finland back then. due to a number of reasons, both political, financial and technical, the Bayes group, to which I belonged back then and which was actually a “neural net” group despite its name, had by then pretty much stopped taking in new students nor new postdocs. i was in a desperate need for meeting peers and talking with them about neural net research (i wasn’t still too familiar with the term “deep learning”, just like many others back then,) not to mention that i really needed to take some courses and learn about various technical aspects of deep learning beyond the limited selection of courses offered back then at Aalto (i mean… the neural net group was essentially at the brink of being dissolved, although this is for another post.) i then learned about this “Graduate Summer School: Deep Learning, Feature Learning” and did not hesitate a second to apply for a seat there. it was a three-week-long program filled up with a series of amazing lectures and lab sessions, allowing me to finally get a bigger picture and learn various technical details behind various algorithms and paradigms.* it was pretty intense, but it was just the right level of intensity that i needed back then. i wonder how my PhD thesis would’ve looked like had i not attended this summer school or even worse had CIFAR not sponsored this summer school in the first place. what a scary thought!

third, i attended the annual summer school organized by CIFAR NCAP (which is now called Learning in Machines and Brains (LMB)) in 2014 hosted at the University of Toronto, as a postdoc at the University of Montreal. it was a very exciting summer school following up on a series of CIFAR NCAP summer schools organized ever since NCAP was created in 2004. the entire summer school was fit in one reasonably small lecture room of U. Toronto, and there were a series of lectures and student talks. because we were all cramped into a single lecture room (talk about pre-pandemic!) it was intensely interactive, and i was just learning so much during those 2-3 days. at this summer school, i presented on-going work on machine translation (so did Ilya Sutskever who gave a much better, slicker and prophetic talk). this is where i coined the term “*neural machine translation*“, which i believe may be the only lasting contribution i’ve made to the field of machine translation (and i’m proud of myself for it!) in fact, after the school on that day, we all went to one of the dive bars where UT grad students used to hang out (can’t really recall the name anymore..) and were toasting to “neural machine translation”.^{#}

finally, CIFAR has been running a number of programs that are aimed at scientific and social aspects of research, such as a Global Scholar Program sponsored by the Azrieli Foundation, called the CIFAR Azrieli Global Scholars Program, and an AI Catalyst program. The Global Scholars program provides a set of opportunities for early-career scholars from a diverse set of disciplines, spanning from political science all the way to cosmology, to not only advance their science but also interact with peers from various disciplines to build up a broader view not only within science but across the society. the AI Catalyst program on the other hand provides funding for proof-of-concept, exploratory and blue-sky projects in order to continue to fuel scientific & societal innovation. i’ve benefited from both of these programs. i was a CIFAR Azrieli Global Scholar from 2017 to 2019 and thoroughly enjoyed my interaction with peer Global Scholars from a diverse set of disciplines, including cosmology, quantum physics, journalism, biology, etc. i received a Catalyst grant last year (2020) which has allowed me to work with Prof. Jimmy Lin at U Waterloo to build Neural Covidex, a specialized search engine for COVID-19 related literatures and make it publicly available at https://covidex.ai/. truly, these programs have enabled me to go above and beyond my comfort zone both scientifically and socially.

it’s pretty clear i have tremendously benefited from CIFAR over the past decade or so, and perhaps only naturally i want others to experience and benefit from being part of CIFAR both scientifically and socially. in particular, i want scientists from a diverse set of backgrounds and disciplines to enjoy such opportunities, in line with how CIFAR is “*committed to creating a more diverse, equitable, and inclusive environment*.”

going beyond wanting and wishing this, i’ve decided to more directly contribute to this cause by donating $50,000 USD to CIFAR so that CIFAR can “*provide funding resources in support of women and researchers from underrepresented groups to attend professional development opportunities.*” it is certainly not a lot, and the impact of this donation on its own will be quite limited. i only wish this would nudge people, including organizations such as governments and companies, to think once more about important roles performed by CIFAR and its likes in supporting innovation and promoting the diversity, inclusion and equity in science.

(@) well.. perhaps most objectively, i wouldn’t have been a Fellow of the Learning in Machines and Brains (LMB) program of CIFAR

(*) oh, i forgot to mention this even more important tidbit: Geoff Hinton “announced” the success of deep convolutional nets for ImageNet and “described” dropout at this summer school approximately five months ahead of NeurIPS 2012.

(#) these toasts were mainly led by Jamie Kiros who has become my dear friend ever since.

]]>This time, this random stuff is contrastive learning. my thought on this stuff was sparked by Lerrel Pinto’s message on #random in our group’s Slack responding to the question “*What is wrong with contrastive learning?*” thrown by Andrew Gordon Wilson. Lerrel said,

Lerrel Pinto (2021)

My understanding is that getting negatives for contrastive learning is difficult.

i haven’t worked on the (post-)modern version of contrastive learning, but every time i hear of “*negative samples*” i am reminded me of my phd years. during my phd years, i’ve mainly worked on a restricted Boltzmann machine which defines a distribution over the observation space as

$$p(x; W, b, c) \propto \exp(x^\top b) \prod_{j=1}^J (1+\exp(x^\top w_{\cdot, j} + c_j)),$$

where $W$, $b$ and $c$ are the weight matrix, visible bias and hidden bias. for simplicity, i’ll assume the visible bias is $0$, which is equivalent to saying that the input is on expectation an all-zero vector. This makes the definition above a bit simpler, and especially so when we look at the log-probability:

$$\log p(x; W, c) = \sum_{j=1}^J \log (1+\exp(x^\top w_{\cdot, j} + c_j)) – \log Z,$$

where $\log Z$ is the log-partition function or log-normalization constant.

the goal of learning with a restricted Boltzmann machine is then to maximize the log-probabilities of the observations (training examples):

$$\max_{W, c} \mathbb{E}_{x \sim D} [\log p(x; W, c)],$$

using stochastic gradient descent with the stochastic gradient derived to be

$$g_{\theta} = \sum_{j=1}^J \nabla_\theta \log (1+\exp(x^\top w_{\cdot,j} + c_j)) – \mathbb{E}_{x_- \sim p(x; W,c)} [\sum_{j=1}^J \nabla_\theta \log (1+\exp({x_-}^\top w_{\cdot,j} + c_j)].$$

the first term ensures that each hidden unit (or expert) $j$ is well aligned with the correct observation $x$ drawn from the data distribution (or training set.) not too surprising, since the alignment (dot product) between the expert weight $w_{\cdot, j}$ and a given observation gives rise to the probability of $x$.

the second term corresponds to computing the expected negative energy (ugh, i hate this discrepancy; we maximize the probability but we minimize the energy) over all possible observations according to the model distribution. what this term does is to look for all input configurations $x_-$ that are good under our current model and to make sure the hidden units (or experts) are not well aligned with them.

you can imagine this as playing whac-a-mole. we try to pull out our favourite moles, while we “whac” any mole that’s favoured by the whac-a-mole machine.

in training a restricted boltzmann machine, the major difficulty lies with how to efficiently and effectively draw negative samples from the model distribution. a lot of bright minds at the University of Toronto and University of Montreal back then (somewhere between 2006 and 2013) spent years on figuring this out. unfortunately, we (as the field) have never got it to work well, which is probably not surprising since we’re talking about sampling from an unnormalized (often discrete) distribution over hundreds if not thousands of dimensions. if it were easy, we would’ve solved most of problems in ML already.

let’s consider a stochastic transformation $T: \mathcal{X} \to \mathcal{X}$, where $\mathcal{X}$ is the input space. given any input $x \in \mathcal{X}$, this transformation outputs $\tilde{x} \sim T$ that highly likely maintains the same semantics as the original $x$. this is often used for data augmentation which has been found to be a critical component of contrastive learning (or as a matter of fact any so-called self-supervised learning algorithms).

imagine a widely used set of input transformations in e.g. computer vision. $T$ would include (limited) translation, (limited) rotation, (limited) color distortion, (limited) elastic distortion, etc. we know these transformations often in advance, and these are often domain/problem-specific.

what we will now do is to create a very large set of hidden units (or experts) by drawing transformed inputs from the stochastic transformation $T$ for one particular input $x$. that is, we have $J$-many $\tilde{x}_j \sim T(x)$. in the case of computer vision, we’ll have $J$-many possible distortions of $x$ that largely maintain the semantics of $x$.

these hidden units then define a restricted Boltzmann machine and allow us to compute the probability of any input $x’$:

$$\log p(x’ | \tilde{x}_1, \ldots, \tilde{x}_J) = \sum_{j=1}^J \log (1+\exp(s(x’,\tilde{x}_j))) – \log Z,$$

where i’m now using a compatibility function $s: \mathcal{X} \times \mathcal{X} \to \mathbb{R}$ instead of the dot-product for more generality.

starting from here, we’ll make two changes (one relaxation and one restriction). first, we don’t want to only use $J$ many transformed copies of the input $x$. we want to in fact use all possible transformed versions of $x$ out of $T$. in other words, we want to relax our construction that this restricted Boltzmann machine has a finite number of hidden units. this turns the equation above to be:

$$\log p(x’ | x, T) = \mathbb{E}_{\tilde{x} \sim T(x)}\left[ \log (1+\exp(s(x’,\tilde{x})))\right] – \log Z.$$

second, we will assume that the input space $\mathcal{X}$ coincides with the training set $D$ which has a finite number of training examples, i.e., $D=\left\{ x_1, \ldots, x_N \right\}$. this second change only affects the second term (the log-partition function) only:

$$\log p(x’ | T(x)) = \mathbb{E}_{\tilde{x} \sim T(x)}\left[ \log (1+\exp(s(x’,\tilde{x})))\right] – \log \frac{1}{N} \sum_{n=1}^N \mathbb{E}_{\tilde{x} \sim T(x)}\left[ \log (1+\exp(s(x_n,\tilde{x})))\right].$$

to summarize what we’ve done so far: we build one restricted Boltzmann machine for a given input $x \in \mathcal{X}$ by drawing the hidden units (or experts) from the transformation distribution $\tilde{x} \sim T(x)$. the support of this restricted Boltzmann machine is restricted (pun intended) to be a training set.

what would be a good training criterion for one such restricted Boltzmann machine? the answer is almost always maximum likelihood! in this particular case, we want to ensure that the original example $x$ is most likely under the restricted Boltzmann machine induced by itself:

$$\max_{\theta} \log p(x | T(x)),$$

where $\theta$ is the parameters for defining the compatibility function $s$ from above.

we do so for all $N$ restricted Boltzmann machines induced from $N$ training examples:

$$\max_{\theta} \frac{1}{N} \sum_{n=1}^N \log p(x_n | T(x_n)).$$

since it’s decomposed over the training examples, let’s consider only one example $x \in D$. we then train the induced restricted Boltzmann machine with stochastic gradient descent, following

$$\frac{1}{M} \sum_{m=1}^M \nabla_{\theta} \log (1+\exp(s(x, \tilde{x}_m; \theta))) – \frac{1}{M} \sum_{m=1}^M \sum_{n=1}^N p(x_n|T(x)) \nabla_{\theta} \log (1+\exp(s(x_n, \tilde{x}_m; \theta))),$$

where we use $M$ transformed copies to approximate the two expectations over $T(x)$ but not $p(x_n|T(x))$. we probably should use another set of $M$ transformed copies to get the unbiased estimate.

this does look quite similar to more recently popular variants of contrastive learning. we start from a training example $x$, generate a transformed version $\tilde{x}$, maximize the compatibility between $x$ and $\tilde{x}$, and minimize the compatibility between $\tilde{x}$ and all the training examples (including $x$). there are minor differences, such as the choice of nonlinearity, but at the high level, it turned out we can derive contrastive learning from the restricted Boltzmann machine.

perhaps the only major difference is that this formulation gives us a clear guideline on how we should pick the negative examples. that is, according to this formula, we should either use all the training examples weighted according to how likely they are under this $x$-induced restricted Boltzmann machine or use a subset of training examples drawn according to the $x$-induced restricted Boltzmann machine without further weighting. of course, another alternative is to use uniformly-selected training examples as negative samples but weight them according to their probabilities under the $x$-induced restricted Boltzmann machine, *à la* importance sampling.

so, yes, contrastive learning can be derived from restricted Boltzmann machines, and this is advantageous, because this tells us how we should pick negative examples. in fact, as i was writing this blog post (and an earlier internal slack message,) i was reminded of a recent workshop i’ve attended together with Yoshua Bengio. there was a talk on how to choose *hard* negative samples for contrastive learning (or representation learning) on graphs, and after the talk was over, Yoshua raised his hand and made this remark

Yoshua Bengio (2019, paraphrased)

That’s called Boltzmann machine learning!

Indeed…

Based on this exercise of deriving modern contrastive learning from restricted Boltzmann machines, we can now have a meta-framework for coming up with a contrastive learning recipe. Any recipe must consist of three major ingredients:

**A per-example density estimator**: i used the restricted Boltzmann machine, but you may very well use variational autoencoders, independent component analysis, principal component analysis, sparse coding, etc. these will give rise to different variants of self-supervised learning. the latter three are particularly interesting, because they are fully described by a set of basis vectors and don’t require any negative samples for learning. i’m almost 100% certain you can derive all these non-contrastive learning algorithms by choosing one of these three.**A compatibility function**$s$: this is the part where we design a network “architecture”, and how the output from this network is used to compute a scalar that indicates how similar a pair of examples is. it looks like the current practice is to use a deep neural net with a cosine similarity to implement this compatibility function.**A stochastic transformation****generator**: this generator effectively generates a density estimator for each example. this is very important, since it defines the set of bases used by these density estimators. any aspect of data cannot be modelled if these generated bases do not cover them.

we have a pretty good idea of what kind of density estimator is suitable for various purposes. we have a pretty good idea what’s the best way to measure the similarity between two highly-complex, high-dimensional inputs (thanks, deep learning!) but, we cannot know what the right stochastic transformation generator should be, because it is heavily dependent on the problem and domain. for instance, the optimal transformation generator for static, natural images won’t be optimal for e.g. natural language text.

so, my sense is that the success of using contrastive learning (or any self-supervised learning) for any given problem will ultimately boil down to the choice and design of stochastic transformation, since there’s a chance that we may find a near-optimal pair of the first two (density estimator and compatibility function) that works well across multiple problems and domains.

]]>- Ho-Am Prize & Scholarship for Macademia at Aalto University
- Ho-Am Prize & 백규고전학술상 (Baek-Gyu Scholarly Award for Classics)
- Ho-Am Prize & Lim Mi-Sook Scholarhip (임미숙 장학금) at KAIST

i graduated from Korea Advanced Institute of Science and Technology (KAIST) with the Bachelor in Science (B.Sc.) degree. i majored in computer science which is the subject i’ve never left so far, having become a professor of computer science (and data science) in 2015. although my undergraduate years in terms of education was closer to failure than success (which is extremely visible on my transcript,) i thoroughly enjoyed my days at KAIST and have fond memory of the years I spent there.

although the whole field, including myself, has become so much more aware of the issue of gender imbalance in computer science in recent years, it was already super-clear that there was this issue in computer science when i was in my undergraduate years. my memory is definitely failing me, but i recall there were less than five if not four females students out of approximately 60-70 students in my cohort. of course, the awareness did not mean that i felt any issue with it nor was compelled to do something about it. it just felt only natural back then that boys majored computer science and girls in biology (yes, i’m simplifying it quite a bit here, but this is how it seemed to me back then.)

perhaps this is precisely what my mom and others in the family felt back when i was born. before i was born, my mom used to be a teacher in a (junior) high school, teaching Korean. my mom and dad graduated from the same university for their undergraduate degree, after which my mom became a teacher and my dad decided to pursue higher degrees, eventually becoming a professor of korean literature. clearly both of them had the same level of education up until a certain point, but at that point, mom gave up on her career to raise me and my younger brother who was born less than 2 years after i was born. again, i’m sure this was the choice that was only natural back then.

unfortunately it’s about 20 years since i started my undergrad years at KAIST, and the issue of gender balance in computer science hasn’t gotten any better. in fact, this issue, which i didn’t even realize existed back then, turned out to be just a tip of the iceberg. the field of computer science, or perhaps more narrowly machine learning, is riddled with imbalances; gender imbalance, geographical imbalances (over-representation of north america, europe and east asia over other parts of the world), imbalance across races (6 black researchers out of more than 5,000 attendees of NeurIPS 2017, noticed by Timnit Gebru), and many more.†

these issues are somehow “discovered” each day, but the truth is that we are barely freeing ourselves from the social constructs that have blinded us or have convinced us that these imbalances are only natural. this is just like how i never thought it was an issue that all boys majored in computer science while all girls majored in biology when i was in my sophomore year. this is just like why my mom quit her job to raise me and my brother more than 35 years ago, which i’m sure no one questioned then.

i don’t have any solution to this issue of social blindness, but one thing i have become aware of is that one cannot see what is not there for them to see. when i was one of 90% or more of the boys who majored in computer science 18 or so years ago, i couldn’t see the problem. when i was one of 90% or so of the non-black, male researchers attending ICML and NeurIPS over many years, i couldn’t see the problem. i mean i was having beer, tequila, etc. non-stop together with Yann Dauphin, but i couldn’t see this near-complete lack of black researchers as a troubling trend at all. i only started to see these problems of equal access, equity, etc. only when i started to see people raising these issues and bringing these issues to my attention. in other words, the one remedy i know and have experienced myself is to create a diverse environment in which each individual can see and interact with diverse individuals and hear their stories.

so, as a small effort toward helping build such diverse environments, i have decided to donate approximately ₩100,000,000 KRW (≈ \$91,000 USD) to the Department of Computer Science, School of Computing at KAIST to create a small scholarship named after my mom (Lim Mi-Sook 임미숙) that will provide a small amount of supplement (≈ \$900) each to a small group of female students who major in computer science, at the beginning of each semester, until the fund runs out.^{∘} it’s not a lot, but it never hurts to have some extra allowance at the beginning of each semester. they might use it for buying a new iPad for either taking better notes in their classes or watching Netflix more comfortably. they might use it to hang out with their friends and have some nice meals. they might use it to pay for their hobbies.^{⊚} however they spend it, i only hope this would encourage them to continue their study in computer science and to encourage others to join computer science in the future, thereby contributing toward building a more diverse community of computer scientists (so that my little niece will eventually want to study computer science and be a computer scientist.) furthermore, i wish this will help us, including myself, more easily and readily see and break ourselves free from these social constructs/biases that unfairly disadvantage and harm subsets of population.

finally, here’s why i named it after my mom: although i structured this scholarship to be from my mom, this won’t let me nor my mom answer how her career would’ve been had she not given up on it when i was born. it however will make all of us think more about the burden of raising children that is placed often disproportionately on mothers and how it should be better distributed among parents, relatives and society, in order to ensure and maximize equity in education, and career development and advances.

† more and more organizations and initiatives are founded to address these challenges, including Women in Machine Learning, Black in AI, etc. (see e.g. the Diversity, Equity and Inclusion page of ICLR’21.) these are organizations that make me proud to be a part of this research community.

∘ oh, and i asked the department to arrange a lunch between my parents and these students each semester. i think my parents will love talking with them, and i hope the students will also enjoy the lunch.

⊚ see my earlier post <Giving thanks: Samsung AI Researcher of the Year Award and Donation to Mila> for more of my thoughts on this.

]]>- Ho-Am Prize & Scholarship for Macademia at Aalto University
- Ho-Am Prize & 백규고전학술상 (Baek-Gyu Scholarly Award for Classics)
- Ho-Am Prize & Lim Mi-Sook Scholarship (임미숙 장학금) at KAIST

i’ve rarely mentioned my father in this blog without any particular reason, but perhaps it’s a good time to talk about him briefly in this post.

his name is Kyu-Ick Cho (조규익),and he’s a professor of Korean Language and Literature at Soong-sil University in Seoul, Korea. perhaps unsurprisingly, i don’t know much of Korean language nor literature, not to mention Korean *classical *literature and art in which he is one of the world-wide experts. i only know a few things i picked up here and there about his research as i grew up. unfortunately i’m way out of my depth & breadth even list up what he’s worked on, done and continues to work on, although i can point you to his homepage (http://kicho.pe.kr/), where you can find the ever-growing list of books and papers he authors (warning: all in Korean).

one thing i can talk about is that it’s helped me see the stark difference between how things work in engineering/science and in humanity, just seeing my father from the side. when it comes to Korean classical literature and art research, the intellectual curiosity and perhaps intellectual responsibility truly matters. you do not build anything new that may change the world. you do not discover something that may change the world. you do not learn skills that may make you valuable to for-profit organizations. your research is probably not supported by deep pocketed industry and if by federal government, at the level that barely keeps you alive. it’s pretty much all about fulfilling your intellectual curiosity and carrying our your duty and responsibility as an academic.

although economy in korea has grown tremendously, this doesn’t necessarily translate to increased investment in humanities research, especially for those areas in humanities that do not translate immediately to economic value. korean classical literature and art is clearly one such area where no one expects any *return* on investment at any time. after all, it is *literature* and *art*, and perhaps worse yet, it is *classic*.

there are many negative consequences from such plateaued or shrinking investment, that i’d love to talk a lot about. in this post however let me stick to just one particular consequence. that is, such lack of investment discourages (if not outright prevents) researchers from pursuing their intellectual curiosity and responsibility, thereby effectively serving as a death sentence to the field. to understand what i mean here immediately, imagine how you’d react when your kid announces they’ll pursue PhD in Korean Literature .

perhaps surprisingly, i find it quite disturbing that we may be looking at a serious chance that there won’t be anyone who’ll study and research korean classical literature and art at some point not too far in the future. out of a few things that set us (humans) apart from other intelligent species, literature and art, which are closely related to each other with their boundary becoming fuzzier as we go back further in time, are clearly at the forefront of these unique features of us, and if we can’t afford to spare our effort & time in creating, enjoying and preserving these artifacts ourselves, what are we really doing here?

of course, despite this shrinking investment in korean classical literature & art research, researchers in this field have not given up, including my father. in order to build an environment to accommodate more junior and less established researchers in the field of korean classical literature & art, he founded a research center at Soong-Sil University, named the Center for Korean Literature & Art, in 2006 and has continued to run it so far. this research center has its own journal that publishes 3-4 issues each year. it hosts annual conferences to gather a small number of researchers who are dedicated to korean literature & art. it publishes many books each year. as far as i can tell, the center is not growing in terms of the number of people, but its activities as well as the coverage of research areas within Korean classical literature and art have steadily grown over the past decades.

so, yes, he is really trying hard together with a small number of his colleagues and peers. in fact, he’s been doing so ever since his career as a professor of korean language and literature in mid-80’s, although from what i’ve scantly seen from the side this has been an uphill battle. and, now with his retirement in 1 year, the future of korean classical literature and art does not look particularly brighter.

when i was a kid, i recall one year (1996) when my father received two highly respected awards. one was Do-Nam Award for Korean Literature Research (도남국문학상), and the other was Seong-San Award ~~for Korean Classical Poetary Research~~ (성산~~시조~~학술상). obviously i wasn’t aware of how big deals these awards were back then, not do i know how big deals these were even now. i could however feel that these must be big deals because i could sense the pride in my father’s eyes when he broke the news. i even remember attending the ceremony for one of these awards (not sure if i attended both, though. my memory is failing me here.)

that was 25 years ago, when my father was still considered junior (i mean… it’s the field of Korean *classical* literature and art, where everyone’s supposed to be junior ever.) these prizes must’ve meant quite a bit in that they recognize his own research but also encourage him to advance his research further. noticing that these two awards always mentioned in his bio’s as well as CV’s, i presume i’m not too wrong in this.

unfortunately, it doesn’t look like either of these awards exists anymore. i could trace Do-Name award up to 2008, but i couldn’t find any information about it. in fact, i couldn’t even find the list of awardees from a few minutes of Googling (and Navering). the same goes with Seong-San award. i could trace it up to 2003 or so, but i again can’t find anything substantial about this award. it’s quite shame. two prominent ways to recognize and encourage researchers in this relatively narrow field of korean classical literature and art seem to have been lost over time (, although these awards were not only for the classical literature & art but recognize achievements in a broader field of korean literature.)

no individual will be able to save the whole field of korean classical literature and art. it’ll have to be the whole society’s effort to save this field and along the way our soul as well. my father has contributed his entire career to this cause and will continue to do so even after his retirement, although his forecast becomes gloomier each time i talk with him. to this end, i’ve decided to contribute just a little myself to this effort to saving and perhaps even growing research in Korean classical literature and art by donating ₩100,000,000 (approx. $90,000 USD) to the Center for Korean Literature and Art with the stipulation that this is used to create an award for Korean classical literature and art.

this award will be given to 1-2 researchers each year with approximately $2,000-5,000 each (to be determined by the Center’s Board each year) until the fund runs out, with a hope that this award can be used to recognize the achievements of and encourage future endeavors of researchers in the field of Korean classical literature and art, just like what those two awards above did to my father and what Ho-Am Prize is doing to me.

oh, right, i almost forgot to mention: i’ve also put one small condition that this award be named after my father’s pen name^{*} 백규 (Baek-Gyu, 白圭). so, this award, which will hopefully start to be awarded starting next year (2022), will be called the Baek-Gyu Award in the field of Korean Classical Literature and Art (백규고전학술상).

* 호; i’m not sure what’s the right translation of this in English. it’s a kind of a nick name given by another, often a teacher or fatherly figure.

]]>