in the previous post (How to think of uncertainty and calibration …), i described a high-level function $U(y, p, \tau)$ that can be used for various purposes, such as (1) retrieving all predictions above some level of certainty and (2) calibrating the predictive distribution. of course, one thing that was hidden under the rug was what this predictive distribution $p$ was. in this short, follow-up post, i’d like to give some thoughts about what this $p$ is.
to be specific, i will use $p(y|x)$ to indicate that this is a distribution over all possible answers $\mathcal{Y}$ returned by a machine learning model $f$ given an input $x$. this is distinguished from a predictive distribution computed directly by that model $f$, which i will denote as $p_f(y|x)$. these two, $p(y|x)$ and $p_f(y|x)$, are different from each other in that the former takes into account uncertainty that cannot be captured by the machine learning model $f$ while the latter captures at most the uncertainty it can capture. this distinction will be made clearer later in this post.
of course, neither of these two needs to be an actual probability given an input and an arbitrary target $(x,y)$ but can just be an arbitrary scalar $-E(x,y) \in \mathbb{R}$, as we can turn this into the probability by
$$p(y|x) = \frac{\exp(-E(x,y))}{\int_{\mathcal{Y}} \exp(-E(x,y’)) \mathrm{d}y’}.$$
what does it mean for us to use a predictive distribution rather than a single-point prediction? this is equivalent to saying that there are a set of answers that we consider likely. then, here comes a natural, follow-up question: why are there many likely answers, not only one? one way to answer this question is to say that there exists uncertainty in the answer. then, here’s the next follow-up question: where does this uncertainty come from? this is the question i’ll try to answer by enumerating what i can imagine as the sources of uncertainty in this post.
instead of talking about irreducible (was it aleatoric?) and reducible (was it epistemic?) uncertainty, i’ll just be very much down to earth and talk about some of the sources of uncertainty that i believe we should think of.
before i continue, let me clarify what i mean by $y$ here. $y$ is one of all possible answers. in the case of classification, $y$ is one of all possible classes. in the case of multi-label classification (many binary classifiers,) $y$ is one of the all possible combinations, i.e., $y \in \{0, 1\} \times \cdots \times \{0, 1\}$. in other words, we do not have to worry too much about dependencies between different dimensions of $y$, although this makes it tricky to think of continuous $y$ (it does reveal what i’m interested in, doesn’t it?)
under this setup of $y$, i will care about the probability assigned to $y$ rather than the variance of the probability assigned to $y$. this arises from my desire to consider only those $y$’s that receive reasonably high probabilities. among these reasonably probable $y$’s, those that are more highly probable are also the ones that tend to (but not always for sure) have lower variance (just because the probability is bounded between $0$ and $1$.) in other words, we care about how many highly plausible answers there are and what they are.
of course, you can replace $\mathbb{E}$ with $\mathbb{V}$ below to get the variance of the probability assigned to $y$ rather than the average. perhaps it’s a good idea then to use some combination of the mean and variance, similarly to using e.g. upper-confidence bound in various active learning setups. but, well, i’m writing a blog post not a book here.
Finite data
the first source of noise that comes to my mind is our use of a finite number of examples, for both training and evaluation. even if there exists a single correct answer $y^*$ for an input $x$, it is possible that it may be impossible to precisely identify this correct answer $y^*$ given only a finite number of examples from which our machine learning model learns. even worse, different answers may look more likely when different sets of training examples are used.
it is always reasonable to assume our learning algorithm can only work with a finite number of examples (even in the most optimal case, it will be bounded by how long Google thrives and survives …) let’s say we always use $K$-many training examples drawn from a single data distribution $p_{\mathrm{data}}$. the uncertainty arising from this finite nature of data can be written as
$$p(y|x) = \mathbb{E}_{(x^1, y^1), \ldots, (x^K, y^K) \sim \underbrace{p_{\mathrm{data}} \times \cdots \times p_{\mathrm{data}}}_{K}} \left[\mathrm{LEARN}((x^1, y^1), \ldots, (x^K, y^K))(x)\right],$$
where $\mathrm{LEARN}$ is a learning algorithm that returns a trained model. in other words, we need to try training as many models as possible with size-$K$ subsets and see how the predictions from these models vary.
it makes sense to a certain degree, but of course, this is not tractable in general, because we are often given a single set of training examples to work with, instead of the full data distribution from which we can freely sample a new set of training examples. yes, yes, you’re right that sometimes we have (expensive) access to the data distribution, but let’s assume this is not the case in our case.
instead, it is possible to generate pseudo-training sets by re-sampling multiple sets from this single training set. this is what we often refer to as bootstrap resampling. this is a nice way to capture the variation/uncertainty caused by sampling of data, but it is often intractable to use this methodology in deep learning, as the size of a data set needed to train a model is pretty huge.
of course, such a resampling strategy can be used to measure the uncertainty in the test accuracy given a single model $f$:
$$\mathrm{Var}(\mathrm{EVAL}{p(y|x)}) = \mathbb{V}_{(x^1, y^1), \ldots, (x^K, y^K) \sim \underbrace{p_{\mathrm{test}} \times \cdots \times p_{\mathrm{test}}}_{K}} \left[\mathrm{EVAL}_{p_f(y|x)}((x^1, y^1), \ldots, (x^K, y^K))\right],$$
where $p_f(y|x)$ is the predictive distribution from one particular model $f$.
Measurement noise
the second source of uncertainty is noisy measurement. this is closely related to the sampling-induced uncertainty above, except that we now split the data distribution $p_{\mathrm{data}}$ into two parts; data generation and noise injection. the process by which a single pair $(x,y)$ is sampled is
- true measurement: $(\hat{x}, \hat{y}) \sim p_{\mathrm{data}}(x, y)$
- noisy measurement of the input: $x \sim C_x(x | \hat{x})$
- noisy measurement of the output: $y \sim C_y(y|\hat{y})$
$C_x$ and $C_y$ are the noisy measurement processes for $x$ and $y$, respectively.
let’s first consider the input noise $C_x$. we assume that $C_x$ is symmetric (i.e., $C_x(\hat{x}|x) = C_x(x|\hat{x})$.) this symmetry tells us that we can draw plausible samples of a _clean_ version $x$ given the noisy measurement $\hat{x}$ from $C_x(x | \hat{x})$. we don’t know exactly which of these samples is the original version $x$, but they are all largely plausible. the uncertainty arising from our inability to perfectly denoise the noisy measurement shows up as:
$$p(y|\hat{x}) = \mathbb{E}_{x \sim C_x(x | \hat{x})} \left[ p_f(y|x) \right],$$
where $p_f$ is the predictive distribution returned by a classifier $f$. just like above, this classifier may return an unnormalized scalar, in which case we turn it into the probability by softmax normalization. in other words, the uncertainty is in how the prediction varies across plausible original version of $\hat{x}$ according to the symmetric noisy measurement process $C_x$.
this implies that we can reduce this particular type of uncertainty if we knew (potentially irreversible) $C_x$, by maximizing $\log p(y^*|\hat{x})$ above rather than $p_f(y^*|\hat{x})$. of course, $C_x$ is often (if not always) unknown, and people often resort to manually crafting a proxy corruption process that mimics a reasonable noisy measurement process $C_x$ and sample from it during training to approximate the expectation above. this practice is nowadays referred to as data augmentation. of course, smart ones (yes, like my awesome collaborators 😉) learn a proxy to $C_x$ from unlabelled data, such as we have done with SSMBA recently for natural language processing.
now, let’s quickly consider the output noise $C_y$. “quickly”, because it doesn’t really differ much from the input noise $C_x$. the major difference is that it is often only useful in the training time, since $y$ is not known in the test time.
with this in our mind, let’s consider a particular noisy measurement process $C_y(y | \hat{y}) = \alpha \delta(y|\hat{y}) + (1-\alpha) \mathcal{U}(y; \{1,2, \ldots, L \})$, where $\delta$ is a Dirac delta distribution, $\mathcal{U}$ is a uniform distribution, and $\alpha \in [0, 1]$ is a mixing coefficient. with the probability $\alpha$, there is no noise, and with the probability $1-\alpha$, we switch the label to one of all possible labels uniformly.
we can now express the uncertainty by
$$\mathbb{V}_{y \sim C_y(y|\hat{y})} \left[ \log p_f (y|x) \right]$$
of course, we don’t really have access to $\hat{y}$, but we can flip $\hat{y}$ and $y$ above, because $C_y$ is symmetric; with the probability $\alpha$ the clean answer would’ve been $y$ itself and otherwise it could’ve been anything. in other words, we can reduce this uncertainty by minimizing
$$\mathbb{V}_{y \sim C_y(y’|y)} \left[ \log p_f (y’|x) \right].$$
an interesting observation here is that $\log p_f (y’|x)$ is bounded from above by $0$. this means that we can indirectly minimize this variance by maximizing each $\log p_f (y’|x)$ for $y’ \sim C_y(y’|y)$:
$$\mathbb{E}_{y’ \sim C_y(y’|y)} \left[ \log p_f (y’|x) \right],$$
which can be rewritten with this particular $C_y$ as
$$
\sum_{y’ \in \mathcal{Y}} I(y=y’) \left(1-\frac{1-\alpha}{|\mathcal{Y}|}\right) \log p_f (y|x) + I(y\neq y’) \frac{1-\alpha}{|\mathcal{Y}|} \log p_f (y’|x).
$$
this reminds us of a widely-used technique of label smoothing which always ensure that a model assigns some non-zero probabilities to incorrect classes while maximizing the log-probability of the observed label $y$. so, one way to think of what label smoothing does is that it reduces the uncertainty arising from the noisy measurement of labels.
noise in the output $y$ is however trickier when it is not noise. what does it mean? it means that there may be genuinely multiple correct answers, and that we cannot tell by looking at $y$ alone whether it is noisy version of $\hat{y}$ or that it is just one of many possible clean answers. this is an interesting observation to think about: so-called irreducible noise is often indistinguishable from so-called reducible noise in practice!
if we somehow know that there are genuine ambiguity in the output (which is quite common, such as in machine translation and any other structured prediction problems,) we can deal with it by introducing stochastic hidden variables in our model, such as in stochastic feedforward networks as well as conditional RBM/NADE. of course, such a powerful conditional density model will inevitably capture not only genuine ambiguity but also genuine measurement noise, potentially leading to the issue of overfitting.
Stochastic learning
let $\epsilon$ be some arbitrary random variable from which we can sample numbers in order to make some arbitrary decisions in our learning algorithm. there are so many things we often need to make arbitrary decisions for. some of them are:
- how do we build a minibatch?
- if we are building a minibatch on the fly, which subset of training examples do we use?
- if we are taking the next chunk of training examples, which order do we sort the training examples in?
- parameter initialization
- how do we initialize the parameters of our model?
- dropout (or any stochastic regularizer)
- which hidden units do we drop to $0$?
- Underlying compute engine
furthermore, some learning algorithms intentionally rely on such randomness. a representative example is policy gradient in which noise is added to smooth out the super-difficult optimization problem of
$$\max_{\pi} R(\arg\max_a \pi(a|s))$$
into a slightly-less-difficult problem of
$$\max_{\pi} \mathbb{E}_{a \sim \pi(a|s)} R(a).$$
this smoothing is done by arbitrarily choosing the action among many plausible actions according to $\pi$ at state $s$. such sampling is often implemented by transforming a series of random numbers (e.g. drawn from $\epsilon$) into a single sample from $\pi(\cdot|s)$.
we can now abstract out these details and make $\epsilon$ an additional input to $\mathrm{LEARN}$ function above. this learning algorithm takes as input the training set as well as this source of randomness. then, for each input $x$, we can check the uncertainty of our prediction by considering multiple (potentially infinitely many) models arising from the variance induced by $\epsilon$:
$$p(y|x) = \mathbb{E}_{\tilde{\epsilon} \sim \epsilon} \left[\mathrm{LEARN}((x^1, y^1), \ldots, (x^N, y^N), \tilde{\epsilon})(x)\right],$$
which looks exactly like the bootstrap resampling version above. this is only natural, because dataset sampling itself can be thought of as an arbitrary selection of a subset of all possible examples. it is however informative to think of these two separately, since noise in stochastic learning is what we often can explicitly control and noise in dataset sampling is what we often don’t have much control over (perhaps except in active learning.)
let $\tilde{\epsilon} = (\epsilon^1, \ldots, \epsilon^M) \sim \epsilon$ be a series of random numbers drawn from $\epsilon$ for a single training run. one may be tempted to choose $\tilde{\epsilon}$ with the best validation accuracy and deploy the corresponding model. but, in an application where it is important to find more than one answers with reasonable estimates of their probabilities, it is a much better idea to bag all of them for deployment. this is also why you do not want to and should not tune a random seed.
it is understandably quite expensive to use many models arising due to $\epsilon$ in real life, unfortunately. it is however an attractive feature of this approach to have a full distribution over $y$ that reflects varying degrees of likelihood of each $y$. it is a usual practice to use the idea of knowledge distillation, to train another model that is not trained on the targets from the data $y^n$ but on the entirely predictive distribution $p(y|x^n)$.
Hyperparameter Tuning
in fact, it is not only the training procedure but also the hyperparameter search procedure that relies extensively on those random numbers sampled from $\epsilon$. it is because we almost always cannot perform exhaustive search nor deterministic line search due to an ever-increasing number of hyperparameters. this is similar to the uncertainty arising from stochastic learning above, in that our choice of hyperparameters which directly affect learning has its own noise, e.g. arising from random search, which results in predictive uncertainty.
furthermore, the uncertainty, that is similar to the data sampling uncertainty above, exists with hyperparameter tuning as well, as it is a common practice to use a fixed, finite set of validation (held-out) examples in hyperparameter tuning. each time we use a different set of validation examples drawn from the true data distribution (whatever that is), we would end up with a hyperparameter configuration that leads to a different model that ends up making slightly different prediction. we would aggregate them to understand how uncertain we are of any particular prediction over this validation set sampling noise. one could imagine that bootstrap resampling would work well, but we often
the first type of uncertainty from hyperparameter tuning, that arises from random search, is one of my favourite along with the smoothing technique we often use in learning. for it is the kind of uncertainty that does not exist in the ideal world in which we have all the time in the world and all the compute in the world. even if it is a deterministic mapping from a hyperparameter to an individual prediction, our inability introduces the uncertainty. now, this is what people call as reducible uncertainty, but is it really reducible? i don’t think so.
Unobserved variables
finally, i want to spend just one paragraph on one particular case of uncertainty inherent to a problem itself. we already talked about it above when we discussed noisy measurement of the output. that is, what if there are genuinely multiple possible answers?
it turned out that there may be many different reasons why there are multiple possible answers inherently to the problem in our hands. among those many reasons, i want to talk briefly about one particular scenario. in this scenario, there exists a set of unobserved variables $u$ that affect the target variable in some way (it is not really important how, in this high-level, light blog post.) that is, the true function that determines the target takes as input not only the observed input $x$ but also $u$, i.e., $y = g(x, u)$. because we do not observe $u$, given only $x$, there are multiple correct answers.
this can be indeed handled to a certain degree by stochastic feedforward networks as well as conditional density models, but it’s pretty impossible to ensure our choice of such a model can model the unobserved $u$. after all, it’s unobserved, and we do not even know what it is and even more so whether it exists.
Closing
in this blog post, i enumerated a few sources of uncertainty in machine learning that i could immediately think of off the top of my head. they include finite data sampling, measurement noise, stochastic learning, hyperparameter tuning and unobserved variables. this doesn’t include other potential sources, such as uncontrollable shift in the environment between training and test times. furthermore, it does not cover other paradigms, such as online learning, active learning, etc., because i literally don’t know them well.
now… time to take all those sources into account for quantifying and using uncertainty!
Acknowledgement
this whole post was motivated by my (continuing) discussion with our wonderful members at Prescient Design: Ji Won Park, Natasha Tagasovska, Jae Hyeon Lee and Stephen Ra. Oh, also, we are hiring!