How to think of uncertainty and calibration …

since i started Prescient Design almost exactly a year ago and Prescient Design joined Genentech about 4 months ago, i’ve begun thinking about (but not taking any action on) uncertainty and what it means. as our goal is to research and develop a new framework for de novo protein design that includes not only a computational component but also a wet-lab component, we want to ensure that we balance exploration and exploitation carefully. in doing so, one way that feels natural is to use the level of uncertainty in a design (a novel protein proposed by our algorithm) by our algorithm as a knob to control the trade-off between these two, in addition to other properties predicted by various (pseudo-)oracles.

but, then, i realized i don’t know what uncertainty is in a high level (!) which is somewhat weird, since i think i can often follow specific details of any paper that talks about uncertainty and what to do with it. so, as someone who dies for attention (pun intended, of course), i’ve decided to write a blog post on how i think i (should) view uncertainty. this view has almost no practical implication, but it helps me think of predictive uncertainty (aside from all those crazy epistemic vs. alleatoric uncertainty, which i’m sure i mistyped.)

in my mind, i start with the following binary indicator:

$$U(p, y, \tau) = I( \sum_{y’ \in \mathcal{Y}} I(p(y’) > p(y)) p(y’) \leq \tau).$$

if we are considering a continuous $y$, we replace the summation with the integration:

$$U(p, y, \tau) = I( \int_{y’ \in \mathcal{Y}} I(p(y’) > p(y)) p(y’) \mathrm{d}y’ \leq \tau).$$

$\mathcal{Y}$ is a set/space of all possible $y$’s. $I(\cdot)$ is an indicator function, i.e., it returns $1$ if true and otherwise $0$. $p$ is a predictive distribution under which we want to measure the uncertainty (e.g., a categorical distribution returned by a softmax classifier.) $y$ is a particular value of interest, and $\tau$ is a user-provided threshold.

this binary indicator tells us whether a particular value $y$ is within top-$(100 \times \tau)$% values under $p$. this can be used for a number of purposes.

first, we can use it to check how certain any particular prediction $\hat{y}$ is under our predictive distribution. let $p(y|x)$ be the predictive distribution returned by our classifier. we can solve the following optimization problem:

$$\min_{\tau \in [0, 1]} \tau$$

subject to

$$U(p(\cdot|x), \hat{y}, \tau) = 1.$$

in other words, we try to find the smallest threshold $\tau$ such that $\hat{y}$ is included. we refer to the solution of this optimizatoin by $\hat{\tau}$.

there is a brute-force approach to solving this optimization problem, which sheds a bit of light on what it does (and a bit on why i started with $U$ above,) although this only works for a discrete $y$. first, we enumerate all possible $y$’s and sort them according to the corresponding $p(y|x)$’s. let us call this sorted list $(y^{(1)}, y^{(2)}, \ldots, y^{(N)})$, where $N = |\mathcal{Y}|$. then, we search for $\hat{y}$ in this sorted list, i.e., $\hat{i} = \min_{i=1,\ldots, N} I(y^{(i)} = \hat{y})$. then, $\tau = \sum_{j=1}^{\hat{i}} p(y^{(j)}|x)$. in short, we look at how much probability mass is taken over by predictions that are more probable than $\hat{y}$, which seems (to me at least) to be the right way to think of the uncertainty assigned to $\hat{y}$.

second, we can use it to enumerate all predictions that should be considered under a given threshold $\tau$ beyond one best prediction by solving the following optimization problem:

$$\max_{Y \subseteq \mathcal{Y}} |Y|$$

subject to

$$U(p(\cdot|x), y, \tau) = 1,~\forall y \in Y.$$

in other words, we look at the largest subset $Y$ such that each and every element within $Y$ is certain under the predictive distribution $p(\cdot|x)$ with the certainty $\tau$.

this is a usual problem to solve and return the answer of, especially when we know that the problem has inherent uncertainty. in the case of machine translation, for instance, there are generally more than one equally good translations given a source sentence, and it is only natural to return top-$(100 \times \tau)$% translations rather than one best translation (though, we don’t do that in practice unfortunately.)

the same brute-force solution from the first problem is equally applicable here. once we have a sorted list and find $\hat{i}$, we simply return $Y = (y^{(1)}, y^{(2)}, \ldots, y^{(i)})$. this is too brute-force and is not tractable (nor applicable) in many situations (precisely why we don’t return multiple possible translations in machine translation, in practice.)

third, we can use $U$ to calibrate a given predictive distribution toward any criterion. for instance, our calibration criterion could be

$$J(\hat{p}; \tau) = \left|\frac{
\mathbb{E}_{x, y^* \sim p^*} [I(|\hat{\tau}(\hat{p}, \hat{y}) – \tau| < \epsilon)I(|y^* – \hat{y}| – \delta < 0)]
}
{
\mathbb{E}_{x, y^* \sim p^*} [I(|\hat{\tau}(\hat{p}, \hat{y}) – \tau| < \epsilon)]
}
– \tau \right| < \epsilon,~\forall \tau \in [0, 1],$$

where $\hat{p}$ is a monotonic transformation of $p$, and $\hat{y}=\arg\max_y p(y|x)$. you can think of $\hat{p}$ as a target distribution after we calibrate $p$ to satisfy the inequality above.

this criterion looks a bit confusing, but let’s parse it out. the two expectations effectively correspond to drawing true examples $(x, y^*)$’s from the ground-truth distribution $p^*$. for each $x$, we compute how often the prediction $\arg\max_y \hat{p}(y|x)$ is within the confidence threshold $\tau$. among those cases that satisfy this criterion, we check how good the prediction is (i.e., $|y^* – \hat{y} | – \delta < 0$). the proportion of such good predictions (the ratio above) should be within a close neighbourhood of the confidence threshold $\tau$.

with this criterion, we can solve the following optimization algorithm for calibration:

$$\min_{F} \int_{0}^1 J(F(p); \tau) \mathrm{d}\tau + \lambda \mathcal{R}(F),$$

where $\mathcal{R}(F)$ is some measure of the complexity of the monotonic transformation $F$ with the regularization coefficient $\lambda > 0$.

we can think of this optimization problem as finding minimal changes we need to make to the original predictive distribution $p$ to maximally satisfy the criterion above. of course, we can use different formulations, such as using a margin loss, but the same idea holds regardless.

there can be many other criteria. for instance, we may only care that the true value $y^*$ be within $\tau$ only. in this case, the optimization problem simplifies to:

$$\min_F \mathbb{E}_{x, y^*} \left[ 1- U(F(p), y^*, \tau) \right] + \lambda \mathcal{R}(F).$$

so, how does it relate to all the discussions on reducible (our inability) and irreducible (the universe’s inability) uncertainty? in my view, which is often extremely pragmatic, it’s almost a moot point to distinguish these two too strongly when we consider the uncertainty of prediction coming out of our system, assuming we’ve tried our best to minimize our inability (reducible uncertainty). with a finite number of training examples, which are almost never enough, and with our inability to tell whether there’s a model mismatch (the answer is almost always yes,) we cannot really even tell between reducible and irreducible uncertainty. then, why bother distinguishing these two rather than just lumping them together into $p(\cdot|x)$?

anyhow, the post got longer than i planned but stays as empty as i planned. none of these use cases of the binary indicator $U$ are actionable immediately nor tractably. they need to be polished and specialized for each case by carefully inspecting $p$, $\mathcal{Y}$, etc. but, at least this is how i began to view the problem of uncertainty in machine learning.

Acknowledgement

this whole post was motivated by my discussion with our wonderful members at Prescient Design: Ji Won Park, Natasha Tagasovska and Jae Hyeon Lee. Oh, also, we are hiring!

Leave a Reply