i often find myself extremely embarrassed by myself, because i learn of concepts in machine learning that i should’ve known as a professor in machine learning but had never even heard of before. one latest example was expectile regression; i ran into this concept while studying Kostrikov et al. (2021) on implicit Q learning for offline reinforcement learning together with Daekyu who is visiting me from Samsung.
in their paper, Kostrikov et al. present the following loss function to estimate the $\tau$-th expectile of a random variable $X$:
$$\arg\min_{m_{\tau}} \mathbb{E}_{x \sim X}\left[ L_2^\tau (x – m_{\tau}) \right],$$
where $L_2^\tau(u) = | \tau – \mathbf{1}(u < 0) | u^2$ and $\tau \in (0.5, 1]$.
i couldn’t tell where this loss function comes from and together with Daekyu tried to reason our way toward this loss function. to be frank, i had never heard of “expectile” as a term before this …
first, i decided to figure out the definition of “expectile” and found it inside the scipy.stats.expectile documentation. based on the documentation, the $\tau$-th expectile $m_{\tau}$ satisfies
$$\tau \mathbb{E}_{x \sim X} \left[ \max(0, x – m_\tau) \right] = (1-\tau) \mathbb{E}_{x \sim X} \left[ \max(0, m_\tau-x) \right].$$
now, let’s rewrite this equation a bit by first moving the right hand side to the left hand side:
$$\tau \mathbb{E}_{x \sim X} \left[ \max(0, x – m_\tau) \right] + (\tau – 1)\mathbb{E}_{x \sim X} \left[ \max(0, m_\tau-x) \right] = 0.$$
i love expectation (not expectile) because it is linear:
$$\mathbb{E}_{x \sim X} \left[ \tau \max(0, x – m_\tau) + (\tau – 1) \max(0, m_\tau-x) \right] = 0.$$
let’s use the indicator function $\mathbb{1}(a) = 1$ if $a$ is true and $0$ otherwise:
$$\mathbb{E}_{x \sim X} \left[ \mathbb{1}(x > m_{\tau}) \tau(x – m_\tau) – \mathbb{1}(x \leq m_{\tau}) (\tau – 1) (x-m_\tau) \right] = 0.$$
moving things around a bit, i end up with
$$\mathbb{E}_{x \sim X} \left[ \right(\mathbb{1}(x > m_{\tau}) \tau – \mathbb{1}(x \leq m_{\tau}) (\tau – 1)\left) (x-m_\tau) \right] = 0.$$
at this point, i can see that for this equation to hold, i need to make $m_\tau$ very very close to $x$ on expectation. being a proud deep learner, i naturally want to minimize $(x – m_\tau)^2$. but then, i notice that i don’t want to make $m_{\tau}$ close to $x$ equally across all $x$. rather, there is a weighting factor:
$$\mathbb{1}(x > m_{\tau}) \tau – \mathbb{1}(x \leq m_{\tau}) (\tau – 1)$$
if $x > m_{\tau}$, the weight term is same as $\tau$. otherwise, it is $1 – \tau$ which is equivalent to $| \tau – 1|$, because $\tau \in [0, 1]$. also because of this condition, $\tau = |\tau|$. in other words, we can combine these two cases into:
$$| \tau – \mathbb{1}(x \leq m_{\tau})|.$$
finally, by multiplying the $L_2$ loss $(x – m_\tau)^2$ with this weighting coefficient, we end up with the loss function from Kostrikov et al. (2021):
$$\mathbb{E}_{x \sim X} \left[ | \tau – \mathbb{1}(x \leq m_{\tau})| (x – m_\tau)^2 \right].$$
ugh … why did i derive it myself without trusting our proud alumnus Ilya and decide to write a blog post …? waste of time … but it was fun.