in this blog post, i will write out what kind of conditions i believe any watermarking technique should satisfy, in order for watermarking to be useful and effective.

let $x \in \mathcal{X}$ be the observation we want to watermark with a marker $m \in \mathcal{M}$. we will use $F: \mathcal{X} \times \mathcal{M} \to \mathcal{X}$ as a watermarking function. the first condition $F$, called *perceptual Indistinguishability*, needs to satisfy is

$$d_{\mathrm{perceptual}}(x, F(x, m)) \leq \delta,$$

which states that the percetual difference between $x$ and its watermarked version $F(x, m)$ must be very small (smaller than $\delta$). that is, we shouldn’t be able to distinguish between the original and watermarked observations.

the second condition is *marker verifiability*. there must be a tractable way (i will get to why it needs to be tractable shortly when discussing the third condition) to tell whether a certain marker was applied. given a verification function $G: \mathcal{X} \times \mathcal{M} \to \left\{0, 1\right\}$, this can be written down as

$$\frac{\mathrm{Pr}(G(F(x,m), m)=1)}{\mathrm{Pr}(G(F(x,m), m)=0)} > 1$$

and

$$\frac{\mathrm{Pr}(G(F(x,m), m)=1)}{\mathrm{Pr}(G(F(x,m), m’)=1)} > 1~\forall m \neq m’.$$

that is, $G$ must be able to tell whether the watermark was applied and which watermark was applied as well.

the third condition, to which i will refer as *marker irreversibility*, implies that a watermarked version cannot easily be reverted to the original version. this is as important as the first two conditions, but is often overlooked, which makes many watermarking techniques pretty much obsolete. for instance, you have probably noticed that pirated TV shows often have top-5% of each frame cut off; this is done in order to remove the TV station mark, which is sometimes used as a watermark to track whether a TV show from that particular TV station was pirated.

we can write this condition in the context of computational complexity. for instance, we want to ensure that the inverse watermarking function $F^{-1}: \mathcal{X} \times \mathcal{M} \to \mathcal{X}$ takes exponential time w.r.t. the sizes of the watermarked object $F(x, m) \in \mathcal{X}$ and the marker $m \in \mathcal{M}$, i.e., $O(e^{\max\{|F(x,m)|, |m|\}})$.

then, we must think a bit about what this complexity should be; is exponential complexity enough? an interesting observation here is that what is enough is relative to the second condition on the *marker verifiability* above. if the computational complexities of verification $G$ and removal $F^{-1}$ were of the same order, e.g. both take linear time $O(\max\{|F(x,m)|, |m|\})$, watermarking is a bit of a moot point, since anyone who wants to break tracking by watermarking would simply remove the marker from the content before watching and forwarding it to others, while spending the same amount of computation as any verify would.

in other words, this condition of *marker irreversibility* is defined w.r.t. the condition of *marker verifiability*. that is, the marker removal and verification must reside in different levels in the polynomial hierarchy, with the verification on a lower level.

in summary, there are three conditions that must be met by any reasonable watermarking technique:

*perceptual indistinguishability*: a watermarked object must be (almost) perceptually indistinguishable from the original object.*marker verifiability*: we must be able to tractably verify a given object was watermarked with a particular marker and not with another.*marker irreversibility*: it must be tractable for anyone to remove the marker from a watermarked object to obtain the original object.

up until this point, i have assumed the marker $m$ is openly available, as in the airline marker on movies on airplanes. this is the reason why we wanted to ensure that marker reversal was significantly more complex than marker verification. if we however can (it’s a big CAN) keep the marker $m$ secret, it becomes trivial to separate marker verification and reversibility in terms of computational complexity, as the marker space can be made arbitrarily large. this is however a bit unrealistic, and it may be that the marker will leak via multiple watermarked objects eventually.

based on these conditions, let’s examine the airline’s watermarking strategy:

*perceptual indistinguishability*: largely satisfied. those airline markers only show up quite rarely throuhgout the whole show that it doesn’t really bother me or many viewers.**✓***marker verifiability*: largely satisfied. not sure how to implement it effectively off the top of my head, but it feels pretty straightforward to do so with rudimentary image processing tools.**✓***marker irreversibility*: not satisfied**×**. it’s quite trivial to remove such a marker especially with the recent advances in machine learning for image processing and generation, which can be done in almost linear time w.r.t. the length of the show.

so, it’s not really a great watermarking technology from the technical perspective, although simplicity of such an approach is quite attractive from both business and maintenance perspectives. if i were a hollywood studio executive, i would ask for a stronger watermarking strategy, as any leak affects my studio much more so than the airline via which the show was leaked.

that said, *how does your novel watermarking algorithm fare?*

since then, a lot of have changed. the then-raging pandemic is largely behind us. Prescient Design, i co-founded at the very beginning of 2021, has been fully acquired by Genentech, and we have spent already two years as part of gRED. Putin started an unjustifiable and cruel war against Ukraine. Korea has a new president. my father retired after 30+ years of being a professor of korean literature and language. i served as a program chair of both NeurIPS and ICML. i finally saw mountain gorillas in the wild. i have graduated 11 PhD’s since June 2021. i am two years older than then, as does everyone else who was already born by then. i really can’t list all that happened, but so many infuriating, sad, happy and joyous moments have come and gone.

despite all these up’s and down’s, i want to make sure at least one thing about me stays constant; that is, i want to contribute however little i can to supporting education for the next generation of students by supporting higher-education institutes. to this cause, i’ve just made the following donations to my alma mater’s (though, it’s unclear whether a postdoc institute is considered alma mater):

- Mila: \$20,000 USD
- to support AI4Good Lab which is a 7-week program for women and under-served population in STEM with the skills to build ML projects.

- Aalto University: \$20,000 USD
- the first half to support Aalto Junior which offers art, science, technology and entrepreneurship for children, young people and teachers.; i was given a tour of Aalto Junior earlier this year, and could not have been more impressed by their programs and efforts.
- the other half to support various programs within Aalto at their discretion, including supporting the special course for rebuilding Ukraine hosted at the School of Arts, Design, and Architecture.

- KAIST: ₩15,000,000 KRW (approximately \$12,000 USD)
- to support building a scholarship at the School of Computing for students who are excluded from existing scholarship schemes due to various (often family and personal) reasons.

it’s much smaller than 2 years ago, and probably much smaller than how much i should donate and can afford to do so. but, i hope this compels me to do so more often in the future.

also, i’d like to ask my fellow alumni of these institutes and my colleagues to join me in supporting our colleagues, friends and family of the future for their education.

]]>in their paper, Kostrikov et al. present the following loss function to estimate the $\tau$-th expectile of a random variable $X$:

$$\arg\min_{m_{\tau}} \mathbb{E}_{x \sim X}\left[ L_2^\tau (x – m_{\tau}) \right],$$

where $L_2^\tau(u) = | \tau – \mathbf{1}(u < 0) | u^2$ and $\tau \in (0.5, 1]$.

i couldn’t tell where this loss function comes from and together with Daekyu tried to reason our way toward this loss function. to be frank, i had never heard of “expectile” as a term before this …

first, i decided to figure out the definition of “expectile” and found it inside the scipy.stats.expectile documentation. based on the documentation, the $\tau$-th expectile $m_{\tau}$ satisfies

$$\tau \mathbb{E}_{x \sim X} \left[ \max(0, x – m_\tau) \right] = (1-\tau) \mathbb{E}_{x \sim X} \left[ \max(0, m_\tau-x) \right].$$

now, let’s rewrite this equation a bit by first moving the right hand side to the left hand side:

$$\tau \mathbb{E}_{x \sim X} \left[ \max(0, x – m_\tau) \right] + (\tau – 1)\mathbb{E}_{x \sim X} \left[ \max(0, m_\tau-x) \right] = 0.$$

i love expectation (not expectile) because it is linear:

$$\mathbb{E}_{x \sim X} \left[ \tau \max(0, x – m_\tau) + (\tau – 1) \max(0, m_\tau-x) \right] = 0.$$

let’s use the indicator function $\mathbb{1}(a) = 1$ if $a$ is true and $0$ otherwise:

$$\mathbb{E}_{x \sim X} \left[ \mathbb{1}(x > m_{\tau}) \tau(x – m_\tau) – \mathbb{1}(x \leq m_{\tau}) (\tau – 1) (x-m_\tau) \right] = 0.$$

moving things around a bit, i end up with

$$\mathbb{E}_{x \sim X} \left[ \right(\mathbb{1}(x > m_{\tau}) \tau – \mathbb{1}(x \leq m_{\tau}) (\tau – 1)\left) (x-m_\tau) \right] = 0.$$

at this point, i can see that for this equation to hold, i need to make $m_\tau$ very very close to $x$ on expectation. being a proud deep learner, i naturally want to minimize $(x – m_\tau)^2$. but then, i notice that i don’t want to make $m_{\tau}$ close to $x$ equally across all $x$. rather, there is a weighting factor:

$$\mathbb{1}(x > m_{\tau}) \tau – \mathbb{1}(x \leq m_{\tau}) (\tau – 1)$$

if $x > m_{\tau}$, the weight term is same as $\tau$. otherwise, it is $1 – \tau$ which is equivalent to $| \tau – 1|$, because $\tau \in [0, 1]$. also because of this condition, $\tau = |\tau|$. in other words, we can combine these two cases into:

$$| \tau – \mathbb{1}(x \leq m_{\tau})|.$$

finally, by multiplying the $L_2$ loss $(x – m_\tau)^2$ with this weighting coefficient, we end up with the loss function from Kostrikov et al. (2021):

$$\mathbb{E}_{x \sim X} \left[ | \tau – \mathbb{1}(x \leq m_{\tau})| (x – m_\tau)^2 \right].$$

ugh … why did i derive it myself without trusting our proud alumnus Ilya and decide to write a blog post …? waste of time … but it was fun.

]]>i do not want to discuss any particular paper/tweet/blog, because this topic seems to attract a weird set of people arguing for weird things, when in fact there are just a couple of different views into a single phenomenon, which is only natural in science and engineering. that said, if anyone’s interested in this recent (non-)controversy, these two papers seem to be the ones to take a look: Wei et al. [2022 TMLR] and Schaeffer et al. [2023 arXiv].

in this blog, let me instead define *emergence* in my own words so that i can point anyone to this blog when i end up talking with *emergence* with them. as the first step, here are three variables we must keep in our mind:

- $x \in \mathbb{R}$: the quantity that we vary ourselves to study emergence. some examples are # of parameters given a particular parametrization scheme, # of data points sampled from a particular distribution, etc. these are all discrete quantities, but we can imagine these as points sampled from the real line.
- $z \in \mathcal{Z}$: the quantity that we can’t/don’t control or sometimes don’t even observe while varying $x$. some examples include bit flip by cosmic ray. we often want to marginalize this out.
- because we often can’t control nor observe $z$, we assume $z$ follows a distribution $p_Z$.

- $y \in \mathbb{R}$: the quantity that we observe given $x$ and $z$. some examples are accuracy (average 0-1 loss), average negative log-probability (tight upperbound to the average 0-1 loss), etc.

with these variables, i can think of the very first definition of emergence:

Definition 1[Weak subjective emergence of $y$]Given $y = \mathbb{E}_z f(x, z)$ and $\delta > 0$, there exists $x’ \in \mathbb{R}$ such that $\left| \mathbb{E}_z \frac{\partial f}{\partial x}(x’, z) \right| > \left| \mathbb{E}_z \frac{\partial f}{\partial x}(\tilde{x}, z) \right| + \delta$ for all $\left| \tilde{x} – x’\right| > \epsilon$.

in words, this definition says that emergence is defined as the existence of a point $x’$ at which the change in $y$ is greater than any other point $\tilde{x}$. this can be further strengthened to include all higher order derivatives instead of only the first order derivative, but let me just stop here for now.

to measure whether this *subjective emergence *happens in a neural net of a particular architecture w.r.t. the number of parameters, we can follow the steps below:

- given the number of parameters $x$, train the neural net multiple times while varying random seeds in order to account for $z$. let the average validation accuracy be $y(x)$.
- $f$ then corresponds to training a neural net and measuring its accuracy on a held-out validation set.

- repeat this while varying the number of parameters.
- find a pair of consecutive $x$’s between which the validation accuracy changes most; call the mid-point $x’$.
- if this validation accuracy change is greater than that of any other consecutive pair in a meaningful amount $\delta$, we call it
*weak subjective emergence*.

this sounds reasonable, but it raises a lot of questions. some of those questions include:

- why is the particular choice of $f$ meaningful?
- why is the number of parameters a meaningful quantity to use? what if we use the number of bits after compressing all the parameters using e.g. gzip after each update? what makes the former more interesting than the latter?
- why is the accuracy a meaningful quantity to use? what if we use the margin loss since we care about the quality of decision boundary beyond mere accuracy? what makes the former more interesting than the latter?

- why is the particular resolution of $x$ and $y$ meaningful?
- how do we decide on the meaningful amount $\delta$?
- how do we decide on the neighbourhood size $\epsilon$?

there are a few more questions i had, such as whether marginalization of $z$ is desirable over max or min over $z$, but they seem rather minor, compared to these questions above. though, i must emphasize that we have to take into account $z$ one way or another, and it feels very weird to look at only one particular configuration $z$.

these questions naturally answer why i called this particular notion of emergence *subjective*; it is subjective because we leave the answers to these critical questions to the one who declares *emergence* of a property. in other words, one can use their *subjective* choices of $f$, $\delta$ and $\epsilon$. furthermore, this emergence is *weak* in that one merely needs to choose *one particular choice* of $f$, $\delta$ and $\epsilon$ to show that emergence happens.

can we then define a stronger version of subjective emergence? i believe we can, but this requires us to introduce a few more concepts:

- $T_x: \mathbb{R} \to \mathbb{R} \in \mathcal{T}_x$: this is a transformation that can be applied to $x$ to change e.g. its scale, magnitude, etc.
- one example of $\mathcal{T}_x$ a set of all monotonic transformations on $x$, although we can imagine many other types of transformations.
- in the case of neural net training, another example is to simply enumerate all the things that change as the number of updates (or the number of parameters) changes. for instance, $T_x$ may map the number of updates to the $L_2$-norm of the parameters.

- $T_y: \mathbb{R} \to \mathbb{R} \in \mathcal{T}_y$: this is a transformation that can be applied to $y$ to change e.g. its scale, magnitude, etc.
- for instance, $T_y$ can map the average accuracy to the logit of the true class.

we can now define a stronger version of subjective emergence:

Definition 2.[Strong subjective emergence of $y$]For all $T_x \in \mathcal{T}_x$ and $T_y \in \mathcal{T}_y$, let $T_y(y) = \mathbb{E}_z f(T_x(x), z)$. Then, given $\delta_{T_x,T_y} > 0$, there exists $T_x(x’) \in \mathbb{R}$ such that $\left| \mathbb{E}_z \frac{\partial f}{\partial T_x(x)}(T_x(x’), z) \right| > \left| \mathbb{E}_z \frac{\partial f}{\partial T_x(x)}(T_x(\tilde{x}), z) \right| + \delta_{T_x,T_y}$ for all $\left| T_x(\tilde{x}) – T_x(x’)\right| > \epsilon_{T_x}$.

this is essentially identical to *weak subjective emergence* except that we now impose that emergence should hold over a set of possible transformations made to $x$ and $y$. that is, we cannot simply choose *one* particular choice of $x$ and $y$, observe emergence and declare that emergence happened. rather, we need to show that such emergence happens even if we transform $x$ and $y$ in many reasonable ways.

these two definitions collapse onto each other when $|\mathcal{T}_x|=1$ and $|\mathcal{T}_y|=1$; that is, if we only consider one particular combination of $x$ and $y$ without considering any other possible transformations of them.

this definition of emergence is still subjective, since it relies on the subjective choice of $\mathcal{T}_x$, $\mathcal{T}_y$, $\delta$ (for each combination $(T_x,T_y)$) and $\epsilon$ (again for each $(T_x,T_y)$). one may even say this is even more subjective, as we need to decide on more things here, including transformations of $x$ and $y$ as well as the tolerance and neighbourhood radius for each transformation combination. nevertheless, because the notion of emergence must hold over a larger set of how we define $x$ and $y$, i’d find emergence observed according to this definition to be stronger and much more interesting.

so, we want these transformation sets to be not too narrow so that these two definitions collapse or not too broad so that we will never observe strong emergence. what would be some possible transformation sets that fall in the middle (since almost always the answer is somewhere in the middle)?

in my view, a good choice of the transformation set (either $x$ or $y$) is a set of all (noisy) monotonic transformations. for instance, if we take $x$ to be the number of updates in neural net training, we should also consider the $L_2$-norm of the parameters, as it grows (almost) monotonically w.r.t. the number of updates. if the claimed weak emergence over the number of updates disappears when we transform it into the $L_2$-norm of the parameters, we can’t claim stronger emergence. in the case of $y$, an interesting transformation is the repeated application of $\log$. how many $\log$-transformations of $y$ does the claimed emergence withstand? this would give us a sense of the strength of observed emergence.

finally, can there be *objective emergence*? i believe so, although such emergence would be very narrow in that there is essentially no room for any choice or interpretation. for instance, earlier together with Laura Graesser and Douwe Kiela, we demonstrated that a symmetric pair-wise protocol only emerges among communicating agents if there are at least three agents (it’s a bit obvious, though.) in this case, this emergence is objective, in that there’s no other transformation to choose (i.e., the number of agents is just the number of agents, and the communication success is defined as 0-1 and no other way) nor any other definition of tolerance or neighbourhood. in other words, *objective emergence* would be identical to *subjective emergence* except that the problem setup is extremely constrained to the point that there is no room for subjective choice nor interpretation, which makes it less interesting in general.

that wraps up yet another post of my random thoughts that would never make it to papers. have a nice day!

**Acknowledgement**:

- Thank you, Prof. Ernest Davis, for pointing out that the emergence should be defined w.r.t. $y$. this comment has been reflected.
- Thanks to Daniel Paleka’s comment, i clarified in the second definition that $\delta$ and $\epsilon$ are dependent on the choice of transformations.

for instance, imagine training a face detector for your phone’s camera in order to determine which filter (one optimized for portraits and the other for other types of pictures). if most of the training examples for building such face detector were taken in bright day light, one often without hesitation says that this face detector would work better on pictures taken under bright day light than on pictures taken indoor. this sounds pretty reasonable *until* you start thinking of some simplified scenarios. And, that started for me a few years ago, which eventually led me to write this blog post.

so, let’s consider a very simple binary classification setup. let $D=\{(x_n, y_n)\}$ be the training set. $f(x)$ returns the number of occurrences of $x$ within $D$, that is,

$$f(x) = \frac{1}{N} \sum_{n=1}^N I(sim(x_n, x) \leq \epsilon),$$

where $sim$ is a similarity metric, and $\epsilon$ is a similarity threshold. $I$ is an indicator function. if we set $\epsilon=0$ and $sim(a,b) = I(a=b)$, $f(x)$ literally looks at the number of duplicates of $x$ within the training set.

we assume that the training set is *separable*, which makes everything so much easier to imagine in our head and also reason through.

in this simple setup, what is really interesting (to me, at least) is that the number of duplicates $f(x_n)$ of any $x_n \in D$ does not affect a separating decision boundary. as soon as one of the duplicates is correctly classified (i.e., on the right side of the decision boundary), all the other duplicates are equally well classified and would not affect our choice of the decision boundary.

this is most clearly demonstrated by the perceptron learning rule which is defined as

$$w^n = \begin{cases}

w^{n-1}, &\text{if } y_n (w^n \cdot x_n) > 0 \\

w^{n-1} + x_n, &\text{otherwise}.

\end{cases}$$

that is, the decision boundary defined by $w^n$ is only updated if $x^n$ is incorrectly classified, i.e., $y_n (w^n \cdot x_n) \leq 0$. once $x^n$ is correctly classified, all the subsequent duplicates of $x^n$ do not contribute to the decision boundary.

another example is a max-margin classifier, such as a support vector machine. in this case, we can think of how the margin of a (separating) decision boundary is defined. the margin is defined as the sum of the distance to the nearest correctly-classified examples from both classes (positive and negative) respectively. in other words, the only examples that matter for determining the optimal decision boundary are the ones that are nearest correctly-classified ones (at least two; they are called *support vector*), and all the other examples that are correctly classified and far from the decision boundary (recall the separability assumption) do not contribute to the optimal decision boundary. in other words, it really doesn’t matter whether there are many duplicate copies of any particular example, as either that group of examples contribute equally to the margin or does not contribute at all.

Then, does it mean that the existence of duplicates of each training example does not matter when it comes to learning a classifier? Or, better put, why do we think the existence of duplicates changes how our classifiers work?

every now and then, i stumble upon discussion on the difference between parametric and non-parametric methods. every time i believe i found the answer to this question in a way that is explainable to my students and colleagues, but quite rapidly my belief on that answer fades away, and i start to doubt myself as a computer scientist. the last episode was pretty recent, and you can find people’s responses and insightful answers at

it turned out that this seemingly naive and dumb question connects to this issue of whether/how duplicates of training examples impact classification. what do i mean by that?

instead of perceptron and support vector machine above, which can be thought of as parametric approaches, since their discovered decision boundaries are described *without* referring to the training examples, i.e., on their own, let us consider one of the simplest and perhaps most powerful non-parametric classifier whose decision boundary is a function of the training examples and its complexity grows as we include more training examples. and, this classifier is $k$-nearest neighbour classifier ($k$NN).

given a new example $x$ we want to classify using our $k$NN classifier, let $(x_n,y_n)$ be the nearest neighbour of $x$. given the number of duplicates in the training set $f(x_n)$, we can now tell how many other neighbours are considered by this $k$NN; the answer is $k – f(x_n)$. that is, the probability of this new example $x$ belonging to $y_n$ is written down as:

$$p(y=y_n| x) = \frac{\min(k, f(x_n))}{k} + \frac{1}{k} \sum_{(x’,y’) \in \mathcal{N}_k(x)} I(x’ \neq x_n) I(y’ = y_n),$$

where $\mathcal{N}_k(x)$ is a set of $k$ nearest neighbours of $x$. as $f(x_n)$ grows, the first term dominates, and the chance of classifying $x$ into $y_n$ consequently grows as well. that is, the more duplicates we have of $x_n$ the higher probability for $y_n$. that is, the region corresponding to $(x_n,y_n)$ grows as the number of its duplicates increases, which is precisely what a non-parametric classifier does.

so, what does this tell us? the impact of duplicates in the training set differs between parametric and non-parametric approaches. it is not only in classification, but also in generative modeling, since much of generative modeling can be thought of as supervised learning in disguise. if we are dealing with non-parametric methods, we probably want to take into account duplicates in the training set and either keep them as they are or de-duplicate them. this decision will have to be made for each problem separately. if we are working with parametric methods, we probably don’t need to worry about these duplicates beyond the computational concern.

how does this observation connect with the urban legend/myth on the impact of duplicates? i believe this simply tells us that classifiers we use often in practice are often non-parametric, including $k$NN, neural nets and random forests. in other words, it wasn’t really about whether duplicates matter but it was more about what is a common practice in modern machine learning; that is, we use non-parametric classifiers.

there’s nothing serious nor insightful here, but i enjoyed this thought experiment!

]]>in my mind, there are three ways to define sparse coding.

**code sparsity**: the code is sparse, i.e., $|z|_0 = O(1)$.**computational sparsity**: the computation is sparse, i.e., $x = \sum_{k=1}^K z_k w_k$, where $K = O(1)$ and $w_k \in \mathbb{R}^d$.**noise robustness**: the computation is robust to perturbation to the parameters: let $\tilde{w} = w + \epsilon$, where $\epsilon \sim \mathcal{N}(0, \sigma^2 1_{|w|})$. the MSE between $x$ and $\tilde{x} = |\sum_{k=1}^K z_k w_k – \sum_{k=1}^K z_k \tilde{w}_k|_2^2$ is $O(d \times \sigma^2)$ not $O(d’ \times d \times \sigma^2)$, because $k \ll d’$ is a constant w.r.t. $d’$.

these are equivalent if we constrain the decoder to be linear (i.e., $x = \sum_{i=1}^{d’} z_i w_i$,) but they are not with a nonlinear decoder. in particular, let us consider a neural net decoder with a single hidden layer such that $x = u \max(0, w z),$ where $u \in \mathbb{R}^{d \times d_h}$ and $w \in \mathbb{R}^{d_h \times d’}$. we can then think of how these different notions of sparsity manifest themselves and how we could encourage these different types of sparsity when training a neural net.

the amount of computation is then $O(d \times d_h + d_h \times d’)$ which reduces to $O(d \times d’)$ assuming $d_h = O(d’)$. even if we impose the code sparsity on $z$, the overall computation does not change ($O(d \times d_h + d_h \times k)$) and remain as $O(d \times d’)$. in other words, code sparsity does not imply computation sparsity, as was the case with linear sparse coding.

based on this observation, one can imagine imposing sparsity on all odd-numbered layers (counting the $z$ as the first layer) and the penultimate layer (one before $x$) in order to satisfy **computational sparsity** with a nonlinear decoder. in the example above, this implies that the sparsity should be imposed on both $z$ and $\max(0, wz)$.

this naive approach to computational sparsity implies noise robustness, as the number of parameters used in computation is restricted by construction. it does not mean however that there aren’t any other way to impose noise robustness. in particular, we can rewrite the whole problem of sparse coding as

$$\min_{z, w, u} \frac{1}{N} \sum_{n=1}^N |x^n – u \max(0, w z^n)|^2$$

subject to $$| \text{Jac}_{w,u} u \max(0, w z^n) |_F^2 < k d~\text{for all}~n=1,\ldots, N.$$

in other words, the influence of perturbing the parameters on the output must be bounded by a constant multiple of the output dimensionality.

of course it is not tractable to solve this problem exactly, but we can write a regularized proxy problem:

$$\min_{z, w, u} \frac{1}{N} \sum_{n=1}^N |x^n – u \max(0, w z^n)|^2 + \lambda | \text{Jac}_{w, u} u \max (0, wz^n) |_F^2,$$

where $\lambda$ is a regularization strength. in other words, we find the parameters, $w$ and $u$, that are **robust to perturbation** in terms of the output.

*So, which sparsity are we referring to and do we desire when talking about sparsity in neural networks?*

Delip Rao then retweeted and said that he does not “buy his lossy compression analogy for LMs”, in particular in the context of JPEG compression. Delip and i exchanged a few tweets earlier today, and i thought i’d state it here in a blog post how i described in the following tweet why i think LM and JPEG have the same conceptual background:

one way in which * I* view a compression algorithm is that it (the algorithm $F$) produces a concise description of a distribution $p_{compressed}$ that closely mimics the original distribution $p_{true}$. that is, the goal of $F$ is to turn the description of $p_{true}$ (i.e., $d(p_{true})$) into the description of $p_{compressed}$ (i.e., $d(p_{compressed})$) such that (1) $p_{true}$ and $p_{compressed}$ are similar to each other, and (2) $d(p_{true}) \gg d(p_{compressed})$. now, this is only

then, how does JPEG can be viewed in this angle? in JPEG, there is a compression-decompression routine that can be thought of as a conditional distribution over the JPEG encoded/decoded images given the original image, i.e., $p_{JPEG}(x’ | x)$, where $x$ and $x’$ are both images. it is almost always deterministic, and this may be considered as a Dirac delta distribution. Then, given the trust natural image distribution $p_{true}$, we can get the following compressed distribution:

$$p_{compressed}(x’) = \sum_{x \in \mathcal{X}_{image}} p_{JPEG}(x’|x) p_{true}(x).$$

that is, we convolve all the images with the JPEG conditional distribution to obtain the compressed distribution.

why is this compression? because JPEG loses many fine details about the original image, there are many original images that map to a single image with JPEG-induced artifacts. this makes the number of probable modes under $p_{compressed}$ fewer than those under the original distribution, leading to a lower entropy. this in turn leads to a fewer number of bits we need to describe this distribution, hence, *compression*.

when there is a mismatch between $p_{true}$ and $p_{compressed}$, we can imagine two scenarios. one is that we lose a probable configuration under $p_{true}$ in $p_{compressed}$, which is often referred to as *mode collapse*. the other is $p_{true}(x) \downarrow$ when $p_{compressed}(x) \uparrow$, which is often referred to as *hallucination*. the latter is not really desirable in the case of JPEG compression, as we do not want it to produce an image that has nothing to do with any original image, but this is at the heart of generalization.

combining these two cases we end up with what we mean by *lossy* compression. in other words, any mismatch between $p_{true}$ and $p_{compressed}$ is what we mean by *lossy*.

in language modeling, we start with a vast amount of training examples, which i will collectively considered to constitute $p_{true}$, and our compression algorithm is regularized maximum likelihood (yeah, yeah, RLHF, instructions, blah blah). this compression algorithm (LM training, if you prefer to use) results in $p_{compressed}$ which we use a trained neural net to represent (though, this does not imply that this is the most concise representation of $p_{compressed}$.)

just like JPEG, LM training inevitably results in a discrepancy between $p_{true}$ (i.e., the training set under my definition above) and $p_{compressed}$ due to a number of factors, including the use of finite data as well as our imperfect parametrization. this mismatch however turned out to be *blessing* in this particular case, as this implies *generalization*. that is, $p_{compressed}$ is able to assign a high probability to an input configuration that was not seen during training, but then such a highly probable input turned out to look amazing to us (humans!)

in summary, both JPEG compression and LM training turn the original distributions of natural image and human written text, respectively, into their *compressed* versions. in doing so, inevitable mismatch between these two distributions, in each case, and this is why we refer to this process as *lossy* compression. this lossy nature ends up assigning non-zero probabilities to unseen input configurations, and this is *generalization*. in the case of JPEG, such generalization is often undesirable, while desirable generalization happens with LM thanks to our decades of innovations that have been culminated into modern language models.

so, yes, both are lossy compression with comparable if not identical underlying conceptual frameworks. the real question is however not about whether lossy compression makes LM’s less or more interesting, but more like which ingredients we have found to build these large-scale LM’s contribute to such *desirable* generalization and how.

a major part of running the reviewing process is to ensure all the reviews, meta-reviews, decisions as well as decision agreements are received in time, in order to ensure that we find a set of quality papers to be presented at the conference and that the authors of these accepted papers as well as participants are given enough time to prepare their travels to the conference. in a sense, everyone, including program chairs ourselves, agrees to serve a role, either as a reviewer, area chair or senior area chair, at the time of invitation, and we might naively expect all to stick to the timeline. it is however not the case due to the sheer scale of the conference’s main track, with more than 13,000 submitted abstracts and more than 10,000 reviewers we recruited. what is the chance that every single reviewer is fully available without any personal or professional emergencies over the summer, which is the period over which NeurIPS reviewing happens? if we assume 0.01% of personal/professional emergency for each individual reviewer, the chance that every one is available fully over this period is less than 40% …

now of course on top of that, we are all humans and simply make mistakes by for instance forgetting to put various deadlines on our calendars or simply over-committing ourselves. these mistakes can be however mitigated to some degree by reminders, or at least that was my thought back this summer (2022).

as part of this effort of politely but strongly reminding reviewers as well as area chairs of upcoming deadlines, i decided to finally benefit from a reasonably large number of followers i have on twitter (as of Dec 12 2022, i have 42.5k followers). who knew i would ever use twitter for my own benefit (and i want to say, for the community’s benefit)? but, the time had finally arrived …

i decided to piggy-back on people’s liking of memes on Twitter and started to post quite regularly NeurIPS’22 reviewing memes. It started on Jun 23 2022 and then continued until July 14 2022. here, i’ll list all of them for you to easily see how quickly my mind has sprawled into a dark abyss over time … i am in fact unsure if i’ve ever gotten out of this dark abyss i fell through …

… with even all these tweets, we failed to collect all necessary reviews in time …

]]>in this campaign’s page, they cited one news piece from SBS where they surveyed 21 young people of their situations to illustrate how the starting points for young people in the Korean society dramatically vary across individuals, despite our illusion of fair and equal treatment. it’s nothing rigorous and quite anecdotal, but quite thought-provoking, as it starkly “shows” these differences: https://www.youtube.com/watch?v=AaLZ3bmCb_k. the participants were asked 56 questions, and out of these, the campaign page listed a few (some of these are pretty specific to Korea, i must say, though):

- if you have had to move every 1-2 year, take a step back. 어쩔 수 없이 1,2년 단위로 집을 옮겨야 한다면 / 옮겨 다니고 있다면 한 발 뒤로
- if you don’t have insurance, take a step back. 4대 보험을 받지 못한다면 한 발 뒤로
- if you have to explain your family situations or lifestyle choices frequently to others, take a step back. 내가 취하고 있는 가족 구성원 형태 또는 삶의 형태에 대해 사람들에게 종종 설명을 해야 한다면 한발 뒤로
- if you’ve ever missed paying utility bills, take a step back. 돈이 부족해서 공과금을 연체해 본 적이 있다면 한 발 뒤로
- if you had to go on leave of absence from your schools due to tuition, take a step back. 등록금 때문에 휴학하고 돈을 벌어야 했다면 한 발 뒤로
- if you can always call mom or dad for financial support, take a step forward. 필요할 때 언제든 엄카, 아카를 쓸 수 있다면 한 발 앞으로
- if you had to prove your disability or financial hardship to receive financial aid, take a step back. 경제적 지원을 받기 위해 장애나 소득을 증명한 적이 있다면 한 발 뒤로
- if you had extracurricular education during your school years, take a step forward. 학창 시절 과외를 받아본 적이 있다면 한 발 앞으로
- if you could read as many books as you wanted when you’re younger, take a step forward. 어렸을 때 원하는 책을 마음껏 읽을 수 있었다면 한 발 앞으로
- if you can have whatever you want to eat delivered whenever you’re home alone, take a step forward. 혼자 있을 때 어느 시간 때고 마음 놓고 배달음식을 시켜 먹을 수 있다면 한 발 앞으로

and, you know what? when i asked myself these questions, i never took a step back and was always taking steps forward.

according to the campaign’s homepage, these children who are graduating from the group homes as they enter their 18-th birthday are provided with one-time support of \$4,000 or so (50M KRW) and monthly support of \$250 or so (30M KRW). for those who decide to continue their study in a college, this has never been enough. it has become even more of an issue during the pandemic, as our educational system began to ask students for even more, just for them to participate; they need to have good broadband to participate in remote lectures, they need to have some place quiet to participate in remote lectures without distraction, and they need to have a good laptop to participate in remote lectures, download necessary materials and submit their assignments.

so, i wanted to donate a bit to this campaign, but it turned out this was done via Kakao’s platform and required having a Kakao account which i don’t have. and, yes, i know the pain of creating an account for a Korean website, especially if i want to connect it with my credit card. so, i’ve given up on doing so via this specific campaign but emailed them directly to have a quick phone call.

they were super quick in giving me a call on the same day and gave a quick walk through of their programs. by the end of this short call, i already promised to donate approximately \$27,000 (30M KRW) for any operation. it’s not a lot of money but i hope this can buy a few more laptops for them to support these kids and also to raise awareness of this issue, that is largely hidden. hopefully this little gesture of mine helps students even a tiny bit to take a smaller step back than before.

because i’m generally a show-off, i had to write this blog post to show off this little donation, but there are those who are truly contributing to making the world better. in particular, the Center’s various programs are run by the staff members of the Center as well as many activists and volunteers (some of whom are from these group homes themselves). i’ve been reading and watching some of the materials on their homepage, and i could not have been more impressed and moved by them. also, there are a lot of regular donors to this Center (http://jaripcare.com/bbs/board.php?bo_table=support) who are really making differences, unlike a one-time donor like me who show up, boast and disappear. a huge thanks to all these people who are literally making sure a fewer people take a fewer steps back in the society.

would you be a part of supporting kids take a step forward instead of back with me?

]]>the proposition party consisted of Sella Nevo, Maya R. Gupta and François Charton. Been Kim was unfortunately unable to participate, although she would’ve been a great addition to the proposition party. the proposition party argued that progress towards achieving AI will be mostly driven by engineering not science.

the opposition party (i guess … my party) consisted of Ida Momennejad, Pulkit Agrawal, Sujoy Ganguly and your truly. the opposition party (perhaps obviously) opposed the proposition’s stance and argued that progress towards achieving AI will be mostly driven by science not engineering.

if you’re registered at ICML 2022, you can watch the recording of the debate at https://icml.cc/virtual/2022/social/20780. i don’t know if this will be released publicly when the conference is over, but i will update it here if and when that happens.

the debate was fun and was full of many interesting and thought-provoking ideas and points. i won’t try to summarize those points here, as that would require a huge amount of efforts and i shouldn’t have had that much beer over the past 4 days …

instead, i’ll share my opening statement here. a distinct advantage i had as the opposition leader was that i could prepare my statement in advance, and now i can share it here. my main goal was to leave enough rooms for the other members of the party to delve deeper into their own views/expertise and also to expand on various aspects to address the proposition’s follow-up arguments.

here you go!

The opposition believes that progress toward achieving AI will be mostly driven by science not engineering.

Recent progress in large-scale models, such as language models and language-conditional image generation models, easily give us an impression that what we see as impressive are largely the product of impressive engineering that has allowed us to effectively and efficiently scale up our systems. This impression is not what we oppose here.

Such impressive progress however has begun to give out an incorrect impression that such a stellar level of engineering is what (if not the only way to) drive progress in AI research toward building a truly intelligent system. This impression is what we oppose here.

Instead of arguing how engineering alone would not be enough for future progress toward achieving AI here. I’d like to focus on more concrete examples of how engineering alone has not been enough to have arrived at even the current state of AI, which I believe most of us agree is not at all close to the ultimate goal of truly intelligent machines.

As the first and perhaps most salient example today, I would like to talk about these super-impressive large-scale language models, represented by GPT-3 and many follow-up even more impressive models such as PaLM, BLOOM, etc. Despite their differences, there are a few core concepts shared by all these models that are critical to their existence.

First, they all rely heavily on the concept of maximum likelihood with autoregressive modeling. These two concepts together end up being building a classifier that predicts the next token given all the preceding tokens (words in many cases but the details do not matter much). And, doing so corresponds to estimating the upper-bound to the true entropy of the distribution underlying a gigantic amount of text we use.

By building a machine to predict the next word correctly, which takes into account both short- and long-term dependencies (unlike what many critics say otherwise,) we approximate the text/language distribution very well and sample/generate extremely well-formed text and images from these distributions.

Where did this idea come from? Has this idea benefited from superb engineering? Yes, superb engineering, including software and hardware, has dramatically pushed the boundary of the said technique but the birth and full formalization of next-word prediction can be traced back all the way to Claude Shannon’s paper from 1950.

This same idea was revived and pushed dramatically since late 80’s when folks from IBM, including Peter Brown and Bob Mercer, built the first statistical machine translation system where a large-scale (yes! it was already large then) target-side language model was a critical component.

The very same idea was revived or rejuvenated multiple times even after that, including late 90’s with Yoshua Bengio’s neural language models, around 2010 with Alex Graves’ and Tomas Mikolov’s recurrent language models, and now with attention-based models.

Better engineering, in terms of better software and better hardware, has indeed pushed the boundary of what we can do with this next-word prediction, but the seed of what we see now was already planted by “science” in 50’s.

Second, I’d like to talk about all the “techniques” or “tricks” that facilitate learning. Although it may look like faster hardware and better software framework are the main drivers of recent advances in large-scale language models, it is highly questionable whether we can train any reasonable model had we not found a series of techniques that enable us to do so.

For instance, non-saturating nonlinearities, such as rectified linear units, are workhorses of modern neural networks, including large-scale language models. It is only natural to use ReLU or its variant now, but it wasn’t so until around 2010 when there were two papers, one from U. Toronto and the other from U. Montreal, that demonstrated the potential effectiveness of ReLU from two different perspectives. As an example, the first one, Nair & Hinton, derived the ReLU for restricted Boltzmann machines by viewing it as an approximation to having an infinitely many replicated binary hidden units that share the weight vector but differ in their biases.

Furthermore, the potential for using ReLU-like nonlinearities was studied extensively in (computational) neuroscience, which has inspired many to consider this in the context of artificial neural network research for many decades.

Would engineering alone have allowed us to jump from much more widely used sigmoid nonlinearities to ReLU? With exhaustive hyperparameter tuning using an excessive amount of resources, engineering may have ended up with a very particular way of initializing parameters and a very particular setup for optimization that makes sigmoid nonlinearity work, but it is unclear if that would’ve happened at all, because the community would’ve already given up on investing further on this direction.

Of course, the last example I want to bring up is shortcut connections, which reflects a bit of my personal preference. Shortcut connections, which include residual connections as well as gated connections in LSTM and GRU’s, are what we, the research community, had to spend decades to come up with in order to address the issue of vanishing gradient or long-range credit assignment. It started with mathematical analysis by Sepp Hochreiter and Yoshua Bengio in the early 90’s, some further empirical analysis by many people since then, and some proposals, such as leaky units, and others, of which some were successful and others were not as much.

Eventually, this was identified as a way to propagate gradient properly across many nonlinear layers of both recurrent and feedforward networks, evident from the near-universal showing of residual blocks or connections in modern neural networks, including large-scale language models that are built as transformers.

However small they seem and are, we could get to this point only because of all these science (or perhaps mathematics) driven innovations. More properly, I could say that it was science that has put us on this path so that engineering could push us forward following this path.

It may not look like this will happen anytime soon, but i can assure you that very soon the bandwagon driven by engineering on this path laid out by science will find itself at the next cross road. Engineering won’t tell us which road we take next, but it will be science that tells us which path we can and should take next in order to move us closer to AI.

]]>