i do not want to discuss any particular paper/tweet/blog, because this topic seems to attract a weird set of people arguing for weird things, when in fact there are just a couple of different views into a single phenomenon, which is only natural in science and engineering. that said, if anyone’s interested in this recent (non-)controversy, these two papers seem to be the ones to take a look: Wei et al. [2022 TMLR] and Schaeffer et al. [2023 arXiv].

in this blog, let me instead define *emergence* in my own words so that i can point anyone to this blog when i end up talking with *emergence* with them. as the first step, here are three variables we must keep in our mind:

- $x \in \mathbb{R}$: the quantity that we vary ourselves to study emergence. some examples are # of parameters given a particular parametrization scheme, # of data points sampled from a particular distribution, etc. these are all discrete quantities, but we can imagine these as points sampled from the real line.
- $z \in \mathcal{Z}$: the quantity that we can’t/don’t control or sometimes don’t even observe while varying $x$. some examples include bit flip by cosmic ray. we often want to marginalize this out.
- because we often can’t control nor observe $z$, we assume $z$ follows a distribution $p_Z$.

- $y \in \mathbb{R}$: the quantity that we observe given $x$ and $z$. some examples are accuracy (average 0-1 loss), average negative log-probability (tight upperbound to the average 0-1 loss), etc.

with these variables, i can think of the very first definition of emergence:

Definition 1[Weak subjective emergence of $y$]Given $y = \mathbb{E}_z f(x, z)$ and $\delta > 0$, there exists $x’ \in \mathbb{R}$ such that $\left| \mathbb{E}_z \frac{\partial f}{\partial x}(x’, z) \right| > \left| \mathbb{E}_z \frac{\partial f}{\partial x}(\tilde{x}, z) \right| + \delta$ for all $\left| \tilde{x} – x’\right| > \epsilon$.

in words, this definition says that emergence is defined as the existence of a point $x’$ at which the change in $y$ is greater than any other point $\tilde{x}$. this can be further strengthened to include all higher order derivatives instead of only the first order derivative, but let me just stop here for now.

to measure whether this *subjective emergence *happens in a neural net of a particular architecture w.r.t. the number of parameters, we can follow the steps below:

- given the number of parameters $x$, train the neural net multiple times while varying random seeds in order to account for $z$. let the average validation accuracy be $y(x)$.
- $f$ then corresponds to training a neural net and measuring its accuracy on a held-out validation set.

- repeat this while varying the number of parameters.
- find a pair of consecutive $x$’s between which the validation accuracy changes most; call the mid-point $x’$.
- if this validation accuracy change is greater than that of any other consecutive pair in a meaningful amount $\delta$, we call it
*weak subjective emergence*.

this sounds reasonable, but it raises a lot of questions. some of those questions include:

- why is the particular choice of $f$ meaningful?
- why is the number of parameters a meaningful quantity to use? what if we use the number of bits after compressing all the parameters using e.g. gzip after each update? what makes the former more interesting than the latter?
- why is the accuracy a meaningful quantity to use? what if we use the margin loss since we care about the quality of decision boundary beyond mere accuracy? what makes the former more interesting than the latter?

- why is the particular resolution of $x$ and $y$ meaningful?
- how do we decide on the meaningful amount $\delta$?
- how do we decide on the neighbourhood size $\epsilon$?

there are a few more questions i had, such as whether marginalization of $z$ is desirable over max or min over $z$, but they seem rather minor, compared to these questions above. though, i must emphasize that we have to take into account $z$ one way or another, and it feels very weird to look at only one particular configuration $z$.

these questions naturally answer why i called this particular notion of emergence *subjective*; it is subjective because we leave the answers to these critical questions to the one who declares *emergence* of a property. in other words, one can use their *subjective* choices of $f$, $\delta$ and $\epsilon$. furthermore, this emergence is *weak* in that one merely needs to choose *one particular choice* of $f$, $\delta$ and $\epsilon$ to show that emergence happens.

can we then define a stronger version of subjective emergence? i believe we can, but this requires us to introduce a few more concepts:

- $T_x: \mathbb{R} \to \mathbb{R} \in \mathcal{T}_x$: this is a transformation that can be applied to $x$ to change e.g. its scale, magnitude, etc.
- one example of $\mathcal{T}_x$ a set of all monotonic transformations on $x$, although we can imagine many other types of transformations.
- in the case of neural net training, another example is to simply enumerate all the things that change as the number of updates (or the number of parameters) changes. for instance, $T_x$ may map the number of updates to the $L_2$-norm of the parameters.

- $T_y: \mathbb{R} \to \mathbb{R} \in \mathcal{T}_y$: this is a transformation that can be applied to $y$ to change e.g. its scale, magnitude, etc.
- for instance, $T_y$ can map the average accuracy to the logit of the true class.

we can now define a stronger version of subjective emergence:

Definition 2.[Strong subjective emergence of $y$]For all $T_x \in \mathcal{T}_x$ and $T_y \in \mathcal{T}_y$, let $T_y(y) = \mathbb{E}_z f(T_x(x), z)$. Then, given $\delta_{T_x,T_y} > 0$, there exists $T_x(x’) \in \mathbb{R}$ such that $\left| \mathbb{E}_z \frac{\partial f}{\partial T_x(x)}(T_x(x’), z) \right| > \left| \mathbb{E}_z \frac{\partial f}{\partial T_x(x)}(T_x(\tilde{x}), z) \right| + \delta_{T_x,T_y}$ for all $\left| T_x(\tilde{x}) – T_x(x’)\right| > \epsilon_{T_x}$.

this is essentially identical to *weak subjective emergence* except that we now impose that emergence should hold over a set of possible transformations made to $x$ and $y$. that is, we cannot simply choose *one* particular choice of $x$ and $y$, observe emergence and declare that emergence happened. rather, we need to show that such emergence happens even if we transform $x$ and $y$ in many reasonable ways.

these two definitions collapse onto each other when $|\mathcal{T}_x|=1$ and $|\mathcal{T}_y|=1$; that is, if we only consider one particular combination of $x$ and $y$ without considering any other possible transformations of them.

this definition of emergence is still subjective, since it relies on the subjective choice of $\mathcal{T}_x$, $\mathcal{T}_y$, $\delta$ (for each combination $(T_x,T_y)$) and $\epsilon$ (again for each $(T_x,T_y)$). one may even say this is even more subjective, as we need to decide on more things here, including transformations of $x$ and $y$ as well as the tolerance and neighbourhood radius for each transformation combination. nevertheless, because the notion of emergence must hold over a larger set of how we define $x$ and $y$, i’d find emergence observed according to this definition to be stronger and much more interesting.

so, we want these transformation sets to be not too narrow so that these two definitions collapse or not too broad so that we will never observe strong emergence. what would be some possible transformation sets that fall in the middle (since almost always the answer is somewhere in the middle)?

in my view, a good choice of the transformation set (either $x$ or $y$) is a set of all (noisy) monotonic transformations. for instance, if we take $x$ to be the number of updates in neural net training, we should also consider the $L_2$-norm of the parameters, as it grows (almost) monotonically w.r.t. the number of updates. if the claimed weak emergence over the number of updates disappears when we transform it into the $L_2$-norm of the parameters, we can’t claim stronger emergence. in the case of $y$, an interesting transformation is the repeated application of $\log$. how many $\log$-transformations of $y$ does the claimed emergence withstand? this would give us a sense of the strength of observed emergence.

finally, can there be *objective emergence*? i believe so, although such emergence would be very narrow in that there is essentially no room for any choice or interpretation. for instance, earlier together with Laura Graesser and Douwe Kiela, we demonstrated that a symmetric pair-wise protocol only emerges among communicating agents if there are at least three agents (it’s a bit obvious, though.) in this case, this emergence is objective, in that there’s no other transformation to choose (i.e., the number of agents is just the number of agents, and the communication success is defined as 0-1 and no other way) nor any other definition of tolerance or neighbourhood. in other words, *objective emergence* would be identical to *subjective emergence* except that the problem setup is extremely constrained to the point that there is no room for subjective choice nor interpretation, which makes it less interesting in general.

that wraps up yet another post of my random thoughts that would never make it to papers. have a nice day!

**Acknowledgement**:

- Thank you, Prof. Ernest Davis, for pointing out that the emergence should be defined w.r.t. $y$. this comment has been reflected.
- Thanks to Daniel Paleka’s comment, i clarified in the second definition that $\delta$ and $\epsilon$ are dependent on the choice of transformations.

for instance, imagine training a face detector for your phone’s camera in order to determine which filter (one optimized for portraits and the other for other types of pictures). if most of the training examples for building such face detector were taken in bright day light, one often without hesitation says that this face detector would work better on pictures taken under bright day light than on pictures taken indoor. this sounds pretty reasonable *until* you start thinking of some simplified scenarios. And, that started for me a few years ago, which eventually led me to write this blog post.

so, let’s consider a very simple binary classification setup. let $D=\{(x_n, y_n)\}$ be the training set. $f(x)$ returns the number of occurrences of $x$ within $D$, that is,

$$f(x) = \frac{1}{N} \sum_{n=1}^N I(sim(x_n, x) \leq \epsilon),$$

where $sim$ is a similarity metric, and $\epsilon$ is a similarity threshold. $I$ is an indicator function. if we set $\epsilon=0$ and $sim(a,b) = I(a=b)$, $f(x)$ literally looks at the number of duplicates of $x$ within the training set.

we assume that the training set is *separable*, which makes everything so much easier to imagine in our head and also reason through.

in this simple setup, what is really interesting (to me, at least) is that the number of duplicates $f(x_n)$ of any $x_n \in D$ does not affect a separating decision boundary. as soon as one of the duplicates is correctly classified (i.e., on the right side of the decision boundary), all the other duplicates are equally well classified and would not affect our choice of the decision boundary.

this is most clearly demonstrated by the perceptron learning rule which is defined as

$$w^n = \begin{cases}

w^{n-1}, &\text{if } y_n (w^n \cdot x_n) > 0 \\

w^{n-1} + x_n, &\text{otherwise}.

\end{cases}$$

that is, the decision boundary defined by $w^n$ is only updated if $x^n$ is incorrectly classified, i.e., $y_n (w^n \cdot x_n) \leq 0$. once $x^n$ is correctly classified, all the subsequent duplicates of $x^n$ do not contribute to the decision boundary.

another example is a max-margin classifier, such as a support vector machine. in this case, we can think of how the margin of a (separating) decision boundary is defined. the margin is defined as the sum of the distance to the nearest correctly-classified examples from both classes (positive and negative) respectively. in other words, the only examples that matter for determining the optimal decision boundary are the ones that are nearest correctly-classified ones (at least two; they are called *support vector*), and all the other examples that are correctly classified and far from the decision boundary (recall the separability assumption) do not contribute to the optimal decision boundary. in other words, it really doesn’t matter whether there are many duplicate copies of any particular example, as either that group of examples contribute equally to the margin or does not contribute at all.

Then, does it mean that the existence of duplicates of each training example does not matter when it comes to learning a classifier? Or, better put, why do we think the existence of duplicates changes how our classifiers work?

every now and then, i stumble upon discussion on the difference between parametric and non-parametric methods. every time i believe i found the answer to this question in a way that is explainable to my students and colleagues, but quite rapidly my belief on that answer fades away, and i start to doubt myself as a computer scientist. the last episode was pretty recent, and you can find people’s responses and insightful answers at

it turned out that this seemingly naive and dumb question connects to this issue of whether/how duplicates of training examples impact classification. what do i mean by that?

instead of perceptron and support vector machine above, which can be thought of as parametric approaches, since their discovered decision boundaries are described *without* referring to the training examples, i.e., on their own, let us consider one of the simplest and perhaps most powerful non-parametric classifier whose decision boundary is a function of the training examples and its complexity grows as we include more training examples. and, this classifier is $k$-nearest neighbour classifier ($k$NN).

given a new example $x$ we want to classify using our $k$NN classifier, let $(x_n,y_n)$ be the nearest neighbour of $x$. given the number of duplicates in the training set $f(x_n)$, we can now tell how many other neighbours are considered by this $k$NN; the answer is $k – f(x_n)$. that is, the probability of this new example $x$ belonging to $y_n$ is written down as:

$$p(y=y_n| x) = \frac{\min(k, f(x_n))}{k} + \frac{1}{k} \sum_{(x’,y’) \in \mathcal{N}_k(x)} I(x’ \neq x_n) I(y’ = y_n),$$

where $\mathcal{N}_k(x)$ is a set of $k$ nearest neighbours of $x$. as $f(x_n)$ grows, the first term dominates, and the chance of classifying $x$ into $y_n$ consequently grows as well. that is, the more duplicates we have of $x_n$ the higher probability for $y_n$. that is, the region corresponding to $(x_n,y_n)$ grows as the number of its duplicates increases, which is precisely what a non-parametric classifier does.

so, what does this tell us? the impact of duplicates in the training set differs between parametric and non-parametric approaches. it is not only in classification, but also in generative modeling, since much of generative modeling can be thought of as supervised learning in disguise. if we are dealing with non-parametric methods, we probably want to take into account duplicates in the training set and either keep them as they are or de-duplicate them. this decision will have to be made for each problem separately. if we are working with parametric methods, we probably don’t need to worry about these duplicates beyond the computational concern.

how does this observation connect with the urban legend/myth on the impact of duplicates? i believe this simply tells us that classifiers we use often in practice are often non-parametric, including $k$NN, neural nets and random forests. in other words, it wasn’t really about whether duplicates matter but it was more about what is a common practice in modern machine learning; that is, we use non-parametric classifiers.

there’s nothing serious nor insightful here, but i enjoyed this thought experiment!

]]>in my mind, there are three ways to define sparse coding.

**code sparsity**: the code is sparse, i.e., $|z|_0 = O(1)$.**computational sparsity**: the computation is sparse, i.e., $x = \sum_{k=1}^K z_k w_k$, where $K = O(1)$ and $w_k \in \mathbb{R}^d$.**noise robustness**: the computation is robust to perturbation to the parameters: let $\tilde{w} = w + \epsilon$, where $\epsilon \sim \mathcal{N}(0, \sigma^2 1_{|w|})$. the MSE between $x$ and $\tilde{x} = |\sum_{k=1}^K z_k w_k – \sum_{k=1}^K z_k \tilde{w}_k|_2^2$ is $O(d \times \sigma^2)$ not $O(d’ \times d \times \sigma^2)$, because $k \ll d’$ is a constant w.r.t. $d’$.

these are equivalent if we constrain the decoder to be linear (i.e., $x = \sum_{i=1}^{d’} z_i w_i$,) but they are not with a nonlinear decoder. in particular, let us consider a neural net decoder with a single hidden layer such that $x = u \max(0, w z),$ where $u \in \mathbb{R}^{d \times d_h}$ and $w \in \mathbb{R}^{d_h \times d’}$. we can then think of how these different notions of sparsity manifest themselves and how we could encourage these different types of sparsity when training a neural net.

the amount of computation is then $O(d \times d_h + d_h \times d’)$ which reduces to $O(d \times d’)$ assuming $d_h = O(d’)$. even if we impose the code sparsity on $z$, the overall computation does not change ($O(d \times d_h + d_h \times k)$) and remain as $O(d \times d’)$. in other words, code sparsity does not imply computation sparsity, as was the case with linear sparse coding.

based on this observation, one can imagine imposing sparsity on all odd-numbered layers (counting the $z$ as the first layer) and the penultimate layer (one before $x$) in order to satisfy **computational sparsity** with a nonlinear decoder. in the example above, this implies that the sparsity should be imposed on both $z$ and $\max(0, wz)$.

this naive approach to computational sparsity implies noise robustness, as the number of parameters used in computation is restricted by construction. it does not mean however that there aren’t any other way to impose noise robustness. in particular, we can rewrite the whole problem of sparse coding as

$$\min_{z, w, u} \frac{1}{N} \sum_{n=1}^N |x^n – u \max(0, w z^n)|^2$$

subject to $$| \text{Jac}_{w,u} u \max(0, w z^n) |_F^2 < k d~\text{for all}~n=1,\ldots, N.$$

in other words, the influence of perturbing the parameters on the output must be bounded by a constant multiple of the output dimensionality.

of course it is not tractable to solve this problem exactly, but we can write a regularized proxy problem:

$$\min_{z, w, u} \frac{1}{N} \sum_{n=1}^N |x^n – u \max(0, w z^n)|^2 + \lambda | \text{Jac}_{w, u} u \max (0, wz^n) |_F^2,$$

where $\lambda$ is a regularization strength. in other words, we find the parameters, $w$ and $u$, that are **robust to perturbation** in terms of the output.

*So, which sparsity are we referring to and do we desire when talking about sparsity in neural networks?*

Delip Rao then retweeted and said that he does not “buy his lossy compression analogy for LMs”, in particular in the context of JPEG compression. Delip and i exchanged a few tweets earlier today, and i thought i’d state it here in a blog post how i described in the following tweet why i think LM and JPEG have the same conceptual background:

one way in which * I* view a compression algorithm is that it (the algorithm $F$) produces a concise description of a distribution $p_{compressed}$ that closely mimics the original distribution $p_{true}$. that is, the goal of $F$ is to turn the description of $p_{true}$ (i.e., $d(p_{true})$) into the description of $p_{compressed}$ (i.e., $d(p_{compressed})$) such that (1) $p_{true}$ and $p_{compressed}$ are similar to each other, and (2) $d(p_{true}) \gg d(p_{compressed})$. now, this is only

then, how does JPEG can be viewed in this angle? in JPEG, there is a compression-decompression routine that can be thought of as a conditional distribution over the JPEG encoded/decoded images given the original image, i.e., $p_{JPEG}(x’ | x)$, where $x$ and $x’$ are both images. it is almost always deterministic, and this may be considered as a Dirac delta distribution. Then, given the trust natural image distribution $p_{true}$, we can get the following compressed distribution:

$$p_{compressed}(x’) = \sum_{x \in \mathcal{X}_{image}} p_{JPEG}(x’|x) p_{true}(x).$$

that is, we convolve all the images with the JPEG conditional distribution to obtain the compressed distribution.

why is this compression? because JPEG loses many fine details about the original image, there are many original images that map to a single image with JPEG-induced artifacts. this makes the number of probable modes under $p_{compressed}$ fewer than those under the original distribution, leading to a lower entropy. this in turn leads to a fewer number of bits we need to describe this distribution, hence, *compression*.

when there is a mismatch between $p_{true}$ and $p_{compressed}$, we can imagine two scenarios. one is that we lose a probable configuration under $p_{true}$ in $p_{compressed}$, which is often referred to as *mode collapse*. the other is $p_{true}(x) \downarrow$ when $p_{compressed}(x) \uparrow$, which is often referred to as *hallucination*. the latter is not really desirable in the case of JPEG compression, as we do not want it to produce an image that has nothing to do with any original image, but this is at the heart of generalization.

combining these two cases we end up with what we mean by *lossy* compression. in other words, any mismatch between $p_{true}$ and $p_{compressed}$ is what we mean by *lossy*.

in language modeling, we start with a vast amount of training examples, which i will collectively considered to constitute $p_{true}$, and our compression algorithm is regularized maximum likelihood (yeah, yeah, RLHF, instructions, blah blah). this compression algorithm (LM training, if you prefer to use) results in $p_{compressed}$ which we use a trained neural net to represent (though, this does not imply that this is the most concise representation of $p_{compressed}$.)

just like JPEG, LM training inevitably results in a discrepancy between $p_{true}$ (i.e., the training set under my definition above) and $p_{compressed}$ due to a number of factors, including the use of finite data as well as our imperfect parametrization. this mismatch however turned out to be *blessing* in this particular case, as this implies *generalization*. that is, $p_{compressed}$ is able to assign a high probability to an input configuration that was not seen during training, but then such a highly probable input turned out to look amazing to us (humans!)

in summary, both JPEG compression and LM training turn the original distributions of natural image and human written text, respectively, into their *compressed* versions. in doing so, inevitable mismatch between these two distributions, in each case, and this is why we refer to this process as *lossy* compression. this lossy nature ends up assigning non-zero probabilities to unseen input configurations, and this is *generalization*. in the case of JPEG, such generalization is often undesirable, while desirable generalization happens with LM thanks to our decades of innovations that have been culminated into modern language models.

so, yes, both are lossy compression with comparable if not identical underlying conceptual frameworks. the real question is however not about whether lossy compression makes LM’s less or more interesting, but more like which ingredients we have found to build these large-scale LM’s contribute to such *desirable* generalization and how.

a major part of running the reviewing process is to ensure all the reviews, meta-reviews, decisions as well as decision agreements are received in time, in order to ensure that we find a set of quality papers to be presented at the conference and that the authors of these accepted papers as well as participants are given enough time to prepare their travels to the conference. in a sense, everyone, including program chairs ourselves, agrees to serve a role, either as a reviewer, area chair or senior area chair, at the time of invitation, and we might naively expect all to stick to the timeline. it is however not the case due to the sheer scale of the conference’s main track, with more than 13,000 submitted abstracts and more than 10,000 reviewers we recruited. what is the chance that every single reviewer is fully available without any personal or professional emergencies over the summer, which is the period over which NeurIPS reviewing happens? if we assume 0.01% of personal/professional emergency for each individual reviewer, the chance that every one is available fully over this period is less than 40% …

now of course on top of that, we are all humans and simply make mistakes by for instance forgetting to put various deadlines on our calendars or simply over-committing ourselves. these mistakes can be however mitigated to some degree by reminders, or at least that was my thought back this summer (2022).

as part of this effort of politely but strongly reminding reviewers as well as area chairs of upcoming deadlines, i decided to finally benefit from a reasonably large number of followers i have on twitter (as of Dec 12 2022, i have 42.5k followers). who knew i would ever use twitter for my own benefit (and i want to say, for the community’s benefit)? but, the time had finally arrived …

i decided to piggy-back on people’s liking of memes on Twitter and started to post quite regularly NeurIPS’22 reviewing memes. It started on Jun 23 2022 and then continued until July 14 2022. here, i’ll list all of them for you to easily see how quickly my mind has sprawled into a dark abyss over time … i am in fact unsure if i’ve ever gotten out of this dark abyss i fell through …

… with even all these tweets, we failed to collect all necessary reviews in time …

]]>in this campaign’s page, they cited one news piece from SBS where they surveyed 21 young people of their situations to illustrate how the starting points for young people in the Korean society dramatically vary across individuals, despite our illusion of fair and equal treatment. it’s nothing rigorous and quite anecdotal, but quite thought-provoking, as it starkly “shows” these differences: https://www.youtube.com/watch?v=AaLZ3bmCb_k. the participants were asked 56 questions, and out of these, the campaign page listed a few (some of these are pretty specific to Korea, i must say, though):

- if you have had to move every 1-2 year, take a step back. 어쩔 수 없이 1,2년 단위로 집을 옮겨야 한다면 / 옮겨 다니고 있다면 한 발 뒤로
- if you don’t have insurance, take a step back. 4대 보험을 받지 못한다면 한 발 뒤로
- if you have to explain your family situations or lifestyle choices frequently to others, take a step back. 내가 취하고 있는 가족 구성원 형태 또는 삶의 형태에 대해 사람들에게 종종 설명을 해야 한다면 한발 뒤로
- if you’ve ever missed paying utility bills, take a step back. 돈이 부족해서 공과금을 연체해 본 적이 있다면 한 발 뒤로
- if you had to go on leave of absence from your schools due to tuition, take a step back. 등록금 때문에 휴학하고 돈을 벌어야 했다면 한 발 뒤로
- if you can always call mom or dad for financial support, take a step forward. 필요할 때 언제든 엄카, 아카를 쓸 수 있다면 한 발 앞으로
- if you had to prove your disability or financial hardship to receive financial aid, take a step back. 경제적 지원을 받기 위해 장애나 소득을 증명한 적이 있다면 한 발 뒤로
- if you had extracurricular education during your school years, take a step forward. 학창 시절 과외를 받아본 적이 있다면 한 발 앞으로
- if you could read as many books as you wanted when you’re younger, take a step forward. 어렸을 때 원하는 책을 마음껏 읽을 수 있었다면 한 발 앞으로
- if you can have whatever you want to eat delivered whenever you’re home alone, take a step forward. 혼자 있을 때 어느 시간 때고 마음 놓고 배달음식을 시켜 먹을 수 있다면 한 발 앞으로

and, you know what? when i asked myself these questions, i never took a step back and was always taking steps forward.

according to the campaign’s homepage, these children who are graduating from the group homes as they enter their 18-th birthday are provided with one-time support of \$4,000 or so (50M KRW) and monthly support of \$250 or so (30M KRW). for those who decide to continue their study in a college, this has never been enough. it has become even more of an issue during the pandemic, as our educational system began to ask students for even more, just for them to participate; they need to have good broadband to participate in remote lectures, they need to have some place quiet to participate in remote lectures without distraction, and they need to have a good laptop to participate in remote lectures, download necessary materials and submit their assignments.

so, i wanted to donate a bit to this campaign, but it turned out this was done via Kakao’s platform and required having a Kakao account which i don’t have. and, yes, i know the pain of creating an account for a Korean website, especially if i want to connect it with my credit card. so, i’ve given up on doing so via this specific campaign but emailed them directly to have a quick phone call.

they were super quick in giving me a call on the same day and gave a quick walk through of their programs. by the end of this short call, i already promised to donate approximately \$27,000 (30M KRW) for any operation. it’s not a lot of money but i hope this can buy a few more laptops for them to support these kids and also to raise awareness of this issue, that is largely hidden. hopefully this little gesture of mine helps students even a tiny bit to take a smaller step back than before.

because i’m generally a show-off, i had to write this blog post to show off this little donation, but there are those who are truly contributing to making the world better. in particular, the Center’s various programs are run by the staff members of the Center as well as many activists and volunteers (some of whom are from these group homes themselves). i’ve been reading and watching some of the materials on their homepage, and i could not have been more impressed and moved by them. also, there are a lot of regular donors to this Center (http://jaripcare.com/bbs/board.php?bo_table=support) who are really making differences, unlike a one-time donor like me who show up, boast and disappear. a huge thanks to all these people who are literally making sure a fewer people take a fewer steps back in the society.

would you be a part of supporting kids take a step forward instead of back with me?

]]>the proposition party consisted of Sella Nevo, Maya R. Gupta and François Charton. Been Kim was unfortunately unable to participate, although she would’ve been a great addition to the proposition party. the proposition party argued that progress towards achieving AI will be mostly driven by engineering not science.

the opposition party (i guess … my party) consisted of Ida Momennejad, Pulkit Agrawal, Sujoy Ganguly and your truly. the opposition party (perhaps obviously) opposed the proposition’s stance and argued that progress towards achieving AI will be mostly driven by science not engineering.

if you’re registered at ICML 2022, you can watch the recording of the debate at https://icml.cc/virtual/2022/social/20780. i don’t know if this will be released publicly when the conference is over, but i will update it here if and when that happens.

the debate was fun and was full of many interesting and thought-provoking ideas and points. i won’t try to summarize those points here, as that would require a huge amount of efforts and i shouldn’t have had that much beer over the past 4 days …

instead, i’ll share my opening statement here. a distinct advantage i had as the opposition leader was that i could prepare my statement in advance, and now i can share it here. my main goal was to leave enough rooms for the other members of the party to delve deeper into their own views/expertise and also to expand on various aspects to address the proposition’s follow-up arguments.

here you go!

The opposition believes that progress toward achieving AI will be mostly driven by science not engineering.

Recent progress in large-scale models, such as language models and language-conditional image generation models, easily give us an impression that what we see as impressive are largely the product of impressive engineering that has allowed us to effectively and efficiently scale up our systems. This impression is not what we oppose here.

Such impressive progress however has begun to give out an incorrect impression that such a stellar level of engineering is what (if not the only way to) drive progress in AI research toward building a truly intelligent system. This impression is what we oppose here.

Instead of arguing how engineering alone would not be enough for future progress toward achieving AI here. I’d like to focus on more concrete examples of how engineering alone has not been enough to have arrived at even the current state of AI, which I believe most of us agree is not at all close to the ultimate goal of truly intelligent machines.

As the first and perhaps most salient example today, I would like to talk about these super-impressive large-scale language models, represented by GPT-3 and many follow-up even more impressive models such as PaLM, BLOOM, etc. Despite their differences, there are a few core concepts shared by all these models that are critical to their existence.

First, they all rely heavily on the concept of maximum likelihood with autoregressive modeling. These two concepts together end up being building a classifier that predicts the next token given all the preceding tokens (words in many cases but the details do not matter much). And, doing so corresponds to estimating the upper-bound to the true entropy of the distribution underlying a gigantic amount of text we use.

By building a machine to predict the next word correctly, which takes into account both short- and long-term dependencies (unlike what many critics say otherwise,) we approximate the text/language distribution very well and sample/generate extremely well-formed text and images from these distributions.

Where did this idea come from? Has this idea benefited from superb engineering? Yes, superb engineering, including software and hardware, has dramatically pushed the boundary of the said technique but the birth and full formalization of next-word prediction can be traced back all the way to Claude Shannon’s paper from 1950.

This same idea was revived and pushed dramatically since late 80’s when folks from IBM, including Peter Brown and Bob Mercer, built the first statistical machine translation system where a large-scale (yes! it was already large then) target-side language model was a critical component.

The very same idea was revived or rejuvenated multiple times even after that, including late 90’s with Yoshua Bengio’s neural language models, around 2010 with Alex Graves’ and Tomas Mikolov’s recurrent language models, and now with attention-based models.

Better engineering, in terms of better software and better hardware, has indeed pushed the boundary of what we can do with this next-word prediction, but the seed of what we see now was already planted by “science” in 50’s.

Second, I’d like to talk about all the “techniques” or “tricks” that facilitate learning. Although it may look like faster hardware and better software framework are the main drivers of recent advances in large-scale language models, it is highly questionable whether we can train any reasonable model had we not found a series of techniques that enable us to do so.

For instance, non-saturating nonlinearities, such as rectified linear units, are workhorses of modern neural networks, including large-scale language models. It is only natural to use ReLU or its variant now, but it wasn’t so until around 2010 when there were two papers, one from U. Toronto and the other from U. Montreal, that demonstrated the potential effectiveness of ReLU from two different perspectives. As an example, the first one, Nair & Hinton, derived the ReLU for restricted Boltzmann machines by viewing it as an approximation to having an infinitely many replicated binary hidden units that share the weight vector but differ in their biases.

Furthermore, the potential for using ReLU-like nonlinearities was studied extensively in (computational) neuroscience, which has inspired many to consider this in the context of artificial neural network research for many decades.

Would engineering alone have allowed us to jump from much more widely used sigmoid nonlinearities to ReLU? With exhaustive hyperparameter tuning using an excessive amount of resources, engineering may have ended up with a very particular way of initializing parameters and a very particular setup for optimization that makes sigmoid nonlinearity work, but it is unclear if that would’ve happened at all, because the community would’ve already given up on investing further on this direction.

Of course, the last example I want to bring up is shortcut connections, which reflects a bit of my personal preference. Shortcut connections, which include residual connections as well as gated connections in LSTM and GRU’s, are what we, the research community, had to spend decades to come up with in order to address the issue of vanishing gradient or long-range credit assignment. It started with mathematical analysis by Sepp Hochreiter and Yoshua Bengio in the early 90’s, some further empirical analysis by many people since then, and some proposals, such as leaky units, and others, of which some were successful and others were not as much.

Eventually, this was identified as a way to propagate gradient properly across many nonlinear layers of both recurrent and feedforward networks, evident from the near-universal showing of residual blocks or connections in modern neural networks, including large-scale language models that are built as transformers.

However small they seem and are, we could get to this point only because of all these science (or perhaps mathematics) driven innovations. More properly, I could say that it was science that has put us on this path so that engineering could push us forward following this path.

It may not look like this will happen anytime soon, but i can assure you that very soon the bandwagon driven by engineering on this path laid out by science will find itself at the next cross road. Engineering won’t tell us which road we take next, but it will be science that tells us which path we can and should take next in order to move us closer to AI.

]]>First, go to your assigned submission. Here, I’m using an already-accepted paper at TMLR. On the submission page, you will see “Show Revisions” button below the title:

If you click “Show Revisions”, you are directed to a page showing the list of “Revision History”. The revision history includes not only the changes that included the pdf file but also any changes that were made to the metadata.

If we want to compare two versions from the revision history, click “Compare Revisions” button on the top right corner of this page. Then, you will be able to choose two different versions from the revision history for comparison. As an example, we choose the camera-ready version and the initial submission version in this case:

Scroll all the way up and click “View Differences” on the top right corner. This will lead you to the “Revision Comparison” page. In this page, the difference in the metadata shows up first:

If both revisions contained PDF files, at the bottom of the “Revision Comparison” will be “Document comparison” that highlights the difference between two versions of the pdf files:

Happy reviewing!

]]>when you go to https://openreview.net/, you see “TMLR” as one of the active venues, as shown in the screenshot below. if not, you can go directly to the TMLR page by going to https://openreview.net/group?id=TMLR.

when you log in to Openreview at TMLR, you will see a link to your own console. if you’re a reviewer of TMLR, you’ll see a link to “Reviewer Console“. if you’re an action editor of TMLR, you’ll see a link to “Action Editor Console“. if you don’t see this link on the page, please click the link here directly to see if you can access it.

in the respective console, right before you see the list of your assignments, there’s “Assignment Availability” box you can use to set your availability. here are two screenshots below:

by default, your availability is set to “Available”. this feature was implemented to provide reviewers and action editors to proactively set their availabilities, e.g., during their vacations.

if you plan to go on summer vacation, please visit Openreview and set your availability to “Unavailable”. but perhaps much more importantly, do not forget to set your availability to “Available” when you’re back. TMLR in its infancy needs all your help and support!

]]>this learning-to-review-by-reading-one’s-own-reviews strategy has some downsides. a major one is that people are often left with bitter tastes after reading reviews of their own work, because reviewers need to (and often are instructed to) point out both up & downsides of any submission under review. it is this list of downsides that leave bitter taste, and these authors end up being overly critical of others’ work when they start reviewing.

perhaps, a reasonably easy first-step fix would be to expose new reviewers as well as prospective reviewers to reviews of 3rd-party papers, neither their own reviews nor reviews of their own papers. the openreview movement (i’m calling it a movement rather than Openreview itself, as Openreview does support closed-door reviewing which is increasingly adopted more, such as by NeurIPS) enables this, although this is more rare than it should be, in my opinion, and is highly focused on a small number of areas.

so, i thought i’d start by sharing a random sampler of the reviews i’ve written in the past year and so. i understand that some authors may notice these reviews were for their own papers, which were either accepted or rejected. i hope they understand that i didn’t know their identity (i truly rarely do …) and just did my job as well as i can. i spent approximately 1-5 hours to review each 8-to-12-pages-long submission.

by the way, i’m in no way saying that these are *good* reviews. reading these reviews again myself, i’m realizing how bad i am at reviewing myself, and that i also would’ve benefited a lot from learning to review. i mean … my … i do ask authors to cite my work a lot …

Let me start this review by saying that I like the authors’ idea and their motivations behind their proposal on modifying the (self-)attention mechanism to cope with long sequences better with computationally more efficient relative positional embedding. there are however three points i’d like to request the authors to address, after which i will strongly advocate for the manuscript’s acceptance. i’ll describe these issues after giving quick summary of the authors’ contribution in the submission, from my own perspective (the authors are more than welcome to incorporate any of these in future revisions, if needed.)

in the multi-headed attention mechanism, there are two major components; one is to extract multiple values from each input vectors (V) and the other is to compute the attention weights for each input vector per head (Q & K). because we often use softmax (or as the authors refer to it, L1-normalized exp), the attention weight from each head (per input vector) tends to focus on another vector that is in a particular distance away from the input vector. the authors cleverly exploit this phenomenon by computing the attention weights once using L2-normalized sigmoid (to encourage more evenly spread out attention weights) and selecting a subset of these spread-out attention weights using location-sensitive (but value-agnostic) masks (N) to form the attention weight for each attention head. in order to ensure that each such subset is computed aware of the associated attention head (or location), they further add a single vector bias (weighted sum of as many bias vectors as there are heads) to each vector when computing the attention weights. this is a clever approach that is well-motivated and well-executed. that said, the current manuscript is not without any issue, which i will describe below.

first, one of the major motivations from the authors’ perspective is that this approach is better than existing relative positional embeddings or relative positional bias approaches, because it does not “change the memory configuration of these tensors in the accelerator in a less optimized manner”. i can roughly see their argument and why this may be the case, but as this is one of the major motivations, the authors need to explain this much more carefully. for instance, i’d suggest the authors set aside an entire section contrasting what kind of “memory rearrangement” are needed for the three approaches (RPE, RAB and Shatter) and discussing how Shatter is more efficient on which accelerators. of course, another way would be to remove this as a major motivation (could be mentioned only as a nice side-effect) and to refocus the motivation from other aspects of efficiency, such as a fewer parameters and a fewer 3-D tensors to maintain.

second, the authors emphasize that the proposed approach has a fewer parameters to tune compared to e.g. XLNet and other relative positional embedding approaches. this is convincing from one angle; the authors’ clever strategy to share the attention weight computation across multiple heads does reduce the number of parameters involved in computing the attention weights for multiple heads.* it is however not convincing from another angle where the focus is on the lack of parameters in relative positional embedding (N). it is not about whether N does or does not have any trainable parameters, but that the choice of how to construct N is quite important, as the authors point out themselves in A.1: “the pretraining loss and finetuned performance are sensitive to the choice of the partition of unity”. it is almost like the authors worked themselves to tune the parameters behind N, which could be done for other approaches as well (see, e.g., https://arxiv.org/abs/2108.12409 where the bias is a linear function w.r.t. the distance |i-j| to induce some kind of relative distance attention.) that said, this is a minor point and can be fixed by rephrasing text here and there.

third, this is not really the issue of this manuscript but a general issue of most of the papers where they report finetuned accuracies on GLUE tasks, etc. if i understood the authors correctly when they stated “we conduct several runs to show one run better than average (i.e., if the number on some task is worse than average, we will re-run it and show a better one.)” (to be frank, i couldn’t understand what this means at all,) some of the runs with low accuracies are thrown away, and there are multiple runs for each task. unfortunately, i can’t understand the rationale behind throwing some of those “worse” runs and that there are only 1 number (accuracy) per task/configuration pair in all the tables. i totally understand that the authors want to have “fair comparison” (though, it’s unclear in which aspect..) but this simply obscures how well the proposed approach works. to this end, i request the authors to report either both the mean and std. dev. (if there were enough runs) or max/med/min (if there were only a few runs) for each setup without throwing any results away for the runs they’ve run themselves. they can always report a single accuracy for each task-method pair according to what others have done separately as well.

(*) by the way, i do not believe it is correct to say that the Shatter is “single-headed self-attention”, since it does result in the vectors from multiple heads. it’s only that some parts of the computation of attention weights are cleverly shared. i’d suggest the authors refrain from saying so.

In this manuscript, the authors propose a variant of a transformer (or as a matter of fact any neural machine translation system) that is claimed to better capture styles of translation. during training, this proposed model, called a LSTransformer, finds the most similar reference sentence from a minibatch and uses this surrogate reference to compute the weighted sum of the latent style token embeddings. this style code is then appended to the source token embeddings before the source sentence is fed to the transformer for translation. in the translation time, it looks like (i say so, because it wasn’t specified explicitly) the LSTransformer considers the entire test set together as if it’s a minibatch in the training time to find a surrogate reference (based on the source sentence, which is possible because the embedding tables are shared between the encoder and decoder) based on which style embedding is computed and translation is done.

unfortunately the authors do not specify explicitly what they mean by “style”, and where the proposed LSTransformer finds information about “style”, which makes it pretty much impossible for me to understand what the proposed LSTransformer is supposed to do. this gets worse as in the experiments, the “style” of translation is almost equated with the domain from which test sentences were drawn, which is quite different from what i expect style to be based on the authors’ discussion of formality, etc. earlier. independently from my other points below, it’ll be critical for the authors to restructure the main part of the paper by first clearly defining what they mean by style and how the proposed approach captures such style (e.g., why does finding a random reference sentence from a minibatch consisting of i.i.d. samples help the LSTransformer capture a style? what if no reference sentence within the minibatch matches the style of the true reference?) and only then empirically demonstrating the effectiveness of the proposed approach.

In the training procedure, there’s some issue that i cannot wrap my head around. a major innovation the authors proposed is to use the so-called surrogate reference, which is the reference that’s most similar to the true reference within a minibatch. the cnn-based sentence encoder is used to compute the embedding for retrieving a surrogate reference. but, then if the minibatches were truly constructed to include uniformly selected sentences selected at random from the training set, isn’t the choice of the surrogate references is effectively to choose any reference from the training set on expectation? that is, as training continues, the effect of choosing the “most similar surrogate” reference translation disappears, because every minibatch consists of i.i.d samples from the training set (or training distribution). in one of the extreme cases, consider minibatches of size 2 each and using all possible size-2 subsets of the training set for training: every other training example serves as a surrogate reference. how does this procedure help with capturing “style”?

this questionable aspect of the proposed training procedure is only amplified when the inference/generation procedure is discussed. it is because the proposed LSTransformer does not use any references or sources from the same document or domain to capture and use the style, but simply uses the given source sentence. this is a weird set up for translation with style. consider the case where the target language exhibits more fine-grained levels of formality than the source language does. how does one expect a model to decide on the formality of the translation by looking only at the source sentence? perhaps the authors had something different in their mind, when they talked about style of translation, and as i pointed out earlier, it’ll be immensely helpful if the authors restructure the text by defining style to start with.

Along this line of thoughts, one notices that experiments, in particular comparison to vanilla transformers, are actually not too informative. there are two aspects of the proposed LSTransformer that differs from the vanilla Transformer. the first is the training procedure i re-described above, which has some questionable aspect, and the other is the network architecture. in the inference time, there’s no weird surrogate reference retrieval or any related loss functions, and the proposed LSTransformer simply becomes a different parametrization of the so-called deliberation network (https://www.microsoft.com/en-us/research/publication/deliberation-networks-sequence-generation-beyond-one-pass-decoding/) or generative NMT (https://papers.nips.cc/paper/7409-generative-neural-machine-translation.pdf). that is, the proposed LSTransformer defines a distribution over the target sentence space given a pair of an imperfect reference sentence and the source sentence. This is clearly unlike the vanilla Transformer which maps from the source sentence alone to the target sentence. Then, a natural question is not whether the proposed LSTransformer works better than the vanilla Transformer but whether this particular parametrization is more beneficial than other approaches to parametrizing such a “refinement” distribution. for instance, if the authors train the network without the surrogate references but simply by feeding the concatenation of the source sentence and the first translation (that is, the translation from the same network with two copies of source sentences provided), would it work worse than the LSTransformer?

Of course, once such an experimental setting is set up, the authors can finally ask the questions whether those style tokens introduced by the authors are indeed capturing styles and how this aspect of capturing style helps translation. unfortunately, due to the lack of the definition of styles and also due to the lack of proper points of comparison, these questions are only touched upon without clear answers.

I do believe there are interesting findings and insights within this manuscript. It is just that the current version of the manuscript does not reveal what those are to the degree that warrants its publication. it’s possible that my suggestion above, when followed, might reveal results that do not align with what the authors have expected/wanted, but i trust the findings and insights revealed from the authors’ efforts will be greatly appreciated by the community.

P.S. yeah.. the authors’ tSNE visualizations are pretty meaningless. Fig. 2 is totally meaningless, as the authors have realized themselves (see footnote 13.) what i see from Fig. 2 is that the style token embeddings are not doing anything, and it’s the balance loss that simply makes the style embeddings to be orthogonal. after all, it’s VERY easy to have 10 orthogonal vectors in a 512-d space. Fig. 3 doesn’t really encode anything. sadly it shows that the style encoding is 1-dimensional (not even 2-dimensional.) it actually implies that my suspicion above about the issue of surrogate references selected from randomly constructed minibatches might be correct.

This submission consists of two almost independent contributions. The first contribution is a procedure to create a new challenge set for NLI classifiers based on the idea of monotonicity reasoning in NLI, and the second contribution is an algorithm purported to reveal the modular internal structure of a neural net based NLI classifier. I found the first part to consist of interesting findings and perhaps a bit of insight, while the second part to be confusing with frequent self-contradictory statements and perhaps missing answers to many obvious questions.

I’ll go through the paper section-by-section below and leave some major comments first:

Sec. 2: all the neural net (deep learning) based NLP references go from 2018 onward. I find it difficult to believe that this should be the case. For instance, the authors fail to cite <Intriguing properties of neural networks> by Szegedy et al. from 2014 in which the name “adversarial examples” were coined. They also fail to cite <Does string-based neural MT learn source syntax?> by Shi et al. from 2016 which used logistic regression to check whether syntactic labels could be predicted from a neural net hidden state. Let me suggest a bit more of literature review.

Sec. 5.1: this is a pattern i have observed over and over where the authors state something that is either wrong or at best controversial and correct themselves immediately in the same paragraph or in the section. This is simply confusing. For instance, in this section the authors start by “use MoNLI as an adversarial test dataset” and say in the same paragraph “is not especially adversarial”. Indeed, I would not call the proposed dataset adversarial, since it’s not adversarial to any particular model or family and was constructed on its own. I believe a better name would be a challenge set, but any name that is not confusing (even within the authors themselves) would be better.

Sec. 5.2: the observation here is quite interesting in that the models simply fail to flip the labels of these downward monotonically transformed examples, which suggests at least to me that these models are highly insensitive to functional words. I believe this is highly related to the investigation from Gururangan et al. 2018 (https://arxiv.org/abs/1803.02324, which is missing from the references) and also with others’ investigation of NLI models and data earlier. How does your observation agree/disagree with their earlier conclusions? This must be discussed, as it is not the first time the community has learned that NLI models have particular weaknesses (and sometimes surprisingly strengths.)

Sec. 5.3: the final paragraph is confusing, because the first sentence ends saying “this is a failing of the data rather than the models”, while I could not tell why this is so. Then, the authors jumped to another speculation that is not necessarily supported by (nor was expected to be supported by) the experiments in this section: “models can solve a systematic generalization task … only … if they implement a modular representation of lexical entailment relations.“ In reality, all i saw from this section is that the models trained on SNLI work horribly on NMoNLI (they did amazingly on PMoNLI), and that it is a question whether this is due to the models themselves or due to the data on which they were trained. There was no evidence supporting either of these.

Sec. 6: this section starts with the authors declaring that “we intuitively believe any model that can generalize from the training set to the test set will implement a modular representation of lexical entailment.” Unfortunately my intuition does not agree with the authors, or simply i do not have any intuition on this. Perhaps the main cause of this discrepancy may be that the authors have not defined the “modular representation” (or “modular internal structure”) in the context of neural nets that are being tested in this paper. It is possible that I may be a bit of an outsider in these studies on systematic generalization, but it would be good to have it defined clearly somewhere so that the reader can readily go see the definition and understand why such would be intuitive (or counter-intuitive.)

Figure 2: I need to insist that each experiment in these plots be run multiple times by varying random seeds (which would impact the order of training example presentation, etc.) It looks to me as there are a few outliers that are likely statistical fluke than true trends, such as NMoNLI Test with 300 examples and BERT, SNLI/NMoNLI test with 800 examples and ESIM, and SNLI/NMoNLI tests with up to 200 examples and DECOMP. I don’t believe the authors’ conclusions from these plots will change much, but these outliers points make it difficult to trust the overall trends.

The plot titles in Fig. 2 are confusing, because these are models finetuned with inoculation not models trained solely on SNLI.

Sec. 6.4: The main point “every model was able to solve our generalization task” doesn’t seem to hold for ESIM, as it barely solved the challenge test set except for one particular case when 800 examples were used for training.

Sec. 7: unfortunately this section, which is supposed to describe one of the two main contributions, has quite a bit of issues that have ultimately convinced me not to recommend this manuscript to be accepted. I’ll go over why below.

Sec. 7.1: this section is quite difficult to follow, because there’s quite a bit of discussion at the beginning that requires the knowledge of the algorithm Infer, but this algorithm is only explained in the final paragraph of the section.

Sec. 7.2: the section starts with the statement “BERT implements a modular representation of lexical entailment if there is a map M from MoNLI examples to model-internal vectors in BERT such that the model internal-vectors satisfy the counterfactual claims ascribed to the variable lexrel.” There are two major issues with this statement. First, I just don’t see in this manuscript why this conditional holds. This is probably because there has not been a clear definition of modular representation of lexical entailment in the context of BERT or any other neural net. Second, what is this “model internal vector”? If i simply concatenate all the vectors present inside BERT and flatten into a single vector, would that correspond to a model internal vector? If so, does it mean that any BERT that satisfies this condition implements modular representation of lexical entailment? It looks like this makes this statement not a conditional but a definition, in which case it’s a bit of a moot point to state so.

Sec. 7.3-4: Instead of Infer (which requires a two-line equation as its definition,) the procedure in this section requires an algorithm box with much more careful description, because i was totally lost following the proposed algorithm and experimental procedure. For instance, what do the authors mean by “every example is mapped to a vector at the same location”? What actually happens when the authors say “we randomly conducted interchange experiments to partially construct each of the 36 graphs”? What were the random variables here? Why are some output-unchanging edges necessarily non-causal?

Sec. 7.5: i find the random edge graph to be quite uninteresting as a baseline, as it is not clear what 50% of having an edge means and whether it is a reasonable baseline. For instance, if we use ESIM and DECOMP from Fig. 2, what kind of numbers would we get? Will ESIM be worse than BERT, because they are less “modular”? What was ESIM trained only on SNLI? Will it be also worse than ESIM finetuned with inoculation, because it generalizes worse to MoNLI? These are much better baselines to compare against, and without these it’s difficult to put the sizes of the cliques obtained from BERT finetuned with inoculation. I’m sure the random graph serves as a lower-bound, but it looks too loose to be informative at all.

Sec. 7.6: The authors conclude that “this is conclusive evidence that … BERT implements a modular representation of the lexical entailment relations between substituted words”. I cannot agree with this because of the reasons provided by the authors themselves in the third paragraph. With all these caveats in the proposed algorithm, what is the right way to draw a conclusion? Do we know that these issues are not significant? If so, how do we know so? Perhaps, the biggest issue again is that it’s unclear what modular representation of lexical entailment in the context of BERT, because of that, we cannot really tell whether this proposed procedure indeed captures such notion.

Sec. 7.7: According to the probing experiments, the authors demonstrate that the first and third layers are equivalent in terms of their linguistic and control tasks, and then the authors continue to conclude that “probes cannot diagnose whether a given representation has a causal impact on a model’s output behaviour.” But, does this imply anything about the authors’ approach? How do we know that this probing is any worse than the authors’ approach, other than the procedure described above with a lot of approximations that the authors themselves warn the reader about.

This concludes my review of this submission. My suggestion to the authors is to focus on the MoNLI data as a new challenge set for NLI classifiers and to carefully analyze what this challenge set reveals about the existing NLI classifiers (it’ll be even better if the authors could identify and fix the identified weaknesses.) In this case, it’s probably fine to drop any discussion and claim on “modular internal structure” of these neural nets.

If the authors feel they need to keep the second contribution, I suggest them significantly revise these sections; first, describe the algorithm more clearly, second, discuss various approximations that were made to cope with the intractability and demonstrate that those approximations are reasonable, third, run the algorithm on multiple models and more informative baselines, and fourth (and perhaps most importantly), demonstrate convincingly that these models do indeed exhibit a carefully defined notion of modular internal structures and that the proposed metrics does compute the degree of the modularity of internal structures.

in this paper, the authors test a series of modifications to the now-standard transformer, including gated self-attention, convolution as self-attention, attention with a fixed span and attention with a learned span, on SCAN which was manually constructed to test the ability of a sequence-to-sequence model in capturing compositionality. They demonstrate that their observations in the impact of these modifications on SCAN are not indicative of their impacts on more realistic problems, such as machine translation.

the biggest issue i see with this manuscript is the main motivation behind this investigation, that is, “it is not clear if SCAN should be used as a guidance when building new models.” if i did not misunderstand [Bastings et al., 2018] to which a whole paragraph was dedicated in S8 Related Work, Bastings et al. [2018] already demonstrated that SCAN is not a realistic benchmark, and that the improvement in SCAN in fact negatively correlates with the improvement in MT (i just opened Bastings et al. [2018] to see if i recalled incorrectly, but it seems to be the case.) in fact, Bastings et al. [2018] suggest a reason why SCAN is not realistic: “any algorithm with strong long-term dependency modeling ca- pabilities can be detrimental.” in other words, i almost feel like it has been clear for quite some time that SCAN should not be used (on its own) for guiding any development for real-world problems.

of course, it’s a good idea to (1) renew/reconfirm the earlier finding based on a more modern practice, such as using transformers as opposed to using LSTM/GRU and (2) investigate aspects that were not investigated earlier, such as network architectures rather than parametrization of the decoder (autoregressive,) but i believe it’s important for such an effort to be framed and discussed to update/complement the existing knowledge rather than as an attempt to establish the finding as a standalone finding. along this line, i believe it would’ve been more interesting if the manuscript could tell how transformers changed the earlier conclusion on the capability of these neural sequence models on SCAN and its consequences.

a second issue is that it’s very difficult for me to understand why these four variants of attention are relevant to this investigation: why should i be curious about these four particular version of attention in knowing the relationship between the performance on SCAN and the performance on MT? had the authors chosen another set of architectural modifications (e.g., perhaps change the number of feedforward layers within each transformer blocks, perhaps change the softmax to sigmoid or any other monotonic transformation for attention weight, perhaps change the final softmax to the mixture of softmaxes, etc.) would they have arrived at a different conclusion? because of this degree of freedom, i believe it is important to start with a statement on what the authors believe is an important axis and why so, before drawing any empirical conclusion. the best i could read from the manuscript is that the authors “start from an observation that convolution-based seq2seq models (ConvS2S) (Gehring et al., 2017) perform very well on it (Dessi and Baroni, 2019).” this is not a convincing reason why we want to test those four variants of attention (it does explain why we want to test replacing attention with convolution, albeit weakly.)

finally, it is unclear whether the observation that attention with a learned span helps on the compositional en-fr task had to be drawn from this investigation. yes, the authors did arrive at this conclusion in this manuscript, but it looks like this could’ve been a completely separate investigation, perhaps motivated better by starting that “it is not clear if” the existing transformers can correctly bridge the difference in the compositional rules between source and target languages. i believe their data will be useful in the future for evaluating this particular aspect of a machine translation system. unfortunately, this particular data on its own adds only little to the main investigation in this manuscript. perhaps, as the authors stated, this will be a part of an extensive and more useful benchmarks in the future when “more realistic SCAN-like datasets based on natural language data” are created.

Unfortunately i have some issues with the authors’ choices of algorithms and how they use them.

first, few-shot learning algorithms are designed to work best for scenarios in which examples must be classified into “novel” classes that were not present during training time, which is not the case for the problem in this paper. one could argue that many of these few-shot learning algorithms are variants of nearest-neighbour classifiers, and that they tend to work better for rare classes because of their non-parametric nature. this is however not what the authors claim nor argue. what the authors should’ve done and should do in the future iteration of the manuscript is to modify e.g. the prototypical network however without the few-shot constraint by using all the training instances (or subset for computational efficiency).

second, the authors claim and demonstrate the effectiveness of these class reweighting approaches, which I find hard to believe not due to the construction of those algorithms but due to the evaluation metric the authors have chosen to work with. when a neural net classifier, or even logistic regression classifier, is trained, it is trained to capture p(y|x) which is the product of the class-conditional likelihood p(x|y) and the class prior p(y). the latter is often the reason it looks like a trained classifier prefers more common classes when we simply look at the top-1 predicted class. an interesting consequence from this observation is that reweighting based on the class proportion (which is mainly what the authors have tried either via actual reweighting or resampling) only changes p(y) and does not impact p(x|y). that is, if you estimate p(y) from data and divide the neural net’s prediction p(y|x) with it, the effect of class imbalance largely disappears (of course, up to miscalibration of neural net predictive distributions.)

lastly, i’m not entirely sure whether it’s a good idea to frame this problem as classification. instead, i believe this problem should ideally be framed as multi-label classification in which each condition is predicted to be present (positive) or not. this is arguably a significantly more minor point than the issues above.

with all these issues, it’s difficult for me to see what i should get out of reading this manuscript. it’s not surprising that existing few-shot learning algorithms do not work well, because the target problem was not a conventional few-shot learning problem. it’s perhaps not surprising that the baseline seems to work better for more common classes but not for rare classes, because there was no description (which implies no effort) in recalibrating the predictive distribution to remove the class prior.

since all the algorithms have been implemented, i believe a bit of effort in re-designing the experiments and tweaking the algorithms would make the manuscript much stronger.