in this campaign’s page, they cited one news piece from SBS where they surveyed 21 young people of their situations to illustrate how the starting points for young people in the Korean society dramatically vary across individuals, despite our illusion of fair and equal treatment. it’s nothing rigorous and quite anecdotal, but quite thought-provoking, as it starkly “shows” these differences: https://www.youtube.com/watch?v=AaLZ3bmCb_k. the participants were asked 56 questions, and out of these, the campaign page listed a few (some of these are pretty specific to Korea, i must say, though):

- if you have had to move every 1-2 year, take a step back. 어쩔 수 없이 1,2년 단위로 집을 옮겨야 한다면 / 옮겨 다니고 있다면 한 발 뒤로
- if you don’t have insurance, take a step back. 4대 보험을 받지 못한다면 한 발 뒤로
- if you have to explain your family situations or lifestyle choices frequently to others, take a step back. 내가 취하고 있는 가족 구성원 형태 또는 삶의 형태에 대해 사람들에게 종종 설명을 해야 한다면 한발 뒤로
- if you’ve ever missed paying utility bills, take a step back. 돈이 부족해서 공과금을 연체해 본 적이 있다면 한 발 뒤로
- if you had to go on leave of absence from your schools due to tuition, take a step back. 등록금 때문에 휴학하고 돈을 벌어야 했다면 한 발 뒤로
- if you can always call mom or dad for financial support, take a step forward. 필요할 때 언제든 엄카, 아카를 쓸 수 있다면 한 발 앞으로
- if you had to prove your disability or financial hardship to receive financial aid, take a step back. 경제적 지원을 받기 위해 장애나 소득을 증명한 적이 있다면 한 발 뒤로
- if you had extracurricular education during your school years, take a step forward. 학창 시절 과외를 받아본 적이 있다면 한 발 앞으로
- if you could read as many books as you wanted when you’re younger, take a step forward. 어렸을 때 원하는 책을 마음껏 읽을 수 있었다면 한 발 앞으로
- if you can have whatever you want to eat delivered whenever you’re home alone, take a step forward. 혼자 있을 때 어느 시간 때고 마음 놓고 배달음식을 시켜 먹을 수 있다면 한 발 앞으로

and, you know what? when i asked myself these questions, i never took a step back and was always taking steps forward.

according to the campaign’s homepage, these children who are graduating from the group homes as they enter their 18-th birthday are provided with one-time support of \$4,000 or so (50M KRW) and monthly support of \$250 or so (30M KRW). for those who decide to continue their study in a college, this has never been enough. it has become even more of an issue during the pandemic, as our educational system began to ask students for even more, just for them to participate; they need to have good broadband to participate in remote lectures, they need to have some place quiet to participate in remote lectures without distraction, and they need to have a good laptop to participate in remote lectures, download necessary materials and submit their assignments.

so, i wanted to donate a bit to this campaign, but it turned out this was done via Kakao’s platform and required having a Kakao account which i don’t have. and, yes, i know the pain of creating an account for a Korean website, especially if i want to connect it with my credit card. so, i’ve given up on doing so via this specific campaign but emailed them directly to have a quick phone call.

they were super quick in giving me a call on the same day and gave a quick walk through of their programs. by the end of this short call, i already promised to donate approximately \$27,000 (30M KRW) for any operation. it’s not a lot of money but i hope this can buy a few more laptops for them to support these kids and also to raise awareness of this issue, that is largely hidden. hopefully this little gesture of mine helps students even a tiny bit to take a smaller step back than before.

because i’m generally a show-off, i had to write this blog post to show off this little donation, but there are those who are truly contributing to making the world better. in particular, the Center’s various programs are run by the staff members of the Center as well as many activists and volunteers (some of whom are from these group homes themselves). i’ve been reading and watching some of the materials on their homepage, and i could not have been more impressed and moved by them. also, there are a lot of regular donors to this Center (http://jaripcare.com/bbs/board.php?bo_table=support) who are really making differences, unlike a one-time donor like me who show up, boast and disappear. a huge thanks to all these people who are literally making sure a fewer people take a fewer steps back in the society.

would you be a part of supporting kids take a step forward instead of back with me?

]]>the proposition party consisted of Sella Nevo, Maya R. Gupta and François Charton. Been Kim was unfortunately unable to participate, although she would’ve been a great addition to the proposition party. the proposition party argued that progress towards achieving AI will be mostly driven by engineering not science.

the opposition party (i guess … my party) consisted of Ida Momennejad, Pulkit Agrawal, Sujoy Ganguly and your truly. the opposition party (perhaps obviously) opposed the proposition’s stance and argued that progress towards achieving AI will be mostly driven by science not engineering.

if you’re registered at ICML 2022, you can watch the recording of the debate at https://icml.cc/virtual/2022/social/20780. i don’t know if this will be released publicly when the conference is over, but i will update it here if and when that happens.

the debate was fun and was full of many interesting and thought-provoking ideas and points. i won’t try to summarize those points here, as that would require a huge amount of efforts and i shouldn’t have had that much beer over the past 4 days …

instead, i’ll share my opening statement here. a distinct advantage i had as the opposition leader was that i could prepare my statement in advance, and now i can share it here. my main goal was to leave enough rooms for the other members of the party to delve deeper into their own views/expertise and also to expand on various aspects to address the proposition’s follow-up arguments.

here you go!

The opposition believes that progress toward achieving AI will be mostly driven by science not engineering.

Recent progress in large-scale models, such as language models and language-conditional image generation models, easily give us an impression that what we see as impressive are largely the product of impressive engineering that has allowed us to effectively and efficiently scale up our systems. This impression is not what we oppose here.

Such impressive progress however has begun to give out an incorrect impression that such a stellar level of engineering is what (if not the only way to) drive progress in AI research toward building a truly intelligent system. This impression is what we oppose here.

Instead of arguing how engineering alone would not be enough for future progress toward achieving AI here. I’d like to focus on more concrete examples of how engineering alone has not been enough to have arrived at even the current state of AI, which I believe most of us agree is not at all close to the ultimate goal of truly intelligent machines.

As the first and perhaps most salient example today, I would like to talk about these super-impressive large-scale language models, represented by GPT-3 and many follow-up even more impressive models such as PaLM, BLOOM, etc. Despite their differences, there are a few core concepts shared by all these models that are critical to their existence.

First, they all rely heavily on the concept of maximum likelihood with autoregressive modeling. These two concepts together end up being building a classifier that predicts the next token given all the preceding tokens (words in many cases but the details do not matter much). And, doing so corresponds to estimating the upper-bound to the true entropy of the distribution underlying a gigantic amount of text we use.

By building a machine to predict the next word correctly, which takes into account both short- and long-term dependencies (unlike what many critics say otherwise,) we approximate the text/language distribution very well and sample/generate extremely well-formed text and images from these distributions.

Where did this idea come from? Has this idea benefited from superb engineering? Yes, superb engineering, including software and hardware, has dramatically pushed the boundary of the said technique but the birth and full formalization of next-word prediction can be traced back all the way to Claude Shannon’s paper from 1950.

This same idea was revived and pushed dramatically since late 80’s when folks from IBM, including Peter Brown and Bob Mercer, built the first statistical machine translation system where a large-scale (yes! it was already large then) target-side language model was a critical component.

The very same idea was revived or rejuvenated multiple times even after that, including late 90’s with Yoshua Bengio’s neural language models, around 2010 with Alex Graves’ and Tomas Mikolov’s recurrent language models, and now with attention-based models.

Better engineering, in terms of better software and better hardware, has indeed pushed the boundary of what we can do with this next-word prediction, but the seed of what we see now was already planted by “science” in 50’s.

Second, I’d like to talk about all the “techniques” or “tricks” that facilitate learning. Although it may look like faster hardware and better software framework are the main drivers of recent advances in large-scale language models, it is highly questionable whether we can train any reasonable model had we not found a series of techniques that enable us to do so.

For instance, non-saturating nonlinearities, such as rectified linear units, are workhorses of modern neural networks, including large-scale language models. It is only natural to use ReLU or its variant now, but it wasn’t so until around 2010 when there were two papers, one from U. Toronto and the other from U. Montreal, that demonstrated the potential effectiveness of ReLU from two different perspectives. As an example, the first one, Nair & Hinton, derived the ReLU for restricted Boltzmann machines by viewing it as an approximation to having an infinitely many replicated binary hidden units that share the weight vector but differ in their biases.

Furthermore, the potential for using ReLU-like nonlinearities was studied extensively in (computational) neuroscience, which has inspired many to consider this in the context of artificial neural network research for many decades.

Would engineering alone have allowed us to jump from much more widely used sigmoid nonlinearities to ReLU? With exhaustive hyperparameter tuning using an excessive amount of resources, engineering may have ended up with a very particular way of initializing parameters and a very particular setup for optimization that makes sigmoid nonlinearity work, but it is unclear if that would’ve happened at all, because the community would’ve already given up on investing further on this direction.

Of course, the last example I want to bring up is shortcut connections, which reflects a bit of my personal preference. Shortcut connections, which include residual connections as well as gated connections in LSTM and GRU’s, are what we, the research community, had to spend decades to come up with in order to address the issue of vanishing gradient or long-range credit assignment. It started with mathematical analysis by Sepp Hochreiter and Yoshua Bengio in the early 90’s, some further empirical analysis by many people since then, and some proposals, such as leaky units, and others, of which some were successful and others were not as much.

Eventually, this was identified as a way to propagate gradient properly across many nonlinear layers of both recurrent and feedforward networks, evident from the near-universal showing of residual blocks or connections in modern neural networks, including large-scale language models that are built as transformers.

However small they seem and are, we could get to this point only because of all these science (or perhaps mathematics) driven innovations. More properly, I could say that it was science that has put us on this path so that engineering could push us forward following this path.

It may not look like this will happen anytime soon, but i can assure you that very soon the bandwagon driven by engineering on this path laid out by science will find itself at the next cross road. Engineering won’t tell us which road we take next, but it will be science that tells us which path we can and should take next in order to move us closer to AI.

]]>First, go to your assigned submission. Here, I’m using an already-accepted paper at TMLR. On the submission page, you will see “Show Revisions” button below the title:

If you click “Show Revisions”, you are directed to a page showing the list of “Revision History”. The revision history includes not only the changes that included the pdf file but also any changes that were made to the metadata.

If we want to compare two versions from the revision history, click “Compare Revisions” button on the top right corner of this page. Then, you will be able to choose two different versions from the revision history for comparison. As an example, we choose the camera-ready version and the initial submission version in this case:

Scroll all the way up and click “View Differences” on the top right corner. This will lead you to the “Revision Comparison” page. In this page, the difference in the metadata shows up first:

If both revisions contained PDF files, at the bottom of the “Revision Comparison” will be “Document comparison” that highlights the difference between two versions of the pdf files:

Happy reviewing!

]]>when you go to https://openreview.net/, you see “TMLR” as one of the active venues, as shown in the screenshot below. if not, you can go directly to the TMLR page by going to https://openreview.net/group?id=TMLR.

when you log in to Openreview at TMLR, you will see a link to your own console. if you’re a reviewer of TMLR, you’ll see a link to “Reviewer Console“. if you’re an action editor of TMLR, you’ll see a link to “Action Editor Console“. if you don’t see this link on the page, please click the link here directly to see if you can access it.

in the respective console, right before you see the list of your assignments, there’s “Assignment Availability” box you can use to set your availability. here are two screenshots below:

by default, your availability is set to “Available”. this feature was implemented to provide reviewers and action editors to proactively set their availabilities, e.g., during their vacations.

if you plan to go on summer vacation, please visit Openreview and set your availability to “Unavailable”. but perhaps much more importantly, do not forget to set your availability to “Available” when you’re back. TMLR in its infancy needs all your help and support!

]]>this learning-to-review-by-reading-one’s-own-reviews strategy has some downsides. a major one is that people are often left with bitter tastes after reading reviews of their own work, because reviewers need to (and often are instructed to) point out both up & downsides of any submission under review. it is this list of downsides that leave bitter taste, and these authors end up being overly critical of others’ work when they start reviewing.

perhaps, a reasonably easy first-step fix would be to expose new reviewers as well as prospective reviewers to reviews of 3rd-party papers, neither their own reviews nor reviews of their own papers. the openreview movement (i’m calling it a movement rather than Openreview itself, as Openreview does support closed-door reviewing which is increasingly adopted more, such as by NeurIPS) enables this, although this is more rare than it should be, in my opinion, and is highly focused on a small number of areas.

so, i thought i’d start by sharing a random sampler of the reviews i’ve written in the past year and so. i understand that some authors may notice these reviews were for their own papers, which were either accepted or rejected. i hope they understand that i didn’t know their identity (i truly rarely do …) and just did my job as well as i can. i spent approximately 1-5 hours to review each 8-to-12-pages-long submission.

by the way, i’m in no way saying that these are *good* reviews. reading these reviews again myself, i’m realizing how bad i am at reviewing myself, and that i also would’ve benefited a lot from learning to review. i mean … my … i do ask authors to cite my work a lot …

Let me start this review by saying that I like the authors’ idea and their motivations behind their proposal on modifying the (self-)attention mechanism to cope with long sequences better with computationally more efficient relative positional embedding. there are however three points i’d like to request the authors to address, after which i will strongly advocate for the manuscript’s acceptance. i’ll describe these issues after giving quick summary of the authors’ contribution in the submission, from my own perspective (the authors are more than welcome to incorporate any of these in future revisions, if needed.)

in the multi-headed attention mechanism, there are two major components; one is to extract multiple values from each input vectors (V) and the other is to compute the attention weights for each input vector per head (Q & K). because we often use softmax (or as the authors refer to it, L1-normalized exp), the attention weight from each head (per input vector) tends to focus on another vector that is in a particular distance away from the input vector. the authors cleverly exploit this phenomenon by computing the attention weights once using L2-normalized sigmoid (to encourage more evenly spread out attention weights) and selecting a subset of these spread-out attention weights using location-sensitive (but value-agnostic) masks (N) to form the attention weight for each attention head. in order to ensure that each such subset is computed aware of the associated attention head (or location), they further add a single vector bias (weighted sum of as many bias vectors as there are heads) to each vector when computing the attention weights. this is a clever approach that is well-motivated and well-executed. that said, the current manuscript is not without any issue, which i will describe below.

first, one of the major motivations from the authors’ perspective is that this approach is better than existing relative positional embeddings or relative positional bias approaches, because it does not “change the memory configuration of these tensors in the accelerator in a less optimized manner”. i can roughly see their argument and why this may be the case, but as this is one of the major motivations, the authors need to explain this much more carefully. for instance, i’d suggest the authors set aside an entire section contrasting what kind of “memory rearrangement” are needed for the three approaches (RPE, RAB and Shatter) and discussing how Shatter is more efficient on which accelerators. of course, another way would be to remove this as a major motivation (could be mentioned only as a nice side-effect) and to refocus the motivation from other aspects of efficiency, such as a fewer parameters and a fewer 3-D tensors to maintain.

second, the authors emphasize that the proposed approach has a fewer parameters to tune compared to e.g. XLNet and other relative positional embedding approaches. this is convincing from one angle; the authors’ clever strategy to share the attention weight computation across multiple heads does reduce the number of parameters involved in computing the attention weights for multiple heads.* it is however not convincing from another angle where the focus is on the lack of parameters in relative positional embedding (N). it is not about whether N does or does not have any trainable parameters, but that the choice of how to construct N is quite important, as the authors point out themselves in A.1: “the pretraining loss and finetuned performance are sensitive to the choice of the partition of unity”. it is almost like the authors worked themselves to tune the parameters behind N, which could be done for other approaches as well (see, e.g., https://arxiv.org/abs/2108.12409 where the bias is a linear function w.r.t. the distance |i-j| to induce some kind of relative distance attention.) that said, this is a minor point and can be fixed by rephrasing text here and there.

third, this is not really the issue of this manuscript but a general issue of most of the papers where they report finetuned accuracies on GLUE tasks, etc. if i understood the authors correctly when they stated “we conduct several runs to show one run better than average (i.e., if the number on some task is worse than average, we will re-run it and show a better one.)” (to be frank, i couldn’t understand what this means at all,) some of the runs with low accuracies are thrown away, and there are multiple runs for each task. unfortunately, i can’t understand the rationale behind throwing some of those “worse” runs and that there are only 1 number (accuracy) per task/configuration pair in all the tables. i totally understand that the authors want to have “fair comparison” (though, it’s unclear in which aspect..) but this simply obscures how well the proposed approach works. to this end, i request the authors to report either both the mean and std. dev. (if there were enough runs) or max/med/min (if there were only a few runs) for each setup without throwing any results away for the runs they’ve run themselves. they can always report a single accuracy for each task-method pair according to what others have done separately as well.

(*) by the way, i do not believe it is correct to say that the Shatter is “single-headed self-attention”, since it does result in the vectors from multiple heads. it’s only that some parts of the computation of attention weights are cleverly shared. i’d suggest the authors refrain from saying so.

In this manuscript, the authors propose a variant of a transformer (or as a matter of fact any neural machine translation system) that is claimed to better capture styles of translation. during training, this proposed model, called a LSTransformer, finds the most similar reference sentence from a minibatch and uses this surrogate reference to compute the weighted sum of the latent style token embeddings. this style code is then appended to the source token embeddings before the source sentence is fed to the transformer for translation. in the translation time, it looks like (i say so, because it wasn’t specified explicitly) the LSTransformer considers the entire test set together as if it’s a minibatch in the training time to find a surrogate reference (based on the source sentence, which is possible because the embedding tables are shared between the encoder and decoder) based on which style embedding is computed and translation is done.

unfortunately the authors do not specify explicitly what they mean by “style”, and where the proposed LSTransformer finds information about “style”, which makes it pretty much impossible for me to understand what the proposed LSTransformer is supposed to do. this gets worse as in the experiments, the “style” of translation is almost equated with the domain from which test sentences were drawn, which is quite different from what i expect style to be based on the authors’ discussion of formality, etc. earlier. independently from my other points below, it’ll be critical for the authors to restructure the main part of the paper by first clearly defining what they mean by style and how the proposed approach captures such style (e.g., why does finding a random reference sentence from a minibatch consisting of i.i.d. samples help the LSTransformer capture a style? what if no reference sentence within the minibatch matches the style of the true reference?) and only then empirically demonstrating the effectiveness of the proposed approach.

In the training procedure, there’s some issue that i cannot wrap my head around. a major innovation the authors proposed is to use the so-called surrogate reference, which is the reference that’s most similar to the true reference within a minibatch. the cnn-based sentence encoder is used to compute the embedding for retrieving a surrogate reference. but, then if the minibatches were truly constructed to include uniformly selected sentences selected at random from the training set, isn’t the choice of the surrogate references is effectively to choose any reference from the training set on expectation? that is, as training continues, the effect of choosing the “most similar surrogate” reference translation disappears, because every minibatch consists of i.i.d samples from the training set (or training distribution). in one of the extreme cases, consider minibatches of size 2 each and using all possible size-2 subsets of the training set for training: every other training example serves as a surrogate reference. how does this procedure help with capturing “style”?

this questionable aspect of the proposed training procedure is only amplified when the inference/generation procedure is discussed. it is because the proposed LSTransformer does not use any references or sources from the same document or domain to capture and use the style, but simply uses the given source sentence. this is a weird set up for translation with style. consider the case where the target language exhibits more fine-grained levels of formality than the source language does. how does one expect a model to decide on the formality of the translation by looking only at the source sentence? perhaps the authors had something different in their mind, when they talked about style of translation, and as i pointed out earlier, it’ll be immensely helpful if the authors restructure the text by defining style to start with.

Along this line of thoughts, one notices that experiments, in particular comparison to vanilla transformers, are actually not too informative. there are two aspects of the proposed LSTransformer that differs from the vanilla Transformer. the first is the training procedure i re-described above, which has some questionable aspect, and the other is the network architecture. in the inference time, there’s no weird surrogate reference retrieval or any related loss functions, and the proposed LSTransformer simply becomes a different parametrization of the so-called deliberation network (https://www.microsoft.com/en-us/research/publication/deliberation-networks-sequence-generation-beyond-one-pass-decoding/) or generative NMT (https://papers.nips.cc/paper/7409-generative-neural-machine-translation.pdf). that is, the proposed LSTransformer defines a distribution over the target sentence space given a pair of an imperfect reference sentence and the source sentence. This is clearly unlike the vanilla Transformer which maps from the source sentence alone to the target sentence. Then, a natural question is not whether the proposed LSTransformer works better than the vanilla Transformer but whether this particular parametrization is more beneficial than other approaches to parametrizing such a “refinement” distribution. for instance, if the authors train the network without the surrogate references but simply by feeding the concatenation of the source sentence and the first translation (that is, the translation from the same network with two copies of source sentences provided), would it work worse than the LSTransformer?

Of course, once such an experimental setting is set up, the authors can finally ask the questions whether those style tokens introduced by the authors are indeed capturing styles and how this aspect of capturing style helps translation. unfortunately, due to the lack of the definition of styles and also due to the lack of proper points of comparison, these questions are only touched upon without clear answers.

I do believe there are interesting findings and insights within this manuscript. It is just that the current version of the manuscript does not reveal what those are to the degree that warrants its publication. it’s possible that my suggestion above, when followed, might reveal results that do not align with what the authors have expected/wanted, but i trust the findings and insights revealed from the authors’ efforts will be greatly appreciated by the community.

P.S. yeah.. the authors’ tSNE visualizations are pretty meaningless. Fig. 2 is totally meaningless, as the authors have realized themselves (see footnote 13.) what i see from Fig. 2 is that the style token embeddings are not doing anything, and it’s the balance loss that simply makes the style embeddings to be orthogonal. after all, it’s VERY easy to have 10 orthogonal vectors in a 512-d space. Fig. 3 doesn’t really encode anything. sadly it shows that the style encoding is 1-dimensional (not even 2-dimensional.) it actually implies that my suspicion above about the issue of surrogate references selected from randomly constructed minibatches might be correct.

This submission consists of two almost independent contributions. The first contribution is a procedure to create a new challenge set for NLI classifiers based on the idea of monotonicity reasoning in NLI, and the second contribution is an algorithm purported to reveal the modular internal structure of a neural net based NLI classifier. I found the first part to consist of interesting findings and perhaps a bit of insight, while the second part to be confusing with frequent self-contradictory statements and perhaps missing answers to many obvious questions.

I’ll go through the paper section-by-section below and leave some major comments first:

Sec. 2: all the neural net (deep learning) based NLP references go from 2018 onward. I find it difficult to believe that this should be the case. For instance, the authors fail to cite <Intriguing properties of neural networks> by Szegedy et al. from 2014 in which the name “adversarial examples” were coined. They also fail to cite <Does string-based neural MT learn source syntax?> by Shi et al. from 2016 which used logistic regression to check whether syntactic labels could be predicted from a neural net hidden state. Let me suggest a bit more of literature review.

Sec. 5.1: this is a pattern i have observed over and over where the authors state something that is either wrong or at best controversial and correct themselves immediately in the same paragraph or in the section. This is simply confusing. For instance, in this section the authors start by “use MoNLI as an adversarial test dataset” and say in the same paragraph “is not especially adversarial”. Indeed, I would not call the proposed dataset adversarial, since it’s not adversarial to any particular model or family and was constructed on its own. I believe a better name would be a challenge set, but any name that is not confusing (even within the authors themselves) would be better.

Sec. 5.2: the observation here is quite interesting in that the models simply fail to flip the labels of these downward monotonically transformed examples, which suggests at least to me that these models are highly insensitive to functional words. I believe this is highly related to the investigation from Gururangan et al. 2018 (https://arxiv.org/abs/1803.02324, which is missing from the references) and also with others’ investigation of NLI models and data earlier. How does your observation agree/disagree with their earlier conclusions? This must be discussed, as it is not the first time the community has learned that NLI models have particular weaknesses (and sometimes surprisingly strengths.)

Sec. 5.3: the final paragraph is confusing, because the first sentence ends saying “this is a failing of the data rather than the models”, while I could not tell why this is so. Then, the authors jumped to another speculation that is not necessarily supported by (nor was expected to be supported by) the experiments in this section: “models can solve a systematic generalization task … only … if they implement a modular representation of lexical entailment relations.“ In reality, all i saw from this section is that the models trained on SNLI work horribly on NMoNLI (they did amazingly on PMoNLI), and that it is a question whether this is due to the models themselves or due to the data on which they were trained. There was no evidence supporting either of these.

Sec. 6: this section starts with the authors declaring that “we intuitively believe any model that can generalize from the training set to the test set will implement a modular representation of lexical entailment.” Unfortunately my intuition does not agree with the authors, or simply i do not have any intuition on this. Perhaps the main cause of this discrepancy may be that the authors have not defined the “modular representation” (or “modular internal structure”) in the context of neural nets that are being tested in this paper. It is possible that I may be a bit of an outsider in these studies on systematic generalization, but it would be good to have it defined clearly somewhere so that the reader can readily go see the definition and understand why such would be intuitive (or counter-intuitive.)

Figure 2: I need to insist that each experiment in these plots be run multiple times by varying random seeds (which would impact the order of training example presentation, etc.) It looks to me as there are a few outliers that are likely statistical fluke than true trends, such as NMoNLI Test with 300 examples and BERT, SNLI/NMoNLI test with 800 examples and ESIM, and SNLI/NMoNLI tests with up to 200 examples and DECOMP. I don’t believe the authors’ conclusions from these plots will change much, but these outliers points make it difficult to trust the overall trends.

The plot titles in Fig. 2 are confusing, because these are models finetuned with inoculation not models trained solely on SNLI.

Sec. 6.4: The main point “every model was able to solve our generalization task” doesn’t seem to hold for ESIM, as it barely solved the challenge test set except for one particular case when 800 examples were used for training.

Sec. 7: unfortunately this section, which is supposed to describe one of the two main contributions, has quite a bit of issues that have ultimately convinced me not to recommend this manuscript to be accepted. I’ll go over why below.

Sec. 7.1: this section is quite difficult to follow, because there’s quite a bit of discussion at the beginning that requires the knowledge of the algorithm Infer, but this algorithm is only explained in the final paragraph of the section.

Sec. 7.2: the section starts with the statement “BERT implements a modular representation of lexical entailment if there is a map M from MoNLI examples to model-internal vectors in BERT such that the model internal-vectors satisfy the counterfactual claims ascribed to the variable lexrel.” There are two major issues with this statement. First, I just don’t see in this manuscript why this conditional holds. This is probably because there has not been a clear definition of modular representation of lexical entailment in the context of BERT or any other neural net. Second, what is this “model internal vector”? If i simply concatenate all the vectors present inside BERT and flatten into a single vector, would that correspond to a model internal vector? If so, does it mean that any BERT that satisfies this condition implements modular representation of lexical entailment? It looks like this makes this statement not a conditional but a definition, in which case it’s a bit of a moot point to state so.

Sec. 7.3-4: Instead of Infer (which requires a two-line equation as its definition,) the procedure in this section requires an algorithm box with much more careful description, because i was totally lost following the proposed algorithm and experimental procedure. For instance, what do the authors mean by “every example is mapped to a vector at the same location”? What actually happens when the authors say “we randomly conducted interchange experiments to partially construct each of the 36 graphs”? What were the random variables here? Why are some output-unchanging edges necessarily non-causal?

Sec. 7.5: i find the random edge graph to be quite uninteresting as a baseline, as it is not clear what 50% of having an edge means and whether it is a reasonable baseline. For instance, if we use ESIM and DECOMP from Fig. 2, what kind of numbers would we get? Will ESIM be worse than BERT, because they are less “modular”? What was ESIM trained only on SNLI? Will it be also worse than ESIM finetuned with inoculation, because it generalizes worse to MoNLI? These are much better baselines to compare against, and without these it’s difficult to put the sizes of the cliques obtained from BERT finetuned with inoculation. I’m sure the random graph serves as a lower-bound, but it looks too loose to be informative at all.

Sec. 7.6: The authors conclude that “this is conclusive evidence that … BERT implements a modular representation of the lexical entailment relations between substituted words”. I cannot agree with this because of the reasons provided by the authors themselves in the third paragraph. With all these caveats in the proposed algorithm, what is the right way to draw a conclusion? Do we know that these issues are not significant? If so, how do we know so? Perhaps, the biggest issue again is that it’s unclear what modular representation of lexical entailment in the context of BERT, because of that, we cannot really tell whether this proposed procedure indeed captures such notion.

Sec. 7.7: According to the probing experiments, the authors demonstrate that the first and third layers are equivalent in terms of their linguistic and control tasks, and then the authors continue to conclude that “probes cannot diagnose whether a given representation has a causal impact on a model’s output behaviour.” But, does this imply anything about the authors’ approach? How do we know that this probing is any worse than the authors’ approach, other than the procedure described above with a lot of approximations that the authors themselves warn the reader about.

This concludes my review of this submission. My suggestion to the authors is to focus on the MoNLI data as a new challenge set for NLI classifiers and to carefully analyze what this challenge set reveals about the existing NLI classifiers (it’ll be even better if the authors could identify and fix the identified weaknesses.) In this case, it’s probably fine to drop any discussion and claim on “modular internal structure” of these neural nets.

If the authors feel they need to keep the second contribution, I suggest them significantly revise these sections; first, describe the algorithm more clearly, second, discuss various approximations that were made to cope with the intractability and demonstrate that those approximations are reasonable, third, run the algorithm on multiple models and more informative baselines, and fourth (and perhaps most importantly), demonstrate convincingly that these models do indeed exhibit a carefully defined notion of modular internal structures and that the proposed metrics does compute the degree of the modularity of internal structures.

in this paper, the authors test a series of modifications to the now-standard transformer, including gated self-attention, convolution as self-attention, attention with a fixed span and attention with a learned span, on SCAN which was manually constructed to test the ability of a sequence-to-sequence model in capturing compositionality. They demonstrate that their observations in the impact of these modifications on SCAN are not indicative of their impacts on more realistic problems, such as machine translation.

the biggest issue i see with this manuscript is the main motivation behind this investigation, that is, “it is not clear if SCAN should be used as a guidance when building new models.” if i did not misunderstand [Bastings et al., 2018] to which a whole paragraph was dedicated in S8 Related Work, Bastings et al. [2018] already demonstrated that SCAN is not a realistic benchmark, and that the improvement in SCAN in fact negatively correlates with the improvement in MT (i just opened Bastings et al. [2018] to see if i recalled incorrectly, but it seems to be the case.) in fact, Bastings et al. [2018] suggest a reason why SCAN is not realistic: “any algorithm with strong long-term dependency modeling ca- pabilities can be detrimental.” in other words, i almost feel like it has been clear for quite some time that SCAN should not be used (on its own) for guiding any development for real-world problems.

of course, it’s a good idea to (1) renew/reconfirm the earlier finding based on a more modern practice, such as using transformers as opposed to using LSTM/GRU and (2) investigate aspects that were not investigated earlier, such as network architectures rather than parametrization of the decoder (autoregressive,) but i believe it’s important for such an effort to be framed and discussed to update/complement the existing knowledge rather than as an attempt to establish the finding as a standalone finding. along this line, i believe it would’ve been more interesting if the manuscript could tell how transformers changed the earlier conclusion on the capability of these neural sequence models on SCAN and its consequences.

a second issue is that it’s very difficult for me to understand why these four variants of attention are relevant to this investigation: why should i be curious about these four particular version of attention in knowing the relationship between the performance on SCAN and the performance on MT? had the authors chosen another set of architectural modifications (e.g., perhaps change the number of feedforward layers within each transformer blocks, perhaps change the softmax to sigmoid or any other monotonic transformation for attention weight, perhaps change the final softmax to the mixture of softmaxes, etc.) would they have arrived at a different conclusion? because of this degree of freedom, i believe it is important to start with a statement on what the authors believe is an important axis and why so, before drawing any empirical conclusion. the best i could read from the manuscript is that the authors “start from an observation that convolution-based seq2seq models (ConvS2S) (Gehring et al., 2017) perform very well on it (Dessi and Baroni, 2019).” this is not a convincing reason why we want to test those four variants of attention (it does explain why we want to test replacing attention with convolution, albeit weakly.)

finally, it is unclear whether the observation that attention with a learned span helps on the compositional en-fr task had to be drawn from this investigation. yes, the authors did arrive at this conclusion in this manuscript, but it looks like this could’ve been a completely separate investigation, perhaps motivated better by starting that “it is not clear if” the existing transformers can correctly bridge the difference in the compositional rules between source and target languages. i believe their data will be useful in the future for evaluating this particular aspect of a machine translation system. unfortunately, this particular data on its own adds only little to the main investigation in this manuscript. perhaps, as the authors stated, this will be a part of an extensive and more useful benchmarks in the future when “more realistic SCAN-like datasets based on natural language data” are created.

Unfortunately i have some issues with the authors’ choices of algorithms and how they use them.

first, few-shot learning algorithms are designed to work best for scenarios in which examples must be classified into “novel” classes that were not present during training time, which is not the case for the problem in this paper. one could argue that many of these few-shot learning algorithms are variants of nearest-neighbour classifiers, and that they tend to work better for rare classes because of their non-parametric nature. this is however not what the authors claim nor argue. what the authors should’ve done and should do in the future iteration of the manuscript is to modify e.g. the prototypical network however without the few-shot constraint by using all the training instances (or subset for computational efficiency).

second, the authors claim and demonstrate the effectiveness of these class reweighting approaches, which I find hard to believe not due to the construction of those algorithms but due to the evaluation metric the authors have chosen to work with. when a neural net classifier, or even logistic regression classifier, is trained, it is trained to capture p(y|x) which is the product of the class-conditional likelihood p(x|y) and the class prior p(y). the latter is often the reason it looks like a trained classifier prefers more common classes when we simply look at the top-1 predicted class. an interesting consequence from this observation is that reweighting based on the class proportion (which is mainly what the authors have tried either via actual reweighting or resampling) only changes p(y) and does not impact p(x|y). that is, if you estimate p(y) from data and divide the neural net’s prediction p(y|x) with it, the effect of class imbalance largely disappears (of course, up to miscalibration of neural net predictive distributions.)

lastly, i’m not entirely sure whether it’s a good idea to frame this problem as classification. instead, i believe this problem should ideally be framed as multi-label classification in which each condition is predicted to be present (positive) or not. this is arguably a significantly more minor point than the issues above.

with all these issues, it’s difficult for me to see what i should get out of reading this manuscript. it’s not surprising that existing few-shot learning algorithms do not work well, because the target problem was not a conventional few-shot learning problem. it’s perhaps not surprising that the baseline seems to work better for more common classes but not for rare classes, because there was no description (which implies no effort) in recalibrating the predictive distribution to remove the class prior.

since all the algorithms have been implemented, i believe a bit of effort in re-designing the experiments and tweaking the algorithms would make the manuscript much stronger.

언제나 호탕하시고 유쾌하셨던 할머니께서는 한결 같이 저와 통화를 하시면 꼭 두 가지 얘기를 하셨습니다. 하나는 제가 어릴 때 어찌나 서럽게 쉬지 않고 밤낮으로 울었다는 것이었습니다. 저야 애기 때니 기억은 안 나지만 얼마나 울어댔길래 다른 최근 기억은 세월에 잊혀지면서도 이 이야기는 안 잊혀졌던 것 일까요. 이제 할아버지 할머니가 다 되신 외삼촌 외숙모들도 저만 보면 얘는 어릴때 그리 울었다고 하시는것 보면 참 많이 울긴 했나봅니다.

또 다른 이야기는 이렇게 울어대던 와중 한 장면입니다. 제가 드디어 스스로 걷기 시작하고 처음으로 할머니께서 문을 열고 마당에서 걷게 놔두어 보셨답니다. 어째 느낌으론 하늘이 파아란 좋은 날씨 아니었을까 생각되는데, 제가 천천히 걸으며 하늘도 보고 땅도 보면서 뭔지 알아들을 수 없지만 쉬지 않고 중얼중얼 댔다고 하셨습니다. 할머니 느낌에는 아 세상이 참 신기하구나, 이것도 신기하고 저것도 신기하구나, 라고 하는것 같았습니다. 저야 당연히 간난아기 때니 기억 안 나지만 이 얘기를 들을때마다 저도 뭐를 그리 중얼 댔을까 궁금합니다.

수십번도 더 들은 이야기들인데, 아직도 할머니께서 들려주시면 또 듣고 싶네요 ..

]]>with these numbers, we anticipate a much lower level of reviewing load for each area chair/reviewer this year. it is however impossible for us to be certain, especially in the cases of senior area chairs and area chairs. that is the reason why we don’t provide any option to preemptively reduce the reviewing load to senior area chairs and area chairs. instead, if there’s any particular request, we ask senior area chairs and area chairs to reach out to the program chairs directly to discuss the right way to adjust reviewing load individually.

this is however not the case with reviewers. we provide reviewers with an option to request the reduction of the reviewing load already on OpenReview. Unfortunately this option is less visible, evident from a non-stop stream of request emails we’re receiving (yes… my inbox is now … totally filled up and overflowing ..) so, here’s a detailed instruction on how you can request a reduced reviewing load yourself *without* emailing me.

1. Decline the initial reviewer invite: in order to request lower reviewing load, you need to click “DECLINE” link in the original invite email, as shown in the screenshot below:

2. Click “OK” when prompted by OpenReview with “You have chosen to decline this invitation. Do you want to continue?”

3. You will be redirected to a landing page. click “Request reduced load” at the bottom of the landing page.

4. You can then choose the reduced load from {1, 2, 3, 4} in the following page. You can choose one that best suits you and click “Submit”.

And, that’s it!

We strive to ensure no reviewer is overloaded with a huge number of assignments and also all submissions receive a proper level of attention from reviewers, area chairs and senior area chairs. This cannot be done without your service, and we greatly appreciate it.

]]>to be specific, i will use $p(y|x)$ to indicate that this is a distribution over all possible answers $\mathcal{Y}$ returned by a machine learning model $f$ given an input $x$. this is distinguished from a predictive distribution computed directly by that model $f$, which i will denote as $p_f(y|x)$. these two, $p(y|x)$ and $p_f(y|x)$, are different from each other in that the former takes into account uncertainty that cannot be captured by the machine learning model $f$ while the latter captures at most the uncertainty it can capture. this distinction will be made clearer later in this post.

of course, neither of these two needs to be an actual probability given an input and an arbitrary target $(x,y)$ but can just be an arbitrary scalar $-E(x,y) \in \mathbb{R}$, as we can turn this into the probability by

$$p(y|x) = \frac{\exp(-E(x,y))}{\int_{\mathcal{Y}} \exp(-E(x,y’)) \mathrm{d}y’}.$$

what does it mean for us to use a predictive distribution rather than a single-point prediction? this is equivalent to saying that there are a set of answers that we consider likely. then, here comes a natural, follow-up question: why are there many likely answers, not only one? one way to answer this question is to say that there exists uncertainty in the answer. then, here’s the next follow-up question: *where does this uncertainty come from*? this is the question i’ll try to answer by enumerating what i can imagine as the sources of uncertainty in this post.

instead of talking about irreducible (was it aleatoric?) and reducible (was it epistemic?) uncertainty, i’ll just be very much down to earth and talk about some of the sources of uncertainty that i believe we should think of.

before i continue, let me clarify what i mean by $y$ here. $y$ is one of all possible answers. in the case of classification, $y$ is one of all possible classes. in the case of multi-label classification (many binary classifiers,) $y$ is one of the all possible combinations, i.e., $y \in \{0, 1\} \times \cdots \times \{0, 1\}$. in other words, we do not have to worry too much about dependencies between different dimensions of $y$, although this makes it tricky to think of continuous $y$ (it does reveal what i’m interested in, doesn’t it?)

under this setup of $y$, i will care about the probability assigned to $y$ rather than the variance of the probability assigned to $y$. this arises from my desire to consider only those $y$’s that receive reasonably high probabilities. among these reasonably probable $y$’s, those that are more highly probable are also the ones that tend to (but not always for sure) have lower variance (just because the probability is bounded between $0$ and $1$.) in other words, we care about how many highly plausible answers there are and what they are.

of course, you can replace $\mathbb{E}$ with $\mathbb{V}$ below to get the variance of the probability assigned to $y$ rather than the average. perhaps it’s a good idea then to use some combination of the mean and variance, similarly to using e.g. upper-confidence bound in various active learning setups. but, well, i’m writing a blog post not a book here.

the **first** source of noise that comes to my mind is our use of a finite number of examples, for both training and evaluation. even if there exists a single correct answer $y^*$ for an input $x$, it is possible that it may be impossible to precisely identify this correct answer $y^*$ given only a finite number of examples from which our machine learning model learns. even worse, different answers may look more likely when different sets of training examples are used.

it is always reasonable to assume our learning algorithm can only work with a finite number of examples (even in the most optimal case, it will be bounded by how long Google thrives and survives …) let’s say we always use $K$-many training examples drawn from a single data distribution $p_{\mathrm{data}}$. the uncertainty arising from this finite nature of data can be written as

$$p(y|x) = \mathbb{E}_{(x^1, y^1), \ldots, (x^K, y^K) \sim \underbrace{p_{\mathrm{data}} \times \cdots \times p_{\mathrm{data}}}_{K}} \left[\mathrm{LEARN}((x^1, y^1), \ldots, (x^K, y^K))(x)\right],$$

where $\mathrm{LEARN}$ is a learning algorithm that returns a trained model. in other words, we need to try training as many models as possible with size-$K$ subsets and see how the predictions from these models vary.

it makes sense to a certain degree, but of course, this is not tractable in general, because we are often given a single set of training examples to work with, instead of the full data distribution from which we can freely sample a new set of training examples. yes, yes, you’re right that sometimes we have (expensive) access to the data distribution, but let’s assume this is not the case in our case.

instead, it is possible to generate pseudo-training sets by re-sampling multiple sets from this single training set. this is what we often refer to as *bootstrap resampling*. this is a nice way to capture the variation/uncertainty caused by sampling of data, but it is often intractable to use this methodology in deep learning, as the size of a data set needed to train a model is pretty huge.

of course, such a resampling strategy can be used to measure the uncertainty in the test accuracy given a single model $f$:

$$\mathrm{Var}(\mathrm{EVAL}{p(y|x)}) = \mathbb{V}_{(x^1, y^1), \ldots, (x^K, y^K) \sim \underbrace{p_{\mathrm{test}} \times \cdots \times p_{\mathrm{test}}}_{K}} \left[\mathrm{EVAL}_{p_f(y|x)}((x^1, y^1), \ldots, (x^K, y^K))\right],$$

where $p_f(y|x)$ is the predictive distribution from one particular model $f$.

the **second** source of uncertainty is noisy measurement. this is closely related to the sampling-induced uncertainty above, except that we now split the data distribution $p_{\mathrm{data}}$ into two parts; data generation and noise injection. the process by which a single pair $(x,y)$ is sampled is

- true measurement: $(\hat{x}, \hat{y}) \sim p_{\mathrm{data}}(x, y)$
- noisy measurement of the input: $x \sim C_x(x | \hat{x})$
- noisy measurement of the output: $y \sim C_y(y|\hat{y})$

$C_x$ and $C_y$ are the noisy measurement processes for $x$ and $y$, respectively.

let’s first consider the input noise $C_x$. we assume that $C_x$ is symmetric (i.e., $C_x(\hat{x}|x) = C_x(x|\hat{x})$.) this symmetry tells us that we can draw plausible samples of a _clean_ version $x$ given the noisy measurement $\hat{x}$ from $C_x(x | \hat{x})$. we don’t know exactly which of these samples is the original version $x$, but they are all largely plausible. the uncertainty arising from our inability to perfectly denoise the noisy measurement shows up as:

$$p(y|\hat{x}) = \mathbb{E}_{x \sim C_x(x | \hat{x})} \left[ p_f(y|x) \right],$$

where $p_f$ is the predictive distribution returned by a classifier $f$. just like above, this classifier may return an unnormalized scalar, in which case we turn it into the probability by softmax normalization. in other words, the uncertainty is in how the prediction varies across plausible original version of $\hat{x}$ according to the symmetric noisy measurement process $C_x$.

this implies that we can reduce this particular type of uncertainty if we knew (potentially irreversible) $C_x$, by maximizing $\log p(y^*|\hat{x})$ above rather than $p_f(y^*|\hat{x})$. of course, $C_x$ is often (if not always) unknown, and people often resort to manually crafting a proxy corruption process that mimics a reasonable noisy measurement process $C_x$ and sample from it during training to approximate the expectation above. this practice is nowadays referred to as *data augmentation*. of course, smart ones (yes, like my awesome collaborators ) learn a proxy to $C_x$ from unlabelled data, such as we have done with SSMBA recently for natural language processing.

now, let’s quickly consider the output noise $C_y$. “quickly”, because it doesn’t really differ much from the input noise $C_x$. the major difference is that it is often only useful in the training time, since $y$ is not known in the test time.

with this in our mind, let’s consider a particular noisy measurement process $C_y(y | \hat{y}) = \alpha \delta(y|\hat{y}) + (1-\alpha) \mathcal{U}(y; \{1,2, \ldots, L \})$, where $\delta$ is a Dirac delta distribution, $\mathcal{U}$ is a uniform distribution, and $\alpha \in [0, 1]$ is a mixing coefficient. with the probability $\alpha$, there is no noise, and with the probability $1-\alpha$, we switch the label to one of all possible labels uniformly.

we can now express the uncertainty by

$$\mathbb{V}_{y \sim C_y(y|\hat{y})} \left[ \log p_f (y|x) \right]$$

of course, we don’t really have access to $\hat{y}$, but we can flip $\hat{y}$ and $y$ above, because $C_y$ is symmetric; with the probability $\alpha$ the clean answer would’ve been $y$ itself and otherwise it could’ve been anything. in other words, we can reduce this uncertainty by minimizing

$$\mathbb{V}_{y \sim C_y(y’|y)} \left[ \log p_f (y’|x) \right].$$

an interesting observation here is that $\log p_f (y’|x)$ is bounded from above by $0$. this means that we can indirectly minimize this variance by maximizing each $\log p_f (y’|x)$ for $y’ \sim C_y(y’|y)$:

$$\mathbb{E}_{y’ \sim C_y(y’|y)} \left[ \log p_f (y’|x) \right],$$

which can be rewritten with this particular $C_y$ as

$$

\sum_{y’ \in \mathcal{Y}} I(y=y’) \left(1-\frac{1-\alpha}{|\mathcal{Y}|}\right) \log p_f (y|x) + I(y\neq y’) \frac{1-\alpha}{|\mathcal{Y}|} \log p_f (y’|x).

$$

this reminds us of a widely-used technique of *label smoothing* which always ensure that a model assigns some non-zero probabilities to incorrect classes while maximizing the log-probability of the observed label $y$. so, one way to think of what label smoothing does is that it reduces the uncertainty arising from the noisy measurement of labels.

noise in the output $y$ is however trickier when it is *not* noise. what does it mean? it means that there may be genuinely multiple correct answers, and that we cannot tell by looking at $y$ alone whether it is noisy version of $\hat{y}$ or that it is just one of many possible clean answers. this is an interesting observation to think about: so-called irreducible noise is often indistinguishable from so-called reducible noise in practice!

if we somehow know that there are genuine ambiguity in the output (which is quite common, such as in machine translation and any other structured prediction problems,) we can deal with it by introducing stochastic hidden variables in our model, such as in *stochastic feedforward networks* as well as *conditional RBM/NADE*. of course, such a powerful conditional density model will inevitably capture not only genuine ambiguity but also genuine measurement noise, potentially leading to the issue of overfitting.

let $\epsilon$ be some arbitrary random variable from which we can sample numbers in order to make some arbitrary decisions in our learning algorithm. there are so many things we often need to make arbitrary decisions for. some of them are:

- how do we build a minibatch?
- if we are building a minibatch on the fly, which subset of training examples do we use?
- if we are taking the next chunk of training examples, which order do we sort the training examples in?

- parameter initialization
- how do we initialize the parameters of our model?

- dropout (or any stochastic regularizer)
- which hidden units do we drop to $0$?

- Underlying compute engine

furthermore, some learning algorithms intentionally rely on such randomness. a representative example is *policy gradient* in which noise is added to smooth out the super-difficult optimization problem of

$$\max_{\pi} R(\arg\max_a \pi(a|s))$$

into a slightly-less-difficult problem of

$$\max_{\pi} \mathbb{E}_{a \sim \pi(a|s)} R(a).$$

this smoothing is done by arbitrarily choosing the action among many plausible actions according to $\pi$ at state $s$. such sampling is often implemented by transforming a series of random numbers (e.g. drawn from $\epsilon$) into a single sample from $\pi(\cdot|s)$.

we can now abstract out these details and make $\epsilon$ an additional input to $\mathrm{LEARN}$ function above. this learning algorithm takes as input the training set as well as this source of randomness. then, for each input $x$, we can check the uncertainty of our prediction by considering multiple (potentially infinitely many) models arising from the variance induced by $\epsilon$:

$$p(y|x) = \mathbb{E}_{\tilde{\epsilon} \sim \epsilon} \left[\mathrm{LEARN}((x^1, y^1), \ldots, (x^N, y^N), \tilde{\epsilon})(x)\right],$$

which looks exactly like the bootstrap resampling version above. this is only natural, because dataset sampling itself can be thought of as an arbitrary selection of a subset of all possible examples. it is however informative to think of these two separately, since noise in stochastic learning is what we often can explicitly control and noise in dataset sampling is what we often don’t have much control over (perhaps except in active learning.)

let $\tilde{\epsilon} = (\epsilon^1, \ldots, \epsilon^M) \sim \epsilon$ be a series of random numbers drawn from $\epsilon$ for a single training run. one may be tempted to choose $\tilde{\epsilon}$ with the best validation accuracy and deploy the corresponding model. but, in an application where it is important to find more than one answers with reasonable estimates of their probabilities, it is a much better idea to bag all of them for deployment. this is also why you do not want to and should not *tune* a random seed.

it is understandably quite expensive to use many models arising due to $\epsilon$ in real life, unfortunately. it is however an attractive feature of this approach to have a full distribution over $y$ that reflects varying degrees of likelihood of each $y$. it is a usual practice to use the idea of knowledge distillation, to train another model that is not trained on the targets from the data $y^n$ but on the entirely predictive distribution $p(y|x^n)$.

in fact, it is not only the training procedure but also the hyperparameter search procedure that relies extensively on those random numbers sampled from $\epsilon$. it is because we almost always cannot perform exhaustive search nor deterministic line search due to an ever-increasing number of hyperparameters. this is similar to the uncertainty arising from stochastic learning above, in that our choice of hyperparameters which directly affect learning has its own noise, e.g. arising from random search, which results in predictive uncertainty.

furthermore, the uncertainty, that is similar to the data sampling uncertainty above, exists with hyperparameter tuning as well, as it is a common practice to use a fixed, finite set of validation (held-out) examples in hyperparameter tuning. each time we use a different set of validation examples drawn from the true data distribution (whatever that is), we would end up with a hyperparameter configuration that leads to a different model that ends up making slightly different prediction. we would aggregate them to understand how uncertain we are of any particular prediction over this validation set sampling noise. one could imagine that bootstrap resampling would work well, but we often

the first type of uncertainty from hyperparameter tuning, that arises from random search, is one of my favourite along with the smoothing technique we often use in learning. for it is the kind of uncertainty that does not exist in the ideal world in which we have all the time in the world and all the compute in the world. even if it is a deterministic mapping from a hyperparameter to an individual prediction, our inability introduces the uncertainty. now, this is what people call as reducible uncertainty, but is it really reducible? i don’t think so.

finally, i want to spend just one paragraph on one particular case of uncertainty inherent to a problem itself. we already talked about it above when we discussed noisy measurement of the output. that is, what if there are genuinely multiple possible answers?

it turned out that there may be many different reasons why there are multiple possible answers inherently to the problem in our hands. among those many reasons, i want to talk briefly about one particular scenario. in this scenario, there exists a set of *unobserved* variables $u$ that affect the target variable in some way (it is not really important how, in this high-level, light blog post.) that is, the true function that determines the target takes as input not only the observed input $x$ but also $u$, i.e., $y = g(x, u)$. because we do not observe $u$, given only $x$, there are multiple correct answers.

this can be indeed handled to a certain degree by stochastic feedforward networks as well as conditional density models, but it’s pretty impossible to ensure our choice of such a model can model the unobserved $u$. after all, it’s unobserved, and we do not even know what it is and even more so whether it exists.

in this blog post, i enumerated a few sources of uncertainty in machine learning that i could immediately think of off the top of my head. they include finite data sampling, measurement noise, stochastic learning, hyperparameter tuning and unobserved variables. this doesn’t include other potential sources, such as uncontrollable shift in the environment between training and test times. furthermore, it does not cover other paradigms, such as online learning, active learning, etc., because i literally don’t know them well.

now… time to take all those sources into account for quantifying and using uncertainty!

this whole post was motivated by my (continuing) discussion with our wonderful members at Prescient Design: Ji Won Park, Natasha Tagasovska, Jae Hyeon Lee and Stephen Ra. Oh, also, we are hiring!

]]>but, then, i realized i don’t know what uncertainty is in a high level (!) which is somewhat weird, since i think i can often follow specific details of any paper that talks about uncertainty and what to do with it. so, as someone who dies for attention (pun intended, of course), i’ve decided to write a blog post on how i think i (should) view uncertainty. this view has almost no practical implication, but it helps me think of predictive uncertainty (aside from all those crazy epistemic vs. alleatoric uncertainty, which i’m sure i mistyped.)

in my mind, i **start with** the following binary indicator:

$$U(p, y, \tau) = I( \sum_{y’ \in \mathcal{Y}} I(p(y’) > p(y)) p(y’) \leq \tau).$$

if we are considering a continuous $y$, we replace the summation with the integration:

$$U(p, y, \tau) = I( \int_{y’ \in \mathcal{Y}} I(p(y’) > p(y)) p(y’) \mathrm{d}y’ \leq \tau).$$

$\mathcal{Y}$ is a set/space of all possible $y$’s. $I(\cdot)$ is an indicator function, i.e., it returns $1$ if true and otherwise $0$. $p$ is a predictive distribution under which we want to measure the uncertainty (e.g., a categorical distribution returned by a softmax classifier.) $y$ is a particular value of interest, and $\tau$ is a user-provided threshold.

this binary indicator tells us whether a particular value $y$ is within top-$(100 \times \tau)$% values under $p$. this can be used for a number of purposes.

**first**, we can use it to check how certain any particular prediction $\hat{y}$ is under our predictive distribution. let $p(y|x)$ be the predictive distribution returned by our classifier. we can solve the following optimization problem:

$$\min_{\tau \in [0, 1]} \tau$$

subject to

$$U(p(\cdot|x), \hat{y}, \tau) = 1.$$

in other words, we try to find the smallest threshold $\tau$ such that $\hat{y}$ is included. we refer to the solution of this optimizatoin by $\hat{\tau}$.

there is a brute-force approach to solving this optimization problem, which sheds a bit of light on what it does (and a bit on why i started with $U$ above,) although this only works for a discrete $y$. first, we enumerate all possible $y$’s and sort them according to the corresponding $p(y|x)$’s. let us call this sorted list $(y^{(1)}, y^{(2)}, \ldots, y^{(N)})$, where $N = |\mathcal{Y}|$. then, we search for $\hat{y}$ in this sorted list, i.e., $\hat{i} = \min_{i=1,\ldots, N} I(y^{(i)} = \hat{y})$. then, $\tau = \sum_{j=1}^{\hat{i}} p(y^{(j)}|x)$. in short, we look at how much probability mass is taken over by predictions that are more probable than $\hat{y}$, which seems (to me at least) to be the right way to think of the uncertainty assigned to $\hat{y}$.

**second**, we can use it to enumerate all predictions that should be considered under a given threshold $\tau$ beyond one best prediction by solving the following optimization problem:

$$\max_{Y \subseteq \mathcal{Y}} |Y|$$

subject to

$$U(p(\cdot|x), y, \tau) = 1,~\forall y \in Y.$$

in other words, we look at the largest subset $Y$ such that each and every element within $Y$ is certain under the predictive distribution $p(\cdot|x)$ with the certainty $\tau$.

this is a usual problem to solve and return the answer of, especially when we know that the problem has inherent uncertainty. in the case of machine translation, for instance, there are generally more than one equally good translations given a source sentence, and it is only natural to return top-$(100 \times \tau)$% translations rather than one best translation (though, we don’t do that in practice unfortunately.)

the same brute-force solution from the first problem is equally applicable here. once we have a sorted list and find $\hat{i}$, we simply return $Y = (y^{(1)}, y^{(2)}, \ldots, y^{(i)})$. this is too brute-force and is not tractable (nor applicable) in many situations (precisely why we don’t return multiple possible translations in machine translation, in practice.)

**third**, we can use $U$ to calibrate a given predictive distribution toward any criterion. for instance, our calibration criterion could be

$$J(\hat{p}; \tau) = \left|\frac{

\mathbb{E}_{x, y^* \sim p^*} [I(|\hat{\tau}(\hat{p}, \hat{y}) – \tau| < \epsilon)I(|y^* – \hat{y}| – \delta < 0)]

}

{

\mathbb{E}_{x, y^* \sim p^*} [I(|\hat{\tau}(\hat{p}, \hat{y}) – \tau| < \epsilon)]

}

– \tau \right| < \epsilon,~\forall \tau \in [0, 1],$$

where $\hat{p}$ is a monotonic transformation of $p$, and $\hat{y}=\arg\max_y p(y|x)$. you can think of $\hat{p}$ as a target distribution after we calibrate $p$ to satisfy the inequality above.

this criterion looks a bit confusing, but let’s parse it out. the two expectations effectively correspond to drawing true examples $(x, y^*)$’s from the ground-truth distribution $p^*$. for each $x$, we compute how often the prediction $\arg\max_y \hat{p}(y|x)$ is within the confidence threshold $\tau$. among those cases that satisfy this criterion, we check how good the prediction is (i.e., $|y^* – \hat{y} | – \delta < 0$). the proportion of such good predictions (the ratio above) should be within a close neighbourhood of the confidence threshold $\tau$.

with this criterion, we can solve the following optimization algorithm for calibration:

$$\min_{F} \int_{0}^1 J(F(p); \tau) \mathrm{d}\tau + \lambda \mathcal{R}(F),$$

where $\mathcal{R}(F)$ is some measure of the complexity of the monotonic transformation $F$ with the regularization coefficient $\lambda > 0$.

we can think of this optimization problem as finding *minimal* changes we need to make to the original predictive distribution $p$ to maximally satisfy the criterion above. of course, we can use different formulations, such as using a margin loss, but the same idea holds regardless.

there can be many other criteria. for instance, we may only care that the true value $y^*$ be within $\tau$ only. in this case, the optimization problem simplifies to:

$$\min_F \mathbb{E}_{x, y^*} \left[ 1- U(F(p), y^*, \tau) \right] + \lambda \mathcal{R}(F).$$

so, how does it relate to all the discussions on **reducible** (our inability) and **irreducible** (the universe’s inability) **uncertainty**? in my view, which is often extremely pragmatic, it’s almost a moot point to distinguish these two too strongly when we consider the uncertainty of prediction coming out of our system, assuming we’ve tried our best to minimize our inability (reducible uncertainty). with a finite number of training examples, which are almost never enough, and with our inability to tell whether there’s a model mismatch (the answer is almost always yes,) we cannot really even tell between reducible and irreducible uncertainty. then, why bother distinguishing these two rather than just lumping them together into $p(\cdot|x)$?

**anyhow**, the post got longer than i planned but stays as empty as i planned. none of these use cases of the binary indicator $U$ are actionable immediately nor tractably. they need to be polished and specialized for each case by carefully inspecting $p$, $\mathcal{Y}$, etc. but, at least this is how i began to view the problem of uncertainty in machine learning.

this whole post was motivated by my discussion with our wonderful members at Prescient Design: Ji Won Park, Natasha Tagasovska and Jae Hyeon Lee. Oh, also, we are hiring!

]]>if we consider the case of regression (oh i hate this name “regression” so much..) we can write this down as minimizing

$$

-\frac{1}{2} \| \alpha y + (1-\alpha) y’ – G(F(\alpha x + (1-\alpha) x’))\|^2,

$$

where \((x,y)\) and \((x’,y’)\) are two training examples, and \(\alpha \in [0, 1]\) is a mixing ratio. nothing more to explain than to simply look at this loss function: we want our regressor \(G \circ F\) to linearly interpolate between any pair \((x,x’)\).

Manifold mixup followed up on the original mixup ([1806.05236] Manifold Mixup: Better Representations by Interpolating Hidden States (arxiv.org)) by proposing to interpolate in the hidden space (after \(F\) above,) similarly to my own work on interpolating the hidden representations of retrieved examples (see here done together with Jake Zhao.) although there are quite a bit of details, such as randomly selecting the layer at which mixup is done, etc. in the case of manifold mixup, let me ignore that and just consider the key loss function:

$$

L^{\mathrm{mmix}} = -\frac{1}{2} \| \alpha y + (1-\alpha) y’ – G(\alpha F(x) + (1-\alpha) F(x’))\|^2.

$$

a natural inclination is to think of this as ensuring that \(G\) interpolates linearly between two points in the space induced by \(F\). that is probably what the authors of manifold mixup meant by saying that “*Manifold Mixup Flattens Representations*“, although their theory (§3.1) doesn’t seem to have anything to do with this phenomenon of flattening. their theory seems to be largely about universal approximation (which doesn’t really tell us much about linear interpolation) and that classes eventually become linearly separable (again doesn’t tell us much about linear interpolation.)

one thing that’s emphasized in the manifold mixup paper is that it “*backpropagates gradients through the earlier parts of the network*” (i.e. \(F\) above). totally understandable to any deep learner, as the motto we live and die by is end-to-end learning, but if \(F\) changes, it changes the space over which \(G\) linearly interpolates, or \(G\) can linearly interpolate in the space induced by \(F\) by adapting \(F\) rather than \(G\). furthermore, the tie between the linear interpolation between two training examples can dramatically change as the nonlinear \(F\) changes. so… confusing…

let’s look at the gradient of this loss function w.r.t. \(F\) above ourselves, after assuming that \(y \in \mathbb{R}\) and \(G\) is a linear function (similar to sentMixup in [1905.08941] Augmenting Data with Mixup for Sentence Classification: An Empirical Study (arxiv.org)) for simplicity.

$$

\frac{\partial L^{\mathrm{mmix}}}{\partial F} =

(\alpha y + (1-\alpha) y’ – G(\alpha F(x) + (1-\alpha) F(x’)))

\frac{\partial G}{\partial Z}

\frac{\partial Z}{\partial F},

$$

where $Z = \alpha F(x) + (1-\alpha) F(x’)$ and

$$

\frac{\partial Z}{\partial F} = \alpha F'(x) + (1-\alpha) F'(x’).

$$

because $G$ is linear,

$$

(\alpha y + (1-\alpha) y’ – G(\alpha F(x) + (1-\alpha) F(x’))) =

(\alpha y + (1-\alpha) y’ – \alpha G(F(x)) – (1-\alpha) G(F(x’))).

$$

combining all these together,

$$

\frac{\partial L^{\mathrm{mmix}}}{\partial F} =

\left(

\alpha (y-G(F(x))) + (1-\alpha) (y’ – G(F(x’)))

\right)

\left(

\alpha \frac{\partial G}{\partial F}(x) +

(1-\alpha) \frac{\partial G}{\partial F}(x’)

\right).

$$

what you notice here is that there are essentially four terms after expanding this multiplication. two terms are usual gradients we get from making $G \circ F$ predict $y$ given $x$ and $y’$ given $x’$, just like any regression:

- $\alpha^2 (y-G(F(x))) \frac{\partial G}{\partial F}(x)$
- $(1-\alpha)^2 (y’-G(F(x’))) \frac{\partial G}{\partial F}(x’)$

the other two terms are quite unusual:

- $\alpha(1-\alpha) (y-G(F(x))) \frac{\partial G}{\partial F}(x’)$
- $\alpha(1-\alpha) (y’-G(F(x’))) \frac{\partial G}{\partial F}(x)$

in other words, the direction and scale of the update of $F$ given $x$ is determined by the regression error for $x’$ (!) and that given $x’$ by the error for $x$ (!).

one could think of these two terms as the ones that flatten the representation space induced by $F$, but one also notices that the regression error terms are shared between the two usual terms and the two unusual terms. in other words, the gradient is zero when regression on the original pairs $(x,y)$ and $(x’,y’)$ is solved, regardless of how *flattened* the space induced by $F$ is.

this is unlike the original mixup (or input mixup) where the contribution of each of $x$ and $x’$ cannot be separated throughout the entire network ($G \circ F$). in manifold mixup, because the contributions of $x$ and $x’$ can be separated out at the level of $F$ (not at $G$, though,) there is a room for $F$ to make linear interpolation pretty much meaningless.

in fact, this may be what the authors pointed out by the theory of manifold mixup already: “*In the more general case with larger $\mathrm{dim} (H)$, the majority of directions in H-space will be empty in the class-conditional manifold.*” there is no meaningful interpolation between these class-conditional manifolds, because a majority of directions that would otherwise connect them be empty (pretty much meaningless from $G$’s perspective.)

another way to put it is that the feature extractor $F$ can easily give up on inducing a space that is meaningfully interpolate between any pair of training examples, since it stops changing as long as the model $G \circ F$ can predict original training examples very well. in other words, there is no reason why $F$ should induce a space over which $G$ linearly interpolates in a meaningful way.

this leaves us with a BIG mystery: why does manifold mixup work well? it worked well for the authors of the original manifold mixup, and since then, various authors have claimed that it works well (see, e.g., sentMixup as well as TMix). what do those two unusual terms above in the gradient do to make the final model generalize better?

until this mystery is resolved, my suggestion is to stick to a much more explicit way to ensure the representation is *flattened* by ensuring that small changes in the input space indeed map to small changes in the representation space. this can be done by e.g. making representation predictive of the input (see. e,g, [1306.3874] Classifying and Visualizing Motion Capture Sequences using Deep Neural Networks (arxiv.org), http://machinelearning.org/archive/icml2008/papers/601.pdf, [1207.4404] Better Mixing via Deep Representations (arxiv.org), etc.) or explicitly making representation linear using some auxiliary information such as time (see, e.g., [1506.03011] Learning to Linearize Under Uncertainty (arxiv.org)). of course, i need to plug my latest work on learning to interpolate in an unsupervised way as well: [2112.13969] LINDA: Unsupervised Learning to Interpolate in Natural Language Processing (arxiv.org).