it’s typically not a part of any formal training of PhD students to learn how to write a review. certainly there are materials online that aim to address this issue by providing various tips & tricks of writing a review, such as Reviewing Advice – ACL-IJCNLP 2021 (aclweb.org), but it’s not easy to learn to write something off of a bullet-point list of what should be written. it’s thus often left for student authors to learn to review by reading the reviews of their own papers.
this learning-to-review-by-reading-one’s-own-reviews strategy has some downsides. a major one is that people are often left with bitter tastes after reading reviews of their own work, because reviewers need to (and often are instructed to) point out both up & downsides of any submission under review. it is this list of downsides that leave bitter taste, and these authors end up being overly critical of others’ work when they start reviewing.
perhaps, a reasonably easy first-step fix would be to expose new reviewers as well as prospective reviewers to reviews of 3rd-party papers, neither their own reviews nor reviews of their own papers. the openreview movement (i’m calling it a movement rather than Openreview itself, as Openreview does support closed-door reviewing which is increasingly adopted more, such as by NeurIPS) enables this, although this is more rare than it should be, in my opinion, and is highly focused on a small number of areas.
so, i thought i’d start by sharing a random sampler of the reviews i’ve written in the past year and so. i understand that some authors may notice these reviews were for their own papers, which were either accepted or rejected. i hope they understand that i didn’t know their identity (i truly rarely do …) and just did my job as well as i can. i spent approximately 1-5 hours to review each 8-to-12-pages-long submission.
by the way, i’m in no way saying that these are good reviews. reading these reviews again myself, i’m realizing how bad i am at reviewing myself, and that i also would’ve benefited a lot from learning to review. i mean … my … i do ask authors to cite my work a lot …
Let me start this review by saying that I like the authors’ idea and their motivations behind their proposal on modifying the (self-)attention mechanism to cope with long sequences better with computationally more efficient relative positional embedding. there are however three points i’d like to request the authors to address, after which i will strongly advocate for the manuscript’s acceptance. i’ll describe these issues after giving quick summary of the authors’ contribution in the submission, from my own perspective (the authors are more than welcome to incorporate any of these in future revisions, if needed.)
in the multi-headed attention mechanism, there are two major components; one is to extract multiple values from each input vectors (V) and the other is to compute the attention weights for each input vector per head (Q & K). because we often use softmax (or as the authors refer to it, L1-normalized exp), the attention weight from each head (per input vector) tends to focus on another vector that is in a particular distance away from the input vector. the authors cleverly exploit this phenomenon by computing the attention weights once using L2-normalized sigmoid (to encourage more evenly spread out attention weights) and selecting a subset of these spread-out attention weights using location-sensitive (but value-agnostic) masks (N) to form the attention weight for each attention head. in order to ensure that each such subset is computed aware of the associated attention head (or location), they further add a single vector bias (weighted sum of as many bias vectors as there are heads) to each vector when computing the attention weights. this is a clever approach that is well-motivated and well-executed. that said, the current manuscript is not without any issue, which i will describe below.
first, one of the major motivations from the authors’ perspective is that this approach is better than existing relative positional embeddings or relative positional bias approaches, because it does not “change the memory configuration of these tensors in the accelerator in a less optimized manner”. i can roughly see their argument and why this may be the case, but as this is one of the major motivations, the authors need to explain this much more carefully. for instance, i’d suggest the authors set aside an entire section contrasting what kind of “memory rearrangement” are needed for the three approaches (RPE, RAB and Shatter) and discussing how Shatter is more efficient on which accelerators. of course, another way would be to remove this as a major motivation (could be mentioned only as a nice side-effect) and to refocus the motivation from other aspects of efficiency, such as a fewer parameters and a fewer 3-D tensors to maintain.
second, the authors emphasize that the proposed approach has a fewer parameters to tune compared to e.g. XLNet and other relative positional embedding approaches. this is convincing from one angle; the authors’ clever strategy to share the attention weight computation across multiple heads does reduce the number of parameters involved in computing the attention weights for multiple heads.* it is however not convincing from another angle where the focus is on the lack of parameters in relative positional embedding (N). it is not about whether N does or does not have any trainable parameters, but that the choice of how to construct N is quite important, as the authors point out themselves in A.1: “the pretraining loss and finetuned performance are sensitive to the choice of the partition of unity”. it is almost like the authors worked themselves to tune the parameters behind N, which could be done for other approaches as well (see, e.g., https://arxiv.org/abs/2108.12409 where the bias is a linear function w.r.t. the distance |i-j| to induce some kind of relative distance attention.) that said, this is a minor point and can be fixed by rephrasing text here and there.
third, this is not really the issue of this manuscript but a general issue of most of the papers where they report finetuned accuracies on GLUE tasks, etc. if i understood the authors correctly when they stated “we conduct several runs to show one run better than average (i.e., if the number on some task is worse than average, we will re-run it and show a better one.)” (to be frank, i couldn’t understand what this means at all,) some of the runs with low accuracies are thrown away, and there are multiple runs for each task. unfortunately, i can’t understand the rationale behind throwing some of those “worse” runs and that there are only 1 number (accuracy) per task/configuration pair in all the tables. i totally understand that the authors want to have “fair comparison” (though, it’s unclear in which aspect..) but this simply obscures how well the proposed approach works. to this end, i request the authors to report either both the mean and std. dev. (if there were enough runs) or max/med/min (if there were only a few runs) for each setup without throwing any results away for the runs they’ve run themselves. they can always report a single accuracy for each task-method pair according to what others have done separately as well.
(*) by the way, i do not believe it is correct to say that the Shatter is “single-headed self-attention”, since it does result in the vectors from multiple heads. it’s only that some parts of the computation of attention weights are cleverly shared. i’d suggest the authors refrain from saying so.
In this manuscript, the authors propose a variant of a transformer (or as a matter of fact any neural machine translation system) that is claimed to better capture styles of translation. during training, this proposed model, called a LSTransformer, finds the most similar reference sentence from a minibatch and uses this surrogate reference to compute the weighted sum of the latent style token embeddings. this style code is then appended to the source token embeddings before the source sentence is fed to the transformer for translation. in the translation time, it looks like (i say so, because it wasn’t specified explicitly) the LSTransformer considers the entire test set together as if it’s a minibatch in the training time to find a surrogate reference (based on the source sentence, which is possible because the embedding tables are shared between the encoder and decoder) based on which style embedding is computed and translation is done.
unfortunately the authors do not specify explicitly what they mean by “style”, and where the proposed LSTransformer finds information about “style”, which makes it pretty much impossible for me to understand what the proposed LSTransformer is supposed to do. this gets worse as in the experiments, the “style” of translation is almost equated with the domain from which test sentences were drawn, which is quite different from what i expect style to be based on the authors’ discussion of formality, etc. earlier. independently from my other points below, it’ll be critical for the authors to restructure the main part of the paper by first clearly defining what they mean by style and how the proposed approach captures such style (e.g., why does finding a random reference sentence from a minibatch consisting of i.i.d. samples help the LSTransformer capture a style? what if no reference sentence within the minibatch matches the style of the true reference?) and only then empirically demonstrating the effectiveness of the proposed approach.
In the training procedure, there’s some issue that i cannot wrap my head around. a major innovation the authors proposed is to use the so-called surrogate reference, which is the reference that’s most similar to the true reference within a minibatch. the cnn-based sentence encoder is used to compute the embedding for retrieving a surrogate reference. but, then if the minibatches were truly constructed to include uniformly selected sentences selected at random from the training set, isn’t the choice of the surrogate references is effectively to choose any reference from the training set on expectation? that is, as training continues, the effect of choosing the “most similar surrogate” reference translation disappears, because every minibatch consists of i.i.d samples from the training set (or training distribution). in one of the extreme cases, consider minibatches of size 2 each and using all possible size-2 subsets of the training set for training: every other training example serves as a surrogate reference. how does this procedure help with capturing “style”?
this questionable aspect of the proposed training procedure is only amplified when the inference/generation procedure is discussed. it is because the proposed LSTransformer does not use any references or sources from the same document or domain to capture and use the style, but simply uses the given source sentence. this is a weird set up for translation with style. consider the case where the target language exhibits more fine-grained levels of formality than the source language does. how does one expect a model to decide on the formality of the translation by looking only at the source sentence? perhaps the authors had something different in their mind, when they talked about style of translation, and as i pointed out earlier, it’ll be immensely helpful if the authors restructure the text by defining style to start with.
Along this line of thoughts, one notices that experiments, in particular comparison to vanilla transformers, are actually not too informative. there are two aspects of the proposed LSTransformer that differs from the vanilla Transformer. the first is the training procedure i re-described above, which has some questionable aspect, and the other is the network architecture. in the inference time, there’s no weird surrogate reference retrieval or any related loss functions, and the proposed LSTransformer simply becomes a different parametrization of the so-called deliberation network (https://www.microsoft.com/en-us/research/publication/deliberation-networks-sequence-generation-beyond-one-pass-decoding/) or generative NMT (https://papers.nips.cc/paper/7409-generative-neural-machine-translation.pdf). that is, the proposed LSTransformer defines a distribution over the target sentence space given a pair of an imperfect reference sentence and the source sentence. This is clearly unlike the vanilla Transformer which maps from the source sentence alone to the target sentence. Then, a natural question is not whether the proposed LSTransformer works better than the vanilla Transformer but whether this particular parametrization is more beneficial than other approaches to parametrizing such a “refinement” distribution. for instance, if the authors train the network without the surrogate references but simply by feeding the concatenation of the source sentence and the first translation (that is, the translation from the same network with two copies of source sentences provided), would it work worse than the LSTransformer?
Of course, once such an experimental setting is set up, the authors can finally ask the questions whether those style tokens introduced by the authors are indeed capturing styles and how this aspect of capturing style helps translation. unfortunately, due to the lack of the definition of styles and also due to the lack of proper points of comparison, these questions are only touched upon without clear answers.
I do believe there are interesting findings and insights within this manuscript. It is just that the current version of the manuscript does not reveal what those are to the degree that warrants its publication. it’s possible that my suggestion above, when followed, might reveal results that do not align with what the authors have expected/wanted, but i trust the findings and insights revealed from the authors’ efforts will be greatly appreciated by the community.
P.S. yeah.. the authors’ tSNE visualizations are pretty meaningless. Fig. 2 is totally meaningless, as the authors have realized themselves (see footnote 13.) what i see from Fig. 2 is that the style token embeddings are not doing anything, and it’s the balance loss that simply makes the style embeddings to be orthogonal. after all, it’s VERY easy to have 10 orthogonal vectors in a 512-d space. Fig. 3 doesn’t really encode anything. sadly it shows that the style encoding is 1-dimensional (not even 2-dimensional.) it actually implies that my suspicion above about the issue of surrogate references selected from randomly constructed minibatches might be correct.
This submission consists of two almost independent contributions. The first contribution is a procedure to create a new challenge set for NLI classifiers based on the idea of monotonicity reasoning in NLI, and the second contribution is an algorithm purported to reveal the modular internal structure of a neural net based NLI classifier. I found the first part to consist of interesting findings and perhaps a bit of insight, while the second part to be confusing with frequent self-contradictory statements and perhaps missing answers to many obvious questions.
I’ll go through the paper section-by-section below and leave some major comments first:
Sec. 2: all the neural net (deep learning) based NLP references go from 2018 onward. I find it difficult to believe that this should be the case. For instance, the authors fail to cite <Intriguing properties of neural networks> by Szegedy et al. from 2014 in which the name “adversarial examples” were coined. They also fail to cite <Does string-based neural MT learn source syntax?> by Shi et al. from 2016 which used logistic regression to check whether syntactic labels could be predicted from a neural net hidden state. Let me suggest a bit more of literature review.
Sec. 5.1: this is a pattern i have observed over and over where the authors state something that is either wrong or at best controversial and correct themselves immediately in the same paragraph or in the section. This is simply confusing. For instance, in this section the authors start by “use MoNLI as an adversarial test dataset” and say in the same paragraph “is not especially adversarial”. Indeed, I would not call the proposed dataset adversarial, since it’s not adversarial to any particular model or family and was constructed on its own. I believe a better name would be a challenge set, but any name that is not confusing (even within the authors themselves) would be better.
Sec. 5.2: the observation here is quite interesting in that the models simply fail to flip the labels of these downward monotonically transformed examples, which suggests at least to me that these models are highly insensitive to functional words. I believe this is highly related to the investigation from Gururangan et al. 2018 (https://arxiv.org/abs/1803.02324, which is missing from the references) and also with others’ investigation of NLI models and data earlier. How does your observation agree/disagree with their earlier conclusions? This must be discussed, as it is not the first time the community has learned that NLI models have particular weaknesses (and sometimes surprisingly strengths.)
Sec. 5.3: the final paragraph is confusing, because the first sentence ends saying “this is a failing of the data rather than the models”, while I could not tell why this is so. Then, the authors jumped to another speculation that is not necessarily supported by (nor was expected to be supported by) the experiments in this section: “models can solve a systematic generalization task … only … if they implement a modular representation of lexical entailment relations.“ In reality, all i saw from this section is that the models trained on SNLI work horribly on NMoNLI (they did amazingly on PMoNLI), and that it is a question whether this is due to the models themselves or due to the data on which they were trained. There was no evidence supporting either of these.
Sec. 6: this section starts with the authors declaring that “we intuitively believe any model that can generalize from the training set to the test set will implement a modular representation of lexical entailment.” Unfortunately my intuition does not agree with the authors, or simply i do not have any intuition on this. Perhaps the main cause of this discrepancy may be that the authors have not defined the “modular representation” (or “modular internal structure”) in the context of neural nets that are being tested in this paper. It is possible that I may be a bit of an outsider in these studies on systematic generalization, but it would be good to have it defined clearly somewhere so that the reader can readily go see the definition and understand why such would be intuitive (or counter-intuitive.)
Figure 2: I need to insist that each experiment in these plots be run multiple times by varying random seeds (which would impact the order of training example presentation, etc.) It looks to me as there are a few outliers that are likely statistical fluke than true trends, such as NMoNLI Test with 300 examples and BERT, SNLI/NMoNLI test with 800 examples and ESIM, and SNLI/NMoNLI tests with up to 200 examples and DECOMP. I don’t believe the authors’ conclusions from these plots will change much, but these outliers points make it difficult to trust the overall trends.
The plot titles in Fig. 2 are confusing, because these are models finetuned with inoculation not models trained solely on SNLI.
Sec. 6.4: The main point “every model was able to solve our generalization task” doesn’t seem to hold for ESIM, as it barely solved the challenge test set except for one particular case when 800 examples were used for training.
Sec. 7: unfortunately this section, which is supposed to describe one of the two main contributions, has quite a bit of issues that have ultimately convinced me not to recommend this manuscript to be accepted. I’ll go over why below.
Sec. 7.1: this section is quite difficult to follow, because there’s quite a bit of discussion at the beginning that requires the knowledge of the algorithm Infer, but this algorithm is only explained in the final paragraph of the section.
Sec. 7.2: the section starts with the statement “BERT implements a modular representation of lexical entailment if there is a map M from MoNLI examples to model-internal vectors in BERT such that the model internal-vectors satisfy the counterfactual claims ascribed to the variable lexrel.” There are two major issues with this statement. First, I just don’t see in this manuscript why this conditional holds. This is probably because there has not been a clear definition of modular representation of lexical entailment in the context of BERT or any other neural net. Second, what is this “model internal vector”? If i simply concatenate all the vectors present inside BERT and flatten into a single vector, would that correspond to a model internal vector? If so, does it mean that any BERT that satisfies this condition implements modular representation of lexical entailment? It looks like this makes this statement not a conditional but a definition, in which case it’s a bit of a moot point to state so.
Sec. 7.3-4: Instead of Infer (which requires a two-line equation as its definition,) the procedure in this section requires an algorithm box with much more careful description, because i was totally lost following the proposed algorithm and experimental procedure. For instance, what do the authors mean by “every example is mapped to a vector at the same location”? What actually happens when the authors say “we randomly conducted interchange experiments to partially construct each of the 36 graphs”? What were the random variables here? Why are some output-unchanging edges necessarily non-causal?
Sec. 7.5: i find the random edge graph to be quite uninteresting as a baseline, as it is not clear what 50% of having an edge means and whether it is a reasonable baseline. For instance, if we use ESIM and DECOMP from Fig. 2, what kind of numbers would we get? Will ESIM be worse than BERT, because they are less “modular”? What was ESIM trained only on SNLI? Will it be also worse than ESIM finetuned with inoculation, because it generalizes worse to MoNLI? These are much better baselines to compare against, and without these it’s difficult to put the sizes of the cliques obtained from BERT finetuned with inoculation. I’m sure the random graph serves as a lower-bound, but it looks too loose to be informative at all.
Sec. 7.6: The authors conclude that “this is conclusive evidence that … BERT implements a modular representation of the lexical entailment relations between substituted words”. I cannot agree with this because of the reasons provided by the authors themselves in the third paragraph. With all these caveats in the proposed algorithm, what is the right way to draw a conclusion? Do we know that these issues are not significant? If so, how do we know so? Perhaps, the biggest issue again is that it’s unclear what modular representation of lexical entailment in the context of BERT, because of that, we cannot really tell whether this proposed procedure indeed captures such notion.
Sec. 7.7: According to the probing experiments, the authors demonstrate that the first and third layers are equivalent in terms of their linguistic and control tasks, and then the authors continue to conclude that “probes cannot diagnose whether a given representation has a causal impact on a model’s output behaviour.” But, does this imply anything about the authors’ approach? How do we know that this probing is any worse than the authors’ approach, other than the procedure described above with a lot of approximations that the authors themselves warn the reader about.
This concludes my review of this submission. My suggestion to the authors is to focus on the MoNLI data as a new challenge set for NLI classifiers and to carefully analyze what this challenge set reveals about the existing NLI classifiers (it’ll be even better if the authors could identify and fix the identified weaknesses.) In this case, it’s probably fine to drop any discussion and claim on “modular internal structure” of these neural nets.
If the authors feel they need to keep the second contribution, I suggest them significantly revise these sections; first, describe the algorithm more clearly, second, discuss various approximations that were made to cope with the intractability and demonstrate that those approximations are reasonable, third, run the algorithm on multiple models and more informative baselines, and fourth (and perhaps most importantly), demonstrate convincingly that these models do indeed exhibit a carefully defined notion of modular internal structures and that the proposed metrics does compute the degree of the modularity of internal structures.
in this paper, the authors test a series of modifications to the now-standard transformer, including gated self-attention, convolution as self-attention, attention with a fixed span and attention with a learned span, on SCAN which was manually constructed to test the ability of a sequence-to-sequence model in capturing compositionality. They demonstrate that their observations in the impact of these modifications on SCAN are not indicative of their impacts on more realistic problems, such as machine translation.
the biggest issue i see with this manuscript is the main motivation behind this investigation, that is, “it is not clear if SCAN should be used as a guidance when building new models.” if i did not misunderstand [Bastings et al., 2018] to which a whole paragraph was dedicated in S8 Related Work, Bastings et al.  already demonstrated that SCAN is not a realistic benchmark, and that the improvement in SCAN in fact negatively correlates with the improvement in MT (i just opened Bastings et al.  to see if i recalled incorrectly, but it seems to be the case.) in fact, Bastings et al.  suggest a reason why SCAN is not realistic: “any algorithm with strong long-term dependency modeling ca- pabilities can be detrimental.” in other words, i almost feel like it has been clear for quite some time that SCAN should not be used (on its own) for guiding any development for real-world problems.
of course, it’s a good idea to (1) renew/reconfirm the earlier finding based on a more modern practice, such as using transformers as opposed to using LSTM/GRU and (2) investigate aspects that were not investigated earlier, such as network architectures rather than parametrization of the decoder (autoregressive,) but i believe it’s important for such an effort to be framed and discussed to update/complement the existing knowledge rather than as an attempt to establish the finding as a standalone finding. along this line, i believe it would’ve been more interesting if the manuscript could tell how transformers changed the earlier conclusion on the capability of these neural sequence models on SCAN and its consequences.
a second issue is that it’s very difficult for me to understand why these four variants of attention are relevant to this investigation: why should i be curious about these four particular version of attention in knowing the relationship between the performance on SCAN and the performance on MT? had the authors chosen another set of architectural modifications (e.g., perhaps change the number of feedforward layers within each transformer blocks, perhaps change the softmax to sigmoid or any other monotonic transformation for attention weight, perhaps change the final softmax to the mixture of softmaxes, etc.) would they have arrived at a different conclusion? because of this degree of freedom, i believe it is important to start with a statement on what the authors believe is an important axis and why so, before drawing any empirical conclusion. the best i could read from the manuscript is that the authors “start from an observation that convolution-based seq2seq models (ConvS2S) (Gehring et al., 2017) perform very well on it (Dessi and Baroni, 2019).” this is not a convincing reason why we want to test those four variants of attention (it does explain why we want to test replacing attention with convolution, albeit weakly.)
finally, it is unclear whether the observation that attention with a learned span helps on the compositional en-fr task had to be drawn from this investigation. yes, the authors did arrive at this conclusion in this manuscript, but it looks like this could’ve been a completely separate investigation, perhaps motivated better by starting that “it is not clear if” the existing transformers can correctly bridge the difference in the compositional rules between source and target languages. i believe their data will be useful in the future for evaluating this particular aspect of a machine translation system. unfortunately, this particular data on its own adds only little to the main investigation in this manuscript. perhaps, as the authors stated, this will be a part of an extensive and more useful benchmarks in the future when “more realistic SCAN-like datasets based on natural language data” are created.
Unfortunately i have some issues with the authors’ choices of algorithms and how they use them.
first, few-shot learning algorithms are designed to work best for scenarios in which examples must be classified into “novel” classes that were not present during training time, which is not the case for the problem in this paper. one could argue that many of these few-shot learning algorithms are variants of nearest-neighbour classifiers, and that they tend to work better for rare classes because of their non-parametric nature. this is however not what the authors claim nor argue. what the authors should’ve done and should do in the future iteration of the manuscript is to modify e.g. the prototypical network however without the few-shot constraint by using all the training instances (or subset for computational efficiency).
second, the authors claim and demonstrate the effectiveness of these class reweighting approaches, which I find hard to believe not due to the construction of those algorithms but due to the evaluation metric the authors have chosen to work with. when a neural net classifier, or even logistic regression classifier, is trained, it is trained to capture p(y|x) which is the product of the class-conditional likelihood p(x|y) and the class prior p(y). the latter is often the reason it looks like a trained classifier prefers more common classes when we simply look at the top-1 predicted class. an interesting consequence from this observation is that reweighting based on the class proportion (which is mainly what the authors have tried either via actual reweighting or resampling) only changes p(y) and does not impact p(x|y). that is, if you estimate p(y) from data and divide the neural net’s prediction p(y|x) with it, the effect of class imbalance largely disappears (of course, up to miscalibration of neural net predictive distributions.)
lastly, i’m not entirely sure whether it’s a good idea to frame this problem as classification. instead, i believe this problem should ideally be framed as multi-label classification in which each condition is predicted to be present (positive) or not. this is arguably a significantly more minor point than the issues above.
with all these issues, it’s difficult for me to see what i should get out of reading this manuscript. it’s not surprising that existing few-shot learning algorithms do not work well, because the target problem was not a conventional few-shot learning problem. it’s perhaps not surprising that the baseline seems to work better for more common classes but not for rare classes, because there was no description (which implies no effort) in recalibrating the predictive distribution to remove the class prior.
since all the algorithms have been implemented, i believe a bit of effort in re-designing the experiments and tweaking the algorithms would make the manuscript much stronger.