BERT has a Mouth and must Speak, but it is not an MRF

Update on June 9 2021: i still don’t know the fate of the hypothetical manuscript by Chandel et al., but i’ve noticed that Kartik Goyal, Chris Dyer & Taylor Berg-Kirkpatrick fixed this issue (https://arxiv.org/abs/2106.02736) in this blog post by using BERT’s conditional as a proposal distribution in Metropolis-Hastings, to sample from the distribution defined using the potentials defined by the BERT’s single-token conditionals’ logits.

It was pointed out by our colleagues at NYU, Chandel, Joseph and Ranganath, that there is an error in the recent technical report <BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model> written by Alex Wang and me. The mistake was entirely on me not on Alex. There is an upcoming paper by Chandel, Joseph and Ranganath (2019) on a much better and correct interpretation and analysis of BERT, which I will share and refer to in an updated version of our technical report as soon as it appears publicly.

Here, I would like to briefly point out this mistake for the readers of our technical report.

In Eq. 1 of Wang&Cho (2019), the log-potential was defined with the index t, as shown below:

$$\log \phi_t(X) = \begin{cases} \text{1h}(x_t)^\top f_\theta(X_{\backslash t}),&\text{ if } \text{MASK} \notin X_{1:t-1} \cup X_{t+1}:T \\ 0, &\text{ otherwise} \end{cases}$$

Based on this formulation, I mistakenly thought that x_t would not be affected by the other log-potentials, i.e., log φ_{t’≠ t}(X). this is clearly not true, because x_t is clearly used as an input to the BERT f_θ.

In other words, the following equation (Eq. 3 in the technical report) is not a conditional distribution of the MRF defined with the log-potential above:

$$p(x_t | X_{\backslash t}) = \frac{1}{Z(X_{\backslash t})} \exp(\text{1h}(x_t^\top f_{\theta}(X_{\backslash t})).$$

It is however a conditional distribution over the t-th token given all the other tokens, although there may not be a joint distribution from which these conditionals can be (easily) derived. I believe this characterization of what BERT learns will be a key to Chandel, Joseph, Ranganath (2019), and I will update this blog post (along with the technical report), when it becomes available.

Apologies to everyone who read our technical report and thought BERT as an MRF. It is a generative model and must speak, but it is not an MRF. sincere apologies again.

Related Posts

Drug Discovery may be in the Cold War Era

Global AI Frontier Lab at New York University

Softmax forever, or why I like softmax

Leave a Reply Cancel reply