BERT has a Mouth and must Speak, but it is not an MRF

Update on June 9 2021: i still don’t know the fate of the hypothetical manuscript by Chandel et al., but i’ve noticed that Kartik Goyal, Chris Dyer & Taylor Berg-Kirkpatrick fixed this issue (https://arxiv.org/abs/2106.02736) in this blog post by using BERT’s conditional as a proposal distribution in Metropolis-Hastings, to sample from the distribution defined using the potentials defined by the BERT’s single-token conditionals’ logits.

It was pointed out by our colleagues at NYU, Chandel, Joseph and Ranganath, that there is an error in the recent technical report <BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model> written by Alex Wang and me. The mistake was entirely on me not on Alex. There is an upcoming paper by Chandel, Joseph and Ranganath (2019) on a much better and correct interpretation and analysis of BERT, which I will share and refer to in an updated version of our technical report as soon as it appears publicly.

Here, I would like to briefly point out this mistake for the readers of our technical report.
In Eq. 1 of Wang&Cho (2019), the log-potential was defined with the index t, as shown below:
$$\log \phi_t(X) = \begin{cases} \text{1h}(x_t)^\top f_\theta(X_{\backslash t}),&\text{ if } \text{MASK} \notin X_{1:t-1} \cup X_{t+1}:T \\ 0, &\text{ otherwise} \end{cases}$$
Based on this formulation, I mistakenly thought that xt would not be affected by the other log-potentials, i.e., log φt’≠ t (X). this is clearly not true, because xt is clearly used as an input to the BERT fθ.
In other words, the following equation (Eq. 3 in the technical report) is not a conditional distribution of the MRF defined with the log-potential above:
$$p(x_t | X_{\backslash t}) = \frac{1}{Z(X_{\backslash t})} \exp(\text{1h}(x_t^\top f_{\theta}(X_{\backslash t})).$$
It is however a conditional distribution over the t-th token given all the other tokens, although there may not be a joint distribution from which these conditionals can be (easily) derived. I believe this characterization of what BERT learns will be a key to Chandel, Joseph, Ranganath (2019), and I will update this blog post (along with the technical report), when it becomes available.
Apologies to everyone who read our technical report and thought BERT as an MRF. It is a generative model and must speakbut it is not an MRF. sincere apologies again.

Leave a Reply