# BERT has a Mouth and must Speak, but it is not an MRF

It was pointed out by our colleagues at NYU, Chandel, Joseph and Ranganath, that there is an error in the recent technical report <BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model> written by Alex Wang and me. The mistake was entirely on me not on Alex. There is an upcoming paper by Chandel, Joseph and Ranganath (2019) on a much better and correct interpretation and analysis of BERT, which I will share and refer to in an updated version of our technical report as soon as it appears publicly.

Here, I would like to briefly point out this mistake for the readers of our technical report.
In Eq. 1 of Wang&Cho (2019), the log-potential was defined with the index t, as shown below:
$$\log \phi_t(X) = \begin{cases} \text{1h}(x_t)^\top f_\theta(X_{\backslash t}),&\text{ if } \text{MASK} \notin X_{1:t-1} \cup X_{t+1}:T \\ 0, &\text{ otherwise} \end{cases}$$
Based on this formulation, I mistakenly thought that xt would not be affected by the other log-potentials, i.e., log φt’≠ t (X). this is clearly not true, because xt is clearly used as an input to the BERT fθ.
In other words, the following equation (Eq. 3 in the technical report) is not a conditional distribution of the MRF defined with the log-potential above:
$$p(x_t | X_{\backslash t}) = \frac{1}{Z(X_{\backslash t})} \exp(\text{1h}(x_t^\top f_{\theta}(X_{\backslash t})).$$
It is however a conditional distribution over the t-th token given all the other tokens, although there may not be a joint distribution from which these conditionals can be (easily) derived. I believe this characterization of what BERT learns will be a key to Chandel, Joseph, Ranganath (2019), and I will update this blog post (along with the technical report), when it becomes available.
Apologies to everyone who read our technical report and thought BERT as an MRF. It is a generative model and must speakbut it is not an MRF. sincere apologies again.