It was pointed out by our colleagues at NYU, Chandel, Joseph and Ranganath, that there is an error in the recent technical report <BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model> written by Alex Wang and me. The mistake was entirely on me not on Alex. There is an upcoming paper by Chandel, Joseph and Ranganath (2019) on a much better and correct interpretation and analysis of BERT, which I will share and refer to in an updated version of our technical report as soon as it appears publicly.

Here, I would like to briefly point out this mistake for the readers of our technical report.

In Eq. 1 of Wang&Cho (2019), the log-potential was defined with the index

*t*, as shown below:$$\log \phi_t(X) = \begin{cases} \text{1h}(x_t)^\top f_\theta(X_{\backslash t}),&\text{ if } \text{MASK} \notin X_{1:t-1} \cup X_{t+1}:T \\ 0, &\text{ otherwise} \end{cases}$$

Based on this formulation, I mistakenly thought that

*x*would not be affected by the other log-potentials, i.e.,_{t}*log φ*this is clearly not true, because_{t’≠ t }(X).*x*is clearly used as an_{t}*input*to the BERT*f*._{θ}In other words, the following equation (Eq. 3 in the technical report) is

*above:***not**a conditional distribution of the**MRF**defined with the log-potential$$p(x_t | X_{\backslash t}) = \frac{1}{Z(X_{\backslash t})} \exp(\text{1h}(x_t^\top f_{\theta}(X_{\backslash t})).$$

It is however a conditional distribution over the

*t*-th token given all the other tokens, although there may not be a joint distribution from which these conditionals can be (easily) derived. I believe this characterization of what BERT learns will be a key to Chandel, Joseph, Ranganath (2019), and I will update this blog post (along with the technical report), when it becomes available.Apologies to everyone who read our technical report and thought BERT as an MRF. It is a

**generative model and must speak**,**but it is not an MRF**. sincere apologies again.