Researchers at Google AI Language have published a recent paper titled BERT (Bidirectional Encoder Representations from Transformers). In addition to presenting state-of-the-art results in a wide range of NLP tasks, including Question Answering (SQuAD v1.1), Natural Language Inference (MNLI), and others, it has also caused a stir in the Machine Learning community.

The key technological innovation of BERT is its application of Transformer to language modeling, a popular attention model. It is shown in this paper that a bidirectionally trained language model has a deeper sense of language context than a single-direction language model. It differs from previous approaches that have looked at text sequences from left to right or combined right-to-left and left-to-right training. Researchers in the paper outline a novel technique called Masked LM (MLM), which enables bidirectional training in models that previously could not be done.

How BERT Works

BERT uses a transformer mechanism, which learns the contextual relationships between words (or sub-words) in a text. The vanilla version of Transformer includes two distinct mechanisms — an encoder and a decoder. Since BERT’s goal is to generate a language model, only the encoder mechanism is needed. Detailed information about Transform can be found in a paper by Google.

As directional models do, transform encoders read the sequence of words at once instead of sequentially reading text from right to left or left to right. Based on the surroundings of a word (left and right of the word), this feature allows the model to learn the context of a word. As a result, it is considered to be bidirectional, although it would be better to describe it as non-directional.

The chart below gives a high-level overview of what the Transformer encoder does. A sequence of tokens is fed into the neural network, which embeds them into vectors before processing them. A sequence of H-dimensional vectors is produced, in which each vector corresponds to the same input token.

It can be difficult to define a prediction goal when training language models. In general, most models predict the next word in the sequence (e.g., “The child came home from __”). This is a directional approach that limits the ability to learn context. BERT overcomes this limitation by using two training strategies:

Masked LM (MLM)

Each sequence of words is supplemented with a [MASK] token before being fed into BERT. Based on the context provided by the non-masked words, the model then attempts to predict the original value of the masked words. Technically, the prediction of the output words is as follows:


  • Applying a classification layer on top of the encoder output.
  • Transform the output vectors into the vocabulary dimension by multiplying them by the embedding matrix.
  • Compute the probability of each word in the vocabulary using softmax.

Next Sentence Prediction (NSP)

BERT is trained by receiving pairs of sentences as input and predicting whether the second sentence in the pair is the following sentence in the original document. The second sentence in 50% of the input pairs consists of the second sentence from the original document, whereas the second sentence in 50% of the input pairs consists of a random sentence from the corpus. We assume that the random sentence will be disconnected from the first sentence.

Before entering the model, the input is processed the following way in order to help distinguish the two sentences:

  • The first sentence begins with a token [CLS], and each subsequent sentence ends with a token [SEP].
  • A sentence embedding indicating Sentence A or Sentence B is added to each token. Token embeddings, which contain a vocabulary of 2, are similar in concept to sentence embeddings.
  • To indicate each token’s position in the sequence, it has a positional embedding. In the Transformer paper, we describe positional embedding and its implementation.