Devlin et al 2018 - BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Notes for devlin18_bert.

BERT is based on the transformer model from Vaswani et al 2017 - Attention Is All You Need (vaswani17_atten_is_all_you_need). However, only the encoder component of the transformer is relevant to BERT.

1 Training

1.1 Masked LM (MLM)

For each input \(w_i\), the transformer encoder produces an embedding \(h_i\), which is the output of an attention mechanism and other layers. Each \(h_i\) is fed through a classification layer to obtain a distribution over the vocabulary \(o_i\). During MLM training, words in the input are replaced with the <MASK> token. The model learns to predict the masked word: the objective is the KL-divergence between \(o_i\) and a one-hot embedding of the masked word.

1.2 Next Sentence Prediction

The model receives two sentences. The sentences either are adjacent in the source text, or are unrelated. The model must decide which is the case.

The first sentence begins with a <CLS> tag. The sentences are separated by a <SEP> tag. The tokens of each sentence are augmented with a sentence embedding (sentence A or sentence B). The output associated with a <CLS> tag is fed through a classifier layer, whose output distribution predicts whether the sentences are sequential or unrelated.

2 Helpful links

Bibliography

[devlin18_bert] Devlin, Chang, Lee, & Toutanova, Bert: Pre-Training of Deep Bidirectional Transformers for Language Understanding, CoRR, (2018). link.
[vaswani17_atten_is_all_you_need] Vaswani, Shazeer, Parmar, , Uszkoreit, Jones, Gomez, Aidan, Kaiser & Polosukhin, Attention Is All You Need, CoRR, (2017). link.