Lu et al 2019 - ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

ViLBert is based on the BERT language model. It learns joint embeddings for image and natural language content.

1 Input

A stream of image regions \(v_1\), \(v_2\),… And \(v_0\) is the <IMG> tag
Text input \(w_1\), \(w_2\), …. And \(w_0\) is the <CLS> tag

Text input passes through stacked transformer layers. Image input passes through an image embedding layer.

2 Co-transformer

Next, text representations pass through a co-transformer layer that attends to the image representations. Likewise, image representations pass through a co-transformer layer that attends to the text representations. Recall that in a transformer model, the \(l\) th transformer takes as input the output of the \((l-1)\) th transformer: \(h^l_0\), \(h^l_1\), … These are multiplied by the key, query, and value transformation matrices \(W^K\) \(W^Q\), and \(W^V\), to obtain \(K\), \(Q\), and \(V\). In the co-transformer layers, we have:

\(W^K_w\), \(W^Q_w\), and \(W^V_w\) for the text
\(W^K_v\), \(W^Q_v\), and \(W^V_v\) for the images

In the co-transformer for images, we use the image queries \(Q_v\) with the text keys and values, \(K_w\) and \(V_w\) to produce \(h_v\). The co-transformer for text is similar.

\(h_v\) and \(h_w\) both pass through a normal transformer layer as well.

Bibliography

[lu19_vilber] Lu, Batra, Parikh, & Lee, Vilbert: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-And-Language Tasks, CoRR, (2019). link.