Visual Question Answering using LXMERT

7 min readMay 28, 2021

Vision-language reasoning requires a certain understanding of visual concepts and language semantics, especially the need to be able to align and find relationships between these two modalities. The authors proposed the LXMERT framework to learn the connection between these languages and vision. It contains three encoders: an object-relational encoder, a language encoder, and a cross-modal encoder. In order to make the model have the ability to link vision and language semantics, 5 different representative pre-training tasks are used:

(1) Mask cross-modal language modeling

(2) Mask target prediction through ROI feature regression

(3) Masking target prediction by detected label classification

(4) Cross-modal matching

(5) Image problem solving.

These multi-modal pre-training can not only help to learn the connections within the same modal, but also help to learn the cross-modal connections.

Introduction

For visual-content understanding, people have developed several backbone models shown their effectiveness on large vision datasets also show the generalizability of these pre-trained (especially on ImageNet) backbone models by fine-tuning them on different tasks. In terms of language understanding, last year, we witnessed strong progress towards building a universal backbone model with large-scale contextualized language model pre-training. Despite these influential single modality works, large-scale pretraining and fine-tuning studies for the modality-pair of vision and language are still under-developed. In order to better learn the cross-modal alignments between vision and language, we next pre-train our model with five diverse representative tasks:

(1) masked cross-modality language modeling,

(2) masked object prediction via RoI-feature regression,

(3) masked
object prediction via detected-label classification,

(4) cross-modality matching, and

(5) image question answering.

Further, to show the generalizability of our pre-trained model, we fine-tune LXMERT on a challenging visual reasoning task, Natural Language for Visual Reasoning.

2. Model Architecture

We build our cross-modality model with self attention and cross-attention layers following the recent progress in designing natural language processing models. As shown in Fig, our model takes two inputs: an image and its related sentence (e.g., a caption or a question). Via careful design and combination of these self-attention and cross-attention layers, our model is able to generate language representations, image representations, and cross-modality representations from the inputs.

2.1 Input Embeddings

The input embedding layers in LXMERT convert the inputs (i.e., an image and a sentence) into two sequences of features: word-level sentence embeddings and object-level image embeddings. These embedding features will be further processed by the latter encoding layers.

Word-Level Sentence Embeddings :

A sentence is first split into words(w1,….,wn) with length of n by the same Word Piece Tokenizer. Next as shown in the figure, the word wi and its index are projected to vectors by embeddings sub-layers and then added to the index-aware word embeddings

Object-Level Image Embeddings :

Instead of using the feature map output by a convolutional neural network in taking the features of detected objects as the embeddings of images. Each object is represented by its position feature (i.e., bounding box coordinates) and its 2048 dimensional region-of-interest (RoI) feature.

In addition to providing spatial information in visual reasoning, the inclusion of positional information is necessary for our masked object prediction pre-training task. Since the image embedding layer and the following attention layers are agnostic to the absolute indices of their inputs, the order of the object is notspecified.

2.2 Encoders

We build our encoders, i.e., the language encoder, the object-relationship encoder, and the cross-modality encoder, mostly on the basis of two kinds of attention layers: self-attention layers and cross-attention layers.

Single-Modality Encoders

After the embedding layers, we first apply two transformer encoders , i.e., a language encoder and an object-relationship encoder, and each of them only focuses on a single modality (i.e., language or vision). Different from BERT, which applies the transformer encoder only to language inputs, we apply it to vision inputs as well (and to cross modality inputs as described later below).

Cross-Modality Encoder

Each cross-modality layer in the cross-modality encoder consists of two self-attention sub-layers, one bi-directional cross-attention sub-layer, and two feed-forward sub-layers. Inside the kth layer, the bi-directional cross-attention sub-layer (‘Cross’) is first applied, which contains two unidirectional cross-attention sub-layers: one from language to vision and one from vision to language. The query and context vectors are the outputs of the layer. The cross-attention sub-layer is used to exchange the information and align the entities between the two modalities in order to learn joint cross-modality representations.

2.3 Output Report

LXMERT cross-modality model has three outputs for language, vision, and cross-modality, respectively.

3. Pre-Training

In order to learn a better initialization which understands connections between vision and language, we pre-train our model with different modality pre-training tasks on a large aggregated dataset.

Language Task: Masked Cross-Modality LM

On the language side, we take the masked cross-modality language model (LM) task. 2, the task setup is almost same to BERT words are randomly masked with a probability of 0.15 and the model is asked to predict
these masked words. In addition to BERT where masked words are predicted from the non-masked words in the language modality, LXMERT, with its cross-modality model architecture, could predict masked words from the vision modality as well, so as to resolve ambiguity. Hence, it helps building connections from the vision modality to the language modality, and we refer to this task as masked cross-modality LM to emphasize this difference.

Vision Task: Masked Object Prediction

As shown in the top branch of Fig, we pre-train the vision side by randomly masking objects (i.e., masking RoI features with zeros) with a probability of 0.15 and asking the model to predict properties of these masked objects. Similar to the language task (i.e., masked cross-modality, the model can infer the masked objects either from visible objects or from the language modality. Therefore, we perform two sub-tasks: RoI-Feature Regression regresses the object RoI feature with L2 loss, and Detected Label Classification learns the labels of masked objects with cross-entropy loss.

4. Fine-tuning

Fine-tuning is fast and robust. We only perform necessary modification to our model with respect to different tasks. We use a learning rate of 1e−5 or 5e−5, a batch size of 32, and fine-tune the model from our pre-trained parameters for 4 epochs.

5. Datasets

We used three datasets for evaluating our LXMERT framework: VQA v2.0 dataset , GQA and NLVR2

6. Empirical Comparison Results

We compare our single-model results with previous best published results on VQA/GQA test standard sets and NLVR2 public test set. Our LXMERT model improves consistency (‘Cons’) to 42.1% (i.e., by 3.5 times)

7. Analysis

We analyze our LXMERT framework by comparing it with some alternative choices or by excluding certain model components/pre-training strategies.

8. Disadvantages of BERT

The fine tuning & pre-training are inconsistent.
There is no masking in fine-tuning phase.
Pre-training has 15% masking.
Model file is too large.
Cannot perform question answering.

9. Conclusion

We presented a cross-modality framework, LXMERT, for learning the connections between vision and language. We build the model based on Transformer encoders and our novel cross-modality encoder. This model is then pre-trained with diverse pre-training tasks on a large-scale dataset of image-and-sentence pairs. Empirically, we show state-of-the-art results on two image QA datasets (i.e., VQA and GQA) and show the model generalizability with a 22% improvement on the challenging visual reasoning dataset of NLVR2. We also show the effectiveness of several model components and training methods via detailed analysis and ablation studies.

10. References:

Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh, Hierarchical Question-Image Co-Attention for Visual Question Answering (2016)
https://github.com/ritvikshrivastava/ADL_VQA_Tensorflow2
https://www.appliedaicourse.com/
https://arxiv.org/abs/1908.07490

11. Github:

https://github.com/akhiilkasare/Visual-Question-Answering-Using-LXMERT

Visual Question Answering using LXMERT

10. References:

11. Github:

Written by Akhil Kasare