SqueezeBERT: What can computer vision teach NLP about efficient neural networks?

Akhil Kasare
4 min readDec 26, 2020

INTRODUCTION:

In reference to the research paper that details SqueezeBERT, a mobile NLP neural network architecture, is 4.3 times in comparison to BERT on a 3 Pixel smart mobile device and at the same time attaining the accuracy identical to MobileBERT in GLUE benchmark tasks. A key difference between MobileBERT and SqueezeBERT, Iandola told VentureBeat in an interview, is the use of grouped convolutions to increase speed and efficiency, a technique first introduced in 2012.

However, presently high-scalable NLP neural network models like BERT and RoBERTa are notably computationally expensive, alongside BERT-dependent taking 1.7 seconds for categorizing a text snippet on the 3 Pixel smart mobile device.

Holding the antithesis amid SqueezeBERT and MobileBERT is “the implementation of assorted convolution to upgrade the speed and efficiency”.

Prerequisite: Basic understanding of BERT.

Core Idea:

  • SqueezeBERT depends on the procedures acquired from SqueezNAS, a neural architecture search (NAS) model.
  • SqueezeBERT executes on lower latency on a smart mobile device (Google Pixel 3 smartphone) related to BERT-base, MobileBERT, and various useful NLP models, though sustaining efficiency.

Detailed Description:

Conventionally, the BERT-derived networks involve 3 stages basically :

  • Embedding, to transfer preprocessed words (interpreted as integer-valued tokens) into abstracted feature-vectors of floating-point numbers;
  • The encoder that is consisted of a set of self-attention and other layers, and
  • The classifier that makes the final outcome of the network.

The proposed neural architecture (known as SqueezeBERT) practices assorted convolutions, it is identical to BERT-base, but the PFC layer deployed as convolutions and assorted convolutions for various layers. It consists of an encoder that has a self-attention module with 3 PFC layers, and more 3 PFC layers termed as the feed-forward network layers in terms of FFN1, FFN2, and FFN3 where all the layers incorporate various dimensions.

Dataset Pre-Training :

For pretraining, an aggregate of the Wikipedia and Corpus of Books is practiced, placing down 3% of the consolidated dataset in the form of a test dataset. Masked Language Modelling (MLM) is adopted and Sentence Order Prediction(SOP) in the form of pretraining tasks.

Fine Tuning Data :

Fine Tuning and assessing SqueezeBERT, even more baselines, are conducted on the General Language Understanding Evaluation(GLUE) set of tasks. This consolidated benchmark incorporates a huge variety of nine NLU tasks.

GLUE has been marked as the standard evaluation benchmark for most of the NLP research.

The efficiency of the model over the GLUE tasks is highly likely to deliver an excellent estimation of the generalizability of that model, particularly to some text classification tasks.

Training methodology :

Most of the latest research over adequate NLP networks proclaims outcomes on the basis of models that are trained with bells and whistles including distillation, adversarial training, transfer learning over GLUE tasks, etc.

There are no such parameters or generalizations of training strategies over several papers that are making it hard to differentiate the benefits of the model from the participation of the training approaches to the conclusive accuracy number.

So, SqueezeBERT gets first trained by deploying an easy training proposal and then trained with distillation and other applicable techniques.

Results :

Tips :

  • SqueezeBERT is a model with absolute position embeddings so it’s usually advised to pad the inputs on the right rather than the left.
  • SqueezeBERT is similar to BERT and therefore relies on the masked language modeling (MLM) objective. It is therefore efficient at predicting masked tokens and at NLU in general, but is not optimal for text generation. Models trained with a causal language modeling (CLM) objective are better in that regard.
  • For best results when fine tuning on sequence classification tasks, it is recommended to start with the squeezebert/squeezebert-mnli-headless checkpoint.

Applications :

Token Classification

Question Answering

Text classification tasks

Conclusion:

In this paper, we have studied how grouped convolutions, a popular technique in the design of efficient CV neural networks, can be applied to NLP. First, we showed that the position-wise fully connected layers of self-attention networks can be implemented with mathematically-equivalent 1D convolutions. Further, we proposed SqueezeBERT, an efficient NLP model which implements most of the layers of its self-attention encoder with 1D grouped convolutions.Extensions of this idea, such as U-Nets, as well as modifying channel sizes (hidden size) instead of and in addition to sequence length would also be promising directions.One very promising direction is downsampling strategies which decrease the sequence length of the activations in the self-attention network as the layers progress.

References:

--

--