Overview of Transformer Architectures

Introduction

Today, I want to explain the existing transformer architectures using the examples of three major language models: Google BERT, OpenAI GPT, and Google T5. But first, let me introduce a few words about Transformers.

A Few Words About Transformers

‍

Explain the Transformer Architecture (with Examples and Videos) - AIML.com

‍

It’s important to understand that everything in the world of neural network NLP before 2017 amounts to just 0.01% of what we see today. Quality language modeling was not yet a topic of discussion. Finally, in 2017, the Attention layer was invented. I discussed the first type of these blocks in the article "Bahdanau Attention: The Layer that Changed the World of Neural Networks." Subsequent versions of this layer appeared later. For now, it is enough to note that Transformers are based on these Attention blocks and they are the driving force behind modern model development.

The general architecture of transformers looks as follows: there are two major blocks - the encoder and the decoder. The encoder receives the so-called source text input. The main task of the encoder is to understand the text as much as possible: the relationships between words and the meaning of this text in forming a response. For example, consider the sentence: "Hello, I am a linguist, I know many languages, for instance..." We want the model to continue this sentence appropriately, not with "... spaghetti and dog." For this, the input text must be vectorized, pass through several Attention layers (to understand the relationships between words), and then proceed to the subsequent feed-forward layers to filter out all unnecessary information.

At this stage, we are not generating anything yet, but we already understand the first part of the text very well.

‍

BERT

BERT (Bidirectional Encoder Representations from Transformers) is the first prominent representative of Transformers. It is built exclusively on encoder blocks (Base - 12 layers, Large - 24). It was trained on a large volume of data from Wikipedia for 3 days non-stop with goals such as:

Generating a missing word in a sentence that fits very well in context.
Analyzing two sentences and determining if they are contextually close.

The result turned out very well because, using encoder blocks, the model understands the text and the relationships between words. The purpose of this model was not to generate long sequences of text, but to understand it. This model is often used by many Data Scientists to obtain quality embeddings without writing the corresponding layer themselves.

‍

GPT

GPT (Generative Pre-trained Transformer), unlike BERT, excludes the use of encoder blocks and uses exclusively decoder blocks, which OpenAI calls transformer blocks. Let's understand the difference between these so-called transformer blocks and classic decoder blocks.

In the classic transformer architecture, the decoder block is responsible for the target text (i.e., the text to be generated). It is assisted by the encoder block, which provides information about the input text. This works as follows: the encoder’s output is added as key & values to the MultiHeadAttention layer of the decoder. This helps the decoder understand the past context when generating the next words in the sequence.

Since GPT completely lacks an encoder block, the MultiHeadAttention layer from the decoder to the encoder is also absent. The model in this form is autoregressive and was trained on the task of generating the next word in a sequence.

The absence of an encoder block, and consequently the absence of the attention layer from decoder to encoder, makes the training of this model significantly faster and allows for a much larger number of decoder layers compared to the classic transformer architecture. While BERT Large has 24 encoder blocks, GPT has 92 decoder blocks.

‍

T5

T5 (Text-to-Text Transfer Transformer), in turn, uses both encoder and decoder blocks. The number of blocks varies depending on the model size (Google released a range of T5 models from small to XXL). The Base model includes 12 encoder and decoder blocks, the Large model has 24 blocks, and larger models include even more.

T5 was trained on many tasks: text generation, summarization, translation, etc., and fine-tuned on even more tasks and texts.

It is worth noting that the T5 architecture is the closest to the classic transformer architecture, which includes both encoder and decoder blocks.

In terms of pure results, it can be said with certainty that GPT significantly outperforms T5 in quality. However, this should not be blamed solely on the architecture. Building a transformer is a very ambitious task where the result depends not only on the chosen architecture but also on the amount of data, its quality, language, model parameters, text processing, training parameters, and monitoring results.

In this article, I have outlined from memory the main types of transformer architectures using the examples of three models and compared them with each other.

‍