Bahdanau Attention: The Layer That Changed the World of Neural Networks

Introduction

Today, I want to talk about a layer whose invention marked the end of the "old world" of neural networks and the beginning of the "new world," the products of which you use every day: GPT, Diffusion, and so on.

‍

To understand the importance of this layer in neural network architecture, it is necessary to describe the technological flagships that existed before it. At that time, such a flagship was recurrent networks, the best of which was the LSTM (long short-term memory) layer. These architectures were well-suited for text processing, including: sentiment analysis, text generation, machine translation, and so on. And this is not surprising, because "under the hood" of one neuron (in LSTM it is called a cell) was one of my favorite architectures, which looks like this:

‍

Here there are four gates, each with its own function:

Forget gate: takes a new word in the sequence + the previous history of the text up to that word (short-term memory). Then they are concatenated, multiplied by a corresponding weight matrix (which is trained during the neural network training cycles) and passed through a sigmoid. The sigmoid, in this case, is a door that either opens fully or partially. If it opens fully, then everything needs to be remembered (when sigmoid = 1), the gate opens fully, and the signal goes further, in other words - it is added to long-term memory. If the sigmoid = 0, then the information is not useful, it is forgotten, this word is not important in the context of our task, it does not contribute any result to the accuracy of the model.
Input gate & Update Gate: are responsible for remembering the hidden state of the cell. Here we try to understand: with what force we need to remember the word that is coming now. When a new word arrives, we concatenate it with the previous history of the text and get a matrix, which then passes through a sigmoid. The sigmoid also acts as a gate. Then the obtained result is added to the results of the Update Gate, which takes the result of the Input Gate, and also has its own matrix of trained weights. These gates are very important because they affect what will be added to the layer's long-term memory and what will be forgotten.
Output Gate: is responsible for what will go to short-term memory based on long-term memory. Short-term memory is also important because it plays a role in shaping long-term memory.

‍

I apologize for the technical terminology, prof. deformation :) The LSTM cell diagram is something I can look at for hours and admire :)

However, there are some drawbacks to this layer, some of which are simply huge:

Very long training of a network based on RNN and LSTM in particular, since the gradients (i.e., what the network learns with) must pass through all the cells described above, and there can be a lot of them.
Gradients can disappear, again, due to the long sequence of cells in the layer. And no optimizer will help here, not even Adam. No one has ever fully solved this problem.

And here comes the Bahdanau Attention layer to the rescue for recurrent networks.

Before diving into the details of this breakthrough layer, I will immediately say the advantages over recurrent layers:

There are no recurrent layers, cells, separated in time. Everything is based on fully connected (linear layers), which are quite fast.
And, most importantly, the quality of networks built with the use of the Attention layer is significantly better, especially with a large amount of data.

Here is what the first invention of the Attention layer consists of (see my doodles in parallel :) )

‍

When using recurrent networks to understand the importance of each individual token when decoding from encoder to decoder, the following is done:

They take not the state of the entire RNN encoder network, but the output state of each cell (which is responsible for a specific token) - that is, the short-term memory of each RNN cell, which I wrote about above.
They perform a scalar dot product between Sd1 - the hidden state of the first RNN decoder cell and each such output state of the encoder cells (he1, he2...).
After that, they get a vector of integers, for example, [1.4, 5.6, 2.3]. They pass it through softmax and get a probability distribution - from 0 to 1, the sum of which is 1.This distribution is also fed into the RNN decoder so that it understands the weight (value) of each encoder token when generating (translating) new tokens. This answers the question: how close is each encoder token to the token that needs to be generated (or translated) at the current iteration.
Then, the first weight probability distribution of softmax is multiplied by the first hidden state vector of the first RNN encoder cell: w1he1 + w2he2... The resulting embedding is concatenated with the first cell of the RNN decoder Sd1. This concat is then passed to the fully connected layer. This is the architecture of Bahdanau Attention for recurrent networks.

‍

Simply put, Bahdanau Attention changed the approach to using RNN networks. Let's imagine we are doing machine translation. Here, we already have an encoder that must extract all the information from the received text (in Ukrainian) and a decoder that must create a new word (text) based on the target text (in English) that it receives and "looks into" the encoder. This "looking into" the encoder is precisely what the Bahdanau Attention layer helps with. Simply put, the encoder and decoder are based on the same recurrent networks, but they are significantly improved thanks to the Attention layer.

This article was devoted only to the first versions of the Attention layer family. It does not yet mention transformers and large language models (LLMs). However, I found it interesting to share the technical side of what started all the magic.