9  Attention for Neural Networks

These notes are based on Josh Starmer’s video Attention for Neural Networks, Clearly Explained!!!

Issues with the basic encoder-decoder architecture

When we unroll the LSTMs in the encoder-decoder architecture, we end up compressing the entire input sentence into a single context vector. This is ok for short phrases, but for bigger input vocabularies this might not work. For example, words which come earlier on may be forgotten. For example, this would be a big issue if the first word at the beginning of “Don’t eat the pizza while petting the cat”.

The Novelty of Attention

Instead of having information forgotten by the LSTMs, new paths are created from older units (further towards the front) directly to the decoder so they can serve as inputs. However, adding these extra paths is not a simple process. This style of model is consistent for all encoder-decoder models with attention.

The Process of Attention

Attention step added to Seq2Seq

Just like with encoder-decoder networks, you keep using the next word, unrolling the network, if the predicted word is not <EOS>

However, now that attention has been added to the model, it actually turns out that the LSTM modules are not necessary.