I also referred to this implementation to understand some of the details.
This is the paper describing the Transformer, a sequence-to-sequence model based entirely on attention. I think it’s best described with pictures.
From this picture, I think the following things need explaining:
\(2\pi\)
to 10000 times that.\(Q\)
comes from the previous masked attention layer and \(K\)
and \(V\)
come from the output of the encoder. Everywhere else uses self-attention, meaning that \(Q\)
, \(K\)
, and \(V\)
are all the same.The “Mask (opt.)” can be ignored because that’s for masked attention, described above.
On the left, we can see “dot-product” attention. Additive attention replaces the first matrix multiplication with a feed-forward network with one hidden layer. dot-product attention is faster, so they used it instead. Unfortunately, with large dimension dot-product attention produces large output, so the “scale” block divides its input by the square root of the dimension.
Multi-head attention is just the concatenation of the result of performing attention on several different learned projections of the inputs \(V\)
, \(K\)
, and \(Q\)
. In this paper the projections reduced the dimension such that the resulting multi-head attention had the same overall complexity as regular attention on the unprojected inputs.
Multi-head attention is nice because it allows the output to more easily attend to multiple inputs.
To understand the different dimensions for the tensors used in \(Q\)
, \(K\)
, and \(V\)
for dot-product attention, check out the implementation here.
The benefits of this attention-only model are several:
They used a byte-pair encoding, which is explained here and is basically the same as the word-piece model. They created a shared encoding for both the source and target languages.
They used beam search as described in the GNMT paper. Remember, because of the auto-regressive property we can predict one token at a time by getting the output and then just doing the argmax on the output for the next timestep.