I also referred to this implementation to understand some of the details.
This is the paper describing the Transformer, a sequence-to-sequence model based entirely on attention. I think it’s best described with pictures.

From this picture, I think the following things need explaining:
\(2\pi\) to 10000 times that.\(Q\) comes from the previous masked attention layer and \(K\) and \(V\) come from the output of the encoder. Everywhere else uses self-attention, meaning that \(Q\), \(K\), and \(V\) are all the same.
The “Mask (opt.)” can be ignored because that’s for masked attention, described above.
On the left, we can see “dot-product” attention. Additive attention replaces the first matrix multiplication with a feed-forward network with one hidden layer. dot-product attention is faster, so they used it instead. Unfortunately, with large dimension dot-product attention produces large output, so the “scale” block divides its input by the square root of the dimension.
Multi-head attention is just the concatenation of the result of performing attention on several different learned projections of the inputs \(V\), \(K\), and \(Q\). In this paper the projections reduced the dimension such that the resulting multi-head attention had the same overall complexity as regular attention on the unprojected inputs.
Multi-head attention is nice because it allows the output to more easily attend to multiple inputs.
To understand the different dimensions for the tensors used in \(Q\), \(K\), and \(V\) for dot-product attention, check out the implementation here.
The benefits of this attention-only model are several:
They used a byte-pair encoding, which is explained here and is basically the same as the word-piece model. They created a shared encoding for both the source and target languages.
They used beam search as described in the GNMT paper. Remember, because of the auto-regressive property we can predict one token at a time by getting the output and then just doing the argmax on the output for the next timestep.