Posted on 2020-06-30 at 08:22:30 UTC-0600nlp attention machine-translationThis model was superseded by this one.They did some careful things with residual connections to make sure it was very parallelizable. They put each LSTM layer on a separate GPU. They quantized the models such that they could train using full floating-point computations with a couple restrictions and then convert the models to quantized versions.