The B is for bidirectional, and that’s a big deal. It makes it possible to do well on sentence-level (NLI, question answering) and token-level tasks (NER, POS tagging). In a unidirectional model, the word “bank” in a sentence like “I made a bank deposit.” has only “I made a” as its context, keeping useful information from the model.
Another cool thing is masked language model training (MLM). They train the model by blanking certain words in the sentence and asking the model to guess the missing word.
Read moreThis model was superseded by this one.
They did some careful things with residual connections to make sure it was very parallelizable. They put each LSTM layer on a separate GPU. They quantized the models such that they could train using full floating-point computations with a couple restrictions and then convert the models to quantized versions.
They use the word-piece model from “Japanese and Korean Voice Search”, with 32,000 word pieces. (This is a lot less than the 200,000 used in that paper.) They state in the paper that the shared word-piece model is very similar to Byte-Pair-Encoding, which was used for NMT in this paper by researchers at U of Edinburgh.
The model and training process are exactly as in Google’s earlier paper. It takes 3 weeks on 100 GPUs to train, even after increasing batch size and learning rate.
Read moreThis was mentioned in the paper on Google’s Multilingual Neural Machine Translation System. It’s regarded as the original paper to use the word-piece model, which is the focus of my notes here.
the WordPieceModel Here’s the WordPieceModel algorithm:
func WordPieceModel(D, chars, n, threshold) -> inventory: # D: training data # n: user-specified number of word units (often 200k) # chars: unicode characters used in the language (e.g. Kanji, Hiragana, Katakana, ASCII for Japanese) # threshold: stopping criterion for likelihood increase # inventory: the set of word units created by the model inventory := chars likelihood := +INF while len(inventory) < n && likelihood >= threshold: lm := LM(inventory, D) inventory += argmax_{combined word unit}(lm.
Read moreThey used a technique called CCA to combine hand-made features with NN representations. It didn’t do great on typological feature prediction, but it did do well with predicting a phylogenetic tree for Indo-European languages.