This was mentioned in the paper on Google’s Multilingual Neural Machine Translation System. It’s regarded as the original paper to use the word-piece model, which is the focus of my notes here.
Here’s the WordPieceModel algorithm:
func WordPieceModel(D, chars, n, threshold) -> inventory:
# D: training data
# n: user-specified number of word units (often 200k)
# chars: unicode characters used in the language (e.g. Kanji, Hiragana, Katakana, ASCII for Japanese)
# threshold: stopping criterion for likelihood increase
# inventory: the set of word units created by the model
inventory := chars
likelihood := +INF
while len(inventory) < n && likelihood >= threshold:
lm := LM(inventory, D)
inventory += argmax_{combined word unit}(lm.likelihood_{inventory + combined word unit}(D))
likelihood = lm.likelihood_{inventory}(D)
return inventory
The algorithm can be optimized by
After these optimizations, building a 200k word piece inventory can take a few hours on a single machine.
They also do something important to make sure the ASR output text has spaces formatted reasonably. It’s best explained in the following image from the paper:
They use entropy-pruned 3- to 5-grams with Katz back-off after removing unwanted symbols etc. as much as possible similar to what is described in a previous voice search paper from Google.
They used a hodge-podge of various techniques to generate the pronunciation dictionaries.