This was mentioned in the paper on Google’s Multilingual Neural Machine Translation System. It’s regarded as the original paper to use the word-piece model, which is the focus of my notes here.the WordPieceModelHere’s the WordPieceModel algorithm:func WordPieceModel(D, chars, n, threshold) -> inventory:
 # D: training data
 # n: user-specified number of word units (often 200k)
 # chars: unicode characters used in the language (e.g. Kanji, Hiragana, Katakana, ASCII for Japanese)
 # threshold: stopping criterion for likelihood increase
 # inventory: the set of word units created by the model

 inventory := chars
 likelihood := +INF
 while len(inventory) < n && likelihood >= threshold:
 lm := LM(inventory, D)
 inventory += argmax_{combined word unit}(lm.likelihood_{inventory + combined word unit}(D))
 likelihood = lm.likelihood_{inventory}(D)
 return inventory
The algorithm can be optimized bytesting only word pairs that exist in the training datatesting only pairs with a significant chance of being the bestcombining several clustering steps into a single iteration (possible for groups of pairs that don’t affect each other)only modify the LM counts for affected entriesAfter these optimizations, building a 200k word piece inventory can take a few hours on a single machine.Dealing with spacesThey also do something important to make sure the ASR output text has spaces formatted reasonably. It’s best explained in the following image from the paper:LMThey use entropy-pruned 3- to 5-grams with Katz back-off after removing unwanted symbols etc. as much as possible similar to what is described in a previous voice search paper from Google.pronunciation dictionaryThey used a hodge-podge of various techniques to generate the pronunciation dictionaries.IME dataextractions of readings from the weba transliterator for loan wordsrule-based approachesreviewing by hand the most important groups of pronunciations