<title>Japanese and Korean voice search</title>
<link>https://kylrth.com/paper/word-piece-model/</link>
<pubDate>Wed, 24 Jun 2020 14:44:02 -0600</pubDate>
<guid>https://kylrth.com/paper/word-piece-model/</guid>
<description>This was mentioned in the paper on Google’s Multilingual Neural Machine Translation System. It’s regarded as the original paper to use the word-piece model, which is the focus of my notes here.
the WordPieceModel Here’s the WordPieceModel algorithm:
func WordPieceModel(D, chars, n, threshold) -> inventory: # D: training data # n: user-specified number of word units (often 200k) # chars: unicode characters used in the language (e.g. Kanji, Hiragana, Katakana, ASCII for Japanese) # threshold: stopping criterion for likelihood increase # inventory: the set of word units created by the model inventory := chars likelihood := +INF while len(inventory) < n && likelihood >= threshold: lm := LM(inventory, D) inventory += argmax_{combined word unit}(lm.</description>