machine-translation on Kyle Roth

https://kylrth.com/tags/machine-translation/Recent content in machine-translation on Kyle RothHugo -- gohugo.ioen-usTue, 04 Aug 2020 08:57:44 -0700https://kylrth.com/paper/bert/Tue, 04 Aug 2020 08:57:44 -0700https://kylrth.com/paper/bert/The B is for bidirectional, and that’s a big deal. It makes it possible to do well on sentence-level (NLI, question answering) and token-level tasks (NER, POS tagging). In a unidirectional model, the word “bank” in a sentence like “I made a bank deposit.” has only “I made a” as its context, keeping useful information from the model. Another cool thing is masked language model training (MLM). They train the model by blanking certain words in the sentence and asking the model to guess the missing word.https://kylrth.com/paper/google-nmt-2016/Tue, 30 Jun 2020 08:22:30 -0600https://kylrth.com/paper/google-nmt-2016/This model was superseded by this one. They did some careful things with residual connections to make sure it was very parallelizable. They put each LSTM layer on a separate GPU. They quantized the models such that they could train using full floating-point computations with a couple restrictions and then convert the models to quantized versions.https://kylrth.com/paper/google-zero-shot/Fri, 26 Jun 2020 08:02:12 -0600https://kylrth.com/paper/google-zero-shot/They use the word-piece model from “Japanese and Korean Voice Search”, with 32,000 word pieces. (This is a lot less than the 200,000 used in that paper.) They state in the paper that the shared word-piece model is very similar to Byte-Pair-Encoding, which was used for NMT in this paper by researchers at U of Edinburgh. The model and training process are exactly as in Google’s earlier paper. It takes 3 weeks on 100 GPUs to train, even after increasing batch size and learning rate.

<title>machine-translation on Kyle Roth</title>

<description>Recent content in machine-translation on Kyle Roth</description>

<generator>Hugo -- gohugo.io</generator>

<atom:link href="https://kylrth.com/tags/machine-translation/index.xml" rel="self" type="application/rss+xml"/>

<item>

<title>BERT: pre-training of deep bidirectional transformers for language understanding</title>
<link>https://kylrth.com/paper/bert/</link>
<pubDate>Tue, 04 Aug 2020 08:57:44 -0700</pubDate>
<guid>https://kylrth.com/paper/bert/</guid>
<description>The B is for bidirectional, and that&rsquo;s a big deal. It makes it possible to do well on sentence-level (NLI, question answering) and token-level tasks (NER, POS tagging). In a unidirectional model, the word &ldquo;bank&rdquo; in a sentence like &ldquo;I made a bank deposit.&rdquo; has only &ldquo;I made a&rdquo; as its context, keeping useful information from the model.
Another cool thing is masked language model training (MLM). They train the model by blanking certain words in the sentence and asking the model to guess the missing word.</description>

</item>

<item>

<title>Google's neural machine translation system: bridging the gap between human and machine translation</title>
<link>https://kylrth.com/paper/google-nmt-2016/</link>
<pubDate>Tue, 30 Jun 2020 08:22:30 -0600</pubDate>
<guid>https://kylrth.com/paper/google-nmt-2016/</guid>
<description>This model was superseded by this one.
They did some careful things with residual connections to make sure it was very parallelizable. They put each LSTM layer on a separate GPU. They quantized the models such that they could train using full floating-point computations with a couple restrictions and then convert the models to quantized versions.</description>

</item>

<item>

<title>Google's multilingual neural machine translation system</title>
<link>https://kylrth.com/paper/google-zero-shot/</link>
<pubDate>Fri, 26 Jun 2020 08:02:12 -0600</pubDate>
<guid>https://kylrth.com/paper/google-zero-shot/</guid>
<description>They use the word-piece model from &ldquo;Japanese and Korean Voice Search&rdquo;, with 32,000 word pieces. (This is a lot less than the 200,000 used in that paper.) They state in the paper that the shared word-piece model is very similar to Byte-Pair-Encoding, which was used for NMT in this paper by researchers at U of Edinburgh.
The model and training process are exactly as in Google&rsquo;s earlier paper. It takes 3 weeks on 100 GPUs to train, even after increasing batch size and learning rate.</description>

</item>

</channel>

</rss>