<title>attention on Kyle Roth</title>
<link>https://kylrth.com/tags/attention/</link>
<description>Recent content in attention on Kyle Roth</description>
<generator>Hugo -- gohugo.io</generator>
<language>en-us</language>
<lastBuildDate>Wed, 05 Aug 2020 12:37:42 -0700</lastBuildDate>
<atom:link href="https://kylrth.com/tags/attention/index.xml" rel="self" type="application/rss+xml"/>
<item>
<title>Attention is all you need</title>
<link>https://kylrth.com/paper/attention-all-you-need/</link>
<pubDate>Wed, 05 Aug 2020 12:37:42 -0700</pubDate>
<guid>https://kylrth.com/paper/attention-all-you-need/</guid>
<description>I also referred to this implementation to understand some of the details. This is the paper describing the Transformer, a sequence-to-sequence model based entirely on attention. I think it’s best described with pictures. model overview From this picture, I think the following things need explaining: embeddings these are learned embeddings that convert the input and output tokens to vectors of the model dimension. In this paper, they actually used the same weight matrix for input embedding, output embedding, and the final linear layer before the final softmax.</description>
...
</item>
<item>
<title>BERT: pre-training of deep bidirectional transformers for language understanding</title>
<link>https://kylrth.com/paper/bert/</link>
<pubDate>Tue, 04 Aug 2020 08:57:44 -0700</pubDate>
<guid>https://kylrth.com/paper/bert/</guid>
<description>The B is for bidirectional, and that’s a big deal. It makes it possible to do well on sentence-level (NLI, question answering) and token-level tasks (NER, POS tagging). In a unidirectional model, the word “bank” in a sentence like “I made a bank deposit.” has only “I made a” as its context, keeping useful information from the model. Another cool thing is masked language model training (MLM). They train the model by blanking certain words in the sentence and asking the model to guess the missing word.</description>
...
</item>
<item>
<title>Google's neural machine translation system: bridging the gap between human and machine translation</title>
<link>https://kylrth.com/paper/google-nmt-2016/</link>
<pubDate>Tue, 30 Jun 2020 08:22:30 -0600</pubDate>
<guid>https://kylrth.com/paper/google-nmt-2016/</guid>
<description>This model was superseded by this one. They did some careful things with residual connections to make sure it was very parallelizable. They put each LSTM layer on a separate GPU. They quantized the models such that they could train using full floating-point computations with a couple restrictions and then convert the models to quantized versions.</description>
...
</item>
<item>
<title>Google's multilingual neural machine translation system</title>
<link>https://kylrth.com/paper/google-zero-shot/</link>
<pubDate>Fri, 26 Jun 2020 08:02:12 -0600</pubDate>
<guid>https://kylrth.com/paper/google-zero-shot/</guid>
<description>They use the word-piece model from “Japanese and Korean Voice Search”, with 32,000 word pieces. (This is a lot less than the 200,000 used in that paper.) They state in the paper that the shared word-piece model is very similar to Byte-Pair-Encoding, which was used for NMT in this paper by researchers at U of Edinburgh. The model and training process are exactly as in Google’s earlier paper. It takes 3 weeks on 100 GPUs to train, even after increasing batch size and learning rate.</description>
...
</item>
...