nlp on Kyle Roth

https://kylrth.com/tags/nlp/Recent content in nlp on Kyle RothHugo -- gohugo.ioen-usMon, 18 Apr 2022 23:08:52 -0400https://kylrth.com/post/wordburner/Mon, 18 Apr 2022 23:08:52 -0400https://kylrth.com/post/wordburner/Update 2022-04-27: The beta is over, but the apk is still installable with the instructions below and any feedback sent from inside the app will be received by me. I’m going to be working on this more over the summer, and eventually publishing it on the app store. :) Ever since learning Spanish, it has been a dream of mine to create a vocabulary study app that meets my needs. Duolingo won’t cover advanced vocabulary, Anki requires manually-generated decks, and other apps have expensive subscription plans.https://kylrth.com/paper/palm/Mon, 11 Apr 2022 12:17:25 -0400https://kylrth.com/paper/palm/This was a paper I presented about in Bang Liu’s research group meeting on 2022-04-11. You can view the slides I used here.https://kylrth.com/paper/not-just-size-that-matters/Fri, 18 Feb 2022 13:13:54 -0500https://kylrth.com/paper/not-just-size-that-matters/We presented this paper as a mini-lecture in Bang Liu’s IFT6289 course in winter 2022. You can view the slides we used here.https://kylrth.com/paper/cnn-sentence/Wed, 02 Feb 2022 15:35:00 -0500https://kylrth.com/paper/cnn-sentence/This post was created as an assignment in Bang Liu’s IFT6289 course in winter 2022. The structure of the post follows the structure of the assignment: summarization followed by my own comments. paper summarization Word embeddings have gotten so good that state-of-the-art sentence classification can often be achieved with just a one-layer convolutional network on top of those embeddings. This paper dials in on the specifics of training that convolutional layer for this downstream sentence classification task.https://kylrth.com/paper/clip/Wed, 02 Feb 2022 12:35:03 -0500https://kylrth.com/paper/clip/This post was created as an assignment in Irina Rish’s neural scaling laws course (IFT6167) in winter 2022. The post contains no summarization, only questions and thoughts. This concept of wide vs. narrow supervision (rather than binary “supervised” and “unsupervised”) is an interesting and flexible way to think about the way these training schemes leverage data. The zero-shot CLIP matches the performance of 4-shot CLIP, which is a surprising result. What do the authors mean when they make this guess about zero-shot’s advantage:https://kylrth.com/paper/distributed-representations/Tue, 01 Feb 2022 16:09:19 -0500https://kylrth.com/paper/distributed-representations/This post was created as an assignment in Bang Liu’s IFT6289 course in winter 2022. The structure of the post follows the structure of the assignment: summarization followed by my own comments. paper summarization This paper describes multiple improvements that are made to the original Skip-gram model: Decreasing the rate of exposure to common words improves the training speed and increases the model’s accuracy on infrequent words. A new training target they call “negative sampling” improves the training speed and the model’s accuracy on frequent words.https://kylrth.com/paper/cross-lingual-alignment-contextual/Fri, 11 Dec 2020 06:30:43 -0700https://kylrth.com/paper/cross-lingual-alignment-contextual/Recent contextual word embeddings (e.g. ELMo) have shown to be much better than “static” embeddings (where there’s a one-to-one mapping from token to representation). This paper is exciting because they were able to create a multi-lingual embedding space that used contextual word embeddings. Each token will have a “point cloud” of embedding values, one point for each context containing the token. They define the embedding anchor as the average of all those points for a particular token.https://kylrth.com/paper/spanbert/Sat, 05 Dec 2020 16:08:03 -0700https://kylrth.com/paper/spanbert/BERT optimizes the Masked Language Model (MLM) objective by masking word pieces uniformly at random in its training data and attempting to predict the masked values. With SpanBERT, spans of tokens are masked and the model is expected to predict the text in the spans from the representations of the words on the boundary. Span lengths follow a geometric distribution, and span start points are uniformly random. To predict each individual masked token, a two-layer feedforward network was provided with the boundary token representations plus the position embedding of the target token, and the output vector representation was used to predict the masked token and compute cross-entropy loss exactly as in standard MLM.https://kylrth.com/paper/deep-contextualized-word-representations/Thu, 03 Dec 2020 12:01:43 -0700https://kylrth.com/paper/deep-contextualized-word-representations/This is the original paper introducing Embeddings from Language Models (ELMo). Unlike most widely used word embeddings, ELMo word representations are functions of the entire input sentence. That’s what makes ELMo great: they’re contextualized word representations, meaning that they can express multiple possible senses of the same word. Specifically, ELMo representations are a learned linear combination of all layers of an LSTM encoding. The LSTM undergoes general semi-supervised pretraining, but the linear combination is learned specific to the task.https://kylrth.com/paper/attention-all-you-need/Wed, 05 Aug 2020 12:37:42 -0700https://kylrth.com/paper/attention-all-you-need/I also referred to this implementation to understand some of the details. This is the paper describing the Transformer, a sequence-to-sequence model based entirely on attention. I think it’s best described with pictures. model overview From this picture, I think the following things need explaining: embeddings these are learned embeddings that convert the input and output tokens to vectors of the model dimension. In this paper, they actually used the same weight matrix for input embedding, output embedding, and the final linear layer before the final softmax.https://kylrth.com/paper/bert/Tue, 04 Aug 2020 08:57:44 -0700https://kylrth.com/paper/bert/The B is for bidirectional, and that’s a big deal. It makes it possible to do well on sentence-level (NLI, question answering) and token-level tasks (NER, POS tagging). In a unidirectional model, the word “bank” in a sentence like “I made a bank deposit.” has only “I made a” as its context, keeping useful information from the model. Another cool thing is masked language model training (MLM). They train the model by blanking certain words in the sentence and asking the model to guess the missing word.https://kylrth.com/paper/google-nmt-2016/Tue, 30 Jun 2020 08:22:30 -0600https://kylrth.com/paper/google-nmt-2016/This model was superseded by this one. They did some careful things with residual connections to make sure it was very parallelizable. They put each LSTM layer on a separate GPU. They quantized the models such that they could train using full floating-point computations with a couple restrictions and then convert the models to quantized versions.https://kylrth.com/paper/google-zero-shot/Fri, 26 Jun 2020 08:02:12 -0600https://kylrth.com/paper/google-zero-shot/They use the word-piece model from “Japanese and Korean Voice Search”, with 32,000 word pieces. (This is a lot less than the 200,000 used in that paper.) They state in the paper that the shared word-piece model is very similar to Byte-Pair-Encoding, which was used for NMT in this paper by researchers at U of Edinburgh. The model and training process are exactly as in Google’s earlier paper. It takes 3 weeks on 100 GPUs to train, even after increasing batch size and learning rate.https://kylrth.com/paper/word-piece-model/Wed, 24 Jun 2020 14:44:02 -0600https://kylrth.com/paper/word-piece-model/This was mentioned in the paper on Google’s Multilingual Neural Machine Translation System. It’s regarded as the original paper to use the word-piece model, which is the focus of my notes here. the WordPieceModel Here’s the WordPieceModel algorithm: func WordPieceModel(D, chars, n, threshold) -> inventory: # D: training data # n: user-specified number of word units (often 200k) # chars: unicode characters used in the language (e.g. Kanji, Hiragana, Katakana, ASCII for Japanese) # threshold: stopping criterion for likelihood increase # inventory: the set of word units created by the model inventory := chars likelihood := +INF while len(inventory) < n && likelihood >= threshold: lm := LM(inventory, D) inventory += argmax_{combined word unit}(lm.https://kylrth.com/paper/multi-view-language-representation/Tue, 23 Jun 2020 08:40:04 -0600https://kylrth.com/paper/multi-view-language-representation/They used a technique called CCA to combine hand-made features with NN representations. It didn’t do great on typological feature prediction, but it did do well with predicting a phylogenetic tree for Indo-European languages.

<description>Recent content in nlp on Kyle Roth</description>

<generator>Hugo -- gohugo.io</generator>

<atom:link href="https://kylrth.com/tags/nlp/index.xml" rel="self" type="application/rss+xml"/>

<item>

<title>WordBurner beta</title>
<link>https://kylrth.com/post/wordburner/</link>
<pubDate>Mon, 18 Apr 2022 23:08:52 -0400</pubDate>
<guid>https://kylrth.com/post/wordburner/</guid>
<description>Update 2022-04-27: The beta is over, but the apk is still installable with the instructions below and any feedback sent from inside the app will be received by me. I&rsquo;m going to be working on this more over the summer, and eventually publishing it on the app store. :)
Ever since learning Spanish, it has been a dream of mine to create a vocabulary study app that meets my needs. Duolingo won&rsquo;t cover advanced vocabulary, Anki requires manually-generated decks, and other apps have expensive subscription plans.</description>

</item>

<item>

<title>PaLM</title>
<link>https://kylrth.com/paper/palm/</link>
<pubDate>Mon, 11 Apr 2022 12:17:25 -0400</pubDate>
<guid>https://kylrth.com/paper/palm/</guid>
<description>This was a paper I presented about in Bang Liu&rsquo;s research group meeting on 2022-04-11. You can view the slides I used here.</description>

</item>

<item>

<title>It's not just size that matters: small language models are also few-shot learners</title>
<link>https://kylrth.com/paper/not-just-size-that-matters/</link>
<pubDate>Fri, 18 Feb 2022 13:13:54 -0500</pubDate>
<guid>https://kylrth.com/paper/not-just-size-that-matters/</guid>
<description>We presented this paper as a mini-lecture in Bang Liu&rsquo;s IFT6289 course in winter 2022. You can view the slides we used here.</description>

</item>

<item>

<title>A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification</title>
<link>https://kylrth.com/paper/cnn-sentence/</link>
<pubDate>Wed, 02 Feb 2022 15:35:00 -0500</pubDate>
<guid>https://kylrth.com/paper/cnn-sentence/</guid>
<description>This post was created as an assignment in Bang Liu&rsquo;s IFT6289 course in winter 2022. The structure of the post follows the structure of the assignment: summarization followed by my own comments.
paper summarization Word embeddings have gotten so good that state-of-the-art sentence classification can often be achieved with just a one-layer convolutional network on top of those embeddings. This paper dials in on the specifics of training that convolutional layer for this downstream sentence classification task.</description>

</item>

<item>

<title>Learning transferable visual models from natural language supervision (CLIP)</title>
<link>https://kylrth.com/paper/clip/</link>
<pubDate>Wed, 02 Feb 2022 12:35:03 -0500</pubDate>
<guid>https://kylrth.com/paper/clip/</guid>
<description>This post was created as an assignment in Irina Rish&rsquo;s neural scaling laws course (IFT6167) in winter 2022. The post contains no summarization, only questions and thoughts.
This concept of wide vs. narrow supervision (rather than binary &ldquo;supervised&rdquo; and &ldquo;unsupervised&rdquo;) is an interesting and flexible way to think about the way these training schemes leverage data.
The zero-shot CLIP matches the performance of 4-shot CLIP, which is a surprising result. What do the authors mean when they make this guess about zero-shot&rsquo;s advantage:</description>

</item>

<item>

<title>Distributed representations of words and phrases and their compositionality</title>
<link>https://kylrth.com/paper/distributed-representations/</link>
<pubDate>Tue, 01 Feb 2022 16:09:19 -0500</pubDate>
<guid>https://kylrth.com/paper/distributed-representations/</guid>
<description>This post was created as an assignment in Bang Liu&rsquo;s IFT6289 course in winter 2022. The structure of the post follows the structure of the assignment: summarization followed by my own comments.
paper summarization This paper describes multiple improvements that are made to the original Skip-gram model:
Decreasing the rate of exposure to common words improves the training speed and increases the model&rsquo;s accuracy on infrequent words. A new training target they call &ldquo;negative sampling&rdquo; improves the training speed and the model&rsquo;s accuracy on frequent words.</description>

</item>

<item>

<title>Cross-lingual alignment of contextual word embeddings, with applications to zero-shot dependency parsing</title>
<link>https://kylrth.com/paper/cross-lingual-alignment-contextual/</link>
<pubDate>Fri, 11 Dec 2020 06:30:43 -0700</pubDate>
<guid>https://kylrth.com/paper/cross-lingual-alignment-contextual/</guid>
<description>Recent contextual word embeddings (e.g. ELMo) have shown to be much better than &ldquo;static&rdquo; embeddings (where there&rsquo;s a one-to-one mapping from token to representation). This paper is exciting because they were able to create a multi-lingual embedding space that used contextual word embeddings.
Each token will have a &ldquo;point cloud&rdquo; of embedding values, one point for each context containing the token. They define the embedding anchor as the average of all those points for a particular token.</description>

</item>

<item>

<title>SpanBERT: improving pre-training by representing and predicting spans</title>
<link>https://kylrth.com/paper/spanbert/</link>
<pubDate>Sat, 05 Dec 2020 16:08:03 -0700</pubDate>
<guid>https://kylrth.com/paper/spanbert/</guid>
<description>BERT optimizes the Masked Language Model (MLM) objective by masking word pieces uniformly at random in its training data and attempting to predict the masked values. With SpanBERT, spans of tokens are masked and the model is expected to predict the text in the spans from the representations of the words on the boundary. Span lengths follow a geometric distribution, and span start points are uniformly random.
To predict each individual masked token, a two-layer feedforward network was provided with the boundary token representations plus the position embedding of the target token, and the output vector representation was used to predict the masked token and compute cross-entropy loss exactly as in standard MLM.</description>

</item>

<item>

<title>Deep contextualized word representations</title>
<link>https://kylrth.com/paper/deep-contextualized-word-representations/</link>
<pubDate>Thu, 03 Dec 2020 12:01:43 -0700</pubDate>
<guid>https://kylrth.com/paper/deep-contextualized-word-representations/</guid>
<description>This is the original paper introducing Embeddings from Language Models (ELMo).
Unlike most widely used word embeddings, ELMo word representations are functions of the entire input sentence.
That&rsquo;s what makes ELMo great: they&rsquo;re contextualized word representations, meaning that they can express multiple possible senses of the same word.
Specifically, ELMo representations are a learned linear combination of all layers of an LSTM encoding. The LSTM undergoes general semi-supervised pretraining, but the linear combination is learned specific to the task.</description>

</item>

<item>

<title>Attention is all you need</title>
<link>https://kylrth.com/paper/attention-all-you-need/</link>
<pubDate>Wed, 05 Aug 2020 12:37:42 -0700</pubDate>
<guid>https://kylrth.com/paper/attention-all-you-need/</guid>
<description>I also referred to this implementation to understand some of the details.
This is the paper describing the Transformer, a sequence-to-sequence model based entirely on attention. I think it&rsquo;s best described with pictures.
model overview From this picture, I think the following things need explaining:
embeddings these are learned embeddings that convert the input and output tokens to vectors of the model dimension. In this paper, they actually used the same weight matrix for input embedding, output embedding, and the final linear layer before the final softmax.</description>

</item>

<item>

<title>BERT: pre-training of deep bidirectional transformers for language understanding</title>
<link>https://kylrth.com/paper/bert/</link>
<pubDate>Tue, 04 Aug 2020 08:57:44 -0700</pubDate>
<guid>https://kylrth.com/paper/bert/</guid>
<description>The B is for bidirectional, and that&rsquo;s a big deal. It makes it possible to do well on sentence-level (NLI, question answering) and token-level tasks (NER, POS tagging). In a unidirectional model, the word &ldquo;bank&rdquo; in a sentence like &ldquo;I made a bank deposit.&rdquo; has only &ldquo;I made a&rdquo; as its context, keeping useful information from the model.
Another cool thing is masked language model training (MLM). They train the model by blanking certain words in the sentence and asking the model to guess the missing word.</description>

</item>

<item>

<title>Google's neural machine translation system: bridging the gap between human and machine translation</title>
<link>https://kylrth.com/paper/google-nmt-2016/</link>
<pubDate>Tue, 30 Jun 2020 08:22:30 -0600</pubDate>
<guid>https://kylrth.com/paper/google-nmt-2016/</guid>
<description>This model was superseded by this one.
They did some careful things with residual connections to make sure it was very parallelizable. They put each LSTM layer on a separate GPU. They quantized the models such that they could train using full floating-point computations with a couple restrictions and then convert the models to quantized versions.</description>

</item>

<item>

<title>Google's multilingual neural machine translation system</title>
<link>https://kylrth.com/paper/google-zero-shot/</link>
<pubDate>Fri, 26 Jun 2020 08:02:12 -0600</pubDate>
<guid>https://kylrth.com/paper/google-zero-shot/</guid>
<description>They use the word-piece model from &ldquo;Japanese and Korean Voice Search&rdquo;, with 32,000 word pieces. (This is a lot less than the 200,000 used in that paper.) They state in the paper that the shared word-piece model is very similar to Byte-Pair-Encoding, which was used for NMT in this paper by researchers at U of Edinburgh.
The model and training process are exactly as in Google&rsquo;s earlier paper. It takes 3 weeks on 100 GPUs to train, even after increasing batch size and learning rate.</description>

</item>

<item>

<title>Japanese and Korean voice search</title>
<link>https://kylrth.com/paper/word-piece-model/</link>
<pubDate>Wed, 24 Jun 2020 14:44:02 -0600</pubDate>
<guid>https://kylrth.com/paper/word-piece-model/</guid>
<description>This was mentioned in the paper on Google&rsquo;s Multilingual Neural Machine Translation System. It&rsquo;s regarded as the original paper to use the word-piece model, which is the focus of my notes here.
the WordPieceModel Here&rsquo;s the WordPieceModel algorithm:
func WordPieceModel(D, chars, n, threshold) -&gt; inventory: # D: training data # n: user-specified number of word units (often 200k) # chars: unicode characters used in the language (e.g. Kanji, Hiragana, Katakana, ASCII for Japanese) # threshold: stopping criterion for likelihood increase # inventory: the set of word units created by the model inventory := chars likelihood := +INF while len(inventory) &lt; n &amp;&amp; likelihood &gt;= threshold: lm := LM(inventory, D) inventory += argmax_{combined word unit}(lm.</description>

</item>

<item>

<title>Towards a multi-view language representation</title>
<link>https://kylrth.com/paper/multi-view-language-representation/</link>
<pubDate>Tue, 23 Jun 2020 08:40:04 -0600</pubDate>
<guid>https://kylrth.com/paper/multi-view-language-representation/</guid>
<description>They used a technique called CCA to combine hand-made features with NN representations. It didn&rsquo;t do great on typological feature prediction, but it did do well with predicting a phylogenetic tree for Indo-European languages.</description>

</item>

</channel>

</rss>