https://kylrth.com/paper/Recent content in papers on Kyle RothHugo -- gohugo.ioen-usMon, 11 Apr 2022 12:17:25 -0400https://kylrth.com/paper/palm/Mon, 11 Apr 2022 12:17:25 -0400https://kylrth.com/paper/palm/This was a paper I presented about in Bang Liu’s research group meeting on 2022-04-11. You can view the slides I used here.https://kylrth.com/paper/qa-gnn/Tue, 05 Apr 2022 22:54:43 -0400https://kylrth.com/paper/qa-gnn/This post was created as an assignment in Bang Liu’s IFT6289 course in winter 2022. The structure of the post follows the structure of the assignment: summarization followed by my own comments. The authors create a novel system for combining an LM and a knowledge graph by performing reasoning over a joint graph produced by the LM and the KG, thus solving the problem of irrelevant entities appearing in the knowledge graph and unifying the representations across the LM and KG.https://kylrth.com/paper/experienced-well-being/Wed, 30 Mar 2022 13:34:53 -0400https://kylrth.com/paper/experienced-well-being/Turns out that money does buy happiness. You may have heard that people’s average happiness stops improving once you make more than $75,000/year? Researchers did a better survey with more data and found that that was not the case. The researchers cited 5 methodological improvements over the old research that suggested that it didn’t matter after $75,000: They measured people’s happiness in real time, instead of having people try to remember past happiness levels.https://kylrth.com/paper/neural-message-passing/Fri, 25 Mar 2022 14:46:11 -0400https://kylrth.com/paper/neural-message-passing/This post was created as an assignment in Bang Liu’s IFT6289 course in winter 2022. The structure of the post follows the structure of the assignment: summarization followed by my own comments. To summarize, the authors create a unifying framework for describing message-passing neural networks, which they apply to the problem of predicting the structural properties of chemical compounds in the QM9 dataset. paper summarization The authors first demonstrate that many of the recent works applying neural nets to this problem can fit into a message-passing neural network (MPNN) framework.https://kylrth.com/paper/effect-of-model-size-on-worst-group-generalization/Thu, 17 Mar 2022 14:34:33 -0400https://kylrth.com/paper/effect-of-model-size-on-worst-group-generalization/This was a paper we presented about in Irina Rish’s neural scaling laws course (IFT6167) in winter 2022. You can view the slides we used here.https://kylrth.com/paper/scaling-laws-few-shot-image-classifiers/Tue, 22 Feb 2022 13:19:12 -0500https://kylrth.com/paper/scaling-laws-few-shot-image-classifiers/The unsurprising result here is that few-shot performance scales predictably with pre-training dataset size under traditional fine-tuning, matching network, and prototypical network approaches. The interesting result is that the exponents of these three approaches were substantially different (see Table 1 in the paper), which says to me that the few-shot inference approach matters a lot. The surprising result was that while more training on the “non-natural” Omniglot dataset did not improve few-shot accuracy on other datasets, training on “natural” datasets did improve accuracy on few-shot Omniglot.https://kylrth.com/paper/learning-explanations-hard-to-vary/Tue, 22 Feb 2022 12:29:17 -0500https://kylrth.com/paper/learning-explanations-hard-to-vary/The big idea here is to use the geometric mean instead of the arithmetic mean across samples in the batch when computing the gradient for SGD. This overcomes the situation where averaging produces optima that are not actually optimal for any individual samples, as demonstrated in their toy example below: In practice, the method the authors test is not exactly the geometric mean for numerical and performance reasons, but effectively accomplishes the same thing by avoiding optima that are “inconsistent” (meaning that gradients from relatively few samples actually point in that direction).https://kylrth.com/paper/robust-measures-of-generalization/Mon, 21 Feb 2022 15:33:22 -0500https://kylrth.com/paper/robust-measures-of-generalization/These authors define robust error as the least upper bound on the expected loss over a family of environmental settings (including dataset, model architecture, learning algorithm, etc.): \[\sup_{e\in\mathcal F}\mathbb E_{\omega\in P^e}\left[\ell(\phi,\omega)\right]\] The fact that this is an upper bound and not an average is very important and is what makes this work unique from previous work in this direction. Indeed, what we should be concerned about is not how poorly a model performs on the average sample but on the worst-case sample.https://kylrth.com/paper/not-just-size-that-matters/Fri, 18 Feb 2022 13:13:54 -0500https://kylrth.com/paper/not-just-size-that-matters/We presented this paper as a mini-lecture in Bang Liu’s IFT6289 course in winter 2022. You can view the slides we used here.https://kylrth.com/paper/scaling-laws-for-transfer/Wed, 16 Feb 2022 14:12:26 -0500https://kylrth.com/paper/scaling-laws-for-transfer/This post was created as an assignment in Irina Rish’s neural scaling laws course (IFT6167) in winter 2022. The post contains no summarization, only questions and thoughts. Sometimes these scaling laws can feel like pseudoscience because they’re a post hoc attempt to place a trend line on data. How can we be confident that the trends we observe actually reflect the scaling laws that we’re after? In the limitations section they mention that they didn’t tune hyperparameters for fine-tuning or for the code data distribution.https://kylrth.com/paper/scaling-predictable-empirically/Mon, 14 Feb 2022 10:38:11 -0500https://kylrth.com/paper/scaling-predictable-empirically/This was a paper we presented about in Irina Rish’s neural scaling laws course (IFT6167) in winter 2022. You can view the slides we used here. It’s important to note that in the results for NMT (Figure 1) we would expect the lines in the graph on the left to curve as the capacity of the individual models is exhausted. That’s why the authors fit the curves with an extra constant added.https://kylrth.com/paper/masked-autoencoders-are-scalable-vision-learners/Fri, 11 Feb 2022 14:18:30 -0500https://kylrth.com/paper/masked-autoencoders-are-scalable-vision-learners/This post was created as an assignment in Irina Rish’s neural scaling laws course (IFT6167) in winter 2022. The post contains no summarization, only questions and thoughts. In this paper they mention that the mask vector is learned, and it sounds like the positional embeddings are also learned. I remember in Attention is all you need they found that cosine positional embeddings worked better than learned ones, especially for sequences of longer length.https://kylrth.com/paper/data-scaling-laws-nmt/Wed, 09 Feb 2022 20:47:59 -0500https://kylrth.com/paper/data-scaling-laws-nmt/This paper is all about trying a bunch of different changes to the training setup to see what affects the power law exponent over dataset size. Here are some of the answers: encoder-decoder size asymmetry: exponent not affected, but effective model capacity affected architecture (LSTM vs. Transformer): exponent not affected, but effective model capacity affected dataset quality (filtered vs. not): exponent and effective model capacity not effected, losses on smaller datasets affected dataset source (ParaCrawl vs.https://kylrth.com/paper/parallel-training-with-local-updates/Wed, 09 Feb 2022 10:50:21 -0500https://kylrth.com/paper/parallel-training-with-local-updates/This post was created as an assignment in Irina Rish’s neural scaling laws course (IFT6167) in winter 2022. The post contains no summarization, only questions and thoughts. Once I learned how the loss functions worked for each chunk, my first question was whether the earlier chunks were going to be able to learn the low-level features that later chunks would need. Figure 7 seems to show that they do, although their quality apparently decreases with increasingly local updates.https://kylrth.com/paper/cnn-sentence/Wed, 02 Feb 2022 15:35:00 -0500https://kylrth.com/paper/cnn-sentence/This post was created as an assignment in Bang Liu’s IFT6289 course in winter 2022. The structure of the post follows the structure of the assignment: summarization followed by my own comments. paper summarization Word embeddings have gotten so good that state-of-the-art sentence classification can often be achieved with just a one-layer convolutional network on top of those embeddings. This paper dials in on the specifics of training that convolutional layer for this downstream sentence classification task.https://kylrth.com/paper/clip/Wed, 02 Feb 2022 12:35:03 -0500https://kylrth.com/paper/clip/This post was created as an assignment in Irina Rish’s neural scaling laws course (IFT6167) in winter 2022. The post contains no summarization, only questions and thoughts. This concept of wide vs. narrow supervision (rather than binary “supervised” and “unsupervised”) is an interesting and flexible way to think about the way these training schemes leverage data. The zero-shot CLIP matches the performance of 4-shot CLIP, which is a surprising result. What do the authors mean when they make this guess about zero-shot’s advantage:https://kylrth.com/paper/distributed-representations/Tue, 01 Feb 2022 16:09:19 -0500https://kylrth.com/paper/distributed-representations/This post was created as an assignment in Bang Liu’s IFT6289 course in winter 2022. The structure of the post follows the structure of the assignment: summarization followed by my own comments. paper summarization This paper describes multiple improvements that are made to the original Skip-gram model: Decreasing the rate of exposure to common words improves the training speed and increases the model’s accuracy on infrequent words. A new training target they call “negative sampling” improves the training speed and the model’s accuracy on frequent words.https://kylrth.com/paper/deep-learning/Thu, 20 Jan 2022 15:11:00 -0500https://kylrth.com/paper/deep-learning/This post was created as an assignment in Bang Liu’s IFT6289 course in winter 2022. The structure of the post follows the structure of the assignment: summarization followed by my own comments. paper summarization The authors use the example of distinguishing between a Samoyed and a white wolf to talk about the importance of learning to rely on very small variations while ignoring others. While shallow classifiers must rely on human-crafted features which are expensive to build and always imperfect, deep classifiers are expected to learn their own features by applying a “general-purpose learning procedure” to learn the features and the classification layer from the data simultaneously.https://kylrth.com/paper/cross-lingual-alignment-contextual/Fri, 11 Dec 2020 06:30:43 -0700https://kylrth.com/paper/cross-lingual-alignment-contextual/Recent contextual word embeddings (e.g. ELMo) have shown to be much better than “static” embeddings (where there’s a one-to-one mapping from token to representation). This paper is exciting because they were able to create a multi-lingual embedding space that used contextual word embeddings. Each token will have a “point cloud” of embedding values, one point for each context containing the token. They define the embedding anchor as the average of all those points for a particular token.https://kylrth.com/paper/inductive-biases-higher-cognition/Tue, 08 Dec 2020 06:40:48 -0700https://kylrth.com/paper/inductive-biases-higher-cognition/This is a long paper, so a lot of my writing here is an attempt to condense the discussion. I’ve taken the liberty to pull exact phrases and structure from the paper without explicitly using quotes. Our main hypothesis is that deep learning succeeded in part because of a set of inductive biases, but that additional ones should be added in order to go from good in-distribution generalization in highly supervised learning tasks (or where strong and dense rewards are available), such as object recognition in images, to strong out-of-distribution generalization and transfer learning to new tasks with low sample complexity.https://kylrth.com/paper/spanbert/Sat, 05 Dec 2020 16:08:03 -0700https://kylrth.com/paper/spanbert/BERT optimizes the Masked Language Model (MLM) objective by masking word pieces uniformly at random in its training data and attempting to predict the masked values. With SpanBERT, spans of tokens are masked and the model is expected to predict the text in the spans from the representations of the words on the boundary. Span lengths follow a geometric distribution, and span start points are uniformly random. To predict each individual masked token, a two-layer feedforward network was provided with the boundary token representations plus the position embedding of the target token, and the output vector representation was used to predict the masked token and compute cross-entropy loss exactly as in standard MLM.https://kylrth.com/paper/deep-contextualized-word-representations/Thu, 03 Dec 2020 12:01:43 -0700https://kylrth.com/paper/deep-contextualized-word-representations/This is the original paper introducing Embeddings from Language Models (ELMo). Unlike most widely used word embeddings, ELMo word representations are functions of the entire input sentence. That’s what makes ELMo great: they’re contextualized word representations, meaning that they can express multiple possible senses of the same word. Specifically, ELMo representations are a learned linear combination of all layers of an LSTM encoding. The LSTM undergoes general semi-supervised pretraining, but the linear combination is learned specific to the task.https://kylrth.com/paper/overcoming-catastrophic-forgetting/Thu, 01 Oct 2020 10:47:28 -0600https://kylrth.com/paper/overcoming-catastrophic-forgetting/In the paper they use Bayes’ rule to show that the contribution of the first of two tasks is contained in the posterior distribution of model parameters over the first dataset. This is important because it means we can estimate that posterior to try to get a sense for which model parameters were most important for that first task. In this paper, they perform that estimation using a multivariate Gaussian distribution.https://kylrth.com/paper/neural-causal-models/Tue, 22 Sep 2020 10:39:54 -0600https://kylrth.com/paper/neural-causal-models/This is a follow-on to A meta-transfer objective for learning to disentangle causal mechanisms Here we describe an algorithm for predicting the causal graph structure of a set of visible random variables, each possibly causally dependent on any of the other variables. the algorithm There are two sets of parameters, the structural parameters and the functional parameters. The structural parameters compose a matrix where \(\sigma(\gamma_{ij})\) represents the belief that variable \(X_j\) is a direct cause of \(X_i\).https://kylrth.com/paper/meta-transfer-objective-for-causal-mechanisms/Mon, 21 Sep 2020 08:46:30 -0600https://kylrth.com/paper/meta-transfer-objective-for-causal-mechanisms/Theoretically, models should be able to predict on out-of-distribution data if their understanding of causal relationships is correct. The toy problem they use in this paper is that of predicting temperature from altitude. If a model is trained on data from Switzerland, the model should ideally be able to correctly predict on data from the Netherlands, even though it hasn’t seen elevations that low before. The main contribution of this paper is that they’ve found that models tend to transfer faster to a new distribution when they learn the correct causal relationships, and when those relationships are sparsely represented, meaning they are represented by relatively few nodes in the network.https://kylrth.com/paper/parameter-function-map-biased-to-simple/Tue, 08 Sep 2020 07:29:09 -0600https://kylrth.com/paper/parameter-function-map-biased-to-simple/The theoretical value in talking about the parameter-function map is that this map lets us talk about sets of parameters that produce the same function. In this paper they used some recently proven stuff from algorithmic information theory (AIT) to show that for neural networks the parameter-function map is biased toward functions with low Komolgorov complexity, meaning that simple functions are more likely to appear given random choice of parameters. Since real world problems are also biased toward simple functions, this could explain the generalization/memorization results found by Zhang et al.https://kylrth.com/paper/closer-look-at-memorization/Mon, 31 Aug 2020 11:52:35 -0600https://kylrth.com/paper/closer-look-at-memorization/This paper builds on what we learned in “Understanding deep learning requires rethinking generalization”. In that paper they showed that DNNs are able to fit pure noise in the same amount of time as it can fit real data, which means that our optimization algorithm (SGD, Adam, etc.) is not what’s keeping DNNs from overfitting. experiments for detecting easy/hard samples It looks like there are qualitative differences between a DNN that has memorized some data and a DNN that has seen real data.https://kylrth.com/paper/disciplined-approach-to-hyperparameters/Fri, 28 Aug 2020 14:16:29 -0600https://kylrth.com/paper/disciplined-approach-to-hyperparameters/The goal of hyperparameter tuning is to reach the point where test loss is horizontal on the graph over model complexity. Underfitting can be observed with a small learning rate, simple architecture, or complex data distribution. You can observe underfitting decrease by seeing more drastic results at the outset, followed by a more horizontal line further into training. You can use the LR range test to find a good learning rate range, and then use a cyclical learning rate to move up and down within that range.https://kylrth.com/paper/gradient-based-hyperparameter-optimization/Fri, 28 Aug 2020 14:16:29 -0600https://kylrth.com/paper/gradient-based-hyperparameter-optimization/In the area of hyperparameter optimization (HO), the goal is to optimize a response function of the hyperparameters. The response function is usually the average loss on a validation set. Gradient-based HO refers to iteratively finding the optimal hyperparameters using gradient updates, just as we do with neural network training itself. The gradient of the response function with respect to the hyperparameters is called the hypergradient. One of the great things about this work is that their framework allows for all kinds of hyperparameters.https://kylrth.com/paper/understanding-requires-rethinking-generalization/Wed, 26 Aug 2020 08:42:58 -0600https://kylrth.com/paper/understanding-requires-rethinking-generalization/It turns out that neural networks can reach training loss of 0 even on randomly labeled data, even when the data itself is random. It was previously thought that some implicit bias in the model architecture prevented (or regularized the model away from) overfitting to specific training examples, but that’s obviously not true. They showed this empirically as just described, and also theoretically constructed a two-layer ReLU network with \(p=2n+d\) parameters to express any labeling of any sample of size \(n\) in \(d\) dimensions.https://kylrth.com/paper/why-unsupervised-helps/Mon, 24 Aug 2020 11:40:00 -0600https://kylrth.com/paper/why-unsupervised-helps/They’re pretty sure that it performs regularization by starting off the supervised training in a good spot, instead of by somehow improving the optimization path.https://kylrth.com/paper/consciousness-prior/Fri, 14 Aug 2020 09:05:56 -0700https://kylrth.com/paper/consciousness-prior/System 1 cognitive abilities are about low-level perception and intuitive knowledge. System 2 cognitive abilities can be described verbally, and include things like reasoning, planning, and imagination. In cognitive neuroscience, the “Global Workspace Theory” says that at each moment specific pieces of information become a part of working memory and become globally available to other unconscious computational processes. Relative to the unconscious state, the conscious state is low-dimensional, focusing on a few things.https://kylrth.com/paper/troubling-trends-in-ml/Thu, 13 Aug 2020 10:36:05 -0700https://kylrth.com/paper/troubling-trends-in-ml/The authors discuss four trends in AI research that have negative consequences for the community. problems explanation vs. speculation It’s important to allow researchers to include speculation, because speculation is what allows ideas to form. But the paper has to carefully couch speculation inside a “Motivations” section or other verbage to ensure the reader understands its place. It’s extremely important to define concepts before using them. Terms like internal covariate shift or coverage sound like definitions without actually being such.https://kylrth.com/paper/attention-all-you-need/Wed, 05 Aug 2020 12:37:42 -0700https://kylrth.com/paper/attention-all-you-need/I also referred to this implementation to understand some of the details. This is the paper describing the Transformer, a sequence-to-sequence model based entirely on attention. I think it’s best described with pictures. model overview From this picture, I think the following things need explaining: embeddings these are learned embeddings that convert the input and output tokens to vectors of the model dimension. In this paper, they actually used the same weight matrix for input embedding, output embedding, and the final linear layer before the final softmax.https://kylrth.com/paper/bert/Tue, 04 Aug 2020 08:57:44 -0700https://kylrth.com/paper/bert/The B is for bidirectional, and that’s a big deal. It makes it possible to do well on sentence-level (NLI, question answering) and token-level tasks (NER, POS tagging). In a unidirectional model, the word “bank” in a sentence like “I made a bank deposit.” has only “I made a” as its context, keeping useful information from the model. Another cool thing is masked language model training (MLM). They train the model by blanking certain words in the sentence and asking the model to guess the missing word.https://kylrth.com/paper/factorizing-alignment-and-translation/Mon, 27 Jul 2020 09:11:16 -0700https://kylrth.com/paper/factorizing-alignment-and-translation/They had a biRNN with attention for alignment encoding, and then a single linear function of each one-hot encoded word for encoding that single word. Their reasoning was that by separating the alignment from the meaning of individual words the model could more easily generalize to unseen words.https://kylrth.com/paper/semi-supervised-for-asr/Tue, 14 Jul 2020 08:06:00 -0600https://kylrth.com/paper/semi-supervised-for-asr/This was Manohar’s PhD dissertation at JHU. Chapter 2 provides a relatively clear overview of how chain and non-chain models work in Kaldi. In chapter 3 he tried using negative conditional entropy as the loss function for the unsupervised data, and it helped a bit. In chapter 4 Manohar uses [CTC loss]/paper/ctc/. In chapter 5, he discusses ways to do semi-supervised model training. It’s nice when you have parallel data in different domains, because then you can do a student-teacher model.https://kylrth.com/paper/ctc/Fri, 10 Jul 2020 09:14:59 -0600https://kylrth.com/paper/ctc/RNNs generally require pre-segmented training data, but this avoids that need. Basically, you have the RNN output probabilities for each label (or a blank) for every frame, and then you find the most likely path across that lattice of probabilities. The section explaining the loss function was kind of complicated. They used their forward-backward algorithm (sort of like Viterbi) to get the probability of all paths corresponding to the output that go through each symbol at each time, and then they differentiated that to get the derivatives with respect to the outputs.https://kylrth.com/paper/google-nmt-2016/Tue, 30 Jun 2020 08:22:30 -0600https://kylrth.com/paper/google-nmt-2016/This model was superseded by this one. They did some careful things with residual connections to make sure it was very parallelizable. They put each LSTM layer on a separate GPU. They quantized the models such that they could train using full floating-point computations with a couple restrictions and then convert the models to quantized versions.https://kylrth.com/paper/google-zero-shot/Fri, 26 Jun 2020 08:02:12 -0600https://kylrth.com/paper/google-zero-shot/They use the word-piece model from “Japanese and Korean Voice Search”, with 32,000 word pieces. (This is a lot less than the 200,000 used in that paper.) They state in the paper that the shared word-piece model is very similar to Byte-Pair-Encoding, which was used for NMT in this paper by researchers at U of Edinburgh. The model and training process are exactly as in Google’s earlier paper. It takes 3 weeks on 100 GPUs to train, even after increasing batch size and learning rate.https://kylrth.com/paper/word-piece-model/Wed, 24 Jun 2020 14:44:02 -0600https://kylrth.com/paper/word-piece-model/This was mentioned in the paper on Google’s Multilingual Neural Machine Translation System. It’s regarded as the original paper to use the word-piece model, which is the focus of my notes here. the WordPieceModel Here’s the WordPieceModel algorithm: func WordPieceModel(D, chars, n, threshold) -> inventory: # D: training data # n: user-specified number of word units (often 200k) # chars: unicode characters used in the language (e.g. Kanji, Hiragana, Katakana, ASCII for Japanese) # threshold: stopping criterion for likelihood increase # inventory: the set of word units created by the model inventory := chars likelihood := +INF while len(inventory) < n && likelihood >= threshold: lm := LM(inventory, D) inventory += argmax_{combined word unit}(lm.https://kylrth.com/paper/multi-view-language-representation/Tue, 23 Jun 2020 08:40:04 -0600https://kylrth.com/paper/multi-view-language-representation/They used a technique called CCA to combine hand-made features with NN representations. It didn’t do great on typological feature prediction, but it did do well with predicting a phylogenetic tree for Indo-European languages.https://kylrth.com/paper/universal-phone-recognition/Tue, 23 Jun 2020 08:33:48 -0600https://kylrth.com/paper/universal-phone-recognition/These guys made sure to model allophones. They had an encoder that produced a universal phone set, and then language-specific decoders. This meant they could use data from various languages to train the system. The decoder has an allophone layer, followed by other dense trainable layers. The allophone layer is a single trainable dense layer, but was initialized by a bunch of linguists who sat down and described the phone sets belonging to each phoneme in each language present in the training set.
This XML file does not appear to have any style information associated with it. The document tree is shown below.
<rss xmlns:atom="http://www.w3.org/2005/Atom" version="2.0">
<channel>
<title>papers on Kyle Roth</title>
<link>https://kylrth.com/paper/</link>
<description>Recent content in papers on Kyle Roth</description>
<generator>Hugo -- gohugo.io</generator>
<language>en-us</language>
<lastBuildDate>Mon, 11 Apr 2022 12:17:25 -0400</lastBuildDate>
<atom:link href="https://kylrth.com/paper/index.xml" rel="self" type="application/rss+xml"/>
<item>
<title>PaLM</title>
<link>https://kylrth.com/paper/palm/</link>
<pubDate>Mon, 11 Apr 2022 12:17:25 -0400</pubDate>
<guid>https://kylrth.com/paper/palm/</guid>
<description>This was a paper I presented about in Bang Liu&rsquo;s research group meeting on 2022-04-11. You can view the slides I used here.</description>
</item>
<item>
<title>QA-GNN: reasoning with language models and knowledge graphs for question answering</title>
<link>https://kylrth.com/paper/qa-gnn/</link>
<pubDate>Tue, 05 Apr 2022 22:54:43 -0400</pubDate>
<guid>https://kylrth.com/paper/qa-gnn/</guid>
<description>This post was created as an assignment in Bang Liu&rsquo;s IFT6289 course in winter 2022. The structure of the post follows the structure of the assignment: summarization followed by my own comments. The authors create a novel system for combining an LM and a knowledge graph by performing reasoning over a joint graph produced by the LM and the KG, thus solving the problem of irrelevant entities appearing in the knowledge graph and unifying the representations across the LM and KG.</description>
</item>
<item>
<title>Experienced well-being rises with income, even above $75,000 per year</title>
<link>https://kylrth.com/paper/experienced-well-being/</link>
<pubDate>Wed, 30 Mar 2022 13:34:53 -0400</pubDate>
<guid>https://kylrth.com/paper/experienced-well-being/</guid>
<description>Turns out that money does buy happiness. You may have heard that people&rsquo;s average happiness stops improving once you make more than $75,000/year? Researchers did a better survey with more data and found that that was not the case. The researchers cited 5 methodological improvements over the old research that suggested that it didn&rsquo;t matter after $75,000: They measured people&rsquo;s happiness in real time, instead of having people try to remember past happiness levels.</description>
</item>
<item>
<title>Neural message passing for quantum chemistry</title>
<link>https://kylrth.com/paper/neural-message-passing/</link>
<pubDate>Fri, 25 Mar 2022 14:46:11 -0400</pubDate>
<guid>https://kylrth.com/paper/neural-message-passing/</guid>
<description>This post was created as an assignment in Bang Liu&rsquo;s IFT6289 course in winter 2022. The structure of the post follows the structure of the assignment: summarization followed by my own comments. To summarize, the authors create a unifying framework for describing message-passing neural networks, which they apply to the problem of predicting the structural properties of chemical compounds in the QM9 dataset. paper summarization The authors first demonstrate that many of the recent works applying neural nets to this problem can fit into a message-passing neural network (MPNN) framework.</description>
</item>
<item>
<title>The effect of model size on worst-group generalization</title>
<link>https://kylrth.com/paper/effect-of-model-size-on-worst-group-generalization/</link>
<pubDate>Thu, 17 Mar 2022 14:34:33 -0400</pubDate>
<guid>https://kylrth.com/paper/effect-of-model-size-on-worst-group-generalization/</guid>
<description>This was a paper we presented about in Irina Rish&rsquo;s neural scaling laws course (IFT6167) in winter 2022. You can view the slides we used here.</description>
</item>
<item>
<title>Scaling laws for the few-shot adaptation of pre-trained image classifiers</title>
<link>https://kylrth.com/paper/scaling-laws-few-shot-image-classifiers/</link>
<pubDate>Tue, 22 Feb 2022 13:19:12 -0500</pubDate>
<guid>https://kylrth.com/paper/scaling-laws-few-shot-image-classifiers/</guid>
<description>The unsurprising result here is that few-shot performance scales predictably with pre-training dataset size under traditional fine-tuning, matching network, and prototypical network approaches. The interesting result is that the exponents of these three approaches were substantially different (see Table 1 in the paper), which says to me that the few-shot inference approach matters a lot. The surprising result was that while more training on the &ldquo;non-natural&rdquo; Omniglot dataset did not improve few-shot accuracy on other datasets, training on &ldquo;natural&rdquo; datasets did improve accuracy on few-shot Omniglot.</description>
</item>
<item>
<title>Learning explanations that are hard to vary</title>
<link>https://kylrth.com/paper/learning-explanations-hard-to-vary/</link>
<pubDate>Tue, 22 Feb 2022 12:29:17 -0500</pubDate>
<guid>https://kylrth.com/paper/learning-explanations-hard-to-vary/</guid>
<description>The big idea here is to use the geometric mean instead of the arithmetic mean across samples in the batch when computing the gradient for SGD. This overcomes the situation where averaging produces optima that are not actually optimal for any individual samples, as demonstrated in their toy example below: In practice, the method the authors test is not exactly the geometric mean for numerical and performance reasons, but effectively accomplishes the same thing by avoiding optima that are &ldquo;inconsistent&rdquo; (meaning that gradients from relatively few samples actually point in that direction).</description>
</item>
<item>
<title>In search of robust measures of generalization</title>
<link>https://kylrth.com/paper/robust-measures-of-generalization/</link>
<pubDate>Mon, 21 Feb 2022 15:33:22 -0500</pubDate>
<guid>https://kylrth.com/paper/robust-measures-of-generalization/</guid>
<description>These authors define robust error as the least upper bound on the expected loss over a family of environmental settings (including dataset, model architecture, learning algorithm, etc.): \[\sup_{e\in\mathcal F}\mathbb E_{\omega\in P^e}\left[\ell(\phi,\omega)\right]\] The fact that this is an upper bound and not an average is very important and is what makes this work unique from previous work in this direction. Indeed, what we should be concerned about is not how poorly a model performs on the average sample but on the worst-case sample.</description>
</item>
<item>
<title>It's not just size that matters: small language models are also few-shot learners</title>
<link>https://kylrth.com/paper/not-just-size-that-matters/</link>
<pubDate>Fri, 18 Feb 2022 13:13:54 -0500</pubDate>
<guid>https://kylrth.com/paper/not-just-size-that-matters/</guid>
<description>We presented this paper as a mini-lecture in Bang Liu&rsquo;s IFT6289 course in winter 2022. You can view the slides we used here.</description>
</item>
<item>
<title>Scaling laws for transfer</title>
<link>https://kylrth.com/paper/scaling-laws-for-transfer/</link>
<pubDate>Wed, 16 Feb 2022 14:12:26 -0500</pubDate>
<guid>https://kylrth.com/paper/scaling-laws-for-transfer/</guid>
<description>This post was created as an assignment in Irina Rish&rsquo;s neural scaling laws course (IFT6167) in winter 2022. The post contains no summarization, only questions and thoughts. Sometimes these scaling laws can feel like pseudoscience because they&rsquo;re a post hoc attempt to place a trend line on data. How can we be confident that the trends we observe actually reflect the scaling laws that we&rsquo;re after? In the limitations section they mention that they didn&rsquo;t tune hyperparameters for fine-tuning or for the code data distribution.</description>
</item>
<item>
<title>Deep learning scaling is predictable, empirically</title>
<link>https://kylrth.com/paper/scaling-predictable-empirically/</link>
<pubDate>Mon, 14 Feb 2022 10:38:11 -0500</pubDate>
<guid>https://kylrth.com/paper/scaling-predictable-empirically/</guid>
<description>This was a paper we presented about in Irina Rish&rsquo;s neural scaling laws course (IFT6167) in winter 2022. You can view the slides we used here. It&rsquo;s important to note that in the results for NMT (Figure 1) we would expect the lines in the graph on the left to curve as the capacity of the individual models is exhausted. That&rsquo;s why the authors fit the curves with an extra constant added.</description>
</item>
<item>
<title>Masked autoencoders are scalable vision learners</title>
<link>https://kylrth.com/paper/masked-autoencoders-are-scalable-vision-learners/</link>
<pubDate>Fri, 11 Feb 2022 14:18:30 -0500</pubDate>
<guid>https://kylrth.com/paper/masked-autoencoders-are-scalable-vision-learners/</guid>
<description>This post was created as an assignment in Irina Rish&rsquo;s neural scaling laws course (IFT6167) in winter 2022. The post contains no summarization, only questions and thoughts. In this paper they mention that the mask vector is learned, and it sounds like the positional embeddings are also learned. I remember in Attention is all you need they found that cosine positional embeddings worked better than learned ones, especially for sequences of longer length.</description>
</item>
<item>
<title>Data scaling laws in NMT: the effect of noise and architecture</title>
<link>https://kylrth.com/paper/data-scaling-laws-nmt/</link>
<pubDate>Wed, 09 Feb 2022 20:47:59 -0500</pubDate>
<guid>https://kylrth.com/paper/data-scaling-laws-nmt/</guid>
<description>This paper is all about trying a bunch of different changes to the training setup to see what affects the power law exponent over dataset size. Here are some of the answers: encoder-decoder size asymmetry: exponent not affected, but effective model capacity affected architecture (LSTM vs. Transformer): exponent not affected, but effective model capacity affected dataset quality (filtered vs. not): exponent and effective model capacity not effected, losses on smaller datasets affected dataset source (ParaCrawl vs.</description>
</item>
<item>
<title>Parallel training of deep networks with local updates</title>
<link>https://kylrth.com/paper/parallel-training-with-local-updates/</link>
<pubDate>Wed, 09 Feb 2022 10:50:21 -0500</pubDate>
<guid>https://kylrth.com/paper/parallel-training-with-local-updates/</guid>
<description>This post was created as an assignment in Irina Rish&rsquo;s neural scaling laws course (IFT6167) in winter 2022. The post contains no summarization, only questions and thoughts. Once I learned how the loss functions worked for each chunk, my first question was whether the earlier chunks were going to be able to learn the low-level features that later chunks would need. Figure 7 seems to show that they do, although their quality apparently decreases with increasingly local updates.</description>
</item>
<item>
<title>A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification</title>
<link>https://kylrth.com/paper/cnn-sentence/</link>
<pubDate>Wed, 02 Feb 2022 15:35:00 -0500</pubDate>
<guid>https://kylrth.com/paper/cnn-sentence/</guid>
<description>This post was created as an assignment in Bang Liu&rsquo;s IFT6289 course in winter 2022. The structure of the post follows the structure of the assignment: summarization followed by my own comments. paper summarization Word embeddings have gotten so good that state-of-the-art sentence classification can often be achieved with just a one-layer convolutional network on top of those embeddings. This paper dials in on the specifics of training that convolutional layer for this downstream sentence classification task.</description>
</item>
<item>
<title>Learning transferable visual models from natural language supervision (CLIP)</title>
<link>https://kylrth.com/paper/clip/</link>
<pubDate>Wed, 02 Feb 2022 12:35:03 -0500</pubDate>
<guid>https://kylrth.com/paper/clip/</guid>
<description>This post was created as an assignment in Irina Rish&rsquo;s neural scaling laws course (IFT6167) in winter 2022. The post contains no summarization, only questions and thoughts. This concept of wide vs. narrow supervision (rather than binary &ldquo;supervised&rdquo; and &ldquo;unsupervised&rdquo;) is an interesting and flexible way to think about the way these training schemes leverage data. The zero-shot CLIP matches the performance of 4-shot CLIP, which is a surprising result. What do the authors mean when they make this guess about zero-shot&rsquo;s advantage:</description>
</item>
<item>
<title>Distributed representations of words and phrases and their compositionality</title>
<link>https://kylrth.com/paper/distributed-representations/</link>
<pubDate>Tue, 01 Feb 2022 16:09:19 -0500</pubDate>
<guid>https://kylrth.com/paper/distributed-representations/</guid>
<description>This post was created as an assignment in Bang Liu&rsquo;s IFT6289 course in winter 2022. The structure of the post follows the structure of the assignment: summarization followed by my own comments. paper summarization This paper describes multiple improvements that are made to the original Skip-gram model: Decreasing the rate of exposure to common words improves the training speed and increases the model&rsquo;s accuracy on infrequent words. A new training target they call &ldquo;negative sampling&rdquo; improves the training speed and the model&rsquo;s accuracy on frequent words.</description>
</item>
<item>
<title>Deep learning</title>
<link>https://kylrth.com/paper/deep-learning/</link>
<pubDate>Thu, 20 Jan 2022 15:11:00 -0500</pubDate>
<guid>https://kylrth.com/paper/deep-learning/</guid>
<description>This post was created as an assignment in Bang Liu&rsquo;s IFT6289 course in winter 2022. The structure of the post follows the structure of the assignment: summarization followed by my own comments. paper summarization The authors use the example of distinguishing between a Samoyed and a white wolf to talk about the importance of learning to rely on very small variations while ignoring others. While shallow classifiers must rely on human-crafted features which are expensive to build and always imperfect, deep classifiers are expected to learn their own features by applying a &ldquo;general-purpose learning procedure&rdquo; to learn the features and the classification layer from the data simultaneously.</description>
</item>
<item>
<title>Cross-lingual alignment of contextual word embeddings, with applications to zero-shot dependency parsing</title>
<link>https://kylrth.com/paper/cross-lingual-alignment-contextual/</link>
<pubDate>Fri, 11 Dec 2020 06:30:43 -0700</pubDate>
<guid>https://kylrth.com/paper/cross-lingual-alignment-contextual/</guid>
<description>Recent contextual word embeddings (e.g. ELMo) have shown to be much better than &ldquo;static&rdquo; embeddings (where there&rsquo;s a one-to-one mapping from token to representation). This paper is exciting because they were able to create a multi-lingual embedding space that used contextual word embeddings. Each token will have a &ldquo;point cloud&rdquo; of embedding values, one point for each context containing the token. They define the embedding anchor as the average of all those points for a particular token.</description>
</item>
<item>
<title>Inductive biases for deep learning of higher-level cognition</title>
<link>https://kylrth.com/paper/inductive-biases-higher-cognition/</link>
<pubDate>Tue, 08 Dec 2020 06:40:48 -0700</pubDate>
<guid>https://kylrth.com/paper/inductive-biases-higher-cognition/</guid>
<description>This is a long paper, so a lot of my writing here is an attempt to condense the discussion. I&rsquo;ve taken the liberty to pull exact phrases and structure from the paper without explicitly using quotes. Our main hypothesis is that deep learning succeeded in part because of a set of inductive biases, but that additional ones should be added in order to go from good in-distribution generalization in highly supervised learning tasks (or where strong and dense rewards are available), such as object recognition in images, to strong out-of-distribution generalization and transfer learning to new tasks with low sample complexity.</description>
</item>
<item>
<title>SpanBERT: improving pre-training by representing and predicting spans</title>
<link>https://kylrth.com/paper/spanbert/</link>
<pubDate>Sat, 05 Dec 2020 16:08:03 -0700</pubDate>
<guid>https://kylrth.com/paper/spanbert/</guid>
<description>BERT optimizes the Masked Language Model (MLM) objective by masking word pieces uniformly at random in its training data and attempting to predict the masked values. With SpanBERT, spans of tokens are masked and the model is expected to predict the text in the spans from the representations of the words on the boundary. Span lengths follow a geometric distribution, and span start points are uniformly random. To predict each individual masked token, a two-layer feedforward network was provided with the boundary token representations plus the position embedding of the target token, and the output vector representation was used to predict the masked token and compute cross-entropy loss exactly as in standard MLM.</description>
</item>
<item>
<title>Deep contextualized word representations</title>
<link>https://kylrth.com/paper/deep-contextualized-word-representations/</link>
<pubDate>Thu, 03 Dec 2020 12:01:43 -0700</pubDate>
<guid>https://kylrth.com/paper/deep-contextualized-word-representations/</guid>
<description>This is the original paper introducing Embeddings from Language Models (ELMo). Unlike most widely used word embeddings, ELMo word representations are functions of the entire input sentence. That&rsquo;s what makes ELMo great: they&rsquo;re contextualized word representations, meaning that they can express multiple possible senses of the same word. Specifically, ELMo representations are a learned linear combination of all layers of an LSTM encoding. The LSTM undergoes general semi-supervised pretraining, but the linear combination is learned specific to the task.</description>
</item>
<item>
<title>Overcoming catastrophic forgetting in neural networks</title>
<link>https://kylrth.com/paper/overcoming-catastrophic-forgetting/</link>
<pubDate>Thu, 01 Oct 2020 10:47:28 -0600</pubDate>
<guid>https://kylrth.com/paper/overcoming-catastrophic-forgetting/</guid>
<description>In the paper they use Bayes&rsquo; rule to show that the contribution of the first of two tasks is contained in the posterior distribution of model parameters over the first dataset. This is important because it means we can estimate that posterior to try to get a sense for which model parameters were most important for that first task. In this paper, they perform that estimation using a multivariate Gaussian distribution.</description>
</item>
<item>
<title>Learning neural causal models from unknown interventions</title>
<link>https://kylrth.com/paper/neural-causal-models/</link>
<pubDate>Tue, 22 Sep 2020 10:39:54 -0600</pubDate>
<guid>https://kylrth.com/paper/neural-causal-models/</guid>
<description>This is a follow-on to A meta-transfer objective for learning to disentangle causal mechanisms Here we describe an algorithm for predicting the causal graph structure of a set of visible random variables, each possibly causally dependent on any of the other variables. the algorithm There are two sets of parameters, the structural parameters and the functional parameters. The structural parameters compose a matrix where \(\sigma(\gamma_{ij})\) represents the belief that variable \(X_j\) is a direct cause of \(X_i\).</description>
</item>
<item>
<title>A meta-transfer objective for learning to disentangle causal mechanisms</title>
<link>https://kylrth.com/paper/meta-transfer-objective-for-causal-mechanisms/</link>
<pubDate>Mon, 21 Sep 2020 08:46:30 -0600</pubDate>
<guid>https://kylrth.com/paper/meta-transfer-objective-for-causal-mechanisms/</guid>
<description>Theoretically, models should be able to predict on out-of-distribution data if their understanding of causal relationships is correct. The toy problem they use in this paper is that of predicting temperature from altitude. If a model is trained on data from Switzerland, the model should ideally be able to correctly predict on data from the Netherlands, even though it hasn&rsquo;t seen elevations that low before. The main contribution of this paper is that they&rsquo;ve found that models tend to transfer faster to a new distribution when they learn the correct causal relationships, and when those relationships are sparsely represented, meaning they are represented by relatively few nodes in the network.</description>
</item>
<item>
<title>Deep learning generalizes because the parameter-function map is biased towards simple functions</title>
<link>https://kylrth.com/paper/parameter-function-map-biased-to-simple/</link>
<pubDate>Tue, 08 Sep 2020 07:29:09 -0600</pubDate>
<guid>https://kylrth.com/paper/parameter-function-map-biased-to-simple/</guid>
<description>The theoretical value in talking about the parameter-function map is that this map lets us talk about sets of parameters that produce the same function. In this paper they used some recently proven stuff from algorithmic information theory (AIT) to show that for neural networks the parameter-function map is biased toward functions with low Komolgorov complexity, meaning that simple functions are more likely to appear given random choice of parameters. Since real world problems are also biased toward simple functions, this could explain the generalization/memorization results found by Zhang et al.</description>
</item>
<item>
<title>A closer look at memorization in deep networks</title>
<link>https://kylrth.com/paper/closer-look-at-memorization/</link>
<pubDate>Mon, 31 Aug 2020 11:52:35 -0600</pubDate>
<guid>https://kylrth.com/paper/closer-look-at-memorization/</guid>
<description>This paper builds on what we learned in &ldquo;Understanding deep learning requires rethinking generalization&rdquo;. In that paper they showed that DNNs are able to fit pure noise in the same amount of time as it can fit real data, which means that our optimization algorithm (SGD, Adam, etc.) is not what&rsquo;s keeping DNNs from overfitting. experiments for detecting easy/hard samples It looks like there are qualitative differences between a DNN that has memorized some data and a DNN that has seen real data.</description>
</item>
<item>
<title>A disciplined approach to neural network hyperparameters: part 1</title>
<link>https://kylrth.com/paper/disciplined-approach-to-hyperparameters/</link>
<pubDate>Fri, 28 Aug 2020 14:16:29 -0600</pubDate>
<guid>https://kylrth.com/paper/disciplined-approach-to-hyperparameters/</guid>
<description>The goal of hyperparameter tuning is to reach the point where test loss is horizontal on the graph over model complexity. Underfitting can be observed with a small learning rate, simple architecture, or complex data distribution. You can observe underfitting decrease by seeing more drastic results at the outset, followed by a more horizontal line further into training. You can use the LR range test to find a good learning rate range, and then use a cyclical learning rate to move up and down within that range.</description>
</item>
<item>
<title>Forward and reverse gradient-based hyperparameter optimization</title>
<link>https://kylrth.com/paper/gradient-based-hyperparameter-optimization/</link>
<pubDate>Fri, 28 Aug 2020 14:16:29 -0600</pubDate>
<guid>https://kylrth.com/paper/gradient-based-hyperparameter-optimization/</guid>
<description>In the area of hyperparameter optimization (HO), the goal is to optimize a response function of the hyperparameters. The response function is usually the average loss on a validation set. Gradient-based HO refers to iteratively finding the optimal hyperparameters using gradient updates, just as we do with neural network training itself. The gradient of the response function with respect to the hyperparameters is called the hypergradient. One of the great things about this work is that their framework allows for all kinds of hyperparameters.</description>
</item>
<item>
<title>Understanding deep learning requires rethinking generalization</title>
<link>https://kylrth.com/paper/understanding-requires-rethinking-generalization/</link>
<pubDate>Wed, 26 Aug 2020 08:42:58 -0600</pubDate>
<guid>https://kylrth.com/paper/understanding-requires-rethinking-generalization/</guid>
<description>It turns out that neural networks can reach training loss of 0 even on randomly labeled data, even when the data itself is random. It was previously thought that some implicit bias in the model architecture prevented (or regularized the model away from) overfitting to specific training examples, but that&rsquo;s obviously not true. They showed this empirically as just described, and also theoretically constructed a two-layer ReLU network with \(p=2n+d\) parameters to express any labeling of any sample of size \(n\) in \(d\) dimensions.</description>
</item>
<item>
<title>Why does unsupervised pre-training help deep learning?</title>
<link>https://kylrth.com/paper/why-unsupervised-helps/</link>
<pubDate>Mon, 24 Aug 2020 11:40:00 -0600</pubDate>
<guid>https://kylrth.com/paper/why-unsupervised-helps/</guid>
<description>They&rsquo;re pretty sure that it performs regularization by starting off the supervised training in a good spot, instead of by somehow improving the optimization path.</description>
</item>
<item>
<title>The consciousness prior</title>
<link>https://kylrth.com/paper/consciousness-prior/</link>
<pubDate>Fri, 14 Aug 2020 09:05:56 -0700</pubDate>
<guid>https://kylrth.com/paper/consciousness-prior/</guid>
<description>System 1 cognitive abilities are about low-level perception and intuitive knowledge. System 2 cognitive abilities can be described verbally, and include things like reasoning, planning, and imagination. In cognitive neuroscience, the &ldquo;Global Workspace Theory&rdquo; says that at each moment specific pieces of information become a part of working memory and become globally available to other unconscious computational processes. Relative to the unconscious state, the conscious state is low-dimensional, focusing on a few things.</description>
</item>
<item>
<title>Troubling trends in machine learning scholarship</title>
<link>https://kylrth.com/paper/troubling-trends-in-ml/</link>
<pubDate>Thu, 13 Aug 2020 10:36:05 -0700</pubDate>
<guid>https://kylrth.com/paper/troubling-trends-in-ml/</guid>
<description>The authors discuss four trends in AI research that have negative consequences for the community. problems explanation vs. speculation It&rsquo;s important to allow researchers to include speculation, because speculation is what allows ideas to form. But the paper has to carefully couch speculation inside a &ldquo;Motivations&rdquo; section or other verbage to ensure the reader understands its place. It&rsquo;s extremely important to define concepts before using them. Terms like internal covariate shift or coverage sound like definitions without actually being such.</description>
</item>
<item>
<title>Attention is all you need</title>
<link>https://kylrth.com/paper/attention-all-you-need/</link>
<pubDate>Wed, 05 Aug 2020 12:37:42 -0700</pubDate>
<guid>https://kylrth.com/paper/attention-all-you-need/</guid>
<description>I also referred to this implementation to understand some of the details. This is the paper describing the Transformer, a sequence-to-sequence model based entirely on attention. I think it&rsquo;s best described with pictures. model overview From this picture, I think the following things need explaining: embeddings these are learned embeddings that convert the input and output tokens to vectors of the model dimension. In this paper, they actually used the same weight matrix for input embedding, output embedding, and the final linear layer before the final softmax.</description>
</item>
<item>
<title>BERT: pre-training of deep bidirectional transformers for language understanding</title>
<link>https://kylrth.com/paper/bert/</link>
<pubDate>Tue, 04 Aug 2020 08:57:44 -0700</pubDate>
<guid>https://kylrth.com/paper/bert/</guid>
<description>The B is for bidirectional, and that&rsquo;s a big deal. It makes it possible to do well on sentence-level (NLI, question answering) and token-level tasks (NER, POS tagging). In a unidirectional model, the word &ldquo;bank&rdquo; in a sentence like &ldquo;I made a bank deposit.&rdquo; has only &ldquo;I made a&rdquo; as its context, keeping useful information from the model. Another cool thing is masked language model training (MLM). They train the model by blanking certain words in the sentence and asking the model to guess the missing word.</description>
</item>
<item>
<title>Compositional generalization by factorizing alignment and translation</title>
<link>https://kylrth.com/paper/factorizing-alignment-and-translation/</link>
<pubDate>Mon, 27 Jul 2020 09:11:16 -0700</pubDate>
<guid>https://kylrth.com/paper/factorizing-alignment-and-translation/</guid>
<description>They had a biRNN with attention for alignment encoding, and then a single linear function of each one-hot encoded word for encoding that single word. Their reasoning was that by separating the alignment from the meaning of individual words the model could more easily generalize to unseen words.</description>
</item>
<item>
<title>Semi-supervised training for automatic speech recognition</title>
<link>https://kylrth.com/paper/semi-supervised-for-asr/</link>
<pubDate>Tue, 14 Jul 2020 08:06:00 -0600</pubDate>
<guid>https://kylrth.com/paper/semi-supervised-for-asr/</guid>
<description>This was Manohar&rsquo;s PhD dissertation at JHU. Chapter 2 provides a relatively clear overview of how chain and non-chain models work in Kaldi. In chapter 3 he tried using negative conditional entropy as the loss function for the unsupervised data, and it helped a bit. In chapter 4 Manohar uses [CTC loss]/paper/ctc/. In chapter 5, he discusses ways to do semi-supervised model training. It&rsquo;s nice when you have parallel data in different domains, because then you can do a student-teacher model.</description>
</item>
<item>
<title>Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks</title>
<link>https://kylrth.com/paper/ctc/</link>
<pubDate>Fri, 10 Jul 2020 09:14:59 -0600</pubDate>
<guid>https://kylrth.com/paper/ctc/</guid>
<description>RNNs generally require pre-segmented training data, but this avoids that need. Basically, you have the RNN output probabilities for each label (or a blank) for every frame, and then you find the most likely path across that lattice of probabilities. The section explaining the loss function was kind of complicated. They used their forward-backward algorithm (sort of like Viterbi) to get the probability of all paths corresponding to the output that go through each symbol at each time, and then they differentiated that to get the derivatives with respect to the outputs.</description>
</item>
<item>
<title>Google's neural machine translation system: bridging the gap between human and machine translation</title>
<link>https://kylrth.com/paper/google-nmt-2016/</link>
<pubDate>Tue, 30 Jun 2020 08:22:30 -0600</pubDate>
<guid>https://kylrth.com/paper/google-nmt-2016/</guid>
<description>This model was superseded by this one. They did some careful things with residual connections to make sure it was very parallelizable. They put each LSTM layer on a separate GPU. They quantized the models such that they could train using full floating-point computations with a couple restrictions and then convert the models to quantized versions.</description>
</item>
<item>
<title>Google's multilingual neural machine translation system</title>
<link>https://kylrth.com/paper/google-zero-shot/</link>
<pubDate>Fri, 26 Jun 2020 08:02:12 -0600</pubDate>
<guid>https://kylrth.com/paper/google-zero-shot/</guid>
<description>They use the word-piece model from &ldquo;Japanese and Korean Voice Search&rdquo;, with 32,000 word pieces. (This is a lot less than the 200,000 used in that paper.) They state in the paper that the shared word-piece model is very similar to Byte-Pair-Encoding, which was used for NMT in this paper by researchers at U of Edinburgh. The model and training process are exactly as in Google&rsquo;s earlier paper. It takes 3 weeks on 100 GPUs to train, even after increasing batch size and learning rate.</description>
</item>
<item>
<title>Japanese and Korean voice search</title>
<link>https://kylrth.com/paper/word-piece-model/</link>
<pubDate>Wed, 24 Jun 2020 14:44:02 -0600</pubDate>
<guid>https://kylrth.com/paper/word-piece-model/</guid>
<description>This was mentioned in the paper on Google&rsquo;s Multilingual Neural Machine Translation System. It&rsquo;s regarded as the original paper to use the word-piece model, which is the focus of my notes here. the WordPieceModel Here&rsquo;s the WordPieceModel algorithm: func WordPieceModel(D, chars, n, threshold) -&gt; inventory: # D: training data # n: user-specified number of word units (often 200k) # chars: unicode characters used in the language (e.g. Kanji, Hiragana, Katakana, ASCII for Japanese) # threshold: stopping criterion for likelihood increase # inventory: the set of word units created by the model inventory := chars likelihood := +INF while len(inventory) &lt; n &amp;&amp; likelihood &gt;= threshold: lm := LM(inventory, D) inventory += argmax_{combined word unit}(lm.</description>
</item>
<item>
<title>Towards a multi-view language representation</title>
<link>https://kylrth.com/paper/multi-view-language-representation/</link>
<pubDate>Tue, 23 Jun 2020 08:40:04 -0600</pubDate>
<guid>https://kylrth.com/paper/multi-view-language-representation/</guid>
<description>They used a technique called CCA to combine hand-made features with NN representations. It didn&rsquo;t do great on typological feature prediction, but it did do well with predicting a phylogenetic tree for Indo-European languages.</description>
</item>
<item>
<title>Universal phone recognition with a multilingual allophone system</title>
<link>https://kylrth.com/paper/universal-phone-recognition/</link>
<pubDate>Tue, 23 Jun 2020 08:33:48 -0600</pubDate>
<guid>https://kylrth.com/paper/universal-phone-recognition/</guid>
<description>These guys made sure to model allophones. They had an encoder that produced a universal phone set, and then language-specific decoders. This meant they could use data from various languages to train the system. The decoder has an allophone layer, followed by other dense trainable layers. The allophone layer is a single trainable dense layer, but was initialized by a bunch of linguists who sat down and described the phone sets belonging to each phoneme in each language present in the training set.</description>
</item>
</channel>
</rss>