<title>deep-learning on Kyle Roth</title>
<link>https://kylrth.com/tags/deep-learning/</link>
<description>Recent content in deep-learning on Kyle Roth</description>
<generator>Hugo -- gohugo.io</generator>
<language>en-us</language>
<lastBuildDate>Mon, 11 Apr 2022 12:17:25 -0400</lastBuildDate>
<atom:link href="https://kylrth.com/tags/deep-learning/index.xml" rel="self" type="application/rss+xml"/>
<item>
<title>PaLM</title>
<link>https://kylrth.com/paper/palm/</link>
<pubDate>Mon, 11 Apr 2022 12:17:25 -0400</pubDate>
<guid>https://kylrth.com/paper/palm/</guid>
<description>This was a paper I presented about in Bang Liu’s research group meeting on 2022-04-11. You can view the slides I used here.</description>
...
</item>
<item>
<title>QA-GNN: reasoning with language models and knowledge graphs for question answering</title>
<link>https://kylrth.com/paper/qa-gnn/</link>
<pubDate>Tue, 05 Apr 2022 22:54:43 -0400</pubDate>
<guid>https://kylrth.com/paper/qa-gnn/</guid>
<description>This post was created as an assignment in Bang Liu’s IFT6289 course in winter 2022. The structure of the post follows the structure of the assignment: summarization followed by my own comments. The authors create a novel system for combining an LM and a knowledge graph by performing reasoning over a joint graph produced by the LM and the KG, thus solving the problem of irrelevant entities appearing in the knowledge graph and unifying the representations across the LM and KG.</description>
...
</item>
<item>
<title>Neural message passing for quantum chemistry</title>
<link>https://kylrth.com/paper/neural-message-passing/</link>
<pubDate>Fri, 25 Mar 2022 14:46:11 -0400</pubDate>
<guid>https://kylrth.com/paper/neural-message-passing/</guid>
<description>This post was created as an assignment in Bang Liu’s IFT6289 course in winter 2022. The structure of the post follows the structure of the assignment: summarization followed by my own comments. To summarize, the authors create a unifying framework for describing message-passing neural networks, which they apply to the problem of predicting the structural properties of chemical compounds in the QM9 dataset. paper summarization The authors first demonstrate that many of the recent works applying neural nets to this problem can fit into a message-passing neural network (MPNN) framework.</description>
...
</item>
<item>
<title>The effect of model size on worst-group generalization</title>
<link>https://kylrth.com/paper/effect-of-model-size-on-worst-group-generalization/</link>
<pubDate>Thu, 17 Mar 2022 14:34:33 -0400</pubDate>
<guid>https://kylrth.com/paper/effect-of-model-size-on-worst-group-generalization/</guid>
<description>This was a paper we presented about in Irina Rish’s neural scaling laws course (IFT6167) in winter 2022. You can view the slides we used here.</description>
...
</item>
<item>
<title>Scaling laws for the few-shot adaptation of pre-trained image classifiers</title>
<link>https://kylrth.com/paper/scaling-laws-few-shot-image-classifiers/</link>
<pubDate>Tue, 22 Feb 2022 13:19:12 -0500</pubDate>
<guid>https://kylrth.com/paper/scaling-laws-few-shot-image-classifiers/</guid>
<description>The unsurprising result here is that few-shot performance scales predictably with pre-training dataset size under traditional fine-tuning, matching network, and prototypical network approaches. The interesting result is that the exponents of these three approaches were substantially different (see Table 1 in the paper), which says to me that the few-shot inference approach matters a lot. The surprising result was that while more training on the “non-natural” Omniglot dataset did not improve few-shot accuracy on other datasets, training on “natural” datasets did improve accuracy on few-shot Omniglot.</description>
...
</item>
<item>
<title>Learning explanations that are hard to vary</title>
<link>https://kylrth.com/paper/learning-explanations-hard-to-vary/</link>
<pubDate>Tue, 22 Feb 2022 12:29:17 -0500</pubDate>
<guid>https://kylrth.com/paper/learning-explanations-hard-to-vary/</guid>
<description>The big idea here is to use the geometric mean instead of the arithmetic mean across samples in the batch when computing the gradient for SGD. This overcomes the situation where averaging produces optima that are not actually optimal for any individual samples, as demonstrated in their toy example below: In practice, the method the authors test is not exactly the geometric mean for numerical and performance reasons, but effectively accomplishes the same thing by avoiding optima that are “inconsistent” (meaning that gradients from relatively few samples actually point in that direction).</description>
...
</item>
<item>
<title>In search of robust measures of generalization</title>
<link>https://kylrth.com/paper/robust-measures-of-generalization/</link>
<pubDate>Mon, 21 Feb 2022 15:33:22 -0500</pubDate>
<guid>https://kylrth.com/paper/robust-measures-of-generalization/</guid>
<description>These authors define robust error as the least upper bound on the expected loss over a family of environmental settings (including dataset, model architecture, learning algorithm, etc.): \[\sup_{e\in\mathcal F}\mathbb E_{\omega\in P^e}\left[\ell(\phi,\omega)\right]\] The fact that this is an upper bound and not an average is very important and is what makes this work unique from previous work in this direction. Indeed, what we should be concerned about is not how poorly a model performs on the average sample but on the worst-case sample.</description>
...
</item>
<item>
<title>It's not just size that matters: small language models are also few-shot learners</title>
<link>https://kylrth.com/paper/not-just-size-that-matters/</link>
<pubDate>Fri, 18 Feb 2022 13:13:54 -0500</pubDate>
<guid>https://kylrth.com/paper/not-just-size-that-matters/</guid>
<description>We presented this paper as a mini-lecture in Bang Liu’s IFT6289 course in winter 2022. You can view the slides we used here.</description>
...
</item>
<item>
<title>Scaling laws for transfer</title>
<link>https://kylrth.com/paper/scaling-laws-for-transfer/</link>
<pubDate>Wed, 16 Feb 2022 14:12:26 -0500</pubDate>
<guid>https://kylrth.com/paper/scaling-laws-for-transfer/</guid>
<description>This post was created as an assignment in Irina Rish’s neural scaling laws course (IFT6167) in winter 2022. The post contains no summarization, only questions and thoughts. Sometimes these scaling laws can feel like pseudoscience because they’re a post hoc attempt to place a trend line on data. How can we be confident that the trends we observe actually reflect the scaling laws that we’re after? In the limitations section they mention that they didn’t tune hyperparameters for fine-tuning or for the code data distribution.</description>
...
</item>
<item>
<title>Deep learning scaling is predictable, empirically</title>
<link>https://kylrth.com/paper/scaling-predictable-empirically/</link>
<pubDate>Mon, 14 Feb 2022 10:38:11 -0500</pubDate>
<guid>https://kylrth.com/paper/scaling-predictable-empirically/</guid>
<description>This was a paper we presented about in Irina Rish’s neural scaling laws course (IFT6167) in winter 2022. You can view the slides we used here. It’s important to note that in the results for NMT (Figure 1) we would expect the lines in the graph on the left to curve as the capacity of the individual models is exhausted. That’s why the authors fit the curves with an extra constant added.</description>
...
</item>
<item>
<title>Masked autoencoders are scalable vision learners</title>
<link>https://kylrth.com/paper/masked-autoencoders-are-scalable-vision-learners/</link>
<pubDate>Fri, 11 Feb 2022 14:18:30 -0500</pubDate>
<guid>https://kylrth.com/paper/masked-autoencoders-are-scalable-vision-learners/</guid>
<description>This post was created as an assignment in Irina Rish’s neural scaling laws course (IFT6167) in winter 2022. The post contains no summarization, only questions and thoughts. In this paper they mention that the mask vector is learned, and it sounds like the positional embeddings are also learned. I remember in Attention is all you need they found that cosine positional embeddings worked better than learned ones, especially for sequences of longer length.</description>
...
</item>
<item>
<title>Data scaling laws in NMT: the effect of noise and architecture</title>
<link>https://kylrth.com/paper/data-scaling-laws-nmt/</link>
<pubDate>Wed, 09 Feb 2022 20:47:59 -0500</pubDate>
<guid>https://kylrth.com/paper/data-scaling-laws-nmt/</guid>
<description>This paper is all about trying a bunch of different changes to the training setup to see what affects the power law exponent over dataset size. Here are some of the answers: encoder-decoder size asymmetry: exponent not affected, but effective model capacity affected architecture (LSTM vs. Transformer): exponent not affected, but effective model capacity affected dataset quality (filtered vs. not): exponent and effective model capacity not effected, losses on smaller datasets affected dataset source (ParaCrawl vs.</description>
...
</item>
<item>
<title>Parallel training of deep networks with local updates</title>
<link>https://kylrth.com/paper/parallel-training-with-local-updates/</link>
<pubDate>Wed, 09 Feb 2022 10:50:21 -0500</pubDate>
<guid>https://kylrth.com/paper/parallel-training-with-local-updates/</guid>
<description>This post was created as an assignment in Irina Rish’s neural scaling laws course (IFT6167) in winter 2022. The post contains no summarization, only questions and thoughts. Once I learned how the loss functions worked for each chunk, my first question was whether the earlier chunks were going to be able to learn the low-level features that later chunks would need. Figure 7 seems to show that they do, although their quality apparently decreases with increasingly local updates.</description>
...
</item>
<item>
<title>A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification</title>
<link>https://kylrth.com/paper/cnn-sentence/</link>
<pubDate>Wed, 02 Feb 2022 15:35:00 -0500</pubDate>
<guid>https://kylrth.com/paper/cnn-sentence/</guid>
<description>This post was created as an assignment in Bang Liu’s IFT6289 course in winter 2022. The structure of the post follows the structure of the assignment: summarization followed by my own comments. paper summarization Word embeddings have gotten so good that state-of-the-art sentence classification can often be achieved with just a one-layer convolutional network on top of those embeddings. This paper dials in on the specifics of training that convolutional layer for this downstream sentence classification task.</description>
...
</item>
<item>
<title>Learning transferable visual models from natural language supervision (CLIP)</title>
<link>https://kylrth.com/paper/clip/</link>
<pubDate>Wed, 02 Feb 2022 12:35:03 -0500</pubDate>
<guid>https://kylrth.com/paper/clip/</guid>
<description>This post was created as an assignment in Irina Rish’s neural scaling laws course (IFT6167) in winter 2022. The post contains no summarization, only questions and thoughts. This concept of wide vs. narrow supervision (rather than binary “supervised” and “unsupervised”) is an interesting and flexible way to think about the way these training schemes leverage data. The zero-shot CLIP matches the performance of 4-shot CLIP, which is a surprising result. What do the authors mean when they make this guess about zero-shot’s advantage:</description>
...
</item>
<item>
<title>Distributed representations of words and phrases and their compositionality</title>
<link>https://kylrth.com/paper/distributed-representations/</link>
<pubDate>Tue, 01 Feb 2022 16:09:19 -0500</pubDate>
<guid>https://kylrth.com/paper/distributed-representations/</guid>
<description>This post was created as an assignment in Bang Liu’s IFT6289 course in winter 2022. The structure of the post follows the structure of the assignment: summarization followed by my own comments. paper summarization This paper describes multiple improvements that are made to the original Skip-gram model: Decreasing the rate of exposure to common words improves the training speed and increases the model’s accuracy on infrequent words. A new training target they call “negative sampling” improves the training speed and the model’s accuracy on frequent words.</description>
...
</item>
<item>
<title>Deep learning</title>
<link>https://kylrth.com/paper/deep-learning/</link>
<pubDate>Thu, 20 Jan 2022 15:11:00 -0500</pubDate>
<guid>https://kylrth.com/paper/deep-learning/</guid>
<description>This post was created as an assignment in Bang Liu’s IFT6289 course in winter 2022. The structure of the post follows the structure of the assignment: summarization followed by my own comments. paper summarization The authors use the example of distinguishing between a Samoyed and a white wolf to talk about the importance of learning to rely on very small variations while ignoring others. While shallow classifiers must rely on human-crafted features which are expensive to build and always imperfect, deep classifiers are expected to learn their own features by applying a “general-purpose learning procedure” to learn the features and the classification layer from the data simultaneously.</description>
...
</item>
<item>
<title>avatarify</title>
<link>https://kylrth.com/post/avatarify/</link>
<pubDate>Wed, 24 Nov 2021 11:58:34 -0500</pubDate>
<guid>https://kylrth.com/post/avatarify/</guid>
<description>Avatarify is a cool project that lets you create a relatively realistic avatar that you can use during video meetings. It works by creating a fake video input device and passing your video input through a neural network in PyTorch. My laptop doesn’t have a GPU, so I used the server/client setup. setting up the server Be sure you’ve installed the Nvidia Docker runtime so that the Docker container can use the GPU.</description>
...
</item>
<item>
<title>Inductive biases for deep learning of higher-level cognition</title>
<link>https://kylrth.com/paper/inductive-biases-higher-cognition/</link>
<pubDate>Tue, 08 Dec 2020 06:40:48 -0700</pubDate>
<guid>https://kylrth.com/paper/inductive-biases-higher-cognition/</guid>
<description>This is a long paper, so a lot of my writing here is an attempt to condense the discussion. I’ve taken the liberty to pull exact phrases and structure from the paper without explicitly using quotes. Our main hypothesis is that deep learning succeeded in part because of a set of inductive biases, but that additional ones should be added in order to go from good in-distribution generalization in highly supervised learning tasks (or where strong and dense rewards are available), such as object recognition in images, to strong out-of-distribution generalization and transfer learning to new tasks with low sample complexity.</description>
...
</item>
<item>
<title>A closer look at memorization in deep networks</title>
<link>https://kylrth.com/paper/closer-look-at-memorization/</link>
<pubDate>Mon, 31 Aug 2020 11:52:35 -0600</pubDate>
<guid>https://kylrth.com/paper/closer-look-at-memorization/</guid>
<description>This paper builds on what we learned in “Understanding deep learning requires rethinking generalization”. In that paper they showed that DNNs are able to fit pure noise in the same amount of time as it can fit real data, which means that our optimization algorithm (SGD, Adam, etc.) is not what’s keeping DNNs from overfitting. experiments for detecting easy/hard samples It looks like there are qualitative differences between a DNN that has memorized some data and a DNN that has seen real data.</description>
...
</item>
...