https://kylrth.com/tags/optimization/Recent content in optimization on Kyle RothHugo -- gohugo.ioen-usWed, 09 Feb 2022 10:50:21 -0500https://kylrth.com/paper/parallel-training-with-local-updates/Wed, 09 Feb 2022 10:50:21 -0500https://kylrth.com/paper/parallel-training-with-local-updates/This post was created as an assignment in Irina Rish’s neural scaling laws course (IFT6167) in winter 2022. The post contains no summarization, only questions and thoughts. Once I learned how the loss functions worked for each chunk, my first question was whether the earlier chunks were going to be able to learn the low-level features that later chunks would need. Figure 7 seems to show that they do, although their quality apparently decreases with increasingly local updates.https://kylrth.com/paper/spanbert/Sat, 05 Dec 2020 16:08:03 -0700https://kylrth.com/paper/spanbert/BERT optimizes the Masked Language Model (MLM) objective by masking word pieces uniformly at random in its training data and attempting to predict the masked values. With SpanBERT, spans of tokens are masked and the model is expected to predict the text in the spans from the representations of the words on the boundary. Span lengths follow a geometric distribution, and span start points are uniformly random. To predict each individual masked token, a two-layer feedforward network was provided with the boundary token representations plus the position embedding of the target token, and the output vector representation was used to predict the masked token and compute cross-entropy loss exactly as in standard MLM.https://kylrth.com/paper/overcoming-catastrophic-forgetting/Thu, 01 Oct 2020 10:47:28 -0600https://kylrth.com/paper/overcoming-catastrophic-forgetting/In the paper they use Bayes’ rule to show that the contribution of the first of two tasks is contained in the posterior distribution of model parameters over the first dataset. This is important because it means we can estimate that posterior to try to get a sense for which model parameters were most important for that first task. In this paper, they perform that estimation using a multivariate Gaussian distribution.https://kylrth.com/paper/disciplined-approach-to-hyperparameters/Fri, 28 Aug 2020 14:16:29 -0600https://kylrth.com/paper/disciplined-approach-to-hyperparameters/The goal of hyperparameter tuning is to reach the point where test loss is horizontal on the graph over model complexity. Underfitting can be observed with a small learning rate, simple architecture, or complex data distribution. You can observe underfitting decrease by seeing more drastic results at the outset, followed by a more horizontal line further into training. You can use the LR range test to find a good learning rate range, and then use a cyclical learning rate to move up and down within that range.https://kylrth.com/paper/gradient-based-hyperparameter-optimization/Fri, 28 Aug 2020 14:16:29 -0600https://kylrth.com/paper/gradient-based-hyperparameter-optimization/In the area of hyperparameter optimization (HO), the goal is to optimize a response function of the hyperparameters. The response function is usually the average loss on a validation set. Gradient-based HO refers to iteratively finding the optimal hyperparameters using gradient updates, just as we do with neural network training itself. The gradient of the response function with respect to the hyperparameters is called the hypergradient. One of the great things about this work is that their framework allows for all kinds of hyperparameters.
This XML file does not appear to have any style information associated with it. The document tree is shown below.
<rss xmlns:atom="http://www.w3.org/2005/Atom" version="2.0">
<channel>
<title>optimization on Kyle Roth</title>
<link>https://kylrth.com/tags/optimization/</link>
<description>Recent content in optimization on Kyle Roth</description>
<generator>Hugo -- gohugo.io</generator>
<language>en-us</language>
<lastBuildDate>Wed, 09 Feb 2022 10:50:21 -0500</lastBuildDate>
<atom:link href="https://kylrth.com/tags/optimization/index.xml" rel="self" type="application/rss+xml"/>
<item>
<title>Parallel training of deep networks with local updates</title>
<link>https://kylrth.com/paper/parallel-training-with-local-updates/</link>
<pubDate>Wed, 09 Feb 2022 10:50:21 -0500</pubDate>
<guid>https://kylrth.com/paper/parallel-training-with-local-updates/</guid>
<description>This post was created as an assignment in Irina Rish&rsquo;s neural scaling laws course (IFT6167) in winter 2022. The post contains no summarization, only questions and thoughts. Once I learned how the loss functions worked for each chunk, my first question was whether the earlier chunks were going to be able to learn the low-level features that later chunks would need. Figure 7 seems to show that they do, although their quality apparently decreases with increasingly local updates.</description>
</item>
<item>
<title>SpanBERT: improving pre-training by representing and predicting spans</title>
<link>https://kylrth.com/paper/spanbert/</link>
<pubDate>Sat, 05 Dec 2020 16:08:03 -0700</pubDate>
<guid>https://kylrth.com/paper/spanbert/</guid>
<description>BERT optimizes the Masked Language Model (MLM) objective by masking word pieces uniformly at random in its training data and attempting to predict the masked values. With SpanBERT, spans of tokens are masked and the model is expected to predict the text in the spans from the representations of the words on the boundary. Span lengths follow a geometric distribution, and span start points are uniformly random. To predict each individual masked token, a two-layer feedforward network was provided with the boundary token representations plus the position embedding of the target token, and the output vector representation was used to predict the masked token and compute cross-entropy loss exactly as in standard MLM.</description>
</item>
<item>
<title>Overcoming catastrophic forgetting in neural networks</title>
<link>https://kylrth.com/paper/overcoming-catastrophic-forgetting/</link>
<pubDate>Thu, 01 Oct 2020 10:47:28 -0600</pubDate>
<guid>https://kylrth.com/paper/overcoming-catastrophic-forgetting/</guid>
<description>In the paper they use Bayes&rsquo; rule to show that the contribution of the first of two tasks is contained in the posterior distribution of model parameters over the first dataset. This is important because it means we can estimate that posterior to try to get a sense for which model parameters were most important for that first task. In this paper, they perform that estimation using a multivariate Gaussian distribution.</description>
</item>
<item>
<title>A disciplined approach to neural network hyperparameters: part 1</title>
<link>https://kylrth.com/paper/disciplined-approach-to-hyperparameters/</link>
<pubDate>Fri, 28 Aug 2020 14:16:29 -0600</pubDate>
<guid>https://kylrth.com/paper/disciplined-approach-to-hyperparameters/</guid>
<description>The goal of hyperparameter tuning is to reach the point where test loss is horizontal on the graph over model complexity. Underfitting can be observed with a small learning rate, simple architecture, or complex data distribution. You can observe underfitting decrease by seeing more drastic results at the outset, followed by a more horizontal line further into training. You can use the LR range test to find a good learning rate range, and then use a cyclical learning rate to move up and down within that range.</description>
</item>
<item>
<title>Forward and reverse gradient-based hyperparameter optimization</title>
<link>https://kylrth.com/paper/gradient-based-hyperparameter-optimization/</link>
<pubDate>Fri, 28 Aug 2020 14:16:29 -0600</pubDate>
<guid>https://kylrth.com/paper/gradient-based-hyperparameter-optimization/</guid>
<description>In the area of hyperparameter optimization (HO), the goal is to optimize a response function of the hyperparameters. The response function is usually the average loss on a validation set. Gradient-based HO refers to iteratively finding the optimal hyperparameters using gradient updates, just as we do with neural network training itself. The gradient of the response function with respect to the hyperparameters is called the hypergradient. One of the great things about this work is that their framework allows for all kinds of hyperparameters.</description>
</item>
</channel>
</rss>