optimization
This post was created as an assignment in Irina Rish’s neural scaling laws course (IFT6167) in winter 2022. The post contains no summarization, only questions and thoughts.
Once I learned how the loss functions worked for each chunk, my first question was whether the earlier chunks were going to be able to learn the low-level features that later chunks would need. Figure 7 seems to show that they do, although their quality apparently decreases with increasingly local updates.
Read moreBERT optimizes the Masked Language Model (MLM) objective by masking word pieces uniformly at random in its training data and attempting to predict the masked values. With SpanBERT, spans of tokens are masked and the model is expected to predict the text in the spans from the representations of the words on the boundary. Span lengths follow a geometric distribution, and span start points are uniformly random.
To predict each individual masked token, a two-layer feedforward network was provided with the boundary token representations plus the position embedding of the target token, and the output vector representation was used to predict the masked token and compute cross-entropy loss exactly as in standard MLM.
Read moreIn the paper they use Bayes’ rule to show that the contribution of the first of two tasks is contained in the posterior distribution of model parameters over the first dataset. This is important because it means we can estimate that posterior to try to get a sense for which model parameters were most important for that first task.
In this paper, they perform that estimation using a multivariate Gaussian distribution.
Read moreThe goal of hyperparameter tuning is to reach the point where test loss is horizontal on the graph over model complexity.
Underfitting can be observed with a small learning rate, simple architecture, or complex data distribution. You can observe underfitting decrease by seeing more drastic results at the outset, followed by a more horizontal line further into training. You can use the LR range test to find a good learning rate range, and then use a cyclical learning rate to move up and down within that range.
Read moreIn the area of hyperparameter optimization (HO), the goal is to optimize a response function of the hyperparameters. The response function is usually the average loss on a validation set. Gradient-based HO refers to iteratively finding the optimal hyperparameters using gradient updates, just as we do with neural network training itself. The gradient of the response function with respect to the hyperparameters is called the hypergradient.
One of the great things about this work is that their framework allows for all kinds of hyperparameters.
Read more