papers

These are my notes from research papers I read. Each page’s title is also a link to the abstract or PDF.

Deep learning generalizes because the parameter-function map is biased towards simple functions

The theoretical value in talking about the parameter-function map is that this map lets us talk about sets of parameters that produce the same function. In this paper they used some recently proven stuff from algorithmic information theory (AIT) to show that for neural networks the parameter-function map is biased toward functions with low Komolgorov complexity, meaning that simple functions are more likely to appear given random choice of parameters. Since real world problems are also biased toward simple functions, this could explain the generalization/memorization results found by Zhang et al.
Read more

A closer look at memorization in deep networks

This paper builds on what we learned in “Understanding deep learning requires rethinking generalization”. In that paper they showed that DNNs are able to fit pure noise in the same amount of time as it can fit real data, which means that our optimization algorithm (SGD, Adam, etc.) is not what’s keeping DNNs from overfitting. experiments for detecting easy/hard samples It looks like there are qualitative differences between a DNN that has memorized some data and a DNN that has seen real data.
Read more

A disciplined approach to neural network hyperparameters: part 1

The goal of hyperparameter tuning is to reach the point where test loss is horizontal on the graph over model complexity. Underfitting can be observed with a small learning rate, simple architecture, or complex data distribution. You can observe underfitting decrease by seeing more drastic results at the outset, followed by a more horizontal line further into training. You can use the LR range test to find a good learning rate range, and then use a cyclical learning rate to move up and down within that range.
Read more

Forward and reverse gradient-based hyperparameter optimization

In the area of hyperparameter optimization (HO), the goal is to optimize a response function of the hyperparameters. The response function is usually the average loss on a validation set. Gradient-based HO refers to iteratively finding the optimal hyperparameters using gradient updates, just as we do with neural network training itself. The gradient of the response function with respect to the hyperparameters is called the hypergradient. One of the great things about this work is that their framework allows for all kinds of hyperparameters.
Read more

Understanding deep learning requires rethinking generalization

It turns out that neural networks can reach training loss of 0 even on randomly labeled data, even when the data itself is random. It was previously thought that some implicit bias in the model architecture prevented (or regularized the model away from) overfitting to specific training examples, but that’s obviously not true. They showed this empirically as just described, and also theoretically constructed a two-layer ReLU network with \(p=2n+d\) parameters to express any labeling of any sample of size \(n\) in \(d\) dimensions.
Read more