This was a paper we presented about in Irina Rish’s neural scaling laws course (IFT6167) in winter 2022. You can view the slides we used here.

It’s important to note that in the results for NMT (Figure 1) we would expect the lines in the graph on the left to curve as the capacity of the individual models is exhausted. That’s why the authors fit the curves with an extra constant added. Meanwhile, the results in the graph on the right are curved because as the data size grows, the optimal model size also grows and it becomes increasingly difficult to find the right hyperparameters to train the model down to the optimal generalization error. (See the last paragraph in Section 4.1.)