This paper is all about trying a bunch of different changes to the training setup to see what affects the power law exponent over dataset size. Here are some of the answers:encoder-decoder size asymmetry: exponent not affected, but effective model capacity affectedarchitecture (LSTM vs. Transformer): exponent not affected, but effective model capacity affecteddataset quality (filtered vs. not): exponent and effective model capacity not effected, losses on smaller datasets affecteddataset source (ParaCrawl vs. in-house dataset): exponent not affectedadding independent noise: exponent not affected, but effective model capacity affectedHere are some other things to test that I thought of while I read this:compare scaling with respect to language pairs (the architecture experiments saw p=0.28 and p=0.25 for en -> de and zh -> en respectively. Is that difference significant?)