papers

These are my notes from research papers I read. Each page’s title is also a link to the abstract or PDF.

Deep learning scaling is predictable, empirically

Posted on 2022-02-14 at 10:38:11 UTC-0500

This was a paper we presented about in Irina Rish’s neural scaling laws course (IFT6167) in winter 2022. You can view the slides we used here. It’s important to note that in the results for NMT (Figure 1) we would expect the lines in the graph on the left to curve as the capacity of the individual models is exhausted. That’s why the authors fit the curves with an extra constant added.

Masked autoencoders are scalable vision learners

Posted on 2022-02-11 at 14:18:30 UTC-0500

This post was created as an assignment in Irina Rish’s neural scaling laws course (IFT6167) in winter 2022. The post contains no summarization, only questions and thoughts. In this paper they mention that the mask vector is learned, and it sounds like the positional embeddings are also learned. I remember in Attention is all you need they found that cosine positional embeddings worked better than learned ones, especially for sequences of longer length.

Data scaling laws in NMT: the effect of noise and architecture

Posted on 2022-02-09 at 20:47:59 UTC-0500

This paper is all about trying a bunch of different changes to the training setup to see what affects the power law exponent over dataset size. Here are some of the answers: encoder-decoder size asymmetry: exponent not affected, but effective model capacity affected architecture (LSTM vs. Transformer): exponent not affected, but effective model capacity affected dataset quality (filtered vs. not): exponent and effective model capacity not effected, losses on smaller datasets affected dataset source (ParaCrawl vs.

Parallel training of deep networks with local updates

Posted on 2022-02-09 at 10:50:21 UTC-0500

This post was created as an assignment in Irina Rish’s neural scaling laws course (IFT6167) in winter 2022. The post contains no summarization, only questions and thoughts. Once I learned how the loss functions worked for each chunk, my first question was whether the earlier chunks were going to be able to learn the low-level features that later chunks would need. Figure 7 seems to show that they do, although their quality apparently decreases with increasingly local updates.

A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification

Posted on 2022-02-02 at 15:35:00 UTC-0500

This post was created as an assignment in Bang Liu’s IFT6289 course in winter 2022. The structure of the post follows the structure of the assignment: summarization followed by my own comments. paper summarization Word embeddings have gotten so good that state-of-the-art sentence classification can often be achieved with just a one-layer convolutional network on top of those embeddings. This paper dials in on the specifics of training that convolutional layer for this downstream sentence classification task.