papers

These are my notes from research papers I read. Each page’s title is also a link to the abstract or PDF.

Why does unsupervised pre-training help deep learning?

Posted on 2020-08-24 at 11:40:00 UTC-0600

They’re pretty sure that it performs regularization by starting off the supervised training in a good spot, instead of by somehow improving the optimization path.

The consciousness prior

Posted on 2020-08-14 at 09:05:56 UTC-0700

System 1 cognitive abilities are about low-level perception and intuitive knowledge. System 2 cognitive abilities can be described verbally, and include things like reasoning, planning, and imagination. In cognitive neuroscience, the “Global Workspace Theory” says that at each moment specific pieces of information become a part of working memory and become globally available to other unconscious computational processes. Relative to the unconscious state, the conscious state is low-dimensional, focusing on a few things.

Troubling trends in machine learning scholarship

Posted on 2020-08-13 at 10:36:05 UTC-0700

The authors discuss four trends in AI research that have negative consequences for the community. problems explanation vs. speculation It’s important to allow researchers to include speculation, because speculation is what allows ideas to form. But the paper has to carefully couch speculation inside a “Motivations” section or other verbage to ensure the reader understands its place. It’s extremely important to define concepts before using them. Terms like internal covariate shift or coverage sound like definitions without actually being such.

Attention is all you need

Posted on 2020-08-05 at 12:37:42 UTC-0700

I also referred to this implementation to understand some of the details. This is the paper describing the Transformer, a sequence-to-sequence model based entirely on attention. I think it’s best described with pictures. model overview From this picture, I think the following things need explaining: embeddings these are learned embeddings that convert the input and output tokens to vectors of the model dimension. In this paper, they actually used the same weight matrix for input embedding, output embedding, and the final linear layer before the final softmax.

BERT: pre-training of deep bidirectional transformers for language understanding

Posted on 2020-08-04 at 08:57:44 UTC-0700

The B is for bidirectional, and that’s a big deal. It makes it possible to do well on sentence-level (NLI, question answering) and token-level tasks (NER, POS tagging). In a unidirectional model, the word “bank” in a sentence like “I made a bank deposit.” has only “I made a” as its context, keeping useful information from the model. Another cool thing is masked language model training (MLM). They train the model by blanking certain words in the sentence and asking the model to guess the missing word.