papers

These are my notes from research papers I read. Each page’s title is also a link to the abstract or PDF.

Compositional generalization by factorizing alignment and translation

Posted on

They had a biRNN with attention for alignment encoding, and then a single linear function of each one-hot encoded word for encoding that single word. Their reasoning was that by separating the alignment from the meaning of individual words the model could more easily generalize to unseen words.

Semi-supervised training for automatic speech recognition

Posted on

This was Manohar’s PhD dissertation at JHU. Chapter 2 provides a relatively clear overview of how chain and non-chain models work in Kaldi. In chapter 3 he tried using negative conditional entropy as the loss function for the unsupervised data, and it helped a bit. In chapter 4 Manohar uses [CTC loss]/paper/ctc/. In chapter 5, he discusses ways to do semi-supervised model training. It’s nice when you have parallel data in different domains, because then you can do a student-teacher model.

Read more

Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks

Posted on

RNNs generally require pre-segmented training data, but this avoids that need. Basically, you have the RNN output probabilities for each label (or a blank) for every frame, and then you find the most likely path across that lattice of probabilities. The section explaining the loss function was kind of complicated. They used their forward-backward algorithm (sort of like Viterbi) to get the probability of all paths corresponding to the output that go through each symbol at each time, and then they differentiated that to get the derivatives with respect to the outputs.

Read more

Google's multilingual neural machine translation system

Posted on

They use the word-piece model from “Japanese and Korean Voice Search”, with 32,000 word pieces. (This is a lot less than the 200,000 used in that paper.) They state in the paper that the shared word-piece model is very similar to Byte-Pair-Encoding, which was used for NMT in this paper by researchers at U of Edinburgh. The model and training process are exactly as in Google’s earlier paper. It takes 3 weeks on 100 GPUs to train, even after increasing batch size and learning rate.

Read more