https://kylrth.com/tags/speech/Recent content in speech on Kyle RothHugo -- gohugo.ioen-usTue, 14 Jul 2020 08:06:00 -0600https://kylrth.com/paper/semi-supervised-for-asr/Tue, 14 Jul 2020 08:06:00 -0600https://kylrth.com/paper/semi-supervised-for-asr/This was Manohar’s PhD dissertation at JHU. Chapter 2 provides a relatively clear overview of how chain and non-chain models work in Kaldi. In chapter 3 he tried using negative conditional entropy as the loss function for the unsupervised data, and it helped a bit. In chapter 4 Manohar uses [CTC loss]/paper/ctc/. In chapter 5, he discusses ways to do semi-supervised model training. It’s nice when you have parallel data in different domains, because then you can do a student-teacher model.https://kylrth.com/paper/ctc/Fri, 10 Jul 2020 09:14:59 -0600https://kylrth.com/paper/ctc/RNNs generally require pre-segmented training data, but this avoids that need. Basically, you have the RNN output probabilities for each label (or a blank) for every frame, and then you find the most likely path across that lattice of probabilities. The section explaining the loss function was kind of complicated. They used their forward-backward algorithm (sort of like Viterbi) to get the probability of all paths corresponding to the output that go through each symbol at each time, and then they differentiated that to get the derivatives with respect to the outputs.https://kylrth.com/paper/universal-phone-recognition/Tue, 23 Jun 2020 08:33:48 -0600https://kylrth.com/paper/universal-phone-recognition/These guys made sure to model allophones. They had an encoder that produced a universal phone set, and then language-specific decoders. This meant they could use data from various languages to train the system. The decoder has an allophone layer, followed by other dense trainable layers. The allophone layer is a single trainable dense layer, but was initialized by a bunch of linguists who sat down and described the phone sets belonging to each phoneme in each language present in the training set.
This XML file does not appear to have any style information associated with it. The document tree is shown below.
<rss xmlns:atom="http://www.w3.org/2005/Atom" version="2.0">
<channel>
<title>speech on Kyle Roth</title>
<link>https://kylrth.com/tags/speech/</link>
<description>Recent content in speech on Kyle Roth</description>
<generator>Hugo -- gohugo.io</generator>
<language>en-us</language>
<lastBuildDate>Tue, 14 Jul 2020 08:06:00 -0600</lastBuildDate>
<atom:link href="https://kylrth.com/tags/speech/index.xml" rel="self" type="application/rss+xml"/>
<item>
<title>Semi-supervised training for automatic speech recognition</title>
<link>https://kylrth.com/paper/semi-supervised-for-asr/</link>
<pubDate>Tue, 14 Jul 2020 08:06:00 -0600</pubDate>
<guid>https://kylrth.com/paper/semi-supervised-for-asr/</guid>
<description>This was Manohar&rsquo;s PhD dissertation at JHU. Chapter 2 provides a relatively clear overview of how chain and non-chain models work in Kaldi. In chapter 3 he tried using negative conditional entropy as the loss function for the unsupervised data, and it helped a bit. In chapter 4 Manohar uses [CTC loss]/paper/ctc/. In chapter 5, he discusses ways to do semi-supervised model training. It&rsquo;s nice when you have parallel data in different domains, because then you can do a student-teacher model.</description>
</item>
<item>
<title>Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks</title>
<link>https://kylrth.com/paper/ctc/</link>
<pubDate>Fri, 10 Jul 2020 09:14:59 -0600</pubDate>
<guid>https://kylrth.com/paper/ctc/</guid>
<description>RNNs generally require pre-segmented training data, but this avoids that need. Basically, you have the RNN output probabilities for each label (or a blank) for every frame, and then you find the most likely path across that lattice of probabilities. The section explaining the loss function was kind of complicated. They used their forward-backward algorithm (sort of like Viterbi) to get the probability of all paths corresponding to the output that go through each symbol at each time, and then they differentiated that to get the derivatives with respect to the outputs.</description>
</item>
<item>
<title>Universal phone recognition with a multilingual allophone system</title>
<link>https://kylrth.com/paper/universal-phone-recognition/</link>
<pubDate>Tue, 23 Jun 2020 08:33:48 -0600</pubDate>
<guid>https://kylrth.com/paper/universal-phone-recognition/</guid>
<description>These guys made sure to model allophones. They had an encoder that produced a universal phone set, and then language-specific decoders. This meant they could use data from various languages to train the system. The decoder has an allophone layer, followed by other dense trainable layers. The allophone layer is a single trainable dense layer, but was initialized by a bunch of linguists who sat down and described the phone sets belonging to each phoneme in each language present in the training set.</description>
</item>
</channel>
</rss>