-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
all previous class notes;fix pandoc image dupe
notes from: SichangHe/CS304@83e7b43 SichangHe/CS310@92ed648 SichangHe/CS401@0095c95 SichangHe/CS402@212702f
- Loading branch information
Showing
7 changed files
with
1,105 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,214 @@ | ||
<!-- toc --> | ||
# Speech Recognition | ||
|
||
broader speech processing | ||
|
||
- wake-up word detection: binary classification | ||
- small window → low latency | ||
- multiple models, threshold & size | ||
- echo cancellation: subtract (complex) echo of sound from speaker | ||
- sound source localization: microphone array, delay | ||
|
||
online/streaming mode: continuous conversion | ||
|
||
omnidirectional/ cardioid/ hyper cardioid microphone | ||
|
||
sound sample rate | ||
|
||
- need at lease 2x of highest frequency wanted, else aliasing (Nyquist theorem) | ||
- 44.1kHz for CD | ||
- 16kHz good enough for speech | ||
|
||
$\mu$-curve | ||
|
||
audio format: use PCM `.wav` | ||
|
||
online endpointing format: push to talk, hit to talk, continuous listening | ||
|
||
## feature extraction | ||
|
||
spectrogram: energy distribution over frequency vs time | ||
|
||
1. preemphasize speech signal: boost high frequency | ||
|
||
$$ | ||
s_{preemp}(n) = s(n) - \alpha s(n-1)\\ | ||
\alpha=0.95 | ||
$$ | ||
|
||
1. spectrogram: from time series to frequency domain | ||
- discrete Fourier transform (DFT) | ||
- symmetric, only first half + midpoint meaningful (by Nyquist theorem) | ||
- magnitude: $0\sim\frac{S_R}{2}$Hz, frequency $\frac{i}{M}S_R$ at point $i$ | ||
- $S_R$: sample rate, $M$: number of sample | ||
- power: magnitude squared | ||
- flat noise from jump in sample windowing | ||
- solution: multiple input by bell-shape *windowing function* | ||
and half-overlap windows to avoid losing information | ||
- before using fast Fourier transform (FFT): zero padding | ||
- cause fake interpolated detail | ||
1. auditory perception: from frequency to Bark | ||
- mimic human ear, which distinguish low frequency better | ||
- frequency warping with Mel curve | ||
- filter bank: triangular filter on Mel curve to be multiplied | ||
- evenly-spaced, half-overlap, 40 enough, 80 good | ||
- translate back to equal-area triangles on frequency axis | ||
- not directly on Mel curve for simplicity | ||
1. log Mel spectrum: log of integration of each filter bank | ||
1. Mel cepstrum discrete cosine transform (DCT) compression | ||
- reduce redundancy from filter bank overlap | ||
- subtract the mean to remove microphone/noise difference | ||
- microphone response is speech with convolution | ||
- convolution -DFT → multiplication -log → summation | ||
|
||
longest common subsequence: dynamic programming, dummy char padding, streaming, | ||
search trellis, sub-trellis, lexical tree | ||
|
||
- pruning: hard threshold vs beam search | ||
|
||
DTW: dynamic time warping: disallow skipping (vertical on trellis), | ||
non-linear mapping (super diagonal) | ||
|
||
- $P_{i,j}$: best path cost from origin to $(i,j)$ | ||
- $C_{i,j}$: local node cost (vector distance) | ||
- global beam across templates for beam search | ||
- combine multiple template | ||
- template averaging: first align templates by DTW | ||
- template segmentation: compress chunks of same phoneme | ||
- actual method: iterate from uniform segments to stabilize variance | ||
|
||
Mahalanobis distance | ||
|
||
$$ | ||
d(x,m_j) = (x-m_j)^TC_j^{-1}(x-m_j) | ||
$$ | ||
|
||
- can be estimated with negative Gaussian log likelihood | ||
|
||
$$ | ||
d(x,m_j) = \frac{1}{2}\log \left( | ||
(2\pi)^D|C_j| | ||
\right)+(x-m_j)^TC_j^{-1}(x-m_j) | ||
$$ | ||
|
||
covariance | ||
|
||
$$ | ||
\Sigma = \begin{pmatrix} | ||
\sigma_{11} & \sigma_{12} & \cdots & \sigma_{1d}\\ | ||
\sigma_{21} & \sigma_{22} & \cdots & \sigma_{2d}\\ | ||
\vdots & \vdots & \ddots & \vdots\\ | ||
\sigma_{d1} & \sigma_{d2} & \cdots & \sigma_{dd}\\ | ||
\end{pmatrix} | ||
$$ | ||
|
||
- estimated with diagonal covariance matrix when elements largely uncorrelated | ||
|
||
self-transition penalty: model phoneme duration | ||
|
||
MFCC (Mel frequency cepstrum coefficient) | ||
|
||
dynamic time warping (DTW) | ||
|
||
- align multiple training sample | ||
1. segment all model sample uniformly by #phoneme, and average | ||
1. align each sample against model sample to get new segment and new average | ||
1. iterate until convergence | ||
- transition probability: #segment switch over #frame in segment | ||
|
||
hidden Markov model for DTW | ||
|
||
- use log probability so total score is log total probability | ||
- simulation by transition probability matrix $T$ (each row sum to 1) | ||
- initial probability $\pi$ | ||
|
||
expectation-maximization (EM) algorithm | ||
|
||
1. initialize with k-means clustering | ||
1. auxiliary function: | ||
conditional expectation of the complete data log likelihood | ||
|
||
$$ | ||
Q(\Theta,\Theta^{(t)}) = \sum_yp(y|X,\Theta^t)\log(p(X,y|\Theta)) | ||
$$ | ||
|
||
1. evaluate $\Theta^{t+1}$ | ||
|
||
$$ | ||
\Theta^{t+1} = \argmax_\Theta Q(\Theta,\Theta^t) | ||
$$ | ||
|
||
1. iterate until convergence | ||
|
||
forward algorithm | ||
|
||
1. initialize $\alpha(s,1)=\pi_s P(o_t|s)$ | ||
1. iterate $\alpha(s,t+1)=\sum_{s'}\alpha(s',1)P(s|s')P(o_{t+1}|s)$ | ||
|
||
Baum Welch: soft state alignment | ||
|
||
log-domain math | ||
|
||
$$ | ||
\log(x^y)=\log(x)2^{\log(y)}\\ | ||
\log(x+y)=\log(x)+\log(1+2^{\log(y)-\log(x)}) | ||
$$ | ||
|
||
continuous text recognition: small-scale problem, e.g. voice command | ||
|
||
- wrap back from end of template to start dummy variable | ||
- high wrap back cost → discourage space | ||
- lextree | ||
- non-emitting state/ null state: only for connecting, no self-transition | ||
- no transition time | ||
- prior probability for word: add to the start of its HMM | ||
- word transition probability for edge cost between word HMM | ||
- approximate sentence probability with best path | ||
|
||
grammar: only focus on syntax not semantics | ||
|
||
- finite-state grammar (FSG) | ||
- context-free grammar (CFG) | ||
- backpointer: only word-level, additional script on word transition | ||
|
||
training with continuous speech: bootstrap & iterate | ||
|
||
- silence: silence model, $\varepsilon$ bypass arc, self-loop non-emitting state | ||
- loop need to go through emitting state, else infinite | ||
|
||
N-gram | ||
|
||
- N-gram assumption: $P(w_k|w_1,…,w_{k-1})=P(w_k|w_{k-(n-1)},…,w_{k-1})$ | ||
- start/end of sentence: `<s>`, `</s>` | ||
- unigram, bigram. 5-gram good enough for commercial use | ||
- N-gram with D words need $\sum_{i=1}^{N-1}D^i$ transition node | ||
- good Turing | ||
- Zipf's law | ||
- backoff | ||
|
||
state-of-the-art system | ||
|
||
- unit of sound by clustering large unlabelled dataset | ||
- transfer learning with small labeled dataset | ||
|
||
phoneme: 39 English, Mandarin with tone | ||
|
||
- few rare phoneme, much more common than rare word → beat Zipf's law | ||
- defined by linguistic, or learned by clustering | ||
- mono-phone/tri-phone: context-independent/dependent (for neighbor) | ||
- absorbing/generating state → non-emitting | ||
- locus: stable centers in spectrum | ||
|
||
multiple pronunciation: multiple internal model + probability | ||
|
||
mono-phone/ context-independent (CI) model | ||
|
||
di-phone: model previous and current phoneme. problem: cross-word effect | ||
|
||
tri-phone: model multiple current phoneme based on previous and next phoneme | ||
|
||
- many cases not seen, back off to mono-phone | ||
- share Gaussian with mono-phone, different weight | ||
- decision tree | ||
|
||
inexact search: run n-gram on (n-1)-gram model by applying word-transition prob |
Oops, something went wrong.