Skip to content

Commit

Permalink
all previous class notes;fix pandoc image dupe
Browse files Browse the repository at this point in the history
  • Loading branch information
SichangHe committed Mar 28, 2024
1 parent 6e76194 commit 620cf93
Show file tree
Hide file tree
Showing 7 changed files with 1,105 additions and 2 deletions.
6 changes: 5 additions & 1 deletion src/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,11 @@
- [Xcode](notes/text_editor/xcode.md)
- [Unstructured Class Notes](notes/class_notes/index.md)
- [Algorithms and Databases](notes/class_notes/cs301.md)
- [Artificial Intelligence](notes/class_notes/cs402.md)
- [Cloud Computing](notes/class_notes/cs401.md)
- [Introduction to Databases](notes/class_notes/cs310.md)
- [Introduction to Programming and Data Structures](notes/class_notes/cs201.md)
- [Probability, Random Variables, and Stochastic Processes](notes/class_notes/stats210.md)
- [Probability and Statistics](notes/class_notes/math205.md)
- [Introduction to Programming and Data Structures](notes/class_notes/cs201.md)
- [Linear Algebra](notes/class_notes/math202.md)
- [Speech Recognition](notes/class_notes/cs304.md)
214 changes: 214 additions & 0 deletions src/notes/class_notes/cs304.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,214 @@
<!-- toc -->
# Speech Recognition

broader speech processing

- wake-up word detection: binary classification
- small window → low latency
- multiple models, threshold & size
- echo cancellation: subtract (complex) echo of sound from speaker
- sound source localization: microphone array, delay

online/streaming mode: continuous conversion

omnidirectional/ cardioid/ hyper cardioid microphone

sound sample rate

- need at lease 2x of highest frequency wanted, else aliasing (Nyquist theorem)
- 44.1kHz for CD
- 16kHz good enough for speech

$\mu$-curve

audio format: use PCM `.wav`

online endpointing format: push to talk, hit to talk, continuous listening

## feature extraction

spectrogram: energy distribution over frequency vs time

1. preemphasize speech signal: boost high frequency

$$
s_{preemp}(n) = s(n) - \alpha s(n-1)\\
\alpha=0.95
$$

1. spectrogram: from time series to frequency domain
- discrete Fourier transform (DFT)
- symmetric, only first half + midpoint meaningful (by Nyquist theorem)
- magnitude: $0\sim\frac{S_R}{2}$Hz, frequency $\frac{i}{M}S_R$ at point $i$
- $S_R$: sample rate, $M$: number of sample
- power: magnitude squared
- flat noise from jump in sample windowing
- solution: multiple input by bell-shape *windowing function*
and half-overlap windows to avoid losing information
- before using fast Fourier transform (FFT): zero padding
- cause fake interpolated detail
1. auditory perception: from frequency to Bark
- mimic human ear, which distinguish low frequency better
- frequency warping with Mel curve
- filter bank: triangular filter on Mel curve to be multiplied
- evenly-spaced, half-overlap, 40 enough, 80 good
- translate back to equal-area triangles on frequency axis
- not directly on Mel curve for simplicity
1. log Mel spectrum: log of integration of each filter bank
1. Mel cepstrum discrete cosine transform (DCT) compression
- reduce redundancy from filter bank overlap
- subtract the mean to remove microphone/noise difference
- microphone response is speech with convolution
- convolution -DFT → multiplication -log → summation

longest common subsequence: dynamic programming, dummy char padding, streaming,
search trellis, sub-trellis, lexical tree

- pruning: hard threshold vs beam search

DTW: dynamic time warping: disallow skipping (vertical on trellis),
non-linear mapping (super diagonal)

- $P_{i,j}$: best path cost from origin to $(i,j)$
- $C_{i,j}$: local node cost (vector distance)
- global beam across templates for beam search
- combine multiple template
- template averaging: first align templates by DTW
- template segmentation: compress chunks of same phoneme
- actual method: iterate from uniform segments to stabilize variance

Mahalanobis distance

$$
d(x,m_j) = (x-m_j)^TC_j^{-1}(x-m_j)
$$

- can be estimated with negative Gaussian log likelihood

$$
d(x,m_j) = \frac{1}{2}\log \left(
(2\pi)^D|C_j|
\right)+(x-m_j)^TC_j^{-1}(x-m_j)
$$

covariance

$$
\Sigma = \begin{pmatrix}
\sigma_{11} & \sigma_{12} & \cdots & \sigma_{1d}\\
\sigma_{21} & \sigma_{22} & \cdots & \sigma_{2d}\\
\vdots & \vdots & \ddots & \vdots\\
\sigma_{d1} & \sigma_{d2} & \cdots & \sigma_{dd}\\
\end{pmatrix}
$$

- estimated with diagonal covariance matrix when elements largely uncorrelated

self-transition penalty: model phoneme duration

MFCC (Mel frequency cepstrum coefficient)

dynamic time warping (DTW)

- align multiple training sample
1. segment all model sample uniformly by #phoneme, and average
1. align each sample against model sample to get new segment and new average
1. iterate until convergence
- transition probability: #segment switch over #frame in segment

hidden Markov model for DTW

- use log probability so total score is log total probability
- simulation by transition probability matrix $T$ (each row sum to 1)
- initial probability $\pi$

expectation-maximization (EM) algorithm

1. initialize with k-means clustering
1. auxiliary function:
conditional expectation of the complete data log likelihood

$$
Q(\Theta,\Theta^{(t)}) = \sum_yp(y|X,\Theta^t)\log(p(X,y|\Theta))
$$

1. evaluate $\Theta^{t+1}$

$$
\Theta^{t+1} = \argmax_\Theta Q(\Theta,\Theta^t)
$$

1. iterate until convergence

forward algorithm

1. initialize $\alpha(s,1)=\pi_s P(o_t|s)$
1. iterate $\alpha(s,t+1)=\sum_{s'}\alpha(s',1)P(s|s')P(o_{t+1}|s)$

Baum Welch: soft state alignment

log-domain math

$$
\log(x^y)=\log(x)2^{\log(y)}\\
\log(x+y)=\log(x)+\log(1+2^{\log(y)-\log(x)})
$$

continuous text recognition: small-scale problem, e.g. voice command

- wrap back from end of template to start dummy variable
- high wrap back cost → discourage space
- lextree
- non-emitting state/ null state: only for connecting, no self-transition
- no transition time
- prior probability for word: add to the start of its HMM
- word transition probability for edge cost between word HMM
- approximate sentence probability with best path

grammar: only focus on syntax not semantics

- finite-state grammar (FSG)
- context-free grammar (CFG)
- backpointer: only word-level, additional script on word transition

training with continuous speech: bootstrap & iterate

- silence: silence model, $\varepsilon$ bypass arc, self-loop non-emitting state
- loop need to go through emitting state, else infinite

N-gram

- N-gram assumption: $P(w_k|w_1,…,w_{k-1})=P(w_k|w_{k-(n-1)},…,w_{k-1})$
- start/end of sentence: `<s>`, `</s>`
- unigram, bigram. 5-gram good enough for commercial use
- N-gram with D words need $\sum_{i=1}^{N-1}D^i$ transition node
- good Turing
- Zipf's law
- backoff

state-of-the-art system

- unit of sound by clustering large unlabelled dataset
- transfer learning with small labeled dataset

phoneme: 39 English, Mandarin with tone

- few rare phoneme, much more common than rare word → beat Zipf's law
- defined by linguistic, or learned by clustering
- mono-phone/tri-phone: context-independent/dependent (for neighbor)
- absorbing/generating state → non-emitting
- locus: stable centers in spectrum

multiple pronunciation: multiple internal model + probability

mono-phone/ context-independent (CI) model

di-phone: model previous and current phoneme. problem: cross-word effect

tri-phone: model multiple current phoneme based on previous and next phoneme

- many cases not seen, back off to mono-phone
- share Gaussian with mono-phone, different weight
- decision tree

inexact search: run n-gram on (n-1)-gram model by applying word-transition prob
Loading

0 comments on commit 620cf93

Please sign in to comment.