Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the support of mfcc feature for DS2 #168

Merged
merged 4 commits into from
Jul 20, 2017
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion deep_speech_2/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,13 @@ python datasets/librispeech/librispeech.py --help
python compute_mean_std.py
```

`python compute_mean_std.py` computes mean and stdandard deviation for audio features, and save them to a file with a default name `./mean_std.npz`. This file will be used in both training and inferencing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python compute_mean_std.py computes --> "It will compute"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

`python compute_mean_std.py` computes mean and stdandard deviation for audio features, and save them to a file with a default name `./mean_std.npz`. This file will be used in both training and inferencing. The default feature of audio data is power spectrum, currently the mfcc feature is also supported. To train and infer based on mfcc feature, you can regenerate this file by
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently the mfcc feature is also supported,changing currently to and should be better? There's no need to tell the user that mfcc is added just now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can regenerate this file by,why regenerate when first running ds2?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. you can regenerate --> please regenerate
    2.spectrum or spectrogram ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


```
python compute_mean_std.py --specgram_type mfcc
```

and specify the ```specgram_type``` to ```mfcc``` in each step, including training, inference etc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. “in each step, including training, inference etc.” --》 “ when running train.py, infer.py, evaluator.py or tune.py"
  2. specgram_type to mfcc --> --specgram_type mfcc

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


More help for arguments:

Expand Down
8 changes: 7 additions & 1 deletion deep_speech_2/compute_mean_std.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,12 @@

parser = argparse.ArgumentParser(
description='Computing mean and stddev for feature normalizer.')
parser.add_argument(
"--specgram_type",
default='linear',
type=str,
help="Feature type of audio data: 'linear' (power spectrum)"
" or 'mfcc'. (default: %(default)s)")
parser.add_argument(
"--manifest_path",
default='datasets/manifest.train',
Expand Down Expand Up @@ -39,7 +45,7 @@

def main():
augmentation_pipeline = AugmentationPipeline(args.augmentation_config)
audio_featurizer = AudioFeaturizer()
audio_featurizer = AudioFeaturizer(specgram_type=args.specgram_type)

def augment_and_featurize(audio_segment):
augmentation_pipeline.transform_audio(audio_segment)
Expand Down
48 changes: 45 additions & 3 deletions deep_speech_2/data_utils/featurizer/audio_featurizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,23 +6,26 @@
import numpy as np
from data_utils import utils
from data_utils.audio import AudioSegment
from python_speech_features import mfcc
from python_speech_features import delta


class AudioFeaturizer(object):
"""Audio featurizer, for extracting features from audio contents of
AudioSegment or SpeechSegment.

Currently, it only supports feature type of linear spectrogram.
Currently, it supports feature types of linear spectrogram and mfcc.

:param specgram_type: Specgram feature type. Options: 'linear'.
:type specgram_type: str
:param stride_ms: Striding size (in milliseconds) for generating frames.
:type stride_ms: float
:param window_ms: Window size (in milliseconds) for generating frames.
:type window_ms: float
:param max_freq: Used when specgram_type is 'linear', only FFT bins
:param max_freq: When specgram_type is 'linear', only FFT bins
corresponding to frequencies between [0, max_freq] are
returned.
returned; when specgram_type is 'mfcc', max_feq is the
highest band edge of mel filters.
:types max_freq: None|float
:param target_sample_rate: Audio are resampled (if upsampling or
downsampling is allowed) to this before
Expand Down Expand Up @@ -91,6 +94,9 @@ def _compute_specgram(self, samples, sample_rate):
return self._compute_linear_specgram(
samples, sample_rate, self._stride_ms, self._window_ms,
self._max_freq)
elif self._specgram_type == 'mfcc':
return self._compute_mfcc(samples, sample_rate, self._stride_ms,
self._window_ms, self._max_freq)
else:
raise ValueError("Unknown specgram_type %s. "
"Supported values: linear." % self._specgram_type)
Expand Down Expand Up @@ -142,3 +148,39 @@ def _specgram_real(self, samples, window_size, stride_size, sample_rate):
# prepare fft frequency list
freqs = float(sample_rate) / window_size * np.arange(fft.shape[0])
return fft, freqs

def _compute_mfcc(self,
samples,
sample_rate,
stride_ms=10.0,
window_ms=20.0,
max_freq=None):
"""Compute mfcc from samples."""
if max_freq is None:
max_freq = sample_rate / 2
if max_freq > sample_rate / 2:
raise ValueError("max_freq must be greater than half of "
"sample rate.")
if stride_ms > window_ms:
raise ValueError("Stride size must not be greater than "
"window size.")
# compute 13 cepstral coefficients, and the first one is replaced
# by log(frame energy)
mfcc_feat = mfcc(
signal=samples,
samplerate=sample_rate,
winlen=0.001 * window_ms,
winstep=0.001 * stride_ms,
highfreq=max_freq)
# Deltas
d_mfcc_feat = delta(mfcc_feat, 2)
# Deltas-Deltas
dd_mfcc_feat = delta(d_mfcc_feat, 2)
# concat above three features
concat_mfcc_feat = [
np.concatenate((mfcc_feat[i], d_mfcc_feat[i], dd_mfcc_feat[i]))
for i in xrange(len(mfcc_feat))
]
# transpose to be consistent with the linear specgram situation
concat_mfcc_feat = np.transpose(concat_mfcc_feat)
return concat_mfcc_feat
15 changes: 8 additions & 7 deletions deep_speech_2/data_utils/featurizer/speech_featurizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,23 +11,24 @@ class SpeechFeaturizer(object):
"""Speech featurizer, for extracting features from both audio and transcript
contents of SpeechSegment.

Currently, for audio parts, it only supports feature type of linear
spectrogram; for transcript parts, it only supports char-level tokenizing
and conversion into a list of token indices. Note that the token indexing
order follows the given vocabulary file.
Currently, for audio parts, it supports feature types of linear
spectrogram and mfcc; for transcript parts, it only supports char-level
tokenizing and conversion into a list of token indices. Note that the
token indexing order follows the given vocabulary file.

:param vocab_filepath: Filepath to load vocabulary for token indices
conversion.
:type specgram_type: basestring
:param specgram_type: Specgram feature type. Options: 'linear'.
:param specgram_type: Specgram feature type. Options: 'linear', 'mfcc'.
:type specgram_type: str
:param stride_ms: Striding size (in milliseconds) for generating frames.
:type stride_ms: float
:param window_ms: Window size (in milliseconds) for generating frames.
:type window_ms: float
:param max_freq: Used when specgram_type is 'linear', only FFT bins
:param max_freq: When specgram_type is 'linear', only FFT bins
corresponding to frequencies between [0, max_freq] are
returned.
returned; when specgram_type is 'mfcc', max_freq is the
highest band edge of mel filters.
:types max_freq: None|float
:param target_sample_rate: Speech are resampled (if upsampling or
downsampling is allowed) to this before
Expand Down
2 changes: 1 addition & 1 deletion deep_speech_2/data_utils/normalizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ class FeatureNormalizer(object):
if mean_std_filepath is provided (not None), the normalizer will directly
initilize from the file. Otherwise, both manifest_path and featurize_func
should be given for on-the-fly mean and stddev computing.

:param mean_std_filepath: File containing the pre-computed mean and stddev.
:type mean_std_filepath: None|basestring
:param manifest_path: Manifest of instances for computing mean and stddev.
Expand Down
7 changes: 7 additions & 0 deletions deep_speech_2/evaluate.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,12 @@
default=500,
type=int,
help="Width for beam search decoding. (default: %(default)d)")
parser.add_argument(
"--specgram_type",
default='linear',
type=str,
help="Feature type of audio data: 'linear' (power spectrum)"
" or 'mfcc'. (default: %(default)s)")
parser.add_argument(
"--decode_manifest_path",
default='datasets/manifest.test',
Expand All @@ -111,6 +117,7 @@ def evaluate():
vocab_filepath=args.vocab_filepath,
mean_std_filepath=args.mean_std_filepath,
augmentation_config='{}',
specgram_type=args.specgram_type,
num_threads=args.num_threads_data)

# create network config
Expand Down
7 changes: 7 additions & 0 deletions deep_speech_2/infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,12 @@
default=multiprocessing.cpu_count(),
type=int,
help="Number of cpu processes for beam search. (default: %(default)s)")
parser.add_argument(
"--specgram_type",
default='linear',
type=str,
help="Feature type of audio data: 'linear' (power spectrum)"
" or 'mfcc'. (default: %(default)s)")
parser.add_argument(
"--mean_std_filepath",
default='mean_std.npz',
Expand Down Expand Up @@ -118,6 +124,7 @@ def infer():
vocab_filepath=args.vocab_filepath,
mean_std_filepath=args.mean_std_filepath,
augmentation_config='{}',
specgram_type=args.specgram_type,
num_threads=args.num_threads_data)

# create network config
Expand Down
1 change: 1 addition & 0 deletions deep_speech_2/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,4 @@ wget==3.2
scipy==0.13.1
resampy==0.1.5
https://github.com/kpu/kenlm/archive/master.zip
python_speech_features
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a version number.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No version number

7 changes: 7 additions & 0 deletions deep_speech_2/train.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,12 @@
default=True,
type=distutils.util.strtobool,
help="Use sortagrad or not. (default: %(default)s)")
parser.add_argument(
"--specgram_type",
default='linear',
type=str,
help="Feature type of audio data: 'linear' (power spectrum)"
" or 'mfcc'. (default: %(default)s)")
parser.add_argument(
"--max_duration",
default=27.0,
Expand Down Expand Up @@ -130,6 +136,7 @@ def data_generator():
augmentation_config=args.augmentation_config,
max_duration=args.max_duration,
min_duration=args.min_duration,
specgram_type=args.specgram_type,
num_threads=args.num_threads_data)

train_generator = data_generator()
Expand Down
7 changes: 7 additions & 0 deletions deep_speech_2/tune.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,12 @@
default=multiprocessing.cpu_count(),
type=int,
help="Number of cpu processes for beam search. (default: %(default)s)")
parser.add_argument(
"--specgram_type",
default='linear',
type=str,
help="Feature type of audio data: 'linear' (power spectrum)"
" or 'mfcc'. (default: %(default)s)")
parser.add_argument(
"--mean_std_filepath",
default='mean_std.npz',
Expand Down Expand Up @@ -133,6 +139,7 @@ def tune():
vocab_filepath=args.vocab_filepath,
mean_std_filepath=args.mean_std_filepath,
augmentation_config='{}',
specgram_type=args.specgram_type,
num_threads=args.num_threads_data)

# create network config
Expand Down