Getting rid of need for alignments in iVector training in AMI #1514

david-ryan-snyder · 2017-03-24T23:50:53Z

This only partially satisfies Issue #1123. I think it will be easier to commit support for this feature one recipe at a time. This starts with AMI.

Adds a script in steps/online/nnet2/get_pca_transform.sh which estimates a PCA transform to be used as an alternative to LDA+MLLT for ivectors

In egs/ami/s5b

Adding local/chain/tuning/run_tdnn_1d.sh which is like local/chain/tuning/run_tdnn_1b.sh but uses PCA features for the ivector instead of LDA
Updating RESULTS_ihm to show results with the above tuning script (run_tdnn_1d)
Updating local/nnet3/run_ivector_common.sh to support either PCA or LDA features (LDA is by default)

danpovey · 2017-03-24T23:53:50Z

egs/ami/s5b/RESULTS_ihm

 %WER 22.5 | 13098 94490 | 80.6 10.8 8.6 3.1 22.5 55.0 | 0.072 | exp/ihm/chain_cleaned/tdnn_sp_bi/decode_dev/ascore_10/dev_hires.ctm.filt.sys
 %WER 22.5 | 12643 89978 | 80.3 12.5 7.2 2.7 22.5 53.1 | 0.149 | exp/ihm/chain_cleaned/tdnn_sp_bi/decode_eval/ascore_10/eval_hires.ctm.filt.sys

+# cleanup + chain TDNN model. Uses PCA instead of LDA for ivector features.


Could you rerun the baseline? I suspect something else may have changed since it was last run.
And can you put the output of one of those compare_wer.sh scripts inside the 1d script, showing
how it differs from 1b?
If 1d is the best script so far, you could update the link.

Yes. Starting it now.

david-ryan-snyder · 2017-03-27T21:08:24Z

@danpovey,

I'm not sure how we should be handling the interaction between the older ivector extractors used in the other nnet3 recipes in AMI, and the one I'm updating (local/chain/run_tdnn.sh). To make my concern concrete:

If the user runs some nnet3 script in local (something that isn't local/chain/run_tdnn.sh) he will get an ivector extractor in exp/ihm/nnet3_cleaned/ that is trained using the default LDA features. However, if the user then runs local/chain/run_tdnn.sh (the one updated in this PR) from --stage 4 or earlier, the old ivector extractor will be obliterated (since they point to the same directory), and replaced with the one trained on PCA features.

A potential solution is to change nnet3_affix=_cleaned to nnet3_affix=1d_cleaned in local/chain/run_tdnn_1d.sh. This will solve the problem of different versions of the ivector overriding the other (e.g., LDA vs PCA version). However, there are problems with this approach as well. For example, consider what happens when the user wants to train a new nnet3 recipe using the same PCA ivectors. Now he has to set nnet3_affix=1d_cleaned, which I think is confusing and will make the contents of exp/ uglier, or alternatively, he has to train a new ivector extractor for each new nnet3 system, but that's unnecessarily wasteful.

danpovey · 2017-03-27T21:16:47Z

That's a pre-existing issue and it's independent of the change to the scripts, because even running the same script twice will give you a different iVector extractor that's incompatible. The scripts currently use those 'ids' to detect when an incompatible iVector extractor has been used. But it doesn't give you any way to make the system usable, because the old iVector extractor would have been lost by that point. How about modifying the script train_ivector_extractor.sh so that instead of just nuking the old iVector extractor, it creates a subdirectory backup.XXX (via mktemp) where it puts final.ie and final.ie.id? Remember to check both the mac and Linux versions of mktemp to make sure your usage is compatible.

…

On Mon, Mar 27, 2017 at 5:08 PM, david-ryan-snyder ***@***.*** > wrote: @danpovey <https://github.com/danpovey>, I'm not sure how we should be handling the interaction between the older ivector extractors used in the other nnet3 recipes in AMI, and the one I'm updating (local/chain/run_tdnn.sh). To make my concern concrete: If the user runs some nnet3 script in local (something that isn't local/chain/run_tdnn.sh) he will get an ivector extractor in exp/ihm/nnet3_cleaned/ that is trained using the default LDA features. However, if the user then runs local/chain/run_tdnn.sh (the one updated in this PR) from --stage 4 or earlier, the old ivector extractor will be obliterated. One solution is to change nnet3_affix=_cleaned to nnet3_affix=*1d*_cleaned in local/chain/run_tdnn_1d.sh. This will solve the problem of different versions of the ivector overriding the other (e.g., LDA vs PCA version). However, there are problems with this approach as well. For example, consider what happens when the user wants to train a new nnet3 recipe using the *same* PCA ivectors. Now he has to set nnet3_affix=1d_cleaned, which I think is confusing and will make the contents of exp/ uglier, or alternatively, he has to train a new ivector extractor for each new nnet3 system, but that's unnecessarily wasteful. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1514 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu0sDNjaw7pE_ZvXBN995COZ5Wf2fks5rqCVKgaJpZM4Mo4dG> .

david-ryan-snyder · 2017-03-27T21:26:23Z

@danpovey,

OK, we can do that if that looks to be the best option.

But, elsewhere you simply check to see if some model or directory already exists, and if it does, you print out something like "already exists, refusing to overwrite it." Why not do a similar thing for the UBM and i-vector directories or models?

danpovey · 2017-03-27T21:35:21Z

I would be fine with that solution also-- but I think it should be done in the calling script (run_ivector_common.sh), preferably near the top so that the user sees it immediately and not after a delay. But, IMO, better to do the backup anyway.

…

On Mon, Mar 27, 2017 at 5:26 PM, david-ryan-snyder ***@***.*** > wrote: @danpovey <https://github.com/danpovey>, OK, we can do that if that looks to be the best option. But, elsewhere you simply check to see if some model or directory already exists, and if it does, you print out something like "already exists, refusing to overwrite it." Why not do a similar thing for the UBM and i-vector directories or models? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1514 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu0L2USTqX-V1PRFCk-L8d9BbR4RMks5rqCmBgaJpZM4Mo4dG> .

danpovey · 2017-03-27T21:37:20Z

also, because final.ie is a soft link, doing the backup stuff at the end of the script might be tricky-- better to do it at the start.

…

On Mon, Mar 27, 2017 at 5:35 PM, Daniel Povey ***@***.***> wrote: I would be fine with that solution also-- but I think it should be done in the calling script (run_ivector_common.sh), preferably near the top so that the user sees it immediately and not after a delay. But, IMO, better to do the backup anyway. On Mon, Mar 27, 2017 at 5:26 PM, david-ryan-snyder < ***@***.***> wrote: > @danpovey <https://github.com/danpovey>, > > OK, we can do that if that looks to be the best option. > > But, elsewhere you simply check to see if some model or directory already > exists, and if it does, you print out something like "already exists, > refusing to overwrite it." Why not do a similar thing for the UBM and > i-vector directories or models? > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#1514 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/ADJVu0L2USTqX-V1PRFCk-L8d9BbR4RMks5rqCmBgaJpZM4Mo4dG> > . >

david-ryan-snyder · 2017-03-27T21:51:45Z

OK. Backup it is.

This issue also applies to and needs to be taken care of in the UBM as well, right?

danpovey · 2017-03-27T22:38:35Z

Yes, it looks like it.

…

On Mon, Mar 27, 2017 at 5:51 PM, david-ryan-snyder ***@***.*** > wrote: OK. Backup it is. This issue also applies to and needs to be taken care of in the UBM as well, right? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1514 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVu4nW2LSpxCeRj4D4zIwvBiDRq090ks5rqC9zgaJpZM4Mo4dG> .

david-ryan-snyder · 2017-04-01T23:51:40Z

@danpovey, you can review this now when you get the chance.

The main feature is supporting PCA instead of LDA for the ivector features. For now, results are only reported in the AMI recipe for the default TDNN system. Once we get this through code review, I can start expanding this to other recipes.

In addition to the PCA vs LDA stuff, there's also some changes to the ivector scripts. These consist of the model backup we talked about and some minor changes so that the --cleanup option works correctly (See #1059).

danpovey

some cosmetic comments

danpovey · 2017-04-01T23:57:36Z

egs/ami/s5b/local/nnet3/run_ivector_common.sh

 num_threads_ubm=32
+feat_type=lda


let's call this ivector_transform_type so as not to confuse it with the feat_type used in other scripts.

danpovey · 2017-04-01T23:58:53Z

egs/wsj/s5/steps/online/nnet2/train_diag_ubm.sh

@@ -89,6 +89,15 @@ for f in $data/feats.scp "$online_cmvn_config" $srcdir/splice_opts $srcdir/final
   [ ! -f "$f" ] && echo "$0: expecting file $f to exist" && exit 1
 done

+if [ -d "$dir" ]; then
+  bak_dir=$(mktemp -d ${dir}/backup.XXX);
+  echo "$0: Directory $dir already exists. Backing up models in ${bak_dir}";


models -> i-vector extractor

This is actually the script for training the diagonal UBM. So I can just refer to it as such here, and say "i-vector extractor" in the next script.

danpovey · 2017-04-02T00:08:21Z

ok.

…

On Sat, Apr 1, 2017 at 8:04 PM, david-ryan-snyder ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In egs/ami/s5b/local/nnet3/run_ivector_common.sh <#1514 (comment)>: > num_threads_ubm=32 +feat_type=lda OK. ------------------------------ In egs/wsj/s5/steps/online/nnet2/train_diag_ubm.sh <#1514 (comment)>: > @@ -89,6 +89,15 @@ for f in $data/feats.scp "$online_cmvn_config" $srcdir/splice_opts $srcdir/final [ ! -f "$f" ] && echo "$0: expecting file $f to exist" && exit 1 done +if [ -d "$dir" ]; then + bak_dir=$(mktemp -d ${dir}/backup.XXX); + echo "$0: Directory $dir already exists. Backing up models in ${bak_dir}"; This is actually the script for training the diagonal UBM. So I can just refer to it as such here, and say "i-vector extractor" in the next script. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1514 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADJVuxK5Lv-SQ5rbQc-19gvmqogVAaPJks5rruYkgaJpZM4Mo4dG> .

…ectors used in ASR. Results are reported in the default TDNN recipe in AMI. Updating steps/online/nnet2/{train_diag_ubm.sh,train_ivector_extractor.sh} so that they now backup the contents of their destination directory if it already exists.

david-ryan-snyder · 2017-04-02T01:10:13Z

@danpovey, I addressed your last comments.

danpovey · 2017-04-02T01:12:49Z

Thanks! Merging.

* 'master' of https://github.com/kaldi-asr/kaldi: [src] Cosmetic change: remove 'train.tra' from usage messages (kaldi-asr#1529) [src] cudamatrix: speed up AddColSumMat with transfrom reduce kernel template (kaldi-asr#1530) [build]: remove openfst check (kaldi-asr#1531) [build,src,doc] Modify get_version.sh to deal better with whitespace (avoid space in version); minor fixes (kaldi-asr#1526) [scripts,egs] Adding options for using PCA instead of LDA+MLLT for ivectors used in ASR. Results are reported in the default TDNN recipe in AMI. Updating steps/online/nnet2/{train_diag_ubm.sh,train_ivector_extractor.sh} so that they now backup the contents of their destination directory if it already exists. (kaldi-asr#1514) [src] (minor) Added missing SetZero() to NaturalGradientAffineComponent::Scale() if scale==0.0 (kaldi-asr#1522) [src,doc] Fix several unrelated minor problems. Thanks: gaoxinglong [src] Adding noexcept to hashing function objects (kaldi-asr#1519) [egs] Fix to egs/wsj/s5/run.sh (unset variable) (kaldi-asr#1517) [misc] remove eXecute permissions where not needed (kaldi-asr#1515) [src,scripts]: Several unrelated cosmetic changes [egs] fixes to babel pipeline; thanks to Fred Richardson (kaldi-asr#1509) [src] Fix exit code of extract-rows.cc (kaldi-asr#1510)

…ectors used in ASR. Results are reported in the default TDNN recipe in AMI. Updating steps/online/nnet2/{train_diag_ubm.sh,train_ivector_extractor.sh} so that they now backup the contents of their destination directory if it already exists. (kaldi-asr#1514)

david-ryan-snyder mentioned this pull request Mar 24, 2017

WIP ivector-pca: adding support to replace LDA+MLLT with PCA for ivectors #1404

Closed

danpovey reviewed Mar 24, 2017

View reviewed changes

david-ryan-snyder force-pushed the ivector-pca2 branch from ff92d8f to 0952d6b Compare April 1, 2017 23:41

danpovey reviewed Apr 1, 2017

View reviewed changes

david-ryan-snyder force-pushed the ivector-pca2 branch from c3c0ed0 to a7bbe5d Compare April 2, 2017 00:01

david-ryan-snyder force-pushed the ivector-pca2 branch from 7dc4e4d to e1c484e Compare April 2, 2017 00:58

david-ryan-snyder force-pushed the ivector-pca2 branch from dc227c8 to e96b257 Compare April 2, 2017 01:03

danpovey merged commit e8ca375 into kaldi-asr:master Apr 2, 2017

david-ryan-snyder mentioned this pull request Apr 14, 2017

cleanup when training iVector extractor #1059

Closed

This was referenced Nov 27, 2017

Irrecoverable ivector extractor/diag-ubm overwrites in nnet2/nnet3 ivector recipes. #295

Closed

Getting rid of need for alignments in iVector training #1123

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting rid of need for alignments in iVector training in AMI #1514

Getting rid of need for alignments in iVector training in AMI #1514

david-ryan-snyder commented Mar 24, 2017

danpovey Mar 24, 2017

david-ryan-snyder Mar 24, 2017

david-ryan-snyder commented Mar 27, 2017 •

edited

Loading

danpovey commented Mar 27, 2017 via email

david-ryan-snyder commented Mar 27, 2017

danpovey commented Mar 27, 2017 via email

danpovey commented Mar 27, 2017 via email

david-ryan-snyder commented Mar 27, 2017

danpovey commented Mar 27, 2017 via email

david-ryan-snyder commented Apr 1, 2017

danpovey left a comment

danpovey Apr 1, 2017

david-ryan-snyder Apr 2, 2017

danpovey Apr 1, 2017

david-ryan-snyder Apr 2, 2017

danpovey commented Apr 2, 2017 via email

david-ryan-snyder commented Apr 2, 2017

danpovey commented Apr 2, 2017

		num_threads_ubm=32
		feat_type=lda

Getting rid of need for alignments in iVector training in AMI #1514

Getting rid of need for alignments in iVector training in AMI #1514

Conversation

david-ryan-snyder commented Mar 24, 2017

danpovey Mar 24, 2017

Choose a reason for hiding this comment

david-ryan-snyder Mar 24, 2017

Choose a reason for hiding this comment

david-ryan-snyder commented Mar 27, 2017 • edited Loading

danpovey commented Mar 27, 2017 via email

david-ryan-snyder commented Mar 27, 2017

danpovey commented Mar 27, 2017 via email

danpovey commented Mar 27, 2017 via email

david-ryan-snyder commented Mar 27, 2017

danpovey commented Mar 27, 2017 via email

david-ryan-snyder commented Apr 1, 2017

danpovey left a comment

Choose a reason for hiding this comment

danpovey Apr 1, 2017

Choose a reason for hiding this comment

david-ryan-snyder Apr 2, 2017

Choose a reason for hiding this comment

danpovey Apr 1, 2017

Choose a reason for hiding this comment

david-ryan-snyder Apr 2, 2017

Choose a reason for hiding this comment

danpovey commented Apr 2, 2017 via email

david-ryan-snyder commented Apr 2, 2017

danpovey commented Apr 2, 2017

david-ryan-snyder commented Mar 27, 2017 •

edited

Loading