-
Notifications
You must be signed in to change notification settings - Fork 295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA out of memory in decoding #70
Comments
The 100 is still big for max-duration. Maybe you can reduce the max-duration to 50, 30 or even less. |
I have reduce max-duration to 1 ,but the error still exist. |
@csukuangfj We have use you advices (1) and (3) ,but the problem is not solved. If you can give some other advices, thank you very much! |
You could also mess with the decoding parameters, e.g. reduce the
max-active and/or the beam.
…On Fri, Oct 8, 2021 at 6:41 PM Lzhang-hub ***@***.***> wrote:
@csukuangfj <https://github.com/csukuangfj> We have use you advices (1)
and (3) ,but the problem is not solved. If you can give some other advices,
thank you very much!
[image: 企业微信截图_16336894139504]
<https://user-images.githubusercontent.com/57925599/136542593-f8c0f1e1-2bc4-44a7-9fa8-bdeb4077d9ea.png>
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#70 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO6XVD6EXTWOT4X2TSDUF3DDZANCNFSM5FTIMTGQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
icefall/egs/librispeech/ASR/tdnn_lstm_ctc/decode.py Lines 132 to 135 in adb068e
You can reduce By the way, does |
Thanks! I will attempt to decode with your advices.
|
What do you mean by very poor? Is this your own data, or Librispeech?
The model quality and data quality can affect the memory used in decoding.
…On Sat, Oct 9, 2021 at 9:39 AM Lzhang-hub ***@***.***> wrote:
https://github.com/k2-fsa/icefall/blob/adb068eb8242fe79dafce5a100c3fdfad934c7a5/egs/librispeech/ASR/tdnn_lstm_ctc/decode.py#L132-L135
You can reduce search_beam, output_beam, or max_active_states.
By the way, does CUDA out of memory abort your decoding process? Does it
continue to decode after pruning?
Thanks! I will attempt to decode with your advices.
CUDA out of memory do not abort my decoding process, the decode can be
done, but the results are very pool.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#70 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO66KEUIGGBQEQCGRCLUF6MN3ANCNFSM5FTIMTGQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
@danpovey @csukuangfj ,Thanks for your reply, we are newer to learn icefall, we just run the recipes of Librispeech, we finished the training steps, the above errors occured in the decoding steps. The decoding process can finised, but the wer of test-other is 59.41%. 2021-10-09 10:44:14,942 INFO [decode.py:732] num_arcs before pruning: 2061527 2021-10-09 10:44:16,944 INFO [decode.py:732] num_arcs before pruning: 2814753 we reduce search_beam(20->15), max_active_states(10000->7000) a moment ago, the error is same. We suspect the error could be casued by processing G, and we may follow the https://github.com/kaldi-asr/kaldi/pull/4594 to prune your G. |
Hm, can you show the last part of one of the training logs or point to the
tensorboard log (tensorfboard dev upload --logdir blah/log)?
I wonder whether the model is OK.
…On Sat, Oct 9, 2021 at 11:21 AM cdxie ***@***.***> wrote:
What do you mean by very poor? Is this your own data, or Librispeech? The
model quality and data quality can affect the memory used in decoding.
… <#m_4937422915890188941_>
On Sat, Oct 9, 2021 at 9:39 AM Lzhang-hub *@*.***> wrote:
https://github.com/k2-fsa/icefall/blob/adb068eb8242fe79dafce5a100c3fdfad934c7a5/egs/librispeech/ASR/tdnn_lstm_ctc/decode.py#L132-L135
You can reduce search_beam, output_beam, or max_active_states. By the way,
does CUDA out of memory abort your decoding process? Does it continue to
decode after pruning? Thanks! I will attempt to decode with your advices.
CUDA out of memory do not abort my decoding process, the decode can be
done, but the results are very pool. — You are receiving this because you
commented. Reply to this email directly, view it on GitHub <#70 (comment)
<#70 (comment)>>, or
unsubscribe
https://github.com/notifications/unsubscribe-auth/AAZFLO66KEUIGGBQEQCGRCLUF6MN3ANCNFSM5FTIMTGQ
. Triage notifications on the go with GitHub Mobile for iOS
https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675
or Android
https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub
.
@danpovey <https://github.com/danpovey> @csukuangfj
<https://github.com/csukuangfj> ,Thanks for your reply, we are newer to
learn icefall, we just run the recipes of Librispeech, we finished the
training steps, the above errors occured in the decoding steps. The
decoding process can finised, but the wer of test-other is 59.41%.
The device we used is V100 NVIDIA GPU-32G, and we follow the csukuangfj
advices (1) and (3), the above errors still occurs:
##############
2021-10-09 10:38:49,103 INFO [decode.py:387] Decoding started
2021-10-09 10:38:49,241 INFO [decode.py:388] {'exp_dir':
PosixPath('tdnn_lstm_ctc/exp'), 'lang_dir': PosixPath('data/lang_phone'),
'lm_dir': PosixPath('data/lm'), 'feature_dim': 80, 'subsampling_factor': 3,
'search_beam': 15, 'output_beam': 5, 'min_active_states': 30,
'max_active_states': 7000, 'use_double_scores': True, 'epoch': 19, 'avg':
5, 'method': 'whole-lattice-rescoring', 'num_paths': 100, 'nbest_scale':
0.5, 'export': False, 'full_libri': True, 'feature_dir':
PosixPath('data/fbank'), 'max_duration': 150, 'bucketing_sampler': True,
'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0,
'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts':
True, 'num_workers': 2}
2021-10-09 10:38:50,467 INFO [lexicon.py:113] Loading pre-compiled
data/lang_phone/Linv.pt
2021-10-09 10:38:52,312 INFO [decode.py:397] device: cuda
2021-10-09 10:40:48,429 INFO [decode.py:428] Loading pre-compiled
G_4_gram.pt
2021-10-09 10:43:25,546 INFO [decode.py:458] averaging ['tdnn_lstm_ctc/exp/
epoch-15.pt', 'tdnn_lstm_ctc/exp/epoch-16.pt', 'tdnn_lstm_ctc/exp/
epoch-17.pt', 'tdnn_lstm_ctc/exp/epoch-18.pt', 'tdnn_lstm_ctc/exp/
epoch-19.pt']
2021-10-09 10:44:14,941 INFO [decode.py:731] Caught exception:
CUDA out of memory. Tried to allocate 4.38 GiB (GPU 0; 31.75 GiB total
capacity; 27.41 GiB already allocated; 365.75 MiB free; 30.23 GiB reserved
in total by PyTorch)
2021-10-09 10:44:14,942 INFO [decode.py:732] num_arcs before pruning:
2061527
2021-10-09 10:44:14,977 INFO [decode.py:739] num_arcs after pruning: 113145
2021-10-09 10:44:16,184 INFO [decode.py:336] batch 0/?, cuts processed
until now is 18
2021-10-09 10:44:16,944 INFO [decode.py:731] Caught exception:
CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0; 31.75 GiB total
capacity; 21.89 GiB already allocated; 4.36 GiB free; 26.23 GiB reserved in
total by PyTorch)
2021-10-09 10:44:16,944 INFO [decode.py:732] num_arcs before pruning:
2814753
2021-10-09 10:44:16,982 INFO [decode.py:739] num_arcs after pruning: 120129
2021-10-09 10:44:18,624 INFO [decode.py:731] Caught exception:
CUDA out of memory. Tried to allocate 8.00 GiB (GPU 0; 31.75 GiB total
capacity; 21.80 GiB already allocated; 1.54 GiB free; 29.05 GiB reserved in
total by PyTorch)
#########################
we reduce search_beam(20->15), max_active_states(10000->7000) a moment
ago, the error is same. We suspect the error could be casued by processing
G, and we may follow the kaldi-asr/kaldi#4594
<http://url> to prune your G.
We now can't pinpoint the cause of the error, so we need help,thanks
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#70 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO74EHY6MLXZINC73RDUF6YLBANCNFSM5FTIMTGQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
@danpovey OK, this is the training log of tdnn_lstm_ctc : |
Your model did not converge; loss should be something like 0.005, not 0.5.
I believe when we ran it, we used --bucketing-sampler=True, that could
possibly be the reason.
Also we used several GPUs, but that should not really affect convergence I
think.
(Normally this script converges OK).
…On Sat, Oct 9, 2021 at 2:17 PM cdxie ***@***.***> wrote:
Hm, can you show the last part of one of the training logs or point to the
tensorboard log (tensorfboard dev upload --logdir blah/log)? I wonder
whether the model is OK.
… <#m_3115122914114341433_>
On Sat, Oct 9, 2021 at 11:21 AM cdxie *@*.*> wrote: What do you mean by
very poor? Is this your own data, or Librispeech? The model quality and
data quality can affect the memory used in decoding. …
<#m_4937422915890188941_> On Sat, Oct 9, 2021 at 9:39 AM Lzhang-hub @.*>
wrote:
https://github.com/k2-fsa/icefall/blob/adb068eb8242fe79dafce5a100c3fdfad934c7a5/egs/librispeech/ASR/tdnn_lstm_ctc/decode.py#L132-L135
You can reduce search_beam, output_beam, or max_active_states. By the way,
does CUDA out of memory abort your decoding process? Does it continue to
decode after pruning? Thanks! I will attempt to decode with your advices.
CUDA out of memory do not abort my decoding process, the decode can be
done, but the results are very pool. — You are receiving this because you
commented. Reply to this email directly, view it on GitHub <#70
<#70> (comment) <#70 (comment)
<#70 (comment)>>>,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAZFLO66KEUIGGBQEQCGRCLUF6MN3ANCNFSM5FTIMTGQ
. Triage notifications on the go with GitHub Mobile for iOS
https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675
or Android
https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub
. @danpovey <https://github.com/danpovey> https://github.com/danpovey
@csukuangfj <https://github.com/csukuangfj> https://github.com/csukuangfj
,Thanks for your reply, we are newer to learn icefall, we just run the
recipes of Librispeech, we finished the training steps, the above errors
occured in the decoding steps. The decoding process can finised, but the
wer of test-other is 59.41%. The device we used is V100 NVIDIA GPU-32G, and
we follow the csukuangfj advices (1) and (3), the above errors still
occurs: ############## 2021-10-09 10:38:49,103 INFO [decode.py:387]
Decoding started 2021-10-09 10:38:49,241 INFO [decode.py:388] {'exp_dir':
PosixPath('tdnn_lstm_ctc/exp'), 'lang_dir': PosixPath('data/lang_phone'),
'lm_dir': PosixPath('data/lm'), 'feature_dim': 80, 'subsampling_factor': 3,
'search_beam': 15, 'output_beam': 5, 'min_active_states': 30,
'max_active_states': 7000, 'use_double_scores': True, 'epoch': 19, 'avg':
5, 'method': 'whole-lattice-rescoring', 'num_paths': 100, 'nbest_scale':
0.5, 'export': False, 'full_libri': True, 'feature_dir':
PosixPath('data/fbank'), 'max_duration': 150, 'bucketing_sampler': True,
'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0,
'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts':
True, 'num_workers': 2} 2021-10-09 10:38:50,467 INFO [lexicon.py:113]
Loading pre-compiled data/lang_phone/Linv.pt 2021-10-09 10:38:52,312 INFO
[decode.py:397] device: cuda 2021-10-09 10:40:48,429 INFO [decode.py:428]
Loading pre-compiled G_4_gram.pt 2021-10-09 10:43:25,546 INFO
[decode.py:458] averaging ['tdnn_lstm_ctc/exp/ epoch-15.pt',
'tdnn_lstm_ctc/exp/epoch-16.pt', 'tdnn_lstm_ctc/exp/ epoch-17.pt',
'tdnn_lstm_ctc/exp/epoch-18.pt', 'tdnn_lstm_ctc/exp/ epoch-19.pt']
2021-10-09 10:44:14,941 INFO [decode.py:731] Caught exception: CUDA out of
memory. Tried to allocate 4.38 GiB (GPU 0; 31.75 GiB total capacity; 27.41
GiB already allocated; 365.75 MiB free; 30.23 GiB reserved in total by
PyTorch) 2021-10-09 10:44:14,942 INFO [decode.py:732] num_arcs before
pruning: 2061527 2021-10-09 10:44:14,977 INFO [decode.py:739] num_arcs
after pruning: 113145 2021-10-09 10:44:16,184 INFO [decode.py:336] batch
0/?, cuts processed until now is 18 2021-10-09 10:44:16,944 INFO
[decode.py:731] Caught exception: CUDA out of memory. Tried to allocate
8.00 GiB (GPU 0; 31.75 GiB total capacity; 21.89 GiB already allocated;
4.36 GiB free; 26.23 GiB reserved in total by PyTorch) 2021-10-09
10:44:16,944 INFO [decode.py:732] num_arcs before pruning: 2814753
2021-10-09 10:44:16,982 INFO [decode.py:739] num_arcs after pruning: 120129
2021-10-09 10:44:18,624 INFO [decode.py:731] Caught exception: CUDA out of
memory. Tried to allocate 8.00 GiB (GPU 0; 31.75 GiB total capacity; 21.80
GiB already allocated; 1.54 GiB free; 29.05 GiB reserved in total by
PyTorch) ######################### we reduce search_beam(20->15),
max_active_states(10000->7000) a moment ago, the error is same. We suspect
the error could be casued by processing G, and we may follow the
kaldi-asr/kaldi#4594 <kaldi-asr/kaldi#4594>
http://url to prune your G. We now can't pinpoint the cause of the error,
so we need help,thanks — You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#70 (comment)
<#70 (comment)>>, or
unsubscribe
https://github.com/notifications/unsubscribe-auth/AAZFLO74EHY6MLXZINC73RDUF6YLBANCNFSM5FTIMTGQ
. Triage notifications on the go with GitHub Mobile for iOS
https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675
or Android
https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub
.
OK, this is the training log of tdnn_lstm_ctc :
tdnn-lstm-ctc-log-train.txt
<https://github.com/k2-fsa/icefall/files/7315071/tdnn-lstm-ctc-log-train.txt>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#70 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO5Y56FEMGRNZ4QATXDUF7M73ANCNFSM5FTIMTGQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Thanks, I will modify the parameters and run again. The GPU device I used is a100 NVIDIA GPU-40G, and single-GPU single-machine. The parameters of the script "./tdnn_lstm_ctc/train.py" are not modified |
And please show us some sample decoding output, it is written to somewhere
(aligned output vs. the ref text).
I want to see how the model failed. To get 59% WER is unusual; would
normally be either 100% or close to 0, I'd expect.
…On Sat, Oct 9, 2021 at 4:36 PM cdxie ***@***.***> wrote:
Your model did not converge; loss should be something like 0.005, not 0.5.
I believe when we ran it, we used --bucketing-sampler=True, that could
possibly be the reason. Also we used several GPUs, but that should not
really affect convergence I think. (Normally this script converges OK).
… <#m_7389038400197059205_>
On Sat, Oct 9, 2021 at 2:17 PM cdxie *@*.**> wrote: Hm, can you show the
last part of one of the training logs or point to the tensorboard log
(tensorfboard dev upload --logdir blah/log)? I wonder whether the model is
OK. … <#m_3115122914114341433_> On Sat, Oct 9, 2021 at 11:21 AM cdxie @.>
wrote: What do you mean by very poor? Is this your own data, or
Librispeech? The model quality and data quality can affect the memory used
in decoding. … <#m_4937422915890188941_> On Sat, Oct 9, 2021 at 9:39 AM
Lzhang-hub @.*> wrote:
https://github.com/k2-fsa/icefall/blob/adb068eb8242fe79dafce5a100c3fdfad934c7a5/egs/librispeech/ASR/tdnn_lstm_ctc/decode.py#L132-L135
You can reduce search_beam, output_beam, or max_active_states. By the way,
does CUDA out of memory abort your decoding process? Does it continue to
decode after pruning? Thanks! I will attempt to decode with your advices.
CUDA out of memory do not abort my decoding process, the decode can be
done, but the results are very pool. — You are receiving this because you
commented. Reply to this email directly, view it on GitHub <#70
<#70> <#70
<#70>> (comment) <#70
<#70> (comment) <#70 (comment)
<#70 (comment)>>>>,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAZFLO66KEUIGGBQEQCGRCLUF6MN3ANCNFSM5FTIMTGQ
. Triage notifications on the go with GitHub Mobile for iOS
https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675
or Android
https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub
. @danpovey <https://github.com/danpovey> https://github.com/danpovey
https://github.com/danpovey @csukuangfj <https://github.com/csukuangfj>
https://github.com/csukuangfj https://github.com/csukuangfj ,Thanks for
your reply, we are newer to learn icefall, we just run the recipes of
Librispeech, we finished the training steps, the above errors occured in
the decoding steps. The decoding process can finised, but the wer of
test-other is 59.41%. The device we used is V100 NVIDIA GPU-32G, and we
follow the csukuangfj advices (1) and (3), the above errors still occurs:
############## 2021-10-09 10:38:49,103 INFO [decode.py:387] Decoding
started 2021-10-09 10:38:49,241 INFO [decode.py:388] {'exp_dir':
PosixPath('tdnn_lstm_ctc/exp'), 'lang_dir': PosixPath('data/lang_phone'),
'lm_dir': PosixPath('data/lm'), 'feature_dim': 80, 'subsampling_factor': 3,
'search_beam': 15, 'output_beam': 5, 'min_active_states': 30,
'max_active_states': 7000, 'use_double_scores': True, 'epoch': 19, 'avg':
5, 'method': 'whole-lattice-rescoring', 'num_paths': 100, 'nbest_scale':
0.5, 'export': False, 'full_libri': True, 'feature_dir':
PosixPath('data/fbank'), 'max_duration': 150, 'bucketing_sampler': True,
'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0,
'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts':
True, 'num_workers': 2} 2021-10-09 10:38:50,467 INFO [lexicon.py:113]
Loading pre-compiled data/lang_phone/Linv.pt 2021-10-09 10:38:52,312 INFO
[decode.py:397] device: cuda 2021-10-09 10:40:48,429 INFO [decode.py:428]
Loading pre-compiled G_4_gram.pt 2021-10-09 10:43:25,546 INFO
[decode.py:458] averaging ['tdnn_lstm_ctc/exp/ epoch-15.pt',
'tdnn_lstm_ctc/exp/epoch-16.pt', 'tdnn_lstm_ctc/exp/ epoch-17.pt',
'tdnn_lstm_ctc/exp/epoch-18.pt', 'tdnn_lstm_ctc/exp/ epoch-19.pt']
2021-10-09 10:44:14,941 INFO [decode.py:731] Caught exception: CUDA out of
memory. Tried to allocate 4.38 GiB (GPU 0; 31.75 GiB total capacity; 27.41
GiB already allocated; 365.75 MiB free; 30.23 GiB reserved in total by
PyTorch) 2021-10-09 10:44:14,942 INFO [decode.py:732] num_arcs before
pruning: 2061527 2021-10-09 10:44:14,977 INFO [decode.py:739] num_arcs
after pruning: 113145 2021-10-09 10:44:16,184 INFO [decode.py:336] batch
0/?, cuts processed until now is 18 2021-10-09 10:44:16,944 INFO
[decode.py:731] Caught exception: CUDA out of memory. Tried to allocate
8.00 GiB (GPU 0; 31.75 GiB total capacity; 21.89 GiB already allocated;
4.36 GiB free; 26.23 GiB reserved in total by PyTorch) 2021-10-09
10:44:16,944 INFO [decode.py:732] num_arcs before pruning: 2814753
2021-10-09 10:44:16,982 INFO [decode.py:739] num_arcs after pruning: 120129
2021-10-09 10:44:18,624 INFO [decode.py:731] Caught exception: CUDA out of
memory. Tried to allocate 8.00 GiB (GPU 0; 31.75 GiB total capacity; 21.80
GiB already allocated; 1.54 GiB free; 29.05 GiB reserved in total by
PyTorch) ######################### we reduce search_beam(20->15),
max_active_states(10000->7000) a moment ago, the error is same. We suspect
the error could be casued by processing G, and we may follow the
kaldi-asr/kaldi#4594 <kaldi-asr/kaldi#4594> <
kaldi-asr/kaldi#4594 <kaldi-asr/kaldi#4594>>
http://url to prune your G. We now can't pinpoint the cause of the error,
so we need help,thanks — You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#70
<#70> (comment) <#70 (comment)
<#70 (comment)>>>,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAZFLO74EHY6MLXZINC73RDUF6YLBANCNFSM5FTIMTGQ
. Triage notifications on the go with GitHub Mobile for iOS
https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675
or Android
https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub
. OK, this is the training log of tdnn_lstm_ctc :
tdnn-lstm-ctc-log-train.txt
https://github.com/k2-fsa/icefall/files/7315071/tdnn-lstm-ctc-log-train.txt
— You are receiving this because you were mentioned. Reply to this email
directly, view it on GitHub <#70 (comment)
<#70 (comment)>>, or
unsubscribe
https://github.com/notifications/unsubscribe-auth/AAZFLO5Y56FEMGRNZ4QATXDUF7M73ANCNFSM5FTIMTGQ
. Triage notifications on the go with GitHub Mobile for iOS
https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675
or Android
https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub
.
Thanks, I will modify the parameters and run again.
The GPU device I used is a100 NVIDIA GPU-40G, and single-GPU
single-machine. The parameters of the script "./tdnn_lstm_ctc/train.py" are
not modified
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#70 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO7PDH2BU52NMFTXSL3UF75JNANCNFSM5FTIMTGQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
OK, I choose the best results file(lm_scale_0.7) of tdnn-lstm-ctc model, the decoding parameters { 'search_beam': 15, 'output_beam': 5, 'min_active_states': 30, 'max_active_states': 7000 } : errs-test-clean-lm_scale_0.7.txt |
@danpovey , I also run the recipes of Librispeech of Confromer CTC, using a100 NVIDIA GPU-40G, and single-GPU single-machine, and no parameters are modified during the training, my model may be not converge according your opinions. Now what should need to change that can make the loss converged? the last part of the training logs: |
@danpovey @csukuangfj , Ignore the loss convergence problem, the problems of CUDA out of memory in decoding are still not be solved, could you give more help? |
The conformer model logs look normal. |
OK, I will do it.
|
@danpovey @csukuangfj Sorry to truble again, I just run the decode steps of conformer-ctc, the same mistaked occured again(reduced the search_beam, max_active_states). Is this the same reason as TDNN+LSTM+CTC? or some wrong with our machine(we use docker environment)? : 2021-10-09 20:51:44,274 INFO [decode.py:732] num_arcs before pruning: 103742 2021-10-09 20:51:46,104 INFO [decode.py:732] num_arcs before pruning: 233253 2021-10-09 20:51:46,235 INFO [decode.py:732] num_arcs before pruning: 90555 2021-10-09 20:51:46,360 INFO [decode.py:732] num_arcs before pruning: 90414 2021-10-09 20:51:46,483 INFO [decode.py:732] num_arcs before pruning: 90366 2021-10-09 20:51:46,605 INFO [decode.py:732] num_arcs before pruning: 90366 2021-10-09 20:51:46,728 INFO [decode.py:732] num_arcs before pruning: 90366 2021-10-09 20:51:46,853 INFO [decode.py:732] num_arcs before pruning: 90366 2021-10-09 20:51:46,978 INFO [decode.py:732] num_arcs before pruning: 90366 2021-10-09 20:51:47,101 INFO [decode.py:732] num_arcs before pruning: 90366 2021-10-09 20:51:47,226 INFO [decode.py:732] num_arcs before pruning: 90366 2021-10-09 20:51:47,351 INFO [decode.py:732] num_arcs before pruning: 90366 |
I suggest that you can use ctc decoding to verify your model according to #71. If the results based on ctc decoding are normal, maybe the problem happens to your language model. |
looks like.it imwas.attention decode that failed, fst decode worked i think
…On Saturday, October 9, 2021, Mingshuang Luo ***@***.***> wrote:
I suggest that you can use ctc decoding to verify your model according to
#71 <#71>. If the results based on
ctc decoding are normal, maybe the problem happens to your language model.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#70 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO3E3GZNM7UR4NA2HVTUGA6MDANCNFSM5FTIMTGQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
According your advice, we use ctc decoding,but the decoding progress is stuck. The logs are as follows and has been in this state for more than 30 hours. ####### |
I am using NVIDIA Tesla V100 GPU with 32 GB RAM, Python 3.8 with torch 1.7.1
We have never encountered this issue before. Could you test the decoding script with a pre-trained model, provided by us (see https://icefall.readthedocs.io/en/latest/recipes/librispeech/conformer_ctc.html#pre-trained-model)? $ cd egs/librispeech/ASR
$ mkdir tmp
$ cd tmp
$ git lfs install
$ git clone https://huggingface.co/pkufool/icefall_asr_librispeech_conformer_ctc
$ cd ..
$ ln -s $PWD/tmp/icefall_asr_librispeech_conformer_ctc/exp/pretrained.pt conformer_ctc/exp/epoch-99.pt And you can pass If it still gets stuck, there is a higher chance that there are some problems with your configuration. |
Perhaps `nvidia-smi` would show something.
…On Mon, Oct 11, 2021 at 12:13 PM Fangjun Kuang ***@***.***> wrote:
I am using NVIDIA Tesla V100 GPU with 32 GB RAM, Python 3.8 with torch
1.7.1
but the decoding progress is stuck.
We have never encountered this issue before.
------------------------------
Could you test the decoding script with a pre-trained model, provided by
us (see
https://icefall.readthedocs.io/en/latest/recipes/librispeech/conformer_ctc.html#pre-trained-model
)?
$ cd egs/librispeech/ASR
$ mkdir tmp
$ cd tmp
$ git lfs install
$ git clone https://huggingface.co/pkufool/icefall_asr_librispeech_conformer_ctc
$ cd ..
$ ln -s $PWD/tmp/icefall_asr_librispeech_conformer_ctc/exp/pretrained.pt conformer_ctc/exp/epoch-99.pt
And you can pass --epoch 99 --avg 1 when running ./conformer_ctc/decode.py
.
If it still gets stuck, there is a higher chance that there are some
problems with your configuration.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#70 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO33N6HSGTOBNHCOAQDUGJP7PANCNFSM5FTIMTGQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
you could perhaps debug by doing
export CUDA_LAUNCH_BLOCKING=1
gdb --args python3 [program and args]
(gdb) r
... and then do ctrl-c when it gets stuck. The backtrace may be useful.
More useful if you built k2 in debug mode.
…On Mon, Oct 11, 2021 at 12:27 PM Daniel Povey ***@***.***> wrote:
Perhaps `nvidia-smi` would show something.
On Mon, Oct 11, 2021 at 12:13 PM Fangjun Kuang ***@***.***>
wrote:
> I am using NVIDIA Tesla V100 GPU with 32 GB RAM, Python 3.8 with torch
> 1.7.1
>
> but the decoding progress is stuck.
>
> We have never encountered this issue before.
> ------------------------------
>
> Could you test the decoding script with a pre-trained model, provided by
> us (see
> https://icefall.readthedocs.io/en/latest/recipes/librispeech/conformer_ctc.html#pre-trained-model
> )?
>
> $ cd egs/librispeech/ASR
> $ mkdir tmp
> $ cd tmp
> $ git lfs install
> $ git clone https://huggingface.co/pkufool/icefall_asr_librispeech_conformer_ctc
> $ cd ..
> $ ln -s $PWD/tmp/icefall_asr_librispeech_conformer_ctc/exp/pretrained.pt conformer_ctc/exp/epoch-99.pt
>
> And you can pass --epoch 99 --avg 1 when running
> ./conformer_ctc/decode.py.
>
> If it still gets stuck, there is a higher chance that there are some
> problems with your configuration.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#70 (comment)>, or
> unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AAZFLO33N6HSGTOBNHCOAQDUGJP7PANCNFSM5FTIMTGQ>
> .
> Triage notifications on the go with GitHub Mobile for iOS
> <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
> or Android
> <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
>
>
|
I test the decoding script with a pre-trained model,get the follow error: ######### |
Please make sure that you have run Also, you can check the file size of |
We use the model : tmp/icefall_asr_librispeech_conformer_ctc/exp/pretrained.pt, and run the :python -m pdb conformer_ctc/decode.py --epoch 99 --avg 1 --method ctc-decoding --max-duration 50 . this is the debug steps:
################ and ctrl-c when it gets stuck:
^C
########### python3 -m k2.versionCollecting environment information...
nvidia-smiMon Oct 11 17:38:22 2021 |
@Lzhang-hub you can give the "nvidia-smi" of Tesla V100 and A100-SXM4-40GB |
|
OK, we now try your advices |
This is the result run with GDB , in PruneTimeRange, begin_t=30 ,end_t=60. But we don't know what that tells us. |
Another thing you could that would help debug this is, in intersect_dense_pruned.cu, around line 599 (it may have changed), |
|
Ah yes you're right.
…On Thu, Oct 21, 2021 at 8:57 PM Fangjun Kuang ***@***.***> wrote:
arc_end_scores.RowSplits(1).Data() is a data pointer, should it be
arc_end_scores.RowSplits(1)?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#70 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO64UMOS3QUNQN6IH6LUIAE2VANCNFSM5FTIMTGQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
@danpovey @csukuangfj
this is the gdb logs: CUDA out of memory still occurs. |
Are you using Just want to check that you are using the correct files. |
@cdxie Would you mind emailing me you QQ or Wechat, It is more efficient to discuss this problem. We can report our solution here when it fixed. You can find my email in my github profile, thanks. |
OK, your output is not what I expected. The numbers of arcs being printed out are quite small. The large thing being allocated implies that there are about 100 million arcs, but the numbers you are printing indicate that there should be no more than 100,000 arcs active. Perhaps printing out old_states_offsets and old_arcs_offsets (after their data is written to) in PruneTimeRange, would clarify things. |
Just wanted to chime in — I’m also seeing issues with CUDA memory usage in decoding. I had to set max_duration=5 to make ctc-decoding work. I’m also using a V100 GPU with 32GB RAM. |
Yeah I suspect a bug. Not sure what right now. I am working on something
so I am hoping the other guys can help figure it out. From those printed
numbers I don't think it is simply a case of lots of states being active--
the amount of memory it's trying to allocate seems to be too large for that.
Printing old_state_offsets and old_arcs_offsets in PruneTimeRange() may
help.
…On Fri, Oct 22, 2021 at 7:44 PM Piotr Żelasko ***@***.***> wrote:
Just wanted to chime in — I’m also seeing issues with CUDA memory usage in
decoding. I had to set max_duration=5 to make ctc-decoding work.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#70 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLOYQBP5JZMBMJ2BYMNLUIFFBHANCNFSM5FTIMTGQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
I will look into the |
Turns out I can reproduce the issue using the pre-trained model downloaded from https://huggingface.co/pkufool/icefall_asr_librispeech_conformer_ctc When I created the pull request #58 supporting CTC decoding, the model that I used for testing was trained by myself, not the one downloaded from hugging face. I just re-tested CTC decoding using the pre-trained model from https://github.com/csukuangfj/icefall-asr-conformer-ctc-bpe-500 and everything works fine. The commands for reproducing are
The decoding logs are
The only difference between the model trained by me and the pre-trained model from hugging face is that the vocab size is changed from 5000 to 500. I suspect the OOM is caused by the large size of the CTC topo.If I switch to the modified CTC topo by changing icefall/egs/librispeech/ASR/conformer_ctc/decode.py Lines 564 to 568 in 712ead8
to H = k2.ctc_topo(
max_token=max_token_id,
modified=True,
device=device,
) then ctc-decoding works with the model downloaded from hugging face. The decoding logs are
Note: My model, i.e., the one from https://github.com/csukuangfj/icefall-asr-conformer-ctc-bpe-500, does not work well with modified CTC topo. The WERs for ctc-decoding using modified CTC topo degrade rapidly, see below:
|
As suggested by Piotr in #70 (comment),
|
Using the |
we use the same resources 、same parameters、same GPU device as you, but we can't decoding completely , specially --method attention-decoder | ctc-decoding and --max-duration 300. I think this is most important to be clarified, and we can participate in some verification. |
OK. I suspect there may be some logic in the code that is not correct in that it might think it is measuring arcs but really be measuring states, or something like tha. |
Have you tried that? |
Guys, I think I see the issue: the pruning beams are on min/max active states, but it's the arcs, not states, that are getting out of control. There are a few possible things we could do:
|
@csukuangfj we try the "modified CTC topo", ctc-decoding and "max-duration 300" are OK, we use the https://huggingface.co/pkufool/icefall_asr_librispeech_conformer_ctc resource. whole-lattice-rescoring and attention-decoder are still decoding error, maybe cause by Loading G_4_gram.fst.txt 2021-10-23 15:02:15,830 INFO [decode.py:538] Decoding started 2021-10-23 15:04:37,353 INFO [decode.py:474] batch 0/?, cuts processed until now is 70 2021-10-23 15:06:10,162 INFO [decode.py:681] Done! |
Could you show us the error log? |
@csukuangfj the error is OOM, that's what we mentioned at the first time, and then you asked us to use ctc-decoding to see if there was any problem with the code. So, we should go back to the first error? |
Do you mean the log in #70 (comment) ?
Could you figure out which line causes segfault? As your code is in Python, you can use |
A gdb stacktrace might also be helpful (complementary with the python one).
…On Sat, Oct 23, 2021 at 6:29 PM Fangjun Kuang ***@***.***> wrote:
Do you mean the log in #70 (comment)
<#70 (comment)> ?
2021-10-14 11:02:43,243 INFO [pretrained.py:236] device: cuda:0
2021-10-14 11:02:43,243 INFO [pretrained.py:238] Creating model
2021-10-14 11:02:47,756 INFO [pretrained.py:255] Constructing Fbank
computer
2021-10-14 11:02:47,758 INFO [pretrained.py:265] Reading sound files:
['./tmp/icefall_asr_librispeech_conformer_ctc/test_wavs/1089-134686-0001.flac',
'./tmp/icefall_asr_librispeech_conformer_ctc/test_wavs/1221-135766-0001.flac',
'./tmp/icefall_asr_librispeech_conformer_ctc/test_wavs/1221-135766-0002.flac']
2021-10-14 11:02:47,969 INFO [pretrained.py:271] Decoding started
2021-10-14 11:02:48,845 INFO [pretrained.py:327] Loading HLG from
./tmp/icefall_asr_librispeech_conformer_ctc/data/lang_bpe/HLG.pt
Segmentation fault (core dumped)
Could you figure out which line causes segfault? As your code is in
Python, you can use pdb to run the script step by step.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#70 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO6RWA3Z7M3KWMKJD53UIKFARANCNFSM5FTIMTGQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
I think one advantage of CTC decoding is that it is super faster. |
Hi, I am newer to learn icefall,I finished the training of tdnn_lstm_ctc, when run the decode steps, I meet the following error, I change the --max-duration, there are still errors:
we set --max-duration=100 and use Tesla V100-SXM, the GPU info follow:
would you give me some advice?thanks
The text was updated successfully, but these errors were encountered: