Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Pruned transducer stateless2 for WenetSpeech #314

Conversation

luomingshuang
Copy link
Collaborator

This PR is for pruned transducer stateless2 on WenetSpeech. Still doing. Now, I am running with 1400+ hours data for training. Will update some things on real time.

@luomingshuang
Copy link
Collaborator Author

luomingshuang commented Apr 15, 2022

When using char as modeling unit, there are some simple results. When I use the original pruned RNN-T for training, the pruned rnnt loss gets un-normal (too big in some iterations) after 23 epochs. And the pruned RNN-T2 doesn't have this case. BTW, I decode some results as follows:

Subset_for_training decoding_method epoch avg pruned RNN-T (test-net/test-meet) reworked model (dev/test-net/test-meet)
1400+hours in L greedy_search 23 1 17.94/30.24 12.14/13.76/24.91
1400+hours in L greedy_search 22 1 18.17/29.66 11.85/13.54/23.52
1400+hours in L greedy_search 21 1 18.61/30.66 12.80/14.18/27.16
L greedy_search 0 1 -/- 11.90/14.00/23.26
L greedy_search 1 1 -/- 11.07/12.29/23.50
L greedy_search 5 2 -/- 8.17/9.37/14.47
L modified_beam_search 5 2 -/- 8.07/9.37/14.31
L fast_beam_search 5 2 -/- 8.25/9.27/14.81
M greedy_search 29 11 -/- 10.40/11.31/19.64
M modified_beam_search 29 11 -/- 9.85/11.04/18.20
M fast_beam_search 29 11 -/- 10.18/11.10/19.32
S greedy_search 29 24 -/- 19.92/25.20/35.35
S modified_beam_search 29 24 -/- 18.62/23.88/33.80
S fast_beam_search 29 24 -/- 19.31/24.41/34.87
S / / / kaldi 11.70/17.47/37.27
M / / / kaldi 9.81/14.19/28.22
L / / / kaldi 9.07/12.83/24.72
L / / / wenet 8.88/9.70/15.59
L / / / espnet 9.70/8.90/15.90

It shows that the reworked model is better than the original. Next, I will use the total wenetspeech data for training based on the reworked model (char).

@danpovey
Copy link
Collaborator

danpovey commented Apr 15, 2022

Cool... what WER should we be aiming for, i.e. what is the state of the art? Also please show the decoding configuration, e.g. decoding algorithm and how many models you averaged.

@csukuangfj
Copy link
Collaborator

BTW, I decode some results as follows (set avg=1):

I would recommend you to use --avg=n where n > 1. The greedy search is very fast and it should not take much time to find an n that leads to the lowest WER given --epoch.

@luomingshuang
Copy link
Collaborator Author

OK, will add it.

Cool... what WER should we be aiming for, i.e. what is the state of the art? Also please show the decoding configuration, e.g. decoding algorithm and how many models you averaged.

@pkufool
Copy link
Collaborator

pkufool commented Apr 15, 2022

Cool... what WER should we be aiming for, i.e. what is the state of the art? Also please show the decoding configuration, e.g. decoding algorithm and how many models you averaged.

From wenetspeech github page (https://github.com/wenet-e2e/WenetSpeech):

image

@danpovey
Copy link
Collaborator

OK, please when you report results please be clear about which subsets you are referring to!

@csukuangfj
Copy link
Collaborator

Screen Shot 2022-04-15 at 5 44 29 PM

Screen Shot 2022-04-15 at 5 44 36 PM

@luomingshuang
Copy link
Collaborator Author

I will update the results on real time.

@csukuangfj
Copy link
Collaborator

Please use predefined training subsets. Otherwise, the results are not comparable.

@luomingshuang luomingshuang changed the title [WIP] pruned transducer stateless2 for WenetSpeech [WIP] Pruned transducer stateless2 for WenetSpeech Apr 18, 2022
# use cuts_L_50_pieces.jsonl.gz for original experiments
)
else:
cuts_train = CutSet.from_file(
self.args.manifest_dir
/ "cuts_L_50_pieces.jsonl.gz"
/ "cuts_L.jsonl.gz"
# use cuts_L_50_pieces.jsonl.gz for original experiments
Copy link
Collaborator

@csukuangfj csukuangfj Apr 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are "original experiments" in the comment?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means that the cuts_L_50_pieces.jsonl.gz includes 1400+ hours. It refers the results showing above.

log "Stage 8: Compute fbank for musan"
if [ $stage -le 12 ] && [ $stop_stage -ge 12 ]; then
log "Stage 12: Combine features for M"
if [ ! -f data/fbank/cuts_M.jsonl.gz ]; then
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to combine them, I think.
For large datasets, the combining process is slow and the resulting file is large.

You can use

def train_XL_cuts(self) -> CutSet:
logging.info("About to get train-XL cuts")
filenames = list(
glob.glob(f"{self.manifest_dir}/XL_split_2000/cuts_XL.*.jsonl.gz")
)
pattern = re.compile(r"cuts_XL.([0-9]+).jsonl.gz")
idx_filenames = [
(int(pattern.search(f).group(1)), f) for f in filenames
]
idx_filenames = sorted(idx_filenames, key=lambda x: x[0])
sorted_filenames = [f[1] for f in idx_filenames]
logging.info(f"Loading {len(sorted_filenames)} splits")
return lhotse.combine(
lhotse.load_manifest_lazy(p) for p in sorted_filenames
)

as an example

@@ -5,6 +5,7 @@ set -eou pipefail
nj=15
stage=0
stop_stage=100
use_whole_text=True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is it used and what is its meaning?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Em...actually, I want to set a choice for using the transcripts in L subset or using the text in the whole data to generate the text, token.txt, words.txt and so on. But it seems that it is not finished. Will add some things about it.

@luomingshuang
Copy link
Collaborator Author

There is a problem for me. My process is interrupted by the SIGKILL. And then I load the nearest training batch model checkpoint-xxxx.pt for continuing running, but it seems the code takes much time to load the model. And Now it stays and keeps the following state. (the time when I load it is 10:08, but now is 10:35)
image

@luomingshuang
Copy link
Collaborator Author

luomingshuang commented Apr 21, 2022

It seems the code takes much time to do the following process.

    scaler = GradScaler(enabled=params.use_fp16)
    if checkpoints and "grad_scaler" in checkpoints:
        logging.info("Loading grad scaler state dict")
        scaler.load_state_dict(checkpoints["grad_scaler"])

If I change to load the model epoch-xxx.pt, there is no above problem.

@csukuangfj
Copy link
Collaborator

It seems the code takes much time to do the following process.

    scaler = GradScaler(enabled=params.use_fp16)
    if checkpoints and "grad_scaler" in checkpoints:
        logging.info("Loading grad scaler state dict")
        scaler.load_state_dict(checkpoints["grad_scaler"])

If I change to load the model epoch-xxx.pt, there is no above problem.

That is expected.

for batch_idx, batch in enumerate(train_dl):
if batch_idx < cur_batch_idx:
continue
cur_batch_idx = batch_idx

It has to skip as many batches as needed to reach the point where your last checkpoint was saved.

@luomingshuang
Copy link
Collaborator Author

OK.

@pzelasko
Copy link
Collaborator

pzelasko commented Apr 21, 2022

I figured out how to add the support for quick restoring of dynamic samplers state but forgot to actually add it. You can try to save sampler.state_dict() and then sampler.load_state_dict(state_dict) with this PR and let me know if it works lhotse-speech/lhotse#684

It should be much faster because it only iterates cuts rather than doing actual dataloading.

@pzelasko
Copy link
Collaborator

... note that it's supposed to "just work" and pick up from where it was saved when you start iterating it, so you won't need the if statement checking for batch_idx

@luomingshuang
Copy link
Collaborator Author

I have test the PR #684. It runs successfully. When use this version, the time for loading the batch checkpoint is very few. Also, I can remove the check for batch_idx.

I figured out how to add the support for quick restoring of dynamic samplers state but forgot to actually add it. You can try to save sampler.state_dict() and then sampler.load_state_dict(state_dict) with this PR and let me know if it works lhotse-speech/lhotse#684

It should be much faster because it only iterates cuts rather than doing actual dataloading.

@pzelasko
Copy link
Collaborator

Thanks. Please double check that you are getting the expected number of steps in an epoch after resuming to confirm that the implementation is correct. There might be num_workers * prefetching_factor - 1 less batches (typically less than 10) but it is expected.

@luomingshuang
Copy link
Collaborator Author

I am testing the restore-dynamic-sampler #684 again.
image

As shows in the above picture, the left part in it is my training process starting from epoch 0 and batch idx 0, I also set it to save a checkpoint-xxx.pt every 200 batchs. I use a small part data for testing it. The batch idx for epoch 0 is from 0 to 924. In the right part, I try to load the checkpoint-800.pt to continue train. Its starting batch idx is 799. And for its epoch 0, the batch idx is from 799 to 1723.

@pzelasko
Copy link
Collaborator

Please clarify: did you expect epoch 0 to run for 124 more batches after step 800 (total 924), but it ran for 924 batches instead (total 1723) when resumed?

@luomingshuang
Copy link
Collaborator Author

Yes. So is it a reasonable expectation?

Please clarify: did you expect epoch 0 to run for 124 more batches after step 800 (total 924), but it ran for 924 batches instead (total 1723) when resumed?

@pzelasko
Copy link
Collaborator

Yes, it's an issue, I need to debug that.

@pzelasko
Copy link
Collaborator

Hey @luomingshuang, I was not able to reproduce your issue using mini librispeech. Please see the following notebook on Colab and let me know if anything is done differently than in your recipe. If you can reproduce it here, it will be easy for me to fix. Or maybe you're using this code in a way I haven't anticipated.

https://drive.google.com/file/d/1hZSSIN8K6VPCw12pIHwUtL5w0bOzyvct/view?usp=sharing

@luomingshuang
Copy link
Collaborator Author

OK,thanks. I will see the Colab and check my code again.

@luomingshuang
Copy link
Collaborator Author

BTW, when I train the subset M with char with the following commands:
430d7c8fbfe2171bfa561852be524c3

and there is an error happening to my codes, the entails of the error shows as the following pictures:
b9b7cc6fc0ddda3104ca675540524c4
dd656595561d164b8ee51f451ff6a00
5a62bbc0de995aa2630776107f10c34
3941cf184c3c52d461845a5361db950
024dcca621d1ae2925ff156751b843e

@luomingshuang
Copy link
Collaborator Author

As you can see, the code has been running normally for 6 epochs. Somebody in the internet says that it may be due to the number of cpus in the server is not enough. I'm not sure.

@luomingshuang
Copy link
Collaborator Author

Hi, @pzelasko , I can reproduce your results with the provided Colab based on my current Lhotse. But when I use the codes as the PR showing (sampler is used in asr_datamodule.py and train.py ) with the same Lhotse for Wenetspeech, the problem as I described above still exists. When I load the checkpoint-xxx.pt, for the first training epoch, it still will run over the whole dataloader whatever the start-batch is. And after the first epoch, it will run normally.

Hey @luomingshuang, I was not able to reproduce your issue using mini librispeech. Please see the following notebook on Colab and let me know if anything is done differently than in your recipe. If you can reproduce it here, it will be easy for me to fix. Or maybe you're using this code in a way I haven't anticipated.

https://drive.google.com/file/d/1hZSSIN8K6VPCw12pIHwUtL5w0bOzyvct/view?usp=sharing

@luomingshuang
Copy link
Collaborator Author

I use a small part of wenetspeech for debug.
My running code is as follows:

export CUDA_VISIBLE_DEVICES="7"

python ./pruned_transducer_stateless2/train.py \
  --lang-dir data/lang_char \
  --exp-dir pruned_transducer_stateless2/exp-char-S-debug-batch-checkpoint \
  --max-duration 100 \
  --valid-interval 100 \
  --start-batch 40 \
  --model-warm-step 100 \
  --world-size 1 \
  --save-every-n 20 \
  --training-subset S

I save a checkpoint every 20 iterations. The number of iterations for one epoch is 66. I use the checkpoint-40 for debug.
And some logs are as follows:

(k2-python) luomingshuang@de-74279-k2-train-3-0307202051-57dc848959-8tmmp:~/icefall/egs/wenetspeech/ASR$ ./run_debug_batch_checkpoint_load.sh
2022-04-28 19:01:22,826 INFO [train.py:819] Training started
2022-04-28 19:01:23.040117: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /ceph-sh1/fangjun/software/cuda-10.2.89/lib:/ceph-sh1/fangjun/software/cuda-10.2.89/lib64:
2022-04-28 19:01:23.040164: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2022-04-28 19:01:25,575 INFO [train.py:829] Device: cuda:0
2022-04-28 19:01:26,699 INFO [lexicon.py:176] Loading pre-compiled data/lang_char/Linv.pt
2022-04-28 19:01:26,789 INFO [train.py:848] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 1, 'reset_interval': 200, 'feature_dim': 80, 'subsampling_factor': 4, 'encoder_dim': 512, 'nhead': 8, 'dim_feedforward': 2048, 'num_encoder_layers': 12, 'decoder_dim': 512, 'joiner_dim': 512, 'env_info': {'k2-version': '1.14', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '4fb6b88661cca73e5f66f03df16e5a1d0c4886f8', 'k2-git-date': 'Fri Apr 8 18:29:32 2022', 'lhotse-version': '1.1.0', 'torch-version': '1.11.0', 'torch-cuda-available': True, 'torch-cuda-version': '10.2', 'python-version': '3.8', 'icefall-git-branch': 'wenetspeech-pruned-transducer-stateless2', 'icefall-git-sha1': '4b567e4-dirty', 'icefall-git-date': 'Wed Apr 27 13:43:54 2022', 'icefall-path': '/ceph-meixu/luomingshuang/icefall', 'k2-path': '/ceph-ms/luomingshuang/k2/k2/python/k2/__init__.py', 'lhotse-path': '/ceph-meixu/luomingshuang/anaconda3/envs/k2-python/lib/python3.8/site-packages/lhotse-1.1.0-py3.8.egg/lhotse/__init__.py', 'hostname': 'de-74279-k2-train-3-0307202051-57dc848959-8tmmp', 'IP address': '10.177.24.138'}, 'world_size': 1, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 30, 'start_epoch': 0, 'start_batch': 40, 'exp_dir': PosixPath('pruned_transducer_stateless2/exp-char-S-debug-batch-checkpoint'), 'lang_dir': PosixPath('data/lang_char'), 'token_type': 'char', 'initial_lr': 0.003, 'lr_batches': 5000, 'lr_epochs': 6, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'seed': 42, 'print_diagnostics': False, 'save_every_n': 20, 'keep_last_k': 20, 'use_fp16': False, 'valid_interval': 100, 'model_warm_step': 100, 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 100, 'bucketing_sampler': True, 'num_buckets': 300, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'lazy_load': True, 'training_subset': 'S', 'blank_id': 0, 'vocab_size': 5537}
2022-04-28 19:01:26,790 INFO [train.py:850] About to create model
2022-04-28 19:01:27,372 INFO [train.py:854] Number of model parameters: 88978927
2022-04-28 19:01:27,373 INFO [checkpoint.py:100] Loading checkpoint from pruned_transducer_stateless2/exp-char-S-debug-batch-checkpoint/checkpoint-40.pt
2022-04-28 19:01:31,180 INFO [train.py:869] Loading optimizer state dict
2022-04-28 19:01:31,907 INFO [train.py:877] Loading scheduler state dict
2022-04-28 19:01:31,907 INFO [asr_datamodule.py:427] About to get train cuts
2022-04-28 19:01:31,907 INFO [asr_datamodule.py:429] use lazy cuts
2022-04-28 19:01:31,908 INFO [asr_datamodule.py:445] About to get dev cuts
2022-04-28 19:01:32,964 INFO [asr_datamodule.py:359] About to create dev dataset
/ceph-meixu/luomingshuang/anaconda3/envs/k2-python/lib/python3.8/site-packages/lhotse-1.1.0-py3.8.egg/lhotse/dataset/sampling/dynamic_bucketing.py:129: UserWarning: You are using DynamicBucketingSampler with an eagerly read CutSet. You won't see any memory/speed benefits with that setup. Either use 'CutSet.from_jsonl_lazy' to read the CutSet lazily, or use a BucketingSampler instead.
2022-04-28 19:01:32,972 INFO [asr_datamodule.py:380] About to create dev dataloader
2022-04-28 19:01:32,972 INFO [asr_datamodule.py:220] About to get Musan cuts
2022-04-28 19:01:35,308 INFO [asr_datamodule.py:227] Enable MUSAN
2022-04-28 19:01:35,308 INFO [asr_datamodule.py:252] Enable SpecAugment
2022-04-28 19:01:35,308 INFO [asr_datamodule.py:253] Time warp factor: 80
2022-04-28 19:01:35,308 INFO [asr_datamodule.py:265] Num frame mask: 10
2022-04-28 19:01:35,308 INFO [asr_datamodule.py:278] About to create train dataset
2022-04-28 19:01:35,308 INFO [asr_datamodule.py:306] Using DynamicBucketingSampler.
2022-04-28 19:01:35,669 INFO [asr_datamodule.py:322] About to create train dataloader
2022-04-28 19:01:35,669 INFO [asr_datamodule.py:344] Loading sampler state dict
{'epoch': 0, 'world_size': 1, 'rank': 0, 'seed': 0, 'shuffle': True, 'diagnostics': {'current_epoch': 0, 'stats_per_epoch': {0: {'epoch': 0, 'kept_cuts': 1998, 'discarded_cuts': 0, 'kept_batches': 44, 'discarded_batches': 0}}}, 'max_duration': 100, 'max_cuts': None, 'drop_last': True, 'consistent_ids': True, 'buffer_size': 30000, 'num_cuts_for_bins_estimate': 10000, 'shuffle_buffer_size': 20000, 'strict': True}
2022-04-28 19:01:36,758 INFO [train.py:940] Loading grad scaler state dict
2022-04-28 19:01:38,665 INFO [train.py:757] Epoch 0, batch 0, loss[loss=0.455, simple_loss=0.91, pruned_loss=8.779, over 2490.00 frames.], tot_loss[loss=0.455, simple_loss=0.91, pruned_loss=8.779, over 2490.00 frames.], batch size: 30, lr: 3.00e-03
2022-04-28 19:01:39,249 INFO [train.py:757] Epoch 0, batch 1, loss[loss=0.4376, simple_loss=0.8752, pruned_loss=8.768, over 2491.00 frames.], tot_loss[loss=0.4463, simple_loss=0.8926, pruned_loss=8.774, over 4968.55 frames.], batch size: 47, lr: 3.00e-03
2022-04-28 19:01:39,795 INFO [train.py:757] Epoch 0, batch 2, loss[loss=0.4416, simple_loss=0.8832, pruned_loss=8.942, over 2496.00 frames.], tot_loss[loss=0.4447, simple_loss=0.8894, pruned_loss=8.83, over 7439.71 frames.], batch size: 32, lr: 3.00e-03
2022-04-28 19:01:40,335 INFO [train.py:757] Epoch 0, batch 3, loss[loss=0.3987, simple_loss=0.7974, pruned_loss=8.835, over 2403.00 frames.], tot_loss[loss=0.4334, simple_loss=0.8669, pruned_loss=8.831, over 9805.51 frames.], batch size: 15, lr: 3.00e-03
2022-04-28 19:01:40,872 INFO [train.py:757] Epoch 0, batch 4, loss[loss=0.46, simple_loss=0.92, pruned_loss=8.908, over 2464.00 frames.], tot_loss[loss=0.4388, simple_loss=0.8776, pruned_loss=8.847, over 12220.48 frames.], batch size: 44, lr: 3.00e-03
2022-04-28 19:01:41,428 INFO [train.py:757] Epoch 0, batch 5, loss[loss=0.4898, simple_loss=0.9796, pruned_loss=8.733, over 2480.00 frames.], tot_loss[loss=0.4474, simple_loss=0.8949, pruned_loss=8.827, over 14639.38 frames.], batch size: 80, lr: 3.00e-03
2022-04-28 19:01:42,013 INFO [train.py:757] Epoch 0, batch 6, loss[loss=0.4639, simple_loss=0.9278, pruned_loss=8.943, over 2484.00 frames.], tot_loss[loss=0.4498, simple_loss=0.8997, pruned_loss=8.844, over 17050.18 frames.], batch size: 54, lr: 3.00e-03
2022-04-28 19:01:42,563 INFO [train.py:757] Epoch 0, batch 7, loss[loss=0.4232, simple_loss=0.8464, pruned_loss=8.795, over 2496.00 frames.], tot_loss[loss=0.4464, simple_loss=0.8928, pruned_loss=8.838, over 19460.93 frames.], batch size: 39, lr: 3.00e-03
2022-04-28 19:01:43,121 INFO [train.py:757] Epoch 0, batch 8, loss[loss=0.4364, simple_loss=0.8727, pruned_loss=8.92, over 2457.00 frames.], tot_loss[loss=0.4453, simple_loss=0.8906, pruned_loss=8.847, over 21820.63 frames.], batch size: 39, lr: 3.00e-03
2022-04-28 19:01:43,688 INFO [train.py:757] Epoch 0, batch 9, loss[loss=0.4808, simple_loss=0.9616, pruned_loss=8.997, over 2451.00 frames.], tot_loss[loss=0.4489, simple_loss=0.8978, pruned_loss=8.862, over 24162.52 frames.], batch size: 43, lr: 3.00e-03
2022-04-28 19:01:44,235 INFO [train.py:757] Epoch 0, batch 10, loss[loss=0.4279, simple_loss=0.8558, pruned_loss=8.93, over 2448.00 frames.], tot_loss[loss=0.447, simple_loss=0.8939, pruned_loss=8.869, over 26489.71 frames.], batch size: 22, lr: 3.00e-03
2022-04-28 19:01:44,791 INFO [train.py:757] Epoch 0, batch 11, loss[loss=0.4915, simple_loss=0.983, pruned_loss=9.041, over 2450.00 frames.], tot_loss[loss=0.4507, simple_loss=0.9015, pruned_loss=8.883, over 28807.26 frames.], batch size: 50, lr: 3.00e-03
2022-04-28 19:01:45,343 INFO [train.py:757] Epoch 0, batch 12, loss[loss=0.4249, simple_loss=0.8497, pruned_loss=8.851, over 2470.00 frames.], tot_loss[loss=0.4487, simple_loss=0.8974, pruned_loss=8.881, over 31133.23 frames.], batch size: 38, lr: 3.00e-03
2022-04-28 19:01:45,888 INFO [train.py:757] Epoch 0, batch 13, loss[loss=0.4676, simple_loss=0.9352, pruned_loss=8.874, over 2496.00 frames.], tot_loss[loss=0.4501, simple_loss=0.9002, pruned_loss=8.88, over 33473.56 frames.], batch size: 48, lr: 3.00e-03
2022-04-28 19:01:46,459 INFO [train.py:757] Epoch 0, batch 14, loss[loss=0.4217, simple_loss=0.8435, pruned_loss=8.807, over 2442.00 frames.], tot_loss[loss=0.4482, simple_loss=0.8963, pruned_loss=8.875, over 35748.19 frames.], batch size: 33, lr: 3.00e-03
2022-04-28 19:01:47,012 INFO [train.py:757] Epoch 0, batch 15, loss[loss=0.4808, simple_loss=0.9617, pruned_loss=8.937, over 2494.00 frames.], tot_loss[loss=0.4503, simple_loss=0.9006, pruned_loss=8.879, over 38063.45 frames.], batch size: 58, lr: 3.00e-03
2022-04-28 19:01:47,576 INFO [train.py:757] Epoch 0, batch 16, loss[loss=0.4133, simple_loss=0.8267, pruned_loss=8.74, over 2450.00 frames.], tot_loss[loss=0.4481, simple_loss=0.8961, pruned_loss=8.871, over 40323.13 frames.], batch size: 35, lr: 3.00e-03
2022-04-28 19:01:48,104 INFO [train.py:757] Epoch 0, batch 17, loss[loss=0.4579, simple_loss=0.9158, pruned_loss=8.859, over 2470.00 frames.], tot_loss[loss=0.4486, simple_loss=0.8972, pruned_loss=8.87, over 42591.52 frames.], batch size: 65, lr: 3.00e-03
2022-04-28 19:01:48,650 INFO [train.py:757] Epoch 0, batch 18, loss[loss=0.4509, simple_loss=0.9017, pruned_loss=8.926, over 2450.00 frames.], tot_loss[loss=0.4487, simple_loss=0.8975, pruned_loss=8.873, over 44828.56 frames.], batch size: 49, lr: 3.00e-03
2022-04-28 19:01:49,199 INFO [checkpoint.py:67] Saving checkpoint to pruned_transducer_stateless2/exp-char-S-debug-batch-checkpoint/checkpoint-60.pt
2022-04-28 19:01:52,156 INFO [train.py:757] Epoch 0, batch 19, loss[loss=0.4387, simple_loss=0.8774, pruned_loss=8.853, over 2496.00 frames.], tot_loss[loss=0.4482, simple_loss=0.8964, pruned_loss=8.872, over 47100.42 frames.], batch size: 64, lr: 3.00e-03
2022-04-28 19:01:52,710 INFO [train.py:757] Epoch 0, batch 20, loss[loss=0.5755, simple_loss=1.151, pruned_loss=9.952, over 2484.00 frames.], tot_loss[loss=0.4546, simple_loss=0.9092, pruned_loss=8.926, over 49348.92 frames.], batch size: 69, lr: 3.00e-03
2022-04-28 19:01:53,245 INFO [train.py:757] Epoch 0, batch 21, loss[loss=0.3958, simple_loss=0.7917, pruned_loss=8.665, over 2436.00 frames.], tot_loss[loss=0.4518, simple_loss=0.9037, pruned_loss=8.914, over 51538.17 frames.], batch size: 29, lr: 3.00e-03
2022-04-28 19:01:53,787 INFO [train.py:757] Epoch 0, batch 22, loss[loss=0.458, simple_loss=0.916, pruned_loss=8.99, over 2450.00 frames.], tot_loss[loss=0.4521, simple_loss=0.9042, pruned_loss=8.918, over 53730.48 frames.], batch size: 49, lr: 3.00e-03
2022-04-28 19:01:54,351 INFO [train.py:757] Epoch 0, batch 23, loss[loss=0.3922, simple_loss=0.7843, pruned_loss=8.816, over 2449.00 frames.], tot_loss[loss=0.4495, simple_loss=0.899, pruned_loss=8.913, over 55910.83 frames.], batch size: 31, lr: 3.00e-03
2022-04-28 19:01:54,885 INFO [train.py:757] Epoch 0, batch 24, loss[loss=0.4274, simple_loss=0.8548, pruned_loss=9.072, over 2404.00 frames.], tot_loss[loss=0.4486, simple_loss=0.8972, pruned_loss=8.92, over 58035.27 frames.], batch size: 25, lr: 3.00e-03
2022-04-28 19:01:55,466 INFO [train.py:757] Epoch 0, batch 25, loss[loss=0.4462, simple_loss=0.8924, pruned_loss=9.121, over 2478.00 frames.], tot_loss[loss=0.4485, simple_loss=0.897, pruned_loss=8.928, over 60223.10 frames.], batch size: 59, lr: 3.00e-03
2022-04-28 19:01:56,020 INFO [train.py:757] Epoch 0, batch 26, loss[loss=0.4634, simple_loss=0.9268, pruned_loss=9.186, over 2478.00 frames.], tot_loss[loss=0.4491, simple_loss=0.8982, pruned_loss=8.938, over 62399.98 frames.], batch size: 42, lr: 3.00e-03
2022-04-28 19:01:56,577 INFO [train.py:757] Epoch 0, batch 27, loss[loss=0.4223, simple_loss=0.8447, pruned_loss=8.855, over 2479.00 frames.], tot_loss[loss=0.4481, simple_loss=0.8961, pruned_loss=8.935, over 64566.98 frames.], batch size: 67, lr: 3.00e-03
2022-04-28 19:01:57,145 INFO [train.py:757] Epoch 0, batch 28, loss[loss=0.4133, simple_loss=0.8266, pruned_loss=8.809, over 2482.00 frames.], tot_loss[loss=0.4468, simple_loss=0.8935, pruned_loss=8.93, over 66726.15 frames.], batch size: 73, lr: 3.00e-03
2022-04-28 19:01:57,706 INFO [train.py:757] Epoch 0, batch 29, loss[loss=0.4685, simple_loss=0.937, pruned_loss=9.057, over 2442.00 frames.], tot_loss[loss=0.4475, simple_loss=0.8951, pruned_loss=8.935, over 68834.52 frames.], batch size: 37, lr: 3.00e-03
2022-04-28 19:01:58,267 INFO [train.py:757] Epoch 0, batch 30, loss[loss=0.4155, simple_loss=0.8311, pruned_loss=8.896, over 2480.00 frames.], tot_loss[loss=0.4464, simple_loss=0.8928, pruned_loss=8.933, over 70970.34 frames.], batch size: 31, lr: 3.00e-03
2022-04-28 19:01:58,802 INFO [train.py:757] Epoch 0, batch 31, loss[loss=0.4737, simple_loss=0.9474, pruned_loss=8.863, over 2460.00 frames.], tot_loss[loss=0.4473, simple_loss=0.8947, pruned_loss=8.931, over 73075.49 frames.], batch size: 60, lr: 3.00e-03
2022-04-28 19:01:59,329 INFO [train.py:757] Epoch 0, batch 32, loss[loss=0.4207, simple_loss=0.8413, pruned_loss=8.837, over 2484.00 frames.], tot_loss[loss=0.4464, simple_loss=0.8929, pruned_loss=8.928, over 75194.11 frames.], batch size: 36, lr: 3.00e-03
2022-04-28 19:01:59,861 INFO [train.py:757] Epoch 0, batch 33, loss[loss=0.3861, simple_loss=0.7721, pruned_loss=8.577, over 2464.00 frames.], tot_loss[loss=0.4445, simple_loss=0.889, pruned_loss=8.917, over 77282.14 frames.], batch size: 32, lr: 3.00e-03
2022-04-28 19:02:00,428 INFO [train.py:757] Epoch 0, batch 34, loss[loss=0.4377, simple_loss=0.8754, pruned_loss=8.784, over 2496.00 frames.], tot_loss[loss=0.4443, simple_loss=0.8886, pruned_loss=8.913, over 79391.73 frames.], batch size: 52, lr: 3.00e-03
2022-04-28 19:02:00,952 INFO [train.py:757] Epoch 0, batch 35, loss[loss=0.4297, simple_loss=0.8594, pruned_loss=8.654, over 2479.00 frames.], tot_loss[loss=0.4439, simple_loss=0.8877, pruned_loss=8.905, over 81473.77 frames.], batch size: 37, lr: 3.00e-03
2022-04-28 19:02:01,479 INFO [train.py:757] Epoch 0, batch 36, loss[loss=0.427, simple_loss=0.8541, pruned_loss=8.595, over 2485.00 frames.], tot_loss[loss=0.4434, simple_loss=0.8867, pruned_loss=8.895, over 83551.41 frames.], batch size: 35, lr: 3.00e-03
2022-04-28 19:02:02,001 INFO [train.py:757] Epoch 0, batch 37, loss[loss=0.4691, simple_loss=0.9382, pruned_loss=8.604, over 2464.00 frames.], tot_loss[loss=0.4441, simple_loss=0.8882, pruned_loss=8.887, over 85597.65 frames.], batch size: 56, lr: 3.00e-03
2022-04-28 19:02:02,524 INFO [train.py:757] Epoch 0, batch 38, loss[loss=0.4511, simple_loss=0.9023, pruned_loss=8.393, over 2480.00 frames.], tot_loss[loss=0.4443, simple_loss=0.8886, pruned_loss=8.873, over 87649.66 frames.], batch size: 40, lr: 3.00e-03
2022-04-28 19:02:03,114 INFO [checkpoint.py:67] Saving checkpoint to pruned_transducer_stateless2/exp-char-S-debug-batch-checkpoint/checkpoint-80.pt
2022-04-28 19:02:06,052 INFO [train.py:757] Epoch 0, batch 39, loss[loss=0.4573, simple_loss=0.9147, pruned_loss=8.574, over 2460.00 frames.], tot_loss[loss=0.4447, simple_loss=0.8893, pruned_loss=8.865, over 89671.41 frames.], batch size: 41, lr: 3.00e-03
2022-04-28 19:02:06,611 INFO [train.py:757] Epoch 0, batch 40, loss[loss=0.5415, simple_loss=1.083, pruned_loss=8.671, over 2478.00 frames.], tot_loss[loss=0.4473, simple_loss=0.8946, pruned_loss=8.86, over 91701.05 frames.], batch size: 42, lr: 3.00e-03
2022-04-28 19:02:07,129 INFO [train.py:757] Epoch 0, batch 41, loss[loss=0.506, simple_loss=1.012, pruned_loss=8.688, over 2448.00 frames.], tot_loss[loss=0.4488, simple_loss=0.8976, pruned_loss=8.855, over 93690.55 frames.], batch size: 34, lr: 3.00e-03
2022-04-28 19:02:07,652 INFO [train.py:757] Epoch 0, batch 42, loss[loss=0.5631, simple_loss=1.126, pruned_loss=8.863, over 2491.00 frames.], tot_loss[loss=0.4518, simple_loss=0.9036, pruned_loss=8.855, over 95713.10 frames.], batch size: 53, lr: 3.00e-03
2022-04-28 19:02:08,217 INFO [train.py:757] Epoch 0, batch 43, loss[loss=0.4355, simple_loss=0.8709, pruned_loss=8.61, over 2490.00 frames.], tot_loss[loss=0.4514, simple_loss=0.9027, pruned_loss=8.849, over 97724.53 frames.], batch size: 83, lr: 3.00e-03
2022-04-28 19:02:08,742 INFO [train.py:757] Epoch 0, batch 44, loss[loss=0.4985, simple_loss=0.9971, pruned_loss=8.948, over 2432.00 frames.], tot_loss[loss=0.4525, simple_loss=0.905, pruned_loss=8.852, over 99667.91 frames.], batch size: 32, lr: 3.00e-03
2022-04-28 19:02:09,283 INFO [train.py:757] Epoch 0, batch 45, loss[loss=0.4393, simple_loss=0.8785, pruned_loss=8.664, over 2494.00 frames.], tot_loss[loss=0.4522, simple_loss=0.9044, pruned_loss=8.847, over 101663.57 frames.], batch size: 86, lr: 3.00e-03
2022-04-28 19:02:09,814 INFO [train.py:757] Epoch 0, batch 46, loss[loss=0.5027, simple_loss=1.005, pruned_loss=8.921, over 2485.00 frames.], tot_loss[loss=0.4534, simple_loss=0.9068, pruned_loss=8.849, over 103640.25 frames.], batch size: 71, lr: 3.00e-03
2022-04-28 19:02:10,353 INFO [train.py:757] Epoch 0, batch 47, loss[loss=0.5192, simple_loss=1.038, pruned_loss=9.001, over 2448.00 frames.], tot_loss[loss=0.4549, simple_loss=0.9099, pruned_loss=8.852, over 105570.05 frames.], batch size: 36, lr: 3.00e-03
2022-04-28 19:02:10,889 INFO [train.py:757] Epoch 0, batch 48, loss[loss=0.4594, simple_loss=0.9188, pruned_loss=8.893, over 2496.00 frames.], tot_loss[loss=0.455, simple_loss=0.9101, pruned_loss=8.853, over 107538.20 frames.], batch size: 78, lr: 3.00e-03
2022-04-28 19:02:11,413 INFO [train.py:757] Epoch 0, batch 49, loss[loss=0.5096, simple_loss=1.019, pruned_loss=8.959, over 2465.00 frames.], tot_loss[loss=0.4563, simple_loss=0.9125, pruned_loss=8.856, over 109465.51 frames.], batch size: 29, lr: 3.00e-03
2022-04-28 19:02:11,943 INFO [train.py:757] Epoch 0, batch 50, loss[loss=0.5682, simple_loss=1.136, pruned_loss=9.176, over 2440.00 frames.], tot_loss[loss=0.4587, simple_loss=0.9174, pruned_loss=8.863, over 111358.18 frames.], batch size: 40, lr: 3.00e-03
2022-04-28 19:02:12,479 INFO [train.py:757] Epoch 0, batch 51, loss[loss=0.4903, simple_loss=0.9807, pruned_loss=9.401, over 2484.00 frames.], tot_loss[loss=0.4594, simple_loss=0.9188, pruned_loss=8.874, over 113285.39 frames.], batch size: 54, lr: 3.00e-03
2022-04-28 19:02:12,995 INFO [train.py:757] Epoch 0, batch 52, loss[loss=0.4944, simple_loss=0.9889, pruned_loss=9.683, over 2430.00 frames.], tot_loss[loss=0.4602, simple_loss=0.9203, pruned_loss=8.891, over 115148.96 frames.], batch size: 30, lr: 3.00e-03
2022-04-28 19:02:13,507 INFO [train.py:757] Epoch 0, batch 53, loss[loss=0.523, simple_loss=1.046, pruned_loss=9.292, over 2448.00 frames.], tot_loss[loss=0.4615, simple_loss=0.9229, pruned_loss=8.9, over 117021.22 frames.], batch size: 48, lr: 3.00e-03
2022-04-28 19:02:14,044 INFO [train.py:757] Epoch 0, batch 54, loss[loss=0.5275, simple_loss=1.055, pruned_loss=9.379, over 2434.00 frames.], tot_loss[loss=0.4628, simple_loss=0.9256, pruned_loss=8.91, over 118870.11 frames.], batch size: 13, lr: 3.00e-03
2022-04-28 19:02:14,561 INFO [train.py:757] Epoch 0, batch 55, loss[loss=0.5001, simple_loss=1, pruned_loss=9.292, over 2480.00 frames.], tot_loss[loss=0.4636, simple_loss=0.9272, pruned_loss=8.917, over 120755.76 frames.], batch size: 62, lr: 3.00e-03
2022-04-28 19:02:15,074 INFO [train.py:757] Epoch 0, batch 56, loss[loss=0.5654, simple_loss=1.131, pruned_loss=9.509, over 2450.00 frames.], tot_loss[loss=0.4656, simple_loss=0.9312, pruned_loss=8.929, over 122601.98 frames.], batch size: 50, lr: 3.00e-03
2022-04-28 19:02:15,586 INFO [train.py:757] Epoch 0, batch 57, loss[loss=0.5121, simple_loss=1.024, pruned_loss=9.404, over 2484.00 frames.], tot_loss[loss=0.4665, simple_loss=0.9331, pruned_loss=8.939, over 124472.97 frames.], batch size: 46, lr: 3.00e-03
2022-04-28 19:02:16,101 INFO [train.py:757] Epoch 0, batch 58, loss[loss=0.5294, simple_loss=1.059, pruned_loss=9.26, over 2475.00 frames.], tot_loss[loss=0.4678, simple_loss=0.9356, pruned_loss=8.945, over 126325.61 frames.], batch size: 45, lr: 3.00e-03
2022-04-28 19:02:16,634 INFO [checkpoint.py:67] Saving checkpoint to pruned_transducer_stateless2/exp-char-S-debug-batch-checkpoint/checkpoint-100.pt
2022-04-28 19:02:19,620 INFO [train.py:757] Epoch 0, batch 59, loss[loss=9.547, simple_loss=0.9542, pruned_loss=9.07, over 2491.00 frames.], tot_loss[loss=0.6442, simple_loss=0.9359, pruned_loss=8.947, over 128184.98 frames.], batch size: 53, lr: 3.00e-03
2022-04-28 19:02:20,150 INFO [train.py:757] Epoch 0, batch 60, loss[loss=0.7267, simple_loss=1.002, pruned_loss=2.259, over 2460.00 frames.], tot_loss[loss=0.6458, simple_loss=0.9372, pruned_loss=8.821, over 130004.06 frames.], batch size: 30, lr: 3.00e-03
2022-04-28 19:02:20,678 INFO [train.py:757] Epoch 0, batch 61, loss[loss=0.6688, simple_loss=1.006, pruned_loss=1.657, over 2482.00 frames.], tot_loss[loss=0.6462, simple_loss=0.9385, pruned_loss=8.686, over 131836.04 frames.], batch size: 34, lr: 3.00e-03
2022-04-28 19:02:21,202 INFO [train.py:757] Epoch 0, batch 62, loss[loss=0.6806, simple_loss=0.9814, pruned_loss=1.899, over 2494.00 frames.], tot_loss[loss=0.6468, simple_loss=0.9393, pruned_loss=8.559, over 133670.85 frames.], batch size: 43, lr: 3.00e-03
2022-04-28 19:02:21,730 INFO [train.py:757] Epoch 0, batch 63, loss[loss=0.5746, simple_loss=0.8276, pruned_loss=1.608, over 2475.00 frames.], tot_loss[loss=0.6455, simple_loss=0.9372, pruned_loss=8.432, over 135477.50 frames.], batch size: 75, lr: 3.00e-03
2022-04-28 19:02:22,258 INFO [train.py:757] Epoch 0, batch 64, loss[loss=0.71, simple_loss=1.032, pruned_loss=1.942, over 2475.00 frames.], tot_loss[loss=0.6467, simple_loss=0.9389, pruned_loss=8.315, over 137275.11 frames.], batch size: 55, lr: 3.00e-03
2022-04-28 19:02:22,784 INFO [train.py:757] Epoch 0, batch 65, loss[loss=0.7158, simple_loss=1.024, pruned_loss=2.04, over 2475.00 frames.], tot_loss[loss=0.6479, simple_loss=0.9404, pruned_loss=8.204, over 139063.74 frames.], batch size: 33, lr: 3.00e-03
2022-04-28 19:02:23,328 INFO [train.py:757] Epoch 0, batch 66, loss[loss=0.6897, simple_loss=0.9923, pruned_loss=1.935, over 2475.00 frames.], tot_loss[loss=0.6486, simple_loss=0.9413, pruned_loss=8.094, over 140843.42 frames.], batch size: 45, lr: 3.00e-03
2022-04-28 19:02:23,438 INFO [checkpoint.py:67] Saving checkpoint to pruned_transducer_stateless2/exp-char-S-debug-batch-checkpoint/epoch-0.pt
2022-04-28 19:02:31,505 INFO [train.py:757] Epoch 1, batch 0, loss[loss=0.5422, simple_loss=0.8021, pruned_loss=1.411, over 2479.00 frames.], tot_loss[loss=0.5422, simple_loss=0.8021, pruned_loss=1.411, over 2479.00 frames.], batch size: 67, lr: 2.98e-03
2022-04-28 19:02:32,035 INFO [train.py:757] Epoch 1, batch 1, loss[loss=0.598, simple_loss=0.8457, pruned_loss=1.751, over 2442.00 frames.], tot_loss[loss=0.5699, simple_loss=0.8238, pruned_loss=1.58, over 4908.60 frames.], batch size: 37, lr: 2.98e-03
2022-04-28 19:02:32,596 INFO [train.py:757] Epoch 1, batch 2, loss[loss=0.5564, simple_loss=0.8402, pruned_loss=1.363, over 2465.00 frames.], tot_loss[loss=0.5654, simple_loss=0.8293, pruned_loss=1.508, over 7349.06 frames.], batch size: 29, lr: 2.98e-03
2022-04-28 19:02:33,107 INFO [train.py:757] Epoch 1, batch 3, loss[loss=0.6463, simple_loss=0.9044, pruned_loss=1.941, over 2460.00 frames.], tot_loss[loss=0.5858, simple_loss=0.8482, pruned_loss=1.617, over 9772.32 frames.], batch size: 30, lr: 2.98e-03
2022-04-28 19:02:33,646 INFO [train.py:757] Epoch 1, batch 4, loss[loss=0.4958, simple_loss=0.6516, pruned_loss=1.7, over 2449.00 frames.], tot_loss[loss=0.5677, simple_loss=0.8087, pruned_loss=1.633, over 12172.46 frames.], batch size: 31, lr: 2.98e-03
2022-04-28 19:02:34,195 INFO [train.py:757] Epoch 1, batch 5, loss[loss=0.5326, simple_loss=0.7646, pruned_loss=1.503, over 2475.00 frames.], tot_loss[loss=0.5617, simple_loss=0.8012, pruned_loss=1.611, over 14586.59 frames.], batch size: 75, lr: 2.98e-03
2022-04-28 19:02:34,723 INFO [train.py:757] Epoch 1, batch 6, loss[loss=0.6192, simple_loss=0.8415, pruned_loss=1.984, over 2491.00 frames.], tot_loss[loss=0.5701, simple_loss=0.8071, pruned_loss=1.666, over 17004.66 frames.], batch size: 53, lr: 2.98e-03
2022-04-28 19:02:35,249 INFO [train.py:757] Epoch 1, batch 7, loss[loss=0.5565, simple_loss=0.7731, pruned_loss=1.699, over 2470.00 frames.], tot_loss[loss=0.5684, simple_loss=0.8028, pruned_loss=1.67, over 19389.64 frames.], batch size: 65, lr: 2.98e-03
2022-04-28 19:02:35,759 INFO [train.py:757] Epoch 1, batch 8, loss[loss=0.5227, simple_loss=0.7147, pruned_loss=1.654, over 2457.00 frames.], tot_loss[loss=0.5632, simple_loss=0.7928, pruned_loss=1.668, over 21749.69 frames.], batch size: 39, lr: 2.98e-03
2022-04-28 19:02:36,315 INFO [train.py:757] Epoch 1, batch 9, loss[loss=0.4584, simple_loss=0.6226, pruned_loss=1.471, over 2436.00 frames.], tot_loss[loss=0.5526, simple_loss=0.7756, pruned_loss=1.648, over 24076.94 frames.], batch size: 29, lr: 2.98e-03
2022-04-28 19:02:36,860 INFO [train.py:757] Epoch 1, batch 10, loss[loss=0.5217, simple_loss=0.7574, pruned_loss=1.43, over 2460.00 frames.], tot_loss[loss=0.5498, simple_loss=0.7739, pruned_loss=1.628, over 26416.56 frames.], batch size: 41, lr: 2.98e-03
2022-04-28 19:02:37,411 INFO [train.py:757] Epoch 1, batch 11, loss[loss=0.5145, simple_loss=0.7641, pruned_loss=1.324, over 2496.00 frames.], tot_loss[loss=0.5467, simple_loss=0.773, pruned_loss=1.602, over 28780.47 frames.], batch size: 39, lr: 2.98e-03
2022-04-28 19:02:37,939 INFO [checkpoint.py:67] Saving checkpoint to pruned_transducer_stateless2/exp-char-S-debug-batch-checkpoint/checkpoint-120.pt
2022-04-28 19:02:40,653 INFO [train.py:757] Epoch 1, batch 12, loss[loss=0.4436, simple_loss=0.652, pruned_loss=1.176, over 2464.00 frames.], tot_loss[loss=0.5385, simple_loss=0.7635, pruned_loss=1.568, over 31100.57 frames.], batch size: 32, lr: 2.98e-03
2022-04-28 19:02:41,176 INFO [train.py:757] Epoch 1, batch 13, loss[loss=0.5084, simple_loss=0.7567, pruned_loss=1.301, over 2464.00 frames.], tot_loss[loss=0.5363, simple_loss=0.763, pruned_loss=1.548, over 33409.07 frames.], batch size: 44, lr: 2.98e-03
2022-04-28 19:02:41,729 INFO [train.py:757] Epoch 1, batch 14, loss[loss=0.584, simple_loss=0.9049, pruned_loss=1.315, over 2475.00 frames.], tot_loss[loss=0.5396, simple_loss=0.7728, pruned_loss=1.532, over 35717.02 frames.], batch size: 55, lr: 2.98e-03
2022-04-28 19:02:42,267 INFO [train.py:757] Epoch 1, batch 15, loss[loss=0.5094, simple_loss=0.7771, pruned_loss=1.208, over 2484.00 frames.], tot_loss[loss=0.5376, simple_loss=0.7731, pruned_loss=1.511, over 38022.44 frames.], batch size: 69, lr: 2.98e-03
2022-04-28 19:02:42,789 INFO [train.py:757] Epoch 1, batch 16, loss[loss=0.4621, simple_loss=0.6861, pruned_loss=1.19, over 2450.00 frames.], tot_loss[loss=0.533, simple_loss=0.7678, pruned_loss=1.491, over 40282.32 frames.], batch size: 35, lr: 2.98e-03
2022-04-28 19:02:43,343 INFO [train.py:757] Epoch 1, batch 17, loss[loss=0.4779, simple_loss=0.7472, pruned_loss=1.043, over 2490.00 frames.], tot_loss[loss=0.5298, simple_loss=0.7666, pruned_loss=1.465, over 42570.91 frames.], batch size: 83, lr: 2.98e-03
2022-04-28 19:02:43,855 INFO [train.py:757] Epoch 1, batch 18, loss[loss=0.6089, simple_loss=0.9468, pruned_loss=1.355, over 2440.00 frames.], tot_loss[loss=0.5341, simple_loss=0.7764, pruned_loss=1.459, over 44798.06 frames.], batch size: 40, lr: 2.98e-03
2022-04-28 19:02:44,385 INFO [train.py:757] Epoch 1, batch 19, loss[loss=0.5269, simple_loss=0.8062, pruned_loss=1.237, over 2448.00 frames.], tot_loss[loss=0.5337, simple_loss=0.778, pruned_loss=1.448, over 47022.07 frames.], batch size: 36, lr: 2.98e-03
2022-04-28 19:02:44,910 INFO [train.py:757] Epoch 1, batch 20, loss[loss=0.4733, simple_loss=0.7047, pruned_loss=1.209, over 2405.00 frames.], tot_loss[loss=0.5308, simple_loss=0.7744, pruned_loss=1.436, over 49191.96 frames.], batch size: 25, lr: 2.98e-03
2022-04-28 19:02:45,440 INFO [train.py:757] Epoch 1, batch 21, loss[loss=0.4643, simple_loss=0.7353, pruned_loss=0.9665, over 2494.00 frames.], tot_loss[loss=0.5276, simple_loss=0.7725, pruned_loss=1.413, over 51440.00 frames.], batch size: 86, lr: 2.98e-03
2022-04-28 19:02:45,961 INFO [train.py:757] Epoch 1, batch 22, loss[loss=0.4971, simple_loss=0.7709, pruned_loss=1.116, over 2448.00 frames.], tot_loss[loss=0.5262, simple_loss=0.7724, pruned_loss=1.4, over 53630.80 frames.], batch size: 34, lr: 2.98e-03
2022-04-28 19:02:46,482 INFO [train.py:757] Epoch 1, batch 23, loss[loss=0.4704, simple_loss=0.7225, pruned_loss=1.091, over 2491.00 frames.], tot_loss[loss=0.5237, simple_loss=0.7702, pruned_loss=1.386, over 55853.64 frames.], batch size: 47, lr: 2.98e-03
2022-04-28 19:02:46,999 INFO [train.py:757] Epoch 1, batch 24, loss[loss=0.4702, simple_loss=0.7142, pruned_loss=1.131, over 2450.00 frames.], tot_loss[loss=0.5214, simple_loss=0.7678, pruned_loss=1.375, over 58024.38 frames.], batch size: 49, lr: 2.98e-03
2022-04-28 19:02:47,537 INFO [train.py:757] Epoch 1, batch 25, loss[loss=0.5503, simple_loss=0.8434, pruned_loss=1.286, over 2434.00 frames.], tot_loss[loss=0.5226, simple_loss=0.7709, pruned_loss=1.372, over 60168.25 frames.], batch size: 13, lr: 2.98e-03
2022-04-28 19:02:48,058 INFO [train.py:757] Epoch 1, batch 26, loss[loss=0.5376, simple_loss=0.842, pruned_loss=1.166, over 2460.00 frames.], tot_loss[loss=0.5232, simple_loss=0.7737, pruned_loss=1.363, over 62327.41 frames.], batch size: 60, lr: 2.98e-03
2022-04-28 19:02:48,609 INFO [train.py:757] Epoch 1, batch 27, loss[loss=0.5115, simple_loss=0.7685, pruned_loss=1.273, over 2478.00 frames.], tot_loss[loss=0.5227, simple_loss=0.7735, pruned_loss=1.36, over 64493.78 frames.], batch size: 42, lr: 2.98e-03
2022-04-28 19:02:49,140 INFO [train.py:757] Epoch 1, batch 28, loss[loss=0.5112, simple_loss=0.8095, pruned_loss=1.064, over 2496.00 frames.], tot_loss[loss=0.5223, simple_loss=0.7748, pruned_loss=1.349, over 66667.31 frames.], batch size: 78, lr: 2.98e-03
2022-04-28 19:02:49,687 INFO [train.py:757] Epoch 1, batch 29, loss[loss=0.4827, simple_loss=0.754, pruned_loss=1.057, over 2482.00 frames.], tot_loss[loss=0.5209, simple_loss=0.7741, pruned_loss=1.338, over 68815.97 frames.], batch size: 73, lr: 2.98e-03
2022-04-28 19:02:50,225 INFO [train.py:757] Epoch 1, batch 30, loss[loss=0.5203, simple_loss=0.8267, pruned_loss=1.07, over 2485.00 frames.], tot_loss[loss=0.5209, simple_loss=0.7759, pruned_loss=1.329, over 70956.89 frames.], batch size: 71, lr: 2.98e-03
2022-04-28 19:02:50,779 INFO [train.py:757] Epoch 1, batch 31, loss[loss=0.4374, simple_loss=0.6732, pruned_loss=1.008, over 2480.00 frames.], tot_loss[loss=0.518, simple_loss=0.7724, pruned_loss=1.318, over 73082.11 frames.], batch size: 80, lr: 2.98e-03
2022-04-28 19:02:51,296 INFO [checkpoint.py:67] Saving checkpoint to pruned_transducer_stateless2/exp-char-S-debug-batch-checkpoint/checkpoint-140.pt
2022-04-28 19:02:54,074 INFO [train.py:757] Epoch 1, batch 32, loss[loss=0.525, simple_loss=0.8292, pruned_loss=1.104, over 2475.00 frames.], tot_loss[loss=0.5183, simple_loss=0.7743, pruned_loss=1.311, over 75191.70 frames.], batch size: 33, lr: 2.98e-03
2022-04-28 19:02:54,628 INFO [train.py:757] Epoch 1, batch 33, loss[loss=0.433, simple_loss=0.6594, pruned_loss=1.033, over 2485.00 frames.], tot_loss[loss=0.5155, simple_loss=0.7706, pruned_loss=1.302, over 77300.74 frames.], batch size: 35, lr: 2.98e-03
2022-04-28 19:02:55,162 INFO [train.py:757] Epoch 1, batch 34, loss[loss=0.5497, simple_loss=0.8652, pruned_loss=1.171, over 2491.00 frames.], tot_loss[loss=0.5166, simple_loss=0.7736, pruned_loss=1.298, over 79405.23 frames.], batch size: 53, lr: 2.98e-03
2022-04-28 19:02:55,679 INFO [train.py:757] Epoch 1, batch 35, loss[loss=0.5064, simple_loss=0.7786, pruned_loss=1.172, over 2451.00 frames.], tot_loss[loss=0.5163, simple_loss=0.7737, pruned_loss=1.294, over 81459.21 frames.], batch size: 43, lr: 2.98e-03
2022-04-28 19:02:56,200 INFO [train.py:757] Epoch 1, batch 36, loss[loss=0.418, simple_loss=0.6208, pruned_loss=1.076, over 2480.00 frames.], tot_loss[loss=0.5134, simple_loss=0.7692, pruned_loss=1.288, over 83531.91 frames.], batch size: 31, lr: 2.98e-03
2022-04-28 19:02:56,719 INFO [train.py:757] Epoch 1, batch 37, loss[loss=0.4843, simple_loss=0.7611, pruned_loss=1.038, over 2496.00 frames.], tot_loss[loss=0.5125, simple_loss=0.7689, pruned_loss=1.28, over 85610.25 frames.], batch size: 64, lr: 2.98e-03
2022-04-28 19:02:57,246 INFO [train.py:757] Epoch 1, batch 38, loss[loss=0.4508, simple_loss=0.6906, pruned_loss=1.055, over 2479.00 frames.], tot_loss[loss=0.5108, simple_loss=0.7667, pruned_loss=1.274, over 87661.20 frames.], batch size: 37, lr: 2.98e-03
2022-04-28 19:02:57,758 INFO [train.py:757] Epoch 1, batch 39, loss[loss=0.4826, simple_loss=0.7311, pruned_loss=1.17, over 2450.00 frames.], tot_loss[loss=0.51, simple_loss=0.7658, pruned_loss=1.271, over 89672.89 frames.], batch size: 49, lr: 2.98e-03
2022-04-28 19:02:58,303 INFO [train.py:757] Epoch 1, batch 40, loss[loss=0.4558, simple_loss=0.705, pruned_loss=1.033, over 2470.00 frames.], tot_loss[loss=0.5085, simple_loss=0.7641, pruned_loss=1.265, over 91694.53 frames.], batch size: 38, lr: 2.98e-03
2022-04-28 19:02:58,844 INFO [train.py:757] Epoch 1, batch 41, loss[loss=0.5322, simple_loss=0.8432, pruned_loss=1.106, over 2482.00 frames.], tot_loss[loss=0.5092, simple_loss=0.7662, pruned_loss=1.261, over 93718.06 frames.], batch size: 34, lr: 2.98e-03
2022-04-28 19:02:59,400 INFO [train.py:757] Epoch 1, batch 42, loss[loss=0.4679, simple_loss=0.7189, pruned_loss=1.084, over 2418.00 frames.], tot_loss[loss=0.5081, simple_loss=0.765, pruned_loss=1.256, over 95667.47 frames.], batch size: 31, lr: 2.98e-03
2022-04-28 19:02:59,921 INFO [train.py:757] Epoch 1, batch 43, loss[loss=0.5645, simple_loss=0.9031, pruned_loss=1.13, over 2448.00 frames.], tot_loss[loss=0.5095, simple_loss=0.7685, pruned_loss=1.253, over 97637.13 frames.], batch size: 48, lr: 2.98e-03
2022-04-28 19:03:00,433 INFO [train.py:757] Epoch 1, batch 44, loss[loss=0.5084, simple_loss=0.7851, pruned_loss=1.158, over 2478.00 frames.], tot_loss[loss=0.5095, simple_loss=0.7689, pruned_loss=1.251, over 99626.94 frames.], batch size: 42, lr: 2.98e-03
2022-04-28 19:03:00,956 INFO [train.py:757] Epoch 1, batch 45, loss[loss=0.528, simple_loss=0.8236, pruned_loss=1.162, over 2496.00 frames.], tot_loss[loss=0.51, simple_loss=0.7702, pruned_loss=1.248, over 101624.81 frames.], batch size: 48, lr: 2.98e-03
2022-04-28 19:03:01,530 INFO [train.py:757] Epoch 1, batch 46, loss[loss=0.5577, simple_loss=0.8826, pruned_loss=1.164, over 2484.00 frames.], tot_loss[loss=0.5111, simple_loss=0.7729, pruned_loss=1.246, over 103600.69 frames.], batch size: 46, lr: 2.98e-03
2022-04-28 19:03:02,059 INFO [train.py:757] Epoch 1, batch 47, loss[loss=0.4114, simple_loss=0.6048, pruned_loss=1.09, over 2490.00 frames.], tot_loss[loss=0.5088, simple_loss=0.769, pruned_loss=1.243, over 105572.68 frames.], batch size: 30, lr: 2.98e-03
2022-04-28 19:03:02,589 INFO [train.py:757] Epoch 1, batch 48, loss[loss=0.4516, simple_loss=0.6946, pruned_loss=1.043, over 2480.00 frames.], tot_loss[loss=0.5074, simple_loss=0.7673, pruned_loss=1.238, over 107524.82 frames.], batch size: 40, lr: 2.98e-03
2022-04-28 19:03:03,117 INFO [train.py:757] Epoch 1, batch 49, loss[loss=0.394, simple_loss=0.575, pruned_loss=1.065, over 2403.00 frames.], tot_loss[loss=0.5049, simple_loss=0.763, pruned_loss=1.234, over 109390.19 frames.], batch size: 15, lr: 2.98e-03
2022-04-28 19:03:03,648 INFO [train.py:757] Epoch 1, batch 50, loss[loss=0.5132, simple_loss=0.8203, pruned_loss=1.03, over 2480.00 frames.], tot_loss[loss=0.5051, simple_loss=0.7643, pruned_loss=1.23, over 111323.24 frames.], batch size: 62, lr: 2.98e-03
2022-04-28 19:03:04,210 INFO [train.py:757] Epoch 1, batch 51, loss[loss=0.5458, simple_loss=0.8581, pruned_loss=1.168, over 2475.00 frames.], tot_loss[loss=0.506, simple_loss=0.7664, pruned_loss=1.228, over 113241.63 frames.], batch size: 45, lr: 2.98e-03
2022-04-28 19:03:04,727 INFO [checkpoint.py:67] Saving checkpoint to pruned_transducer_stateless2/exp-char-S-debug-batch-checkpoint/checkpoint-160.pt
2022-04-28 19:03:07,480 INFO [train.py:757] Epoch 1, batch 52, loss[loss=0.4643, simple_loss=0.7155, pruned_loss=1.065, over 2484.00 frames.], tot_loss[loss=0.5051, simple_loss=0.7653, pruned_loss=1.225, over 115159.42 frames.], batch size: 36, lr: 2.98e-03
2022-04-28 19:03:08,001 INFO [train.py:757] Epoch 1, batch 53, loss[loss=0.4886, simple_loss=0.7644, pruned_loss=1.064, over 2430.00 frames.], tot_loss[loss=0.5048, simple_loss=0.7652, pruned_loss=1.222, over 117013.62 frames.], batch size: 30, lr: 2.98e-03
2022-04-28 19:03:08,532 INFO [train.py:757] Epoch 1, batch 54, loss[loss=0.5295, simple_loss=0.8247, pruned_loss=1.171, over 2494.00 frames.], tot_loss[loss=0.5053, simple_loss=0.7665, pruned_loss=1.22, over 118922.55 frames.], batch size: 58, lr: 2.98e-03
2022-04-28 19:03:09,100 INFO [train.py:757] Epoch 1, batch 55, loss[loss=0.4524, simple_loss=0.6937, pruned_loss=1.055, over 2484.00 frames.], tot_loss[loss=0.5042, simple_loss=0.765, pruned_loss=1.217, over 120811.94 frames.], batch size: 54, lr: 2.98e-03
2022-04-28 19:03:09,646 INFO [train.py:757] Epoch 1, batch 56, loss[loss=0.4971, simple_loss=0.7641, pruned_loss=1.15, over 2450.00 frames.], tot_loss[loss=0.5041, simple_loss=0.765, pruned_loss=1.216, over 122657.88 frames.], batch size: 50, lr: 2.98e-03
2022-04-28 19:03:10,196 INFO [train.py:757] Epoch 1, batch 57, loss[loss=0.4817, simple_loss=0.7506, pruned_loss=1.064, over 2464.00 frames.], tot_loss[loss=0.5036, simple_loss=0.7647, pruned_loss=1.213, over 124508.59 frames.], batch size: 56, lr: 2.98e-03
2022-04-28 19:03:10,719 INFO [train.py:757] Epoch 1, batch 58, loss[loss=0.562, simple_loss=0.9062, pruned_loss=1.089, over 2494.00 frames.], tot_loss[loss=0.5048, simple_loss=0.7675, pruned_loss=1.21, over 126380.05 frames.], batch size: 43, lr: 2.98e-03
2022-04-28 19:03:11,240 INFO [train.py:757] Epoch 1, batch 59, loss[loss=0.4852, simple_loss=0.7527, pruned_loss=1.088, over 2432.00 frames.], tot_loss[loss=0.5044, simple_loss=0.7672, pruned_loss=1.208, over 128180.15 frames.], batch size: 32, lr: 2.98e-03
2022-04-28 19:03:11,806 INFO [train.py:757] Epoch 1, batch 60, loss[loss=0.4967, simple_loss=0.7763, pruned_loss=1.085, over 2475.00 frames.], tot_loss[loss=0.5043, simple_loss=0.7674, pruned_loss=1.206, over 130014.25 frames.], batch size: 45, lr: 2.98e-03
2022-04-28 19:03:12,363 INFO [train.py:757] Epoch 1, batch 61, loss[loss=0.4237, simple_loss=0.6344, pruned_loss=1.065, over 2448.00 frames.], tot_loss[loss=0.5028, simple_loss=0.7649, pruned_loss=1.203, over 131812.18 frames.], batch size: 22, lr: 2.98e-03
2022-04-28 19:03:12,893 INFO [train.py:757] Epoch 1, batch 62, loss[loss=0.4905, simple_loss=0.7587, pruned_loss=1.112, over 2484.00 frames.], tot_loss[loss=0.5025, simple_loss=0.7648, pruned_loss=1.201, over 133637.12 frames.], batch size: 54, lr: 2.98e-03
2022-04-28 19:03:13,426 INFO [train.py:757] Epoch 1, batch 63, loss[loss=0.5202, simple_loss=0.8047, pruned_loss=1.178, over 2450.00 frames.], tot_loss[loss=0.5028, simple_loss=0.7655, pruned_loss=1.201, over 135418.93 frames.], batch size: 50, lr: 2.98e-03
2022-04-28 19:03:13,932 INFO [train.py:757] Epoch 1, batch 64, loss[loss=0.4962, simple_loss=0.7707, pruned_loss=1.109, over 2448.00 frames.], tot_loss[loss=0.5027, simple_loss=0.7656, pruned_loss=1.199, over 137189.84 frames.], batch size: 51, lr: 2.98e-03
2022-04-28 19:03:14,466 INFO [train.py:757] Epoch 1, batch 65, loss[loss=0.4523, simple_loss=0.7021, pruned_loss=1.013, over 2442.00 frames.], tot_loss[loss=0.5018, simple_loss=0.7645, pruned_loss=1.196, over 138945.89 frames.], batch size: 33, lr: 2.98e-03
2022-04-28 19:03:14,975 INFO [train.py:757] Epoch 1, batch 66, loss[loss=0.5052, simple_loss=0.7925, pruned_loss=1.09, over 2478.00 frames.], tot_loss[loss=0.5019, simple_loss=0.765, pruned_loss=1.194, over 140729.16 frames.], batch size: 59, lr: 2.98e-03
2022-04-28 19:03:15,087 INFO [checkpoint.py:67] Saving checkpoint to pruned_transducer_stateless2/exp-char-S-debug-batch-checkpoint/epoch-1.pt
2022-04-28 19:03:22,265 INFO [train.py:757] Epoch 2, batch 0, loss[loss=0.3459, simple_loss=0.4939, pruned_loss=0.9891, over 2436.00 frames.], tot_loss[loss=0.3459, simple_loss=0.4939, pruned_loss=0.9891, over 2436.00 frames.], batch size: 29, lr: 2.92e-03
2022-04-28 19:03:22,833 INFO [train.py:757] Epoch 2, batch 1, loss[loss=0.3227, simple_loss=0.4379, pruned_loss=1.038, over 2490.00 frames.], tot_loss[loss=0.3341, simple_loss=0.4655, pruned_loss=1.014, over 4913.82 frames.], batch size: 30, lr: 2.92e-03
2022-04-28 19:03:23,367 INFO [train.py:757] Epoch 2, batch 2, loss[loss=0.4188, simple_loss=0.6471, pruned_loss=0.9526, over 2496.00 frames.], tot_loss[loss=0.3627, simple_loss=0.5269, pruned_loss=0.993, over 7385.25 frames.], batch size: 78, lr: 2.92e-03
2022-04-28 19:03:23,915 INFO [train.py:757] Epoch 2, batch 3, loss[loss=0.4409, simple_loss=0.6838, pruned_loss=0.9902, over 2485.00 frames.], tot_loss[loss=0.3825, simple_loss=0.5665, pruned_loss=0.9923, over 9833.32 frames.], batch size: 71, lr: 2.92e-03
2022-04-28 19:03:24,445 INFO [train.py:757] Epoch 2, batch 4, loss[loss=0.4122, simple_loss=0.632, pruned_loss=0.9615, over 2484.00 frames.], tot_loss[loss=0.3885, simple_loss=0.5798, pruned_loss=0.9861, over 12268.16 frames.], batch size: 69, lr: 2.92e-03
2022-04-28 19:03:24,979 INFO [checkpoint.py:67] Saving checkpoint to pruned_transducer_stateless2/exp-char-S-debug-batch-checkpoint/checkpoint-180.pt
2022-04-28 19:03:27,886 INFO [train.py:757] Epoch 2, batch 5, loss[loss=0.4081, simple_loss=0.6114, pruned_loss=1.025, over 2475.00 frames.], tot_loss[loss=0.3918, simple_loss=0.5851, pruned_loss=0.9926, over 14681.82 frames.], batch size: 45, lr: 2.92e-03
2022-04-28 19:03:28,427 INFO [train.py:757] Epoch 2, batch 6, loss[loss=0.342, simple_loss=0.4819, pruned_loss=1.01, over 2448.00 frames.], tot_loss[loss=0.3847, simple_loss=0.5703, pruned_loss=0.9951, over 17056.41 frames.], batch size: 22, lr: 2.92e-03
2022-04-28 19:03:28,971 INFO [train.py:757] Epoch 2, batch 7, loss[loss=0.5647, simple_loss=0.9177, pruned_loss=1.059, over 2478.00 frames.], tot_loss[loss=0.4076, simple_loss=0.6146, pruned_loss=1.003, over 19449.13 frames.], batch size: 59, lr: 2.92e-03
2022-04-28 19:03:29,501 INFO [train.py:757] Epoch 2, batch 8, loss[loss=0.3318, simple_loss=0.4535, pruned_loss=1.05, over 2480.00 frames.], tot_loss[loss=0.399, simple_loss=0.5963, pruned_loss=1.009, over 21831.88 frames.], batch size: 31, lr: 2.92e-03
2022-04-28 19:03:30,052 INFO [train.py:757] Epoch 2, batch 9, loss[loss=0.3004, simple_loss=0.3956, pruned_loss=1.026, over 2403.00 frames.], tot_loss[loss=0.3892, simple_loss=0.5763, pruned_loss=1.01, over 24125.72 frames.], batch size: 15, lr: 2.92e-03
2022-04-28 19:03:30,581 INFO [train.py:757] Epoch 2, batch 10, loss[loss=0.4296, simple_loss=0.6431, pruned_loss=1.081, over 2475.00 frames.], tot_loss[loss=0.393, simple_loss=0.5825, pruned_loss=1.017, over 26480.09 frames.], batch size: 33, lr: 2.92e-03
2022-04-28 19:03:31,142 INFO [train.py:757] Epoch 2, batch 11, loss[loss=0.4754, simple_loss=0.7211, pruned_loss=1.148, over 2496.00 frames.], tot_loss[loss=0.4001, simple_loss=0.5945, pruned_loss=1.028, over 28843.69 frames.], batch size: 48, lr: 2.92e-03
2022-04-28 19:03:31,674 INFO [train.py:757] Epoch 2, batch 12, loss[loss=0.3926, simple_loss=0.5728, pruned_loss=1.062, over 2450.00 frames.], tot_loss[loss=0.3995, simple_loss=0.5928, pruned_loss=1.031, over 31149.47 frames.], batch size: 50, lr: 2.92e-03
2022-04-28 19:03:32,229 INFO [train.py:757] Epoch 2, batch 13, loss[loss=0.4116, simple_loss=0.6194, pruned_loss=1.019, over 2448.00 frames.], tot_loss[loss=0.4004, simple_loss=0.5948, pruned_loss=1.03, over 33441.73 frames.], batch size: 34, lr: 2.92e-03
2022-04-28 19:03:32,765 INFO [train.py:757] Epoch 2, batch 14, loss[loss=0.3863, simple_loss=0.5573, pruned_loss=1.076, over 2484.00 frames.], tot_loss[loss=0.3994, simple_loss=0.5922, pruned_loss=1.033, over 35758.52 frames.], batch size: 54, lr: 2.92e-03
2022-04-28 19:03:33,301 INFO [train.py:757] Epoch 2, batch 15, loss[loss=0.4376, simple_loss=0.6576, pruned_loss=1.088, over 2482.00 frames.], tot_loss[loss=0.4019, simple_loss=0.5964, pruned_loss=1.037, over 38061.72 frames.], batch size: 34, lr: 2.92e-03
2022-04-28 19:03:33,837 INFO [train.py:757] Epoch 2, batch 16, loss[loss=0.3833, simple_loss=0.5743, pruned_loss=0.9611, over 2480.00 frames.], tot_loss[loss=0.4007, simple_loss=0.5951, pruned_loss=1.032, over 40351.42 frames.], batch size: 80, lr: 2.92e-03
2022-04-28 19:03:34,356 INFO [train.py:757] Epoch 2, batch 17, loss[loss=0.3972, simple_loss=0.5904, pruned_loss=1.02, over 2442.00 frames.], tot_loss[loss=0.4005, simple_loss=0.5948, pruned_loss=1.032, over 42591.66 frames.], batch size: 33, lr: 2.92e-03
2022-04-28 19:03:34,896 INFO [train.py:757] Epoch 2, batch 18, loss[loss=0.4188, simple_loss=0.6266, pruned_loss=1.055, over 2465.00 frames.], tot_loss[loss=0.4016, simple_loss=0.5965, pruned_loss=1.033, over 44843.70 frames.], batch size: 29, lr: 2.92e-03
2022-04-28 19:03:35,417 INFO [train.py:757] Epoch 2, batch 19, loss[loss=0.6011, simple_loss=0.9965, pruned_loss=1.029, over 2464.00 frames.], tot_loss[loss=0.412, simple_loss=0.6175, pruned_loss=1.033, over 47083.48 frames.], batch size: 56, lr: 2.92e-03
2022-04-28 19:03:35,974 INFO [train.py:757] Epoch 2, batch 20, loss[loss=0.4272, simple_loss=0.6597, pruned_loss=0.9735, over 2457.00 frames.], tot_loss[loss=0.4128, simple_loss=0.6196, pruned_loss=1.03, over 49305.07 frames.], batch size: 39, lr: 2.92e-03
2022-04-28 19:03:36,561 INFO [train.py:757] Epoch 2, batch 21, loss[loss=0.4563, simple_loss=0.6946, pruned_loss=1.09, over 2460.00 frames.], tot_loss[loss=0.4148, simple_loss=0.6232, pruned_loss=1.033, over 51518.54 frames.], batch size: 30, lr: 2.92e-03
2022-04-28 19:03:37,084 INFO [train.py:757] Epoch 2, batch 22, loss[loss=0.4913, simple_loss=0.7497, pruned_loss=1.164, over 2440.00 frames.], tot_loss[loss=0.4183, simple_loss=0.6289, pruned_loss=1.039, over 53700.95 frames.], batch size: 40, lr: 2.92e-03
2022-04-28 19:03:37,653 INFO [train.py:757] Epoch 2, batch 23, loss[loss=0.3689, simple_loss=0.5341, pruned_loss=1.019, over 2450.00 frames.], tot_loss[loss=0.4161, simple_loss=0.6248, pruned_loss=1.038, over 55882.44 frames.], batch size: 35, lr: 2.92e-03
2022-04-28 19:03:38,191 INFO [train.py:757] Epoch 2, batch 24, loss[loss=0.3865, simple_loss=0.5479, pruned_loss=1.126, over 2478.00 frames.], tot_loss[loss=0.4149, simple_loss=0.6215, pruned_loss=1.041, over 58081.03 frames.], batch size: 42, lr: 2.92e-03
2022-04-28 19:03:38,741 INFO [checkpoint.py:67] Saving checkpoint to pruned_transducer_stateless2/exp-char-S-debug-batch-checkpoint/checkpoint-200.pt
2022-04-28 19:03:41,452 INFO [train.py:757] Epoch 2, batch 25, loss[loss=1.149, simple_loss=0.4761, pruned_loss=0.9109, over 2464.00 frames.], tot_loss[loss=0.4449, simple_loss=0.6155, pruned_loss=1.036, over 60254.62 frames.], batch size: 32, lr: 2.92e-03
2022-04-28 19:03:41,997 INFO [train.py:757] Epoch 2, batch 26, loss[loss=1.462, simple_loss=0.666, pruned_loss=1.129, over 2442.00 frames.], tot_loss[loss=0.4847, simple_loss=0.6175, pruned_loss=1.04, over 62395.35 frames.], batch size: 37, lr: 2.92e-03
2022-04-28 19:03:42,527 INFO [train.py:757] Epoch 2, batch 27, loss[loss=1.312, simple_loss=0.5594, pruned_loss=1.032, over 2432.00 frames.], tot_loss[loss=0.5159, simple_loss=0.6153, pruned_loss=1.039, over 64515.38 frames.], batch size: 32, lr: 2.92e-03
2022-04-28 19:03:43,088 INFO [train.py:757] Epoch 2, batch 28, loss[loss=1.361, simple_loss=0.6637, pruned_loss=1.029, over 2491.00 frames.], tot_loss[loss=0.5475, simple_loss=0.6171, pruned_loss=1.039, over 66683.80 frames.], batch size: 47, lr: 2.92e-03
2022-04-28 19:03:43,620 INFO [train.py:757] Epoch 2, batch 29, loss[loss=1.268, simple_loss=0.6891, pruned_loss=0.9232, over 2475.00 frames.], tot_loss[loss=0.5734, simple_loss=0.6197, pruned_loss=1.035, over 68825.38 frames.], batch size: 75, lr: 2.92e-03
2022-04-28 19:03:44,166 INFO [train.py:757] Epoch 2, batch 30, loss[loss=1.257, simple_loss=0.6531, pruned_loss=0.9304, over 2490.00 frames.], tot_loss[loss=0.5974, simple_loss=0.6209, pruned_loss=1.031, over 70971.25 frames.], batch size: 83, lr: 2.92e-03
2022-04-28 19:03:44,688 INFO [train.py:757] Epoch 2, batch 31, loss[loss=1.43, simple_loss=0.6777, pruned_loss=1.091, over 2496.00 frames.], tot_loss[loss=0.6258, simple_loss=0.6228, pruned_loss=1.033, over 73112.40 frames.], batch size: 39, lr: 2.92e-03
2022-04-28 19:03:45,233 INFO [train.py:757] Epoch 2, batch 32, loss[loss=1.241, simple_loss=0.5176, pruned_loss=0.9825, over 2449.00 frames.], tot_loss[loss=0.6458, simple_loss=0.6194, pruned_loss=1.032, over 75195.83 frames.], batch size: 31, lr: 2.92e-03
2022-04-28 19:03:45,779 INFO [train.py:757] Epoch 2, batch 33, loss[loss=1.465, simple_loss=0.623, pruned_loss=1.153, over 2478.00 frames.], tot_loss[loss=0.6721, simple_loss=0.6195, pruned_loss=1.036, over 77297.85 frames.], batch size: 42, lr: 2.92e-03
2022-04-28 19:03:46,305 INFO [train.py:757] Epoch 2, batch 34, loss[loss=1.363, simple_loss=0.5622, pruned_loss=1.082, over 2402.00 frames.], tot_loss[loss=0.693, simple_loss=0.6178, pruned_loss=1.037, over 79313.37 frames.], batch size: 25, lr: 2.92e-03
2022-04-28 19:03:46,818 INFO [train.py:757] Epoch 2, batch 35, loss[loss=1.525, simple_loss=0.6418, pruned_loss=1.204, over 2450.00 frames.], tot_loss[loss=0.7181, simple_loss=0.6185, pruned_loss=1.042, over 81366.80 frames.], batch size: 50, lr: 2.92e-03
2022-04-28 19:03:47,349 INFO [train.py:757] Epoch 2, batch 36, loss[loss=1.492, simple_loss=0.7639, pruned_loss=1.11, over 2448.00 frames.], tot_loss[loss=0.7408, simple_loss=0.6228, pruned_loss=1.044, over 83407.96 frames.], batch size: 48, lr: 2.92e-03
2022-04-28 19:03:47,910 INFO [train.py:757] Epoch 2, batch 37, loss[loss=1.406, simple_loss=0.5978, pruned_loss=1.107, over 2451.00 frames.], tot_loss[loss=0.7599, simple_loss=0.6221, pruned_loss=1.046, over 85441.92 frames.], batch size: 43, lr: 2.92e-03
2022-04-28 19:03:48,433 INFO [train.py:757] Epoch 2, batch 38, loss[loss=1.403, simple_loss=0.7337, pruned_loss=1.036, over 2480.00 frames.], tot_loss[loss=0.7781, simple_loss=0.6252, pruned_loss=1.045, over 87494.72 frames.], batch size: 62, lr: 2.92e-03
2022-04-28 19:03:48,969 INFO [train.py:757] Epoch 2, batch 39, loss[loss=1.326, simple_loss=0.5734, pruned_loss=1.039, over 2480.00 frames.], tot_loss[loss=0.7933, simple_loss=0.6238, pruned_loss=1.045, over 89537.24 frames.], batch size: 40, lr: 2.92e-03
2022-04-28 19:03:49,506 INFO [train.py:757] Epoch 2, batch 40, loss[loss=1.375, simple_loss=0.6708, pruned_loss=1.04, over 2460.00 frames.], tot_loss[loss=0.8089, simple_loss=0.625, pruned_loss=1.045, over 91549.56 frames.], batch size: 60, lr: 2.92e-03
2022-04-28 19:03:50,032 INFO [train.py:757] Epoch 2, batch 41, loss[loss=1.318, simple_loss=0.5766, pruned_loss=1.03, over 2484.00 frames.], tot_loss[loss=0.8224, simple_loss=0.6238, pruned_loss=1.045, over 93575.81 frames.], batch size: 54, lr: 2.92e-03
2022-04-28 19:03:50,572 INFO [train.py:757] Epoch 2, batch 42, loss[loss=1.435, simple_loss=0.6666, pruned_loss=1.102, over 2491.00 frames.], tot_loss[loss=0.8384, simple_loss=0.6249, pruned_loss=1.046, over 95598.93 frames.], batch size: 53, lr: 2.92e-03
2022-04-28 19:03:51,112 INFO [train.py:757] Epoch 2, batch 43, loss[loss=1.229, simple_loss=0.6203, pruned_loss=0.9187, over 2494.00 frames.], tot_loss[loss=0.8484, simple_loss=0.6248, pruned_loss=1.043, over 97614.93 frames.], batch size: 86, lr: 2.92e-03
2022-04-28 19:03:51,651 INFO [train.py:757] Epoch 2, batch 44, loss[loss=1.517, simple_loss=0.8237, pruned_loss=1.105, over 2475.00 frames.], tot_loss[loss=0.865, simple_loss=0.6297, pruned_loss=1.045, over 99601.86 frames.], batch size: 55, lr: 2.92e-03
2022-04-28 19:03:52,195 INFO [checkpoint.py:67] Saving checkpoint to pruned_transducer_stateless2/exp-char-S-debug-batch-checkpoint/checkpoint-220.pt
2022-04-28 19:03:55,054 INFO [train.py:757] Epoch 2, batch 45, loss[loss=1.464, simple_loss=0.6611, pruned_loss=1.133, over 2464.00 frames.], tot_loss[loss=0.8795, simple_loss=0.6305, pruned_loss=1.047, over 101567.85 frames.], batch size: 44, lr: 2.92e-03
2022-04-28 19:03:55,578 INFO [train.py:757] Epoch 2, batch 46, loss[loss=1.43, simple_loss=0.7144, pruned_loss=1.073, over 2491.00 frames.], tot_loss[loss=0.8927, simple_loss=0.6325, pruned_loss=1.047, over 103551.01 frames.], batch size: 53, lr: 2.92e-03
2022-04-28 19:03:56,149 INFO [train.py:757] Epoch 2, batch 47, loss[loss=1.499, simple_loss=0.707, pruned_loss=1.145, over 2494.00 frames.], tot_loss[loss=0.9071, simple_loss=0.6342, pruned_loss=1.05, over 105527.26 frames.], batch size: 58, lr: 2.92e-03
2022-04-28 19:03:56,695 INFO [train.py:757] Epoch 2, batch 48, loss[loss=1.353, simple_loss=0.5844, pruned_loss=1.06, over 2430.00 frames.], tot_loss[loss=0.9171, simple_loss=0.6331, pruned_loss=1.05, over 107429.62 frames.], batch size: 30, lr: 2.92e-03
2022-04-28 19:03:57,245 INFO [train.py:757] Epoch 2, batch 49, loss[loss=1.499, simple_loss=0.6084, pruned_loss=1.195, over 2434.00 frames.], tot_loss[loss=0.9301, simple_loss=0.6326, pruned_loss=1.053, over 109326.47 frames.], batch size: 13, lr: 2.92e-03
2022-04-28 19:03:57,783 INFO [train.py:757] Epoch 2, batch 50, loss[loss=1.372, simple_loss=0.6486, pruned_loss=1.048, over 2479.00 frames.], tot_loss[loss=0.9399, simple_loss=0.6329, pruned_loss=1.053, over 111258.84 frames.], batch size: 37, lr: 2.92e-03
2022-04-28 19:03:58,313 INFO [train.py:757] Epoch 2, batch 51, loss[loss=1.44, simple_loss=0.752, pruned_loss=1.064, over 2494.00 frames.], tot_loss[loss=0.951, simple_loss=0.6355, pruned_loss=1.053, over 113196.54 frames.], batch size: 43, lr: 2.92e-03
2022-04-28 19:03:58,846 INFO [train.py:757] Epoch 2, batch 52, loss[loss=1.396, simple_loss=0.6077, pruned_loss=1.092, over 2448.00 frames.], tot_loss[loss=0.9604, simple_loss=0.635, pruned_loss=1.054, over 115078.56 frames.], batch size: 36, lr: 2.92e-03
2022-04-28 19:03:59,387 INFO [train.py:757] Epoch 2, batch 53, loss[loss=1.455, simple_loss=0.7198, pruned_loss=1.096, over 2460.00 frames.], tot_loss[loss=0.9708, simple_loss=0.6367, pruned_loss=1.055, over 116963.17 frames.], batch size: 41, lr: 2.92e-03
2022-04-28 19:03:59,926 INFO [train.py:757] Epoch 2, batch 54, loss[loss=1.394, simple_loss=0.7001, pruned_loss=1.044, over 2470.00 frames.], tot_loss[loss=0.9796, simple_loss=0.6381, pruned_loss=1.055, over 118848.35 frames.], batch size: 65, lr: 2.92e-03
2022-04-28 19:04:00,453 INFO [train.py:757] Epoch 2, batch 55, loss[loss=1.308, simple_loss=0.6408, pruned_loss=0.9873, over 2470.00 frames.], tot_loss[loss=0.9863, simple_loss=0.6381, pruned_loss=1.053, over 120724.11 frames.], batch size: 38, lr: 2.92e-03
2022-04-28 19:04:00,984 INFO [train.py:757] Epoch 2, batch 56, loss[loss=1.458, simple_loss=0.7026, pruned_loss=1.107, over 2475.00 frames.], tot_loss[loss=0.9959, simple_loss=0.6394, pruned_loss=1.054, over 122595.49 frames.], batch size: 45, lr: 2.92e-03
2022-04-28 19:04:01,522 INFO [train.py:757] Epoch 2, batch 57, loss[loss=1.331, simple_loss=0.6288, pruned_loss=1.016, over 2496.00 frames.], tot_loss[loss=1.003, simple_loss=0.6392, pruned_loss=1.054, over 124478.51 frames.], batch size: 32, lr: 2.92e-03
2022-04-28 19:04:02,052 INFO [train.py:757] Epoch 2, batch 58, loss[loss=1.448, simple_loss=0.7275, pruned_loss=1.084, over 2484.00 frames.], tot_loss[loss=1.011, simple_loss=0.6409, pruned_loss=1.054, over 126340.12 frames.], batch size: 46, lr: 2.92e-03
2022-04-28 19:04:02,587 INFO [train.py:757] Epoch 2, batch 59, loss[loss=1.22, simple_loss=0.5536, pruned_loss=0.9427, over 2485.00 frames.], tot_loss[loss=1.015, simple_loss=0.6392, pruned_loss=1.052, over 128193.42 frames.], batch size: 35, lr: 2.92e-03
2022-04-28 19:04:03,136 INFO [train.py:757] Epoch 2, batch 60, loss[loss=1.323, simple_loss=0.6086, pruned_loss=1.019, over 2450.00 frames.], tot_loss[loss=1.021, simple_loss=0.6387, pruned_loss=1.051, over 130002.45 frames.], batch size: 49, lr: 2.92e-03
2022-04-28 19:04:03,696 INFO [train.py:757] Epoch 2, batch 61, loss[loss=1.327, simple_loss=0.7028, pruned_loss=0.9756, over 2496.00 frames.], tot_loss[loss=1.027, simple_loss=0.6399, pruned_loss=1.05, over 131848.44 frames.], batch size: 64, lr: 2.92e-03
2022-04-28 19:04:04,229 INFO [train.py:757] Epoch 2, batch 62, loss[loss=1.313, simple_loss=0.66, pruned_loss=0.9829, over 2496.00 frames.], tot_loss[loss=1.032, simple_loss=0.6403, pruned_loss=1.049, over 133685.20 frames.], batch size: 52, lr: 2.92e-03
2022-04-28 19:04:04,759 INFO [train.py:757] Epoch 2, batch 63, loss[loss=1.336, simple_loss=0.6136, pruned_loss=1.029, over 2484.00 frames.], tot_loss[loss=1.038, simple_loss=0.6398, pruned_loss=1.048, over 135500.77 frames.], batch size: 36, lr: 2.92e-03
2022-04-28 19:04:05,275 INFO [train.py:757] Epoch 2, batch 64, loss[loss=1.245, simple_loss=0.6435, pruned_loss=0.9233, over 2482.00 frames.], tot_loss[loss=1.042, simple_loss=0.6398, pruned_loss=1.046, over 137305.27 frames.], batch size: 73, lr: 2.92e-03
2022-04-28 19:04:05,785 INFO [checkpoint.py:67] Saving checkpoint to pruned_transducer_stateless2/exp-char-S-debug-batch-checkpoint/checkpoint-240.pt
2022-04-28 19:04:08,629 INFO [train.py:757] Epoch 2, batch 65, loss[loss=1.292, simple_loss=0.5488, pruned_loss=1.018, over 2450.00 frames.], tot_loss[loss=1.046, simple_loss=0.6382, pruned_loss=1.046, over 139068.74 frames.], batch size: 49, lr: 2.92e-03
2022-04-28 19:04:09,143 INFO [train.py:757] Epoch 2, batch 66, loss[loss=1.241, simple_loss=0.6217, pruned_loss=0.9304, over 2479.00 frames.], tot_loss[loss=1.049, simple_loss=0.6379, pruned_loss=1.044, over 140852.40 frames.], batch size: 67, lr: 2.92e-03
2022-04-28 19:04:09,258 INFO [checkpoint.py:67] Saving checkpoint to pruned_transducer_stateless2/exp-char-S-debug-batch-checkpoint/epoch-2.pt

if sampler_state_dict is not None:
logging.info("Loading sampler state dict")
print(sampler_state_dict.keys())
train_sampler.load_state_dict(sampler_state_dict)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you try to move this state dict loading after creating dataloader? Then use it like:

train_dl.sampler.load_state_dict(sampler_state_dict)

I think DataLoader might be calling iter more than once on the sampler, but after it's initialized, it would call iter just once.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do it but It seems that it is no use.

@luomingshuang
Copy link
Collaborator Author

Em, ok, I change it. The asr_datamodule.py and train.py that I used as follows (I move sampler.load after the train_dl):
d81d6a80e03e5a19a53b98f4ac0bd63

And the training log is as follows:

(k2-python) luomingshuang@de-74279-k2-train-3-0307202051-57dc848959-8tmmp:~/icefall/egs/wenetspeech/ASR$ ./run_debug_batch_checkpoint_load.sh
2022-04-28 21:16:07,783 INFO [train.py:817] Training started
2022-04-28 21:16:08.019820: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /ceph-sh1/fangjun/software/cuda-10.2.89/lib:/ceph-sh1/fangjun/software/cuda-10.2.89/lib64:
2022-04-28 21:16:08.019865: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2022-04-28 21:16:11,270 INFO [train.py:827] Device: cuda:0
2022-04-28 21:16:12,420 INFO [lexicon.py:176] Loading pre-compiled data/lang_char/Linv.pt
2022-04-28 21:16:12,497 INFO [train.py:846] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 1, 'reset_interval': 200, 'feature_dim': 80, 'subsampling_factor': 4, 'encoder_dim': 512, 'nhead': 8, 'dim_feedforward': 2048, 'num_encoder_layers': 12, 'decoder_dim': 512, 'joiner_dim': 512, 'env_info': {'k2-version': '1.14', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '4fb6b88661cca73e5f66f03df16e5a1d0c4886f8', 'k2-git-date': 'Fri Apr 8 18:29:32 2022', 'lhotse-version': '1.1.0', 'torch-version': '1.11.0', 'torch-cuda-available': True, 'torch-cuda-version': '10.2', 'python-version': '3.8', 'icefall-git-branch': 'wenetspeech-pruned-transducer-stateless2', 'icefall-git-sha1': '4b567e4-dirty', 'icefall-git-date': 'Wed Apr 27 13:43:54 2022', 'icefall-path': '/ceph-meixu/luomingshuang/icefall', 'k2-path': '/ceph-ms/luomingshuang/k2/k2/python/k2/__init__.py', 'lhotse-path': '/ceph-meixu/luomingshuang/anaconda3/envs/k2-python/lib/python3.8/site-packages/lhotse-1.1.0-py3.8.egg/lhotse/__init__.py', 'hostname': 'de-74279-k2-train-3-0307202051-57dc848959-8tmmp', 'IP address': '10.177.24.138'}, 'world_size': 1, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 30, 'start_epoch': 0, 'start_batch': 40, 'exp_dir': PosixPath('pruned_transducer_stateless2/exp-char-S-debug-batch-checkpoint'), 'lang_dir': PosixPath('data/lang_char'), 'token_type': 'char', 'initial_lr': 0.003, 'lr_batches': 5000, 'lr_epochs': 6, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'seed': 42, 'print_diagnostics': False, 'save_every_n': 20, 'keep_last_k': 20, 'use_fp16': False, 'valid_interval': 100, 'model_warm_step': 100, 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 100, 'bucketing_sampler': True, 'num_buckets': 300, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'lazy_load': True, 'training_subset': 'S', 'blank_id': 0, 'vocab_size': 5537}
2022-04-28 21:16:12,497 INFO [train.py:848] About to create model
2022-04-28 21:16:13,073 INFO [train.py:852] Number of model parameters: 88978927
2022-04-28 21:16:13,074 INFO [checkpoint.py:100] Loading checkpoint from pruned_transducer_stateless2/exp-char-S-debug-batch-checkpoint/checkpoint-40.pt
2022-04-28 21:16:22,099 INFO [train.py:867] Loading optimizer state dict
2022-04-28 21:16:23,041 INFO [train.py:875] Loading scheduler state dict
2022-04-28 21:16:23,041 INFO [asr_datamodule.py:427] About to get train cuts
2022-04-28 21:16:23,041 INFO [asr_datamodule.py:429] use lazy cuts
2022-04-28 21:16:23,042 INFO [asr_datamodule.py:445] About to get dev cuts
2022-04-28 21:16:24,114 INFO [asr_datamodule.py:359] About to create dev dataset
/ceph-meixu/luomingshuang/anaconda3/envs/k2-python/lib/python3.8/site-packages/lhotse-1.1.0-py3.8.egg/lhotse/dataset/sampling/dynamic_bucketing.py:129: UserWarning: You are using DynamicBucketingSampler with an eagerly read CutSet. You won't see any memory/speed benefits with that setup. Either use 'CutSet.from_jsonl_lazy' to read the CutSet lazily, or use a BucketingSampler instead.
2022-04-28 21:16:24,122 INFO [asr_datamodule.py:380] About to create dev dataloader
2022-04-28 21:16:24,123 INFO [asr_datamodule.py:220] About to get Musan cuts
2022-04-28 21:16:26,601 INFO [asr_datamodule.py:227] Enable MUSAN
2022-04-28 21:16:26,602 INFO [asr_datamodule.py:252] Enable SpecAugment
2022-04-28 21:16:26,602 INFO [asr_datamodule.py:253] Time warp factor: 80
2022-04-28 21:16:26,602 INFO [asr_datamodule.py:265] Num frame mask: 10
2022-04-28 21:16:26,602 INFO [asr_datamodule.py:278] About to create train dataset
2022-04-28 21:16:26,602 INFO [asr_datamodule.py:306] Using DynamicBucketingSampler.
2022-04-28 21:16:27,034 INFO [asr_datamodule.py:322] About to create train dataloader
2022-04-28 21:16:27,034 INFO [asr_datamodule.py:344] Loading sampler state dict
{'epoch': 0, 'world_size': 1, 'rank': 0, 'seed': 0, 'shuffle': True, 'diagnostics': {'current_epoch': 0, 'stats_per_epoch': {0: {'epoch': 0, 'kept_cuts': 1998, 'discarded_cuts': 0, 'kept_batches': 44, 'discarded_batches': 0}}}, 'max_duration': 100, 'max_cuts': None, 'drop_last': True, 'consistent_ids': True, 'buffer_size': 30000, 'num_cuts_for_bins_estimate': 10000, 'shuffle_buffer_size': 20000, 'strict': True}
2022-04-28 21:16:28,191 INFO [train.py:938] Loading grad scaler state dict
2022-04-28 21:16:31,929 INFO [train.py:755] Epoch 0, batch 0, loss[loss=0.455, simple_loss=0.91, pruned_loss=8.779, over 2490.00 frames.], tot_loss[loss=0.455, simple_loss=0.91, pruned_loss=8.779, over 2490.00 frames.], batch size: 30, lr: 3.00e-03
2022-04-28 21:16:34,350 INFO [train.py:755] Epoch 0, batch 1, loss[loss=0.4376, simple_loss=0.8752, pruned_loss=8.768, over 2491.00 frames.], tot_loss[loss=0.4463, simple_loss=0.8926, pruned_loss=8.774, over 4968.55 frames.], batch size: 47, lr: 3.00e-03
2022-04-28 21:16:36,917 INFO [train.py:755] Epoch 0, batch 2, loss[loss=0.4416, simple_loss=0.8832, pruned_loss=8.942, over 2496.00 frames.], tot_loss[loss=0.4447, simple_loss=0.8894, pruned_loss=8.83, over 7439.71 frames.], batch size: 32, lr: 3.00e-03
2022-04-28 21:16:39,165 INFO [train.py:755] Epoch 0, batch 3, loss[loss=0.3987, simple_loss=0.7974, pruned_loss=8.835, over 2403.00 frames.], tot_loss[loss=0.4334, simple_loss=0.8669, pruned_loss=8.831, over 9805.51 frames.], batch size: 15, lr: 3.00e-03
2022-04-28 21:16:41,498 INFO [train.py:755] Epoch 0, batch 4, loss[loss=0.46, simple_loss=0.92, pruned_loss=8.908, over 2464.00 frames.], tot_loss[loss=0.4388, simple_loss=0.8776, pruned_loss=8.847, over 12220.48 frames.], batch size: 44, lr: 3.00e-03
2022-04-28 21:16:44,087 INFO [train.py:755] Epoch 0, batch 5, loss[loss=0.4898, simple_loss=0.9796, pruned_loss=8.733, over 2480.00 frames.], tot_loss[loss=0.4474, simple_loss=0.8949, pruned_loss=8.827, over 14639.38 frames.], batch size: 80, lr: 3.00e-03
2022-04-28 21:16:46,404 INFO [train.py:755] Epoch 0, batch 6, loss[loss=0.4639, simple_loss=0.9278, pruned_loss=8.943, over 2484.00 frames.], tot_loss[loss=0.4498, simple_loss=0.8997, pruned_loss=8.844, over 17050.18 frames.], batch size: 54, lr: 3.00e-03
2022-04-28 21:16:48,737 INFO [train.py:755] Epoch 0, batch 7, loss[loss=0.4232, simple_loss=0.8464, pruned_loss=8.795, over 2496.00 frames.], tot_loss[loss=0.4464, simple_loss=0.8928, pruned_loss=8.838, over 19460.93 frames.], batch size: 39, lr: 3.00e-03
2022-04-28 21:16:51,078 INFO [train.py:755] Epoch 0, batch 8, loss[loss=0.4364, simple_loss=0.8727, pruned_loss=8.92, over 2457.00 frames.], tot_loss[loss=0.4453, simple_loss=0.8906, pruned_loss=8.847, over 21820.63 frames.], batch size: 39, lr: 3.00e-03
2022-04-28 21:16:53,580 INFO [train.py:755] Epoch 0, batch 9, loss[loss=0.4808, simple_loss=0.9616, pruned_loss=8.997, over 2451.00 frames.], tot_loss[loss=0.4489, simple_loss=0.8978, pruned_loss=8.862, over 24162.52 frames.], batch size: 43, lr: 3.00e-03
2022-04-28 21:16:56,110 INFO [train.py:755] Epoch 0, batch 10, loss[loss=0.4279, simple_loss=0.8558, pruned_loss=8.93, over 2448.00 frames.], tot_loss[loss=0.447, simple_loss=0.8939, pruned_loss=8.869, over 26489.71 frames.], batch size: 22, lr: 3.00e-03
2022-04-28 21:16:58,763 INFO [train.py:755] Epoch 0, batch 11, loss[loss=0.4915, simple_loss=0.983, pruned_loss=9.041, over 2450.00 frames.], tot_loss[loss=0.4507, simple_loss=0.9015, pruned_loss=8.883, over 28807.26 frames.], batch size: 50, lr: 3.00e-03
2022-04-28 21:17:01,211 INFO [train.py:755] Epoch 0, batch 12, loss[loss=0.4249, simple_loss=0.8497, pruned_loss=8.851, over 2470.00 frames.], tot_loss[loss=0.4487, simple_loss=0.8974, pruned_loss=8.881, over 31133.23 frames.], batch size: 38, lr: 3.00e-03
2022-04-28 21:17:03,486 INFO [train.py:755] Epoch 0, batch 13, loss[loss=0.4676, simple_loss=0.9352, pruned_loss=8.874, over 2496.00 frames.], tot_loss[loss=0.4501, simple_loss=0.9002, pruned_loss=8.88, over 33473.56 frames.], batch size: 48, lr: 3.00e-03
2022-04-28 21:17:05,978 INFO [train.py:755] Epoch 0, batch 14, loss[loss=0.4217, simple_loss=0.8435, pruned_loss=8.807, over 2442.00 frames.], tot_loss[loss=0.4482, simple_loss=0.8963, pruned_loss=8.875, over 35748.19 frames.], batch size: 33, lr: 3.00e-03
2022-04-28 21:17:08,429 INFO [train.py:755] Epoch 0, batch 15, loss[loss=0.4808, simple_loss=0.9617, pruned_loss=8.937, over 2494.00 frames.], tot_loss[loss=0.4503, simple_loss=0.9006, pruned_loss=8.879, over 38063.45 frames.], batch size: 58, lr: 3.00e-03
2022-04-28 21:17:10,921 INFO [train.py:755] Epoch 0, batch 16, loss[loss=0.4133, simple_loss=0.8267, pruned_loss=8.74, over 2450.00 frames.], tot_loss[loss=0.4481, simple_loss=0.8961, pruned_loss=8.871, over 40323.13 frames.], batch size: 35, lr: 3.00e-03
2022-04-28 21:17:13,431 INFO [train.py:755] Epoch 0, batch 17, loss[loss=0.4579, simple_loss=0.9158, pruned_loss=8.859, over 2470.00 frames.], tot_loss[loss=0.4486, simple_loss=0.8972, pruned_loss=8.87, over 42591.52 frames.], batch size: 65, lr: 3.00e-03
2022-04-28 21:17:15,943 INFO [train.py:755] Epoch 0, batch 18, loss[loss=0.4509, simple_loss=0.9017, pruned_loss=8.926, over 2450.00 frames.], tot_loss[loss=0.4487, simple_loss=0.8975, pruned_loss=8.873, over 44828.56 frames.], batch size: 49, lr: 3.00e-03
2022-04-28 21:17:18,561 INFO [checkpoint.py:67] Saving checkpoint to pruned_transducer_stateless2/exp-char-S-debug-batch-checkpoint/checkpoint-60.pt
2022-04-28 21:17:21,387 INFO [train.py:755] Epoch 0, batch 19, loss[loss=0.4387, simple_loss=0.8774, pruned_loss=8.853, over 2496.00 frames.], tot_loss[loss=0.4482, simple_loss=0.8964, pruned_loss=8.872, over 47100.42 frames.], batch size: 64, lr: 3.00e-03
2022-04-28 21:17:23,861 INFO [train.py:755] Epoch 0, batch 20, loss[loss=0.5755, simple_loss=1.151, pruned_loss=9.952, over 2484.00 frames.], tot_loss[loss=0.4546, simple_loss=0.9092, pruned_loss=8.926, over 49348.92 frames.], batch size: 69, lr: 3.00e-03
2022-04-28 21:17:26,328 INFO [train.py:755] Epoch 0, batch 21, loss[loss=0.3958, simple_loss=0.7917, pruned_loss=8.665, over 2436.00 frames.], tot_loss[loss=0.4518, simple_loss=0.9037, pruned_loss=8.914, over 51538.17 frames.], batch size: 29, lr: 3.00e-03
2022-04-28 21:17:28,302 INFO [train.py:755] Epoch 0, batch 22, loss[loss=0.458, simple_loss=0.916, pruned_loss=8.99, over 2450.00 frames.], tot_loss[loss=0.4521, simple_loss=0.9042, pruned_loss=8.918, over 53730.48 frames.], batch size: 49, lr: 3.00e-03
2022-04-28 21:17:30,585 INFO [train.py:755] Epoch 0, batch 23, loss[loss=0.3922, simple_loss=0.7843, pruned_loss=8.816, over 2449.00 frames.], tot_loss[loss=0.4495, simple_loss=0.899, pruned_loss=8.913, over 55910.83 frames.], batch size: 31, lr: 3.00e-03
2022-04-28 21:17:33,242 INFO [train.py:755] Epoch 0, batch 24, loss[loss=0.4274, simple_loss=0.8548, pruned_loss=9.072, over 2404.00 frames.], tot_loss[loss=0.4486, simple_loss=0.8972, pruned_loss=8.92, over 58035.27 frames.], batch size: 25, lr: 3.00e-03
2022-04-28 21:17:35,716 INFO [train.py:755] Epoch 0, batch 25, loss[loss=0.4462, simple_loss=0.8924, pruned_loss=9.121, over 2478.00 frames.], tot_loss[loss=0.4485, simple_loss=0.897, pruned_loss=8.928, over 60223.10 frames.], batch size: 59, lr: 3.00e-03
2022-04-28 21:17:38,331 INFO [train.py:755] Epoch 0, batch 26, loss[loss=0.4634, simple_loss=0.9268, pruned_loss=9.186, over 2478.00 frames.], tot_loss[loss=0.4491, simple_loss=0.8982, pruned_loss=8.938, over 62399.98 frames.], batch size: 42, lr: 3.00e-03
2022-04-28 21:17:40,724 INFO [train.py:755] Epoch 0, batch 27, loss[loss=0.4223, simple_loss=0.8447, pruned_loss=8.855, over 2479.00 frames.], tot_loss[loss=0.4481, simple_loss=0.8961, pruned_loss=8.935, over 64566.98 frames.], batch size: 67, lr: 3.00e-03
2022-04-28 21:17:43,237 INFO [train.py:755] Epoch 0, batch 28, loss[loss=0.4133, simple_loss=0.8266, pruned_loss=8.809, over 2482.00 frames.], tot_loss[loss=0.4468, simple_loss=0.8935, pruned_loss=8.93, over 66726.15 frames.], batch size: 73, lr: 3.00e-03
2022-04-28 21:17:45,738 INFO [train.py:755] Epoch 0, batch 29, loss[loss=0.4685, simple_loss=0.937, pruned_loss=9.057, over 2442.00 frames.], tot_loss[loss=0.4475, simple_loss=0.8951, pruned_loss=8.935, over 68834.52 frames.], batch size: 37, lr: 3.00e-03
2022-04-28 21:17:48,405 INFO [train.py:755] Epoch 0, batch 30, loss[loss=0.4156, simple_loss=0.8311, pruned_loss=8.896, over 2480.00 frames.], tot_loss[loss=0.4464, simple_loss=0.8928, pruned_loss=8.933, over 70970.34 frames.], batch size: 31, lr: 3.00e-03
2022-04-28 21:17:51,034 INFO [train.py:755] Epoch 0, batch 31, loss[loss=0.4737, simple_loss=0.9474, pruned_loss=8.863, over 2460.00 frames.], tot_loss[loss=0.4473, simple_loss=0.8947, pruned_loss=8.931, over 73075.49 frames.], batch size: 60, lr: 3.00e-03
2022-04-28 21:17:53,532 INFO [train.py:755] Epoch 0, batch 32, loss[loss=0.4207, simple_loss=0.8413, pruned_loss=8.837, over 2484.00 frames.], tot_loss[loss=0.4464, simple_loss=0.8929, pruned_loss=8.928, over 75194.11 frames.], batch size: 36, lr: 3.00e-03
2022-04-28 21:17:56,127 INFO [train.py:755] Epoch 0, batch 33, loss[loss=0.3861, simple_loss=0.7721, pruned_loss=8.577, over 2464.00 frames.], tot_loss[loss=0.4445, simple_loss=0.889, pruned_loss=8.917, over 77282.14 frames.], batch size: 32, lr: 3.00e-03
2022-04-28 21:17:58,544 INFO [train.py:755] Epoch 0, batch 34, loss[loss=0.4377, simple_loss=0.8754, pruned_loss=8.784, over 2496.00 frames.], tot_loss[loss=0.4443, simple_loss=0.8886, pruned_loss=8.913, over 79391.73 frames.], batch size: 52, lr: 3.00e-03
2022-04-28 21:18:01,140 INFO [train.py:755] Epoch 0, batch 35, loss[loss=0.4297, simple_loss=0.8594, pruned_loss=8.654, over 2479.00 frames.], tot_loss[loss=0.4439, simple_loss=0.8877, pruned_loss=8.905, over 81473.77 frames.], batch size: 37, lr: 3.00e-03
2022-04-28 21:18:03,735 INFO [train.py:755] Epoch 0, batch 36, loss[loss=0.427, simple_loss=0.8541, pruned_loss=8.595, over 2485.00 frames.], tot_loss[loss=0.4434, simple_loss=0.8867, pruned_loss=8.896, over 83551.41 frames.], batch size: 35, lr: 3.00e-03
2022-04-28 21:18:06,278 INFO [train.py:755] Epoch 0, batch 37, loss[loss=0.4691, simple_loss=0.9382, pruned_loss=8.604, over 2464.00 frames.], tot_loss[loss=0.4441, simple_loss=0.8882, pruned_loss=8.887, over 85597.65 frames.], batch size: 56, lr: 3.00e-03
2022-04-28 21:18:08,746 INFO [train.py:755] Epoch 0, batch 38, loss[loss=0.4511, simple_loss=0.9023, pruned_loss=8.393, over 2480.00 frames.], tot_loss[loss=0.4443, simple_loss=0.8886, pruned_loss=8.873, over 87649.66 frames.], batch size: 40, lr: 3.00e-03
2022-04-28 21:18:11,342 INFO [checkpoint.py:67] Saving checkpoint to pruned_transducer_stateless2/exp-char-S-debug-batch-checkpoint/checkpoint-80.pt
2022-04-28 21:18:13,960 INFO [train.py:755] Epoch 0, batch 39, loss[loss=0.4573, simple_loss=0.9147, pruned_loss=8.574, over 2460.00 frames.], tot_loss[loss=0.4447, simple_loss=0.8893, pruned_loss=8.865, over 89671.41 frames.], batch size: 41, lr: 3.00e-03
2022-04-28 21:18:16,339 INFO [train.py:755] Epoch 0, batch 40, loss[loss=0.5415, simple_loss=1.083, pruned_loss=8.671, over 2478.00 frames.], tot_loss[loss=0.4473, simple_loss=0.8946, pruned_loss=8.86, over 91701.05 frames.], batch size: 42, lr: 3.00e-03
2022-04-28 21:18:18,821 INFO [train.py:755] Epoch 0, batch 41, loss[loss=0.506, simple_loss=1.012, pruned_loss=8.688, over 2448.00 frames.], tot_loss[loss=0.4488, simple_loss=0.8976, pruned_loss=8.855, over 93690.55 frames.], batch size: 34, lr: 3.00e-03
2022-04-28 21:18:21,318 INFO [train.py:755] Epoch 0, batch 42, loss[loss=0.5631, simple_loss=1.126, pruned_loss=8.863, over 2491.00 frames.], tot_loss[loss=0.4518, simple_loss=0.9036, pruned_loss=8.855, over 95713.10 frames.], batch size: 53, lr: 3.00e-03
2022-04-28 21:18:23,755 INFO [train.py:755] Epoch 0, batch 43, loss[loss=0.4355, simple_loss=0.8709, pruned_loss=8.61, over 2490.00 frames.], tot_loss[loss=0.4514, simple_loss=0.9027, pruned_loss=8.849, over 97724.53 frames.], batch size: 83, lr: 3.00e-03
2022-04-28 21:18:26,292 INFO [train.py:755] Epoch 0, batch 44, loss[loss=0.4985, simple_loss=0.9971, pruned_loss=8.947, over 2432.00 frames.], tot_loss[loss=0.4525, simple_loss=0.905, pruned_loss=8.852, over 99667.91 frames.], batch size: 32, lr: 3.00e-03
2022-04-28 21:18:28,870 INFO [train.py:755] Epoch 0, batch 45, loss[loss=0.4393, simple_loss=0.8785, pruned_loss=8.664, over 2494.00 frames.], tot_loss[loss=0.4522, simple_loss=0.9044, pruned_loss=8.847, over 101663.57 frames.], batch size: 86, lr: 3.00e-03
2022-04-28 21:18:31,171 INFO [train.py:755] Epoch 0, batch 46, loss[loss=0.5027, simple_loss=1.005, pruned_loss=8.921, over 2485.00 frames.], tot_loss[loss=0.4534, simple_loss=0.9068, pruned_loss=8.849, over 103640.25 frames.], batch size: 71, lr: 3.00e-03
2022-04-28 21:18:33,722 INFO [train.py:755] Epoch 0, batch 47, loss[loss=0.5192, simple_loss=1.038, pruned_loss=9.001, over 2448.00 frames.], tot_loss[loss=0.4549, simple_loss=0.9099, pruned_loss=8.852, over 105570.05 frames.], batch size: 36, lr: 3.00e-03
2022-04-28 21:18:36,334 INFO [train.py:755] Epoch 0, batch 48, loss[loss=0.4594, simple_loss=0.9188, pruned_loss=8.893, over 2496.00 frames.], tot_loss[loss=0.455, simple_loss=0.9101, pruned_loss=8.853, over 107538.20 frames.], batch size: 78, lr: 3.00e-03
2022-04-28 21:18:38,862 INFO [train.py:755] Epoch 0, batch 49, loss[loss=0.5096, simple_loss=1.019, pruned_loss=8.959, over 2465.00 frames.], tot_loss[loss=0.4563, simple_loss=0.9125, pruned_loss=8.856, over 109465.51 frames.], batch size: 29, lr: 3.00e-03
2022-04-28 21:18:41,451 INFO [train.py:755] Epoch 0, batch 50, loss[loss=0.5682, simple_loss=1.136, pruned_loss=9.176, over 2440.00 frames.], tot_loss[loss=0.4587, simple_loss=0.9174, pruned_loss=8.863, over 111358.18 frames.], batch size: 40, lr: 3.00e-03
2022-04-28 21:18:44,037 INFO [train.py:755] Epoch 0, batch 51, loss[loss=0.4903, simple_loss=0.9807, pruned_loss=9.401, over 2484.00 frames.], tot_loss[loss=0.4594, simple_loss=0.9188, pruned_loss=8.874, over 113285.39 frames.], batch size: 54, lr: 3.00e-03
2022-04-28 21:18:46,479 INFO [train.py:755] Epoch 0, batch 52, loss[loss=0.4944, simple_loss=0.9889, pruned_loss=9.683, over 2430.00 frames.], tot_loss[loss=0.4602, simple_loss=0.9203, pruned_loss=8.891, over 115148.96 frames.], batch size: 30, lr: 3.00e-03
2022-04-28 21:18:49,054 INFO [train.py:755] Epoch 0, batch 53, loss[loss=0.523, simple_loss=1.046, pruned_loss=9.292, over 2448.00 frames.], tot_loss[loss=0.4615, simple_loss=0.9229, pruned_loss=8.9, over 117021.22 frames.], batch size: 48, lr: 3.00e-03
2022-04-28 21:18:50,766 INFO [train.py:755] Epoch 0, batch 54, loss[loss=0.5275, simple_loss=1.055, pruned_loss=9.379, over 2434.00 frames.], tot_loss[loss=0.4628, simple_loss=0.9256, pruned_loss=8.91, over 118870.11 frames.], batch size: 13, lr: 3.00e-03
2022-04-28 21:18:51,312 INFO [train.py:755] Epoch 0, batch 55, loss[loss=0.5001, simple_loss=1, pruned_loss=9.292, over 2480.00 frames.], tot_loss[loss=0.4636, simple_loss=0.9272, pruned_loss=8.918, over 120755.76 frames.], batch size: 62, lr: 3.00e-03
2022-04-28 21:18:53,609 INFO [train.py:755] Epoch 0, batch 56, loss[loss=0.5654, simple_loss=1.131, pruned_loss=9.509, over 2450.00 frames.], tot_loss[loss=0.4656, simple_loss=0.9312, pruned_loss=8.929, over 122601.98 frames.], batch size: 50, lr: 3.00e-03
2022-04-28 21:18:56,069 INFO [train.py:755] Epoch 0, batch 57, loss[loss=0.5121, simple_loss=1.024, pruned_loss=9.404, over 2484.00 frames.], tot_loss[loss=0.4665, simple_loss=0.9331, pruned_loss=8.939, over 124472.97 frames.], batch size: 46, lr: 3.00e-03
2022-04-28 21:18:58,533 INFO [train.py:755] Epoch 0, batch 58, loss[loss=0.5294, simple_loss=1.059, pruned_loss=9.26, over 2475.00 frames.], tot_loss[loss=0.4678, simple_loss=0.9356, pruned_loss=8.945, over 126325.61 frames.], batch size: 45, lr: 3.00e-03
2022-04-28 21:19:01,027 INFO [checkpoint.py:67] Saving checkpoint to pruned_transducer_stateless2/exp-char-S-debug-batch-checkpoint/checkpoint-100.pt
2022-04-28 21:19:03,826 INFO [train.py:755] Epoch 0, batch 59, loss[loss=9.546, simple_loss=0.9543, pruned_loss=9.069, over 2491.00 frames.], tot_loss[loss=0.6442, simple_loss=0.9359, pruned_loss=8.948, over 128184.98 frames.], batch size: 53, lr: 3.00e-03
2022-04-28 21:19:06,322 INFO [train.py:755] Epoch 0, batch 60, loss[loss=0.7267, simple_loss=1.002, pruned_loss=2.259, over 2460.00 frames.], tot_loss[loss=0.6458, simple_loss=0.9372, pruned_loss=8.821, over 130004.06 frames.], batch size: 30, lr: 3.00e-03
2022-04-28 21:19:08,927 INFO [train.py:755] Epoch 0, batch 61, loss[loss=0.6688, simple_loss=1.006, pruned_loss=1.657, over 2482.00 frames.], tot_loss[loss=0.6462, simple_loss=0.9385, pruned_loss=8.686, over 131836.04 frames.], batch size: 34, lr: 3.00e-03
2022-04-28 21:19:11,273 INFO [train.py:755] Epoch 0, batch 62, loss[loss=0.6806, simple_loss=0.9814, pruned_loss=1.899, over 2494.00 frames.], tot_loss[loss=0.6468, simple_loss=0.9393, pruned_loss=8.559, over 133670.85 frames.], batch size: 43, lr: 3.00e-03
2022-04-28 21:19:13,679 INFO [train.py:755] Epoch 0, batch 63, loss[loss=0.5747, simple_loss=0.8276, pruned_loss=1.609, over 2475.00 frames.], tot_loss[loss=0.6455, simple_loss=0.9372, pruned_loss=8.432, over 135477.50 frames.], batch size: 75, lr: 3.00e-03
2022-04-28 21:19:16,114 INFO [train.py:755] Epoch 0, batch 64, loss[loss=0.71, simple_loss=1.032, pruned_loss=1.941, over 2475.00 frames.], tot_loss[loss=0.6467, simple_loss=0.9389, pruned_loss=8.315, over 137275.11 frames.], batch size: 55, lr: 3.00e-03
2022-04-28 21:19:18,567 INFO [train.py:755] Epoch 0, batch 65, loss[loss=0.7158, simple_loss=1.024, pruned_loss=2.04, over 2475.00 frames.], tot_loss[loss=0.6479, simple_loss=0.9404, pruned_loss=8.204, over 139063.74 frames.], batch size: 33, lr: 3.00e-03
2022-04-28 21:19:20,992 INFO [train.py:755] Epoch 0, batch 66, loss[loss=0.6898, simple_loss=0.9922, pruned_loss=1.937, over 2475.00 frames.], tot_loss[loss=0.6486, simple_loss=0.9413, pruned_loss=8.094, over 140843.42 frames.], batch size: 45, lr: 3.00e-03
2022-04-28 21:19:21,129 INFO [checkpoint.py:67] Saving checkpoint to pruned_transducer_stateless2/exp-char-S-debug-batch-checkpoint/epoch-0.pt
2022-04-28 21:19:30,736 INFO [train.py:755] Epoch 1, batch 0, loss[loss=0.5418, simple_loss=0.8006, pruned_loss=1.415, over 2479.00 frames.], tot_loss[loss=0.5418, simple_loss=0.8006, pruned_loss=1.415, over 2479.00 frames.], batch size: 67, lr: 2.98e-03
2022-04-28 21:19:33,072 INFO [train.py:755] Epoch 1, batch 1, loss[loss=0.5983, simple_loss=0.8453, pruned_loss=1.757, over 2442.00 frames.], tot_loss[loss=0.5699, simple_loss=0.8228, pruned_loss=1.585, over 4908.60 frames.], batch size: 37, lr: 2.98e-03
2022-04-28 21:19:35,451 INFO [train.py:755] Epoch 1, batch 2, loss[loss=0.5561, simple_loss=0.8372, pruned_loss=1.375, over 2465.00 frames.], tot_loss[loss=0.5653, simple_loss=0.8277, pruned_loss=1.514, over 7349.06 frames.], batch size: 29, lr: 2.98e-03
2022-04-28 21:19:38,038 INFO [train.py:755] Epoch 1, batch 3, loss[loss=0.6433, simple_loss=0.9054, pruned_loss=1.906, over 2460.00 frames.], tot_loss[loss=0.5849, simple_loss=0.8472, pruned_loss=1.613, over 9772.32 frames.], batch size: 30, lr: 2.98e-03
2022-04-28 21:19:40,520 INFO [train.py:755] Epoch 1, batch 4, loss[loss=0.4931, simple_loss=0.651, pruned_loss=1.676, over 2449.00 frames.], tot_loss[loss=0.5664, simple_loss=0.8078, pruned_loss=1.626, over 12172.46 frames.], batch size: 31, lr: 2.98e-03
2022-04-28 21:19:43,000 INFO [train.py:755] Epoch 1, batch 5, loss[loss=0.5367, simple_loss=0.7672, pruned_loss=1.531, over 2475.00 frames.], tot_loss[loss=0.5614, simple_loss=0.8009, pruned_loss=1.61, over 14586.59 frames.], batch size: 75, lr: 2.98e-03
2022-04-28 21:19:45,475 INFO [train.py:755] Epoch 1, batch 6, loss[loss=0.6156, simple_loss=0.8408, pruned_loss=1.952, over 2491.00 frames.], tot_loss[loss=0.5693, simple_loss=0.8067, pruned_loss=1.66, over 17004.66 frames.], batch size: 53, lr: 2.98e-03
2022-04-28 21:19:47,606 INFO [train.py:755] Epoch 1, batch 7, loss[loss=0.5553, simple_loss=0.7734, pruned_loss=1.686, over 2470.00 frames.], tot_loss[loss=0.5676, simple_loss=0.8025, pruned_loss=1.663, over 19389.64 frames.], batch size: 65, lr: 2.98e-03
2022-04-28 21:19:50,185 INFO [train.py:755] Epoch 1, batch 8, loss[loss=0.5205, simple_loss=0.7157, pruned_loss=1.627, over 2457.00 frames.], tot_loss[loss=0.5622, simple_loss=0.7927, pruned_loss=1.659, over 21749.69 frames.], batch size: 39, lr: 2.98e-03
2022-04-28 21:19:52,634 INFO [train.py:755] Epoch 1, batch 9, loss[loss=0.4572, simple_loss=0.6235, pruned_loss=1.455, over 2436.00 frames.], tot_loss[loss=0.5516, simple_loss=0.7756, pruned_loss=1.638, over 24076.94 frames.], batch size: 29, lr: 2.98e-03
2022-04-28 21:19:55,091 INFO [train.py:755] Epoch 1, batch 10, loss[loss=0.5185, simple_loss=0.7531, pruned_loss=1.419, over 2460.00 frames.], tot_loss[loss=0.5485, simple_loss=0.7735, pruned_loss=1.618, over 26416.56 frames.], batch size: 41, lr: 2.98e-03
2022-04-28 21:19:57,316 INFO [train.py:755] Epoch 1, batch 11, loss[loss=0.5121, simple_loss=0.7654, pruned_loss=1.294, over 2496.00 frames.], tot_loss[loss=0.5454, simple_loss=0.7728, pruned_loss=1.59, over 28780.47 frames.], batch size: 39, lr: 2.98e-03
2022-04-28 21:19:59,750 INFO [checkpoint.py:67] Saving checkpoint to pruned_transducer_stateless2/exp-char-S-debug-batch-checkpoint/checkpoint-120.pt
2022-04-28 21:20:02,369 INFO [train.py:755] Epoch 1, batch 12, loss[loss=0.4419, simple_loss=0.6544, pruned_loss=1.147, over 2464.00 frames.], tot_loss[loss=0.5372, simple_loss=0.7634, pruned_loss=1.555, over 31100.57 frames.], batch size: 32, lr: 2.98e-03
2022-04-28 21:20:04,942 INFO [train.py:755] Epoch 1, batch 13, loss[loss=0.5071, simple_loss=0.758, pruned_loss=1.281, over 2464.00 frames.], tot_loss[loss=0.535, simple_loss=0.763, pruned_loss=1.535, over 33409.07 frames.], batch size: 44, lr: 2.98e-03
2022-04-28 21:20:07,447 INFO [train.py:755] Epoch 1, batch 14, loss[loss=0.5805, simple_loss=0.9071, pruned_loss=1.269, over 2475.00 frames.], tot_loss[loss=0.5381, simple_loss=0.773, pruned_loss=1.516, over 35717.02 frames.], batch size: 55, lr: 2.98e-03
2022-04-28 21:20:10,010 INFO [train.py:755] Epoch 1, batch 15, loss[loss=0.5049, simple_loss=0.7744, pruned_loss=1.177, over 2484.00 frames.], tot_loss[loss=0.5359, simple_loss=0.7731, pruned_loss=1.494, over 38022.44 frames.], batch size: 69, lr: 2.98e-03
2022-04-28 21:20:12,466 INFO [train.py:755] Epoch 1, batch 16, loss[loss=0.4685, simple_loss=0.6974, pruned_loss=1.198, over 2450.00 frames.], tot_loss[loss=0.5318, simple_loss=0.7685, pruned_loss=1.476, over 40282.32 frames.], batch size: 35, lr: 2.98e-03
2022-04-28 21:20:14,899 INFO [train.py:755] Epoch 1, batch 17, loss[loss=0.4758, simple_loss=0.7459, pruned_loss=1.029, over 2490.00 frames.], tot_loss[loss=0.5286, simple_loss=0.7671, pruned_loss=1.45, over 42570.91 frames.], batch size: 83, lr: 2.98e-03
2022-04-28 21:20:17,185 INFO [train.py:755] Epoch 1, batch 18, loss[loss=0.6077, simple_loss=0.9446, pruned_loss=1.354, over 2440.00 frames.], tot_loss[loss=0.5329, simple_loss=0.7768, pruned_loss=1.445, over 44798.06 frames.], batch size: 40, lr: 2.98e-03
2022-04-28 21:20:19,662 INFO [train.py:755] Epoch 1, batch 19, loss[loss=0.5222, simple_loss=0.7959, pruned_loss=1.242, over 2448.00 frames.], tot_loss[loss=0.5323, simple_loss=0.7778, pruned_loss=1.434, over 47022.07 frames.], batch size: 36, lr: 2.98e-03
2022-04-28 21:20:22,271 INFO [train.py:755] Epoch 1, batch 20, loss[loss=0.4736, simple_loss=0.7047, pruned_loss=1.213, over 2405.00 frames.], tot_loss[loss=0.5295, simple_loss=0.7742, pruned_loss=1.423, over 49191.96 frames.], batch size: 25, lr: 2.98e-03
2022-04-28 21:20:24,716 INFO [train.py:755] Epoch 1, batch 21, loss[loss=0.4646, simple_loss=0.7338, pruned_loss=0.9768, over 2494.00 frames.], tot_loss[loss=0.5263, simple_loss=0.7723, pruned_loss=1.402, over 51440.00 frames.], batch size: 86, lr: 2.98e-03
2022-04-28 21:20:27,233 INFO [train.py:755] Epoch 1, batch 22, loss[loss=0.4998, simple_loss=0.7782, pruned_loss=1.107, over 2448.00 frames.], tot_loss[loss=0.5251, simple_loss=0.7725, pruned_loss=1.388, over 53630.80 frames.], batch size: 34, lr: 2.98e-03
2022-04-28 21:20:29,722 INFO [train.py:755] Epoch 1, batch 23, loss[loss=0.4685, simple_loss=0.7178, pruned_loss=1.096, over 2491.00 frames.], tot_loss[loss=0.5226, simple_loss=0.7701, pruned_loss=1.375, over 55853.64 frames.], batch size: 47, lr: 2.98e-03
2022-04-28 21:20:32,226 INFO [train.py:755] Epoch 1, batch 24, loss[loss=0.4722, simple_loss=0.7149, pruned_loss=1.147, over 2450.00 frames.], tot_loss[loss=0.5204, simple_loss=0.7678, pruned_loss=1.366, over 58024.38 frames.], batch size: 49, lr: 2.98e-03
2022-04-28 21:20:34,666 INFO [train.py:755] Epoch 1, batch 25, loss[loss=0.5483, simple_loss=0.8351, pruned_loss=1.307, over 2434.00 frames.], tot_loss[loss=0.5216, simple_loss=0.7705, pruned_loss=1.363, over 60168.25 frames.], batch size: 13, lr: 2.98e-03
2022-04-28 21:20:37,243 INFO [train.py:755] Epoch 1, batch 26, loss[loss=0.5317, simple_loss=0.8315, pruned_loss=1.16, over 2460.00 frames.], tot_loss[loss=0.522, simple_loss=0.7729, pruned_loss=1.355, over 62327.41 frames.], batch size: 60, lr: 2.98e-03
2022-04-28 21:20:39,626 INFO [train.py:755] Epoch 1, batch 27, loss[loss=0.5005, simple_loss=0.7456, pruned_loss=1.277, over 2478.00 frames.], tot_loss[loss=0.5211, simple_loss=0.7719, pruned_loss=1.352, over 64493.78 frames.], batch size: 42, lr: 2.98e-03
2022-04-28 21:20:42,160 INFO [train.py:755] Epoch 1, batch 28, loss[loss=0.5141, simple_loss=0.818, pruned_loss=1.051, over 2496.00 frames.], tot_loss[loss=0.5209, simple_loss=0.7736, pruned_loss=1.341, over 66667.31 frames.], batch size: 78, lr: 2.98e-03
2022-04-28 21:20:44,375 INFO [train.py:755] Epoch 1, batch 29, loss[loss=0.4768, simple_loss=0.7457, pruned_loss=1.039, over 2482.00 frames.], tot_loss[loss=0.5193, simple_loss=0.7726, pruned_loss=1.33, over 68815.97 frames.], batch size: 73, lr: 2.98e-03
2022-04-28 21:20:46,516 INFO [train.py:755] Epoch 1, batch 30, loss[loss=0.5176, simple_loss=0.8242, pruned_loss=1.055, over 2485.00 frames.], tot_loss[loss=0.5192, simple_loss=0.7744, pruned_loss=1.32, over 70956.89 frames.], batch size: 71, lr: 2.98e-03
2022-04-28 21:20:49,008 INFO [train.py:755] Epoch 1, batch 31, loss[loss=0.4425, simple_loss=0.683, pruned_loss=1.01, over 2480.00 frames.], tot_loss[loss=0.5166, simple_loss=0.7713, pruned_loss=1.31, over 73082.11 frames.], batch size: 80, lr: 2.98e-03
2022-04-28 21:20:51,709 INFO [checkpoint.py:67] Saving checkpoint to pruned_transducer_stateless2/exp-char-S-debug-batch-checkpoint/checkpoint-140.pt
2022-04-28 21:20:54,427 INFO [train.py:755] Epoch 1, batch 32, loss[loss=0.5182, simple_loss=0.8189, pruned_loss=1.087, over 2475.00 frames.], tot_loss[loss=0.5167, simple_loss=0.7729, pruned_loss=1.303, over 75191.70 frames.], batch size: 33, lr: 2.98e-03
2022-04-28 21:20:57,023 INFO [train.py:755] Epoch 1, batch 33, loss[loss=0.4327, simple_loss=0.6603, pruned_loss=1.025, over 2485.00 frames.], tot_loss[loss=0.514, simple_loss=0.7692, pruned_loss=1.294, over 77300.74 frames.], batch size: 35, lr: 2.98e-03
2022-04-28 21:20:59,554 INFO [train.py:755] Epoch 1, batch 34, loss[loss=0.5478, simple_loss=0.8634, pruned_loss=1.16, over 2491.00 frames.], tot_loss[loss=0.515, simple_loss=0.7722, pruned_loss=1.289, over 79405.23 frames.], batch size: 53, lr: 2.98e-03
2022-04-28 21:21:01,998 INFO [train.py:755] Epoch 1, batch 35, loss[loss=0.5058, simple_loss=0.7779, pruned_loss=1.169, over 2451.00 frames.], tot_loss[loss=0.5148, simple_loss=0.7724, pruned_loss=1.286, over 81459.21 frames.], batch size: 43, lr: 2.98e-03
2022-04-28 21:21:04,365 INFO [train.py:755] Epoch 1, batch 36, loss[loss=0.4164, simple_loss=0.6189, pruned_loss=1.069, over 2480.00 frames.], tot_loss[loss=0.5118, simple_loss=0.7678, pruned_loss=1.279, over 83531.91 frames.], batch size: 31, lr: 2.98e-03
2022-04-28 21:21:06,788 INFO [train.py:755] Epoch 1, batch 37, loss[loss=0.4767, simple_loss=0.7475, pruned_loss=1.029, over 2496.00 frames.], tot_loss[loss=0.5108, simple_loss=0.7672, pruned_loss=1.272, over 85610.25 frames.], batch size: 64, lr: 2.98e-03
2022-04-28 21:21:09,373 INFO [train.py:755] Epoch 1, batch 38, loss[loss=0.4473, simple_loss=0.686, pruned_loss=1.044, over 2479.00 frames.], tot_loss[loss=0.509, simple_loss=0.7649, pruned_loss=1.266, over 87661.20 frames.], batch size: 37, lr: 2.98e-03
2022-04-28 21:21:11,969 INFO [train.py:755] Epoch 1, batch 39, loss[loss=0.482, simple_loss=0.7313, pruned_loss=1.164, over 2450.00 frames.], tot_loss[loss=0.5083, simple_loss=0.764, pruned_loss=1.263, over 89672.89 frames.], batch size: 49, lr: 2.98e-03
2022-04-28 21:21:14,510 INFO [train.py:755] Epoch 1, batch 40, loss[loss=0.4556, simple_loss=0.7052, pruned_loss=1.03, over 2470.00 frames.], tot_loss[loss=0.5069, simple_loss=0.7624, pruned_loss=1.257, over 91694.53 frames.], batch size: 38, lr: 2.98e-03
2022-04-28 21:21:17,017 INFO [train.py:755] Epoch 1, batch 41, loss[loss=0.5291, simple_loss=0.8369, pruned_loss=1.107, over 2482.00 frames.], tot_loss[loss=0.5075, simple_loss=0.7644, pruned_loss=1.253, over 93718.06 frames.], batch size: 34, lr: 2.98e-03
2022-04-28 21:21:19,611 INFO [train.py:755] Epoch 1, batch 42, loss[loss=0.467, simple_loss=0.7155, pruned_loss=1.093, over 2418.00 frames.], tot_loss[loss=0.5064, simple_loss=0.7632, pruned_loss=1.249, over 95667.47 frames.], batch size: 31, lr: 2.98e-03
2022-04-28 21:21:22,078 INFO [train.py:755] Epoch 1, batch 43, loss[loss=0.5672, simple_loss=0.906, pruned_loss=1.142, over 2448.00 frames.], tot_loss[loss=0.508, simple_loss=0.7667, pruned_loss=1.246, over 97637.13 frames.], batch size: 48, lr: 2.98e-03

@pzelasko
Copy link
Collaborator

Interesting. Can you take the code from the notebook, and replace mini LibriSpeech with your test data, and see if you can replicate that issue? I suspect there is a difference between the usage in the notebook and the usage in the training script and we have to close in on that.

@luomingshuang
Copy link
Collaborator Author

Of course. I will do it right now.

Interesting. Can you take the code from the notebook, and replace mini LibriSpeech with your test data, and see if you can replicate that issue? I suspect there is a difference between the usage in the notebook and the usage in the training script and we have to close in on that.

@luomingshuang
Copy link
Collaborator Author

Hi,@pzelasko@csukuangfj suggests me to set num-workers=0 to re-run the code. I do it and there are some results. When num-workers=0, the batch id for the first epoch after loading the checkpoint-xxx.pt is normal. In my original experiments, the num-workers is 2. The training log is as follows: (the kept_batches is 44, and it runs for 22 batches in the first epoch. 44+22=66. This is correct.)

2022-04-29 10:30:55,718 INFO [train.py:818] Training started
2022-04-29 10:30:55.950484: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /ceph-sh1/fangjun/software/cuda-10.2.89/lib:/ceph-sh1/fangjun/software/cuda-10.2.89/lib64:
2022-04-29 10:30:55.950525: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2022-04-29 10:30:58,851 INFO [train.py:828] Device: cuda:0
2022-04-29 10:30:59,958 INFO [lexicon.py:176] Loading pre-compiled data/lang_char/Linv.pt
2022-04-29 10:31:00,050 INFO [train.py:847] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 1, 'reset_interval': 200, 'feature_dim': 80, 'subsampling_factor': 4, 'encoder_dim': 512, 'nhead': 8, 'dim_feedforward': 2048, 'num_encoder_layers': 12, 'decoder_dim': 512, 'joiner_dim': 512, 'env_info': {'k2-version': '1.14', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '4fb6b88661cca73e5f66f03df16e5a1d0c4886f8', 'k2-git-date': 'Fri Apr 8 18:29:32 2022', 'lhotse-version': '1.1.0', 'torch-version': '1.11.0', 'torch-cuda-available': True, 'torch-cuda-version': '10.2', 'python-version': '3.8', 'icefall-git-branch': 'wenetspeech-pruned-transducer-stateless2', 'icefall-git-sha1': '4b567e4-dirty', 'icefall-git-date': 'Wed Apr 27 13:43:54 2022', 'icefall-path': '/ceph-meixu/luomingshuang/icefall', 'k2-path': '/ceph-ms/luomingshuang/k2/k2/python/k2/__init__.py', 'lhotse-path': '/ceph-meixu/luomingshuang/anaconda3/envs/k2-python/lib/python3.8/site-packages/lhotse-1.1.0-py3.8.egg/lhotse/__init__.py', 'hostname': 'de-74279-k2-train-3-0307202051-57dc848959-8tmmp', 'IP address': '10.177.24.138'}, 'world_size': 1, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 30, 'start_epoch': 0, 'start_batch': 40, 'exp_dir': PosixPath('pruned_transducer_stateless2/exp-char-S-debug-batch-checkpoint'), 'lang_dir': PosixPath('data/lang_char'), 'token_type': 'char', 'initial_lr': 0.003, 'lr_batches': 5000, 'lr_epochs': 6, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'seed': 42, 'print_diagnostics': False, 'save_every_n': 20, 'keep_last_k': 20, 'use_fp16': False, 'valid_interval': 100, 'model_warm_step': 100, 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 100, 'bucketing_sampler': True, 'num_buckets': 300, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 0, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'lazy_load': True, 'training_subset': 'S', 'blank_id': 0, 'vocab_size': 5537}
2022-04-29 10:31:00,051 INFO [train.py:849] About to create model
2022-04-29 10:31:00,577 INFO [train.py:853] Number of model parameters: 88978927
2022-04-29 10:31:00,577 INFO [checkpoint.py:100] Loading checkpoint from pruned_transducer_stateless2/exp-char-S-debug-batch-checkpoint/checkpoint-40.pt
2022-04-29 10:31:08,411 INFO [train.py:868] Loading optimizer state dict
2022-04-29 10:31:09,243 INFO [train.py:876] Loading scheduler state dict
2022-04-29 10:31:09,243 INFO [asr_datamodule.py:427] About to get train cuts
2022-04-29 10:31:09,243 INFO [asr_datamodule.py:429] use lazy cuts
2022-04-29 10:31:09,243 INFO [asr_datamodule.py:445] About to get dev cuts
2022-04-29 10:31:10,310 INFO [asr_datamodule.py:359] About to create dev dataset
/ceph-meixu/luomingshuang/anaconda3/envs/k2-python/lib/python3.8/site-packages/lhotse-1.1.0-py3.8.egg/lhotse/dataset/sampling/dynamic_bucketing.py:129: UserWarning: You are using DynamicBucketingSampler with an eagerly read CutSet. You won't see any memory/speed benefits with that setup. Either use 'CutSet.from_jsonl_lazy' to read the CutSet lazily, or use a BucketingSampler instead.
2022-04-29 10:31:10,318 INFO [asr_datamodule.py:380] About to create dev dataloader
2022-04-29 10:31:10,319 INFO [asr_datamodule.py:220] About to get Musan cuts
2022-04-29 10:31:12,615 INFO [asr_datamodule.py:227] Enable MUSAN
2022-04-29 10:31:12,616 INFO [asr_datamodule.py:252] Enable SpecAugment
2022-04-29 10:31:12,616 INFO [asr_datamodule.py:253] Time warp factor: 80
2022-04-29 10:31:12,616 INFO [asr_datamodule.py:265] Num frame mask: 10
2022-04-29 10:31:12,616 INFO [asr_datamodule.py:278] About to create train dataset
2022-04-29 10:31:12,616 INFO [asr_datamodule.py:306] Using DynamicBucketingSampler.
2022-04-29 10:31:12,963 INFO [asr_datamodule.py:322] About to create train dataloader
2022-04-29 10:31:12,964 INFO [asr_datamodule.py:344] Loading sampler state dict
{'epoch': 0, 'world_size': 1, 'rank': 0, 'seed': 0, 'shuffle': True, 'diagnostics': {'current_epoch': 0, 'stats_per_epoch': {0: {'epoch': 0, 'kept_cuts': 1998, 'discarded_cuts': 0, 'kept_batches': 44, 'discarded_batches': 0}}}, 'max_duration': 100, 'max_cuts': None, 'drop_last': True, 'consistent_ids': True, 'buffer_size': 30000, 'num_cuts_for_bins_estimate': 10000, 'shuffle_buffer_size': 20000, 'strict': True}
2022-04-29 10:31:14,001 INFO [train.py:939] Loading grad scaler state dict
2022-04-29 10:31:17,699 INFO [train.py:756] Epoch 0, batch 0, loss[loss=0.5521, simple_loss=1.104, pruned_loss=8.767, over 2432.00 frames.], tot_loss[loss=0.5521, simple_loss=1.104, pruned_loss=8.767, over 2432.00 frames.], batch size: 32, lr: 3.00e-03
2022-04-29 10:31:21,003 INFO [train.py:756] Epoch 0, batch 1, loss[loss=0.474, simple_loss=0.948, pruned_loss=8.545, over 2494.00 frames.], tot_loss[loss=0.5125, simple_loss=1.025, pruned_loss=8.654, over 4913.84 frames.], batch size: 86, lr: 3.00e-03
2022-04-29 10:31:24,395 INFO [train.py:756] Epoch 0, batch 2, loss[loss=0.5441, simple_loss=1.088, pruned_loss=8.671, over 2485.00 frames.], tot_loss[loss=0.5231, simple_loss=1.046, pruned_loss=8.66, over 7374.27 frames.], batch size: 71, lr: 3.00e-03
2022-04-29 10:31:27,772 INFO [train.py:756] Epoch 0, batch 3, loss[loss=0.5618, simple_loss=1.124, pruned_loss=8.818, over 2448.00 frames.], tot_loss[loss=0.5328, simple_loss=1.066, pruned_loss=8.699, over 9785.40 frames.], batch size: 36, lr: 3.00e-03
2022-04-29 10:31:31,042 INFO [train.py:756] Epoch 0, batch 4, loss[loss=0.5015, simple_loss=1.003, pruned_loss=8.766, over 2496.00 frames.], tot_loss[loss=0.5264, simple_loss=1.053, pruned_loss=8.713, over 12232.47 frames.], batch size: 78, lr: 3.00e-03
2022-04-29 10:31:34,352 INFO [train.py:756] Epoch 0, batch 5, loss[loss=0.5348, simple_loss=1.07, pruned_loss=8.814, over 2465.00 frames.], tot_loss[loss=0.5278, simple_loss=1.056, pruned_loss=8.73, over 14636.31 frames.], batch size: 29, lr: 3.00e-03
2022-04-29 10:31:37,538 INFO [train.py:756] Epoch 0, batch 6, loss[loss=0.585, simple_loss=1.17, pruned_loss=8.931, over 2440.00 frames.], tot_loss[loss=0.536, simple_loss=1.072, pruned_loss=8.759, over 17003.13 frames.], batch size: 40, lr: 3.00e-03
2022-04-29 10:31:41,011 INFO [train.py:756] Epoch 0, batch 7, loss[loss=0.5305, simple_loss=1.061, pruned_loss=8.822, over 2484.00 frames.], tot_loss[loss=0.5353, simple_loss=1.071, pruned_loss=8.767, over 19402.11 frames.], batch size: 54, lr: 3.00e-03
2022-04-29 10:31:44,180 INFO [train.py:756] Epoch 0, batch 8, loss[loss=0.522, simple_loss=1.044, pruned_loss=8.849, over 2430.00 frames.], tot_loss[loss=0.5338, simple_loss=1.068, pruned_loss=8.776, over 21735.10 frames.], batch size: 30, lr: 3.00e-03
2022-04-29 10:31:47,587 INFO [train.py:756] Epoch 0, batch 9, loss[loss=0.558, simple_loss=1.116, pruned_loss=8.94, over 2448.00 frames.], tot_loss[loss=0.5363, simple_loss=1.073, pruned_loss=8.793, over 24074.43 frames.], batch size: 48, lr: 3.00e-03
2022-04-29 10:31:50,477 INFO [train.py:756] Epoch 0, batch 10, loss[loss=0.5489, simple_loss=1.098, pruned_loss=9.008, over 2434.00 frames.], tot_loss[loss=0.5375, simple_loss=1.075, pruned_loss=8.813, over 26388.05 frames.], batch size: 13, lr: 3.00e-03
2022-04-29 10:31:53,794 INFO [train.py:756] Epoch 0, batch 11, loss[loss=0.546, simple_loss=1.092, pruned_loss=8.864, over 2480.00 frames.], tot_loss[loss=0.5382, simple_loss=1.076, pruned_loss=8.817, over 28736.11 frames.], batch size: 62, lr: 3.00e-03
2022-04-29 10:31:56,886 INFO [train.py:756] Epoch 0, batch 12, loss[loss=0.565, simple_loss=1.13, pruned_loss=9.024, over 2450.00 frames.], tot_loss[loss=0.5403, simple_loss=1.081, pruned_loss=8.833, over 31042.43 frames.], batch size: 50, lr: 3.00e-03
2022-04-29 10:32:00,078 INFO [train.py:756] Epoch 0, batch 13, loss[loss=0.5525, simple_loss=1.105, pruned_loss=8.993, over 2484.00 frames.], tot_loss[loss=0.5412, simple_loss=1.082, pruned_loss=8.845, over 33371.22 frames.], batch size: 46, lr: 3.00e-03
2022-04-29 10:32:03,622 INFO [train.py:756] Epoch 0, batch 14, loss[loss=0.5679, simple_loss=1.136, pruned_loss=9.02, over 2475.00 frames.], tot_loss[loss=0.5431, simple_loss=1.086, pruned_loss=8.857, over 35679.37 frames.], batch size: 45, lr: 3.00e-03
2022-04-29 10:32:06,750 INFO [train.py:756] Epoch 0, batch 15, loss[loss=0.5192, simple_loss=1.038, pruned_loss=8.932, over 2491.00 frames.], tot_loss[loss=0.5415, simple_loss=1.083, pruned_loss=8.862, over 37991.97 frames.], batch size: 53, lr: 3.00e-03
2022-04-29 10:32:09,987 INFO [train.py:756] Epoch 0, batch 16, loss[loss=0.5254, simple_loss=1.051, pruned_loss=8.875, over 2460.00 frames.], tot_loss[loss=0.5405, simple_loss=1.081, pruned_loss=8.863, over 40262.01 frames.], batch size: 30, lr: 3.00e-03
2022-04-29 10:32:13,218 INFO [train.py:756] Epoch 0, batch 17, loss[loss=0.5323, simple_loss=1.065, pruned_loss=8.882, over 2482.00 frames.], tot_loss[loss=0.54, simple_loss=1.08, pruned_loss=8.864, over 42542.70 frames.], batch size: 34, lr: 3.00e-03
2022-04-29 10:32:16,584 INFO [train.py:756] Epoch 0, batch 18, loss[loss=0.5239, simple_loss=1.048, pruned_loss=8.748, over 2494.00 frames.], tot_loss[loss=0.5391, simple_loss=1.078, pruned_loss=8.858, over 44823.99 frames.], batch size: 43, lr: 3.00e-03
2022-04-29 10:32:19,826 INFO [checkpoint.py:67] Saving checkpoint to pruned_transducer_stateless2/exp-char-S-debug-batch-checkpoint/checkpoint-60.pt
2022-04-29 10:32:22,901 INFO [train.py:756] Epoch 0, batch 19, loss[loss=0.4379, simple_loss=0.8757, pruned_loss=8.512, over 2475.00 frames.], tot_loss[loss=0.5338, simple_loss=1.068, pruned_loss=8.84, over 47074.87 frames.], batch size: 75, lr: 3.00e-03
2022-04-29 10:32:26,228 INFO [train.py:756] Epoch 0, batch 20, loss[loss=0.5446, simple_loss=1.089, pruned_loss=8.702, over 2475.00 frames.], tot_loss[loss=0.5344, simple_loss=1.069, pruned_loss=8.833, over 49314.49 frames.], batch size: 55, lr: 3.00e-03
2022-04-29 10:32:29,579 INFO [train.py:756] Epoch 0, batch 21, loss[loss=0.5392, simple_loss=1.078, pruned_loss=8.65, over 2475.00 frames.], tot_loss[loss=0.5346, simple_loss=1.069, pruned_loss=8.824, over 51542.92 frames.], batch size: 33, lr: 3.00e-03
2022-04-29 10:32:32,787 INFO [train.py:756] Epoch 0, batch 22, loss[loss=0.5197, simple_loss=1.039, pruned_loss=8.67, over 2475.00 frames.], tot_loss[loss=0.5339, simple_loss=1.068, pruned_loss=8.817, over 53760.20 frames.], batch size: 45, lr: 3.00e-03
2022-04-29 10:32:32,814 INFO [checkpoint.py:67] Saving checkpoint to pruned_transducer_stateless2/exp-char-S-debug-batch-checkpoint/epoch-0.pt
2022-04-29 10:32:42,766 INFO [train.py:756] Epoch 1, batch 0, loss[loss=0.4488, simple_loss=0.8977, pruned_loss=8.438, over 2479.00 frames.], tot_loss[loss=0.4488, simple_loss=0.8977, pruned_loss=8.438, over 2479.00 frames.], batch size: 67, lr: 2.98e-03
2022-04-29 10:32:46,163 INFO [train.py:756] Epoch 1, batch 1, loss[loss=0.4989, simple_loss=0.9978, pruned_loss=8.7, over 2442.00 frames.], tot_loss[loss=0.4737, simple_loss=0.9475, pruned_loss=8.568, over 4908.60 frames.], batch size: 37, lr: 2.98e-03
2022-04-29 10:32:49,477 INFO [train.py:756] Epoch 1, batch 2, loss[loss=0.4289, simple_loss=0.8579, pruned_loss=8.506, over 2465.00 frames.], tot_loss[loss=0.4587, simple_loss=0.9174, pruned_loss=8.547, over 7349.06 frames.], batch size: 29, lr: 2.98e-03
2022-04-29 10:32:52,576 INFO [train.py:756] Epoch 1, batch 3, loss[loss=0.4554, simple_loss=0.9108, pruned_loss=8.589, over 2460.00 frames.], tot_loss[loss=0.4579, simple_loss=0.9157, pruned_loss=8.558, over 9772.32 frames.], batch size: 30, lr: 2.98e-03
2022-04-29 10:32:55,960 INFO [train.py:756] Epoch 1, batch 4, loss[loss=0.4035, simple_loss=0.807, pruned_loss=8.49, over 2449.00 frames.], tot_loss[loss=0.4469, simple_loss=0.8939, pruned_loss=8.544, over 12172.46 frames.], batch size: 31, lr: 2.98e-03
2022-04-29 10:32:59,221 INFO [train.py:756] Epoch 1, batch 5, loss[loss=0.4049, simple_loss=0.8099, pruned_loss=8.472, over 2475.00 frames.], tot_loss[loss=0.4398, simple_loss=0.8796, pruned_loss=8.532, over 14586.59 frames.], batch size: 75, lr: 2.98e-03

@pzelasko
Copy link
Collaborator

pzelasko commented May 1, 2022

@luomingshuang num_workers > 0 was indeed the issue. I added it to test cases and fixed, please check now.

lhotse-speech/lhotse#692

@luomingshuang
Copy link
Collaborator Author

OK, I will check it.

@csukuangfj
Copy link
Collaborator

OK, I will check it.

Please update train.py in icefall to remove cur_batch_idx.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants