-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
val_wer #7450
Comments
@stevehuang52 could you take a look |
The bugfix PR is here: #7505 |
This PR #7505 is under review, needs a few more updates before merging |
This PR #7505 was closed, I will create another PR for fixing the logging issue with PTL 2.0 |
@leilei183 the current NeMo main branch is moving to PTL 2.0, where a lot of APIs need to be updated and we're still working on that. Could you please use the stable NeMo r1.20 release for now? |
Hi,@stevehuang52,thank you very much for your answer, I will use the stable version first. |
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days. |
This issue was closed because it has been inactive for 7 days since being marked as stale. |
When I change the ddp to auto, I can train, but I get a new error with the following error. What do you guys suggest please?
error executing job with overrides: []
Traceback (most recent call last):
File "/home/zhengbeida/code/NeMo-main/NeMo-main/NeMo-main/examples/slu/speech_intent_slot/run_speech_intent_slot_train.py", line 119, in main
trainer.fit(model)
File "/home/zhengbeida/anaconda3/envs/nemo/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 520, in fit
call._call_and_handle_interrupt(
File "/home/zhengbeida/anaconda3/envs/nemo/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/zhengbeida/anaconda3/envs/nemo/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 559, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/zhengbeida/anaconda3/envs/nemo/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 935, in _run
results = self._run_stage()
File "/home/zhengbeida/anaconda3/envs/nemo/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 978, in _run_stage
self.fit_loop.run()
File "/home/zhengbeida/anaconda3/envs/nemo/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 201, in run
self.advance()
File "/home/zhengbeida/anaconda3/envs/nemo/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 354, in advance
self.epoch_loop.run(self._data_fetcher)
File "/home/zhengbeida/anaconda3/envs/nemo/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 134, in run
self.on_advance_end()
File "/home/zhengbeida/anaconda3/envs/nemo/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 248, in on_advance_end
self.val_loop.run()
File "/home/zhengbeida/anaconda3/envs/nemo/lib/python3.10/site-packages/pytorch_lightning/loops/utilities.py", line 174, in _decorator
return loop_run(self, *args, **kwargs)
File "/home/zhengbeida/anaconda3/envs/nemo/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 122, in run
return self.on_run_end()
File "/home/zhengbeida/anaconda3/envs/nemo/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 258, in on_run_end
self._on_evaluation_end()
File "/home/zhengbeida/anaconda3/envs/nemo/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 303, in _on_evaluation_end
call._call_callback_hooks(trainer, hook_name, *args, **kwargs)
File "/home/zhengbeida/anaconda3/envs/nemo/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 190, in _call_callback_hooks
fn(trainer, trainer.lightning_module, *args, **kwargs)
File "/home/zhengbeida/anaconda3/envs/nemo/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 311, in on_validation_end
self._save_topk_checkpoint(trainer, monitor_candidates)
File "/home/zhengbeida/anaconda3/envs/nemo/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 358, in _save_topk_checkpoint
raise MisconfigurationException(m)
lightning_fabric.utilities.exceptions.MisconfigurationException:
ModelCheckpoint(monitor='val_wer')
could not find the monitored key in the returned metrics: ['train_loss', 'learning_rate_g0', 'learning_rate_g1', 'train_backward_timing', 'train_step_timing', 'training_batch_wer', 'epoch', 'step']. HINT: Did you calllog('val_wer', value)
in theLightningModule
?Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
The text was updated successfully, but these errors were encountered: