Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

val_wer #7450

Closed
leilei183 opened this issue Sep 18, 2023 · 8 comments
Closed

val_wer #7450

leilei183 opened this issue Sep 18, 2023 · 8 comments
Assignees
Labels
bug Something isn't working stale

Comments

@leilei183
Copy link

When I change the ddp to auto, I can train, but I get a new error with the following error. What do you guys suggest please?
error executing job with overrides: []
Traceback (most recent call last):
File "/home/zhengbeida/code/NeMo-main/NeMo-main/NeMo-main/examples/slu/speech_intent_slot/run_speech_intent_slot_train.py", line 119, in main
trainer.fit(model)
File "/home/zhengbeida/anaconda3/envs/nemo/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 520, in fit
call._call_and_handle_interrupt(
File "/home/zhengbeida/anaconda3/envs/nemo/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/zhengbeida/anaconda3/envs/nemo/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 559, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/zhengbeida/anaconda3/envs/nemo/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 935, in _run
results = self._run_stage()
File "/home/zhengbeida/anaconda3/envs/nemo/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 978, in _run_stage
self.fit_loop.run()
File "/home/zhengbeida/anaconda3/envs/nemo/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 201, in run
self.advance()
File "/home/zhengbeida/anaconda3/envs/nemo/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 354, in advance
self.epoch_loop.run(self._data_fetcher)
File "/home/zhengbeida/anaconda3/envs/nemo/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 134, in run
self.on_advance_end()
File "/home/zhengbeida/anaconda3/envs/nemo/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 248, in on_advance_end
self.val_loop.run()
File "/home/zhengbeida/anaconda3/envs/nemo/lib/python3.10/site-packages/pytorch_lightning/loops/utilities.py", line 174, in _decorator
return loop_run(self, *args, **kwargs)
File "/home/zhengbeida/anaconda3/envs/nemo/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 122, in run
return self.on_run_end()
File "/home/zhengbeida/anaconda3/envs/nemo/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 258, in on_run_end
self._on_evaluation_end()
File "/home/zhengbeida/anaconda3/envs/nemo/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 303, in _on_evaluation_end
call._call_callback_hooks(trainer, hook_name, *args, **kwargs)
File "/home/zhengbeida/anaconda3/envs/nemo/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 190, in _call_callback_hooks
fn(trainer, trainer.lightning_module, *args, **kwargs)
File "/home/zhengbeida/anaconda3/envs/nemo/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 311, in on_validation_end
self._save_topk_checkpoint(trainer, monitor_candidates)
File "/home/zhengbeida/anaconda3/envs/nemo/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 358, in _save_topk_checkpoint
raise MisconfigurationException(m)
lightning_fabric.utilities.exceptions.MisconfigurationException: ModelCheckpoint(monitor='val_wer') could not find the monitored key in the returned metrics: ['train_loss', 'learning_rate_g0', 'learning_rate_g1', 'train_backward_timing', 'train_step_timing', 'training_batch_wer', 'epoch', 'step']. HINT: Did you call log('val_wer', value) in the LightningModule?
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

@leilei183 leilei183 added the bug Something isn't working label Sep 18, 2023
@titu1994
Copy link
Collaborator

@stevehuang52 could you take a look

@XuesongYang
Copy link
Collaborator

The bugfix PR is here: #7505

@stevehuang52
Copy link
Collaborator

This PR #7505 is under review, needs a few more updates before merging

@stevehuang52
Copy link
Collaborator

This PR #7505 was closed, I will create another PR for fixing the logging issue with PTL 2.0

@stevehuang52
Copy link
Collaborator

@leilei183 the current NeMo main branch is moving to PTL 2.0, where a lot of APIs need to be updated and we're still working on that. Could you please use the stable NeMo r1.20 release for now?

@leilei183
Copy link
Author

Hi,@stevehuang52,thank you very much for your answer, I will use the stable version first.

@github-actions
Copy link
Contributor

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the stale label Oct 28, 2023
Copy link
Contributor

github-actions bot commented Nov 4, 2023

This issue was closed because it has been inactive for 7 days since being marked as stale.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale
Projects
None yet
Development

No branches or pull requests

4 participants