Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues around Resumed Runs #12290

Open
aflah02 opened this issue Feb 20, 2025 · 2 comments
Open

Issues around Resumed Runs #12290

aflah02 opened this issue Feb 20, 2025 · 2 comments
Labels
bug Something isn't working

Comments

@aflah02
Copy link

aflah02 commented Feb 20, 2025

Hi

I was testing resuming runs to ensure it worked well and noticed 2 issues -

  • Suppose I set checkpoint frequency to every 10 steps and the last saved checkpoint is step 79 (as NeMo does 0 based indexing to the best of my understanding). So far all logs in wandb will be at points 9,19,29 step and so on but the resumed run will log at step 80, 90 and so on. This makes the spacing inconsistent with initial runs
  • Val loss uses step instead of global step which makes all the wandb logs for the same start off from the same starting point and not actually look like a resumed run while other metrics like reduced train loss use global steps so resumed runs fit in to the same plot nicely
@aflah02 aflah02 added the bug Something isn't working label Feb 20, 2025
@aflah02
Copy link
Author

aflah02 commented Feb 20, 2025

Code is same as this issue: #12284

@aflah02
Copy link
Author

aflah02 commented Feb 20, 2025

I was able to fix 2 by tinkering on wandb console but inconsistent step start points remains an issue post resuming a run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant