Issues around Resumed Runs #12290

aflah02 · 2025-02-20T17:58:26Z

Hi

I was testing resuming runs to ensure it worked well and noticed 2 issues -

Suppose I set checkpoint frequency to every 10 steps and the last saved checkpoint is step 79 (as NeMo does 0 based indexing to the best of my understanding). So far all logs in wandb will be at points 9,19,29 step and so on but the resumed run will log at step 80, 90 and so on. This makes the spacing inconsistent with initial runs
Val loss uses step instead of global step which makes all the wandb logs for the same start off from the same starting point and not actually look like a resumed run while other metrics like reduced train loss use global steps so resumed runs fit in to the same plot nicely

aflah02 · 2025-02-20T17:59:43Z

Code is same as this issue: #12284

aflah02 · 2025-02-20T18:07:47Z

I was able to fix 2 by tinkering on wandb console but inconsistent step start points remains an issue post resuming a run

aflah02 added the bug Something isn't working label Feb 20, 2025

Provide feedback