You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was testing resuming runs to ensure it worked well and noticed 2 issues -
Suppose I set checkpoint frequency to every 10 steps and the last saved checkpoint is step 79 (as NeMo does 0 based indexing to the best of my understanding). So far all logs in wandb will be at points 9,19,29 step and so on but the resumed run will log at step 80, 90 and so on. This makes the spacing inconsistent with initial runs
Val loss uses step instead of global step which makes all the wandb logs for the same start off from the same starting point and not actually look like a resumed run while other metrics like reduced train loss use global steps so resumed runs fit in to the same plot nicely
The text was updated successfully, but these errors were encountered:
Hi
I was testing resuming runs to ensure it worked well and noticed 2 issues -
The text was updated successfully, but these errors were encountered: