-
Notifications
You must be signed in to change notification settings - Fork 55
Description
We've experiences a problem with wandb logging while running a Trinity run:
Seemingly we just ran the same Trinity config on a new machine set, but this for some reason led to no Trinity metrics being uploaded to Wandb: we waited for 20 hours, according to Ray Dashboard, Trinity was indeed having done some initial validation, and then some training steps - and nothing got uploaded to wandb except system metrics
I'm proposing to start logging some sort of heartbeat to wandb (and maybe allowing in config to force commit=True), so that there are some custom bits logged to Wandb almost immediately.
And if nothing gets logged/uploaded, it can immediately be assumed/confirmed problem with Wandb. Currently, if eval_on_startup: true, anything would get logged only in _finish_explore_step(...) and _finish_eval_step(...) which can take a while
Also, wandb has had a bug where step= was forcing it to ignore commit=True (not sure if this bug is still present):