Skip to content

Log some sort of heartbeats to wandb (both from trainer and explorer): both at startup and in loops #238

@vadimkantorov

Description

@vadimkantorov

We've experiences a problem with wandb logging while running a Trinity run:

Seemingly we just ran the same Trinity config on a new machine set, but this for some reason led to no Trinity metrics being uploaded to Wandb: we waited for 20 hours, according to Ray Dashboard, Trinity was indeed having done some initial validation, and then some training steps - and nothing got uploaded to wandb except system metrics

I'm proposing to start logging some sort of heartbeat to wandb (and maybe allowing in config to force commit=True), so that there are some custom bits logged to Wandb almost immediately.

And if nothing gets logged/uploaded, it can immediately be assumed/confirmed problem with Wandb. Currently, if eval_on_startup: true, anything would get logged only in _finish_explore_step(...) and _finish_eval_step(...) which can take a while

Also, wandb has had a bug where step= was forcing it to ignore commit=True (not sure if this bug is still present):

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions