Skip to content

Graph events inconsistently evicted after session log START event #4743

@wchargin

Description

@wchargin

The following event file (posted with permission) renders incorrectly in
the graphs dashboard:
graph_failure.tfevents.zip

The /info route thinks that a graph exists, but the /graph route
404s when trying to fetch it.

The salient property of this event file is that it has a graph_def
event followed by a session_log { status: START } event. The latter
event indicates to purge all preceding events across all tags.
This docstring is twice wrong:

  • it says “purge all previously seen events with larger steps”, but
    the code actually purges events with larger-or-equal steps, which
    matters because both the graph_def and the session_log are at
    step 0; and
  • it says “all” events, but actually only purges tensors.

The fact that this purge can happen at all means that it is possible for
a time series to have summary metadata but not actually have any data,
which is something that we assumed could not happen (since a normal
preemption event always leaves at least one point in the reservoir). And
the fact that it only purges tensors means that this only started
affecting graphs after #4470, which changed the read path from Graph()
to Tensors("__run_graph__"). These have equivalent input streams due
to dataclass_compat, but only Tensors(...) gets purged.

We should fix the data provider implementation to not list blob
sequences that do not actually have tensor data, so that the graphs
dashboard is internally consistent. But we should maybe also consider
the implications of this more broadly.

(This does not affect --load_fast, because RustBoard does not care
about session_log events.)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions