diff --git a/tensorboard/data/provider.py b/tensorboard/data/provider.py index 25f2d7b715..4c4180fe0c 100644 --- a/tensorboard/data/provider.py +++ b/tensorboard/data/provider.py @@ -33,6 +33,56 @@ class DataProvider(object): providers may be asked to implement more filtering mechanisms, such as downsampling strategies or domain restriction by step or wall time. + The data provider interface specifies three *data classes*: scalars, + tensors, and blob sequences. All data is stored in *time series* for + one of these data classes. A time series is identified by run name and + tag name (each a non-empty text string), as well as an experiment ID + and plugin name (see below). Points in a time series are uniquely + indexed by *step*, an arbitrary non-negative integer. Each point in a + time series also has an associated wall time, plus its actual value, + which is drawn from the corresponding data class. + + Each point in a scalar time series contains a single scalar value, as + a 64-bit floating point number. Scalars are "privileged" rather than + being subsumed under tensors because there are useful operations on + scalars that don't make sense in the general tensor case: e.g., "list + all scalar time series with tag name `accuracy` whose exponentially + weighted moving average is at least 0.999". + + Each point in a tensor time series contains a tensor of arbitrary + dtype (including byte strings and text strings) and shape (including + rank-0 tensors, a.k.a. scalars). Each tensor is expected to be + "reasonably small" to accommodate common database cell size limits. + For instance, a histogram with a bounded number of buckets (say, 30) + occupies about 500 bytes, and a PR curve with a bounded number of + thresholds (say, 201) occupies about 5000 bytes. These are both well + within typical database tolerances (Google Cloud Spanner: 10 MiB; + MySQL: 64 KiB), and would be appropriate to store as tensors. By + contrast, image, audio, or model graph data may easily be multiple + megabytes in size, and so should be stored as blobs instead. The + tensors at each step in a time series need not have the same dtype or + shape. + + Each point in a blob sequence time series contains an ordered sequence + of zero or more blobs, which are arbitrary data with no tensor + structure. These might represent PNG-encoded image data, protobuf wire + encodings of TensorFlow graphs, or PLY-format 3D mesh data, for some + examples. This data class provides blob *sequences* rather than just + blobs because it's common to want to take multiple homogeneous samples + of a given time series: say, "show me the bounding box classifications + for 3 random inputs from this batch". A single blob can of course be + represented as a blob sequence that always has exactly one element. + + Every time series belongs to a specific experiment and is owned by a + specific plugin. (Thus, the "primary key" for a time series has four + components: experiment, plugin, run, tag.) The experiment ID is an + arbitrary URL-safe non-empty text string, whose interpretation is at + the discretion of the data provider. As a special case, the empty + string as an experiment ID denotes that no experiment was given. Data + providers may or may not fully support an empty experiment ID. The + plugin name should correspond to the `plugin_data.plugin_name` field + of the `SummaryMetadata` proto passed to `tf.summary.write`. + Unless otherwise noted, any methods on this class may raise errors defined in `tensorboard.errors`, like `tensorboard.errors.NotFoundError`. """