Skip to content

Conversation

@wchargin
Copy link
Contributor

@wchargin wchargin commented Apr 14, 2020

Summary:
This patch teaches the audio plugin to support generic data providers.

This requires a few more list_blob_sequences queries than we’d like,
due to the current structure of the frontend. Some have been avoided by
pushing the content type of the waveform into the query string. A bit of
care is required to ensure that a malicious request cannot turn this
into an XSS hole; we do so by simply whitelisting a (singleton) set of
audio content types, which means that the safety properties are
unchanged by this patch.

Test Plan:
The audio dashboard now works even if the multiplexer is removed. The
audio dashboard doesn’t have any “extras” (data download links, etc.),
so just checking the basic functionality suffices. Unit tests updated.

wchargin-branch: audio-generic

wchargin added 9 commits April 9, 2020 11:15
Summary:
Most versions of the audio summary APIs only support writing audio data,
but the TB 1.x `tensorboard.summary.audio` ops also support attaching a
“label” (Markdown text) to each audio clip. This feature is not present
in any version of the TensorFlow summary APIs, including `tf.contrib`,
and is not present in the TB 2.x `tensorboard.summary.audio` API. It
hasn’t been widely adopted, and doesn’t integrate nicely in its current
form with upcoming efforts like migrating the audio dashboard to use the
generic data APIs.

This commit causes labels to no longer be read by TensorBoard. Labels
are still written by the TB 1.x summary ops if specified, but now print
a warning. Because we do not change the write paths, it is important
that `audio.metadata` *not* set `DATA_CLASS_BLOB_SEQUENCE`, because
otherwise audio summaries will not be touched by `dataclass_compat`.
Tests for `dataclass_compat` cover this.

We don’t think that this feature has wide adoption. If this change gets
significant pushback, we can look into restoring labels with a different
implementation, likely as a parallel tensor time series.

Test Plan:
Unit tests updated. As a manual test, TensorBoard still works on both
legacy audio data and audio data in the new form (that’s not actually
written to disk yet), and simply does not display any labels.

wchargin-branch: audio-no-labels
Summary:
We initially used `dataclass_compat` to perform filtering of large
graphs as a stopgap mechanism. This commit moves that filtering into the
uploader, which is the only surface in which it’s actually used. As a
result, `dataclass_compat` no longer takes extra arguments and so can be
moved into `EventFileLoader` in a future change.

Test Plan:
Unit tests added to the uploader for the small graph, large graph, and
corrupt graph cases.

wchargin-branch: uploader-graph-filtering
wchargin-source: 00af6ebe200fe60aacbc23f468eb44126d2d33f7
Summary:
The `data_compat` and `dataclass_compat` transforms are now effected by
the `EventFileLoader` class. The event accumulator and uploader
therefore no longer need to effect these transformations manually.

Test Plan:
Unit tests that use mocking have been updated to apply compat transforms
manually. Using the uploader with `--plugin scalars,graphs` still works
as intended.

wchargin-branch: efl-compat-transforms
wchargin-source: d5b86d851f5dff6f24ddc8cfac52c740ab8af1b3
Summary:
The `dataclass_compat` transform dispatches on plugin name, which is
stored in the summary metadata. But the semantics of summary metadata is
that it need only be set in the first event of each time series. Thus,
until now, events past the first step of a time series have not been
modified by `dataclass_compat`. This has been fine because all
transformations have been either (a) only metadata transforms (which
don’t matter after the first step anyway) or (b) tensor transforms to
hparams data (which only ever has one step anyway). But an upcoming
change will need to transform every step of each audio time series, so
we need to identify which summaries have audio data even when they don’t
have plugin name set on the same event.

To do so, we let `dataclass_compat` update a stateful map from tag name
to the first `SummaryMetadata` seen for that time series (scoped to a
particular run).

Test Plan:
Some unit tests included now; the tests for migration of audio summaries
will be more complete because they have more interesting things to test.

wchargin-branch: dataclass-compat-initial-metadata
wchargin-source: 364d3b33f32452e8d3ac9468e6bcca4de876963a
wchargin-branch: uploader-graph-filtering
wchargin-source: 9179acc76158380d9f254df1d0a0aae1ec82df6c
wchargin-branch: efl-compat-transforms
wchargin-source: 4fa593a34147c7b05c7f4c55f3ce9279242a7414
wchargin-branch: audio-no-labels
wchargin-source: 7c7844bff26668d32d692ef7faa0dba3ecbaf138

# Conflicts:
#	tensorboard/dataclass_compat.py
wchargin-branch: audio-no-labels
wchargin-source: 7c7844bff26668d32d692ef7faa0dba3ecbaf138
Summary:
This patch teaches the audio plugin to support generic data providers.

This requires a few more `list_blob_sequences` queries than we’d like,
due to the current structure of the frontend. Some have been avoided by
pushing the content type of the waveform into the query string. A bit of
care is required to ensure that a malicious request cannot turn this
into an XSS hole; we do so by simply whitelisting a (singleton) set of
audio content types, which means that the safety properties are
unchanged by this patch.

Test Plan:
With `--generic_data true`, the audio dashboard now works even if the
multiplexer is removed. The audio dashboard doesn’t have any “extras”
(data download links, etc.), so just checking the basic functionality
suffices. Unit tests also updated.

wchargin-branch: audio-generic
wchargin-source: 5b223026c6cb8e22bfe5d8d9654459f61fa3a45e
wchargin added 19 commits April 14, 2020 01:18
wchargin-branch: uploader-graph-filtering
wchargin-source: b89f6da9cd740080aefecec49194752b0de85dd4
wchargin-branch: uploader-graph-filtering
wchargin-source: b89f6da9cd740080aefecec49194752b0de85dd4
wchargin-branch: efl-compat-transforms
wchargin-source: 8abfd3ec5c7394e7b1930b4bf42d2de816e67da4

# Conflicts:
#	tensorboard/uploader/uploader.py
wchargin-branch: efl-compat-transforms
wchargin-source: 8abfd3ec5c7394e7b1930b4bf42d2de816e67da4
wchargin-branch: dataclass-compat-initial-metadata
wchargin-source: cfd81441162803b6bab104bc76d93c3b00e8c381
wchargin-branch: audio-no-labels
wchargin-source: eb55e6cf2dcb2dc50a10201296e04c3a030bca3d
wchargin-branch: audio-generic
wchargin-source: 0d4ed8098d4fdd3e7353ce81e45237abed8ffd0b
Summary:
Fixes an oversight in #3507: we can’t assert that the raw bytes are what
was expected because the code under test does a proto serialization
roundtrip, which is permitted to permute keys.

Test Plan:
Tests still pass, and this should fix an internal sync error.

wchargin-branch: uploader-test-proto-equal
wchargin-branch: uploader-graph-filtering
wchargin-source: 30f0764698d66cdaf481c797dc56bf6a34ae3ae0
wchargin-branch: efl-compat-transforms
wchargin-source: f18ea5648a96010b48237fe7743e5f5d0d535acb
wchargin-branch: dataclass-compat-initial-metadata
wchargin-source: 9f516c3b25569b4e007bb929bb456541e5433f52
wchargin-branch: audio-no-labels
wchargin-source: 7a379545e5195047d46d162de36606433a04cf18
wchargin-branch: audio-generic
wchargin-source: fc3b8e088f8b71b092396755275fb8a895755dba
wchargin-branch: uploader-graph-filtering
wchargin-source: 63adea8ca2f0a85bd9f061aee72435f04d1d56d8

# Conflicts:
#	tensorboard/uploader/uploader_test.py
wchargin-branch: uploader-graph-filtering
wchargin-source: 63adea8ca2f0a85bd9f061aee72435f04d1d56d8
wchargin-branch: efl-compat-transforms
wchargin-source: 4fdc21c37611fab6a00384e9e824cd60302f6def
wchargin-branch: efl-compat-transforms
wchargin-source: 124d838ae368a9d9c9f061a39a72056554c89790

# Conflicts:
#	tensorboard/uploader/uploader.py
#	tensorboard/uploader/uploader_test.py
wchargin-branch: efl-compat-transforms
wchargin-source: 124d838ae368a9d9c9f061a39a72056554c89790
wchargin-branch: efl-compat-transforms
wchargin-source: 41caa7efc7790458360d47924ce80e0d8f0ef248
wchargin-branch: dataclass-compat-initial-metadata
wchargin-source: 9bceb864dae1a69432349daf80ebb4db6e19278b
wchargin-branch: dataclass-compat-initial-metadata
wchargin-source: 9bceb864dae1a69432349daf80ebb4db6e19278b
wchargin-branch: audio-no-labels
wchargin-source: f0ba8b1511c55b4596450e49d7dbe69d2170fdcc
wchargin-branch: audio-generic
wchargin-source: be1113153b02927ea35b953598e5f37866028801
@wchargin wchargin requested a review from nfelt April 16, 2020 17:49
wchargin-branch: dataclass-compat-initial-metadata
wchargin-source: e87f73c1f892413fb38f913c3da0db230f9d8c61

# Conflicts:
#	tensorboard/backend/event_processing/event_file_loader.py
#	tensorboard/backend/event_processing/plugin_event_accumulator_test.py
#	tensorboard/uploader/uploader_test.py
wchargin-branch: dataclass-compat-initial-metadata
wchargin-source: e87f73c1f892413fb38f913c3da0db230f9d8c61
wchargin-branch: audio-no-labels
wchargin-source: 3ef10438c3ca65d6cff434e277b72807fd27cf36
wchargin-branch: audio-generic
wchargin-source: a8e5986d9788550a07672f1bf203d27d8c289a50
return bool(
self._multiplexer.PluginRunToTagToContent(metadata.PLUGIN_NAME)
)
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add the # `list_plugins` as called by TB core suffices comment here as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

experiment = plugin_util.experiment_id(request.environ)
mime_type = request.args["content_type"]
if mime_type not in _ALLOWED_MIME_TYPES:
raise errors.InvalidArgumentError(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this possible issue! Can we add a test that exercises this to ensure that we refuse to return audio as, say, text/javascript?

I think the Werkzeug client get() API takes a content_type parameter so this should be straightforward.

In the longer term it would be nice if we didn't have to rely on the client behavior for this. Perhaps we should add a per-blob content_type onto the blob data model? We would have to go around adding it to the various data providers and being sure to set it correctly on upload, but it seems worth doing to properly handle cases like this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure: done, I think, though I didn’t understand what you meant about the
Werkzeug client API taking a content_type parameter. (Isn’t that for
the content type of a POST request body?)

Perhaps we should add a per-blob content_type onto the blob data
model?

Perhaps. Presumably the source of truth would be a new field on the
SummaryMetadata proto (string content_type = 5;), required to be
homogeneous within a time series across all steps and indices, and
defaulting to application/octet-stream if empty for backward
compatibility? I’ll keep it in mind.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I was confused and misread this and somehow thought we were getting the content type from a header that the browser was populating (even though yes, that isn't a thing for GET requests) rather than the query parameter. Never mind :P

Re: per-blob content types, yeah, I guess it could have a field in SummaryMetadata, although it's perhaps a little odd to have a field that is returned per-value stored in the per-tag SummaryMetadata (even if for now we don't anticipate allowing it to vary per-value).

wchargin-branch: dataclass-compat-initial-metadata
wchargin-source: ba0efc9d5159b2f74982e63b6ca884c7f2380495
wchargin-branch: audio-no-labels
wchargin-source: dc72cdd9d4dd9383e196b556b6a91917cfd9b7dd

# Conflicts:
#	tensorboard/dataclass_compat.py
wchargin-branch: audio-no-labels
wchargin-source: dc72cdd9d4dd9383e196b556b6a91917cfd9b7dd
wchargin-branch: audio-generic
wchargin-source: 2ec65f28a77e56f5cf5c786d7020f652a2cdde36
wchargin-branch: audio-generic
wchargin-source: 2ec65f28a77e56f5cf5c786d7020f652a2cdde36
experiment = plugin_util.experiment_id(request.environ)
mime_type = request.args["content_type"]
if mime_type not in _ALLOWED_MIME_TYPES:
raise errors.InvalidArgumentError(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I was confused and misread this and somehow thought we were getting the content type from a header that the browser was populating (even though yes, that isn't a thing for GET requests) rather than the query parameter. Never mind :P

Re: per-blob content types, yeah, I guess it could have a field in SummaryMetadata, although it's perhaps a little odd to have a field that is returned per-value stored in the per-tag SummaryMetadata (even if for now we don't anticipate allowing it to vary per-value).

wchargin-branch: audio-no-labels
wchargin-source: adbc9272e97c6ba2a70d84cf055baaab7d166f25

# Conflicts:
#	tensorboard/dataclass_compat.py
wchargin-branch: audio-no-labels
wchargin-source: adbc9272e97c6ba2a70d84cf055baaab7d166f25
wchargin-branch: audio-generic
wchargin-source: cfcd8b069dc99defbedfe71f93f5444ad62d66da
@wchargin wchargin changed the base branch from wchargin-audio-no-labels to master April 17, 2020 01:44
wchargin-branch: audio-generic
wchargin-source: ee6bb07e2331836178cd4329427286f69abe43ad

# Conflicts:
#	tensorboard/plugins/audio/audio_plugin.py
wchargin-branch: audio-generic
wchargin-source: ee6bb07e2331836178cd4329427286f69abe43ad
@wchargin wchargin merged commit b85fd87 into master Apr 17, 2020
@wchargin wchargin deleted the wchargin-audio-generic branch April 17, 2020 03:57
caisq pushed a commit to caisq/tensorboard that referenced this pull request May 19, 2020
Summary:
This patch teaches the audio plugin to support generic data providers.

This requires a few more `list_blob_sequences` queries than we’d like,
due to the current structure of the frontend. Some have been avoided by
pushing the content type of the waveform into the query string. A bit of
care is required to ensure that a malicious request cannot turn this
into an XSS hole; we do so by simply whitelisting a (singleton) set of
audio content types, which means that the safety properties are
unchanged by this patch.

Test Plan:
The audio dashboard now works even if the multiplexer is removed. The
audio dashboard doesn’t have any “extras” (data download links, etc.),
so just checking the basic functionality suffices. Unit tests updated.

wchargin-branch: audio-generic
caisq pushed a commit that referenced this pull request May 27, 2020
Summary:
This patch teaches the audio plugin to support generic data providers.

This requires a few more `list_blob_sequences` queries than we’d like,
due to the current structure of the frontend. Some have been avoided by
pushing the content type of the waveform into the query string. A bit of
care is required to ensure that a malicious request cannot turn this
into an XSS hole; we do so by simply whitelisting a (singleton) set of
audio content types, which means that the safety properties are
unchanged by this patch.

Test Plan:
The audio dashboard now works even if the multiplexer is removed. The
audio dashboard doesn’t have any “extras” (data download links, etc.),
so just checking the basic functionality suffices. Unit tests updated.

wchargin-branch: audio-generic
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants