[DebuggerV2] Implement /runs route #3051

caisq · 2019-12-18T22:26:13Z

Add unit tests with dummy tfdbg2 data
The run name (if exists) is now hard-coded to a magic string ("__default_debugger_run__"), as currently tfdbg v2 assumes each directory has <= 1 DebugEvent file sets.
The plugin's serving routes is based on a newly created stub for a DataProvider implementation.
The implementation is specialized for local DebugEvent file sets. It serves the debugger v2 plugin specifically. It can be integrated with a DataProvider implementation capable of handling more diverse plugin support.

- Add unit tests with dummy tfdbg2 data

caisq · 2019-12-19T00:18:34Z

This is WIP. I'll add the DataProvider abstraction and fix the tests before sending the PR out for review.

stephanwlee · 2019-12-19T00:23:44Z

This is WIP. I'll add the DataProvider abstraction and fix the tests before sending the PR out for review

Consider opening a draft PR in the future.

caisq · 2019-12-19T00:24:58Z

@stephanwlee OK. I didn't know the draft-PR feature of GitHub. Now I know. But just to confirm: by default, PRs opened without assigning reviewers are assumed to be drafts, right?

stephanwlee · 2019-12-19T00:27:22Z

PRs opened without assigning reviewers are assumed to be drafts, right?

Yes, but I found people reading the code anyways so I thought it would be nice to be a little bit more explicit.

caisq · 2019-12-19T00:30:05Z

@stephanwlee OK. I'll use the draft-PR feature going forward.

wchargin · 2019-12-19T00:57:56Z

I'll add the DataProvider abstraction and fix the tests before sending
the PR out for review.

Maybe we should talk offline, but I’m not sure what you mean by this
comment. There’s still some infrastructural work that needs to be done
before the data provider can be used for blob data by other plugins:
essentially, de-hacking what’s in the graph implementation of the
multiplexer data provider. This is high-interest debt, and I’d strongly
encourage not building on top of it. There wasn’t too much flexibility
for the graph given that the data format is old, but for the debugger
plugin we have more freedom.

wchargin · 2019-12-19T00:58:37Z

The outstanding work that I mention is at the top of my priority list
for exactly this reason, and I plan to resume working on it after this
fix-it week.

…2-runs

wchargin · 2019-12-19T02:34:44Z

tensorboard/plugins/debugger_v2/debugger_v2_plugin.py

    def serve_runs(self, request):
        runs = []
        try:
+            # pylint:disable=g-import-not-at-top


drive-by: We removed all these suppressions in #3022; you can drop them.

…2-runs

wchargin

High-level feedback: #3051 (comment)

wchargin · 2019-12-20T21:45:16Z

tensorboard/plugins/debugger_v2/local_data_provider.py

+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""An implementation of DataProvider that serves tfdbg v2 data.


Please explicitly document that the existence of this module is only a
short-term hack to unblock debugger work, and that it is not intended to
ever be used in production.

Done. Also note that this file has moved to debug_data_provider.py.

wchargin · 2019-12-20T22:45:57Z

tensorboard/plugins/debugger_v2/local_data_provider.py

+    In this implementation, `experiment_id` is assumed to be the path to the
+    logdir that contains the DebugEvent file set.


In our discussion, I said that I’d be fine with drafting a temporary
data provider to unblock development of this plugin while we figure out
the data loading story. But it’s important that that short-term provider
be a valid implementation of the DataProvider interface and that the
client interact with it according to the documented APIs, so that we can
seamlessly replace this by a real data provider, which will have no
special logic for the debugger plugin. So, for instance:

The log directory shouldn’t be communicated via the experiment ID,
because the logdir is given via a command-line flag while the
experiment ID is given by a URL component. (For this reason, the
experiment ID also cannot contain slashes, which makes it a poor
container for a filesystem path.) Instead, it probably makes more
sense to mirror the MultiplexerDataProvider and accept the logdir
at data provider initialization time, then start a background thread
to reload under that directory, like the multiplexer does. Then an
instance of the data provider is always associated with a logdir,
and list_runs can list the runs under that logdir that have debug
data.

Methods need to be independent. Calling list_runs must not affect
the future behavior of (say) list_blob_sequences. Methods may
depend on state that’s being mutated by a background thread, as in
the case of the multiplexer, but the methods shouldn’t themselves
modify state.

The result of list_runs must be a list of provider.Run values,
not a list of strings.

How is the “default debugger run name” behavior going to map to a
real provider? Perhaps that bit’s just not implemented yet, in which
case it’s fine, but I don’t see the plan forward.

Does this make sense? Ideally, once the normal loading flow is complete,
the only change necessary will be to change the self._data_provider
assignment to read from context._data_provider instead of constructing
a new instance, and everything should Just Work™.

@wchargin Thanks for the thoughtful review!

The log directory shouldn’t be communicated via the experiment ID,
because the logdir is given via a command-line flag while the
experiment ID is given by a URL component. (For this reason, the
experiment ID also cannot contain slashes, which makes it a poor
container for a filesystem path.) Instead, it probably makes more
sense to mirror the MultiplexerDataProvider and accept the logdir
at data provider initialization time, then start a background thread
to reload under that directory, like the multiplexer does. Then an
instance of the data provider is always associated with a logdir,
and list_runs can list the runs under that logdir that have debug
data.

I have created a new class DebuggerV2EventMultiplexer in the new file debug_event_multiplexer.py. It loosely mirroring EventMultiplexer. As you suggested, it is constructed by using a logdir. experiment_id no longer serve as the substitute logdir.

Methods need to be independent. Calling list_runs must not affect
the future behavior of (say) list_blob_sequences. Methods may
depend on state that’s being mutated by a background thread, as in
the case of the multiplexer, but the methods shouldn’t themselves
modify state.

I've made changes such that list_runs() calls is "stateless" and does. The read_blob_sequences() a read_blob() methods, the implementation of which has not started in this PR, will also follow this principle of independence.

The result of list_runs must be a list of provider.Run values,
not a list of strings.

Done.

How is the “default debugger run name” behavior going to map to a
real provider? Perhaps that bit’s just not implemented yet, in which
case it’s fine, but I don’t see the plan forward.

The default debugger run name is a short-term hack necessitated by the fact that each logdir currently cannot consist of >1 DebugEvent file sets (the reader class in tensorflow will error out if >1 sets are found.) Going forward, we'll work on improving this by allowing there to be multiple file sets in the same logdir. The newly added TODO item reflects / tracks that:

# TODO(cais): When tfdbg2 allows there to be multiple DebugEvent file sets in # the same logdir, replace this magic string with actual run names.

tensorboard/plugins/debugger_v2/debugger_v2_plugin_test.py

wchargin · 2019-12-20T22:46:21Z

tensorboard/plugins/debugger_v2/debugger_v2_plugin_test.py

-        self.assertTrue(plugin)
+    def setUp(self):
+        super(DebuggerV2PluginTest, self).setUp()
+        self.logdir = tempfile.mkdtemp()


Can just use self.get_temp_dir(), which is automatically cleaned up
after the test: no need for tearDown.

wchargin · 2019-12-20T22:46:28Z

tensorboard/plugins/debugger_v2/debugger_v2_plugin_test.py

+            "application/json", response.headers.get("content-type")
+        )
+        self.assertEqual(
+            json.loads(response.get_data()), ["__default_debugger_run__"]


Consider using local_data_provider.DEFAULT_DEBUGGER_RUN_NAME here to
avoid skew?

Since this is a test, I'm not so concerned about skew. In fact, I'm inclined toward applying the "DAMP" principle here. Let the test code be descriptive. This way, when the value of DEAULT_DEBUGGER_RUN_NAME in the tested module changes inadvertently, we can catch that with the tests.

wchargin

Only blocking comment is regarding the experiment_id choice:
#3051 (comment)

wchargin · 2019-12-30T20:16:17Z

tensorboard/plugins/debugger_v2/debug_data_multiplexer.py

+        try:
+            from tensorflow.python.debug.lib import debug_events_reader
+
+            # TODO(cais): Switch DebugDataReader when available in tensorflow.


This TODO looks done; we can drop it now, right?

This TODO is about switching the old DebugEventsReader to the latest DebugDataReader, has not happened yet (due to a missing metadata reading support in the latter). So I'll keep this TODO item for now.

wchargin · 2019-12-30T20:20:20Z

tensorboard/plugins/debugger_v2/debug_data_multiplexer.py

+            return {}
+        finally:
+            if reader:
+                reader.close()


Looks like DebugEventsReader implements the context manager protocol;
can this be simplified and the try scope restricted as something like

def Runs(self): from tensorflow.python.debug.lib import debug_events_reader try: reader = debug_events_reader.DebugEventsReader(self._logdir) except ValueError: # Occurs when no DebugEvent file set exists in the logdir return {} with reader: return {DEFAULT_DEBUGGER_RUN_NAME: ...}

Done.

BTW, this method is using the newer DebugDataReader now, because its logic doesn't rely on the missing metadata support. Resolvee the TODO item in this method.

wchargin · 2019-12-30T20:25:42Z

tensorboard/plugins/debugger_v2/debugger_v2_plugin.py

+    @wrappers.Request.application
+    def serve_runs(self, request):
+        runs = self._data_provider.list_runs(
+            debug_data_provider.DUMMY_DEBUGGER_EXPERIMENT_ID


I still don’t see why this needs to pass a dummy experiment ID to the
data provider rather than passing plugin_util.experiment_id(...) like
all the other plugins do. This will be the only plugin for which
navigating to /experiment/foo/#debugger_v2 doesn’t actually thread
experiment_id="foo" down to the data provider. Why is this necessary?

You're right. I didn't fully understand how plugin_util.experiment_id() worked. Now I see. It should work. I made changes to use it instead of the dummy experiment ID now.

tensorboard/plugins/debugger_v2/debugger_v2_plugin_test.py

wchargin · 2019-12-30T20:27:25Z

tensorboard/plugins/debugger_v2/debug_data_provider.py

+
+        Args:
+          experiment_id: currently unused, because the backing
+            LocalDebuggerV2DataProvideer does not accommodate multiple experiments.


Should this say “backing DebuggerV2EventMultiplexer”?

(If not, sp.: s/Provideer/Provider/.)

Yes, it should be DebuggerV2EventMultiplexer. Correction is made.

wchargin · 2019-12-30T20:27:48Z

tensorboard/plugins/debugger_v2/debug_data_multiplexer.py

+        raise TypeError("DebugDataMultiplexer does not support Images().")
+
+    def Audio(self, run, tag):
+        raise TypeError("DebugDataMultiplexer does not support Images().")


IMHO there’s no need to implement these functions just to say that
they’re not implemented. The built-in error message is plenty clear
(“AttributeError: 'DebuggerV2EventMultiplexer' object has no attribute
'Audio'”), and there’s no common interface that this class and the
standard plugin_event_multiplexer.EventMultiplexer need to conform to.

(If you do keep them around: s/Images/Audio/.)

Got it. The default error works for me. I removed them. I removed the following methods: Audio, CompressedHistorgrams, Histograms, Image, RunMetadata, Scalars, SerializedGraph, SummaryMetadata and Tensors. The code is definitely clearer this way.

wchargin · 2019-12-30T21:17:32Z

tensorboard/plugins/debugger_v2/debug_data_multiplexer.py

+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""A wrapper around DebugDataReader used for retrieving tfdbg v2 data."""


s/DebugDataReader/DebugEventsReader/ ?

DebugDataReader is actually the latest feature that the plugin that we intend to build on. So I'll keep this doc string for the (near) future.

caisq

Thanks again for the review! PTAL.

caisq · 2020-01-01T02:41:53Z

tensorboard/plugins/debugger_v2/debug_data_multiplexer.py

+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""A wrapper around DebugDataReader used for retrieving tfdbg v2 data."""


DebugDataReader is actually the latest feature that the plugin that we intend to build on. So I'll keep this doc string for the (near) future.

caisq · 2020-01-01T02:42:34Z

tensorboard/plugins/debugger_v2/debug_data_multiplexer.py

+        try:
+            from tensorflow.python.debug.lib import debug_events_reader
+
+            # TODO(cais): Switch DebugDataReader when available in tensorflow.


This TODO is about switching the old DebugEventsReader to the latest DebugDataReader, has not happened yet (due to a missing metadata reading support in the latter). So I'll keep this TODO item for now.

caisq · 2020-01-01T02:52:33Z

tensorboard/plugins/debugger_v2/debug_data_multiplexer.py

+            return {}
+        finally:
+            if reader:
+                reader.close()


Done.

BTW, this method is using the newer DebugDataReader now, because its logic doesn't rely on the missing metadata support. Resolvee the TODO item in this method.

caisq · 2020-01-01T02:55:23Z

tensorboard/plugins/debugger_v2/debug_data_provider.py

+
+        Args:
+          experiment_id: currently unused, because the backing
+            LocalDebuggerV2DataProvideer does not accommodate multiple experiments.


Yes, it should be DebuggerV2EventMultiplexer. Correction is made.

caisq · 2020-01-01T03:05:54Z

tensorboard/plugins/debugger_v2/debugger_v2_plugin.py

+    @wrappers.Request.application
+    def serve_runs(self, request):
+        runs = self._data_provider.list_runs(
+            debug_data_provider.DUMMY_DEBUGGER_EXPERIMENT_ID


You're right. I didn't fully understand how plugin_util.experiment_id() worked. Now I see. It should work. I made changes to use it instead of the dummy experiment ID now.

caisq · 2020-01-01T03:07:15Z

tensorboard/plugins/debugger_v2/debug_data_multiplexer.py

+        raise TypeError("DebugDataMultiplexer does not support Images().")
+
+    def Audio(self, run, tag):
+        raise TypeError("DebugDataMultiplexer does not support Images().")


Got it. The default error works for me. I removed them. I removed the following methods: Audio, CompressedHistorgrams, Histograms, Image, RunMetadata, Scalars, SerializedGraph, SummaryMetadata and Tensors. The code is definitely clearer this way.

wchargin

Thanks for the revisions. LGTM modulo inline.

wchargin · 2020-01-02T16:23:23Z

tensorboard/plugins/debugger_v2/debug_data_multiplexer.py

+            metadata_iterator, _ = reader.metadata_iterator()
+            return next(metadata_iterator).wall_time
+        finally:
+            reader.close()


Won’t this be a UnboundLocalError if the reader fails to initialize
(which, looking at the implementation, seems possible)? Should this code
use the same with reader pattern as the other call site?

Good catch. Done.

[DebuggerV2] Implement /runs route

2cc8107

- Add unit tests with dummy tfdbg2 data

googlebot added the cla: yes label Dec 18, 2019

caisq added 3 commits December 18, 2019 19:31

Fix tests by using run_v2_only

84efdbc

Fix comment

c52024e

Python style fix

38cccbc

caisq added 2 commits December 18, 2019 21:32

Attempting to fix test in TF1

7d73e20

Merge branch 'dbg-v2-runs' of github.com:caisq/tensorboard into dbg-v…

d825e44

…2-runs

wchargin reviewed Dec 19, 2019

View reviewed changes

caisq added 5 commits December 18, 2019 21:42

Apply black; Remove unneeded pylint disables

ef3aa59

Add DataProvider impl

1be2444

Merge branch 'dbg-v2-runs' of github.com:caisq/tensorboard into dbg-v…

8b52a96

…2-runs

Fix typo

48a9d4d

Add read_blob_sequences() stub

91f9089

caisq requested a review from wchargin December 20, 2019 04:17

caisq assigned wchargin Dec 20, 2019

Fix lint issue

e6eea85

wchargin suggested changes Dec 20, 2019

View reviewed changes

caisq added 5 commits December 22, 2019 22:33

Address review comments; Implement debug_data_multiplexer.py

af17873

Add more doc strings

8092d9c

Reinstate temporarily-commented code

55da5e7

Adjust BUILD file

6fbdfec

Apply buildifier

1580e30

caisq requested a review from wchargin December 23, 2019 04:19

caisq mentioned this pull request Dec 23, 2019

Upgrade Travis to Xenial and floating tf-nightly #3070

Merged

Merge branch 'master' into dbg-v2-runs

b90f738

caisq mentioned this pull request Dec 24, 2019

[DebuggerV2] Add basic ng impl of actions, store, data_source & new Alerts view #3075

Merged

wchargin reviewed Dec 30, 2019

View reviewed changes

Address review comments by wchargin@

5f132e2

caisq commented Jan 1, 2020

View reviewed changes

caisq requested a review from wchargin January 1, 2020 03:49

wchargin approved these changes Jan 2, 2020

View reviewed changes

Fix logic in FirstEventTimestamp()

e91c2c4

caisq merged commit 1d37469 into tensorflow:master Jan 2, 2020

		In this implementation, `experiment_id` is assumed to be the path to the
		logdir that contains the DebugEvent file set.

[DebuggerV2] Implement /runs route #3051

[DebuggerV2] Implement /runs route #3051

Uh oh!

Conversation

caisq commented Dec 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

caisq commented Dec 19, 2019

Uh oh!

stephanwlee commented Dec 19, 2019

Uh oh!

caisq commented Dec 19, 2019

Uh oh!

stephanwlee commented Dec 19, 2019

Uh oh!

caisq commented Dec 19, 2019

Uh oh!

wchargin commented Dec 19, 2019

Uh oh!

wchargin commented Dec 19, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wchargin left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wchargin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

caisq left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

caisq commented Dec 18, 2019 •

edited

Loading

wchargin left a comment •

edited

Loading