-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[DebuggerV2] Implement /runs route #3051
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Add unit tests with dummy tfdbg2 data
|
This is WIP. I'll add the DataProvider abstraction and fix the tests before sending the PR out for review. |
Consider opening a draft PR in the future. |
|
@stephanwlee OK. I didn't know the draft-PR feature of GitHub. Now I know. But just to confirm: by default, PRs opened without assigning reviewers are assumed to be drafts, right? |
Yes, but I found people reading the code anyways so I thought it would be nice to be a little bit more explicit. |
|
@stephanwlee OK. I'll use the draft-PR feature going forward. |
Maybe we should talk offline, but I’m not sure what you mean by this |
|
The outstanding work that I mention is at the top of my priority list |
| def serve_runs(self, request): | ||
| runs = [] | ||
| try: | ||
| # pylint:disable=g-import-not-at-top |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drive-by: We removed all these suppressions in #3022; you can drop them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
High-level feedback: #3051 (comment)
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
| # ============================================================================== | ||
| """An implementation of DataProvider that serves tfdbg v2 data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please explicitly document that the existence of this module is only a
short-term hack to unblock debugger work, and that it is not intended to
ever be used in production.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Also note that this file has moved to debug_data_provider.py.
| In this implementation, `experiment_id` is assumed to be the path to the | ||
| logdir that contains the DebugEvent file set. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In our discussion, I said that I’d be fine with drafting a temporary
data provider to unblock development of this plugin while we figure out
the data loading story. But it’s important that that short-term provider
be a valid implementation of the DataProvider interface and that the
client interact with it according to the documented APIs, so that we can
seamlessly replace this by a real data provider, which will have no
special logic for the debugger plugin. So, for instance:
-
The log directory shouldn’t be communicated via the experiment ID,
because the logdir is given via a command-line flag while the
experiment ID is given by a URL component. (For this reason, the
experiment ID also cannot contain slashes, which makes it a poor
container for a filesystem path.) Instead, it probably makes more
sense to mirror theMultiplexerDataProviderand accept the logdir
at data provider initialization time, then start a background thread
to reload under that directory, like the multiplexer does. Then an
instance of the data provider is always associated with a logdir,
andlist_runscan list the runs under that logdir that have debug
data. -
Methods need to be independent. Calling
list_runsmust not affect
the future behavior of (say)list_blob_sequences. Methods may
depend on state that’s being mutated by a background thread, as in
the case of the multiplexer, but the methods shouldn’t themselves
modify state. -
The result of
list_runsmust be a list ofprovider.Runvalues,
not a list of strings. -
How is the “default debugger run name” behavior going to map to a
real provider? Perhaps that bit’s just not implemented yet, in which
case it’s fine, but I don’t see the plan forward.
Does this make sense? Ideally, once the normal loading flow is complete,
the only change necessary will be to change the self._data_provider
assignment to read from context._data_provider instead of constructing
a new instance, and everything should Just Work™.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wchargin Thanks for the thoughtful review!
The log directory shouldn’t be communicated via the experiment ID,
because the logdir is given via a command-line flag while the
experiment ID is given by a URL component. (For this reason, the
experiment ID also cannot contain slashes, which makes it a poor
container for a filesystem path.) Instead, it probably makes more
sense to mirror the MultiplexerDataProvider and accept the logdir
at data provider initialization time, then start a background thread
to reload under that directory, like the multiplexer does. Then an
instance of the data provider is always associated with a logdir,
and list_runs can list the runs under that logdir that have debug
data.
I have created a new class DebuggerV2EventMultiplexer in the new file debug_event_multiplexer.py. It loosely mirroring EventMultiplexer. As you suggested, it is constructed by using a logdir. experiment_id no longer serve as the substitute logdir.
Methods need to be independent. Calling list_runs must not affect
the future behavior of (say) list_blob_sequences. Methods may
depend on state that’s being mutated by a background thread, as in
the case of the multiplexer, but the methods shouldn’t themselves
modify state.
I've made changes such that list_runs() calls is "stateless" and does. The read_blob_sequences() a read_blob() methods, the implementation of which has not started in this PR, will also follow this principle of independence.
The result of list_runs must be a list of provider.Run values,
not a list of strings.
Done.
How is the “default debugger run name” behavior going to map to a
real provider? Perhaps that bit’s just not implemented yet, in which
case it’s fine, but I don’t see the plan forward.
The default debugger run name is a short-term hack necessitated by the fact that each logdir currently cannot consist of >1 DebugEvent file sets (the reader class in tensorflow will error out if >1 sets are found.) Going forward, we'll work on improving this by allowing there to be multiple file sets in the same logdir. The newly added TODO item reflects / tracks that:
# TODO(cais): When tfdbg2 allows there to be multiple DebugEvent file sets in
# the same logdir, replace this magic string with actual run names.| self.assertTrue(plugin) | ||
| def setUp(self): | ||
| super(DebuggerV2PluginTest, self).setUp() | ||
| self.logdir = tempfile.mkdtemp() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can just use self.get_temp_dir(), which is automatically cleaned up
after the test: no need for tearDown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| "application/json", response.headers.get("content-type") | ||
| ) | ||
| self.assertEqual( | ||
| json.loads(response.get_data()), ["__default_debugger_run__"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using local_data_provider.DEFAULT_DEBUGGER_RUN_NAME here to
avoid skew?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is a test, I'm not so concerned about skew. In fact, I'm inclined toward applying the "DAMP" principle here. Let the test code be descriptive. This way, when the value of DEAULT_DEBUGGER_RUN_NAME in the tested module changes inadvertently, we can catch that with the tests.
wchargin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only blocking comment is regarding the experiment_id choice:
#3051 (comment)
| try: | ||
| from tensorflow.python.debug.lib import debug_events_reader | ||
|
|
||
| # TODO(cais): Switch DebugDataReader when available in tensorflow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This TODO looks done; we can drop it now, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This TODO is about switching the old DebugEventsReader to the latest DebugDataReader, has not happened yet (due to a missing metadata reading support in the latter). So I'll keep this TODO item for now.
| return {} | ||
| finally: | ||
| if reader: | ||
| reader.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like DebugEventsReader implements the context manager protocol;
can this be simplified and the try scope restricted as something like
def Runs(self):
from tensorflow.python.debug.lib import debug_events_reader
try:
reader = debug_events_reader.DebugEventsReader(self._logdir)
except ValueError:
# Occurs when no DebugEvent file set exists in the logdir
return {}
with reader:
return {DEFAULT_DEBUGGER_RUN_NAME: ...}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
BTW, this method is using the newer DebugDataReader now, because its logic doesn't rely on the missing metadata support. Resolvee the TODO item in this method.
| @wrappers.Request.application | ||
| def serve_runs(self, request): | ||
| runs = self._data_provider.list_runs( | ||
| debug_data_provider.DUMMY_DEBUGGER_EXPERIMENT_ID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don’t see why this needs to pass a dummy experiment ID to the
data provider rather than passing plugin_util.experiment_id(...) like
all the other plugins do. This will be the only plugin for which
navigating to /experiment/foo/#debugger_v2 doesn’t actually thread
experiment_id="foo" down to the data provider. Why is this necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. I didn't fully understand how plugin_util.experiment_id() worked. Now I see. It should work. I made changes to use it instead of the dummy experiment ID now.
| Args: | ||
| experiment_id: currently unused, because the backing | ||
| LocalDebuggerV2DataProvideer does not accommodate multiple experiments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this say “backing DebuggerV2EventMultiplexer”?
(If not, sp.: s/Provideer/Provider/.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it should be DebuggerV2EventMultiplexer. Correction is made.
| raise TypeError("DebugDataMultiplexer does not support Images().") | ||
|
|
||
| def Audio(self, run, tag): | ||
| raise TypeError("DebugDataMultiplexer does not support Images().") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO there’s no need to implement these functions just to say that
they’re not implemented. The built-in error message is plenty clear
(“AttributeError: 'DebuggerV2EventMultiplexer' object has no attribute
'Audio'”), and there’s no common interface that this class and the
standard plugin_event_multiplexer.EventMultiplexer need to conform to.
(If you do keep them around: s/Images/Audio/.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. The default error works for me. I removed them. I removed the following methods: Audio, CompressedHistorgrams, Histograms, Image, RunMetadata, Scalars, SerializedGraph, SummaryMetadata and Tensors. The code is definitely clearer this way.
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
| # ============================================================================== | ||
| """A wrapper around DebugDataReader used for retrieving tfdbg v2 data.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/DebugDataReader/DebugEventsReader/ ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DebugDataReader is actually the latest feature that the plugin that we intend to build on. So I'll keep this doc string for the (near) future.
caisq
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for the review! PTAL.
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
| # ============================================================================== | ||
| """A wrapper around DebugDataReader used for retrieving tfdbg v2 data.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DebugDataReader is actually the latest feature that the plugin that we intend to build on. So I'll keep this doc string for the (near) future.
| try: | ||
| from tensorflow.python.debug.lib import debug_events_reader | ||
|
|
||
| # TODO(cais): Switch DebugDataReader when available in tensorflow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This TODO is about switching the old DebugEventsReader to the latest DebugDataReader, has not happened yet (due to a missing metadata reading support in the latter). So I'll keep this TODO item for now.
| return {} | ||
| finally: | ||
| if reader: | ||
| reader.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
BTW, this method is using the newer DebugDataReader now, because its logic doesn't rely on the missing metadata support. Resolvee the TODO item in this method.
| Args: | ||
| experiment_id: currently unused, because the backing | ||
| LocalDebuggerV2DataProvideer does not accommodate multiple experiments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it should be DebuggerV2EventMultiplexer. Correction is made.
| @wrappers.Request.application | ||
| def serve_runs(self, request): | ||
| runs = self._data_provider.list_runs( | ||
| debug_data_provider.DUMMY_DEBUGGER_EXPERIMENT_ID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. I didn't fully understand how plugin_util.experiment_id() worked. Now I see. It should work. I made changes to use it instead of the dummy experiment ID now.
| raise TypeError("DebugDataMultiplexer does not support Images().") | ||
|
|
||
| def Audio(self, run, tag): | ||
| raise TypeError("DebugDataMultiplexer does not support Images().") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. The default error works for me. I removed them. I removed the following methods: Audio, CompressedHistorgrams, Histograms, Image, RunMetadata, Scalars, SerializedGraph, SummaryMetadata and Tensors. The code is definitely clearer this way.
wchargin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the revisions. LGTM modulo inline.
| metadata_iterator, _ = reader.metadata_iterator() | ||
| return next(metadata_iterator).wall_time | ||
| finally: | ||
| reader.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won’t this be a UnboundLocalError if the reader fails to initialize
(which, looking at the implementation, seems possible)? Should this code
use the same with reader pattern as the other call site?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Done.
"__default_debugger_run__"), as currently tfdbg v2 assumes each directory has <= 1 DebugEvent file sets.DataProviderimplementation.