Skip to content

Conversation

@caisq
Copy link
Contributor

@caisq caisq commented Dec 18, 2019

  • Add unit tests with dummy tfdbg2 data
  • The run name (if exists) is now hard-coded to a magic string ("__default_debugger_run__"), as currently tfdbg v2 assumes each directory has <= 1 DebugEvent file sets.
  • The plugin's serving routes is based on a newly created stub for a DataProvider implementation.
  • The implementation is specialized for local DebugEvent file sets. It serves the debugger v2 plugin specifically. It can be integrated with a DataProvider implementation capable of handling more diverse plugin support.

- Add unit tests with dummy tfdbg2 data
@caisq
Copy link
Contributor Author

caisq commented Dec 19, 2019

This is WIP. I'll add the DataProvider abstraction and fix the tests before sending the PR out for review.

@stephanwlee
Copy link
Contributor

This is WIP. I'll add the DataProvider abstraction and fix the tests before sending the PR out for review

Consider opening a draft PR in the future.

@caisq
Copy link
Contributor Author

caisq commented Dec 19, 2019

@stephanwlee OK. I didn't know the draft-PR feature of GitHub. Now I know. But just to confirm: by default, PRs opened without assigning reviewers are assumed to be drafts, right?

@stephanwlee
Copy link
Contributor

PRs opened without assigning reviewers are assumed to be drafts, right?

Yes, but I found people reading the code anyways so I thought it would be nice to be a little bit more explicit.

@caisq
Copy link
Contributor Author

caisq commented Dec 19, 2019

@stephanwlee OK. I'll use the draft-PR feature going forward.

@wchargin
Copy link
Contributor

I'll add the DataProvider abstraction and fix the tests before sending
the PR out for review.

Maybe we should talk offline, but I’m not sure what you mean by this
comment. There’s still some infrastructural work that needs to be done
before the data provider can be used for blob data by other plugins:
essentially, de-hacking what’s in the graph implementation of the
multiplexer data provider. This is high-interest debt, and I’d strongly
encourage not building on top of it. There wasn’t too much flexibility
for the graph given that the data format is old, but for the debugger
plugin we have more freedom.

@wchargin
Copy link
Contributor

The outstanding work that I mention is at the top of my priority list
for exactly this reason, and I plan to resume working on it after this
fix-it week.

def serve_runs(self, request):
runs = []
try:
# pylint:disable=g-import-not-at-top
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drive-by: We removed all these suppressions in #3022; you can drop them.

Copy link
Contributor

@wchargin wchargin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High-level feedback: #3051 (comment)

# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""An implementation of DataProvider that serves tfdbg v2 data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explicitly document that the existence of this module is only a
short-term hack to unblock debugger work, and that it is not intended to
ever be used in production.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Also note that this file has moved to debug_data_provider.py.

Comment on lines 34 to 35
In this implementation, `experiment_id` is assumed to be the path to the
logdir that contains the DebugEvent file set.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our discussion, I said that I’d be fine with drafting a temporary
data provider to unblock development of this plugin while we figure out
the data loading story. But it’s important that that short-term provider
be a valid implementation of the DataProvider interface and that the
client interact with it according to the documented APIs, so that we can
seamlessly replace this by a real data provider, which will have no
special logic for the debugger plugin. So, for instance:

  • The log directory shouldn’t be communicated via the experiment ID,
    because the logdir is given via a command-line flag while the
    experiment ID is given by a URL component. (For this reason, the
    experiment ID also cannot contain slashes, which makes it a poor
    container for a filesystem path.) Instead, it probably makes more
    sense to mirror the MultiplexerDataProvider and accept the logdir
    at data provider initialization time, then start a background thread
    to reload under that directory, like the multiplexer does. Then an
    instance of the data provider is always associated with a logdir,
    and list_runs can list the runs under that logdir that have debug
    data.

  • Methods need to be independent. Calling list_runs must not affect
    the future behavior of (say) list_blob_sequences. Methods may
    depend on state that’s being mutated by a background thread, as in
    the case of the multiplexer, but the methods shouldn’t themselves
    modify state.

  • The result of list_runs must be a list of provider.Run values,
    not a list of strings.

  • How is the “default debugger run name” behavior going to map to a
    real provider? Perhaps that bit’s just not implemented yet, in which
    case it’s fine, but I don’t see the plan forward.

Does this make sense? Ideally, once the normal loading flow is complete,
the only change necessary will be to change the self._data_provider
assignment to read from context._data_provider instead of constructing
a new instance, and everything should Just Work™.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wchargin Thanks for the thoughtful review!

The log directory shouldn’t be communicated via the experiment ID,
because the logdir is given via a command-line flag while the
experiment ID is given by a URL component. (For this reason, the
experiment ID also cannot contain slashes, which makes it a poor
container for a filesystem path.) Instead, it probably makes more
sense to mirror the MultiplexerDataProvider and accept the logdir
at data provider initialization time, then start a background thread
to reload under that directory, like the multiplexer does. Then an
instance of the data provider is always associated with a logdir,
and list_runs can list the runs under that logdir that have debug
data.

I have created a new class DebuggerV2EventMultiplexer in the new file debug_event_multiplexer.py. It loosely mirroring EventMultiplexer. As you suggested, it is constructed by using a logdir. experiment_id no longer serve as the substitute logdir.

Methods need to be independent. Calling list_runs must not affect
the future behavior of (say) list_blob_sequences. Methods may
depend on state that’s being mutated by a background thread, as in
the case of the multiplexer, but the methods shouldn’t themselves
modify state.

I've made changes such that list_runs() calls is "stateless" and does. The read_blob_sequences() a read_blob() methods, the implementation of which has not started in this PR, will also follow this principle of independence.

The result of list_runs must be a list of provider.Run values,
not a list of strings.

Done.

How is the “default debugger run name” behavior going to map to a
real provider? Perhaps that bit’s just not implemented yet, in which
case it’s fine, but I don’t see the plan forward.

The default debugger run name is a short-term hack necessitated by the fact that each logdir currently cannot consist of >1 DebugEvent file sets (the reader class in tensorflow will error out if >1 sets are found.) Going forward, we'll work on improving this by allowing there to be multiple file sets in the same logdir. The newly added TODO item reflects / tracks that:

# TODO(cais): When tfdbg2 allows there to be multiple DebugEvent file sets in
# the same logdir, replace this magic string with actual run names.

self.assertTrue(plugin)
def setUp(self):
super(DebuggerV2PluginTest, self).setUp()
self.logdir = tempfile.mkdtemp()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can just use self.get_temp_dir(), which is automatically cleaned up
after the test: no need for tearDown.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

"application/json", response.headers.get("content-type")
)
self.assertEqual(
json.loads(response.get_data()), ["__default_debugger_run__"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using local_data_provider.DEFAULT_DEBUGGER_RUN_NAME here to
avoid skew?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a test, I'm not so concerned about skew. In fact, I'm inclined toward applying the "DAMP" principle here. Let the test code be descriptive. This way, when the value of DEAULT_DEBUGGER_RUN_NAME in the tested module changes inadvertently, we can catch that with the tests.

@caisq caisq requested a review from wchargin December 23, 2019 04:19
Copy link
Contributor

@wchargin wchargin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only blocking comment is regarding the experiment_id choice:
#3051 (comment)

try:
from tensorflow.python.debug.lib import debug_events_reader

# TODO(cais): Switch DebugDataReader when available in tensorflow.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This TODO looks done; we can drop it now, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This TODO is about switching the old DebugEventsReader to the latest DebugDataReader, has not happened yet (due to a missing metadata reading support in the latter). So I'll keep this TODO item for now.

return {}
finally:
if reader:
reader.close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like DebugEventsReader implements the context manager protocol;
can this be simplified and the try scope restricted as something like

    def Runs(self):
        from tensorflow.python.debug.lib import debug_events_reader

        try:
            reader = debug_events_reader.DebugEventsReader(self._logdir)
        except ValueError:
            # Occurs when no DebugEvent file set exists in the logdir
            return {}
        with reader:
            return {DEFAULT_DEBUGGER_RUN_NAME: ...}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

BTW, this method is using the newer DebugDataReader now, because its logic doesn't rely on the missing metadata support. Resolvee the TODO item in this method.

@wrappers.Request.application
def serve_runs(self, request):
runs = self._data_provider.list_runs(
debug_data_provider.DUMMY_DEBUGGER_EXPERIMENT_ID
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don’t see why this needs to pass a dummy experiment ID to the
data provider rather than passing plugin_util.experiment_id(...) like
all the other plugins do. This will be the only plugin for which
navigating to /experiment/foo/#debugger_v2 doesn’t actually thread
experiment_id="foo" down to the data provider. Why is this necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I didn't fully understand how plugin_util.experiment_id() worked. Now I see. It should work. I made changes to use it instead of the dummy experiment ID now.

Args:
experiment_id: currently unused, because the backing
LocalDebuggerV2DataProvideer does not accommodate multiple experiments.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this say “backing DebuggerV2EventMultiplexer”?

(If not, sp.: s/Provideer/Provider/.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should be DebuggerV2EventMultiplexer. Correction is made.

raise TypeError("DebugDataMultiplexer does not support Images().")

def Audio(self, run, tag):
raise TypeError("DebugDataMultiplexer does not support Images().")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO there’s no need to implement these functions just to say that
they’re not implemented. The built-in error message is plenty clear
(“AttributeError: 'DebuggerV2EventMultiplexer' object has no attribute
'Audio'”), and there’s no common interface that this class and the
standard plugin_event_multiplexer.EventMultiplexer need to conform to.

(If you do keep them around: s/Images/Audio/.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. The default error works for me. I removed them. I removed the following methods: Audio, CompressedHistorgrams, Histograms, Image, RunMetadata, Scalars, SerializedGraph, SummaryMetadata and Tensors. The code is definitely clearer this way.

# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""A wrapper around DebugDataReader used for retrieving tfdbg v2 data."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/DebugDataReader/DebugEventsReader/ ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DebugDataReader is actually the latest feature that the plugin that we intend to build on. So I'll keep this doc string for the (near) future.

Copy link
Contributor Author

@caisq caisq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for the review! PTAL.

# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""A wrapper around DebugDataReader used for retrieving tfdbg v2 data."""
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DebugDataReader is actually the latest feature that the plugin that we intend to build on. So I'll keep this doc string for the (near) future.

try:
from tensorflow.python.debug.lib import debug_events_reader

# TODO(cais): Switch DebugDataReader when available in tensorflow.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This TODO is about switching the old DebugEventsReader to the latest DebugDataReader, has not happened yet (due to a missing metadata reading support in the latter). So I'll keep this TODO item for now.

return {}
finally:
if reader:
reader.close()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

BTW, this method is using the newer DebugDataReader now, because its logic doesn't rely on the missing metadata support. Resolvee the TODO item in this method.

Args:
experiment_id: currently unused, because the backing
LocalDebuggerV2DataProvideer does not accommodate multiple experiments.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should be DebuggerV2EventMultiplexer. Correction is made.

@wrappers.Request.application
def serve_runs(self, request):
runs = self._data_provider.list_runs(
debug_data_provider.DUMMY_DEBUGGER_EXPERIMENT_ID
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I didn't fully understand how plugin_util.experiment_id() worked. Now I see. It should work. I made changes to use it instead of the dummy experiment ID now.

raise TypeError("DebugDataMultiplexer does not support Images().")

def Audio(self, run, tag):
raise TypeError("DebugDataMultiplexer does not support Images().")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. The default error works for me. I removed them. I removed the following methods: Audio, CompressedHistorgrams, Histograms, Image, RunMetadata, Scalars, SerializedGraph, SummaryMetadata and Tensors. The code is definitely clearer this way.

@caisq caisq requested a review from wchargin January 1, 2020 03:49
Copy link
Contributor

@wchargin wchargin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the revisions. LGTM modulo inline.

metadata_iterator, _ = reader.metadata_iterator()
return next(metadata_iterator).wall_time
finally:
reader.close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won’t this be a UnboundLocalError if the reader fails to initialize
(which, looking at the implementation, seems possible)? Should this code
use the same with reader pattern as the other call site?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Done.

@caisq caisq merged commit 1d37469 into tensorflow:master Jan 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants