Ensure EventFileLoader only uses no-TF stub when required #3194

nfelt · 2020-01-29T02:55:50Z

This addresses concern 2 from #1711 (comment), namely that the changes in #3185 were sufficient to fall back correctly if pywrap_tensorflow no longer contained a PyRecordReader_New symbol, but not actually sufficient in the eventual case that pywrap_tensorflow disappears entirely and can no longer be imported. The previous code was using the tensorboard.compat._pywrap_tensorflow lazy-loader which would in that case have automatically started using the no-TF stub implementation of pywrap_tensorflow instead, which is undesirable relative to using tf_record_iterator.

We fix that by getting rid of tensorboard.compat._pywrap_tensorflow entirely, since it was unused elsewhere (the projector plugin imported it but did not use it; it was leftover cruft from #2096), and just explicitly handling the no-TF case up front.

I added a no-TF build target for the test to confirm that the fallback to the stub implementation works when forcing no-TF mode. The stub implementation didn't previously support truncated records at all (it would fail in struct unpacking when truncated partway through a fixed-length field), let alone read offset preservation, so I also added that functionality to the implementation so that the set of strengthened tests from #3185 passes.

…or plugin

wchargin · 2020-01-29T22:51:41Z

tensorboard/compat/tensorflow_stub/pywrap_tensorflow.py

+        if len(header_str) < 8:
+            raise self._truncation_error("header")


AFAIK read is allowed to return fewer bytes than were requested, even
if those are available in the stream. (For instance, this can occur when
a signal handler is entered during the read.) From Python’s read docs,
it’s not clear to me whether a short result is guaranteed to imply that
EOF is imminent. In practice, it looks like cPython 3.7 and 3.9 both do
exhaust*. Do you think that it’s worth retrying the read until done
or read returns b"", or is that not necessary?

* Test program:

Source (tested on cPython 3.7.5rc1 and `667b91a5e2` on Debian)

import os import signal import threading import time def handle_sigusr1(signalnum, frame): print("Got SIGUSR1 (%d)" % signalnum) def send_delayed_sigusr1(delay): time.sleep(delay) os.kill(os.getpid(), signal.SIGUSR1) signal.signal(signal.SIGUSR1, handle_sigusr1) threading.Thread(target=send_delayed_sigusr1, args=(0.1,)).start() with open("/dev/zero", "rb") as infile: zeros = infile.read(int(1e10)) print(len(zeros)) # seems to always give 1e10 in my testing

Thanks for bringing this up and testing it, I hadn't thought about that. That said, it seems fine to me to not retry the read immediately, since A) practically speaking, with this change the higher level code will retry on the next reload cycle anyway and B) under the circumstances you describe, it's not obvious to me that read returning zero bytes would actually be a stronger indication of EOF (unless the spec is that it always reads at least 1 byte if available, but might return fewer than available if interrupted by a signal).

unless the spec is that it always reads at least 1 byte if available,
but might return fewer than available if interrupted by a signal

Yeah, this is basically the spec in C (which is why I ask); Python is
not very clear about what its spec is.

Good point that the higher level code will retry anyway, so the worst
case is that there’ll be a spurious “truncated record” message. Keeping
it as is sounds good to me, then.

wchargin · 2020-01-29T22:59:05Z

tensorboard/compat/tensorflow_stub/pywrap_tensorflow.py

+        if event_crc_calc != crc_event[0]:
+            raise errors.DataLossError(
+                None, None, "{} failed event crc32 check".format(self.filename),
+            )


We’re raising the same error for a CRC failure as a truncated stream.
How does upstream code know to retry on the latter but skip this record
on the former? From a cursory read it looks to me like this would
infinitely loop raising DataLossErrors once we find a corrupt record.

For better or for worse, this is the same behavior that the TF C++ RecordReader has always had:
https://github.com/tensorflow/tensorflow/blob/a34091e538540aad57a7a941575538944f38db24/tensorflow/core/lib/io/record_reader.cc#L99
https://github.com/tensorflow/tensorflow/blob/a34091e538540aad57a7a941575538944f38db24/tensorflow/core/lib/io/record_reader.cc#L105

So I'm not too inclined to try to fix it here given that we can't really fix it in the normal path other than by attempting to parse the string error message on the resulting status, which seems too brittle. I think corrupted checksums are infrequent enough in practice that looping on them, while not ideal, is an acceptable behavior. It's the same "infinite loop" that we would have at EOF anyway, so not really a big deal IMO.

Wow, okay, confirmed: wrote an event file with five steps, corrupted the
third one, and observed that only the first two steps are ever loaded.
That’s not great, but you’re right that it’s not a regression.

nfelt

PTAL

nfelt · 2020-01-30T00:04:04Z

tensorboard/compat/tensorflow_stub/pywrap_tensorflow.py

+        if event_crc_calc != crc_event[0]:
+            raise errors.DataLossError(
+                None, None, "{} failed event crc32 check".format(self.filename),
+            )


For better or for worse, this is the same behavior that the TF C++ RecordReader has always had:
https://github.com/tensorflow/tensorflow/blob/a34091e538540aad57a7a941575538944f38db24/tensorflow/core/lib/io/record_reader.cc#L99
https://github.com/tensorflow/tensorflow/blob/a34091e538540aad57a7a941575538944f38db24/tensorflow/core/lib/io/record_reader.cc#L105

So I'm not too inclined to try to fix it here given that we can't really fix it in the normal path other than by attempting to parse the string error message on the resulting status, which seems too brittle. I think corrupted checksums are infrequent enough in practice that looping on them, while not ideal, is an acceptable behavior. It's the same "infinite loop" that we would have at EOF anyway, so not really a big deal IMO.

nfelt · 2020-01-30T00:08:03Z

tensorboard/compat/tensorflow_stub/pywrap_tensorflow.py

+        if len(header_str) < 8:
+            raise self._truncation_error("header")


Thanks for bringing this up and testing it, I hadn't thought about that. That said, it seems fine to me to not retry the read immediately, since A) practically speaking, with this change the higher level code will retry on the next reload cycle anyway and B) under the circumstances you describe, it's not obvious to me that read returning zero bytes would actually be a stronger indication of EOF (unless the spec is that it always reads at least 1 byte if available, but might return fewer than available if interrupted by a signal).

wchargin · 2020-01-30T00:39:02Z

tensorboard/compat/tensorflow_stub/pywrap_tensorflow.py

+        if len(header_str) < 8:
+            raise self._truncation_error("header")


unless the spec is that it always reads at least 1 byte if available,
but might return fewer than available if interrupted by a signal

Yeah, this is basically the spec in C (which is why I ask); Python is
not very clear about what its spec is.

Good point that the higher level code will retry anyway, so the worst
case is that there’ll be a spurious “truncated record” message. Keeping
it as is sounds good to me, then.

wchargin · 2020-01-30T00:54:50Z

tensorboard/compat/tensorflow_stub/pywrap_tensorflow.py

+        if event_crc_calc != crc_event[0]:
+            raise errors.DataLossError(
+                None, None, "{} failed event crc32 check".format(self.filename),
+            )


Wow, okay, confirmed: wrote an event file with five steps, corrupted the
third one, and observed that only the first two steps are ever loaded.
That’s not great, but you’re right that it’s not a regression.

nfelt added 5 commits January 28, 2020 17:57

add notf test for event_file_loader_test.py

c8927ff

implement truncation recovery in stub PyRecordReader

c1569c9

Check explicitly for no-TF case in make_tf_record_iterator

872fd0d

remove unused tensorboard.compat._pywrap_tensorflow import in project…

07afe24

…or plugin

remove now unused tensorboard.compat._pywrap_tensorflow lazy-loader

d3adaa7

nfelt added core:backend type:cleanup labels Jan 29, 2020

nfelt requested a review from wchargin January 29, 2020 02:55

googlebot added the cla: yes label Jan 29, 2020

wchargin reviewed Jan 29, 2020

View reviewed changes

nfelt commented Jan 30, 2020

View reviewed changes

wchargin approved these changes Jan 30, 2020

View reviewed changes

nfelt merged commit 73b4df9 into tensorflow:master Jan 30, 2020

nfelt deleted the pyrecordreader-stub branch January 30, 2020 01:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ensure EventFileLoader only uses no-TF stub when required #3194

Ensure EventFileLoader only uses no-TF stub when required #3194

Uh oh!

nfelt commented Jan 29, 2020

Uh oh!

wchargin Jan 29, 2020

Uh oh!

nfelt Jan 30, 2020

Uh oh!

wchargin Jan 30, 2020

Uh oh!

wchargin Jan 29, 2020

Uh oh!

nfelt Jan 30, 2020

Uh oh!

wchargin Jan 30, 2020

Uh oh!

nfelt left a comment

Uh oh!

nfelt Jan 30, 2020

Uh oh!

nfelt Jan 30, 2020

Uh oh!

wchargin Jan 30, 2020

Uh oh!

wchargin Jan 30, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		if len(header_str) < 8:
		raise self._truncation_error("header")

Ensure EventFileLoader only uses no-TF stub when required #3194

Ensure EventFileLoader only uses no-TF stub when required #3194

Uh oh!

Conversation

nfelt commented Jan 29, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nfelt left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants