-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Ensure EventFileLoader only uses no-TF stub when required #3194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| if len(header_str) < 8: | ||
| raise self._truncation_error("header") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK read is allowed to return fewer bytes than were requested, even
if those are available in the stream. (For instance, this can occur when
a signal handler is entered during the read.) From Python’s read docs,
it’s not clear to me whether a short result is guaranteed to imply that
EOF is imminent. In practice, it looks like cPython 3.7 and 3.9 both do
exhaust*. Do you think that it’s worth retrying the read until done
or read returns b"", or is that not necessary?
* Test program:
Source (tested on cPython 3.7.5rc1 and `667b91a5e2` on Debian)
import os
import signal
import threading
import time
def handle_sigusr1(signalnum, frame):
print("Got SIGUSR1 (%d)" % signalnum)
def send_delayed_sigusr1(delay):
time.sleep(delay)
os.kill(os.getpid(), signal.SIGUSR1)
signal.signal(signal.SIGUSR1, handle_sigusr1)
threading.Thread(target=send_delayed_sigusr1, args=(0.1,)).start()
with open("/dev/zero", "rb") as infile:
zeros = infile.read(int(1e10))
print(len(zeros)) # seems to always give 1e10 in my testingThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for bringing this up and testing it, I hadn't thought about that. That said, it seems fine to me to not retry the read immediately, since A) practically speaking, with this change the higher level code will retry on the next reload cycle anyway and B) under the circumstances you describe, it's not obvious to me that read returning zero bytes would actually be a stronger indication of EOF (unless the spec is that it always reads at least 1 byte if available, but might return fewer than available if interrupted by a signal).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unless the spec is that it always reads at least 1 byte if available,
but might return fewer than available if interrupted by a signal
Yeah, this is basically the spec in C (which is why I ask); Python is
not very clear about what its spec is.
Good point that the higher level code will retry anyway, so the worst
case is that there’ll be a spurious “truncated record” message. Keeping
it as is sounds good to me, then.
| if event_crc_calc != crc_event[0]: | ||
| raise errors.DataLossError( | ||
| None, None, "{} failed event crc32 check".format(self.filename), | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We’re raising the same error for a CRC failure as a truncated stream.
How does upstream code know to retry on the latter but skip this record
on the former? From a cursory read it looks to me like this would
infinitely loop raising DataLossErrors once we find a corrupt record.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For better or for worse, this is the same behavior that the TF C++ RecordReader has always had:
https://github.com/tensorflow/tensorflow/blob/a34091e538540aad57a7a941575538944f38db24/tensorflow/core/lib/io/record_reader.cc#L99
https://github.com/tensorflow/tensorflow/blob/a34091e538540aad57a7a941575538944f38db24/tensorflow/core/lib/io/record_reader.cc#L105
So I'm not too inclined to try to fix it here given that we can't really fix it in the normal path other than by attempting to parse the string error message on the resulting status, which seems too brittle. I think corrupted checksums are infrequent enough in practice that looping on them, while not ideal, is an acceptable behavior. It's the same "infinite loop" that we would have at EOF anyway, so not really a big deal IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, okay, confirmed: wrote an event file with five steps, corrupted the
third one, and observed that only the first two steps are ever loaded.
That’s not great, but you’re right that it’s not a regression.
nfelt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL
| if event_crc_calc != crc_event[0]: | ||
| raise errors.DataLossError( | ||
| None, None, "{} failed event crc32 check".format(self.filename), | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For better or for worse, this is the same behavior that the TF C++ RecordReader has always had:
https://github.com/tensorflow/tensorflow/blob/a34091e538540aad57a7a941575538944f38db24/tensorflow/core/lib/io/record_reader.cc#L99
https://github.com/tensorflow/tensorflow/blob/a34091e538540aad57a7a941575538944f38db24/tensorflow/core/lib/io/record_reader.cc#L105
So I'm not too inclined to try to fix it here given that we can't really fix it in the normal path other than by attempting to parse the string error message on the resulting status, which seems too brittle. I think corrupted checksums are infrequent enough in practice that looping on them, while not ideal, is an acceptable behavior. It's the same "infinite loop" that we would have at EOF anyway, so not really a big deal IMO.
| if len(header_str) < 8: | ||
| raise self._truncation_error("header") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for bringing this up and testing it, I hadn't thought about that. That said, it seems fine to me to not retry the read immediately, since A) practically speaking, with this change the higher level code will retry on the next reload cycle anyway and B) under the circumstances you describe, it's not obvious to me that read returning zero bytes would actually be a stronger indication of EOF (unless the spec is that it always reads at least 1 byte if available, but might return fewer than available if interrupted by a signal).
| if len(header_str) < 8: | ||
| raise self._truncation_error("header") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unless the spec is that it always reads at least 1 byte if available,
but might return fewer than available if interrupted by a signal
Yeah, this is basically the spec in C (which is why I ask); Python is
not very clear about what its spec is.
Good point that the higher level code will retry anyway, so the worst
case is that there’ll be a spurious “truncated record” message. Keeping
it as is sounds good to me, then.
| if event_crc_calc != crc_event[0]: | ||
| raise errors.DataLossError( | ||
| None, None, "{} failed event crc32 check".format(self.filename), | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, okay, confirmed: wrote an event file with five steps, corrupted the
third one, and observed that only the first two steps are ever loaded.
That’s not great, but you’re right that it’s not a regression.
This addresses concern 2 from #1711 (comment), namely that the changes in #3185 were sufficient to fall back correctly if
pywrap_tensorflowno longer contained aPyRecordReader_Newsymbol, but not actually sufficient in the eventual case thatpywrap_tensorflowdisappears entirely and can no longer be imported. The previous code was using thetensorboard.compat._pywrap_tensorflowlazy-loader which would in that case have automatically started using the no-TF stub implementation of pywrap_tensorflow instead, which is undesirable relative to usingtf_record_iterator.We fix that by getting rid of
tensorboard.compat._pywrap_tensorflowentirely, since it was unused elsewhere (the projector plugin imported it but did not use it; it was leftover cruft from #2096), and just explicitly handling the no-TF case up front.I added a no-TF build target for the test to confirm that the fallback to the stub implementation works when forcing no-TF mode. The stub implementation didn't previously support truncated records at all (it would fail in struct unpacking when truncated partway through a fixed-length field), let alone read offset preservation, so I also added that functionality to the implementation so that the set of strengthened tests from #3185 passes.