-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Ensure EventFileLoader only uses no-TF stub when required #3194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
c8927ff
c1569c9
872fd0d
07afe24
d3adaa7
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -197,19 +197,29 @@ def __init__( | |
| self.status = status | ||
| self.curr_event = None | ||
| self.file_handle = gfile.GFile(self.filename, "rb") | ||
| # Maintain a buffer of partially read records, so we can recover from | ||
| # truncated records upon a retry. | ||
| self._buffer = b"" | ||
| self._buffer_pos = 0 | ||
|
|
||
| def GetNext(self): | ||
| # Each new read should start at the beginning of any partial record. | ||
| self._buffer_pos = 0 | ||
| # Read the header | ||
| self.curr_event = None | ||
| header_str = self.file_handle.read(8) | ||
| if len(header_str) != 8: | ||
| header_str = self._read(8) | ||
| if not header_str: | ||
| # Hit EOF so raise and exit | ||
| raise errors.OutOfRangeError(None, None, "No more events to read") | ||
| if len(header_str) < 8: | ||
| raise self._truncation_error("header") | ||
| header = struct.unpack("Q", header_str) | ||
|
|
||
| # Read the crc32, which is 4 bytes, and check it against | ||
| # the crc32 of the header | ||
| crc_header_str = self.file_handle.read(4) | ||
| crc_header_str = self._read(4) | ||
| if len(crc_header_str) < 4: | ||
| raise self._truncation_error("header crc") | ||
| crc_header = struct.unpack("I", crc_header_str) | ||
| header_crc_calc = masked_crc32c(header_str) | ||
| if header_crc_calc != crc_header[0]: | ||
|
|
@@ -220,25 +230,59 @@ def GetNext(self): | |
| # The length of the header tells us how many bytes the Event | ||
| # string takes | ||
| header_len = int(header[0]) | ||
| event_str = self.file_handle.read(header_len) | ||
| event_str = self._read(header_len) | ||
| if len(event_str) < header_len: | ||
| raise self._truncation_error("data") | ||
|
|
||
| event_crc_calc = masked_crc32c(event_str) | ||
|
|
||
| # The next 4 bytes contain the crc32 of the Event string, | ||
| # which we check for integrity. Sometimes, the last Event | ||
| # has no crc32, in which case we skip. | ||
| crc_event_str = self.file_handle.read(4) | ||
| if crc_event_str: | ||
| crc_event = struct.unpack("I", crc_event_str) | ||
| if event_crc_calc != crc_event[0]: | ||
| raise errors.DataLossError( | ||
| None, | ||
| None, | ||
| "{} failed event crc32 check".format(self.filename), | ||
| ) | ||
| # which we check for integrity. | ||
| crc_event_str = self._read(4) | ||
| if len(crc_event_str) < 4: | ||
| raise self._truncation_error("data crc") | ||
| crc_event = struct.unpack("I", crc_event_str) | ||
| if event_crc_calc != crc_event[0]: | ||
| raise errors.DataLossError( | ||
| None, None, "{} failed event crc32 check".format(self.filename), | ||
| ) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We’re raising the same error for a CRC failure as a truncated stream.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For better or for worse, this is the same behavior that the TF C++ RecordReader has always had: So I'm not too inclined to try to fix it here given that we can't really fix it in the normal path other than by attempting to parse the string error message on the resulting status, which seems too brittle. I think corrupted checksums are infrequent enough in practice that looping on them, while not ideal, is an acceptable behavior. It's the same "infinite loop" that we would have at EOF anyway, so not really a big deal IMO.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wow, okay, confirmed: wrote an event file with five steps, corrupted the |
||
|
|
||
| # Set the current event to be read later by record() call | ||
| self.curr_event = event_str | ||
| # Clear the buffered partial record since we're done reading it. | ||
| self._buffer = b"" | ||
|
|
||
| def _read(self, n): | ||
| """Read up to n bytes from the underlying file, with buffering. | ||
|
|
||
| Reads are satisfied from a buffer of previous data read starting at | ||
| `self._buffer_pos` until the buffer is exhausted, and then from the | ||
| actual underlying file. Any new data is added to the buffer, and | ||
| `self._buffer_pos` is advanced to the point in the buffer past all | ||
| data returned as part of this read. | ||
|
|
||
| Args: | ||
| n: non-negative number of bytes to read | ||
|
|
||
| Returns: | ||
| bytestring of data read, up to n bytes | ||
| """ | ||
| result = self._buffer[self._buffer_pos : self._buffer_pos + n] | ||
| self._buffer_pos += len(result) | ||
| n -= len(result) | ||
| if n > 0: | ||
| new_data = self.file_handle.read(n) | ||
| result += new_data | ||
| self._buffer += new_data | ||
| self._buffer_pos += len(new_data) | ||
| return result | ||
|
|
||
| def _truncation_error(self, section): | ||
| return errors.DataLossError( | ||
| None, | ||
| None, | ||
| "{} has truncated record in {}".format(self.filename, section), | ||
| ) | ||
|
|
||
| def record(self): | ||
| return self.curr_event | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK
readis allowed to return fewer bytes than were requested, evenif those are available in the stream. (For instance, this can occur when
a signal handler is entered during the read.) From Python’s
readdocs,it’s not clear to me whether a short result is guaranteed to imply that
EOF is imminent. In practice, it looks like cPython 3.7 and 3.9 both do
exhaust*. Do you think that it’s worth retrying the
readuntil doneor
readreturnsb"", or is that not necessary?* Test program:
Source (tested on cPython 3.7.5rc1 and `667b91a5e2` on Debian)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for bringing this up and testing it, I hadn't thought about that. That said, it seems fine to me to not retry the
readimmediately, since A) practically speaking, with this change the higher level code will retry on the next reload cycle anyway and B) under the circumstances you describe, it's not obvious to me thatreadreturning zero bytes would actually be a stronger indication of EOF (unless the spec is that it always reads at least 1 byte if available, but might return fewer than available if interrupted by a signal).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is basically the spec in C (which is why I ask); Python is
not very clear about what its spec is.
Good point that the higher level code will retry anyway, so the worst
case is that there’ll be a spurious “truncated record” message. Keeping
it as is sounds good to me, then.