Skip to content

Conversation

@nfelt
Copy link
Contributor

@nfelt nfelt commented Jan 29, 2020

This addresses concern 2 from #1711 (comment), namely that the changes in #3185 were sufficient to fall back correctly if pywrap_tensorflow no longer contained a PyRecordReader_New symbol, but not actually sufficient in the eventual case that pywrap_tensorflow disappears entirely and can no longer be imported. The previous code was using the tensorboard.compat._pywrap_tensorflow lazy-loader which would in that case have automatically started using the no-TF stub implementation of pywrap_tensorflow instead, which is undesirable relative to using tf_record_iterator.

We fix that by getting rid of tensorboard.compat._pywrap_tensorflow entirely, since it was unused elsewhere (the projector plugin imported it but did not use it; it was leftover cruft from #2096), and just explicitly handling the no-TF case up front.

I added a no-TF build target for the test to confirm that the fallback to the stub implementation works when forcing no-TF mode. The stub implementation didn't previously support truncated records at all (it would fail in struct unpacking when truncated partway through a fixed-length field), let alone read offset preservation, so I also added that functionality to the implementation so that the set of strengthened tests from #3185 passes.

Comment on lines +214 to +215
if len(header_str) < 8:
raise self._truncation_error("header")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK read is allowed to return fewer bytes than were requested, even
if those are available in the stream. (For instance, this can occur when
a signal handler is entered during the read.) From Python’s read docs,
it’s not clear to me whether a short result is guaranteed to imply that
EOF is imminent. In practice, it looks like cPython 3.7 and 3.9 both do
exhaust*. Do you think that it’s worth retrying the read until done
or read returns b"", or is that not necessary?

* Test program:

Source (tested on cPython 3.7.5rc1 and `667b91a5e2` on Debian)
import os
import signal
import threading
import time


def handle_sigusr1(signalnum, frame):
    print("Got SIGUSR1 (%d)" % signalnum)


def send_delayed_sigusr1(delay):
    time.sleep(delay)
    os.kill(os.getpid(), signal.SIGUSR1)


signal.signal(signal.SIGUSR1, handle_sigusr1)
threading.Thread(target=send_delayed_sigusr1, args=(0.1,)).start()
with open("/dev/zero", "rb") as infile:
    zeros = infile.read(int(1e10))
print(len(zeros))  # seems to always give 1e10 in my testing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for bringing this up and testing it, I hadn't thought about that. That said, it seems fine to me to not retry the read immediately, since A) practically speaking, with this change the higher level code will retry on the next reload cycle anyway and B) under the circumstances you describe, it's not obvious to me that read returning zero bytes would actually be a stronger indication of EOF (unless the spec is that it always reads at least 1 byte if available, but might return fewer than available if interrupted by a signal).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unless the spec is that it always reads at least 1 byte if available,
but might return fewer than available if interrupted by a signal

Yeah, this is basically the spec in C (which is why I ask); Python is
not very clear about what its spec is.

Good point that the higher level code will retry anyway, so the worst
case is that there’ll be a spurious “truncated record” message. Keeping
it as is sounds good to me, then.

if event_crc_calc != crc_event[0]:
raise errors.DataLossError(
None, None, "{} failed event crc32 check".format(self.filename),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We’re raising the same error for a CRC failure as a truncated stream.
How does upstream code know to retry on the latter but skip this record
on the former? From a cursory read it looks to me like this would
infinitely loop raising DataLossErrors once we find a corrupt record.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For better or for worse, this is the same behavior that the TF C++ RecordReader has always had:
https://github.com/tensorflow/tensorflow/blob/a34091e538540aad57a7a941575538944f38db24/tensorflow/core/lib/io/record_reader.cc#L99
https://github.com/tensorflow/tensorflow/blob/a34091e538540aad57a7a941575538944f38db24/tensorflow/core/lib/io/record_reader.cc#L105

So I'm not too inclined to try to fix it here given that we can't really fix it in the normal path other than by attempting to parse the string error message on the resulting status, which seems too brittle. I think corrupted checksums are infrequent enough in practice that looping on them, while not ideal, is an acceptable behavior. It's the same "infinite loop" that we would have at EOF anyway, so not really a big deal IMO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, okay, confirmed: wrote an event file with five steps, corrupted the
third one, and observed that only the first two steps are ever loaded.
That’s not great, but you’re right that it’s not a regression.

Copy link
Contributor Author

@nfelt nfelt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL

if event_crc_calc != crc_event[0]:
raise errors.DataLossError(
None, None, "{} failed event crc32 check".format(self.filename),
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For better or for worse, this is the same behavior that the TF C++ RecordReader has always had:
https://github.com/tensorflow/tensorflow/blob/a34091e538540aad57a7a941575538944f38db24/tensorflow/core/lib/io/record_reader.cc#L99
https://github.com/tensorflow/tensorflow/blob/a34091e538540aad57a7a941575538944f38db24/tensorflow/core/lib/io/record_reader.cc#L105

So I'm not too inclined to try to fix it here given that we can't really fix it in the normal path other than by attempting to parse the string error message on the resulting status, which seems too brittle. I think corrupted checksums are infrequent enough in practice that looping on them, while not ideal, is an acceptable behavior. It's the same "infinite loop" that we would have at EOF anyway, so not really a big deal IMO.

Comment on lines +214 to +215
if len(header_str) < 8:
raise self._truncation_error("header")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for bringing this up and testing it, I hadn't thought about that. That said, it seems fine to me to not retry the read immediately, since A) practically speaking, with this change the higher level code will retry on the next reload cycle anyway and B) under the circumstances you describe, it's not obvious to me that read returning zero bytes would actually be a stronger indication of EOF (unless the spec is that it always reads at least 1 byte if available, but might return fewer than available if interrupted by a signal).

Comment on lines +214 to +215
if len(header_str) < 8:
raise self._truncation_error("header")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unless the spec is that it always reads at least 1 byte if available,
but might return fewer than available if interrupted by a signal

Yeah, this is basically the spec in C (which is why I ask); Python is
not very clear about what its spec is.

Good point that the higher level code will retry anyway, so the worst
case is that there’ll be a spurious “truncated record” message. Keeping
it as is sounds good to me, then.

if event_crc_calc != crc_event[0]:
raise errors.DataLossError(
None, None, "{} failed event crc32 check".format(self.filename),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, okay, confirmed: wrote an event file with five steps, corrupted the
third one, and observed that only the first two steps are ever loaded.
That’s not great, but you’re right that it’s not a regression.

@nfelt nfelt merged commit 73b4df9 into tensorflow:master Jan 30, 2020
@nfelt nfelt deleted the pyrecordreader-stub branch January 30, 2020 01:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants