Replace tf_record_iterator with a non-deprecated way to read TFRecords

TensorBoard currently relies on tf.pywrap_tensorflow.PyRecordReader_New() to read tfrecords files (aka event files).  Not only is this bad because it's not strictly speaking a public API symbol, it's also being hidden in TF 2.0.

There is a reimplementation of record reading logic in pure python in our tensorflow_stub here:
https://github.com/tensorflow/tensorboard/blob/3f055ba9c057ad2fbacfbd2c61b6e27471a51396/tensorboard/compat/tensorflow_stub/pywrap_tensorflow.py

I haven't benchmarked this, but I expect that it's considerably slower than the C++ version from TensorFlow.  If so and if we can't get the performance to be comparable to the C++ version, then when TF is available we should prefer the C++ code.  To use that code, however, we'll need to expose it under a real public python API.

What exists today is:

* [`tf.train.summary_iterator`](https://www.tensorflow.org/versions/r1.12/api_docs/python/tf/train/summary_iterator) - deprecated, will go away entirely in TF 2.0

* [`tf.TFRecordReader`](https://www.tensorflow.org/versions/r1.12/api_docs/python/tf/TFRecordReader) - deprecated queue-based reader being replaced by tf.data

* [`tf.io.tf_record_iterator`](https://www.tensorflow.org/versions/r1.12/api_docs/python/tf/io/tf_record_iterator) - deprecated in favor of tf.data.TFRecordDataset for tf.data, and not enough for our needs

* [`tf.data.TFRecordDataset`](https://www.tensorflow.org/versions/r1.12/api_docs/python/tf/data/TFRecordDataset) - tf.data-compatible Dataset, but ultimately still an iterator, and not enough for our needs

* [`tf.io.TFRecordWriter`](https://www.tensorflow.org/versions/r1.12/api_docs/python/tf/io/TFRecordWriter) - writer side (so not immediately useful,  but an example of what does still exist)

An iterator API isn't powerful enough for us because we need the ability to read to the end of a file and then wait for new data, preserving our read offset.  As far as I can tell, all the above iterator approaches will abort iteration when they reach the end of the file, so reading new data would require re-reading all the previously consumed data.  Also, I believe we currently rely on being able to handle cases where a record isn't fully flushed (e.g. the file appears to end in a corrupted record, but it's really just that the data isn't all there yet, and the appropriate response is to wait and retry), which the iterator APIs also don't allow.

So it's likely that we would need to basically create a new TFRecordReader.  I think it would make the most sense to live at tf.io.TFRecordReader alongside the writer; we'd just have to clearly message that most users will actually want the TFRecordDataset (and that this reader is somewhat different from the old tf.TFRecordReader).



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Replace tf_record_iterator with a non-deprecated way to read TFRecords #1711

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Replace tf_record_iterator with a non-deprecated way to read TFRecords #1711

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions