Keras callback creating .profile-empty file blocks loading data #2084

wchargin · 2019-04-03T02:10:00Z

Repro steps:

Create a virtualenv with tf-nightly-2.0-preview==2.0.0.dev20190402
and open two terminals in this environment.

In one terminal, run the following simple Python script (but
continue to the next step while this script is still running):

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import tensorflow as tf


DATASET = tf.keras.datasets.mnist
INPUT_SHAPE = (28, 28)
OUTPUT_CLASSES = 10


def model_fn():
  model = tf.keras.models.Sequential([
      tf.keras.layers.Input(INPUT_SHAPE),
      tf.keras.layers.Flatten(),
      tf.keras.layers.Dense(128, activation="relu"),
      tf.keras.layers.BatchNormalization(),
      tf.keras.layers.Dense(256, activation="relu"),
      tf.keras.layers.Dropout(0.2),
      tf.keras.layers.Dense(OUTPUT_CLASSES, activation="softmax"),
  ])
  model.compile(
      loss="sparse_categorical_crossentropy",
      optimizer="adagrad",
      metrics=["accuracy"],
  )
  return model


def main():
  model = model_fn()
  ((x_train, y_train), (x_test, y_test)) = DATASET.load_data()
  model.fit(
      x=x_train,
      y=y_train,
      validation_data=(x_test, y_test),
      callbacks=[tf.keras.callbacks.TensorBoard()],
      epochs=5,
  )


if __name__ == "__main__":
  main()

Wait for (say) epoch 2/5 to finish training. Then, in the other
terminal, launch tensorboard --logdir ./logs.
Open TensorBoard and observe that both training and validation runs
appear with two epochs’ worth of data:
As training continues, refresh TensorBoard and/or reload the page.
Observe that validation data continues to appear, but training data
has stalled—even after well after the training has completed, the
plot is incomplete:
Kill the TensorBoard process and restart it. Note that the data
appears as desired:

The same problem occurs in tf-nightly (non-2.0-preview), but
manifests differently: because there is only one run (named .) instead
of separate train/validation, all data stops being displayed after
the epoch in which TensorBoard is opened.

Note as a special case of this that if TensorBoard is running before
training starts, then train data may not appear at all:

The text was updated successfully, but these errors were encountered:

nfelt · 2019-04-03T02:26:41Z

What is the event file structure on disk after running this? This sounds like a case where there are multiple active event files in the run and TensorBoard advances to the last one too quickly and thus misses new data from the earlier one.

wchargin · 2019-04-03T02:39:48Z

Yep, I suspect so, too:

$ find logs/
logs/
logs/validation
logs/validation/events.out.tfevents.1554257194.HOSTNAME.5031.18577.v2
logs/train
logs/train/events.out.tfevents.1554257189.HOSTNAME.5031.1212.v2
logs/train/events.out.tfevents.1554257189.HOSTNAME.profile-empty
logs/train/plugins
logs/train/plugins/profile
logs/train/plugins/profile/2019-04-02_19-06-29
logs/train/plugins/profile/2019-04-02_19-06-29/local.trace

Note that the fake .profile-empty events file only appears in the
train/ directory.

wchargin · 2019-04-03T02:40:46Z

So, related: #1867.

nfelt · 2019-04-03T04:08:07Z

Yeah, that must be it. I thought the code that was writing our the .profile-empty file was supposed to check if there were any events.out.tfevents files pre-existing in the directory, which doesn't totally avoid this happening (due to races) but could at least reduce the incidence. Perhaps that isn't actually the case. Another hacky solution until #1867 is actually ready could be changing the suffix to something that sorts earlier in the alphabet.

wchargin · 2019-04-03T16:45:31Z

supposed to check if there were any events.out.tfevents files
pre-existing in the directory

It only checks whether a .profile-empty events file exists:

_EVENT_FILE_SUFFIX = '.profile-empty'

# …

  for file_name in gfile.ListDirectory(logdir):
    if file_name.endswith(_EVENT_FILE_SUFFIX):
      return
  # TODO(b/127330388): Use summary_ops_v2.create_file_writer instead.
  event_writer = pywrap_tensorflow.EventsWriter(
      compat.as_bytes(os.path.join(logdir, 'events')))
  event_writer.InitWithSuffix(compat.as_bytes(_EVENT_FILE_SUFFIX))

(from python/eager/profiler.py).

wchargin · 2019-04-15T19:58:14Z

Just to note explicitly: setting profile_batch=0 in the Keras callback
options is a workaround that disables profiling entirely.

Summary: Resolves #2440. See #1998 for discussion. Test Plan: The hparams demo still does not specify trial IDs (intentionally, as this is the usual path). But apply the following patch— ```diff diff --git a/tensorboard/plugins/hparams/hparams_demo.py b/tensorboard/plugins/hparams/hparams_demo.py index ac4e762b..d0279f27 100644 --- a/tensorboard/plugins/hparams/hparams_demo.py +++ b/tensorboard/plugins/hparams/hparams_demo.py @@ -63,7 +63,7 @@ flags.DEFINE_integer( ) flags.DEFINE_integer( "num_epochs", - 5, + 1, "Number of epochs per trial.", ) @@ -160,7 +160,7 @@ def model_fn(hparams, seed): return model -def run(data, base_logdir, session_id, hparams): +def run(data, base_logdir, session_id, trial_id, hparams): """Run a training/validation session. Flags must have been parsed for this function to behave. @@ -179,7 +179,7 @@ def run(data, base_logdir, session_id, hparams): update_freq=flags.FLAGS.summary_freq, profile_batch=0, # workaround for issue #2084 ) - hparams_callback = hp.KerasCallback(logdir, hparams) + hparams_callback = hp.KerasCallback(logdir, hparams, trial_id=trial_id) ((x_train, y_train), (x_test, y_test)) = data result = model.fit( x=x_train, @@ -235,6 +235,7 @@ def run_all(logdir, verbose=False): data=data, base_logdir=logdir, session_id=session_id, + trial_id="trial-%d" % group_index, hparams=hparams, ) ``` —and then run `//tensorboard/plugins/hparams:hparams_demo`, and observe that the HParams dashboard renders a “Trial ID” column with the specified IDs: ![Screenshot of new version of HParams dashboard] [1]: https://user-images.githubusercontent.com/4317806/61491024-1fb01280-a963-11e9-8a47-35e0a01f3691.png wchargin-branch: hparams-trial-id

Summary: Resolves #2440. See #1998 for discussion. Test Plan: The hparams demo still does not specify trial IDs (intentionally, as this is the usual path). But apply the following patch— ```diff diff --git a/tensorboard/plugins/hparams/hparams_demo.py b/tensorboard/plugins/hparams/hparams_demo.py index ac4e762b..38b2b122 100644 --- a/tensorboard/plugins/hparams/hparams_demo.py +++ b/tensorboard/plugins/hparams/hparams_demo.py @@ -160,7 +160,7 @@ def model_fn(hparams, seed): return model -def run(data, base_logdir, session_id, hparams): +def run(data, base_logdir, session_id, trial_id, hparams): """Run a training/validation session. Flags must have been parsed for this function to behave. @@ -179,7 +179,7 @@ def run(data, base_logdir, session_id, hparams): update_freq=flags.FLAGS.summary_freq, profile_batch=0, # workaround for issue #2084 ) - hparams_callback = hp.KerasCallback(logdir, hparams) + hparams_callback = hp.KerasCallback(logdir, hparams, trial_id=trial_id) ((x_train, y_train), (x_test, y_test)) = data result = model.fit( x=x_train, @@ -235,6 +235,7 @@ def run_all(logdir, verbose=False): data=data, base_logdir=logdir, session_id=session_id, + trial_id="trial-%d" % group_index, hparams=hparams, ) ``` —and then run `//tensorboard/plugins/hparams:hparams_demo`, and observe that the HParams dashboard renders a “Trial ID” column with the specified IDs: ![Screenshot of new version of HParams dashboard][1] [1]: https://user-images.githubusercontent.com/4317806/61491024-1fb01280-a963-11e9-8a47-35e0a01f3691.png wchargin-branch: hparams-trial-id

sergei-mironov · 2019-11-12T11:41:54Z

FYI: The problem is still here, in 1cf0898dd of TF v2.0.0. Workaround above works.

caisq · 2019-12-22T02:18:20Z

@qqfish Just checking if you are working on this issue, as you are the current assignee. If you're not, I'll be happy to take it.

cc @psybuzz

qqfish · 2019-12-30T18:01:45Z

Thank you Shanqing. Please take it.

caisq · 2019-12-30T18:14:50Z

@qqfish Done.

nfelt · 2020-01-03T20:12:54Z

Okay, an easy fix for this suggested by @wchargin would be to just update TensorBoard to ignore any events file that ends in .profile-empty. This is simpler than changing TensorFlow and means updating TB will fix this for even existing logdirs.

Background: right now whether something is an events file is determined by the IsTensorFlowEventsFile predicate here:

tensorboard/tensorboard/backend/event_processing/io_wrapper.py

Lines 47 to 61 in d0abd84

    
           def IsTensorFlowEventsFile(path): 
        
               """Check the path name to see if it is probably a TF Events file. 
        
               Args: 
        
                 path: A file path to check if it is an event file. 
        
               Raises: 
        
                 ValueError: If the path is an empty string. 
        
               Returns: 
        
                 If path is formatted like a TensorFlowEventsFile. 
        
               """ 
        
               if not path: 
        
                   raise ValueError("Path must be a nonempty string") 
        
               return "tfevents" in tf.compat.as_str_any(os.path.basename(path))

The predicate is used for (A) the code that finds sub-directories within a root logdir that contain any event files (which become the run directories), which happens here:

tensorboard/tensorboard/backend/event_processing/io_wrapper.py

Line 215 in d0abd84

if any(IsTensorFlowEventsFile(f) for f in files)

However the predicate is also used for (B) the code that lists event files within a run directory, which happens here:

tensorboard/tensorboard/backend/event_processing/plugin_event_accumulator.py

Lines 627 to 641 in d0abd84

    
           if io_wrapper.IsTensorFlowEventsFile(path): 
        
               return event_file_loader.EventFileLoader(path) 
        
           elif event_file_active_filter: 
        
               return directory_loader.DirectoryLoader( 
        
                   path, 
        
                   event_file_loader.TimestampedEventFileLoader, 
        
                   path_filter=io_wrapper.IsTensorFlowEventsFile, 
        
                   active_filter=event_file_active_filter, 
        
               ) 
        
           else: 
        
               return directory_watcher.DirectoryWatcher( 
        
                   path, 
        
                   event_file_loader.EventFileLoader, 
        
                   io_wrapper.IsTensorFlowEventsFile, 
        
               )

The point of the .profile-empty files is to trigger the detection in (A) so that directories that otherwise only contain non-tfevents profile data files are still considered to be TensorBoard run directories. The problematic part is that these files are also detected by (B). So we could fix this issue just by splitting these into two predicates, maybe an unchanged IsTensorFlowEventsFile for (A) and then something new like IsSummaryEventsFile for (B).

For the record, here are the commits where .profile-empty was introduced:
tensorflow/tensorflow@23d8e38
tensorflow/tensorflow@c66b603

caisq · 2020-01-05T20:16:05Z

Thanks for the detailed write-up of the root cause of the issue and the proposal for how to fix it. I"ve opened #3108 to implement the suggested fix.

@wchargin

…3108) * Motivation for features / changes * Fix #2084 * The approach is suggested by @wchargin and @nfelt in #2084 (comment) * Technical description of changes * Split the logic of `IsTensorFlowEventsFile()` into two separate functions 1. An unchanged `IsTensorFlowEventsFile()` function, which only checks the existence of the 'tfevents' substring in the path string. 2. A new `IsSummaryEventsFile()` function, which returns `True` if and only if `IsTensorFlowEventsFile()` returns `True` for the input path name *and* the path name does not end in the special suffix `.profile-empty`. * This prevents the `EventAccumulator` implementation from picking up the empty events.tfevents.*.profile-empty files, which under the single-event file mode, causes TensorBoard backend to stop reading the latest summaries in the other (i.e., main, non-profiler-generated) events.* file. The *.profile-empty file was designed to make the TensorBoard backend recognize the subfolder created by the Profile plugin as a valid sub logdir, event when it contains no other events files. * Detailed steps to verify changes work correctly (as executed by you) * Added new unit tests * Manually verified that #2084 is resolved by running `bazel run -c opt tensorboard -- --logdir /path/to/logdir` using the reproduction code in #2084. * Screenshot: ![image](https://user-images.githubusercontent.com/16824702/71785529-9e127680-2fce-11ea-98e0-b2efecd99880.png)

Add learning rate as configurable Tensorboard not refreshing Workaround tensorflow/tensorboard#2084

wchargin added type:bug core:backend labels Apr 3, 2019

wchargin mentioned this issue Jul 8, 2019

TensorBoard not automatically refreshing after 45 / 200 epochs in both Firefox and Google Chrome #2412

Closed

wchargin mentioned this issue Aug 21, 2019

Using tensorboard with tf.keras 1.14 #2584

Closed

nfelt mentioned this issue Sep 26, 2019

Histogram Tab Not Showing up with tf.keras Callback #2699

Closed

nfelt changed the title ~~Keras train data frozen at launch, while validation continues to update~~ Keras callback creating .profile-empty file blocks loading data Sep 26, 2019

wchargin mentioned this issue Nov 14, 2019

No train metrics show up in scalar plots #2911

Closed

psybuzz assigned qqfish Nov 15, 2019

psybuzz mentioned this issue Nov 15, 2019

Tensorboard callback not writing the training metrics tensorflow/tensorflow#34276

Closed

caisq assigned caisq and unassigned qqfish Dec 30, 2019

caisq mentioned this issue Jan 5, 2020

[backend] Split IsTensorFlowEventsFile into two separate functions #3108

Merged

caisq closed this as completed in #3108 Jan 6, 2020

rodyt mentioned this issue Jan 18, 2020

Tensorboard Callback profile_batch causes Segmentation Fault #3149

Open

kenhktsui added a commit to Caro-app/calamari that referenced this issue Feb 13, 2020

Integration Visual callback to Custom Tensorboard callback

c7c031e

Add learning rate as configurable Tensorboard not refreshing Workaround tensorflow/tensorboard#2084

rithiksachdev mentioned this issue Mar 14, 2020

No train in tensorboard scalars #3367

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keras callback creating .profile-empty file blocks loading data #2084

Keras callback creating .profile-empty file blocks loading data #2084

wchargin commented Apr 3, 2019

nfelt commented Apr 3, 2019

wchargin commented Apr 3, 2019

wchargin commented Apr 3, 2019

nfelt commented Apr 3, 2019

wchargin commented Apr 3, 2019

wchargin commented Apr 15, 2019

sergei-mironov commented Nov 12, 2019

caisq commented Dec 22, 2019

qqfish commented Dec 30, 2019

caisq commented Dec 30, 2019

nfelt commented Jan 3, 2020

caisq commented Jan 5, 2020

Keras callback creating .profile-empty file blocks loading data #2084

Keras callback creating .profile-empty file blocks loading data #2084

Comments

wchargin commented Apr 3, 2019

nfelt commented Apr 3, 2019

wchargin commented Apr 3, 2019

wchargin commented Apr 3, 2019

nfelt commented Apr 3, 2019

wchargin commented Apr 3, 2019

wchargin commented Apr 15, 2019

sergei-mironov commented Nov 12, 2019

caisq commented Dec 22, 2019

qqfish commented Dec 30, 2019

caisq commented Dec 30, 2019

nfelt commented Jan 3, 2020

caisq commented Jan 5, 2020