Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keras callback creating .profile-empty file blocks loading data #2084

Closed
wchargin opened this issue Apr 3, 2019 · 12 comments · Fixed by #3108
Closed

Keras callback creating .profile-empty file blocks loading data #2084

wchargin opened this issue Apr 3, 2019 · 12 comments · Fixed by #3108

Comments

@wchargin
Copy link
Contributor

wchargin commented Apr 3, 2019

Repro steps:

  1. Create a virtualenv with tf-nightly-2.0-preview==2.0.0.dev20190402
    and open two terminals in this environment.

  2. In one terminal, run the following simple Python script (but
    continue to the next step while this script is still running):

    from __future__ import absolute_import
    from __future__ import division
    from __future__ import print_function
    
    import tensorflow as tf
    
    
    DATASET = tf.keras.datasets.mnist
    INPUT_SHAPE = (28, 28)
    OUTPUT_CLASSES = 10
    
    
    def model_fn():
      model = tf.keras.models.Sequential([
          tf.keras.layers.Input(INPUT_SHAPE),
          tf.keras.layers.Flatten(),
          tf.keras.layers.Dense(128, activation="relu"),
          tf.keras.layers.BatchNormalization(),
          tf.keras.layers.Dense(256, activation="relu"),
          tf.keras.layers.Dropout(0.2),
          tf.keras.layers.Dense(OUTPUT_CLASSES, activation="softmax"),
      ])
      model.compile(
          loss="sparse_categorical_crossentropy",
          optimizer="adagrad",
          metrics=["accuracy"],
      )
      return model
    
    
    def main():
      model = model_fn()
      ((x_train, y_train), (x_test, y_test)) = DATASET.load_data()
      model.fit(
          x=x_train,
          y=y_train,
          validation_data=(x_test, y_test),
          callbacks=[tf.keras.callbacks.TensorBoard()],
          epochs=5,
      )
    
    
    if __name__ == "__main__":
      main()
  3. Wait for (say) epoch 2/5 to finish training. Then, in the other
    terminal, launch tensorboard --logdir ./logs.

  4. Open TensorBoard and observe that both training and validation runs
    appear with two epochs’ worth of data:

    Screenshot just after launching TensorBoard

  5. As training continues, refresh TensorBoard and/or reload the page.
    Observe that validation data continues to appear, but training data
    has stalled—even after well after the training has completed, the
    plot is incomplete:

    Screenshot of bad state

  6. Kill the TensorBoard process and restart it. Note that the data
    appears as desired:

    Screenshot of good state after TensorBoard relaunch

The same problem occurs in tf-nightly (non-2.0-preview), but
manifests differently: because there is only one run (named .) instead
of separate train/validation, all data stops being displayed after
the epoch in which TensorBoard is opened.

Note as a special case of this that if TensorBoard is running before
training starts, then train data may not appear at all:

Screenshot of validation-only data

@nfelt
Copy link
Contributor

nfelt commented Apr 3, 2019

What is the event file structure on disk after running this? This sounds like a case where there are multiple active event files in the run and TensorBoard advances to the last one too quickly and thus misses new data from the earlier one.

@wchargin
Copy link
Contributor Author

wchargin commented Apr 3, 2019

Yep, I suspect so, too:

$ find logs/
logs/
logs/validation
logs/validation/events.out.tfevents.1554257194.HOSTNAME.5031.18577.v2
logs/train
logs/train/events.out.tfevents.1554257189.HOSTNAME.5031.1212.v2
logs/train/events.out.tfevents.1554257189.HOSTNAME.profile-empty
logs/train/plugins
logs/train/plugins/profile
logs/train/plugins/profile/2019-04-02_19-06-29
logs/train/plugins/profile/2019-04-02_19-06-29/local.trace

Note that the fake .profile-empty events file only appears in the
train/ directory.

@wchargin
Copy link
Contributor Author

wchargin commented Apr 3, 2019

So, related: #1867.

@nfelt
Copy link
Contributor

nfelt commented Apr 3, 2019

Yeah, that must be it. I thought the code that was writing our the .profile-empty file was supposed to check if there were any events.out.tfevents files pre-existing in the directory, which doesn't totally avoid this happening (due to races) but could at least reduce the incidence. Perhaps that isn't actually the case. Another hacky solution until #1867 is actually ready could be changing the suffix to something that sorts earlier in the alphabet.

@wchargin
Copy link
Contributor Author

wchargin commented Apr 3, 2019

supposed to check if there were any events.out.tfevents files
pre-existing in the directory

It only checks whether a .profile-empty events file exists:

_EVENT_FILE_SUFFIX = '.profile-empty'

# …

  for file_name in gfile.ListDirectory(logdir):
    if file_name.endswith(_EVENT_FILE_SUFFIX):
      return
  # TODO(b/127330388): Use summary_ops_v2.create_file_writer instead.
  event_writer = pywrap_tensorflow.EventsWriter(
      compat.as_bytes(os.path.join(logdir, 'events')))
  event_writer.InitWithSuffix(compat.as_bytes(_EVENT_FILE_SUFFIX))

(from python/eager/profiler.py).

@wchargin
Copy link
Contributor Author

Just to note explicitly: setting profile_batch=0 in the Keras callback
options is a workaround that disables profiling entirely.

wchargin added a commit that referenced this issue Jul 18, 2019
Summary:
Resolves #2440. See #1998 for discussion.

Test Plan:
The hparams demo still does not specify trial IDs (intentionally, as
this is the usual path). But apply the following patch—

```diff
diff --git a/tensorboard/plugins/hparams/hparams_demo.py b/tensorboard/plugins/hparams/hparams_demo.py
index ac4e762b..d0279f27 100644
--- a/tensorboard/plugins/hparams/hparams_demo.py
+++ b/tensorboard/plugins/hparams/hparams_demo.py
@@ -63,7 +63,7 @@ flags.DEFINE_integer(
 )
 flags.DEFINE_integer(
     "num_epochs",
-    5,
+    1,
     "Number of epochs per trial.",
 )
 
@@ -160,7 +160,7 @@ def model_fn(hparams, seed):
   return model
 
 
-def run(data, base_logdir, session_id, hparams):
+def run(data, base_logdir, session_id, trial_id, hparams):
   """Run a training/validation session.
 
   Flags must have been parsed for this function to behave.
@@ -179,7 +179,7 @@ def run(data, base_logdir, session_id, hparams):
       update_freq=flags.FLAGS.summary_freq,
       profile_batch=0,  # workaround for issue #2084
   )
-  hparams_callback = hp.KerasCallback(logdir, hparams)
+  hparams_callback = hp.KerasCallback(logdir, hparams, trial_id=trial_id)
   ((x_train, y_train), (x_test, y_test)) = data
   result = model.fit(
       x=x_train,
@@ -235,6 +235,7 @@ def run_all(logdir, verbose=False):
           data=data,
           base_logdir=logdir,
           session_id=session_id,
+          trial_id="trial-%d" % group_index,
           hparams=hparams,
       )
 
```

—and then run `//tensorboard/plugins/hparams:hparams_demo`, and observe
that the HParams dashboard renders a “Trial ID” column with the
specified IDs:

![Screenshot of new version of HParams dashboard]

[1]: https://user-images.githubusercontent.com/4317806/61491024-1fb01280-a963-11e9-8a47-35e0a01f3691.png

wchargin-branch: hparams-trial-id
wchargin added a commit that referenced this issue Jul 19, 2019
Summary:
Resolves #2440. See #1998 for discussion.

Test Plan:
The hparams demo still does not specify trial IDs (intentionally, as
this is the usual path). But apply the following patch—

```diff
diff --git a/tensorboard/plugins/hparams/hparams_demo.py b/tensorboard/plugins/hparams/hparams_demo.py
index ac4e762b..38b2b122 100644
--- a/tensorboard/plugins/hparams/hparams_demo.py
+++ b/tensorboard/plugins/hparams/hparams_demo.py
@@ -160,7 +160,7 @@ def model_fn(hparams, seed):
   return model
 
 
-def run(data, base_logdir, session_id, hparams):
+def run(data, base_logdir, session_id, trial_id, hparams):
   """Run a training/validation session.
 
   Flags must have been parsed for this function to behave.
@@ -179,7 +179,7 @@ def run(data, base_logdir, session_id, hparams):
       update_freq=flags.FLAGS.summary_freq,
       profile_batch=0,  # workaround for issue #2084
   )
-  hparams_callback = hp.KerasCallback(logdir, hparams)
+  hparams_callback = hp.KerasCallback(logdir, hparams, trial_id=trial_id)
   ((x_train, y_train), (x_test, y_test)) = data
   result = model.fit(
       x=x_train,
@@ -235,6 +235,7 @@ def run_all(logdir, verbose=False):
           data=data,
           base_logdir=logdir,
           session_id=session_id,
+          trial_id="trial-%d" % group_index,
           hparams=hparams,
       )
 
```

—and then run `//tensorboard/plugins/hparams:hparams_demo`, and observe
that the HParams dashboard renders a “Trial ID” column with the
specified IDs:

![Screenshot of new version of HParams dashboard][1]

[1]: https://user-images.githubusercontent.com/4317806/61491024-1fb01280-a963-11e9-8a47-35e0a01f3691.png

wchargin-branch: hparams-trial-id
@nfelt nfelt changed the title Keras train data frozen at launch, while validation continues to update Keras callback creating .profile-empty file blocks loading data Sep 26, 2019
@sergei-mironov
Copy link

FYI: The problem is still here, in 1cf0898dd of TF v2.0.0. Workaround above works.

@caisq
Copy link
Contributor

caisq commented Dec 22, 2019

@qqfish Just checking if you are working on this issue, as you are the current assignee. If you're not, I'll be happy to take it.

cc @psybuzz

@qqfish
Copy link
Contributor

qqfish commented Dec 30, 2019

Thank you Shanqing. Please take it.

@caisq caisq assigned caisq and unassigned qqfish Dec 30, 2019
@caisq
Copy link
Contributor

caisq commented Dec 30, 2019

@qqfish Done.

@nfelt
Copy link
Contributor

nfelt commented Jan 3, 2020

Okay, an easy fix for this suggested by @wchargin would be to just update TensorBoard to ignore any events file that ends in .profile-empty. This is simpler than changing TensorFlow and means updating TB will fix this for even existing logdirs.

Background: right now whether something is an events file is determined by the IsTensorFlowEventsFile predicate here:

def IsTensorFlowEventsFile(path):
"""Check the path name to see if it is probably a TF Events file.
Args:
path: A file path to check if it is an event file.
Raises:
ValueError: If the path is an empty string.
Returns:
If path is formatted like a TensorFlowEventsFile.
"""
if not path:
raise ValueError("Path must be a nonempty string")
return "tfevents" in tf.compat.as_str_any(os.path.basename(path))

The predicate is used for (A) the code that finds sub-directories within a root logdir that contain any event files (which become the run directories), which happens here:

if any(IsTensorFlowEventsFile(f) for f in files)

However the predicate is also used for (B) the code that lists event files within a run directory, which happens here:

if io_wrapper.IsTensorFlowEventsFile(path):
return event_file_loader.EventFileLoader(path)
elif event_file_active_filter:
return directory_loader.DirectoryLoader(
path,
event_file_loader.TimestampedEventFileLoader,
path_filter=io_wrapper.IsTensorFlowEventsFile,
active_filter=event_file_active_filter,
)
else:
return directory_watcher.DirectoryWatcher(
path,
event_file_loader.EventFileLoader,
io_wrapper.IsTensorFlowEventsFile,
)

The point of the .profile-empty files is to trigger the detection in (A) so that directories that otherwise only contain non-tfevents profile data files are still considered to be TensorBoard run directories. The problematic part is that these files are also detected by (B). So we could fix this issue just by splitting these into two predicates, maybe an unchanged IsTensorFlowEventsFile for (A) and then something new like IsSummaryEventsFile for (B).

For the record, here are the commits where .profile-empty was introduced:
tensorflow/tensorflow@23d8e38
tensorflow/tensorflow@c66b603

@caisq
Copy link
Contributor

caisq commented Jan 5, 2020

Thanks for the detailed write-up of the root cause of the issue and the proposal for how to fix it. I"ve opened #3108 to implement the suggested fix.

caisq added a commit that referenced this issue Jan 6, 2020
…3108)

* Motivation for features / changes
  * Fix #2084
  * The approach is suggested by @wchargin and @nfelt in #2084 (comment)

* Technical description of changes
  * Split the logic of `IsTensorFlowEventsFile()` into two separate functions
    1. An unchanged `IsTensorFlowEventsFile()` function, which only checks the existence of the 'tfevents' substring in the path string.
    2. A new `IsSummaryEventsFile()` function, which returns `True` if and only if  `IsTensorFlowEventsFile()` returns `True` for the input path name *and* the path name does not end in the special suffix `.profile-empty`.
  * This prevents the `EventAccumulator` implementation from picking up the empty events.tfevents.*.profile-empty files, which under the single-event file mode, causes TensorBoard backend to stop reading the latest summaries in the other (i.e., main, non-profiler-generated) events.* file. The *.profile-empty file was designed to make the TensorBoard backend recognize the subfolder created by the Profile plugin as a valid sub logdir, event when it contains no other events files.

* Detailed steps to verify changes work correctly (as executed by you)
  * Added new unit tests
  * Manually verified that #2084 is resolved by running `bazel run -c opt tensorboard -- --logdir /path/to/logdir` using the reproduction code in #2084. 
    * Screenshot: 
![image](https://user-images.githubusercontent.com/16824702/71785529-9e127680-2fce-11ea-98e0-b2efecd99880.png)
kenhktsui added a commit to Caro-app/calamari that referenced this issue Feb 13, 2020
Add learning rate as configurable
Tensorboard not refreshing Workaround tensorflow/tensorboard#2084
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants