-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keras callback creating .profile-empty file blocks loading data #2084
Comments
What is the event file structure on disk after running this? This sounds like a case where there are multiple active event files in the run and TensorBoard advances to the last one too quickly and thus misses new data from the earlier one. |
Yep, I suspect so, too:
Note that the fake |
So, related: #1867. |
Yeah, that must be it. I thought the code that was writing our the |
It only checks whether a _EVENT_FILE_SUFFIX = '.profile-empty'
# …
for file_name in gfile.ListDirectory(logdir):
if file_name.endswith(_EVENT_FILE_SUFFIX):
return
# TODO(b/127330388): Use summary_ops_v2.create_file_writer instead.
event_writer = pywrap_tensorflow.EventsWriter(
compat.as_bytes(os.path.join(logdir, 'events')))
event_writer.InitWithSuffix(compat.as_bytes(_EVENT_FILE_SUFFIX)) (from |
Just to note explicitly: setting |
Summary: Resolves #2440. See #1998 for discussion. Test Plan: The hparams demo still does not specify trial IDs (intentionally, as this is the usual path). But apply the following patch— ```diff diff --git a/tensorboard/plugins/hparams/hparams_demo.py b/tensorboard/plugins/hparams/hparams_demo.py index ac4e762b..d0279f27 100644 --- a/tensorboard/plugins/hparams/hparams_demo.py +++ b/tensorboard/plugins/hparams/hparams_demo.py @@ -63,7 +63,7 @@ flags.DEFINE_integer( ) flags.DEFINE_integer( "num_epochs", - 5, + 1, "Number of epochs per trial.", ) @@ -160,7 +160,7 @@ def model_fn(hparams, seed): return model -def run(data, base_logdir, session_id, hparams): +def run(data, base_logdir, session_id, trial_id, hparams): """Run a training/validation session. Flags must have been parsed for this function to behave. @@ -179,7 +179,7 @@ def run(data, base_logdir, session_id, hparams): update_freq=flags.FLAGS.summary_freq, profile_batch=0, # workaround for issue #2084 ) - hparams_callback = hp.KerasCallback(logdir, hparams) + hparams_callback = hp.KerasCallback(logdir, hparams, trial_id=trial_id) ((x_train, y_train), (x_test, y_test)) = data result = model.fit( x=x_train, @@ -235,6 +235,7 @@ def run_all(logdir, verbose=False): data=data, base_logdir=logdir, session_id=session_id, + trial_id="trial-%d" % group_index, hparams=hparams, ) ``` —and then run `//tensorboard/plugins/hparams:hparams_demo`, and observe that the HParams dashboard renders a “Trial ID” column with the specified IDs: ![Screenshot of new version of HParams dashboard] [1]: https://user-images.githubusercontent.com/4317806/61491024-1fb01280-a963-11e9-8a47-35e0a01f3691.png wchargin-branch: hparams-trial-id
Summary: Resolves #2440. See #1998 for discussion. Test Plan: The hparams demo still does not specify trial IDs (intentionally, as this is the usual path). But apply the following patch— ```diff diff --git a/tensorboard/plugins/hparams/hparams_demo.py b/tensorboard/plugins/hparams/hparams_demo.py index ac4e762b..38b2b122 100644 --- a/tensorboard/plugins/hparams/hparams_demo.py +++ b/tensorboard/plugins/hparams/hparams_demo.py @@ -160,7 +160,7 @@ def model_fn(hparams, seed): return model -def run(data, base_logdir, session_id, hparams): +def run(data, base_logdir, session_id, trial_id, hparams): """Run a training/validation session. Flags must have been parsed for this function to behave. @@ -179,7 +179,7 @@ def run(data, base_logdir, session_id, hparams): update_freq=flags.FLAGS.summary_freq, profile_batch=0, # workaround for issue #2084 ) - hparams_callback = hp.KerasCallback(logdir, hparams) + hparams_callback = hp.KerasCallback(logdir, hparams, trial_id=trial_id) ((x_train, y_train), (x_test, y_test)) = data result = model.fit( x=x_train, @@ -235,6 +235,7 @@ def run_all(logdir, verbose=False): data=data, base_logdir=logdir, session_id=session_id, + trial_id="trial-%d" % group_index, hparams=hparams, ) ``` —and then run `//tensorboard/plugins/hparams:hparams_demo`, and observe that the HParams dashboard renders a “Trial ID” column with the specified IDs: ![Screenshot of new version of HParams dashboard][1] [1]: https://user-images.githubusercontent.com/4317806/61491024-1fb01280-a963-11e9-8a47-35e0a01f3691.png wchargin-branch: hparams-trial-id
train
data frozen at launch, while validation
continues to update
FYI: The problem is still here, in 1cf0898dd of TF v2.0.0. Workaround above works. |
Thank you Shanqing. Please take it. |
@qqfish Done. |
Okay, an easy fix for this suggested by @wchargin would be to just update TensorBoard to ignore any events file that ends in Background: right now whether something is an events file is determined by the tensorboard/tensorboard/backend/event_processing/io_wrapper.py Lines 47 to 61 in d0abd84
The predicate is used for (A) the code that finds sub-directories within a root logdir that contain any event files (which become the run directories), which happens here:
However the predicate is also used for (B) the code that lists event files within a run directory, which happens here: tensorboard/tensorboard/backend/event_processing/plugin_event_accumulator.py Lines 627 to 641 in d0abd84
The point of the For the record, here are the commits where |
Thanks for the detailed write-up of the root cause of the issue and the proposal for how to fix it. I"ve opened #3108 to implement the suggested fix. |
…3108) * Motivation for features / changes * Fix #2084 * The approach is suggested by @wchargin and @nfelt in #2084 (comment) * Technical description of changes * Split the logic of `IsTensorFlowEventsFile()` into two separate functions 1. An unchanged `IsTensorFlowEventsFile()` function, which only checks the existence of the 'tfevents' substring in the path string. 2. A new `IsSummaryEventsFile()` function, which returns `True` if and only if `IsTensorFlowEventsFile()` returns `True` for the input path name *and* the path name does not end in the special suffix `.profile-empty`. * This prevents the `EventAccumulator` implementation from picking up the empty events.tfevents.*.profile-empty files, which under the single-event file mode, causes TensorBoard backend to stop reading the latest summaries in the other (i.e., main, non-profiler-generated) events.* file. The *.profile-empty file was designed to make the TensorBoard backend recognize the subfolder created by the Profile plugin as a valid sub logdir, event when it contains no other events files. * Detailed steps to verify changes work correctly (as executed by you) * Added new unit tests * Manually verified that #2084 is resolved by running `bazel run -c opt tensorboard -- --logdir /path/to/logdir` using the reproduction code in #2084. * Screenshot: ![image](https://user-images.githubusercontent.com/16824702/71785529-9e127680-2fce-11ea-98e0-b2efecd99880.png)
Add learning rate as configurable Tensorboard not refreshing Workaround tensorflow/tensorboard#2084
Repro steps:
Create a virtualenv with
tf-nightly-2.0-preview==2.0.0.dev20190402
and open two terminals in this environment.
In one terminal, run the following simple Python script (but
continue to the next step while this script is still running):
Wait for (say) epoch 2/5 to finish training. Then, in the other
terminal, launch
tensorboard --logdir ./logs
.Open TensorBoard and observe that both training and validation runs
appear with two epochs’ worth of data:
As training continues, refresh TensorBoard and/or reload the page.
Observe that validation data continues to appear, but training data
has stalled—even after well after the training has completed, the
plot is incomplete:
Kill the TensorBoard process and restart it. Note that the data
appears as desired:
The same problem occurs in
tf-nightly
(non-2.0-preview
), butmanifests differently: because there is only one run (named
.
) insteadof separate
train
/validation
, all data stops being displayed afterthe epoch in which TensorBoard is opened.
Note as a special case of this that if TensorBoard is running before
training starts, then
train
data may not appear at all:The text was updated successfully, but these errors were encountered: