Add option to read multiple active event files per directory #1867

nfelt · 2019-02-19T21:40:59Z

This adds an experimental option --reload_multifile=true that addresses #1063 by polling for new data from multiple active event files within a given run directory, rather than only polling the most recent event file. (See #1063 for why that behavior has been problematic.)

To avoid runaway resource usage, we don't unconditionally poll all event files, since in most cases many of them are no longer receiving new data. This PR instead introduces a notion of an "inactive age" (settable via --reload_multifile_inactive_secs) beyond which an eventfile with not event wall time more recent than that will be considered "inactive" and no longer polled. This means, for example, that users who accumulate many runs over a period of weeks in the same logdir will only poll new data from runs that have data written within a recent window. (In this respect, the new code can significantly improve resource usage against logdirs in which many runs are entirely static.) The default value of this flag is set to 4000 (somewhat over an hour) to accommodate hourly update frequencies while still representing a fairly aggressive cutoff. In theory we could compute a smarter threshold by taking into account the length of time taken by previous reloads as well as the --reload_interval value, but this seems reasonable for now.

Enabling the behavior may still result in a large increase in memory usage and disk reads / network requests for logdirs that have directories with many recently written files, since those will now all be polled. Possible further efficiency improvements include:

Keeping a global upper bound on the number of open loaders (and corresponding memory buffers) to avoid runaway memory usage, and re-initializing loaders as needed (which requires some care to take underlying buffering into account and avoid re-buffering the same data needlessly), e.g. after a stat() call shows the file has grown
Using backoff to poll "staler" files less often, e.g. only every N reload cycles, rather than having only a fixed cutoff of "active" vs "inactive"
Modifying the event file writing logic to A) always open a new filename after some fixed age N, which would allow us to aggressively default the inactive age to just above "N", and/or B) add a sentinel event like SessionLog.STOP when closing an event file, so that when the file is truly done we can stop polling immediately
Attempt to integrate with inotify/FSEvents-style change notifications (but this is very platform-specific, doesn't work for remote filesystems, and would require a much larger rewrite)

stephanwlee

The time-based predicate and overall design looks okay to me. Few questions:

when can the timestamp from TimestampedEventFileLoader be None?
it looks like once an event file's time fails to write within the inactive_secs, it will never have a chance to be visited again. Kinda sucks for the case when an epoch for that took inactive_secs + 1s. Is there a way to discern if a file is current opened by another process?

wchargin · 2019-02-21T00:30:01Z

High-level design in the PR description sounds good to me. I have only
skimmed the code.

3. Modifying the event file writing logic to A) always open a new
filename after some fixed age N

I like this idea. Emitting SessionLog.STOP has the problem of course
that a job may be killed ungracefully by any means, but capping the
output ourselves doesn’t have that problem. We’d need to be at least a
little careful about clock skew, etc., but that should be doable.

Apart from efficiency improvements, a stability improvement for this
solution could be to keep track of how long each event file “typically”
spends between updates. That is, empirically at runtime estimate the
“maximum interval between summary flushes to a single active file”
term. We could also directly measure the time that a reload cycle takes,
and of course we know the reload interval value, so plugging this into
your formula and pad it a bit. If we additionally switch from cutoff to
backoff, then we could consider eliminating the config option entirely.

4. Attempt to integrate with inotify/FSEvents-style change
notifications (but this is very platform-specific, doesn't work for
remote filesystems, and would require a much larger rewrite)

True, though I hear that Go’s abstracted over the platform-specificity
well enough for people to like fsnotify.

wchargin

/me submits an official “review” to stop GitHub from nagging me. :-)

…ti-file directory loading behavior

nfelt · 2019-07-24T21:48:45Z

I've finally updated the PR with a different approach to the flags (rather than a single one that forces the user to set a threshold, I now have a flag to enable and a different flag to customize the threshold). I also added tests :) And I moved the new code into a new file directory_loader.py instead of intermingling it with directory_watcher.py, since it seemed cleaner to me to keep the tests separate and at that point it felt like the implementations might as well be separate too (and it's not like they actually share code).

PTAL :)

wchargin · 2019-07-25T20:49:17Z

tensorboard/backend/application_test.py

+  def testInactiveSecsNegative(self):
+    flags = FakeFlags('logdir', reload_multifile=True,
+                      reload_multifile_inactive_secs=-1)
+    filter = application._get_eventfile_active_filter(flags)


nit: Can we call this filter_ to avoid shadowing the builtin? Makes
the highlighting a bit confusing.

Renamed to filter_fn if that's okay.

wchargin · 2019-07-25T20:56:08Z

tensorboard/backend/application.py

@@ -106,11 +106,13 @@ def standard_tensorboard_wsgi(flags, plugin_loaders, assets_zip_provider):
  :type plugin_loaders: list[base_plugin.TBLoader]
  :rtype: TensorBoardWSGI
  """
+  eventfile_active_filter = _get_eventfile_active_filter(flags)


Global nit: We’ve generally called them “event files” (two words)
throughout the codebase:

$ git grep -c 'events\?file' tensorboard/plugins/profile/profile_demo.py:1 tensorboard/plugins/profile/profile_plugin_test.py:1 $ git grep 'events\?[ _]file' | wc -l 347

Shall we stick with that instead of adding “eventfile” as an alternate
term?

Fair enough, changed to two words.

wchargin · 2019-07-25T20:58:57Z

tensorboard/backend/application.py

+  """Returns a predicate for whether an eventfile load timestamp is active.
+
+  Returns:
+    A predicate function accepting a single UNIX timestamp float argument.


…or None if multi-file loading is not enabled.

wchargin · 2019-07-25T21:36:25Z

tensorboard/backend/event_processing/directory_loader.py

+    except tf.errors.OpError:
+      if not tf.io.gfile.exists(self._directory):
+        raise directory_watcher.DirectoryDeletedError(
+            'Directory %s has been permanently deleted' % self._directory)


Do we want else: raise here? Or are there OpErrors that we want to
ignore? (Which ones?)

If this is intended to be else: pass, can we briefly comment why?

The intent was to avoid a behavior change here relative to the existing logic in DirectoryWatcher. I'm a little reluctant to change to else: raise without some understanding of what errors out in the wild are now going to kill the backend reload thread, but I've changed it to at least log at INFO so we could potentially collect that data - WDYT?

wchargin · 2019-07-25T22:12:20Z

tensorboard/backend/event_processing/directory_loader.py

+      self._loaders[path] = loader
+    logger.info('Loading data from path %s', path)
+    for timestamp, value in loader.Load():
+      if max_timestamp is None or (timestamp is not None


Isn’t loader.Load() always supposed to yield (float, value) pairs?
In what case would timestamp be None?

Good point, removed the guard. Somehow I was thinking if omitted in the proto it could be None, but it's 0.0.

wchargin · 2019-07-25T22:33:33Z

tensorboard/backend/event_processing/directory_loader_test.py

+    # d: empty file (should be considered active in absence of timestamps).
+    self._WriteToFile('a', ['A1', 'A2'], [1, 2])
+    self._WriteToFile('b', ['B1'], [1])
+    self._WriteToFile('c', ['C2', 'C1', 'C0'], [2, 1, 0])


Is this something that we see in practice? My understanding was that the
timestamps are added by the C++ writer itself (and users have no control
over them). I suppose third-party writers can do whatever they want, and
I appreciate the test case anyway; mostly just curious.

(I’m deliberately ignoring non-monotonic clocks around DST transitions…)

Regrettably, users can indeed control wall_time, because the TF 1.x FileWriter API supports writing arbitrary Events:

python -c 'import glob; import tensorflow as tf; w = tf.summary.FileWriter("/tmp/foo"); w.add_event(tf.Event(wall_time=42.0)); w.flush(); print(list(tf.train.summary_iterator(glob.glob("/tmp/foo/event*")[0])))' [wall_time: 1564101194.0 file_version: "brain.Event:2" , wall_time: 42.0 ]

FWIW, I deliberately kept that ability out of the TF 2.0 API: tensorflow/tensorflow@1ea3483

I'm optimistic that not many people actually make use of this ability (and obviously if users are doing anything exotic with wall_time then the resource limiting approach here may backfire), but I thought it was at least worth confirming that we do the right thing in a case where e.g. the user takes an eventfile and re-sorts the events, then writes it back out.

Ah, good point. Agreed, then.

wchargin · 2019-07-25T22:46:04Z

tensorboard/backend/event_processing/directory_loader.py

+      all_paths = io_wrapper.ListDirectoryAbsolute(self._directory)
+      paths = sorted(p for p in all_paths if self._path_filter(p))
+      for path in paths:
+        for value in self._LoadPath(path):


Bit of a race condition here: if listdir returns an event file that is
subsequently deleted before we invoke _LoadPath on it, the exception
will break us out of the whole Load loop.

The race window may be arbitrarily large, because it spans a coroutine
boundary.

def testFileDeletedBeforeLoaderOpened(self): self._WriteToFile('a', 'a') self._WriteToFile('b', 'b') load_generator = self._loader.Load() self.assertEqual('a', next(load_generator)) os.unlink(os.path.join(self._directory, 'b')) self.assertEqual(list(load_generator), []) # fails: IOError(ENOENT)

Is this the desired behavior? (There might be a case for that.) I could
also see silently skipping the file, or maybe opening all loaders
eagerly—for loader in [self._LoadPath(path) for path in paths]:—though
maybe we don’t like the performance characteristics of that.

Done. Note that this race across the coroutine boundary existed in DirectoryWatcher too; I think the exception was being swallowed by the except clause below. I went with skipping the file, since opening all loaders eagerly is worse for performance (right now, we close inactive ones when we're done, so when reading an old logdir we only ever have one loader open at a time, which is a nice property to retain) and there's still the non-coroutine race in that case anyway.

SGTM; skipping made the most sense to me, too. Thanks!

wchargin · 2019-07-25T22:55:58Z

tensorboard/backend/event_processing/plugin_event_multiplexer_test.py

+    # Create two separate event files, using filename suffix to ensure a
+    # deterministic sort order, and then simulate a write to file A, then
+    # to file B, then another write to file A (with reloads after each).
+    with test_util.FileWriter(run_path, filename_suffix='.a') as writer_a:


Looks like this is failing in the TF 2.0 test environment.

Fixed by updating the test file writer to avoid the guard against use under eager mode.

nfelt

Thanks for the review - PTAL

nfelt · 2019-07-26T00:21:26Z

tensorboard/backend/application.py

+  """Returns a predicate for whether an eventfile load timestamp is active.
+
+  Returns:
+    A predicate function accepting a single UNIX timestamp float argument.


nfelt · 2019-07-26T00:21:39Z

tensorboard/backend/application.py

@@ -106,11 +106,13 @@ def standard_tensorboard_wsgi(flags, plugin_loaders, assets_zip_provider):
  :type plugin_loaders: list[base_plugin.TBLoader]
  :rtype: TensorBoardWSGI
  """
+  eventfile_active_filter = _get_eventfile_active_filter(flags)


Fair enough, changed to two words.

nfelt · 2019-07-26T00:21:57Z

tensorboard/backend/application_test.py

+  def testInactiveSecsNegative(self):
+    flags = FakeFlags('logdir', reload_multifile=True,
+                      reload_multifile_inactive_secs=-1)
+    filter = application._get_eventfile_active_filter(flags)


Renamed to filter_fn if that's okay.

nfelt · 2019-07-26T00:38:38Z

tensorboard/backend/event_processing/directory_loader_test.py

+    # d: empty file (should be considered active in absence of timestamps).
+    self._WriteToFile('a', ['A1', 'A2'], [1, 2])
+    self._WriteToFile('b', ['B1'], [1])
+    self._WriteToFile('c', ['C2', 'C1', 'C0'], [2, 1, 0])


Regrettably, users can indeed control wall_time, because the TF 1.x FileWriter API supports writing arbitrary Events:

python -c 'import glob; import tensorflow as tf; w = tf.summary.FileWriter("/tmp/foo"); w.add_event(tf.Event(wall_time=42.0)); w.flush(); print(list(tf.train.summary_iterator(glob.glob("/tmp/foo/event*")[0])))' [wall_time: 1564101194.0 file_version: "brain.Event:2" , wall_time: 42.0 ]

FWIW, I deliberately kept that ability out of the TF 2.0 API: tensorflow/tensorflow@1ea3483

I'm optimistic that not many people actually make use of this ability (and obviously if users are doing anything exotic with wall_time then the resource limiting approach here may backfire), but I thought it was at least worth confirming that we do the right thing in a case where e.g. the user takes an eventfile and re-sorts the events, then writes it back out.

nfelt · 2019-07-26T00:40:28Z

tensorboard/backend/event_processing/directory_loader.py

+      self._loaders[path] = loader
+    logger.info('Loading data from path %s', path)
+    for timestamp, value in loader.Load():
+      if max_timestamp is None or (timestamp is not None


Good point, removed the guard. Somehow I was thinking if omitted in the proto it could be None, but it's 0.0.

nfelt · 2019-07-26T00:53:12Z

tensorboard/backend/event_processing/directory_loader.py

+    except tf.errors.OpError:
+      if not tf.io.gfile.exists(self._directory):
+        raise directory_watcher.DirectoryDeletedError(
+            'Directory %s has been permanently deleted' % self._directory)


The intent was to avoid a behavior change here relative to the existing logic in DirectoryWatcher. I'm a little reluctant to change to else: raise without some understanding of what errors out in the wild are now going to kill the backend reload thread, but I've changed it to at least log at INFO so we could potentially collect that data - WDYT?

nfelt · 2019-07-26T23:52:44Z

tensorboard/backend/event_processing/plugin_event_multiplexer_test.py

+    # Create two separate event files, using filename suffix to ensure a
+    # deterministic sort order, and then simulate a write to file A, then
+    # to file B, then another write to file A (with reloads after each).
+    with test_util.FileWriter(run_path, filename_suffix='.a') as writer_a:


Fixed by updating the test file writer to avoid the guard against use under eager mode.

nfelt · 2019-07-27T00:18:34Z

tensorboard/backend/event_processing/directory_loader.py

+      all_paths = io_wrapper.ListDirectoryAbsolute(self._directory)
+      paths = sorted(p for p in all_paths if self._path_filter(p))
+      for path in paths:
+        for value in self._LoadPath(path):


Done. Note that this race across the coroutine boundary existed in DirectoryWatcher too; I think the exception was being swallowed by the except clause below. I went with skipping the file, since opening all loaders eagerly is worse for performance (right now, we close inactive ones when we're done, so when reading an old logdir we only ever have one loader open at a time, which is a nice property to retain) and there's still the non-coroutine race in that case anyway.

wchargin

Two high-level questions:

If I understand correctly, the new DirectoryLoader may mark a file
as inactive even if it’s the only file in the directory. Consider,
for instance, a long-running job that writes summaries every two
hours; previous versions of TensorBoard would continue updating
forever, while under the new semantics we’ll only see it update
once.

Is this correct? It seems at least a little unfortunate. Did we
consider not marking a file inactive when it’s the only active file?

Test case, in case the above description is unclear

  def testFilter_onlyOneLoader(self):
    """A sole loader can be marked as inactive."""
    loader_registry = []
    loader_factory = functools.partial(
        _TimestampedByteLoader, registry=loader_registry)
    threshold = 0
    active_filter = lambda timestamp: timestamp >= threshold
    self._loader = directory_loader.DirectoryLoader(
        self._directory, loader_factory, active_filter=active_filter)
    def assertLoadersForPaths(paths):
      paths = [os.path.join(self._directory, path) for path in paths]
      self.assertEqual(loader_registry, paths)
    #
    self._WriteToFile('a', ['A1', 'A2'], [1, 2])
    threshold = 2
    # First load pass should leave file C marked inactive.
    self.assertLoaderYields(['A1', 'A2'])
    assertLoadersForPaths(['a'])
    threshold = 3
    # Second load pass should mark file A as inactive (due to newly
    # increased threshold) and thus skip reading data from it.
    self.assertLoaderYields([])
    assertLoadersForPaths([])
    # Even when we get more data and it's the only file. :-(
    self._WriteToFile('a', ['A3', 'A4'], [3, 4])
    self.assertLoaderYields([])
    assertLoadersForPaths([])

The two names DirectoryWatcher and DirectoryLoader tripped me up
a bit in the course of this review. They both watch, and they both
load; the primary difference is in their multi-file support, yes?
Can we find clearer names? :-)

wchargin · 2019-07-27T00:55:33Z

tensorboard/backend/application_test.py

+  def testInactiveSecsNegative(self):
+    flags = FakeFlags('logdir', reload_multifile=True,
+                      reload_multifile_inactive_secs=-1)
+    filter = application._get_eventfile_active_filter(flags)


wchargin · 2019-07-27T00:56:03Z

tensorboard/backend/event_processing/directory_loader.py

+    except tf.errors.OpError:
+      if not tf.io.gfile.exists(self._directory):
+        raise directory_watcher.DirectoryDeletedError(
+            'Directory %s has been permanently deleted' % self._directory)


wchargin · 2019-07-27T00:56:59Z

tensorboard/backend/event_processing/directory_loader_test.py

+    # d: empty file (should be considered active in absence of timestamps).
+    self._WriteToFile('a', ['A1', 'A2'], [1, 2])
+    self._WriteToFile('b', ['B1'], [1])
+    self._WriteToFile('c', ['C2', 'C1', 'C0'], [2, 1, 0])


Ah, good point. Agreed, then.

wchargin · 2019-07-27T01:02:15Z

tensorboard/backend/event_processing/directory_loader_test.py

+    next(generator)  # Ignore the file_version event.
+    event = next(generator)
+    self.assertEqual('a', event.summary.value[0].tag)
+    os.remove(glob.glob(os.path.join(self._directory, '*.b'))[0])


Probably fine here, but I really wish glob.glob had a directory
kwarg so that we could be sure that no globs are expanded in the first
argument to os.path.join. :-( No action required.

wchargin · 2019-07-27T01:22:30Z

tensorboard/backend/event_processing/plugin_event_multiplexer_test.py

+    # Create two separate event files, using filename suffix to ensure a
+    # deterministic sort order, and then simulate a write to file A, then
+    # to file B, then another write to file A (with reloads after each).
+    with test_util.FileWriter(run_path, filename_suffix='.a') as writer_a:


wchargin · 2019-07-27T01:26:47Z

tensorboard/backend/event_processing/directory_loader.py

+      all_paths = io_wrapper.ListDirectoryAbsolute(self._directory)
+      paths = sorted(p for p in all_paths if self._path_filter(p))
+      for path in paths:
+        for value in self._LoadPath(path):


SGTM; skipping made the most sense to me, too. Thanks!

nfelt · 2019-07-27T01:59:55Z

Yes, the new loading algorithm will stop polling for new data if no files are active (it will still poll for new runs, to be clear). Honestly, I see this as a feature - it means loading up an old experiment will only read each file once and then only monitor for new files. To continue polling the last file in the directory even if it's inactive would prevent us from getting that nice property, unless we set a separate inactive threshold on that, but I honestly think that would just be more confusing for the user.

I'm open to changing the default age threshold to be longer (maybe even 24 hours) but I guess I was thinking of starting off more aggressive and then walking it back if it's an issue, to be more ambitious about curtailing resource usage, since I think it's overall easier to lengthen the threshold than shorten it (the result being higher resource usage but not unexpectedly absent data).

Re: Watcher and Loader, I originally named it MultiFileDirectoryWatcher but it's a mouthful and I wasn't crazy about that being the long-term name (assuming we migrate to this implementation only at some point). Also, since the EventAccumulator logic actually can use these in the place of an EventFileLoader, they essentially implement a "Loader" contract, so the "Loader" suffix seemed better (and people in the past complained that DirectoryWatcher suggested equivalence with an internal Watcher concept that doesn't have the same semantics). In an admittedly fuzzy sense, the old logic is more like just "watching" the latest file rather than truly loading the whole directory, whereas the new logic really does load the entire directory.

That said, I am sympathetic to the similarity in the names being somewhat confusing. Do you have a suggestion in mind?

wchargin

Honestly, I see this as a feature […]

Yep, this is pretty reasonable. As long as we message the change
clearly, it makes sense to me.

I'm open to changing the default age threshold to be longer

I’ve got no problem with the proposed default (the “two hours” in my
example was just ”whatever the default is + ε”). Agreed that it’s easier
to revise upward.

Do you have a suggestion in mind?

Nope!

…s expanded

nfelt · 2019-07-29T17:24:16Z

PTAL, I've updated the FAQ blurbs to be more clear that the option changes the polling criteria, rather than just expanding them (which the previous blurbs implied).

nfelt added core:backend core:summaries labels Feb 19, 2019

nfelt requested a review from wchargin February 19, 2019 21:40

nfelt changed the title ~~Add optional logic to read multiple active event files per directory~~ Add option to read multiple active event files per directory Feb 19, 2019

stephanwlee reviewed Feb 20, 2019

View reviewed changes

wchargin mentioned this pull request Apr 3, 2019

Keras callback creating .profile-empty file blocks loading data #2084

Closed

wchargin reviewed Apr 3, 2019

View reviewed changes

wchargin mentioned this pull request Jun 13, 2019

Allow reading multiple files simultaneously from Tensorboard #2346

Closed

nfelt added 4 commits July 22, 2019 18:41

add DirectoryLoader implementation for multi-active-file loading

81f1385

plumb option to use DirectoryLoader through event processing classes

c2e00a2

add --reload_multifile{,_inactive_secs} flags to enable/configure mul…

ebbee75

…ti-file directory loading behavior

update README.md FAQ to mention flag as an experimental option

29d0504

nfelt force-pushed the multifile branch from a680d72 to 29d0504 Compare July 24, 2019 21:38

nfelt requested review from wchargin and stephanwlee July 24, 2019 21:47

wchargin reviewed Jul 25, 2019

View reviewed changes

nfelt added 5 commits July 25, 2019 17:20

CR: rename eventfile->event file, rename filter->filter_fn

c494ff3

CR: clarify return value of _get_event_file_active_filter()

1be6468

CR: make test_util.FileWriter usable from eager mode

e792ad8

CR: introduce test_util.FileWriter.add_test_summary() shortcut

7aa18e7

CR: improve DirectoryLoader handling and testing of filesystem errors

df5803f

nfelt commented Jul 27, 2019

View reviewed changes

remove extraneous make_summary() definitions

c8ab641

googlebot added the cla: yes label Jul 27, 2019

wchargin reviewed Jul 27, 2019

View reviewed changes

wchargin approved these changes Jul 27, 2019

View reviewed changes

CR: adjust FAQ wording to clarify that polling criteria are changed v…

62350b3

…s expanded

wchargin approved these changes Jul 29, 2019

View reviewed changes

nfelt merged commit d8d6703 into tensorflow:master Jul 29, 2019

nfelt deleted the multifile branch July 29, 2019 21:06

nfelt mentioned this pull request Jul 29, 2019

Support multiple file writers writing at once. #1063

Open

nfelt mentioned this pull request Feb 12, 2020

Update --reload_multifile_inactive_secs default to 24 hours #3243

Merged

Add option to read multiple active event files per directory #1867

Add option to read multiple active event files per directory #1867

Conversation

nfelt commented Feb 19, 2019 • edited Loading

stephanwlee left a comment

Choose a reason for hiding this comment

wchargin commented Feb 21, 2019

wchargin left a comment

Choose a reason for hiding this comment

nfelt commented Jul 24, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nfelt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wchargin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nfelt commented Jul 27, 2019

wchargin left a comment

Choose a reason for hiding this comment

nfelt commented Jul 29, 2019

nfelt commented Feb 19, 2019 •

edited

Loading

nfelt commented Jul 24, 2019 •

edited

Loading