Support multiple file writers writing at once. #1063

chihuahua · 2018-03-19T18:01:31Z

Background

TensorBoard assigns one DirectoryWatcher per run. The DirectoryWatcher watches for new events being written to the directory and loads it into TensorBoard's multiplexer. Today, the DirectoryWatcher iterates through events files (for a run) in lexicographic order. After it has finished reading from an events file, it never reads from it again (and moves on to a subsequent events file).

This behavior means that TensorBoard does not support multiple file writers writing to the same run directory: TensorBoard would move on to a different events file B once it finishes reading events file A and would never go back to reading A despite how A is updated.

In turn, that problem blocks TensorFlow+TensorBoard's transition towards migrating Estimator, tflearn hooks, and other high-level TensorFlow constructs to using the tf.contrib.summary-style summaries. tf.contrib.summary-style summaries are written via a type of SummaryWriter different from the one that writes traditional-style (tf.summary.*) summaries, so users that intermingle the summaries will necessarily have to concurrently use 2 summary writers. Many users thoughout google3 intermingle the summaries, and the transition must happen soon.

The text was updated successfully, but these errors were encountered:

nfelt · 2018-03-20T04:29:32Z

Just jotting down what we discussed while it's fresh in my head. We keep the essential structure of DirectoryWatcher - Load() spools events from the current _loader until it's exhausted, and then moves on to a new file.

However, we change the algorithm for moving on to a new file to do the following:

(setup) As state, we keep a "stream map" mapping from file suffixes to "stream entry" values that hold a loader instance, a timestamp, and a list of filenames
List the directory and get all filenames
Filter out any filenames that are "dead" (using a set of known "dead" files)
Find the next filename, per the following substeps:
3a. Order the filenames by timestamp, then by suffix.
3b. Seek ahead in the list to the first element with timestamp and suffix >= the current timestamp and suffix.
3c. Check if element.timestamp is >= to the timestamp value stored for element.suffix in the stream map. If not, log an "out of order files" error and keep going until this is true.
3d. If this process failed to find a filename, null out the current timestamp and suffix and return. (We'll resume with the next Load() call.)
Yield any remaining files from the current stream's loader (to handle sequence described here).
If the last event to be yielded from the current stream's loader was > MAX_INACTIVE_AGE ago, add the current filename to the set of "dead" files, close the loader, and purge the current suffix's entry from the stream map
Set current timestamp and current suffix to those from the new filename
Look up the current suffix in the stream map
If there's an entry, check if the last file in list-of-files is the same as the new filename. If so, we're done early and the routine ends here.
If there's an entry but the filename is different, do the following substeps:
9a. Close the loader in the stream entry, stat the stream entry's filename and record its size as the finalized size for that filename in a separate finalized_sizes map
9b. Add that filename to the set of dead files
9c. [Optional] check the last OOO_WRITE_CHECK_COUNT filenames in the list of files to see if they have out-of-order writes (meaning, they exist and their size != their finalized_sizes size) and if so log an error
If the current suffix didn't have a stream map entry, create one.
With the stream map entry:
11a. Update the timestamp to the current timestamp
11b. Append the new filename to the list-of-files
11c. Open a new loader for the current filename and set it as the stream map entry's loader

We're proposing that MAX_INACTIVE_AGE be set to 24 hours (or perhaps slightly more to allow for files written exactly once every 24 hours). OOO_WRITE_CHECK_COUNT is currently 20 but I think it could be lower (like 1 or 2, really).

The goal of this new algorithm is that we still proceed with loading files by timestamp order, but once we get to the end of that order, we loop back to the beginning and re-read files again. This handles the case where there are multiple active streams of data in different files. To keep from having to re-read every file, we mark files as "dead" in two different ways: 1) if a file hasn't been written to in MAX_INACTIVE_AGE (as determined by last even time) and 2) if a file is preempted by a newer-timestamped file with the same suffix. However, files that are dead of type 2 we still keep around to optionally check if they have been modified in case we want to detect out-of-order writes. The end result should be that for a static set of files (e.g. a terminated experiment), we do read all the data, but that once we've gone through the initial set of files, we only reload data from an set of active "head" files of the streams that might still be open, which is one stream per suffix for files written to in the last 24 hours. This keeps us from having to keep event file readers open and querying for new data for every single event file, even for very old logdirs.

chihuahua · 2018-03-20T07:30:06Z

@nfelt, thank you for that recap and outline! #1064

chihuahua · 2018-03-21T03:04:25Z

@nfelt, your plan as outlined above and then adding tests sounds great. We probably want to merge this change soon to unblock the migration to tf.contrib summaries. Do you want to take over this effort because I'm OOO?

nfelt · 2018-03-21T05:50:43Z

@chihuahua oh shoot, I forgot you were out starting tomorrow... hmm we probably should have started the code in a separate file so we could check it in and then keep working on tests, etc. I guess I'll just grab a copy of the code from your PR and make a new PR? Unless you have time to quickly change the current PR to target a separate file, and then we could merge it as work-in-progress and I could take over from there.

chihuahua · 2018-03-21T06:55:11Z

Ah yes. Within #1064, I moved the changes to a new suffix_based_directory_watcher.py file (the name is definitely tentative). We can check it in and iterate. Or, you could merge the changes into your own new PR. Thanks for the idea.

andrewzu · 2018-08-28T15:20:41Z

Is this feature still planned to be supported?

gogasca · 2019-06-22T22:56:49Z

eval_spec = tf.estimator.EvalSpec(
        eval_input,
        steps=args.eval_steps,
        throttle_secs=1,
        exporters=[exporter])

    run_config = tf.estimator.RunConfig(
        session_config=_get_session_config_from_env_var(),
        save_checkpoints_steps=100)

throttle_secs and save_checkpoints_steps did the trick for me

nfelt · 2019-07-29T21:09:37Z

I've merged #1867 which introduces the experimental --reload_multifile=true option, which can now be used to poll all "active" files in a directory for new data, rather than the most recent one, and addresses the core concern of this issue. Note that a file is "active" as long as it received new data within --reload_multifile_inactive_secs seconds ago, a new configurable flag that defaults to 4000, aka just over an hour.

If you've been waiting on this functionality, thanks for your patience, and please try out the new flag in tb-nightly starting with tomorrow's release and let us know how it works for you.

somedadaism · 2019-07-30T08:52:47Z

Just in time!
Thanks a lot!

dsmic · 2019-08-21T09:03:08Z

Since in tf-2.0 beta1 the tensorflow.keras.callbacks.TensorBoard() callback seems to relay on that feature, it might be a good idea to enable it by default or (if that is not an option) give a hint to that feature in the warning E0821.

Tensorboard does not work well when multiple file writers write into the same directory, see [related issue](tensorflow/tensorboard#1063). The monitor creates a file writer with every summary. This pull request resolves this issue by creating a single writer per run and depending on that.

pindinagesh · 2021-11-30T16:46:37Z

@chihuahua

Closing this issue since PR is merged. Please feel free to reopen if this still exist. Thanks

nfelt · 2021-11-30T17:17:38Z

This issue should stay open, since the flag is still experimental and not on by default yet.

chihuahua added core:backend core:summaries labels Mar 19, 2018

chihuahua mentioned this issue Mar 20, 2018

Rewrite DirectoryWatcher #1064

Closed

nfelt mentioned this issue Feb 19, 2019

Add option to read multiple active event files per directory #1867

Merged

nfelt added the tf-2.0 label Mar 11, 2019

federicofontana mentioned this issue May 10, 2019

[tune] tf.summary.FileWriter extensibility for custom TensorBoard metrics ray-project/ray#4762

Closed

nfelt mentioned this issue May 15, 2019

TensorBoard doesn't load new data when summaries are stored in GCS. #867

Closed

wchargin mentioned this issue Jun 13, 2019

Allow reading multiple files simultaneously from Tensorboard #2346

Closed

bileschi added the theme:performance Performance, scalability, large data sizes, slowness, etc. label Jan 2, 2020

markvdw mentioned this issue Mar 29, 2020

Monitor doesn't update when using multiple tasks GPflow/GPflow#1385

Closed

tomaskrehlik mentioned this issue Apr 10, 2020

Fix Monitor issue where multiple file writers break updates GPflow/GPflow#1424

Merged

pindinagesh closed this as completed Nov 30, 2021

nfelt reopened this Nov 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support multiple file writers writing at once. #1063

Support multiple file writers writing at once. #1063

chihuahua commented Mar 19, 2018

nfelt commented Mar 20, 2018

Uh oh!

chihuahua commented Mar 20, 2018

Uh oh!

chihuahua commented Mar 21, 2018

Uh oh!

nfelt commented Mar 21, 2018

Uh oh!

chihuahua commented Mar 21, 2018

Uh oh!

andrewzu commented Aug 28, 2018

Uh oh!

gogasca commented Jun 22, 2019

Uh oh!

nfelt commented Jul 29, 2019

Uh oh!

somedadaism commented Jul 30, 2019

Uh oh!

dsmic commented Aug 21, 2019

Uh oh!

pindinagesh commented Nov 30, 2021

Uh oh!

nfelt commented Nov 30, 2021

Uh oh!

Support multiple file writers writing at once. #1063

Support multiple file writers writing at once. #1063

Comments

chihuahua commented Mar 19, 2018

Background

nfelt commented Mar 20, 2018

Uh oh!

chihuahua commented Mar 20, 2018

Uh oh!

chihuahua commented Mar 21, 2018

Uh oh!

nfelt commented Mar 21, 2018

Uh oh!

chihuahua commented Mar 21, 2018

Uh oh!

andrewzu commented Aug 28, 2018

Uh oh!

gogasca commented Jun 22, 2019

Uh oh!

nfelt commented Jul 29, 2019

Uh oh!

somedadaism commented Jul 30, 2019

Uh oh!

dsmic commented Aug 21, 2019

Uh oh!

pindinagesh commented Nov 30, 2021

Uh oh!

nfelt commented Nov 30, 2021

Uh oh!