You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Training often occurs on remote clusters which don't persist the local disk at the time the job ends. The local disk is also not accessible from outside tools so tensorboard can not access the results while training is in progress.
Pitch
Replace all directory operations with some remote-aware tool. tensorboard itself provides a gfile compatible handle . There are other options as well. tensorboard itself supports these things natively so maybe we can just get around doing any local file operations and leverage tensorboard lib to write remotely.
Alternatives
Some other option would be to write locally but add hooks to sync the logs to a remote storage.
The text was updated successfully, but these errors were encountered:
Whats the right way to discuss solutions here. I have one proposed solution in a draft PR, but figure its best to discuss with the maintainers first about how best we can solve this.
🚀 Feature
Tensorboard allows you to write to gc, s3, hdfs, etc by specifying paths with the right prefix e.g.
logDir='hdfs://path/to/logs/
However the lightning logger breaks this. see tensorboard.py#L99
Motivation
Training often occurs on remote clusters which don't persist the local disk at the time the job ends. The local disk is also not accessible from outside tools so tensorboard can not access the results while training is in progress.
Pitch
Replace all directory operations with some remote-aware tool. tensorboard itself provides a gfile compatible handle . There are other options as well. tensorboard itself supports these things natively so maybe we can just get around doing any local file operations and leverage tensorboard lib to write remotely.
Alternatives
Some other option would be to write locally but add hooks to sync the logs to a remote storage.
The text was updated successfully, but these errors were encountered: