You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Regularly offloading checkpoints from local storage to object storage (i.e. FSX to S3) is very painful, time-consuming, and subject to errors. It would be great if we could automate this.
Describe the solution you'd like
Fsspec might be a good way of doing this.
Describe alternatives you've considered
open to alternatives for implementing this! Most importantly, we want a solution that is robust to interruption, i.e. will report and abort run if a checkpoint fails to save, and should not delete checkpoints until it is ensured that they are backed up.
Is your feature request related to a problem? Please describe.
Regularly offloading checkpoints from local storage to object storage (i.e. FSX to S3) is very painful, time-consuming, and subject to errors. It would be great if we could automate this.
Describe the solution you'd like
Fsspec might be a good way of doing this.
Describe alternatives you've considered
open to alternatives for implementing this! Most importantly, we want a solution that is robust to interruption, i.e. will report and abort run if a checkpoint fails to save, and should not delete checkpoints until it is ensured that they are backed up.
Additional context
an example PR for OpenCLIP containing what we might want to implement is here: https://github.com/mlfoundations/open_clip/pull/319/files
I may work on this soon, TBD based on how much else I need to do.
The text was updated successfully, but these errors were encountered: