Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow automatic saving / backing up checkpoints to object storage like S3 #781

Closed
haileyschoelkopf opened this issue Feb 4, 2023 · 4 comments
Assignees
Labels
feature request New feature or request

Comments

@haileyschoelkopf
Copy link
Contributor

Is your feature request related to a problem? Please describe.
Regularly offloading checkpoints from local storage to object storage (i.e. FSX to S3) is very painful, time-consuming, and subject to errors. It would be great if we could automate this.

Describe the solution you'd like
Fsspec might be a good way of doing this.

Describe alternatives you've considered
open to alternatives for implementing this! Most importantly, we want a solution that is robust to interruption, i.e. will report and abort run if a checkpoint fails to save, and should not delete checkpoints until it is ensured that they are backed up.

Additional context
an example PR for OpenCLIP containing what we might want to implement is here: https://github.com/mlfoundations/open_clip/pull/319/files

I may work on this soon, TBD based on how much else I need to do.

@haileyschoelkopf haileyschoelkopf added the feature request New feature or request label Feb 4, 2023
@Quentin-Anthony
Copy link
Member

Tensorizer will be relevant: https://github.com/coreweave/tensorizer

@lintangsutawika lintangsutawika self-assigned this Jun 19, 2023
@lintangsutawika
Copy link
Contributor

Definitely a feature we need moving forward.

After some perusing, looks like I need to add changes to both gpt-neox and deeperspeed.

@dashstander
Copy link
Contributor

Isn't this issue closed by #1010 ? @Quentin-Anthony @haileyschoelkopf

@Quentin-Anthony
Copy link
Member

Isn't this issue closed by #1010 ? @Quentin-Anthony @haileyschoelkopf

Yep!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants