Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preupload lfs files before commiting #1699
Preupload lfs files before commiting #1699
Changes from 5 commits
bc201b6
4f23d11
3c2d504
61f0637
40ce31e
8266adb
acbe1e5
4a4af76
37949a9
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if there is a failure on one of the operations before the commit is created?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a failure in one of the uploads, the script will crash (if not in a try/except). If the script is restarted, already uploaded files will not need to be re-uploaded (
preupload_lfs_files
will be instant). However, as long as thecreate_commit
operation is not completed, the files are not accessible to anyone and cannot be seen on the repo on the Hub.Also there was a question at some point to garbage-collect the untracked files in S3 after some time (after 24h?) cc @Pierrci. I think it's not enabled yet but that would mean that if the user waits too long before creating the commit then it's lost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to check if a certain file has been preuploaded based on its name ? Or it requires the hash ?
That would help implementing a fast
push_to_hub
resuming if it crashed mid-wayThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It requires the hash unfortunately. Filenames are just convenient aliases saved in git but what matters on S3 are the uniqueness of the files (i.e. based on hash). For context, until the
create_commit
step, the filename is not even provided to the Hub.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I see. I guess we can do a commit every N files to allow resuming from there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lhoestq Yes indeed.
But in general, what takes the most time in
push_to_hub
? Is it generating + hashing the shards or uploading them? Because if generate+hash takes 5% of the upload time for instance, resuming from a failedpush_to_hub
should still be very fast.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our current "resuming" logic is to generate the shards and compute their "fingerprint" to check if they are already present in the repo (mimics hashing), so "resuming from a failed
push_to_hub
" should be fine, as the "generating + hashing" step is pretty fast (I don't remember anyone complaining about it being slow)