-
Notifications
You must be signed in to change notification settings - Fork 581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preupload lfs files before commiting #1699
Conversation
The documentation is not available anymore as the PR was closed or merged. |
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #1699 +/- ##
==========================================
+ Coverage 81.52% 82.39% +0.87%
==========================================
Files 62 62
Lines 7198 7226 +28
==========================================
+ Hits 5868 5954 +86
+ Misses 1330 1272 -58
☔ View full report in Codecov by Sentry. |
As discussed offline, a good server-centric review of this PR would be required from e.g. @coyotte508 and/or @Pierrci |
Thanks for pinging them 👍 Apart from the PR review, @lhoestq @mariosasko could you open a branch in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! I left a few nits/questions.
Thanks!
docs/source/en/guides/upload.md
Outdated
```py | ||
>>> from huggingface_hub import CommitOperationAdd, preupload_lfs_files, create_commit, create_repo | ||
|
||
>>> repo_id = create_repo("test_preupload").repo_id | ||
|
||
>>> operations = [] # List of all `CommitOperationAdd` objects that will be generated | ||
>>> for i in range(5): | ||
... content = ... # generate binary content | ||
... addition = CommitOperationAdd(path_in_repo=f"shard_{i}_of_5.bin", path_or_fileobj=content) | ||
... preupload_lfs_files(repo_id, additions=[addition]) | ||
... operations.append(addition) | ||
|
||
# Create commit | ||
>>> create_commit(repo_id, operations=operations, commit_message="Commit all shards") | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if there is a failure on one of the operations before the commit is created?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a failure in one of the uploads, the script will crash (if not in a try/except). If the script is restarted, already uploaded files will not need to be re-uploaded (preupload_lfs_files
will be instant). However, as long as the create_commit
operation is not completed, the files are not accessible to anyone and cannot be seen on the repo on the Hub.
Also there was a question at some point to garbage-collect the untracked files in S3 after some time (after 24h?) cc @Pierrci. I think it's not enabled yet but that would mean that if the user waits too long before creating the commit then it's lost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to check if a certain file has been preuploaded based on its name ? Or it requires the hash ?
That would help implementing a fast push_to_hub
resuming if it crashed mid-way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to check if a certain file has been preuploaded based on its name ? Or it requires the hash ?
It requires the hash unfortunately. Filenames are just convenient aliases saved in git but what matters on S3 are the uniqueness of the files (i.e. based on hash). For context, until the create_commit
step, the filename is not even provided to the Hub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I see. I guess we can do a commit every N files to allow resuming from there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lhoestq Yes indeed.
But in general, what takes the most time in push_to_hub
? Is it generating + hashing the shards or uploading them? Because if generate+hash takes 5% of the upload time for instance, resuming from a failed push_to_hub
should still be very fast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our current "resuming" logic is to generate the shards and compute their "fingerprint" to check if they are already present in the repo (mimics hashing), so "resuming from a failed push_to_hub
" should be fine, as the "generating + hashing" step is pretty fast (I don't remember anyone complaining about it being slow)
Thanks for the review @LysandreJik ! I have addressed your comments. We also got approval from moon-landing side (cc @Pierrci @coyotte508) and @mariosasko create a draft PR to test it in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't check the details of the code, but as discussed offline, the logic LGTM 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! The code looks good.
Some doc improvements:
LGTM, feel free to merge @Wauplin! |
Co-authored-by: Mario Šaško <mario@huggingface.co>
Thanks everyone for the reviews and feedback! I'll merge it now and ship it in the upcoming v0.18 release :) 🚀 |
Related to huggingface/datasets#6257 (comment) cc @mariosasko @lhoestq.
The goal of this PR is to add the possibility to preupload files before making the
create_commit
call. This is expected to be only a power-user feature meant for users generating huge files on the fly (e.g. basicallydatasets
when uploading sharded parquet files). The current problem is that each file has to be committed one by one to avoid out-of-memory issues. This strategy leads to 1. users being rate-limited (too many commits) 2. messy git history.The solution proposed in this PR is to add a
preupload_lfs_files
methods. It takes a list ofCommitOperationAdd
and upload them to S3 (if LFS). Once all theCommitOperationAdd
are uploaded, a singlecreate_commit
call is needed.To make this work,
CommitOperationAdd
is mutated during thecreate_commit
process. Some internal attributes are set to keep track of the status (_upload_mode
,_is_uploaded
,_is_committed
). This is the case no matter if the user usespreupload_lfs_files
or not. This is a breaking change but I don't expect to break any current workflow. This is only a problem is a user defines aCommitOperationAdd
object to commit to 2 different repos.Finally
preupload_lfs_files
also mutates theCommitOperationAdd
objects to remove the binary content from them. This is expected as we want to free up the memory (this is the purpose of the whole PR 😄).Documentation:
Here is how it would look like:
TODO:
datasets
(Reduce the number of commits inpush_to_hub
datasets#6269)