Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preupload lfs files before commiting #1699

Merged
merged 9 commits into from
Oct 4, 2023
Merged

Conversation

Wauplin
Copy link
Contributor

@Wauplin Wauplin commented Sep 28, 2023

Related to huggingface/datasets#6257 (comment) cc @mariosasko @lhoestq.

The goal of this PR is to add the possibility to preupload files before making the create_commit call. This is expected to be only a power-user feature meant for users generating huge files on the fly (e.g. basically datasets when uploading sharded parquet files). The current problem is that each file has to be committed one by one to avoid out-of-memory issues. This strategy leads to 1. users being rate-limited (too many commits) 2. messy git history.

The solution proposed in this PR is to add a preupload_lfs_files methods. It takes a list of CommitOperationAdd and upload them to S3 (if LFS). Once all the CommitOperationAdd are uploaded, a single create_commit call is needed.

To make this work, CommitOperationAdd is mutated during the create_commit process. Some internal attributes are set to keep track of the status (_upload_mode, _is_uploaded, _is_committed). This is the case no matter if the user uses preupload_lfs_files or not. This is a breaking change but I don't expect to break any current workflow. This is only a problem is a user defines a CommitOperationAdd object to commit to 2 different repos.

Finally preupload_lfs_files also mutates the CommitOperationAdd objects to remove the binary content from them. This is expected as we want to free up the memory (this is the purpose of the whole PR 😄).

Documentation:


Here is how it would look like:

>>> from huggingface_hub import CommitOperationAdd, preupload_lfs_files, create_commit, create_repo

>>> repo_id = create_repo("test_preupload").repo_id

>>> operations = [] # List of all `CommitOperationAdd` objects that will be generated
>>> for i in range(5):
...     content = ... # generate binary content
...     addition = CommitOperationAdd(path_in_repo=f"shard_{i}_of_5.bin", path_or_fileobj=content)
...     preupload_lfs_files(repo_id, additions=[addition])
...     operations.append(addition)

# Create commit
>>> create_commit(repo_id, operations=operations, commit_message="Commit all shards")

TODO:

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Sep 28, 2023

The documentation is not available anymore as the PR was closed or merged.

@codecov
Copy link

codecov bot commented Sep 29, 2023

Codecov Report

Attention: 5 lines in your changes are missing coverage. Please review.

Comparison is base (89cc691) 81.52% compared to head (3c2d504) 82.39%.

❗ Current head 3c2d504 differs from pull request most recent head 61f0637. Consider uploading reports for the commit 61f0637 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1699      +/-   ##
==========================================
+ Coverage   81.52%   82.39%   +0.87%     
==========================================
  Files          62       62              
  Lines        7198     7226      +28     
==========================================
+ Hits         5868     5954      +86     
+ Misses       1330     1272      -58     
Files Coverage Δ
src/huggingface_hub/__init__.py 75.75% <ø> (ø)
src/huggingface_hub/hf_api.py 86.40% <92.59%> (+0.04%) ⬆️
src/huggingface_hub/_commit_api.py 91.94% <78.57%> (-1.13%) ⬇️

... and 5 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Wauplin Wauplin marked this pull request as ready for review September 29, 2023 14:12
@julien-c
Copy link
Member

As discussed offline, a good server-centric review of this PR would be required from e.g. @coyotte508 and/or @Pierrci

@Wauplin
Copy link
Contributor Author

Wauplin commented Sep 29, 2023

Thanks for pinging them 👍

Apart from the PR review, @lhoestq @mariosasko could you open a branch in datasets to test the workflow from this PR? It doesn't have to be merged on your side until we make a merge+release of huggingface_hub but I want to be 100% sure we are addressing your problem correctly (as we implement it mostly for the parquet shards use case).

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I left a few nits/questions.

Thanks!

docs/source/en/guides/upload.md Outdated Show resolved Hide resolved
src/huggingface_hub/_commit_api.py Show resolved Hide resolved
src/huggingface_hub/hf_api.py Outdated Show resolved Hide resolved
Comment on lines 454 to 468
```py
>>> from huggingface_hub import CommitOperationAdd, preupload_lfs_files, create_commit, create_repo

>>> repo_id = create_repo("test_preupload").repo_id

>>> operations = [] # List of all `CommitOperationAdd` objects that will be generated
>>> for i in range(5):
... content = ... # generate binary content
... addition = CommitOperationAdd(path_in_repo=f"shard_{i}_of_5.bin", path_or_fileobj=content)
... preupload_lfs_files(repo_id, additions=[addition])
... operations.append(addition)

# Create commit
>>> create_commit(repo_id, operations=operations, commit_message="Commit all shards")
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if there is a failure on one of the operations before the commit is created?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a failure in one of the uploads, the script will crash (if not in a try/except). If the script is restarted, already uploaded files will not need to be re-uploaded (preupload_lfs_files will be instant). However, as long as the create_commit operation is not completed, the files are not accessible to anyone and cannot be seen on the repo on the Hub.

Also there was a question at some point to garbage-collect the untracked files in S3 after some time (after 24h?) cc @Pierrci. I think it's not enabled yet but that would mean that if the user waits too long before creating the commit then it's lost.

Copy link
Member

@lhoestq lhoestq Oct 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to check if a certain file has been preuploaded based on its name ? Or it requires the hash ?

That would help implementing a fast push_to_hub resuming if it crashed mid-way

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to check if a certain file has been preuploaded based on its name ? Or it requires the hash ?

It requires the hash unfortunately. Filenames are just convenient aliases saved in git but what matters on S3 are the uniqueness of the files (i.e. based on hash). For context, until the create_commit step, the filename is not even provided to the Hub.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I see. I guess we can do a commit every N files to allow resuming from there

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lhoestq Yes indeed.
But in general, what takes the most time in push_to_hub? Is it generating + hashing the shards or uploading them? Because if generate+hash takes 5% of the upload time for instance, resuming from a failed push_to_hub should still be very fast.

Copy link
Contributor

@mariosasko mariosasko Oct 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our current "resuming" logic is to generate the shards and compute their "fingerprint" to check if they are already present in the repo (mimics hashing), so "resuming from a failed push_to_hub" should be fine, as the "generating + hashing" step is pretty fast (I don't remember anyone complaining about it being slow)

@Wauplin
Copy link
Contributor Author

Wauplin commented Oct 2, 2023

Thanks for the review @LysandreJik ! I have addressed your comments. We also got approval from moon-landing side (cc @Pierrci @coyotte508) and @mariosasko create a draft PR to test it in datasets so I think we are good to merge if everyone's fine? :)

@Wauplin Wauplin changed the title [WIP] Preupload lfs files before commiting Preupload lfs files before commiting Oct 2, 2023
Copy link
Member

@Pierrci Pierrci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't check the details of the code, but as discussed offline, the logic LGTM 👍

Copy link
Contributor

@mariosasko mariosasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! The code looks good.

Some doc improvements:

docs/source/en/guides/upload.md Outdated Show resolved Hide resolved
docs/source/en/guides/upload.md Outdated Show resolved Hide resolved
docs/source/en/guides/upload.md Outdated Show resolved Hide resolved
docs/source/en/guides/upload.md Outdated Show resolved Hide resolved
@LysandreJik
Copy link
Member

LGTM, feel free to merge @Wauplin!

Co-authored-by: Mario Šaško <mario@huggingface.co>
@Wauplin
Copy link
Contributor Author

Wauplin commented Oct 4, 2023

Thanks everyone for the reviews and feedback! I'll merge it now and ship it in the upcoming v0.18 release :) 🚀

@Wauplin Wauplin merged commit d4f6ac0 into main Oct 4, 2023
@Wauplin Wauplin deleted the preupload-files-before-commit branch October 4, 2023 09:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants