Preupload lfs files before commiting #1699

Wauplin · 2023-09-28T16:09:49Z

Related to huggingface/datasets#6257 (comment) cc @mariosasko @lhoestq.

The goal of this PR is to add the possibility to preupload files before making the create_commit call. This is expected to be only a power-user feature meant for users generating huge files on the fly (e.g. basically datasets when uploading sharded parquet files). The current problem is that each file has to be committed one by one to avoid out-of-memory issues. This strategy leads to 1. users being rate-limited (too many commits) 2. messy git history.

The solution proposed in this PR is to add a preupload_lfs_files methods. It takes a list of CommitOperationAdd and upload them to S3 (if LFS). Once all the CommitOperationAdd are uploaded, a single create_commit call is needed.

To make this work, CommitOperationAdd is mutated during the create_commit process. Some internal attributes are set to keep track of the status (_upload_mode, _is_uploaded, _is_committed). This is the case no matter if the user uses preupload_lfs_files or not. This is a breaking change but I don't expect to break any current workflow. This is only a problem is a user defines a CommitOperationAdd object to commit to 2 different repos.

Finally preupload_lfs_files also mutates the CommitOperationAdd objects to remove the binary content from them. This is expected as we want to free up the memory (this is the purpose of the whole PR 😄).

Documentation:

Here is how it would look like:

>>> from huggingface_hub import CommitOperationAdd, preupload_lfs_files, create_commit, create_repo

>>> repo_id = create_repo("test_preupload").repo_id

>>> operations = [] # List of all `CommitOperationAdd` objects that will be generated
>>> for i in range(5):
...     content = ... # generate binary content
...     addition = CommitOperationAdd(path_in_repo=f"shard_{i}_of_5.bin", path_or_fileobj=content)
...     preupload_lfs_files(repo_id, additions=[addition])
...     operations.append(addition)

# Create commit
>>> create_commit(repo_id, operations=operations, commit_message="Commit all shards")

TODO:

release memory after upload
documentation
tests
draft integration with datasets (Reduce the number of commits in push_to_hub datasets#6269)

HuggingFaceDocBuilderDev · 2023-09-28T16:15:15Z

The documentation is not available anymore as the PR was closed or merged.

codecov · 2023-09-29T13:46:05Z

Codecov Report

Attention: 5 lines in your changes are missing coverage. Please review.

Comparison is base (89cc691) 81.52% compared to head (3c2d504) 82.39%.

❗ Current head 3c2d504 differs from pull request most recent head 61f0637. Consider uploading reports for the commit 61f0637 to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1699      +/-   ##
==========================================
+ Coverage   81.52%   82.39%   +0.87%     
==========================================
  Files          62       62              
  Lines        7198     7226      +28     
==========================================
+ Hits         5868     5954      +86     
+ Misses       1330     1272      -58

Files	Coverage Δ
src/huggingface_hub/__init__.py	`75.75% <ø> (ø)`
src/huggingface_hub/hf_api.py	`86.40% <92.59%> (+0.04%)`	⬆️
src/huggingface_hub/_commit_api.py	`91.94% <78.57%> (-1.13%)`	⬇️

... and 5 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

julien-c · 2023-09-29T14:20:51Z

As discussed offline, a good server-centric review of this PR would be required from e.g. @coyotte508 and/or @Pierrci

Wauplin · 2023-09-29T14:30:53Z

Thanks for pinging them 👍

Apart from the PR review, @lhoestq @mariosasko could you open a branch in datasets to test the workflow from this PR? It doesn't have to be merged on your side until we make a merge+release of huggingface_hub but I want to be 100% sure we are addressing your problem correctly (as we implement it mostly for the parquet shards use case).

LysandreJik

Looks great! I left a few nits/questions.

Thanks!

docs/source/en/guides/upload.md

src/huggingface_hub/_commit_api.py

src/huggingface_hub/hf_api.py

LysandreJik · 2023-10-02T08:21:15Z

docs/source/en/guides/upload.md

+```py
+>>> from huggingface_hub import CommitOperationAdd, preupload_lfs_files, create_commit, create_repo
+
+>>> repo_id = create_repo("test_preupload").repo_id
+
+>>> operations = [] # List of all `CommitOperationAdd` objects that will be generated
+>>> for i in range(5):
+...     content = ... # generate binary content
+...     addition = CommitOperationAdd(path_in_repo=f"shard_{i}_of_5.bin", path_or_fileobj=content)
+...     preupload_lfs_files(repo_id, additions=[addition])
+...     operations.append(addition)
+
+# Create commit
+>>> create_commit(repo_id, operations=operations, commit_message="Commit all shards")
+```


What happens if there is a failure on one of the operations before the commit is created?

If there is a failure in one of the uploads, the script will crash (if not in a try/except). If the script is restarted, already uploaded files will not need to be re-uploaded (preupload_lfs_files will be instant). However, as long as the create_commit operation is not completed, the files are not accessible to anyone and cannot be seen on the repo on the Hub.

Also there was a question at some point to garbage-collect the untracked files in S3 after some time (after 24h?) cc @Pierrci. I think it's not enabled yet but that would mean that if the user waits too long before creating the commit then it's lost.

Is there a way to check if a certain file has been preuploaded based on its name ? Or it requires the hash ?

That would help implementing a fast push_to_hub resuming if it crashed mid-way

Is there a way to check if a certain file has been preuploaded based on its name ? Or it requires the hash ?

It requires the hash unfortunately. Filenames are just convenient aliases saved in git but what matters on S3 are the uniqueness of the files (i.e. based on hash). For context, until the create_commit step, the filename is not even provided to the Hub.

Ok I see. I guess we can do a commit every N files to allow resuming from there

@lhoestq Yes indeed.
But in general, what takes the most time in push_to_hub? Is it generating + hashing the shards or uploading them? Because if generate+hash takes 5% of the upload time for instance, resuming from a failed push_to_hub should still be very fast.

Our current "resuming" logic is to generate the shards and compute their "fingerprint" to check if they are already present in the repo (mimics hashing), so "resuming from a failed push_to_hub" should be fine, as the "generating + hashing" step is pretty fast (I don't remember anyone complaining about it being slow)

Wauplin · 2023-10-02T14:19:43Z

Thanks for the review @LysandreJik ! I have addressed your comments. We also got approval from moon-landing side (cc @Pierrci @coyotte508) and @mariosasko create a draft PR to test it in datasets so I think we are good to merge if everyone's fine? :)

Pierrci

Didn't check the details of the code, but as discussed offline, the logic LGTM 👍

mariosasko

Thanks! The code looks good.

Some doc improvements:

docs/source/en/guides/upload.md

LysandreJik · 2023-10-03T14:15:03Z

LGTM, feel free to merge @Wauplin!

Co-authored-by: Mario Šaško <mario@huggingface.co>

Wauplin · 2023-10-04T08:58:26Z

Thanks everyone for the reviews and feedback! I'll merge it now and ship it in the upcoming v0.18 release :) 🚀

first draft for preupload lfs files

bc201b6

Wauplin added 2 commits September 29, 2023 15:16

free up memory after upload

4f23d11

add tests

3c2d504

doc

61f0637

Wauplin marked this pull request as ready for review September 29, 2023 14:12

Wauplin requested review from LysandreJik, julien-c, mariosasko and lhoestq September 29, 2023 14:12

doc

40ce31e

Wauplin requested review from Pierrci, SBrandeis and coyotte508 September 29, 2023 14:24

mariosasko mentioned this pull request Sep 29, 2023

Reduce the number of commits in push_to_hub huggingface/datasets#6269

Merged

1 task

LysandreJik approved these changes Oct 2, 2023

View reviewed changes

Wauplin added 3 commits October 2, 2023 16:10

make private methods private

8266adb

add example

acbe1e5

Merge branch 'main' into preupload-files-before-commit

4a4af76

Wauplin changed the title ~~[WIP] Preupload lfs files before commiting~~ Preupload lfs files before commiting Oct 2, 2023

Pierrci reviewed Oct 2, 2023

View reviewed changes

mariosasko approved these changes Oct 3, 2023

View reviewed changes

docs/source/en/guides/upload.md Outdated Show resolved Hide resolved

docs/source/en/guides/upload.md Outdated Show resolved Hide resolved

docs/source/en/guides/upload.md Outdated Show resolved Hide resolved

docs/source/en/guides/upload.md Outdated Show resolved Hide resolved

Apply suggestions from code review

37949a9

Co-authored-by: Mario Šaško <mario@huggingface.co>

Wauplin merged commit d4f6ac0 into main Oct 4, 2023

Wauplin deleted the preupload-files-before-commit branch October 4, 2023 09:07

Wauplin mentioned this pull request Oct 13, 2023

HfFileSystem's transaction is working counterintuitively #1733

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preupload lfs files before commiting #1699

Preupload lfs files before commiting #1699

Wauplin commented Sep 28, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Sep 28, 2023 •

edited

Loading

codecov bot commented Sep 29, 2023 •

edited

Loading

julien-c commented Sep 29, 2023

Wauplin commented Sep 29, 2023

LysandreJik left a comment

LysandreJik Oct 2, 2023

Wauplin Oct 2, 2023

lhoestq Oct 2, 2023 •

edited

Loading

Wauplin Oct 2, 2023

lhoestq Oct 2, 2023

Wauplin Oct 2, 2023

mariosasko Oct 2, 2023 •

edited

Loading

Wauplin commented Oct 2, 2023

Pierrci left a comment

mariosasko left a comment

LysandreJik commented Oct 3, 2023

Wauplin commented Oct 4, 2023 •

edited

Loading

Preupload lfs files before commiting #1699

Preupload lfs files before commiting #1699

Conversation

Wauplin commented Sep 28, 2023 • edited Loading

HuggingFaceDocBuilderDev commented Sep 28, 2023 • edited Loading

codecov bot commented Sep 29, 2023 • edited Loading

Codecov Report

julien-c commented Sep 29, 2023

Wauplin commented Sep 29, 2023

LysandreJik left a comment

Choose a reason for hiding this comment

LysandreJik Oct 2, 2023

Choose a reason for hiding this comment

Wauplin Oct 2, 2023

Choose a reason for hiding this comment

lhoestq Oct 2, 2023 • edited Loading

Choose a reason for hiding this comment

Wauplin Oct 2, 2023

Choose a reason for hiding this comment

lhoestq Oct 2, 2023

Choose a reason for hiding this comment

Wauplin Oct 2, 2023

Choose a reason for hiding this comment

mariosasko Oct 2, 2023 • edited Loading

Choose a reason for hiding this comment

Wauplin commented Oct 2, 2023

Pierrci left a comment

Choose a reason for hiding this comment

mariosasko left a comment

Choose a reason for hiding this comment

LysandreJik commented Oct 3, 2023

Wauplin commented Oct 4, 2023 • edited Loading

Wauplin commented Sep 28, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Sep 28, 2023 •

edited

Loading

codecov bot commented Sep 29, 2023 •

edited

Loading

lhoestq Oct 2, 2023 •

edited

Loading

mariosasko Oct 2, 2023 •

edited

Loading

Wauplin commented Oct 4, 2023 •

edited

Loading