Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preupload lfs files before commiting #1699

Merged
merged 9 commits into from
Oct 4, 2023
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 44 additions & 0 deletions docs/source/en/guides/upload.md
Original file line number Diff line number Diff line change
Expand Up @@ -431,6 +431,50 @@ In addition to [`upload_file`] and [`upload_folder`], the following functions al

For more detailed information, take a look at the [`HfApi`] reference.

### Preupload LFS files before commit

In some cases, you might want to upload huge files to S3 **before** making the commit call. For example, if you are
committing a dataset in several shards that are generated in-memory, you would need to upload the shards one by one
to avoid a out-of-memory issue. A solution is to upload each shard as a separate commit on the repo. While being
Wauplin marked this conversation as resolved.
Show resolved Hide resolved
perfectly valid, this solution has the drawback of potentially messing the git history by generating a tens of commits.
Wauplin marked this conversation as resolved.
Show resolved Hide resolved
To overcome this issue, you can upload your files one by one to S3 and then create a single commit at the end. This
is possible using [`preupload_lfs_files`] in combination with [`create_commit`].

<Tip warning={true}>

This is a power-user method. Directly using [`upload_file`], [`upload_folder`] or [`create_commit`] instead of handling
the low-level logic of pre-uploading files is the way to go in the vast majority of cases. The main caveat of
[`preupload_lfs_files`] is that until the commit is actually made, the upload files are not accessible on the repo on
the Hub. If you have a question, feel free to ping us on our Discord or in a Github issue.
Wauplin marked this conversation as resolved.
Show resolved Hide resolved

</Tip>

Here is a simple example illustrating how to pre-upload files:

```py
>>> from huggingface_hub import CommitOperationAdd, preupload_lfs_files, create_commit, create_repo

>>> repo_id = create_repo("test_preupload").repo_id

>>> operations = [] # List of all `CommitOperationAdd` objects that will be generated
>>> for i in range(5):
... content = ... # generate binary content
... addition = CommitOperationAdd(path_in_repo=f"shard_{i}_of_5.bin", path_or_fileobj=content)
... preupload_lfs_files(repo_id, additions=[addition])
... operations.append(addition)

# Create commit
>>> create_commit(repo_id, operations=operations, commit_message="Commit all shards")
Wauplin marked this conversation as resolved.
Show resolved Hide resolved
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if there is a failure on one of the operations before the commit is created?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a failure in one of the uploads, the script will crash (if not in a try/except). If the script is restarted, already uploaded files will not need to be re-uploaded (preupload_lfs_files will be instant). However, as long as the create_commit operation is not completed, the files are not accessible to anyone and cannot be seen on the repo on the Hub.

Also there was a question at some point to garbage-collect the untracked files in S3 after some time (after 24h?) cc @Pierrci. I think it's not enabled yet but that would mean that if the user waits too long before creating the commit then it's lost.

Copy link
Member

@lhoestq lhoestq Oct 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to check if a certain file has been preuploaded based on its name ? Or it requires the hash ?

That would help implementing a fast push_to_hub resuming if it crashed mid-way

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to check if a certain file has been preuploaded based on its name ? Or it requires the hash ?

It requires the hash unfortunately. Filenames are just convenient aliases saved in git but what matters on S3 are the uniqueness of the files (i.e. based on hash). For context, until the create_commit step, the filename is not even provided to the Hub.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I see. I guess we can do a commit every N files to allow resuming from there

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lhoestq Yes indeed.
But in general, what takes the most time in push_to_hub? Is it generating + hashing the shards or uploading them? Because if generate+hash takes 5% of the upload time for instance, resuming from a failed push_to_hub should still be very fast.

Copy link
Contributor

@mariosasko mariosasko Oct 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our current "resuming" logic is to generate the shards and compute their "fingerprint" to check if they are already present in the repo (mimics hashing), so "resuming from a failed push_to_hub" should be fine, as the "generating + hashing" step is pretty fast (I don't remember anyone complaining about it being slow)


First, we create the [`CommitOperationAdd`] objects one by one. In a real-world example, those would contain the
generated shards. Each file is uploaded before generating the next one. During the [`preupload_lfs_files`] step, **the
`CommitOperationAdd` object is mutated**. You should only use it to pass it to directly to [`create_commit`]. The main
update of the object is that **the binary content is removed** from it, meaning that it will be garbage-collected if
you don't store another reference to it. This is expected as we don't want to keep in memory the content that is
already uploaded. Finally we create the commit by passing all the operations to [`create_commit`]. You can pass
additional operations (add, delete or copy) that have not being processed yet and they will be handled correctly.
Wauplin marked this conversation as resolved.
Show resolved Hide resolved

## Tips and tricks for large uploads

There are some limitations to be aware of when dealing with a large amount of data in your repo. Given the time it takes to stream the data,
Expand Down
2 changes: 2 additions & 0 deletions src/huggingface_hub/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -194,6 +194,7 @@
"model_info",
"move_repo",
"pause_space",
"preupload_lfs_files",
"rename_discussion",
"repo_exists",
"repo_info",
Expand Down Expand Up @@ -512,6 +513,7 @@ def __dir__():
model_info, # noqa: F401
move_repo, # noqa: F401
pause_space, # noqa: F401
preupload_lfs_files, # noqa: F401
rename_discussion, # noqa: F401
repo_exists, # noqa: F401
repo_info, # noqa: F401
Expand Down
38 changes: 24 additions & 14 deletions src/huggingface_hub/_commit_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -136,10 +136,23 @@ class CommitOperationAdd:
path_or_fileobj: Union[str, Path, bytes, BinaryIO]
upload_info: UploadInfo = field(init=False, repr=False)

# Internal attributes
_upload_mode: Optional[UploadMode] = None # set to "lfs" or "regular" once known
_is_uploaded: bool = False # set to True once the file has been upload as LFS
_is_committed: bool = False # set to True once the file has been committed

def __post_init__(self) -> None:
"""Validates `path_or_fileobj` and compute `upload_info`."""
self.path_in_repo = _validate_path_in_repo(self.path_in_repo)

# Validate `_is_uploaded` and `_upload_mode` cannot be set by user
if self._is_uploaded is not False:
raise ValueError("Attribute `_is_uploaded` cannot be set manually.")
if self._upload_mode is not None:
raise ValueError("Attribute `_upload_mode` cannot be set manually.")
if self._is_committed is not False:
raise ValueError("Attribute `_is_committed` cannot be set manually.")

# Validate `path_or_fileobj` value
if isinstance(self.path_or_fileobj, Path):
self.path_or_fileobj = str(self.path_or_fileobj)
Expand Down Expand Up @@ -427,10 +440,10 @@ def fetch_upload_modes(
revision: str,
endpoint: Optional[str] = None,
create_pr: bool = False,
) -> Dict[str, UploadMode]:
) -> None:
"""
Requests the Hub "preupload" endpoint to determine whether each input file
should be uploaded as a regular git blob or as git LFS blob.
Requests the Hub "preupload" endpoint to determine whether each input file should be uploaded as a regular git blob
or as git LFS blob. Input `additions` are mutated in-place with the upload mode.

Args:
additions (`Iterable` of :class:`CommitOperationAdd`):
Expand All @@ -446,9 +459,6 @@ def fetch_upload_modes(
revision (`str`):
The git revision to upload the files to. Can be any valid git revision.

Returns: `Dict[str, UploadMode]`
Key is the file path, value is the upload mode ("regular" or "lfs").

Raises:
[`~utils.HfHubHTTPError`]
If the Hub API returned an error.
Expand Down Expand Up @@ -483,14 +493,15 @@ def fetch_upload_modes(
preupload_info = _validate_preupload_info(resp.json())
upload_modes.update(**{file["path"]: file["uploadMode"] for file in preupload_info["files"]})

# Set upload mode for each addition operation
for addition in additions:
addition._upload_mode = upload_modes[addition.path_in_repo]

# Empty files cannot be uploaded as LFS (S3 would fail with a 501 Not Implemented)
# => empty files are uploaded as "regular" to still allow users to commit them.
for addition in additions:
if addition.upload_info.size == 0:
path = addition.path_in_repo
upload_modes[path] = "regular"

return upload_modes
addition._upload_mode = "regular"
Wauplin marked this conversation as resolved.
Show resolved Hide resolved


@validate_hf_hub_args
Expand Down Expand Up @@ -557,7 +568,6 @@ def fetch_lfs_files_to_copy(

def prepare_commit_payload(
operations: Iterable[CommitOperation],
upload_modes: Dict[str, UploadMode],
files_to_copy: Dict[Tuple[str, Optional[str]], "RepoFile"],
commit_message: str,
commit_description: Optional[str] = None,
Expand All @@ -584,7 +594,7 @@ def prepare_commit_payload(
# 2. Send operations, one per line
for operation in operations:
# 2.a. Case adding a regular file
if isinstance(operation, CommitOperationAdd) and upload_modes.get(operation.path_in_repo) == "regular":
if isinstance(operation, CommitOperationAdd) and operation._upload_mode == "regular":
yield {
"key": "file",
"value": {
Expand All @@ -594,7 +604,7 @@ def prepare_commit_payload(
},
}
# 2.b. Case adding an LFS file
elif isinstance(operation, CommitOperationAdd) and upload_modes.get(operation.path_in_repo) == "lfs":
elif isinstance(operation, CommitOperationAdd) and operation._upload_mode == "lfs":
yield {
"key": "lfsFile",
"value": {
Expand Down Expand Up @@ -627,5 +637,5 @@ def prepare_commit_payload(
else:
raise ValueError(
f"Unknown operation to commit. Operation: {operation}. Upload mode:"
f" {upload_modes.get(operation.path_in_repo)}"
f" {getattr(operation, '_upload_mode', None)}"
)
Loading