Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent empty commits if files did not change #2389

Merged
merged 7 commits into from
Jul 16, 2024

Conversation

Wauplin
Copy link
Contributor

@Wauplin Wauplin commented Jul 12, 2024

Following server-side update in https://github.com/huggingface-internal/moon-landing/pull/10547 (internal), we can now retrieve the oid of the existing files in the repo. This allows us to determine which files did not change since the last commit and therefore avoid empty commits.

This PR:

  • fetches the oid of the remote files in the /preupload call. If the file doesn't exist, nothing is returned.
  • compares the remote and local oid
  • if they match => remove the CommitOperationAdd from the commit + log something (INFO level)
  • if no operations left in the commit => return early + log something (WARNING level). Since returning None might break downstream libraries, we return the CommitInfo of the last commit for that revision
  • if some operations left => continue normally

Notes:

  • sending sha256 in the /preupload call is not needed in the end so I removed it. It's needed only for the LFS uploads
  • we suggested to add a force_commit: bool parameter but honestly let's stay without it as long as it's not requested by users

This PR closes #1411 (cc @HansBug @narugo1992).


Techincally the oid is not exactly git-hash-object (the "git-sha1") defined by git. For LFS files, the git hash object is supposed to be the git-sha1 of the pointer file, not the actual file. But in our use case and for simplicity, the server returns:

  • the git-hash-object (git-sha1) of the file for regular files
  • the sha256 of the file for LFS files

This is not 100% compliant but it's convenient since sha256 is already computed client-side.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Comment on lines +60 to +63
sha.update(b"blob ")
sha.update(str(len(data)).encode())
sha.update(b"\0")
sha.update(data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Co-authored-by: Julien Chaumond <julien@huggingface.co>
Copy link
Member

@Pierrci Pierrci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice nice nice!!

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Only comment is about the time it adds to the eventual file upload

@@ -246,6 +251,29 @@ def b64content(self) -> bytes:
with self.as_file() as file:
return base64.b64encode(file.read())

@property
def _local_oid(self) -> Optional[str]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of curiosity, how long does this take to compute on larger files?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asking as large file uploads (without going through your fantastic large-file-upload method) are already taking quite a bit of time before uploading anything

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For LFS files (so any file larger than 10MB for text files or 1MB for binary files) it won't add any time as we already compute and have the sha256.
For non-LFS files i expect this to be quite fast unless there are many many small files

Copy link
Contributor Author

@Wauplin Wauplin Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @julien-c said, it should be neglectable on regular files since they are small. In any case, all regular files are loaded in memory in the /commit payload and cannot be above 1GB so having 1000s of files is not a use case to optimize for.

@Wauplin
Copy link
Contributor Author

Wauplin commented Jul 16, 2024

Thanks everyone for having a look at it!

@Wauplin Wauplin merged commit e370fa6 into main Jul 16, 2024
17 checks passed
@Wauplin Wauplin deleted the 1411-prevent-empty-commmits branch July 16, 2024 15:18
Pierrci added a commit to huggingface/huggingface.js that referenced this pull request Jul 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Do not create empty commit
5 participants