-
Notifications
You must be signed in to change notification settings - Fork 571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent empty commits if files did not change #2389
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
sha.update(b"blob ") | ||
sha.update(str(len(data)).encode()) | ||
sha.update(b"\0") | ||
sha.update(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Co-authored-by: Julien Chaumond <julien@huggingface.co>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice nice nice!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Only comment is about the time it adds to the eventual file upload
@@ -246,6 +251,29 @@ def b64content(self) -> bytes: | |||
with self.as_file() as file: | |||
return base64.b64encode(file.read()) | |||
|
|||
@property | |||
def _local_oid(self) -> Optional[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out of curiosity, how long does this take to compute on larger files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Asking as large file uploads (without going through your fantastic large-file-upload method) are already taking quite a bit of time before uploading anything
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For LFS files (so any file larger than 10MB for text files or 1MB for binary files) it won't add any time as we already compute and have the sha256.
For non-LFS files i expect this to be quite fast unless there are many many small files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @julien-c said, it should be neglectable on regular files since they are small. In any case, all regular files are loaded in memory in the /commit
payload and cannot be above 1GB so having 1000s of files is not a use case to optimize for.
Thanks everyone for having a look at it! |
Following server-side update in https://github.com/huggingface-internal/moon-landing/pull/10547 (internal), we can now retrieve the
oid
of the existing files in the repo. This allows us to determine which files did not change since the last commit and therefore avoid empty commits.This PR:
oid
of the remote files in the/preupload
call. If the file doesn't exist, nothing is returned.oid
CommitOperationAdd
from the commit + log something (INFO level)None
might break downstream libraries, we return theCommitInfo
of the last commit for that revisionNotes:
sha256
in the/preupload
call is not needed in the end so I removed it. It's needed only for the LFS uploadsforce_commit: bool
parameter but honestly let's stay without it as long as it's not requested by usersThis PR closes #1411 (cc @HansBug @narugo1992).
Techincally the
oid
is not exactlygit-hash-object
(the "git-sha1") defined by git. For LFS files, the git hash object is supposed to be the git-sha1 of the pointer file, not the actual file. But in our use case and for simplicity, the server returns:git-hash-object
(git-sha1) of the file for regular filessha256
of the file for LFS filesThis is not 100% compliant but it's convenient since sha256 is already computed client-side.