-
Notifications
You must be signed in to change notification settings - Fork 571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not create empty commit #1411
Comments
Hi @HansBug, thanks for raising the question. At the moment the huggingface_hub library is not aware that the created commit is empty. One way of doing this would be to compare the sha256 of each local file (that is already computed) with the sha256 of the remote files. The closest you can get is to have the location, commit hash, etag and size of a file by using get_hf_file_metadata. But the sha256 is not returned unfortunately 😕 (FYI, if an LFS file is committed twice, the data is not re-uploaded - the problem is more for "regular" files). @coyotte508 @Pierrci Do you think it would be possible in the /preupload endpoint (internal link) to also check if the exact same file already exists on the repo? (in addition to returning regular/lfs status). If not I don't think we can prevent empty commits that easily. Refusing empty commits on the @HansBug something you can do for your use case is that each time you do a sync, you create a PR, push all the commits to this PR and then merge it. If there is no diff, you'll get an error and you can just close the PR. If there is a diff, all the small commits will be squashed and you'll get a cleaner git history on the |
You can use
I think it makes sense to prevent empty commits. Maybe it would need something on the gitaly side cc @XciD @co42 , or a git hook? |
(not implemented in HF) |
We do give the |
I've done a few investigation and it seems it is possible to recompute the git-hash of a file. It uses Here is a script that compute the git-hash for 2 files + get it from import os
from hashlib import sha1
from huggingface_hub import hf_hub_download
import requests
pytorch_model_raw = b"""version https://git-lfs.github.com/spec/v1
oid sha256:7c5d3f4b8b76583b422fcb9189ad6c89d5d97a094541ce8932dce3ecabde1421
size 548118077
"""
def git_hash(content: bytes) -> str:
# Inspired by https://stackoverflow.com/a/7225329
sha = sha1()
file_size = len(content)
sha.update(f"blob {file_size}\0".encode("utf-8"))
sha.update(content)
return sha.digest().hex()
def get_remote_oid(repo_id: str, filename: str) -> str:
response = requests.post(
f"https://huggingface.co/api/models/{repo_id}/paths-info/main",
json={"paths": [filename]},
)
return response.json()[0]["oid"]
print("gpt2", "config.json")
with open(hf_hub_download("gpt2", "config.json"), "rb") as f:
content = f.read()
print("From Hub:", get_remote_oid("gpt2", "config.json"))
print("Computed:", git_hash(content))
print("gpt2", "pytorch_model.bin")
print("From Hub:", get_remote_oid("gpt2", "pytorch_model.bin"))
print("Computed:", git_hash(pytorch_model_raw)) output:
|
@Wauplin Any documentation of api |
@HansBug No, there isn't yet unfortunately. I'll add it officially to What I can tell you:
|
There are valid use cases for empty commits though (i use them from time to time, i even have a bash alias for it) |
I have tested this code. Perhaps a more effective approach to check if a large local file matches the LFS file on the repository is to obtain the LFS's Object ID (OID) and size, and then compare it with the local file. Below is an example code: import os
from hashlib import sha256
from typing import Tuple
import requests
from huggingface_hub import hf_hub_download
filename = hf_hub_download(
'HansBug/browser_drivers_mirror',
'google/111.0.5563.64/chromedriver_linux64.zip'
)
def get_sha256(file, chunk=1 << 20):
sha = sha256()
with open(file, 'rb') as f:
while True:
data = f.read(chunk)
if not data:
break
else:
sha.update(data)
return sha.hexdigest()
def get_remote_oid_lfs(repo_id: str, filename: str) -> Tuple[str, int]:
response = requests.post(
f"https://huggingface.co/api/models/{repo_id}/paths-info/main",
json={"paths": [filename]},
)
lfs_data = response.json()[0]['lfs']
return lfs_data['oid'], lfs_data['size']
if __name__ == '__main__':
print((get_sha256(filename), os.path.getsize(filename)))
print(get_remote_oid_lfs(
'HansBug/browser_drivers_mirror',
'google/111.0.5563.64/chromedriver_linux64.zip'
)) The output is
|
Therefore, it may be beneficial for the default behaviour to not create an empty commit when the file remains the same. However, if a user desires to create a commit, regardless of whether or not it remains empty, the user can specify the relevant argument within the function. 😃 |
Here is my current version in use: import os
from hashlib import sha256, sha1
import requests
from huggingface_hub import hf_hub_download
def hf_resource_check(local_filename, repo_id: str, file_in_repo: str, repo_type='model', revision='main',
chunk_for_hash: int = 1 << 20):
response = requests.post(
f"https://huggingface.co/api/{repo_type}s/{repo_id}/paths-info/{revision}",
json={"paths": [file_in_repo]},
)
metadata = response.json()[0]
if 'lfs' in metadata:
is_lfs, oid, filesize = True, metadata['lfs']['oid'], metadata['lfs']['size']
else:
is_lfs, oid, filesize = False, metadata['oid'], metadata['size']
if filesize != os.path.getsize(local_filename):
return False
if is_lfs:
sha = sha256()
else:
sha = sha1()
sha.update(f'blob {filesize}\0'.encode('utf-8'))
with open(local_filename, 'rb') as f:
# make sure the big files will not cause OOM
while True:
data = f.read(chunk_for_hash)
if not data:
break
sha.update(data)
return sha.hexdigest() == oid
if __name__ == '__main__':
local_lfs_file = hf_hub_download(
'HansBug/browser_drivers_mirror',
'google/111.0.5563.64/chromedriver_linux64.zip'
)
local_file = hf_hub_download('HansBug/browser_drivers_mirror', 'README.md')
# chromedriver_linux64.zip vs chromedriver_linux64.zip
print(hf_resource_check(
local_lfs_file,
'HansBug/browser_drivers_mirror',
'google/111.0.5563.64/chromedriver_linux64.zip'
))
# README.md vs README.md
print(hf_resource_check(local_file, 'HansBug/browser_drivers_mirror', 'README.md'))
# chromedriver_linux64.zip vs README.md
print(hf_resource_check(local_lfs_file, 'HansBug/browser_drivers_mirror', 'README.md'))
# README.md vs chromedriver_linux64.zip
print(hf_resource_check(
local_file,
'HansBug/browser_drivers_mirror',
'google/111.0.5563.64/chromedriver_linux64.zip'
)) The output is
|
Hey @HansBug, thanks for your snippet! I think we could use it indeed. I'll first implement the -also preventing empty commits server-side might be shipped before the huggingface_hub update :) - |
I identified an issue with the mentioned code related to the function For instance, on both Linux and macOS platforms, the size of the example file However, on Windows, the file size returned by This discrepancy in the file size may cause the value of PS: This issue seems to occur only on non-LFS text files that contain non-ASCII content on Windows. PS: I just tried:
All the above results are the same on Windows (github action log). Therefore, I believe this can now be recognized as a confirmed cross-platform compatibility issue specific to Windows. |
Wow, thanks @narugo1992 for spotting this compatibility issue and for letting us know. I still haven't investigated further on how we would like to integrate consistency checks in Overall it's a good reminder that we should be extra careful when implementing any logic strongly tied to git internals. |
@narugo1992 I found that:
So, I guess that, when the project is cloned on a Windows runner, the file is treated as a plain text file. As a result, line separators are However, on the other hand, when uploading a text file to Huggingface, its line breaks will be ignored again or rather treated as This means that this issue will only occur when a non-LFS text file is uploaded and uses |
@Wauplin Can you provide additional information about this new API? I received a temporary 405 error when trying to access https://huggingface.co/api/datasets/deepghs/few_shots/paths-info/main . As this error seems to be not reproducible, would it be possible to provide more information on when it may occur? |
@narugo1992 |
Thank you very much for this information. It is highly appreciated and valuable to us. I recently implemented a library called narugo1992/hfmirror based on the interface mentioned in this issue, which can integrate and synchronize resources from the network (such as Github release, danbooru, or any other custom resources) to the Hugging Face repository according to custom rules. It can be configured and mounted on platforms like Github Action to achieve automatic synchronization similar to a mirror site, allowing datasets to be updated in the long term. It includes a batch upload and delete method (link) for repository files that does not create an empty commit when there are no substantive changes. However, there is still a problem with the Windows platform, as described in this issue: #1411 (comment) . Currently, hfmirror has been released as version |
Wow, that's an impressive library. Thanks for letting me know! Would you mind if maybe we integrate some parts in the official
|
@Wauplin No problem, I'm happy to contribute my code to the huggingface repository. Regarding the Regarding the use of huggingface, my understanding is:
I think:
|
Hey @narugo1992 , I'm getting back to you and will try to be exhaustive regarding your message above. Some problem can be mitigated with the existing tools in 1.
Agree with you that we should be extra careful here. Replacing \r\n seems too risky. I can't promise yet how we will tackle this unfortunately. 2.
Good news, this is now possible! In the newest version of 3.
In this case what I would advice you to do is to configure where you want to have the cache on your machine. You can set the 4.
No official direct method indeed but there is a custom error try:
# basically a HEAD call, handling redirection and raising custom errors
# see https://huggingface.co/docs/huggingface_hub/v0.14.1/en/package_reference/file_download#huggingface_hub.get_hf_file_metadata
_ = get_hf_file_metadata(hf_hub_url(...))
# file exists
except huggingface_hub.utils.EntryNotFoundError:
# file does not exist
... If you use try:
hf_hub_download(repo_id, filename, local_dir="path/to/local")
# file exists
except huggingface_hub.utils.EntryNotFoundError:
# file does not exist
... 5.
FYI, you can also use 6.
Yes I saw that and it makes perfect sense IMO! import requests
from huggingface_hub import configure_http_backend, get_session
# Create a factory function that returns a Session with configured proxies
def backend_factory() -> requests.Session:
session = requests.Session()
(...) # code from https://github.com/narugo1992/hfmirror/blob/main/hfmirror/utils/session.py#L26
return session
# Set it as the default session factory
configure_http_backend(backend_factory=backend_factory) 7.
Yes definitely but will not be done short term I think (not in the next 2-3 weeks) as my plate is a bit full at the moment 😕 |
@Wauplin Does it have a method like This is very inconvenient for some customized download services. So maybe a simpler |
Not at the moment no. As you saw, you can only provide a directory but the remote repo architecture is also kept the same. What you can do is to download to a tmp directory and then move the file wherever you want on your computer. Keeping the repo structure is made on purpose and I don't think we want to move away from it. Having a new |
Here is a PR to (finally) address this: #2389. We rely on the remote |
Is your feature request related to a problem? Please describe.
I am using Hugging Face to build a mirrored repository of resources. As the size of the resources is huge, I need to use
upload_file
andupload_folder
to upload resources one by one, in order to avoid filling up the disk space by downloading all the resources at once. However, when some of the resource files are completely identical to the existing version in the Hugging Face repository, the huggingface_hub library still creates an empty commit, which is very detrimental to tracking data changes and results in creating thousands of empty commits every time the data is updated. Therefore, I would like to add an option that whenupload_file
orupload_folder
is executed and there is no substantial content change, no empty commit will be created.Additional context
Here is my repository: https://huggingface.co/HansBug/browser_drivers_mirror/tree/main
Here is an empty commit created by huggingface_hub: https://huggingface.co/HansBug/browser_drivers_mirror/commit/22f931ed1acc5dc67a0b43b6d2479320b3ed210a
Alternative solution (maybe)
Or, if a new API can be provided to check whether a local file or path is exactly the same as a file or path in the repository, the problem can also be solved.
The text was updated successfully, but these errors were encountered: