-
Notifications
You must be signed in to change notification settings - Fork 572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Opt-in for downloading without symlinks #1240
Comments
Hi @pcuenca , thanks for opening the issue. As you said, this is indeed a quite niche situation. It remind me a discussion (internal link) triggered by @philschmid when he wanted to download a model from the Hub without the cache structure (e.g. the blobs and symlinks) in order to build docker containers (cc @julien-c as well). The solution you proposed is quite good. It would just require to make sure the cache directory is not populated before About having a flag to disable smylinks (and activate #1067), I'm not against it. I would just wait for more requests before making it a feature of |
Yeah, I supposed there might be other scenarios where users need a exact copy of the file structure. Building a docker container sounds like one of them (or deployment tasks, in general). Happy to propose a PR if you do decide to go this way. |
I'm having the same issue. Our ML platform syncs the HF cache folder to S3. Due to the use of symlinks, this doesn’t work correctly. We'd love it if HF allowed disabling symlinks for the cache. Otherwise, using HF by all of our users will require workarounds that are hard to discover and use. UPDATE: After giving it more thought, I realized that it's better to resolve this use-case without changing the implementation of the cache but rather by adding a |
Hmm, I'm still not against doing so (e.g. not using symlinks when downloading/caching some files from the Hub). The only problem I have with it is that with the current cache, symlinks allow to have a same file shared between several revisions of the same repo. The workaround we built in #1067 is to copy files when symlinks are not supported. It is ok-ish in the sense that "it works" but with a really degraded disk space efficiency. It was ok because the issue was raised for "non-admin-non-dev Windows users" who will not play a lot with different versions of a repo. Now if we make this feature available as a "normal usage", I'm a bit more skeptical. Let's say a user downloads a model of 10GB and then the readme gets updated by the author. Next time the user wants to load the model, it will be re-downloaded even though only the readme was updated. Furthermore, the model weights will be duplicated on disk for each edit of any file of the repo. Once again, this is not too problematic in case of students in a course but for real development it's different. |
Otherwise yes, we could have a method to download to an arbitrary folder and without any versioning. I'm a bit worried that if we do such a thing, some integrations will be messy (not versioning means no caching, no sharing between libs, no way to scan/delete the cache,...) @julien-c do you have an opinion here ? |
Opening up again this issue after some talks with @philschmid (for docker-related purposes). I think the best way to tackle this is to allow users to download stuff independently from the cache-system. It feels safer than playing with a "disable_symlinks" flag. What I propose is to add a new parameter "destination" to both # Download checkpoints/ folder without cache
snapshot_download("my-cool-model", allow_patterns="checkpoints/**", cache_dir=False, destination="/path/to/local/folder")
# Download model weights folder without cache
hf_hub_download("my-cool-model", "pytorch_model.bin", cache_dir=False, destination="/path/to/local/folder") Having WDYT @philschmid @pcuenca @patrickvonplaten? EDIT: Actually, we can still use the internal cache just in case the entry exist. If it does, we do a simple shutil.copyfile instead of re-downloading. But if entry is not there, we don't update the cache. |
IMO the suggestion above is a good one: the user has full control, and when it makes sense, the cache still can be leveraged. |
what about still using the cache (unless another flag is passed) but copying (or .... symlinking?) to |
Ok, let forget about about the clunky
Seems to be a good compromise for everyone, right? |
i would personally say that'd be reasonable^ |
@Wauplin that sounds reasonable to me too. |
Sounds good to me as well ! Just regarding naming, if |
Yep sorry I did not re-post here. I made a PR for that (#1360). The naming is the following: # pip install git+https://github.com/huggingface/huggingface_hub1240-snapshot-download-to-destination
from huggingface_hub import snapshot_download
# Download and cache files
snapshot_download(repo_id)
# Download and cache files + add symlinks in "my-folder"
snapshot_download(repo_id, local_dir="my-folder")
# Duplicate files already existing in cache
# and/or
# Download missing files directly to "my-folder"
# => if ran multiple times, files are re-downloaded
snapshot_download(repo_id, local_dir="my-folder", local_dir_use_symlinks=False) I tried to be consistent with the existing naming (especially in |
Thanks, everyone, and especially @Wauplin for implementing this. A quick question though. I just noticed, that there is no way to use the newly added |
@peterschmidt85 , glad you like the solution :) About having it natively handled in |
@Wauplin Done: huggingface/diffusers#2886 |
This is possibly a niche use case.
I recently found that some libraries (
coremltools
, in this case) don't play nice with symlinks even on Unix platforms 😲. This led me to replace this one-liner, which was intended for user communication:With this one (taken from the blog post):
It's not the end of the world, but in this case I really wanted to stress how easy it was to download Core ML checkpoints from the hub and use them downstream for whatever purpose.
If this is something that only affects
coremltools
, then it's not worthwhile doing anything (I'll open a PR there when I look into the problem in more depth). I'm raising the issue in case somebody else has observed other use cases that could benefit from a flag to unconditionally use #1067 even if symlinks are supported by the underlying os.The text was updated successfully, but these errors were encountered: