Opt-in for downloading without symlinks #1240

pcuenca · 2022-12-02T14:50:45Z

This is possibly a niche use case.

I recently found that some libraries (coremltools, in this case) don't play nice with symlinks even on Unix platforms 😲. This led me to replace this one-liner, which was intended for user communication:

from huggingface_hub import snapshot_download

repo_id = "apple/coreml-stable-diffusion-v1-4"
variant = "original/packages"

downloaded = snapshot_download(repo_id, allow_patterns=f"{variant}/*")

With this one (taken from the blog post):

from huggingface_hub import snapshot_download
from huggingface_hub.file_download import repo_folder_name
from pathlib import Path
import shutil

repo_id = "apple/coreml-stable-diffusion-v1-4"
variant = "original/packages"

def download_model(repo_id, variant, output_dir):
    destination = Path(output_dir) / (repo_id.split("/")[-1] + "_" + variant.replace("/", "_"))
    if destination.exists():
        raise Exception(f"Model already exists at {destination}")
    
    # Download and copy without symlinks
    downloaded = snapshot_download(repo_id, allow_patterns=f"{variant}/*", cache_dir=output_dir)
    downloaded_bundle = Path(downloaded) / variant
    shutil.copytree(downloaded_bundle, destination)

    # Remove all downloaded files
    cache_folder = Path(output_dir) / repo_folder_name(repo_id=repo_id, repo_type="model")
    shutil.rmtree(cache_folder)
    return destination

model_path = download_model(repo_id, variant, output_dir="./models")
print(f"Model downloaded at {model_path}")

It's not the end of the world, but in this case I really wanted to stress how easy it was to download Core ML checkpoints from the hub and use them downstream for whatever purpose.

If this is something that only affects coremltools, then it's not worthwhile doing anything (I'll open a PR there when I look into the problem in more depth). I'm raising the issue in case somebody else has observed other use cases that could benefit from a flag to unconditionally use #1067 even if symlinks are supported by the underlying os.

The text was updated successfully, but these errors were encountered:

Wauplin · 2022-12-02T15:10:29Z

Hi @pcuenca , thanks for opening the issue. As you said, this is indeed a quite niche situation. It remind me a discussion (internal link) triggered by @philschmid when he wanted to download a model from the Hub without the cache structure (e.g. the blobs and symlinks) in order to build docker containers (cc @julien-c as well).

The solution you proposed is quite good. It would just require to make sure the cache directory is not populated before download_model as it is completely erased by shutil.rmtree (e.g. users need to know what they are doing 😄).

About having a flag to disable smylinks (and activate #1067), I'm not against it. I would just wait for more requests before making it a feature of hfh.

pcuenca · 2022-12-02T16:33:22Z

Yeah, I supposed there might be other scenarios where users need a exact copy of the file structure. Building a docker container sounds like one of them (or deployment tasks, in general). snapshot_download is much better than git clone because you can specify branches or patterns (as in my example above) and don't have to download stuff you don't need or keep a humongous .git directory with all the lfs blobs. Admittedly, we tend to create heavily overloaded repos with multiple variants for different frameworks, floating-point precision, etc.

Happy to propose a PR if you do decide to go this way.

peterschmidt85 · 2022-12-11T21:56:07Z

I'm having the same issue.

Our ML platform syncs the HF cache folder to S3. Due to the use of symlinks, this doesn’t work correctly. We'd love it if HF allowed disabling symlinks for the cache.

Otherwise, using HF by all of our users will require workarounds that are hard to discover and use.

UPDATE: After giving it more thought, I realized that it's better to resolve this use-case without changing the implementation of the cache but rather by adding a save_dir argument to the from_pretrained function that would automatically save the model to the given folder (without using symlinks).

Wauplin · 2022-12-12T13:32:59Z

Hmm, I'm still not against doing so (e.g. not using symlinks when downloading/caching some files from the Hub).

The only problem I have with it is that with the current cache, symlinks allow to have a same file shared between several revisions of the same repo. The workaround we built in #1067 is to copy files when symlinks are not supported. It is ok-ish in the sense that "it works" but with a really degraded disk space efficiency. It was ok because the issue was raised for "non-admin-non-dev Windows users" who will not play a lot with different versions of a repo.

Now if we make this feature available as a "normal usage", I'm a bit more skeptical. Let's say a user downloads a model of 10GB and then the readme gets updated by the author. Next time the user wants to load the model, it will be re-downloaded even though only the readme was updated. Furthermore, the model weights will be duplicated on disk for each edit of any file of the repo. Once again, this is not too problematic in case of students in a course but for real development it's different.

Wauplin · 2022-12-12T13:35:10Z

Otherwise yes, we could have a method to download to an arbitrary folder and without any versioning. I'm a bit worried that if we do such a thing, some integrations will be messy (not versioning means no caching, no sharing between libs, no way to scan/delete the cache,...)

@julien-c do you have an opinion here ?

Wauplin · 2023-02-17T15:29:04Z

Opening up again this issue after some talks with @philschmid (for docker-related purposes). I think the best way to tackle this is to allow users to download stuff independently from the cache-system. It feels safer than playing with a "disable_symlinks" flag.

What I propose is to add a new parameter "destination" to both hf_hub_download and snapshot_download. When destination is set, cache_dir must explicitly be set to False.

# Download checkpoints/ folder without cache
snapshot_download("my-cool-model", allow_patterns="checkpoints/**", cache_dir=False, destination="/path/to/local/folder")

# Download model weights folder without cache
hf_hub_download("my-cool-model", "pytorch_model.bin", cache_dir=False, destination="/path/to/local/folder")

Having cache_dir set to False is a bit a redundant information to provide but I think it makes it very clear to the user that there will not be any caching done. And cache_dir=True would have no meaning (raise exception in that case).

WDYT @philschmid @pcuenca @patrickvonplaten?
@julien-c I already have an intuition that you will not like the idea (:smile:) but making it explicit should prevent misuse from users, isn't it? At least I prefer this solution than passing environment variables.

EDIT: Actually, we can still use the internal cache just in case the entry exist. If it does, we do a simple shutil.copyfile instead of re-downloading. But if entry is not there, we don't update the cache.

peterschmidt85 · 2023-02-17T15:34:39Z

EDIT: Actually, maybe we should still use the internal cache just in case the entry exist. If it does, we do a simple shutil.copyfile instead of re-downloading. But if entry is not there, we don't update the cache.

IMO the suggestion above is a good one: the user has full control, and when it makes sense, the cache still can be leveraged.

julien-c · 2023-02-21T13:11:28Z

what about still using the cache (unless another flag is passed) but copying (or .... symlinking?) to destination at the end?

Wauplin · 2023-02-21T13:28:20Z

Ok, let forget about about the clunky cache_dir=False that I suggested and add a use_symlinks input flag instead (default True):

if destination is set + use_symlinks=True => blobs are downloaded in the cache directory + symlinked to the destination folder
if destination is set + use_symlinks=False
- if blob is already in cache directory => we copy-paste it in the destination directory
- if blob is not cached => we download it directly to the destination directory. If user does this in multiple locations, the file will be downloaded multiple times but it would be a user's choice
if destination is not set + use_symlinks=False => raise ValueError (doesn't make sense)

Seems to be a good compromise for everyone, right?

julien-c · 2023-02-21T13:31:26Z

i would personally say that'd be reasonable^

pcuenca · 2023-02-21T13:54:33Z

@Wauplin that sounds reasonable to me too.

patrickvonplaten · 2023-03-01T15:13:35Z

Sounds good to me as well ! Just regarding naming, if destination always only contains symlinks (not 100% sure if that's the case), then symlink_to could maybe be a better name?

Wauplin · 2023-03-01T15:24:28Z

Yep sorry I did not re-post here. I made a PR for that (#1360). The naming is the following:

# pip install git+https://github.com/huggingface/huggingface_hub1240-snapshot-download-to-destination
from huggingface_hub import snapshot_download

# Download and cache files
snapshot_download(repo_id)

# Download and cache files + add symlinks in "my-folder"
snapshot_download(repo_id, local_dir="my-folder")

# Duplicate files already existing in cache
# and/or
# Download missing files directly to "my-folder"
#     => if ran multiple times, files are re-downloaded
snapshot_download(repo_id, local_dir="my-folder", local_dir_use_symlinks=False)

I tried to be consistent with the existing naming (especially in transformers and huggingface_hub.Repository) + made it explicit that we are talking about symlink to the local dir (and not inside the cache).

peterschmidt85 · 2023-03-28T12:53:12Z

Thanks, everyone, and especially @Wauplin for implementing this.

A quick question though. I just noticed, that there is no way to use the newly added local_dir and local_dir_use_symlinks with StableDiffusionPipeline.from_pretrained.
Do you think you could add the same parameters to StableDiffusionPipeline.from_pretrained as well?
This would allow using StableDiffusionPipeline.from_pretrained directly (which is very convenient).

Wauplin · 2023-03-28T13:20:00Z

@peterschmidt85 , glad you like the solution :)

About having it natively handled in StableDiffusionPipeline.from_pretrained, I guess that should be doable. However it needs adaptation in the diffusers library. Could you create a new issue to keep track of this in the diffusers repository? Thanks in advance!

peterschmidt85 · 2023-03-29T09:46:39Z

@Wauplin Done: huggingface/diffusers#2886
Thank you!

Wauplin added this to the v0.13 milestone Feb 8, 2023

Wauplin mentioned this issue Feb 23, 2023

Download file to specific destination #1360

Merged

1 task

Wauplin closed this as completed in #1360 Mar 8, 2023

pcuenca mentioned this issue Jan 19, 2024

Revampt workflow when downloading to local dir #1738

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Opt-in for downloading without symlinks #1240

Opt-in for downloading without symlinks #1240

pcuenca commented Dec 2, 2022

Wauplin commented Dec 2, 2022

pcuenca commented Dec 2, 2022

peterschmidt85 commented Dec 11, 2022 •

edited

Loading

Wauplin commented Dec 12, 2022

Wauplin commented Dec 12, 2022 •

edited

Loading

Wauplin commented Feb 17, 2023 •

edited

Loading

peterschmidt85 commented Feb 17, 2023

julien-c commented Feb 21, 2023

Wauplin commented Feb 21, 2023 •

edited

Loading

julien-c commented Feb 21, 2023

pcuenca commented Feb 21, 2023

patrickvonplaten commented Mar 1, 2023

Wauplin commented Mar 1, 2023

peterschmidt85 commented Mar 28, 2023

Wauplin commented Mar 28, 2023

peterschmidt85 commented Mar 29, 2023

Opt-in for downloading without symlinks #1240

Opt-in for downloading without symlinks #1240

Comments

pcuenca commented Dec 2, 2022

Wauplin commented Dec 2, 2022

pcuenca commented Dec 2, 2022

peterschmidt85 commented Dec 11, 2022 • edited Loading

Wauplin commented Dec 12, 2022

Wauplin commented Dec 12, 2022 • edited Loading

Wauplin commented Feb 17, 2023 • edited Loading

peterschmidt85 commented Feb 17, 2023

julien-c commented Feb 21, 2023

Wauplin commented Feb 21, 2023 • edited Loading

julien-c commented Feb 21, 2023

pcuenca commented Feb 21, 2023

patrickvonplaten commented Mar 1, 2023

Wauplin commented Mar 1, 2023

peterschmidt85 commented Mar 28, 2023

Wauplin commented Mar 28, 2023

peterschmidt85 commented Mar 29, 2023

peterschmidt85 commented Dec 11, 2022 •

edited

Loading

Wauplin commented Dec 12, 2022 •

edited

Loading

Wauplin commented Feb 17, 2023 •

edited

Loading

Wauplin commented Feb 21, 2023 •

edited

Loading