Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Opt-in for downloading without symlinks #1240

Closed
pcuenca opened this issue Dec 2, 2022 · 16 comments · Fixed by #1360
Closed

Opt-in for downloading without symlinks #1240

pcuenca opened this issue Dec 2, 2022 · 16 comments · Fixed by #1360

Comments

@pcuenca
Copy link
Member

pcuenca commented Dec 2, 2022

This is possibly a niche use case.

I recently found that some libraries (coremltools, in this case) don't play nice with symlinks even on Unix platforms 😲. This led me to replace this one-liner, which was intended for user communication:

from huggingface_hub import snapshot_download

repo_id = "apple/coreml-stable-diffusion-v1-4"
variant = "original/packages"

downloaded = snapshot_download(repo_id, allow_patterns=f"{variant}/*")

With this one (taken from the blog post):

from huggingface_hub import snapshot_download
from huggingface_hub.file_download import repo_folder_name
from pathlib import Path
import shutil

repo_id = "apple/coreml-stable-diffusion-v1-4"
variant = "original/packages"

def download_model(repo_id, variant, output_dir):
    destination = Path(output_dir) / (repo_id.split("/")[-1] + "_" + variant.replace("/", "_"))
    if destination.exists():
        raise Exception(f"Model already exists at {destination}")
    
    # Download and copy without symlinks
    downloaded = snapshot_download(repo_id, allow_patterns=f"{variant}/*", cache_dir=output_dir)
    downloaded_bundle = Path(downloaded) / variant
    shutil.copytree(downloaded_bundle, destination)

    # Remove all downloaded files
    cache_folder = Path(output_dir) / repo_folder_name(repo_id=repo_id, repo_type="model")
    shutil.rmtree(cache_folder)
    return destination

model_path = download_model(repo_id, variant, output_dir="./models")
print(f"Model downloaded at {model_path}")

It's not the end of the world, but in this case I really wanted to stress how easy it was to download Core ML checkpoints from the hub and use them downstream for whatever purpose.

If this is something that only affects coremltools, then it's not worthwhile doing anything (I'll open a PR there when I look into the problem in more depth). I'm raising the issue in case somebody else has observed other use cases that could benefit from a flag to unconditionally use #1067 even if symlinks are supported by the underlying os.

@Wauplin
Copy link
Contributor

Wauplin commented Dec 2, 2022

Hi @pcuenca , thanks for opening the issue. As you said, this is indeed a quite niche situation. It remind me a discussion (internal link) triggered by @philschmid when he wanted to download a model from the Hub without the cache structure (e.g. the blobs and symlinks) in order to build docker containers (cc @julien-c as well).

The solution you proposed is quite good. It would just require to make sure the cache directory is not populated before download_model as it is completely erased by shutil.rmtree (e.g. users need to know what they are doing 😄).

About having a flag to disable smylinks (and activate #1067), I'm not against it. I would just wait for more requests before making it a feature of hfh.

@pcuenca
Copy link
Member Author

pcuenca commented Dec 2, 2022

Yeah, I supposed there might be other scenarios where users need a exact copy of the file structure. Building a docker container sounds like one of them (or deployment tasks, in general). snapshot_download is much better than git clone because you can specify branches or patterns (as in my example above) and don't have to download stuff you don't need or keep a humongous .git directory with all the lfs blobs. Admittedly, we tend to create heavily overloaded repos with multiple variants for different frameworks, floating-point precision, etc.

Happy to propose a PR if you do decide to go this way.

@peterschmidt85
Copy link

peterschmidt85 commented Dec 11, 2022

I'm having the same issue.

Our ML platform syncs the HF cache folder to S3. Due to the use of symlinks, this doesn’t work correctly. We'd love it if HF allowed disabling symlinks for the cache.

Otherwise, using HF by all of our users will require workarounds that are hard to discover and use.

UPDATE: After giving it more thought, I realized that it's better to resolve this use-case without changing the implementation of the cache but rather by adding a save_dir argument to the from_pretrained function that would automatically save the model to the given folder (without using symlinks).

@Wauplin
Copy link
Contributor

Wauplin commented Dec 12, 2022

Hmm, I'm still not against doing so (e.g. not using symlinks when downloading/caching some files from the Hub).

The only problem I have with it is that with the current cache, symlinks allow to have a same file shared between several revisions of the same repo. The workaround we built in #1067 is to copy files when symlinks are not supported. It is ok-ish in the sense that "it works" but with a really degraded disk space efficiency. It was ok because the issue was raised for "non-admin-non-dev Windows users" who will not play a lot with different versions of a repo.

Now if we make this feature available as a "normal usage", I'm a bit more skeptical. Let's say a user downloads a model of 10GB and then the readme gets updated by the author. Next time the user wants to load the model, it will be re-downloaded even though only the readme was updated. Furthermore, the model weights will be duplicated on disk for each edit of any file of the repo. Once again, this is not too problematic in case of students in a course but for real development it's different.

@Wauplin
Copy link
Contributor

Wauplin commented Dec 12, 2022

Otherwise yes, we could have a method to download to an arbitrary folder and without any versioning. I'm a bit worried that if we do such a thing, some integrations will be messy (not versioning means no caching, no sharing between libs, no way to scan/delete the cache,...)

@julien-c do you have an opinion here ?

@Wauplin Wauplin added this to the v0.13 milestone Feb 8, 2023
@Wauplin
Copy link
Contributor

Wauplin commented Feb 17, 2023

Opening up again this issue after some talks with @philschmid (for docker-related purposes). I think the best way to tackle this is to allow users to download stuff independently from the cache-system. It feels safer than playing with a "disable_symlinks" flag.

What I propose is to add a new parameter "destination" to both hf_hub_download and snapshot_download. When destination is set, cache_dir must explicitly be set to False.

# Download checkpoints/ folder without cache
snapshot_download("my-cool-model", allow_patterns="checkpoints/**", cache_dir=False, destination="/path/to/local/folder")

# Download model weights folder without cache
hf_hub_download("my-cool-model", "pytorch_model.bin", cache_dir=False, destination="/path/to/local/folder")

Having cache_dir set to False is a bit a redundant information to provide but I think it makes it very clear to the user that there will not be any caching done. And cache_dir=True would have no meaning (raise exception in that case).

WDYT @philschmid @pcuenca @patrickvonplaten?
@julien-c I already have an intuition that you will not like the idea (:smile:) but making it explicit should prevent misuse from users, isn't it? At least I prefer this solution than passing environment variables.


EDIT: Actually, we can still use the internal cache just in case the entry exist. If it does, we do a simple shutil.copyfile instead of re-downloading. But if entry is not there, we don't update the cache.

@peterschmidt85
Copy link

EDIT: Actually, maybe we should still use the internal cache just in case the entry exist. If it does, we do a simple shutil.copyfile instead of re-downloading. But if entry is not there, we don't update the cache.

IMO the suggestion above is a good one: the user has full control, and when it makes sense, the cache still can be leveraged.

@julien-c
Copy link
Member

what about still using the cache (unless another flag is passed) but copying (or .... symlinking?) to destination at the end?

@Wauplin
Copy link
Contributor

Wauplin commented Feb 21, 2023

Ok, let forget about about the clunky cache_dir=False that I suggested and add a use_symlinks input flag instead (default True):

  • if destination is set + use_symlinks=True => blobs are downloaded in the cache directory + symlinked to the destination folder
  • if destination is set + use_symlinks=False
    • if blob is already in cache directory => we copy-paste it in the destination directory
    • if blob is not cached => we download it directly to the destination directory. If user does this in multiple locations, the file will be downloaded multiple times but it would be a user's choice
  • if destination is not set + use_symlinks=False => raise ValueError (doesn't make sense)

Seems to be a good compromise for everyone, right?

@julien-c
Copy link
Member

i would personally say that'd be reasonable^

@pcuenca
Copy link
Member Author

pcuenca commented Feb 21, 2023

@Wauplin that sounds reasonable to me too.

@patrickvonplaten
Copy link
Contributor

Sounds good to me as well ! Just regarding naming, if destination always only contains symlinks (not 100% sure if that's the case), then symlink_to could maybe be a better name?

@Wauplin
Copy link
Contributor

Wauplin commented Mar 1, 2023

Yep sorry I did not re-post here. I made a PR for that (#1360). The naming is the following:

# pip install git+https://github.com/huggingface/huggingface_hub1240-snapshot-download-to-destination
from huggingface_hub import snapshot_download

# Download and cache files
snapshot_download(repo_id)

# Download and cache files + add symlinks in "my-folder"
snapshot_download(repo_id, local_dir="my-folder")

# Duplicate files already existing in cache
# and/or
# Download missing files directly to "my-folder"
#     => if ran multiple times, files are re-downloaded
snapshot_download(repo_id, local_dir="my-folder", local_dir_use_symlinks=False)

I tried to be consistent with the existing naming (especially in transformers and huggingface_hub.Repository) + made it explicit that we are talking about symlink to the local dir (and not inside the cache).

@peterschmidt85
Copy link

Thanks, everyone, and especially @Wauplin for implementing this.

A quick question though. I just noticed, that there is no way to use the newly added local_dir and local_dir_use_symlinks with StableDiffusionPipeline.from_pretrained.
Do you think you could add the same parameters to StableDiffusionPipeline.from_pretrained as well?
This would allow using StableDiffusionPipeline.from_pretrained directly (which is very convenient).

@Wauplin
Copy link
Contributor

Wauplin commented Mar 28, 2023

@peterschmidt85 , glad you like the solution :)

About having it natively handled in StableDiffusionPipeline.from_pretrained, I guess that should be doable. However it needs adaptation in the diffusers library. Could you create a new issue to keep track of this in the diffusers repository? Thanks in advance!

@peterschmidt85
Copy link

@Wauplin Done: huggingface/diffusers#2886
Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants