Skip to content

Llama.from_pretrained should work with HF_HUB_OFFLINE=1 #1801

Open
@davidgilbertson

Description

@davidgilbertson

Is your feature request related to a problem? Please describe.
Even with a model downloaded, the package attempts a call to HF HUB, which increases the load time.

From a quick scan of the logic here, it seems that the code just wants to check that the filename provided is in the repo provided.

Describe the solution you'd like
If you skipped that check and just assumed that the file existed and called hf_hub_download, that function would handle the case of errors if it couldn't find the file in the given repo.

The error may not be quite as focused, but init would run in a third the time.

On my machine:

  • loading from cache takes 400ms
  • loading from cache with this additional check of available files in the repo takes 1,200ms

Describe alternatives you've considered
The workaround is to use from_pretrained to download the appropriate file (if I want to do it all in Python), then get the cached file location and pass that as model_path to Llama without using from_pretrained.

Additional context
For work with HF models, I have HF_HUB_OFFLINE=1 set by default, only turning it off when I need a new model (because a few HF operations like to make checks for model info that require network requests, even with cache primed). It would be great if this was compatible with llama-cpp-python.

Side note: I just started using this today and was delighted with how easy it was to install, with CUDA support, from a single pip command. Nice work.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions