[LLM] Can't access gated huggingface models with vLLM V1

### What happened + What you expected to happen

Hi, I’m trying to deploy the Llama 3 8B model. I’ve requested access and have a valid Hugging Face token (`HF_TOKEN`). I'm using a simple Ray Serve config, but when I run it, I encounter a Hugging Face permission/authentication error.

The issue only occurs when using vLLM V1. It does not happen when I fall back to vLLM V0 or use an older Ray version (where I assume vLLM V0 is the default ?). See the Fix section below for details.

**logs**
Download logs `anyscale logs cluster --id ses_21xd987a267m97yc9v1avwyf6c --download --download-dir /tmp`
Here is the relevant part of a dump
<pre> <details> <summary>Click to expand logs</summary>
```
(ProxyActor pid=12245, ip=10.0.155.99) INFO 2025-07-21 18:04:04,861 proxy 10.0.155.99 -- Proxy starting on node 10b7a33fa7707947cf45dbc768cb474f7053c5b9b1d228e6affbb901 (HTTP port: 8000).
(ProxyActor pid=12245, ip=10.0.155.99) INFO 2025-07-21 18:04:04,911 proxy 10.0.155.99 -- Got updated endpoints: {Deployment(name='LLMRouter', app='llama3-app'): EndpointInfo(route='/', app_is_cross_language=False)}.
(ProxyActor pid=12245, ip=10.0.155.99) INFO 2025-07-21 18:04:04,934 proxy 10.0.155.99 -- Started <ray.serve._private.router.SharedRouterLongPollClient object at 0x78a552622290>.
(pid=12615, ip=10.0.155.99) INFO 07-21 18:04:11 [__init__.py:244] Automatically detected platform cuda.
(_get_vllm_engine_config pid=12615, ip=10.0.155.99) INFO 07-21 18:04:19 [config.py:841] This model supports multiple tasks: {'generate', 'classify', 'embed', 'reward'}. Defaulting to 'generate'.
(_get_vllm_engine_config pid=12615, ip=10.0.155.99) INFO 07-21 18:04:19 [config.py:1472] Using max model len 8192
(_get_vllm_engine_config pid=12615, ip=10.0.155.99) INFO 07-21 18:04:19 [config.py:2285] Chunked prefill is enabled with max_num_batched_tokens=2048.
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) You are using a model of type llama to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) You are using a model of type llama to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) INFO 2025-07-21 18:04:20,494 llama3-app_LLMDeploymentllama-3-8B-instruct mjgdnawj -- Clearing the current platform cache ...
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) INFO 2025-07-21 18:04:20,499 llama3-app_LLMDeploymentllama-3-8B-instruct mjgdnawj -- Using executor class: <class 'vllm.v1.executor.ray_distributed_executor.RayDistributedExecutor'>
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) WARNING 07-21 18:04:21 [__init__.py:2662] We must use the `spawn` multiprocessing start method. Overriding VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/usage/troubleshooting.html#python-multiprocessing for more information. Reason: In a Ray actor and can only be spawned
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) INFO 07-21 18:04:25 [__init__.py:244] Automatically detected platform cuda.
(ServeController pid=179381) WARNING 2025-07-21 18:04:26,608 controller 179381 -- Deployment 'LLMDeploymentllama-3-8B-instruct' in application 'llama3-app' has 1 replicas that have taken more than 30s to initialize.
(ServeController pid=179381) This may be caused by a slow __init__ or reconfigure method.
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) 2025-07-21 18:04:27,963    INFO worker.py:1606 -- Using address 10.0.94.29:6379 set in the environment variable RAY_ADDRESS
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) 2025-07-21 18:04:27,971    INFO worker.py:1747 -- Connecting to existing Ray cluster at address: 10.0.94.29:6379...
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) 2025-07-21 18:04:27,983    INFO worker.py:1918 -- Connected to Ray cluster. View the dashboard at https://session-21xd987a267m97yc9v1avwyf6c.i.anyscaleuserdata.com 
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) INFO 07-21 18:04:27 [core.py:526] Waiting for init message from front-end.
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) INFO 07-21 18:04:27 [core.py:69] Initializing a V1 LLM engine (v0.9.2) with config: model='meta-llama/Meta-Llama-3-8B-Instruct', speculative_config=None, tokenizer='meta-llama/Meta-Llama-3-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='xgrammar', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=llama-3-8B-instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null}
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) INFO 07-21 18:04:28 [ray_utils.py:313] Using the existing placement group
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) INFO 07-21 18:04:28 [ray_distributed_executor.py:177] use_ray_spmd_worker: True
(pid=12935, ip=10.0.155.99) INFO 07-21 18:04:32 [__init__.py:244] Automatically detected platform cuda.
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) (pid=12935) INFO 07-21 18:04:32 [__init__.py:244] Automatically detected platform cuda.
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) INFO 07-21 18:04:33 [ray_distributed_executor.py:353] non_carry_over_env_vars from config: set()
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) INFO 07-21 18:04:33 [ray_distributed_executor.py:355] Copying the following environment variables to workers: ['LD_LIBRARY_PATH', 'VLLM_USE_RAY_SPMD_WORKER', 'VLLM_USE_RAY_COMPILED_DAG', 'VLLM_WORKER_MULTIPROC_METHOD']
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) INFO 07-21 18:04:33 [ray_distributed_executor.py:358] If certain env vars should NOT be copied to workers, add them to /home/ray/.config/vllm/ray_non_carry_over_env_vars.json file
(ServeController pid=179381) WARNING 2025-07-21 18:04:34,721 controller 179381 -- Deployment 'LLMRouter' in application 'llama3-app' has 2 replicas that have taken more than 30s to initialize.
(ServeController pid=179381) This may be caused by a slow __init__ or reconfigure method.
(RayWorkerWrapper pid=12935, ip=10.0.155.99) INFO 07-21 18:04:36 [parallel_state.py:1076] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(RayWorkerWrapper pid=12935, ip=10.0.155.99) WARNING 07-21 18:04:36 [topk_topp_sampler.py:59] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(RayWorkerWrapper pid=12935, ip=10.0.155.99) INFO 07-21 18:04:36 [gpu_model_runner.py:1770] Starting to load model meta-llama/Meta-Llama-3-8B-Instruct...
(RayWorkerWrapper pid=12935, ip=10.0.155.99) INFO 07-21 18:04:36 [gpu_model_runner.py:1775] Loading model from scratch...
(RayWorkerWrapper pid=12935, ip=10.0.155.99) INFO 07-21 18:04:36 [cuda.py:284] Using Flash Attention backend on V1 engine.
(RayWorkerWrapper pid=12935, ip=10.0.155.99) INFO 07-21 18:04:37 [weight_utils.py:292] Using model weights format ['*.safetensors']
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) Process EngineCore_0:
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) Traceback (most recent call last):
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99)   File "/home/ray/anaconda3/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99)     self.run()
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99)   File "/home/ray/anaconda3/lib/python3.11/multiprocessing/process.py", line 108, in run
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99)     self._target(*self._args, **self._kwargs)
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99)   File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 590, in run_engine_core
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99)     raise e
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99)   File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 577, in run_engine_core
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99)     engine_core = EngineCoreProc(*args, **kwargs)
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99)   File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 404, in __init__
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99)     super().__init__(vllm_config, executor_class, log_stats,
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99)   File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 75, in __init__
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99)     self.model_executor = executor_class(vllm_config)
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99)   File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 287, in __init__
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99)     super().__init__(*args, **kwargs)
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99)   File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 53, in __init__
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99)     self._init_executor()
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99)   File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/executor/ray_distributed_executor.py", line 115, in _init_executor
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99)     self._init_workers_ray(placement_group)
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99)   File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/executor/ray_distributed_executor.py", line 397, in _init_workers_ray
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99)     self._run_workers("load_model",
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99)   File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/executor/ray_distributed_executor.py", line 522, in _run_workers
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99)     ray_worker_outputs = ray.get(ray_worker_outputs)
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99)                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99)   File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99)     return fn(*args, **kwargs)
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99)            ^^^^^^^^^^^^^^^^^^^
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99)   File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99)     return func(*args, **kwargs)
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99)            ^^^^^^^^^^^^^^^^^^^^^
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99)   File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/worker.py", line 2858, in get
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99)     values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99)                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99)   File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/worker.py", line 958, in get_objects
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99)     raise value.as_instanceof_cause()
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) ray.exceptions.RayTaskError(GatedRepoError): ray::RayWorkerWrapper.execute_method() (pid=12935, ip=10.0.155.99, actor_id=77431ce61763e942f02949db0a000000, repr=<vllm.executor.ray_utils.RayWorkerWrapper object at 0x713ed3988d90>)
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99)   File "/home/ray/anaconda3/lib/python3.11/site-packages/requests/models.py", line 1024, in raise_for_status
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99)     raise HTTPError(http_error_msg, response=self)
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/resolve/8afb486c1db24fe5011ec46dfbe5b5dccdb575c2/model-00001-of-00004.safetensors
```
</details> </pre>

Here is the whole dump
https://gist.github.com/Aydin-ab/031b939e454fe0fbed65115b8ced42a6

Notice this line, might be relevant
```
(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) INFO 07-21 18:04:33 [ray_distributed_executor.py:355] Copying the following environment variables to workers: ['LD_LIBRARY_PATH', 'VLLM_USE_RAY_SPMD_WORKER', 'VLLM_USE_RAY_COMPILED_DAG', 'VLLM_WORKER_MULTIPROC_METHOD']
```

**Fix**
This works when I switch to vllm V0 i.e setting `VLLM_USE_V1` to 0 in the serve config file
```
applications:
- name: llama3-app
  route_prefix: "/"
  import_path: ray.serve.llm:build_openai_app
  args:
    llm_configs:
      - model_loading_config:
          model_id: llama-3-8B-instruct
          model_source: meta-llama/Meta-Llama-3-8B-Instruct
        accelerator_type: A10G
        deployment_config:
          autoscaling_config:
            min_replicas: 1
            max_replicas: 2
        runtime_env:
          env_vars:
            HF_TOKEN: <YOUR-TOKEN>
            VLLM_USE_V1: "0" # <--- This will work now
        engine_kwargs:
          tensor_parallel_size: 1
          max_model_len: 8192
```

It also works when I use an older image `anyscale/ray-llm:2.44.1-py311-cu124` with `ray==2.44.1` and `vllm==0.7.2` (with  `VLLM_USE_V1` omitted)

### Versions / Dependencies

I'm running in an Anyscale Workspace using the image anyscale/ray-llm:2.48.0-py311-cu128, which includes ray==2.48.0 and vllm==0.9.2.

### Reproduction script

Run your anyscale workspace on the image anyscale/ray-llm:2.48.0-py311-cu128

make a serve config file, with a valid huggingface token

```
# serve_llama3.yaml
applications:
- name: llama3-app
  route_prefix: "/"
  import_path: ray.serve.llm:build_openai_app
  args:
    llm_configs:
      - model_loading_config:
          model_id: llama-3-8B-instruct
          model_source: meta-llama/Meta-Llama-3-8B-Instruct
        accelerator_type: A10G
        deployment_config:
          autoscaling_config:
            min_replicas: 1
            max_replicas: 2
        runtime_env:
          env_vars:
            HF_TOKEN: <YOUR-TOKEN>
        engine_kwargs:
          tensor_parallel_size: 1
          max_model_len: 8192
```

Then run serve `serve run serve_llama3.yaml`

Adding `VLLM_USE_V1: "0"` in the `env_vars` should fix the issue

### Issue Severity

Low: It annoys or frustrates me.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[LLM] Can't access gated huggingface models with vLLM V1 #54812

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[LLM] Can't access gated huggingface models with vLLM V1 #54812

Description

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions