Skip to content

[LLM] Can't access gated huggingface models with vLLM V1 #54812

@Aydin-ab

Description

@Aydin-ab

What happened + What you expected to happen

Hi, I’m trying to deploy the Llama 3 8B model. I’ve requested access and have a valid Hugging Face token (HF_TOKEN). I'm using a simple Ray Serve config, but when I run it, I encounter a Hugging Face permission/authentication error.

The issue only occurs when using vLLM V1. It does not happen when I fall back to vLLM V0 or use an older Ray version (where I assume vLLM V0 is the default ?). See the Fix section below for details.

logs
Download logs anyscale logs cluster --id ses_21xd987a267m97yc9v1avwyf6c --download --download-dir /tmp
Here is the relevant part of a dump

 
Click to expand logs ``` (ProxyActor pid=12245, ip=10.0.155.99) INFO 2025-07-21 18:04:04,861 proxy 10.0.155.99 -- Proxy starting on node 10b7a33fa7707947cf45dbc768cb474f7053c5b9b1d228e6affbb901 (HTTP port: 8000). (ProxyActor pid=12245, ip=10.0.155.99) INFO 2025-07-21 18:04:04,911 proxy 10.0.155.99 -- Got updated endpoints: {Deployment(name='LLMRouter', app='llama3-app'): EndpointInfo(route='/', app_is_cross_language=False)}. (ProxyActor pid=12245, ip=10.0.155.99) INFO 2025-07-21 18:04:04,934 proxy 10.0.155.99 -- Started . (pid=12615, ip=10.0.155.99) INFO 07-21 18:04:11 [__init__.py:244] Automatically detected platform cuda. (_get_vllm_engine_config pid=12615, ip=10.0.155.99) INFO 07-21 18:04:19 [config.py:841] This model supports multiple tasks: {'generate', 'classify', 'embed', 'reward'}. Defaulting to 'generate'. (_get_vllm_engine_config pid=12615, ip=10.0.155.99) INFO 07-21 18:04:19 [config.py:1472] Using max model len 8192 (_get_vllm_engine_config pid=12615, ip=10.0.155.99) INFO 07-21 18:04:19 [config.py:2285] Chunked prefill is enabled with max_num_batched_tokens=2048. (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) You are using a model of type llama to instantiate a model of type . This is not supported for all configurations of models and can yield errors. (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) You are using a model of type llama to instantiate a model of type . This is not supported for all configurations of models and can yield errors. (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) INFO 2025-07-21 18:04:20,494 llama3-app_LLMDeploymentllama-3-8B-instruct mjgdnawj -- Clearing the current platform cache ... (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) INFO 2025-07-21 18:04:20,499 llama3-app_LLMDeploymentllama-3-8B-instruct mjgdnawj -- Using executor class: (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) WARNING 07-21 18:04:21 [__init__.py:2662] We must use the `spawn` multiprocessing start method. Overriding VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/usage/troubleshooting.html#python-multiprocessing for more information. Reason: In a Ray actor and can only be spawned (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) INFO 07-21 18:04:25 [__init__.py:244] Automatically detected platform cuda. (ServeController pid=179381) WARNING 2025-07-21 18:04:26,608 controller 179381 -- Deployment 'LLMDeploymentllama-3-8B-instruct' in application 'llama3-app' has 1 replicas that have taken more than 30s to initialize. (ServeController pid=179381) This may be caused by a slow __init__ or reconfigure method. (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) 2025-07-21 18:04:27,963 INFO worker.py:1606 -- Using address 10.0.94.29:6379 set in the environment variable RAY_ADDRESS (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) 2025-07-21 18:04:27,971 INFO worker.py:1747 -- Connecting to existing Ray cluster at address: 10.0.94.29:6379... (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) 2025-07-21 18:04:27,983 INFO worker.py:1918 -- Connected to Ray cluster. View the dashboard at https://session-21xd987a267m97yc9v1avwyf6c.i.anyscaleuserdata.com (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) INFO 07-21 18:04:27 [core.py:526] Waiting for init message from front-end. (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) INFO 07-21 18:04:27 [core.py:69] Initializing a V1 LLM engine (v0.9.2) with config: model='meta-llama/Meta-Llama-3-8B-Instruct', speculative_config=None, tokenizer='meta-llama/Meta-Llama-3-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='xgrammar', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=llama-3-8B-instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null} (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) INFO 07-21 18:04:28 [ray_utils.py:313] Using the existing placement group (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) INFO 07-21 18:04:28 [ray_distributed_executor.py:177] use_ray_spmd_worker: True (pid=12935, ip=10.0.155.99) INFO 07-21 18:04:32 [__init__.py:244] Automatically detected platform cuda. (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) (pid=12935) INFO 07-21 18:04:32 [__init__.py:244] Automatically detected platform cuda. (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) INFO 07-21 18:04:33 [ray_distributed_executor.py:353] non_carry_over_env_vars from config: set() (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) INFO 07-21 18:04:33 [ray_distributed_executor.py:355] Copying the following environment variables to workers: ['LD_LIBRARY_PATH', 'VLLM_USE_RAY_SPMD_WORKER', 'VLLM_USE_RAY_COMPILED_DAG', 'VLLM_WORKER_MULTIPROC_METHOD'] (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) INFO 07-21 18:04:33 [ray_distributed_executor.py:358] If certain env vars should NOT be copied to workers, add them to /home/ray/.config/vllm/ray_non_carry_over_env_vars.json file (ServeController pid=179381) WARNING 2025-07-21 18:04:34,721 controller 179381 -- Deployment 'LLMRouter' in application 'llama3-app' has 2 replicas that have taken more than 30s to initialize. (ServeController pid=179381) This may be caused by a slow __init__ or reconfigure method. (RayWorkerWrapper pid=12935, ip=10.0.155.99) INFO 07-21 18:04:36 [parallel_state.py:1076] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0 (RayWorkerWrapper pid=12935, ip=10.0.155.99) WARNING 07-21 18:04:36 [topk_topp_sampler.py:59] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer. (RayWorkerWrapper pid=12935, ip=10.0.155.99) INFO 07-21 18:04:36 [gpu_model_runner.py:1770] Starting to load model meta-llama/Meta-Llama-3-8B-Instruct... (RayWorkerWrapper pid=12935, ip=10.0.155.99) INFO 07-21 18:04:36 [gpu_model_runner.py:1775] Loading model from scratch... (RayWorkerWrapper pid=12935, ip=10.0.155.99) INFO 07-21 18:04:36 [cuda.py:284] Using Flash Attention backend on V1 engine. (RayWorkerWrapper pid=12935, ip=10.0.155.99) INFO 07-21 18:04:37 [weight_utils.py:292] Using model weights format ['*.safetensors'] (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) Process EngineCore_0: (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) Traceback (most recent call last): (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) File "/home/ray/anaconda3/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) self.run() (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) File "/home/ray/anaconda3/lib/python3.11/multiprocessing/process.py", line 108, in run (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) self._target(*self._args, **self._kwargs) (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 590, in run_engine_core (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) raise e (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 577, in run_engine_core (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) engine_core = EngineCoreProc(*args, **kwargs) (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 404, in __init__ (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) super().__init__(vllm_config, executor_class, log_stats, (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 75, in __init__ (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) self.model_executor = executor_class(vllm_config) (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 287, in __init__ (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) super().__init__(*args, **kwargs) (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 53, in __init__ (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) self._init_executor() (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/executor/ray_distributed_executor.py", line 115, in _init_executor (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) self._init_workers_ray(placement_group) (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/executor/ray_distributed_executor.py", line 397, in _init_workers_ray (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) self._run_workers("load_model", (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) File "/home/ray/anaconda3/lib/python3.11/site-packages/vllm/executor/ray_distributed_executor.py", line 522, in _run_workers (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) ray_worker_outputs = ray.get(ray_worker_outputs) (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) return fn(*args, **kwargs) (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) ^^^^^^^^^^^^^^^^^^^ (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) return func(*args, **kwargs) (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) ^^^^^^^^^^^^^^^^^^^^^ (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/worker.py", line 2858, in get (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/worker.py", line 958, in get_objects (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) raise value.as_instanceof_cause() (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) ray.exceptions.RayTaskError(GatedRepoError): ray::RayWorkerWrapper.execute_method() (pid=12935, ip=10.0.155.99, actor_id=77431ce61763e942f02949db0a000000, repr=) (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) File "/home/ray/anaconda3/lib/python3.11/site-packages/requests/models.py", line 1024, in raise_for_status (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) raise HTTPError(http_error_msg, response=self) (ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/resolve/8afb486c1db24fe5011ec46dfbe5b5dccdb575c2/model-00001-of-00004.safetensors ```

Here is the whole dump
https://gist.github.com/Aydin-ab/031b939e454fe0fbed65115b8ced42a6

Notice this line, might be relevant

(ServeReplica:llama3-app:LLMDeploymentllama-3-8B-instruct pid=12249, ip=10.0.155.99) INFO 07-21 18:04:33 [ray_distributed_executor.py:355] Copying the following environment variables to workers: ['LD_LIBRARY_PATH', 'VLLM_USE_RAY_SPMD_WORKER', 'VLLM_USE_RAY_COMPILED_DAG', 'VLLM_WORKER_MULTIPROC_METHOD']

Fix
This works when I switch to vllm V0 i.e setting VLLM_USE_V1 to 0 in the serve config file

applications:
- name: llama3-app
  route_prefix: "/"
  import_path: ray.serve.llm:build_openai_app
  args:
    llm_configs:
      - model_loading_config:
          model_id: llama-3-8B-instruct
          model_source: meta-llama/Meta-Llama-3-8B-Instruct
        accelerator_type: A10G
        deployment_config:
          autoscaling_config:
            min_replicas: 1
            max_replicas: 2
        runtime_env:
          env_vars:
            HF_TOKEN: <YOUR-TOKEN>
            VLLM_USE_V1: "0" # <--- This will work now
        engine_kwargs:
          tensor_parallel_size: 1
          max_model_len: 8192

It also works when I use an older image anyscale/ray-llm:2.44.1-py311-cu124 with ray==2.44.1 and vllm==0.7.2 (with VLLM_USE_V1 omitted)

Versions / Dependencies

I'm running in an Anyscale Workspace using the image anyscale/ray-llm:2.48.0-py311-cu128, which includes ray==2.48.0 and vllm==0.9.2.

Reproduction script

Run your anyscale workspace on the image anyscale/ray-llm:2.48.0-py311-cu128

make a serve config file, with a valid huggingface token

# serve_llama3.yaml
applications:
- name: llama3-app
  route_prefix: "/"
  import_path: ray.serve.llm:build_openai_app
  args:
    llm_configs:
      - model_loading_config:
          model_id: llama-3-8B-instruct
          model_source: meta-llama/Meta-Llama-3-8B-Instruct
        accelerator_type: A10G
        deployment_config:
          autoscaling_config:
            min_replicas: 1
            max_replicas: 2
        runtime_env:
          env_vars:
            HF_TOKEN: <YOUR-TOKEN>
        engine_kwargs:
          tensor_parallel_size: 1
          max_model_len: 8192

Then run serve serve run serve_llama3.yaml

Adding VLLM_USE_V1: "0" in the env_vars should fix the issue

Issue Severity

Low: It annoys or frustrates me.

Metadata

Metadata

Assignees

Labels

bugSomething that is supposed to be working; but isn'tllmregressionserveRay Serve Related IssuestabilitytriageNeeds triage (eg: priority, bug/not-bug, and owning component)

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions