- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 10.9k
Closed
vllm-project/flash-attention
#66Labels
bugSomething isn't workingSomething isn't working
Description
Your current environment
The output of `python collect_env.py`
INFO 04-22 16:50:53 [__init__.py:239] Automatically detected platform cuda.
Collecting environment information...
PyTorch version: 2.6.0+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A
OS: Rocky Linux 9.5 (Blue Onyx) (x86_64)
GCC version: (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5)
Clang version: Could not collect
CMake version: version 3.26.5
Libc version: glibc-2.34
Python version: 3.12.10 (main, Apr  9 2025, 04:03:51) [Clang 20.1.0 ] (64-bit runtime)
Python platform: Linux-5.15.0-1070-nvidia-x86_64-with-glibc2.34
Is CUDA available: True
CUDA runtime version: 12.2.140
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A100-SXM4-40GB
GPU 1: NVIDIA A100-SXM4-40GB
GPU 2: NVIDIA A100-SXM4-40GB
GPU 3: NVIDIA A100-SXM4-40GB
GPU 4: NVIDIA A100-SXM4-40GB
GPU 5: NVIDIA A100-SXM4-40GB
GPU 6: NVIDIA A100-SXM4-40GB
GPU 7: NVIDIA A100-SXM4-40GB
Nvidia driver version: 550.144.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        43 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               256
On-line CPU(s) list:                  0-255
Vendor ID:                            AuthenticAMD
Model name:                           AMD EPYC 7742 64-Core Processor
CPU family:                           23
Model:                                49
Thread(s) per core:                   2
Core(s) per socket:                   64
Socket(s):                            2
Stepping:                             0
Frequency boost:                      enabled
CPU(s) scaling MHz:                   67%
CPU max MHz:                          2250.0000
CPU min MHz:                          1500.0000
BogoMIPS:                             4491.73
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es
Virtualization:                       AMD-V
L1d cache:                            4 MiB (128 instances)
L1i cache:                            4 MiB (128 instances)
L2 cache:                             64 MiB (128 instances)
L3 cache:                             512 MiB (32 instances)
NUMA node(s):                         8
NUMA node0 CPU(s):                    0-15,128-143
NUMA node1 CPU(s):                    16-31,144-159
NUMA node2 CPU(s):                    32-47,160-175
NUMA node3 CPU(s):                    48-63,176-191
NUMA node4 CPU(s):                    64-79,192-207
NUMA node5 CPU(s):                    80-95,208-223
NUMA node6 CPU(s):                    96-111,224-239
NUMA node7 CPU(s):                    112-127,240-255
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Mitigation; untrained return thunk; SMT enabled with STIBP protection
Vulnerability Spec rstack overflow:   Mitigation; safe RET
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-cusparselt-cu12==0.6.2
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] pyzmq==26.4.0
[pip3] torch==2.6.0
[pip3] torchaudio==2.6.0
[pip3] torchvision==0.21.0
[pip3] transformers==4.51.3
[pip3] triton==3.2.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.8.5.dev148+gd0f214e04
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    NIC8    NIC9    NIC10   NIC11   CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     48-63,176-191   3               N/A
GPU1    NV12     X      NV12    NV12    NV12    NV12    NV12    NV12    PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     48-63,176-191   3               N/A
GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     16-31,144-159   1               N/A
GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     16-31,144-159   1               N/A
GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    SYS     SYS     SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     112-127,240-255 7               N/A
GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    SYS     SYS     SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     112-127,240-255 7               N/A
GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     80-95,208-223   5               N/A
GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     80-95,208-223   5               N/A
NIC0    PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS      X      PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC1    PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     PXB      X      SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC2    SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS      X      PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC3    SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     PXB      X      SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC4    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      PIX     SYS     SYS     SYS     SYS     SYS     SYS
NIC5    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX      X      SYS     SYS     SYS     SYS     SYS     SYS
NIC6    SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      PXB     SYS     SYS     SYS     SYS
NIC7    SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PXB      X      SYS     SYS     SYS     SYS
NIC8    SYS     SYS     SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      PXB     SYS     SYS
NIC9    SYS     SYS     SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PXB      X      SYS     SYS
NIC10   SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      PIX
NIC11   SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX      X
Legend:
  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks
NIC Legend:
  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7
  NIC8: mlx5_8
  NIC9: mlx5_9
  NIC10: mlx5_10
  NIC11: mlx5_11
CUDA_VISIBLE_DEVICES=0,1
CUDA_VISIBLE_DEVICES=0,1
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY
🐛 Describe the bug
I am building with MAIN_CUDA_VERSION = "12.2", based on commit fe742ae.
$ vllm serve google/gemma-3-12b-it
INFO 04-22 16:49:17 [__init__.py:239] Automatically detected platform cuda.
INFO 04-22 16:49:22 [api_server.py:1043] vLLM API server version 0.8.5.dev148+gd0f214e04
INFO 04-22 16:49:22 [api_server.py:1044] args: Namespace(subparser='serve', model_tag='google/gemma-3-12b-it', config='', host=None, port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='google/gemma-3-12b-it', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config={}, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', max_model_len=None, guided_decoding_backend='auto', reasoning_parser=None, logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, gpu_memory_utilization=0.9, swap_space=4, kv_cache_dtype='auto', num_gpu_blocks_override=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', cpu_offload_gb=0, calculate_kv_scales=False, disable_sliding_window=False, use_v2_block_manager=True, seed=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config={}, limit_mm_per_prompt={}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, speculative_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, max_num_batched_tokens=None, max_num_seqs=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, num_lookahead_slots=0, scheduler_delay_factor=0.0, enable_chunked_prefill=None, multi_step_stream_outputs=True, scheduling_policy='fcfs', disable_chunked_mm_input=False, scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, additional_config=None, enable_reasoning=False, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x7f7716de8a40>)
INFO 04-22 16:49:32 [config.py:715] This model supports multiple tasks: {'classify', 'generate', 'embed', 'reward', 'score'}. Defaulting to 'generate'.
INFO 04-22 16:49:35 [config.py:1997] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 04-22 16:49:41 [__init__.py:239] Automatically detected platform cuda.
INFO 04-22 16:49:46 [core.py:57] Initializing a V1 LLM engine (v0.8.5.dev148+gd0f214e04) with config: model='google/gemma-3-12b-it', speculative_config=None, tokenizer='google/gemma-3-12b-it', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=google/gemma-3-12b-it, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
ERROR 04-22 16:49:48 [core.py:390] EngineCore failed to start.
ERROR 04-22 16:49:48 [core.py:390] Traceback (most recent call last):
ERROR 04-22 16:49:48 [core.py:390]   File "/home/user/work/uv-pvenv-3.12-work/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 381, in run_engine_core
ERROR 04-22 16:49:48 [core.py:390]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 04-22 16:49:48 [core.py:390]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 16:49:48 [core.py:390]   File "/home/user/work/uv-pvenv-3.12-work/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 323, in __init__
ERROR 04-22 16:49:48 [core.py:390]     super().__init__(vllm_config, executor_class, log_stats,
ERROR 04-22 16:49:48 [core.py:390]   File "/home/user/work/uv-pvenv-3.12-work/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 63, in __init__
ERROR 04-22 16:49:48 [core.py:390]     self.model_executor = executor_class(vllm_config)
ERROR 04-22 16:49:48 [core.py:390]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 16:49:48 [core.py:390]   File "/home/user/work/uv-pvenv-3.12-work/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 52, in __init__
ERROR 04-22 16:49:48 [core.py:390]     self._init_executor()
ERROR 04-22 16:49:48 [core.py:390]   File "/home/user/work/uv-pvenv-3.12-work/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 45, in _init_executor
ERROR 04-22 16:49:48 [core.py:390]     self.collective_rpc("init_worker", args=([kwargs], ))
ERROR 04-22 16:49:48 [core.py:390]   File "/home/user/work/uv-pvenv-3.12-work/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
ERROR 04-22 16:49:48 [core.py:390]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 04-22 16:49:48 [core.py:390]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 16:49:48 [core.py:390]   File "/home/user/work/uv-pvenv-3.12-work/lib/python3.12/site-packages/vllm/utils.py", line 2428, in run_method
ERROR 04-22 16:49:48 [core.py:390]     return func(*args, **kwargs)
ERROR 04-22 16:49:48 [core.py:390]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 16:49:48 [core.py:390]   File "/home/user/work/uv-pvenv-3.12-work/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 558, in init_worker
ERROR 04-22 16:49:48 [core.py:390]     worker_class = resolve_obj_by_qualname(
ERROR 04-22 16:49:48 [core.py:390]                    ^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 16:49:48 [core.py:390]   File "/home/user/work/uv-pvenv-3.12-work/lib/python3.12/site-packages/vllm/utils.py", line 2059, in resolve_obj_by_qualname
ERROR 04-22 16:49:48 [core.py:390]     module = importlib.import_module(module_name)
ERROR 04-22 16:49:48 [core.py:390]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 16:49:48 [core.py:390]   File "/home/user/.local/share/uv/python/cpython-3.12.10-linux-x86_64-gnu/lib/python3.12/importlib/__init__.py", line 90, in import_module
ERROR 04-22 16:49:48 [core.py:390]     return _bootstrap._gcd_import(name[level:], package, level)
ERROR 04-22 16:49:48 [core.py:390]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 16:49:48 [core.py:390]   File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
ERROR 04-22 16:49:48 [core.py:390]   File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
ERROR 04-22 16:49:48 [core.py:390]   File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
ERROR 04-22 16:49:48 [core.py:390]   File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
ERROR 04-22 16:49:48 [core.py:390]   File "<frozen importlib._bootstrap_external>", line 999, in exec_module
ERROR 04-22 16:49:48 [core.py:390]   File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
ERROR 04-22 16:49:48 [core.py:390]   File "/home/user/work/uv-pvenv-3.12-work/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 26, in <module>
ERROR 04-22 16:49:48 [core.py:390]     from vllm.v1.worker.gpu_model_runner import GPUModelRunner
ERROR 04-22 16:49:48 [core.py:390]   File "/home/user/work/uv-pvenv-3.12-work/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 32, in <module>
ERROR 04-22 16:49:48 [core.py:390]     from vllm.v1.attention.backends.flash_attn import FlashAttentionMetadata
ERROR 04-22 16:49:48 [core.py:390]   File "/home/user/work/uv-pvenv-3.12-work/lib/python3.12/site-packages/vllm/v1/attention/backends/flash_attn.py", line 17, in <module>
ERROR 04-22 16:49:48 [core.py:390]     from vllm.vllm_flash_attn.fa_utils import (flash_attn_supports_fp8,
ERROR 04-22 16:49:48 [core.py:390]   File "/home/user/work/uv-pvenv-3.12-work/lib/python3.12/site-packages/vllm/vllm_flash_attn/__init__.py", line 4, in <module>
ERROR 04-22 16:49:48 [core.py:390]     from .flash_attn_interface import (
ERROR 04-22 16:49:48 [core.py:390]   File "/home/user/work/uv-pvenv-3.12-work/lib/python3.12/site-packages/vllm/vllm_flash_attn/flash_attn_interface.py", line 25, in <module>
ERROR 04-22 16:49:48 [core.py:390]     raise e
ERROR 04-22 16:49:48 [core.py:390]   File "/home/user/work/uv-pvenv-3.12-work/lib/python3.12/site-packages/vllm/vllm_flash_attn/flash_attn_interface.py", line 21, in <module>
ERROR 04-22 16:49:48 [core.py:390]     from . import _vllm_fa3_C  # noqa: F401
ERROR 04-22 16:49:48 [core.py:390]     ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 16:49:48 [core.py:390] ImportError: cannot import name '_vllm_fa3_C' from partially initialized module 'vllm.vllm_flash_attn' (most likely due to a circular import) (/home/user/work/uv-pvenv-3.12-work/lib/python3.12/site-packages/vllm/vllm_flash_attn/__init__.py)
Process EngineCore_0:
Traceback (most recent call last):
  File "/home/user/.local/share/uv/python/cpython-3.12.10-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/user/.local/share/uv/python/cpython-3.12.10-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/user/work/uv-pvenv-3.12-work/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 394, in run_engine_core
    raise e
  File "/home/user/work/uv-pvenv-3.12-work/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 381, in run_engine_core
    engine_core = EngineCoreProc(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/work/uv-pvenv-3.12-work/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 323, in __init__
    super().__init__(vllm_config, executor_class, log_stats,
  File "/home/user/work/uv-pvenv-3.12-work/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 63, in __init__
    self.model_executor = executor_class(vllm_config)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/work/uv-pvenv-3.12-work/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 52, in __init__
    self._init_executor()
  File "/home/user/work/uv-pvenv-3.12-work/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 45, in _init_executor
    self.collective_rpc("init_worker", args=([kwargs], ))
  File "/home/user/work/uv-pvenv-3.12-work/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
    answer = run_method(self.driver_worker, method, args, kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/work/uv-pvenv-3.12-work/lib/python3.12/site-packages/vllm/utils.py", line 2428, in run_method
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/work/uv-pvenv-3.12-work/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 558, in init_worker
    worker_class = resolve_obj_by_qualname(
                   ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/work/uv-pvenv-3.12-work/lib/python3.12/site-packages/vllm/utils.py", line 2059, in resolve_obj_by_qualname
    module = importlib.import_module(module_name)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/share/uv/python/cpython-3.12.10-linux-x86_64-gnu/lib/python3.12/importlib/__init__.py", line 90, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1331, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 935, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 999, in exec_module
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "/home/user/work/uv-pvenv-3.12-work/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 26, in <module>
    from vllm.v1.worker.gpu_model_runner import GPUModelRunner
  File "/home/user/work/uv-pvenv-3.12-work/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 32, in <module>
    from vllm.v1.attention.backends.flash_attn import FlashAttentionMetadata
  File "/home/user/work/uv-pvenv-3.12-work/lib/python3.12/site-packages/vllm/v1/attention/backends/flash_attn.py", line 17, in <module>
    from vllm.vllm_flash_attn.fa_utils import (flash_attn_supports_fp8,
  File "/home/user/work/uv-pvenv-3.12-work/lib/python3.12/site-packages/vllm/vllm_flash_attn/__init__.py", line 4, in <module>
    from .flash_attn_interface import (
  File "/home/user/work/uv-pvenv-3.12-work/lib/python3.12/site-packages/vllm/vllm_flash_attn/flash_attn_interface.py", line 25, in <module>
    raise e
  File "/home/user/work/uv-pvenv-3.12-work/lib/python3.12/site-packages/vllm/vllm_flash_attn/flash_attn_interface.py", line 21, in <module>
    from . import _vllm_fa3_C  # noqa: F401
    ^^^^^^^^^^^^^^^^^^^^^^^^^
ImportError: cannot import name '_vllm_fa3_C' from partially initialized module 'vllm.vllm_flash_attn' (most likely due to a circular import) (/home/user/work/uv-pvenv-3.12-work/lib/python3.12/site-packages/vllm/vllm_flash_attn/__init__.py)
Traceback (most recent call last):
  File "/home/user/work/uv-pvenv-3.12-work/bin/vllm", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/user/work/uv-pvenv-3.12-work/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 53, in main
    args.dispatch_function(args)
  File "/home/user/work/uv-pvenv-3.12-work/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 27, in cmd
    uvloop.run(run_server(args))
  File "/home/user/work/uv-pvenv-3.12-work/lib/python3.12/site-packages/uvloop/__init__.py", line 109, in run
    return __asyncio.run(
           ^^^^^^^^^^^^^^
  File "/home/user/.local/share/uv/python/cpython-3.12.10-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 195, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/home/user/.local/share/uv/python/cpython-3.12.10-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/home/user/work/uv-pvenv-3.12-work/lib/python3.12/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/home/user/work/uv-pvenv-3.12-work/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1078, in run_server
    async with build_async_engine_client(args) as engine_client:
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/share/uv/python/cpython-3.12.10-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/work/uv-pvenv-3.12-work/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/share/uv/python/cpython-3.12.10-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/work/uv-pvenv-3.12-work/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 178, in build_async_engine_client_from_engine_args
    async_llm = AsyncLLM.from_vllm_config(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/work/uv-pvenv-3.12-work/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 141, in from_vllm_config
    return cls(
           ^^^^
  File "/home/user/work/uv-pvenv-3.12-work/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 103, in __init__
    self.engine_core = core_client_class(
                       ^^^^^^^^^^^^^^^^^^
  File "/home/user/work/uv-pvenv-3.12-work/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 620, in __init__
    super().__init__(
  File "/home/user/work/uv-pvenv-3.12-work/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 395, in __init__
    self._wait_for_engine_startup()
  File "/home/user/work/uv-pvenv-3.12-work/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 421, in _wait_for_engine_startup
    raise RuntimeError("Engine core initialization failed. "
RuntimeError: Engine core initialization failed. See root cause above.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working