Skip to content

[Bug]: Does not run embedding model sergeyzh/rubert-tiny-turbo #25060

@spions

Description

@spions
The output of python collect_env.py
root@shark02:/opt/docker/vllm-openai# docker exec -it 2b199fcfdb8b vllm collect-env
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
INFO 09-17 05:41:35 [__init__.py:216] Automatically detected platform cuda.
Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version                : Could not collect
CMake version                : version 4.1.0
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.8.0+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.11 (main, Jun  4 2025, 08:56:18) [GCC 11.4.0] (64-bit runtime)
Python platform              : Linux-6.1.0-39-cloud-amd64-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.8.93
CUDA_MODULE_LOADING set to   : LAZY
GPU models and configuration : GPU 0: NVIDIA GeForce RTX 3090
Nvidia driver version        : 570.144
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           46 bits physical, 48 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  4
On-line CPU(s) list:                     0-3
Vendor ID:                               GenuineIntel
Model name:                              Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
CPU family:                              6
Model:                                   63
Thread(s) per core:                      1
Core(s) per socket:                      4
Socket(s):                               1
Stepping:                                2
BogoMIPS:                                4599.99
Flags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm cpuid_fault invpcid_single pti tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt arat umip arch_capabilities
Virtualization:                          VT-x
Hypervisor vendor:                       KVM
Virtualization type:                     full
L1d cache:                               128 KiB (4 instances)
L1i cache:                               128 KiB (4 instances)
L2 cache:                                16 MiB (4 instances)
L3 cache:                                16 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-3
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Mitigation; Aligned branch/return thunks
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Mitigation; PTE Inversion; VMX flush not necessary, SMT disabled
Vulnerability Mds:                       Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Meltdown:                  Mitigation; PTI
Vulnerability Mmio stale data:           Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Vulnerable
Vulnerability Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Retpoline
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.3.0
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.8.4.1
[pip3] nvidia-cuda-cupti-cu12==12.8.90
[pip3] nvidia-cuda-nvrtc-cu12==12.8.93
[pip3] nvidia-cuda-runtime-cu12==12.8.90
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cudnn-frontend==1.14.1
[pip3] nvidia-cufft-cu12==11.3.3.83
[pip3] nvidia-cufile-cu12==1.13.1.3
[pip3] nvidia-curand-cu12==10.3.9.90
[pip3] nvidia-cusolver-cu12==11.7.3.90
[pip3] nvidia-cusparse-cu12==12.5.8.93
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-ml-py==13.580.82
[pip3] nvidia-nccl-cu12==2.27.3
[pip3] nvidia-nvjitlink-cu12==12.8.93
[pip3] nvidia-nvtx-cu12==12.8.90
[pip3] pynvml==13.0.1
[pip3] pyzmq==27.1.0
[pip3] torch==2.8.0+cu128
[pip3] torchaudio==2.8.0+cu128
[pip3] torchvision==0.23.0+cu128
[pip3] transformers==4.56.1
[pip3] triton==3.4.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.10.2
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
  	GPU0	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	0-3	0		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

==============================
     Environment Variables
==============================
NVIDIA_REQUIRE_CUDA=cuda>=12.8 brand=unknown,driver>=470,driver<471 brand=grid,driver>=470,driver<471 brand=tesla,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=vapps,driver>=470,driver<471 brand=vpc,driver>=470,driver<471 brand=vcs,driver>=470,driver<471 brand=vws,driver>=470,driver<471 brand=cloudgaming,driver>=470,driver<471 brand=unknown,driver>=535,driver<536 brand=grid,driver>=535,driver<536 brand=tesla,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=vapps,driver>=535,driver<536 brand=vpc,driver>=535,driver<536 brand=vcs,driver>=535,driver<536 brand=vws,driver>=535,driver<536 brand=cloudgaming,driver>=535,driver<536 brand=unknown,driver>=550,driver<551 brand=grid,driver>=550,driver<551 brand=tesla,driver>=550,driver<551 brand=nvidia,driver>=550,driver<551 brand=quadro,driver>=550,driver<551 brand=quadrortx,driver>=550,driver<551 brand=nvidiartx,driver>=550,driver<551 brand=vapps,driver>=550,driver<551 brand=vpc,driver>=550,driver<551 brand=vcs,driver>=550,driver<551 brand=vws,driver>=550,driver<551 brand=cloudgaming,driver>=550,driver<551 brand=unknown,driver>=560,driver<561 brand=grid,driver>=560,driver<561 brand=tesla,driver>=560,driver<561 brand=nvidia,driver>=560,driver<561 brand=quadro,driver>=560,driver<561 brand=quadrortx,driver>=560,driver<561 brand=nvidiartx,driver>=560,driver<561 brand=vapps,driver>=560,driver<561 brand=vpc,driver>=560,driver<561 brand=vcs,driver>=560,driver<561 brand=vws,driver>=560,driver<561 brand=cloudgaming,driver>=560,driver<561 brand=unknown,driver>=565,driver<566 brand=grid,driver>=565,driver<566 brand=tesla,driver>=565,driver<566 brand=nvidia,driver>=565,driver<566 brand=quadro,driver>=565,driver<566 brand=quadrortx,driver>=565,driver<566 brand=nvidiartx,driver>=565,driver<566 brand=vapps,driver>=565,driver<566 brand=vpc,driver>=565,driver<566 brand=vcs,driver>=565,driver<566 brand=vws,driver>=565,driver<566 brand=cloudgaming,driver>=565,driver<566
CUDA_VERSION=12.8.1
LD_LIBRARY_PATH=/usr/local/cuda/lib64
NVIDIA_VISIBLE_DEVICES=all
NVIDIA_DRIVER_CAPABILITIES=compute,utility
NCCL_VERSION=2.25.1-1
NVIDIA_PRODUCT_NAME=CUDA
CUDA_HOME=/usr/local/cuda
CUDA_HOME=/usr/local/cuda
VLLM_USAGE_SOURCE=production-docker-image
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_root
VLLM_WORKER_MULTIPROC_METHOD=spawn
CUDA_MODULE_LOADING=LAZY

🐛 Describe the bug

HI,
I am using version vllm/vllm-openai:v0.10.2 in docker.
I can't run embedding model sergeyzh/rubert-tiny-turbo (while sergeyzh/LaBSE-ru-turbo works successfully).

Both BERT models that are supported by the architecture.


  vllm-openai-3:
    image: vllm/vllm-openai:v0.10.2
    container_name: vllm-openai-3.shark02.loc
    hostname: vllm-openai-3.shark02.loc
    ipc: host
    volumes:
      - vllm-openai-3:/root/.cache/huggingface
    command: --model sergeyzh/rubert-tiny-turbo
    deploy:
      resources:
        reservations:
          devices:
          - driver: nvidia
            capabilities: [gpu]
            count: all
    ports:
      - 8085:8000
    network_mode: bridge
    restart: always
    environment:
      - TORCHDYNAMO_VERBOSE=1

The following error is output:

llm-openai-3.shark02.loc  |   import pynvml  # type: ignore[import]
 INFO 09-17 01:04:49 [__init__.py:216] Automatically detected platform cuda.
 (APIServer pid=1) INFO 09-17 01:04:52 [api_server.py:1896] vLLM API server version 0.10.2
 (APIServer pid=1) INFO 09-17 01:04:52 [utils.py:328] non-default args: {'model': 'sergeyzh/rubert-tiny-turbo'}
 (APIServer pid=1) INFO 09-17 01:04:53 [config.py:810] Found sentence-transformers tokenize configuration.
 (APIServer pid=1) INFO 09-17 01:05:04 [config.py:708] Found sentence-transformers modules configuration.
 (APIServer pid=1) INFO 09-17 01:05:04 [config.py:728] Found pooling configuration.
 (APIServer pid=1) INFO 09-17 01:05:04 [__init__.py:962] Resolved `--runner auto` to `--runner pooling`. Pass the value explicitly to silence this message.
 (APIServer pid=1) INFO 09-17 01:05:04 [__init__.py:742] Resolved architecture: BertModel
 (APIServer pid=1) `torch_dtype` is deprecated! Use `dtype` instead!
 (APIServer pid=1) INFO 09-17 01:05:04 [__init__.py:2764] Downcasting torch.float32 to torch.float16.
 (APIServer pid=1) INFO 09-17 01:05:05 [__init__.py:1815] Using max model len 2048
 (APIServer pid=1) INFO 09-17 01:05:05 [arg_utils.py:1639] (Disabling) chunked prefill by default
 (APIServer pid=1) INFO 09-17 01:05:05 [arg_utils.py:1642] (Disabling) prefix caching by default
 (APIServer pid=1) INFO 09-17 01:05:06 [__init__.py:3479] Only "last" pooling supports chunked prefill and prefix caching; disabling both.
 /usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
   import pynvml  # type: ignore[import]
 INFO 09-17 01:05:13 [__init__.py:216] Automatically detected platform cuda.
 (EngineCore_DP0 pid=32) INFO 09-17 01:05:15 [core.py:654] Waiting for init message from front-end.
 (EngineCore_DP0 pid=32) INFO 09-17 01:05:15 [core.py:76] Initializing a V1 LLM engine (v0.10.2) with config: model='sergeyzh/rubert-tiny-turbo', speculative_config=None, tokenizer='sergeyzh/rubert-tiny-turbo', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=sergeyzh/rubert-tiny-turbo, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=False, pooler_config=PoolerConfig(pooling_type='CLS', normalize=True, dimensions=None, enable_chunked_processing=None, max_embed_len=None, activation=None, logit_bias=None, softmax=None, step_tag_id=None, returned_token_ids=None), compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.mamba_mixer","vllm.short_conv","vllm.linear_attention","vllm.plamo2_mamba_mixer","vllm.gdn_attention"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":1,"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":512,"local_cache_dir":null}
 (EngineCore_DP0 pid=32) W0917 01:05:17.467000 32 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
 (EngineCore_DP0 pid=32) W0917 01:05:17.467000 32 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
 [W917 01:05:19.984260103 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
 (EngineCore_DP0 pid=32) INFO 09-17 01:05:19 [parallel_state.py:1165] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
 (EngineCore_DP0 pid=32) INFO 09-17 01:05:19 [topk_topp_sampler.py:58] Using FlashInfer for top-p & top-k sampling.
 (EngineCore_DP0 pid=32) INFO 09-17 01:05:19 [gpu_model_runner.py:2338] Starting to load model sergeyzh/rubert-tiny-turbo...
 (EngineCore_DP0 pid=32) INFO 09-17 01:05:19 [gpu_model_runner.py:2370] Loading model from scratch...
 (EngineCore_DP0 pid=32) INFO 09-17 01:05:19 [cuda.py:379] Using FlexAttention backend for head_size=26 on V1 engine.
 (EngineCore_DP0 pid=32) INFO 09-17 01:05:22 [weight_utils.py:348] Using model weights format ['*.safetensors']
 (EngineCore_DP0 pid=32) INFO 09-17 01:05:23 [weight_utils.py:369] Time spent downloading weights for sergeyzh/rubert-tiny-turbo: 0.782355 seconds
 (EngineCore_DP0 pid=32) INFO 09-17 01:05:23 [weight_utils.py:406] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 200.77it/s]
 (EngineCore_DP0 pid=32)
 (EngineCore_DP0 pid=32) INFO 09-17 01:05:23 [default_loader.py:268] Loading weights took 0.05 seconds
 (EngineCore_DP0 pid=32) INFO 09-17 01:05:24 [gpu_model_runner.py:2392] Model loading took 0.0544 GiB and 4.035904 seconds
 (EngineCore_DP0 pid=32) INFO 09-17 01:05:26 [backends.py:539] Using cache directory: /root/.cache/vllm/torch_compile_cache/636c8f0e84/rank_0_0/backbone for vLLM's torch.compile
 (EngineCore_DP0 pid=32) INFO 09-17 01:05:26 [backends.py:550] Dynamo bytecode transform time: 2.02 s
 (EngineCore_DP0 pid=32) INFO 09-17 01:05:29 [backends.py:194] Cache the graph for dynamic shape for later use
 (EngineCore_DP0 pid=32) INFO 09-17 01:05:32 [backends.py:215] Compiling a graph for dynamic shape takes 5.44 s
 (EngineCore_DP0 pid=32) INFO 09-17 01:05:32 [monitor.py:34] torch.compile takes 7.46 s in total
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 67/67 [00:10<00:00,  6.15it/s]
 (EngineCore_DP0 pid=32) INFO 09-17 01:05:35 [gpu_model_runner.py:3118] Graph capturing finished in 12 secs, took 0.12 GiB
 (EngineCore_DP0 pid=32) INFO 09-17 01:05:35 [core.py:218] init engine (profile, create kv cache, warmup model) took 11.59 seconds
 (EngineCore_DP0 pid=32) INFO 09-17 01:05:36 [config.py:810] Found sentence-transformers tokenize configuration.
 (EngineCore_DP0 pid=32) INFO 09-17 01:05:36 [core.py:120] Disabling chunked prefill for model without KVCache
 (EngineCore_DP0 pid=32) INFO 09-17 01:05:36 [__init__.py:3479] Only "last" pooling supports chunked prefill and prefix caching; disabling both.
 (APIServer pid=1) INFO 09-17 01:05:36 [loggers.py:142] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 1
 (APIServer pid=1) INFO 09-17 01:05:36 [async_llm.py:180] Torch profiler disabled. AsyncLLM CPU traces will not be collected.
 (APIServer pid=1) INFO 09-17 01:05:37 [api_server.py:1692] Supported_tasks: ['embed', 'encode']
 (APIServer pid=1) INFO 09-17 01:05:37 [__init__.py:36] No IOProcessor plugins requested by the model
 (APIServer pid=1) INFO 09-17 01:05:37 [api_server.py:1971] Starting vLLM API server 0 on http://0.0.0.0:8000
 (APIServer pid=1) INFO 09-17 01:05:37 [launcher.py:36] Available routes are:
 (APIServer pid=1) INFO 09-17 01:05:37 [launcher.py:44] Route: /openapi.json, Methods: HEAD, GET
 (APIServer pid=1) INFO 09-17 01:05:37 [launcher.py:44] Route: /docs, Methods: HEAD, GET
 (APIServer pid=1) INFO 09-17 01:05:37 [launcher.py:44] Route: /docs/oauth2-redirect, Methods: HEAD, GET
 (APIServer pid=1) INFO 09-17 01:05:37 [launcher.py:44] Route: /redoc, Methods: HEAD, GET
 (APIServer pid=1) INFO 09-17 01:05:37 [launcher.py:44] Route: /health, Methods: GET
 (APIServer pid=1) INFO 09-17 01:05:37 [launcher.py:44] Route: /load, Methods: GET
 (APIServer pid=1) INFO 09-17 01:05:37 [launcher.py:44] Route: /ping, Methods: POST
 (APIServer pid=1) INFO 09-17 01:05:37 [launcher.py:44] Route: /ping, Methods: GET
 (APIServer pid=1) INFO 09-17 01:05:37 [launcher.py:44] Route: /tokenize, Methods: POST
 (APIServer pid=1) INFO 09-17 01:05:37 [launcher.py:44] Route: /detokenize, Methods: POST
 (APIServer pid=1) INFO 09-17 01:05:37 [launcher.py:44] Route: /v1/models, Methods: GET
 (APIServer pid=1) INFO 09-17 01:05:37 [launcher.py:44] Route: /version, Methods: GET
 (APIServer pid=1) INFO 09-17 01:05:37 [launcher.py:44] Route: /v1/responses, Methods: POST
 (APIServer pid=1) INFO 09-17 01:05:37 [launcher.py:44] Route: /v1/responses/{response_id}, Methods: GET
 (APIServer pid=1) INFO 09-17 01:05:37 [launcher.py:44] Route: /v1/responses/{response_id}/cancel, Methods: POST
 (APIServer pid=1) INFO 09-17 01:05:37 [launcher.py:44] Route: /v1/chat/completions, Methods: POST
 (APIServer pid=1) INFO 09-17 01:05:37 [launcher.py:44] Route: /v1/completions, Methods: POST
 (APIServer pid=1) INFO 09-17 01:05:37 [launcher.py:44] Route: /v1/embeddings, Methods: POST
 (APIServer pid=1) INFO 09-17 01:05:37 [launcher.py:44] Route: /pooling, Methods: POST
 (APIServer pid=1) INFO 09-17 01:05:37 [launcher.py:44] Route: /classify, Methods: POST
 (APIServer pid=1) INFO 09-17 01:05:37 [launcher.py:44] Route: /score, Methods: POST
 (APIServer pid=1) INFO 09-17 01:05:37 [launcher.py:44] Route: /v1/score, Methods: POST
 (APIServer pid=1) INFO 09-17 01:05:37 [launcher.py:44] Route: /v1/audio/transcriptions, Methods: POST
 (APIServer pid=1) INFO 09-17 01:05:37 [launcher.py:44] Route: /v1/audio/translations, Methods: POST
 (APIServer pid=1) INFO 09-17 01:05:37 [launcher.py:44] Route: /rerank, Methods: POST
 (APIServer pid=1) INFO 09-17 01:05:37 [launcher.py:44] Route: /v1/rerank, Methods: POST
 (APIServer pid=1) INFO 09-17 01:05:37 [launcher.py:44] Route: /v2/rerank, Methods: POST
 (APIServer pid=1) INFO 09-17 01:05:37 [launcher.py:44] Route: /scale_elastic_ep, Methods: POST
 (APIServer pid=1) INFO 09-17 01:05:37 [launcher.py:44] Route: /is_scaling_elastic_ep, Methods: POST
 (APIServer pid=1) INFO 09-17 01:05:37 [launcher.py:44] Route: /invocations, Methods: POST
 (APIServer pid=1) INFO 09-17 01:05:37 [launcher.py:44] Route: /metrics, Methods: GET
 (APIServer pid=1) INFO:     Started server process [1]
 (APIServer pid=1) INFO:     Waiting for application startup.
 (APIServer pid=1) INFO:     Application startup complete.

 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [dump_input.py:69] Dumping input data for V1 LLM engine (v0.10.2) with config: model='sergeyzh/rubert-tiny-turbo', speculative_config=None, tokenizer='sergeyzh/rubert-tiny-turbo', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=sergeyzh/rubert-tiny-turbo, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=False, pooler_config=PoolerConfig(pooling_type='CLS', normalize=True, dimensions=None, enable_chunked_processing=None, max_embed_len=None, activation=None, logit_bias=None, softmax=None, step_tag_id=None, returned_token_ids=None), compilation_config={"level":3,"debug_dump_path":"","cache_dir":"/root/.cache/vllm/torch_compile_cache/636c8f0e84","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.mamba_mixer","vllm.short_conv","vllm.linear_attention","vllm.plamo2_mamba_mixer","vllm.gdn_attention"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":1,"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":512,"local_cache_dir":"/root/.cache/vllm/torch_compile_cache/636c8f0e84/rank_0_0/backbone"},
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [dump_input.py:76] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=embd-020b589347f844998b40a17466dcdfeb-0,prompt_token_ids_len=15,mm_kwargs=[],mm_hashes=[],mm_positions=[],sampling_params=None,block_ids=(),num_computed_tokens=0,lora_request=None)], scheduled_cached_reqs=CachedRequestData(req_ids=[], resumed_from_preemption=[], new_token_ids=[], new_block_ids=[], num_computed_tokens=[]), num_scheduled_tokens={embd-020b589347f844998b40a17466dcdfeb-0: 15}, total_num_scheduled_tokens=15, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[], finished_req_ids=[], free_encoder_mm_hashes=[], structured_output_request_ids={}, grammar_bitmask=null, kv_connector_metadata=null)
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [dump_input.py:79] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0), spec_decoding_stats=None, num_corrupted_reqs=0)
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720] EngineCore encountered a fatal error.
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720] Traceback (most recent call last):
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/symbolic_convert.py", line 1267, in step
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]     self.dispatch_table[inst.opcode](self, inst)
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/symbolic_convert.py", line 1818, in RAISE_VARARGS
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]     self._raise_exception_variable(val)
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/symbolic_convert.py", line 1795, in _raise_exception_variable
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]     raise observed_exception_type(f"raised exception {val}")
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720] torch._dynamo.exc.ObservedValueErrorError: raised exception ExceptionVariable(<class 'ValueError'>)
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720] During handling of the above exception, another exception occurred:
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720] Traceback (most recent call last):
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 711, in run_engine_core
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]     engine_core.run_busy_loop()
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 738, in run_busy_loop
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]     self._process_engine_step()
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 764, in _process_engine_step
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]     outputs, model_executed = self.step_fn()
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]                               ^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 292, in step
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]     model_output = self.execute_model_with_error_logging(
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 278, in execute_model_with_error_logging
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]     raise err
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 269, in execute_model_with_error_logging
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]     return model_fn(scheduler_output)
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]            ^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 93, in execute_model
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]     output = self.collective_rpc("execute_model",
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]     answer = run_method(self.driver_worker, method, args, kwargs)
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils/__init__.py", line 3060, in run_method
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]     return func(*args, **kwargs)
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]            ^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]     return func(*args, **kwargs)
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]            ^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 436, in execute_model
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]     output = self.model_runner.execute_model(scheduler_output,
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]     return func(*args, **kwargs)
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]            ^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2064, in execute_model
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]     model_output = self.model(
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]                    ^^^^^^^^^^^
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]     return self._call_impl(*args, **kwargs)
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]     return forward_call(*args, **kwargs)
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/bert.py", line 467, in forward
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]     return self.model(input_ids=input_ids,
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 312, in __call__
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]     model_output = self.forward(*args, **kwargs)
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/bert.py", line 351, in forward
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]     def forward(
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 375, in __call__
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]     return super().__call__(*args, **kwargs)
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]     return self._call_impl(*args, **kwargs)
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]     return forward_call(*args, **kwargs)
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 929, in _fn
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]     return fn(*args, **kwargs)
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]            ^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 848, in call_wrapped
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]     return self._wrapped_call(self, *args, **kwargs)
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 424, in __call__
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]     raise e
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 411, in __call__
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]     return self._call_impl(*args, **kwargs)
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]     return forward_call(*args, **kwargs)
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   File "<eval_with_key>.8", line 54, in forward
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]     submod_1 = self.submod_1(getitem, s72, getitem_1, getitem_2, getitem_3);  getitem = getitem_1 = getitem_2 = submod_1 = None
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 848, in call_wrapped
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]     return self._wrapped_call(self, *args, **kwargs)
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 424, in __call__
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]     raise e
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 411, in __call__
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]     return self._call_impl(*args, **kwargs)
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]     return forward_call(*args, **kwargs)
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   File "<eval_with_key>.2", line 5, in forward
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]     unified_attention_with_output = torch.ops.vllm.unified_attention_with_output(query, key, value, output_1, 'model.encoder.layer.0.attention.output.attn');  query = key = value = output_1 = unified_attention_with_output = None
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1243, in __call__
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]     return self._op(*args, **kwargs)
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]            ^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/vllm/attention/layer.py", line 521, in unified_attention_with_output
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]     self.impl.forward(self,
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/flex_attention.py", line 756, in forward
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]     out = flex_attention_compiled(
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]           ^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 736, in compile_wrapper
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]     return fn(*args, **kwargs)
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]            ^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/convert_frame.py", line 1495, in __call__
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]     return self._torchdynamo_orig_callable(
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/convert_frame.py", line 629, in __call__
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]     return _compile(
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]            ^^^^^^^^^
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/convert_frame.py", line 1111, in _compile
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]     guarded_code = compile_inner(code, one_graph, hooks, transform)
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/torch/_utils_internal.py", line 97, in wrapper_function
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]     return function(*args, **kwargs)
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]            ^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/convert_frame.py", line 793, in compile_inner
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]     return _compile_inner(code, one_graph, hooks, transform)
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/convert_frame.py", line 832, in _compile_inner
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]     out_code = transform_code_object(code, transform)
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/bytecode_transformation.py", line 1424, in transform_code_object
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]     transformations(instructions, code_options)
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/convert_frame.py", line 267, in _fn
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]     return fn(*args, **kwargs)
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]            ^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/convert_frame.py", line 753, in transform
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]     tracer.run()
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/symbolic_convert.py", line 3497, in run
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]     super().run()
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/symbolic_convert.py", line 1363, in run
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]     while self.step():
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]           ^^^^^^^^^^^
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/symbolic_convert.py", line 1272, in step
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]     self.exception_handler(e)
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/symbolic_convert.py", line 1935, in exception_handler
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]     unimplemented_v2(
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/exc.py", line 528, in unimplemented_v2
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]     raise Unsupported(msg)
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720] torch._dynamo.exc.Unsupported: Observed exception
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   Explanation: Dynamo found no exception handler at the top-level compiled function when encountering an exception. Exception will propagate outside the compiled region.
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   Hint: Dynamo has detected that tracing the code will result in an error when running in eager. Please double check that your code doesn't contain a similar error when actually running eager/uncompiled.
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   Hint: It may be possible to write Dynamo tracing rules for this code. Please report an issue to PyTorch if you encounter this graph break often and it is causing performance issues.
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]   Developer debug context: raised exception ExceptionVariable(<class 'ValueError'>)
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720] from user code:
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]    File "/usr/local/lib/python3.12/dist-packages/torch/nn/attention/flex_attention.py", line 1364, in flex_attention
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]     raise ValueError(
 (EngineCore_DP0 pid=32) ERROR 09-17 01:05:51 [core.py:720]
 (APIServer pid=1) ERROR 09-17 01:05:51 [async_llm.py:485] AsyncLLM output_handler failed.
 (APIServer pid=1) ERROR 09-17 01:05:51 [async_llm.py:485] Traceback (most recent call last):
 (APIServer pid=1) ERROR 09-17 01:05:51 [async_llm.py:485]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 444, in output_handler
 (APIServer pid=1) ERROR 09-17 01:05:51 [async_llm.py:485]     outputs = await engine_core.get_output_async()
 (APIServer pid=1) ERROR 09-17 01:05:51 [async_llm.py:485]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (APIServer pid=1) ERROR 09-17 01:05:51 [async_llm.py:485]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 845, in get_output_async
 (APIServer pid=1) ERROR 09-17 01:05:51 [async_llm.py:485]     raise self._format_exception(outputs) from None
 (APIServer pid=1) ERROR 09-17 01:05:51 [async_llm.py:485] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
 (APIServer pid=1) INFO:     192.168.1.92:39892 - "POST /v1/embeddings HTTP/1.1" 400 Bad Request
 (EngineCore_DP0 pid=32) Process EngineCore_DP0:
 (EngineCore_DP0 pid=32) Traceback (most recent call last):
 (EngineCore_DP0 pid=32)   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/symbolic_convert.py", line 1267, in step
 (EngineCore_DP0 pid=32)     self.dispatch_table[inst.opcode](self, inst)
 (EngineCore_DP0 pid=32)   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/symbolic_convert.py", line 1818, in RAISE_VARARGS
 (EngineCore_DP0 pid=32)     self._raise_exception_variable(val)
 (EngineCore_DP0 pid=32)   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/symbolic_convert.py", line 1795, in _raise_exception_variable
 (EngineCore_DP0 pid=32)     raise observed_exception_type(f"raised exception {val}")
 (EngineCore_DP0 pid=32) torch._dynamo.exc.ObservedValueErrorError: raised exception ExceptionVariable(<class 'ValueError'>)
 (EngineCore_DP0 pid=32)
 (EngineCore_DP0 pid=32) During handling of the above exception, another exception occurred:
 (EngineCore_DP0 pid=32)
 (EngineCore_DP0 pid=32) Traceback (most recent call last):
 (EngineCore_DP0 pid=32)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
 (EngineCore_DP0 pid=32)     self.run()
 (EngineCore_DP0 pid=32)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
 (EngineCore_DP0 pid=32)     self._target(*self._args, **self._kwargs)
 (EngineCore_DP0 pid=32)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 722, in run_engine_core
 (EngineCore_DP0 pid=32)     raise e
 (EngineCore_DP0 pid=32)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 711, in run_engine_core
 (EngineCore_DP0 pid=32)     engine_core.run_busy_loop()
 (EngineCore_DP0 pid=32)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 738, in run_busy_loop
 (EngineCore_DP0 pid=32)     self._process_engine_step()
 (EngineCore_DP0 pid=32)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 764, in _process_engine_step
 (EngineCore_DP0 pid=32)     outputs, model_executed = self.step_fn()
 (EngineCore_DP0 pid=32)                               ^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 292, in step
 (EngineCore_DP0 pid=32)     model_output = self.execute_model_with_error_logging(
 (EngineCore_DP0 pid=32)                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 278, in execute_model_with_error_logging
 (EngineCore_DP0 pid=32)     raise err
 (EngineCore_DP0 pid=32)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 269, in execute_model_with_error_logging
 (EngineCore_DP0 pid=32)     return model_fn(scheduler_output)
 (EngineCore_DP0 pid=32)            ^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 93, in execute_model
 (EngineCore_DP0 pid=32)     output = self.collective_rpc("execute_model",
 (EngineCore_DP0 pid=32)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32)   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
 (EngineCore_DP0 pid=32)     answer = run_method(self.driver_worker, method, args, kwargs)
 (EngineCore_DP0 pid=32)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32)   File "/usr/local/lib/python3.12/dist-packages/vllm/utils/__init__.py", line 3060, in run_method
 (EngineCore_DP0 pid=32)     return func(*args, **kwargs)
 (EngineCore_DP0 pid=32)            ^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32)   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
 (EngineCore_DP0 pid=32)     return func(*args, **kwargs)
 (EngineCore_DP0 pid=32)            ^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 436, in execute_model
 (EngineCore_DP0 pid=32)     output = self.model_runner.execute_model(scheduler_output,
 (EngineCore_DP0 pid=32)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32)   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
 (EngineCore_DP0 pid=32)     return func(*args, **kwargs)
 (EngineCore_DP0 pid=32)            ^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2064, in execute_model
 (EngineCore_DP0 pid=32)     model_output = self.model(
 (EngineCore_DP0 pid=32)                    ^^^^^^^^^^^
 (EngineCore_DP0 pid=32)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
 (EngineCore_DP0 pid=32)     return self._call_impl(*args, **kwargs)
 (EngineCore_DP0 pid=32)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
 (EngineCore_DP0 pid=32)     return forward_call(*args, **kwargs)
 (EngineCore_DP0 pid=32)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/bert.py", line 467, in forward
 (EngineCore_DP0 pid=32)     return self.model(input_ids=input_ids,
 (EngineCore_DP0 pid=32)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32)   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 312, in __call__
 (EngineCore_DP0 pid=32)     model_output = self.forward(*args, **kwargs)
 (EngineCore_DP0 pid=32)                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/bert.py", line 351, in forward
 (EngineCore_DP0 pid=32)     def forward(
 (EngineCore_DP0 pid=32)   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 375, in __call__
 (EngineCore_DP0 pid=32)     return super().__call__(*args, **kwargs)
 (EngineCore_DP0 pid=32)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
 (EngineCore_DP0 pid=32)     return self._call_impl(*args, **kwargs)
 (EngineCore_DP0 pid=32)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
 (EngineCore_DP0 pid=32)     return forward_call(*args, **kwargs)
 (EngineCore_DP0 pid=32)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32)   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 929, in _fn
 (EngineCore_DP0 pid=32)     return fn(*args, **kwargs)
 (EngineCore_DP0 pid=32)            ^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32)   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 848, in call_wrapped
 (EngineCore_DP0 pid=32)     return self._wrapped_call(self, *args, **kwargs)
 (EngineCore_DP0 pid=32)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32)   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 424, in __call__
 (EngineCore_DP0 pid=32)     raise e
 (EngineCore_DP0 pid=32)   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 411, in __call__
 (EngineCore_DP0 pid=32)     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
 (EngineCore_DP0 pid=32)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
 (EngineCore_DP0 pid=32)     return self._call_impl(*args, **kwargs)
 (EngineCore_DP0 pid=32)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
 (EngineCore_DP0 pid=32)     return forward_call(*args, **kwargs)
 (EngineCore_DP0 pid=32)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32)   File "<eval_with_key>.8", line 54, in forward
 (EngineCore_DP0 pid=32)     submod_1 = self.submod_1(getitem, s72, getitem_1, getitem_2, getitem_3);  getitem = getitem_1 = getitem_2 = submod_1 = None
 (EngineCore_DP0 pid=32)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32)   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 848, in call_wrapped
 (EngineCore_DP0 pid=32)     return self._wrapped_call(self, *args, **kwargs)
 (EngineCore_DP0 pid=32)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32)   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 424, in __call__
 (EngineCore_DP0 pid=32)     raise e
 (EngineCore_DP0 pid=32)   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 411, in __call__
 (EngineCore_DP0 pid=32)     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
 (EngineCore_DP0 pid=32)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
 (EngineCore_DP0 pid=32)     return self._call_impl(*args, **kwargs)
 (EngineCore_DP0 pid=32)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1784, in _call_impl
 (EngineCore_DP0 pid=32)     return forward_call(*args, **kwargs)
 (EngineCore_DP0 pid=32)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32)   File "<eval_with_key>.2", line 5, in forward
 (EngineCore_DP0 pid=32)     unified_attention_with_output = torch.ops.vllm.unified_attention_with_output(query, key, value, output_1, 'model.encoder.layer.0.attention.output.attn');  query = key = value = output_1 = unified_attention_with_output = None
 (EngineCore_DP0 pid=32)                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32)   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1243, in __call__
 (EngineCore_DP0 pid=32)     return self._op(*args, **kwargs)
 (EngineCore_DP0 pid=32)            ^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32)   File "/usr/local/lib/python3.12/dist-packages/vllm/attention/layer.py", line 521, in unified_attention_with_output
 (EngineCore_DP0 pid=32)     self.impl.forward(self,
 (EngineCore_DP0 pid=32)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/flex_attention.py", line 756, in forward
 (EngineCore_DP0 pid=32)     out = flex_attention_compiled(
 (EngineCore_DP0 pid=32)           ^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32)   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 736, in compile_wrapper
 (EngineCore_DP0 pid=32)     return fn(*args, **kwargs)
 (EngineCore_DP0 pid=32)            ^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32)   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/convert_frame.py", line 1495, in __call__
 (EngineCore_DP0 pid=32)     return self._torchdynamo_orig_callable(
 (EngineCore_DP0 pid=32)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32)   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/convert_frame.py", line 629, in __call__
 (EngineCore_DP0 pid=32)     return _compile(
 (EngineCore_DP0 pid=32)            ^^^^^^^^^
 (EngineCore_DP0 pid=32)   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/convert_frame.py", line 1111, in _compile
 (EngineCore_DP0 pid=32)     guarded_code = compile_inner(code, one_graph, hooks, transform)
 (EngineCore_DP0 pid=32)                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32)   File "/usr/local/lib/python3.12/dist-packages/torch/_utils_internal.py", line 97, in wrapper_function
 (EngineCore_DP0 pid=32)     return function(*args, **kwargs)
 (EngineCore_DP0 pid=32)            ^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32)   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/convert_frame.py", line 793, in compile_inner
 (EngineCore_DP0 pid=32)     return _compile_inner(code, one_graph, hooks, transform)
 (EngineCore_DP0 pid=32)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32)   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/convert_frame.py", line 832, in _compile_inner
 (EngineCore_DP0 pid=32)     out_code = transform_code_object(code, transform)
 (EngineCore_DP0 pid=32)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32)   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/bytecode_transformation.py", line 1424, in transform_code_object
 (EngineCore_DP0 pid=32)     transformations(instructions, code_options)
 (EngineCore_DP0 pid=32)   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/convert_frame.py", line 267, in _fn
 (EngineCore_DP0 pid=32)     return fn(*args, **kwargs)
 (EngineCore_DP0 pid=32)            ^^^^^^^^^^^^^^^^^^^
 (EngineCore_DP0 pid=32)   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/convert_frame.py", line 753, in transform
 (EngineCore_DP0 pid=32)     tracer.run()
 (EngineCore_DP0 pid=32)   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/symbolic_convert.py", line 3497, in run
 (EngineCore_DP0 pid=32)     super().run()
 (EngineCore_DP0 pid=32)   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/symbolic_convert.py", line 1363, in run
 (EngineCore_DP0 pid=32)     while self.step():
 (EngineCore_DP0 pid=32)           ^^^^^^^^^^^
 (EngineCore_DP0 pid=32)   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/symbolic_convert.py", line 1272, in step
 (EngineCore_DP0 pid=32)     self.exception_handler(e)
 (EngineCore_DP0 pid=32)   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/symbolic_convert.py", line 1935, in exception_handler
 (EngineCore_DP0 pid=32)     unimplemented_v2(
 (EngineCore_DP0 pid=32)   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/exc.py", line 528, in unimplemented_v2
 (EngineCore_DP0 pid=32)     raise Unsupported(msg)
 (EngineCore_DP0 pid=32) torch._dynamo.exc.Unsupported: Observed exception
 (EngineCore_DP0 pid=32)   Explanation: Dynamo found no exception handler at the top-level compiled function when encountering an exception. Exception will propagate outside the compiled region.
 (EngineCore_DP0 pid=32)   Hint: Dynamo has detected that tracing the code will result in an error when running in eager. Please double check that your code doesn't contain a similar error when actually running eager/uncompiled.
 (EngineCore_DP0 pid=32)   Hint: It may be possible to write Dynamo tracing rules for this code. Please report an issue to PyTorch if you encounter this graph break often and it is causing performance issues.
 (EngineCore_DP0 pid=32)
 (EngineCore_DP0 pid=32)   Developer debug context: raised exception ExceptionVariable(<class 'ValueError'>)
 (EngineCore_DP0 pid=32)
 (EngineCore_DP0 pid=32)
 (EngineCore_DP0 pid=32) from user code:
 (EngineCore_DP0 pid=32)    File "/usr/local/lib/python3.12/dist-packages/torch/nn/attention/flex_attention.py", line 1364, in flex_attention
 (EngineCore_DP0 pid=32)     raise ValueError(
 (EngineCore_DP0 pid=32)
 [rank0]:[W917 01:05:51.474984284 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
 (APIServer pid=1) INFO:     Shutting down
 (APIServer pid=1) INFO:     Waiting for application shutdown.
 (APIServer pid=1) INFO:     Application shutdown complete.
 (APIServer pid=1) INFO:     Finished server process [1]
vllm-openai-3.shark02.loc exited with code 0
 /usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
   import pynvml  # type: ignore[import]

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions