[Bug]: EngineCore died unexpectedly When Inference llama(generate)

### Your current environment

<details>
<summary>The output of <code>python collect_env.py</code></summary>

```text
==============================
        System Info
==============================
OS                           : Ubuntu 24.04.1 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Clang version                : Could not collect
CMake version                : version 3.28.3
Libc version                 : glibc-2.39

==============================
       PyTorch Info
==============================
PyTorch version              : 2.7.1+cu126
Is debug build               : False
CUDA used to build PyTorch   : 12.6
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.11 | packaged by Anaconda, Inc. | (main, Jun  5 2025, 13:09:17) [GCC 11.2.0] (64-bit runtime)
Python platform              : Linux-6.14.0-28-generic-x86_64-with-glibc2.39

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.8.61
CUDA_MODULE_LOADING set to   : LAZY
GPU models and configuration : 
GPU 0: NVIDIA GeForce RTX 4060 Ti
GPU 1: NVIDIA GeForce RTX 4060 Ti

Nvidia driver version        : 570.181
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           48 bits physical, 48 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  32
On-line CPU(s) list:                     0-31
Vendor ID:                               AuthenticAMD
Model name:                              AMD Ryzen 9 9950X3D 16-Core Processor
CPU family:                              26
Model:                                   68
Thread(s) per core:                      2
Core(s) per socket:                      16
Socket(s):                               1
Stepping:                                0
Frequency boost:                         enabled
CPU(s) scaling MHz:                      74%
CPU max MHz:                             5756.0000
CPU min MHz:                             600.0000
BogoMIPS:                                8599.99
Flags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk avx_vnni avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid bus_lock_detect movdiri movdir64b overflow_recov succor smca fsrm avx512_vp2intersect flush_l1d amd_lbr_pmc_freeze
Virtualization:                          AMD-V
L1d cache:                               768 KiB (16 instances)
L1i cache:                               512 KiB (16 instances)
L2 cache:                                16 MiB (16 instances)
L3 cache:                                128 MiB (2 instances)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-31
Vulnerability Gather data sampling:      Not affected
Vulnerability Ghostwrite:                Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Mitigation; IBPB on VMEXIT only
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP always-on; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                     Not affected
Vulnerability Tsx async abort:           Not affected

==============================
Versions of relevant libraries
==============================
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.6.4.1
[pip3] nvidia-cuda-cupti-cu12==12.6.80
[pip3] nvidia-cuda-nvrtc-cu12==12.6.77
[pip3] nvidia-cuda-runtime-cu12==12.6.77
[pip3] nvidia-cudnn-cu12==9.5.1.17
[pip3] nvidia-cufft-cu12==11.3.0.4
[pip3] nvidia-cufile-cu12==1.11.1.6
[pip3] nvidia-curand-cu12==10.3.7.77
[pip3] nvidia-cusolver-cu12==11.7.1.2
[pip3] nvidia-cusparse-cu12==12.5.4.2
[pip3] nvidia-cusparselt-cu12==0.6.3
[pip3] nvidia-nccl-cu12==2.26.2
[pip3] nvidia-nvjitlink-cu12==12.6.85
[pip3] nvidia-nvtx-cu12==12.6.77
[pip3] pyzmq==27.0.2
[pip3] torch==2.7.1
[pip3] torchaudio==2.7.1
[pip3] torchvision==0.22.1
[pip3] transformers==4.55.3
[pip3] triton==3.3.1
[conda] numpy                     2.2.6                    pypi_0    pypi
[conda] nvidia-cublas-cu12        12.6.4.1                 pypi_0    pypi
[conda] nvidia-cuda-cupti-cu12    12.6.80                  pypi_0    pypi
[conda] nvidia-cuda-nvrtc-cu12    12.6.77                  pypi_0    pypi
[conda] nvidia-cuda-runtime-cu12  12.6.77                  pypi_0    pypi
[conda] nvidia-cudnn-cu12         9.5.1.17                 pypi_0    pypi
[conda] nvidia-cufft-cu12         11.3.0.4                 pypi_0    pypi
[conda] nvidia-cufile-cu12        1.11.1.6                 pypi_0    pypi
[conda] nvidia-curand-cu12        10.3.7.77                pypi_0    pypi
[conda] nvidia-cusolver-cu12      11.7.1.2                 pypi_0    pypi
[conda] nvidia-cusparse-cu12      12.5.4.2                 pypi_0    pypi
[conda] nvidia-cusparselt-cu12    0.6.3                    pypi_0    pypi
[conda] nvidia-nccl-cu12          2.26.2                   pypi_0    pypi
[conda] nvidia-nvjitlink-cu12     12.6.85                  pypi_0    pypi
[conda] nvidia-nvtx-cu12          12.6.77                  pypi_0    pypi
[conda] pyzmq                     27.0.2                   pypi_0    pypi
[conda] torch                     2.7.1                    pypi_0    pypi
[conda] torchaudio                2.7.1                    pypi_0    pypi
[conda] torchvision               0.22.1                   pypi_0    pypi
[conda] transformers              4.55.3                   pypi_0    pypi
[conda] triton                    3.3.1                    pypi_0    pypi

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
Neuron SDK Version           : N/A
vLLM Version                 : 0.10.1rc2.dev101+g7be5d113d (git sha: 7be5d113d)
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
        GPU0    GPU1    NIC0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      PHB     PHB     0-31    0               N/A
GPU1    PHB      X      PHB     0-31    0               N/A
NIC0    PHB     PHB      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0

==============================
     Environment Variables
==============================
LD_LIBRARY_PATH=/usr/local/cuda/lib64:
CUDA_HOME=/usr/local/cuda
CUDA_HOME=/usr/local/cuda
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY
```

</details>


### 🐛 Describe the bug

My script runs inference on the Llama2-7b model using 2 GPUs with tensor parallelism and float16 precision. During inference, I observed:

When instantiating the LLM, GPU memory usage suddenly spikes (from 7 GB → 14 GB per GPU). I’m not sure why—could it be due to CUDA Graph capture or KV cache allocation?

During token generation, an error occurs: “engine core died unexpectedly”.

```python
from vllm import LLM
llm = LLM("/home/workspace/models/llama2-7b-hf",
          dtype="float16",
          tensor_parallel_size=2,
          max_seq_len_to_capture=256,
          max_model_len=256,
          gpu_memory_utilization=0.9,
          disable_log_stats=False)

output = llm.generate("hello, my gpu! How are you feeling today? please tell")
print(output)
```


<details>
<summary>The output of <code>log for python test.py</code></summary>

```text
python test.py 
INFO 08-25 11:01:16 [__init__.py:241] Automatically detected platform cuda.
INFO 08-25 11:01:16 [utils.py:326] non-default args: {'model': '/home/workspace/models/llama2-7b-hf', 'dtype': 'float16', 'max_model_len': 256, 'tensor_parallel_size': 2, 'max_seq_len_to_capture': 256}
INFO 08-25 11:01:19 [__init__.py:736] Resolved architecture: LlamaForCausalLM
INFO 08-25 11:01:19 [__init__.py:1777] Using max model len 256
INFO 08-25 11:01:19 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=8192.
(EngineCore_0 pid=675221) INFO 08-25 11:01:19 [core.py:644] Waiting for init message from front-end.
(EngineCore_0 pid=675221) INFO 08-25 11:01:19 [core.py:74] Initializing a V1 LLM engine (v0.10.1rc2.dev101+g7be5d113d) with config: model='/home/workspace/models/llama2-7b-hf', speculative_config=None, tokenizer='/home/workspace/models/llama2-7b-hf', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=256, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/home/workspace/models/llama2-7b-hf, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":1,"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":512,"local_cache_dir":null}
(EngineCore_0 pid=675221) WARNING 08-25 11:01:19 [multiproc_worker_utils.py:273] Reducing Torch parallelism from 16 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore_0 pid=675221) INFO 08-25 11:01:19 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1], buffer_handle=(2, 16777216, 10, 'psm_854adbea'), local_subscribe_addr='ipc:///tmp/fdee9f1c-1de6-4997-9d3f-b7fd1a25e7ea', remote_subscribe_addr=None, remote_addr_ipv6=False)
(EngineCore_0 pid=675221) (VllmWorker TP1 pid=675229) INFO 08-25 11:01:20 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_eb777793'), local_subscribe_addr='ipc:///tmp/2c46459a-d2e0-4b53-a8b9-050ae1f7b416', remote_subscribe_addr=None, remote_addr_ipv6=False)
(EngineCore_0 pid=675221) (VllmWorker TP0 pid=675227) INFO 08-25 11:01:20 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_73f5191a'), local_subscribe_addr='ipc:///tmp/76664b92-a96a-46c1-8bee-ac1c8959e6fc', remote_subscribe_addr=None, remote_addr_ipv6=False)
(EngineCore_0 pid=675221) (VllmWorker TP0 pid=675227) INFO 08-25 11:01:21 [__init__.py:1418] Found nccl from library libnccl.so.2
(EngineCore_0 pid=675221) (VllmWorker TP1 pid=675229) INFO 08-25 11:01:21 [__init__.py:1418] Found nccl from library libnccl.so.2
(EngineCore_0 pid=675221) (VllmWorker TP0 pid=675227) INFO 08-25 11:01:21 [pynccl.py:70] vLLM is using nccl==2.26.2
(EngineCore_0 pid=675221) (VllmWorker TP1 pid=675229) INFO 08-25 11:01:21 [pynccl.py:70] vLLM is using nccl==2.26.2
(EngineCore_0 pid=675221) (VllmWorker TP0 pid=675227) INFO 08-25 11:01:21 [custom_all_reduce.py:35] Skipping P2P check and trusting the driver's P2P report.
(EngineCore_0 pid=675221) (VllmWorker TP1 pid=675229) INFO 08-25 11:01:21 [custom_all_reduce.py:35] Skipping P2P check and trusting the driver's P2P report.
(EngineCore_0 pid=675221) (VllmWorker TP1 pid=675229) WARNING 08-25 11:01:21 [custom_all_reduce.py:147] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(EngineCore_0 pid=675221) (VllmWorker TP0 pid=675227) WARNING 08-25 11:01:21 [custom_all_reduce.py:147] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(EngineCore_0 pid=675221) (VllmWorker TP0 pid=675227) INFO 08-25 11:01:21 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_09664453'), local_subscribe_addr='ipc:///tmp/270f2538-fa87-4dc8-ba6b-7b89799e535b', remote_subscribe_addr=None, remote_addr_ipv6=False)
(EngineCore_0 pid=675221) (VllmWorker TP0 pid=675227) INFO 08-25 11:01:21 [parallel_state.py:1134] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_0 pid=675221) (VllmWorker TP1 pid=675229) INFO 08-25 11:01:21 [parallel_state.py:1134] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
(EngineCore_0 pid=675221) (VllmWorker TP0 pid=675227) WARNING 08-25 11:01:21 [topk_topp_sampler.py:61] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(EngineCore_0 pid=675221) (VllmWorker TP1 pid=675229) WARNING 08-25 11:01:21 [topk_topp_sampler.py:61] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(EngineCore_0 pid=675221) (VllmWorker TP1 pid=675229) INFO 08-25 11:01:21 [gpu_model_runner.py:1970] Starting to load model /home/workspace/models/llama2-7b-hf...
(EngineCore_0 pid=675221) (VllmWorker TP0 pid=675227) INFO 08-25 11:01:21 [gpu_model_runner.py:1970] Starting to load model /home/workspace/models/llama2-7b-hf...
(EngineCore_0 pid=675221) (VllmWorker TP1 pid=675229) INFO 08-25 11:01:21 [gpu_model_runner.py:2002] Loading model from scratch...
(EngineCore_0 pid=675221) (VllmWorker TP0 pid=675227) INFO 08-25 11:01:21 [gpu_model_runner.py:2002] Loading model from scratch...
(EngineCore_0 pid=675221) (VllmWorker TP1 pid=675229) INFO 08-25 11:01:21 [cuda.py:328] Using Flash Attention backend on V1 engine.
(EngineCore_0 pid=675221) (VllmWorker TP0 pid=675227) INFO 08-25 11:01:21 [cuda.py:328] Using Flash Attention backend on V1 engine.
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  3.56it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.68it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.83it/s]
(EngineCore_0 pid=675221) (VllmWorker TP0 pid=675227) 
(EngineCore_0 pid=675221) (VllmWorker TP1 pid=675229) INFO 08-25 11:01:23 [default_loader.py:267] Loading weights took 1.13 seconds
(EngineCore_0 pid=675221) (VllmWorker TP0 pid=675227) INFO 08-25 11:01:23 [default_loader.py:267] Loading weights took 1.13 seconds
(EngineCore_0 pid=675221) (VllmWorker TP1 pid=675229) INFO 08-25 11:01:23 [gpu_model_runner.py:2024] Model loading took 6.3102 GiB and 1.202797 seconds
(EngineCore_0 pid=675221) (VllmWorker TP0 pid=675227) INFO 08-25 11:01:23 [gpu_model_runner.py:2024] Model loading took 6.3102 GiB and 1.202734 seconds
(EngineCore_0 pid=675221) (VllmWorker TP1 pid=675229) INFO 08-25 11:01:26 [backends.py:548] Using cache directory: /home/congziyi/.cache/vllm/torch_compile_cache/13b4bc42ee/rank_1_0/backbone for vLLM's torch.compile
(EngineCore_0 pid=675221) (VllmWorker TP0 pid=675227) INFO 08-25 11:01:26 [backends.py:548] Using cache directory: /home/congziyi/.cache/vllm/torch_compile_cache/13b4bc42ee/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_0 pid=675221) (VllmWorker TP0 pid=675227) INFO 08-25 11:01:26 [backends.py:559] Dynamo bytecode transform time: 2.91 s
(EngineCore_0 pid=675221) (VllmWorker TP1 pid=675229) INFO 08-25 11:01:26 [backends.py:559] Dynamo bytecode transform time: 2.91 s
(EngineCore_0 pid=675221) (VllmWorker TP1 pid=675229) INFO 08-25 11:01:28 [backends.py:161] Directly load the compiled graph(s) for dynamic shape from the cache, took 1.735 s
(EngineCore_0 pid=675221) (VllmWorker TP0 pid=675227) INFO 08-25 11:01:28 [backends.py:161] Directly load the compiled graph(s) for dynamic shape from the cache, took 1.746 s
(EngineCore_0 pid=675221) (VllmWorker TP1 pid=675229) INFO 08-25 11:01:28 [monitor.py:34] torch.compile takes 2.91 s in total
(EngineCore_0 pid=675221) (VllmWorker TP0 pid=675227) INFO 08-25 11:01:28 [monitor.py:34] torch.compile takes 2.91 s in total
(EngineCore_0 pid=675221) (VllmWorker TP0 pid=675227) INFO 08-25 11:01:31 [gpu_worker.py:277] Available KV cache memory: 7.14 GiB
(EngineCore_0 pid=675221) (VllmWorker TP1 pid=675229) INFO 08-25 11:01:31 [gpu_worker.py:277] Available KV cache memory: 7.14 GiB
(EngineCore_0 pid=675221) INFO 08-25 11:01:31 [kv_cache_utils.py:849] GPU KV cache size: 29,216 tokens
(EngineCore_0 pid=675221) INFO 08-25 11:01:31 [kv_cache_utils.py:853] Maximum concurrency for 256 tokens per request: 114.12x
(EngineCore_0 pid=675221) INFO 08-25 11:01:31 [kv_cache_utils.py:849] GPU KV cache size: 29,216 tokens
(EngineCore_0 pid=675221) INFO 08-25 11:01:31 [kv_cache_utils.py:853] Maximum concurrency for 256 tokens per request: 114.12x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|████████████████████████████████████████████| 67/67 [00:05<00:00, 12.83it/s](EngineCore_0 pid=675221) (VllmWorker TP1 pid=675229) INFO 08-25 11:01:37 [gpu_model_runner.py:2724] Graph capturing finished in 5 secs, took 0.70 GiB
(EngineCore_0 pid=675221) (VllmWorker TP0 pid=675227) INFO 08-25 11:01:37 [gpu_model_runner.py:2724] Graph capturing finished in 5 secs, took 0.70 GiB
(EngineCore_0 pid=675221) INFO 08-25 11:01:37 [core.py:215] init engine (profile, create kv cache, warmup model) took 13.66 seconds
INFO 08-25 11:01:37 [llm.py:298] Supported_tasks: ['generate']
WARNING 08-25 11:01:37 [__init__.py:1652] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
Adding requests: 100%|█████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 822.25it/s]Processed prompts: 100%|██████████████████████████████| 1/1 [00:00<00:00,  2.31it/s, est. speed input: 34.60 toks/s, output: 36.91 toks/s][RequestOutput(request_id=0, prompt='hello, my gpu! How are you feeling today? please tell', prompt_token_ids=[1, 22172, 29892, 590, 330, 3746, 29991, 1128, 526, 366, 11223, 9826, 29973, 3113, 2649], encoder_prompt=None, encoder_prompt_token_ids=None, prompt_logprobs=None, outputs=[CompletionOutput(index=0, text=" me how much you're worth, because I'm not sure how to", token_ids=[592, 920, 1568, 366, 29915, 276, 7088, 29892, 1363, 306, 29915, 29885, 451, 1854, 920, 304], cumulative_logprob=None, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=None, lora_request=None, num_cached_tokens=0, multi_modal_placeholders={})]
ERROR 08-25 11:01:38 [core_client.py:562] Engine core proc EngineCore_0 died unexpectedly, shutting down client.
```

</details>

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: EngineCore died unexpectedly When Inference llama(generate) #23517

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: EngineCore died unexpectedly When Inference llama(generate) #23517

Description

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions