Skip to content

[Bug]: CPU offload not working for DeepSeek-V2-Lite-Chat #15871

@ymcki

Description

@ymcki

Your current environment

The output of `python collect_env.py`
INFO 04-01 17:28:14 [__init__.py:239] Automatically detected platform cuda.
Collecting environment information...
PyTorch version: 2.6.0+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A

OS: Ubuntu 24.04.2 LTS (x86_64)
GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Clang version: Could not collect
CMake version: version 3.28.3
Libc version: glibc-2.39

Python version: 3.12.2 | packaged by conda-forge | (main, Feb 16 2024, 20:50:58) [GCC 12.3.0] (64-bit runtime)
Python platform: Linux-6.8.0-56-generic-x86_64-with-glibc2.39
Is CUDA available: True
CUDA runtime version: 12.4.131
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA GeForce GT 1030
GPU 1: NVIDIA GeForce RTX 3090

Nvidia driver version: 550.127.05
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.5.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.5.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.5.1
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.5.1
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.5.1
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.5.1
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.5.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.5.1
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        46 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               6
On-line CPU(s) list:                  0-5
Vendor ID:                            GenuineIntel
Model name:                           Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz
CPU family:                           6
Model:                                62
Thread(s) per core:                   1
Core(s) per socket:                   6
Socket(s):                            1
Stepping:                             4
CPU(s) scaling MHz:                   73%
CPU max MHz:                          3400.0000
CPU min MHz:                          1200.0000
BogoMIPS:                             6804.00
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb pti ssbd ibrs ibpb stibp tpr_shadow flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm arat pln pts vnmi md_clear flush_l1d
Virtualization:                       VT-x
L1d cache:                            192 KiB (6 instances)
L1i cache:                            192 KiB (6 instances)
L2 cache:                             1.5 MiB (6 instances)
L3 cache:                             12 MiB (1 instance)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0-5
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          KVM: Mitigation: VMX disabled
Vulnerability L1tf:                   Mitigation; PTE Inversion; VMX conditional cache flushes, SMT disabled
Vulnerability Mds:                    Mitigation; Clear CPU buffers; SMT disabled
Vulnerability Meltdown:               Mitigation; PTI
Vulnerability Mmio stale data:        Unknown: No mitigations
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; IBRS_FW; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

Versions of relevant libraries:
[pip3] fast-pytorch-kmeans==0.2.0.1
[pip3] flake8==7.0.0
[pip3] mypy==1.10.0
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.4
[pip3] numpydoc==1.7.0
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-cusparselt-cu12==0.6.2
[pip3] nvidia-ml-py==12.560.30
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] onnx==1.17.0
[pip3] onnxruntime==1.20.1
[pip3] pynvml==12.0.0
[pip3] pytorch_retinaface==0.1.0
[pip3] pyzmq==25.1.2
[pip3] sentence-transformers==2.7.0
[pip3] torch==2.6.0
[pip3] torchaudio==2.6.0
[pip3] torchsde==0.2.6
[pip3] torchvision==0.21.0
[pip3] transformers==4.48.2
[pip3] triton==3.2.0
[conda] _anaconda_depends         2024.06             py312_mkl_2  
[conda] blas                      1.0                         mkl  
[conda] cuda-cccl_linux-64        12.6.77                       0    nvidia
[conda] cuda-cudart-dev           12.4.127                      0    nvidia
[conda] cuda-cudart-dev_linux-64  12.6.77                       0    nvidia
[conda] cuda-cudart-static        12.6.77                       0    nvidia
[conda] cuda-cudart-static_linux-64 12.6.77                       0    nvidia
[conda] cuda-nvrtc                12.6.85                       0    nvidia
[conda] cuda-profiler-api         12.6.77                       0    nvidia
[conda] cuda-version              12.6                          3    nvidia
[conda] fast-pytorch-kmeans       0.2.0.1                  pypi_0    pypi
[conda] libblas                   3.9.0            20_linux64_mkl    conda-forge
[conda] libcblas                  3.9.0            20_linux64_mkl    conda-forge
[conda] libcublas                 12.6.4.1                      0    nvidia
[conda] libcublas-dev             12.6.4.1                      0    nvidia
[conda] libcurand                 10.3.7.77                     0    nvidia
[conda] libcurand-dev             10.3.7.77                     0    nvidia
[conda] libcusolver               11.7.1.2                      0    nvidia
[conda] libcusolver-dev           11.7.1.2                      0    nvidia
[conda] libcusparse               12.5.4.2                      0    nvidia
[conda] libcusparse-dev           12.5.4.2                      0    nvidia
[conda] libfaiss                  1.9.0           h4818125_0_cuda12.1.1_raft    pytorch
[conda] liblapack                 3.9.0            20_linux64_mkl    conda-forge
[conda] libnvjitlink              12.6.85                       0    nvidia
[conda] mkl                       2023.2.0         h84fe81f_50496    conda-forge
[conda] mkl-service               2.4.0           py312h5eee18b_1  
[conda] mkl_fft                   1.3.8           py312h5eee18b_0  
[conda] mkl_random                1.2.4           py312hdb19cb5_0  
[conda] numpy                     1.26.4          py312hc5e2394_0  
[conda] numpy-base                1.26.4          py312h0da6c21_0  
[conda] numpydoc                  1.7.0           py312h06a4308_0  
[conda] nvidia-cublas-cu12        12.4.5.8                 pypi_0    pypi
[conda] nvidia-cuda-cupti-cu12    12.4.127                 pypi_0    pypi
[conda] nvidia-cuda-nvrtc-cu12    12.4.127                 pypi_0    pypi
[conda] nvidia-cuda-runtime-cu12  12.4.127                 pypi_0    pypi
[conda] nvidia-cudnn-cu12         9.1.0.70                 pypi_0    pypi
[conda] nvidia-cufft-cu12         11.2.1.3                 pypi_0    pypi
[conda] nvidia-curand-cu12        10.3.5.147               pypi_0    pypi
[conda] nvidia-cusolver-cu12      11.6.1.9                 pypi_0    pypi
[conda] nvidia-cusparse-cu12      12.3.1.170               pypi_0    pypi
[conda] nvidia-cusparselt-cu12    0.6.2                    pypi_0    pypi
[conda] nvidia-ml-py              12.560.30                pypi_0    pypi
[conda] nvidia-nccl-cu12          2.21.5                   pypi_0    pypi
[conda] nvidia-nvjitlink-cu12     12.4.127                 pypi_0    pypi
[conda] nvidia-nvtx-cu12          12.4.127                 pypi_0    pypi
[conda] pynvml                    12.0.0                   pypi_0    pypi
[conda] pytorch-retinaface        0.1.0                    pypi_0    pypi
[conda] pyzmq                     25.1.2          py312h6a678d5_0  
[conda] sentence-transformers     2.7.0                    pypi_0    pypi
[conda] torch                     2.6.0                    pypi_0    pypi
[conda] torchaudio                2.6.0                    pypi_0    pypi
[conda] torchsde                  0.2.6                    pypi_0    pypi
[conda] torchvision               0.21.0                   pypi_0    pypi
[conda] transformers              4.48.2                   pypi_0    pypi
[conda] triton                    3.2.0                    pypi_0    pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.8.2
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0	GPU1	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	PHB	0-5	0		N/A
GPU1	PHB	 X 	0-5	0		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

CUDA_DEVICE_ORDER=PCI_BUS_ID
CUDA_VISIBLE_DEVICES=1
CUDA_VISIBLE_DEVICES=1
LD_LIBRARY_PATH=/usr/lib/libreoffice/program:/usr/local/cuda/lib64/:/usr/lib/x86_64-linux-gnu:/tank/ffmpeg/ffmpeg-7.1/libavdevice:/tank/ffmpeg/ffmpeg-7.1/libavfilter:/tank/ffmpeg/ffmpeg-7.1/libavformat:/tank/ffmpeg/ffmpeg-7.1/libavcodec:/tank/ffmpeg/ffmpeg-7.1/libpostproc:/tank/ffmpeg/ffmpeg-7.1/libswresample:/tank/ffmpeg/ffmpeg-7.1/libswscale:/tank/ffmpeg/ffmpeg-7.1/libavutil:
MKL_INTERFACE_LAYER=LP64,GNU
NCCL_CUMEM_ENABLE=0
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY

🐛 Describe the bug

I am running this script on my machine with 32GB RAM and a 3090. My vllm is 0.8.2. I am trying to offload 15GB to CPU RAM but it crashes complaining tensors can't be on different devices. Does that mean DeepSeek-V2-Lite-Chat is not supported for CPU offloading?

This is the script vc.py I ran:

from vllm import LLM, SamplingParams
import sys
# Sample prompts.
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Create an LLM.
llm = LLM(model=sys.argv[1], trust_remote_code=True, max_model_len=32768, cpu_offload_gb=15)
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

#python3 vc.py deepseek-ai/DeepSeek-V2-Lite-Chat/
INFO 04-01 16:42:54 [init.py:239] Automatically detected platform cuda.
INFO 04-01 16:42:56 [config.py:208] Replacing legacy 'type' key with 'rope_type'
INFO 04-01 16:43:05 [config.py:585] This model supports multiple tasks: {'embed', 'reward', 'classify', 'score', 'generate'}. Defaulting to 'generate'.
WARNING 04-01 16:43:05 [arg_utils.py:1854] --cpu-offload-gb is not supported by the V1 Engine. Falling back to V0.
INFO 04-01 16:43:05 [llm_engine.py:241] Initializing a V0 LLM engine (v0.8.2) with config: model='/home/user/DeepSeek-V2-Lite-Chat/', speculative_config=None, tokenizer='/home/user/DeepSeek-V2-Lite-Chat/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/home/user/DeepSeek-V2-Lite-Chat/, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
INFO 04-01 16:43:07 [cuda.py:190] Using Triton MLA backend.
WARNING 04-01 16:43:07 [triton_decode_attention.py:44] The following error message 'operation scheduled before its operands' can be ignored.
INFO 04-01 16:43:08 [parallel_state.py:954] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 04-01 16:43:08 [model_runner.py:1110] Starting to load model /home/user/DeepSeek-V2-Lite-Chat/...
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:13<00:39, 13.23s/it]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:26<00:26, 13.18s/it]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:34<00:10, 10.65s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:45<00:00, 11.17s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:46<00:00, 11.51s/it]

INFO 04-01 16:44:04 [loader.py:447] Loading weights took 46.24 seconds
INFO 04-01 16:44:11 [model_runner.py:1146] Model loading took 14.2597 GB and 59.536244 seconds
WARNING 04-01 16:44:18 [fused_moe.py:881] Using default MoE config. Performance might be sub-optimal! Config file not found at /home/user/anaconda3/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=1408,device_name=NVIDIA_GeForce_RTX_3090.json
INFO 04-01 16:44:23 [worker.py:267] Memory profiling takes 12.44 seconds
INFO 04-01 16:44:23 [worker.py:267] the current vLLM instance can use total_gpu_memory (23.68GiB) x gpu_memory_utilization (0.90) = 21.32GiB
INFO 04-01 16:44:23 [worker.py:267] model weights take 14.26GiB; non_torch_memory takes 0.10GiB; PyTorch activation peak memory takes 3.15GiB; the rest of the memory reserved for KV Cache is 3.81GiB.
INFO 04-01 16:44:23 [executor_base.py:111] # cuda blocks: 8210, # CPU blocks: 8630
INFO 04-01 16:44:23 [executor_base.py:116] Maximum concurrency for 32768 tokens per request: 4.01x
INFO 04-01 16:44:31 [model_runner.py:1442] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing gpu_memory_utilization or switching to eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage.
Capturing CUDA graph shapes: 0%| | 0/35 [00:01<?, ?it/s]
[rank0]: Traceback (most recent call last):
[rank0]: File "/tank/ai/langchain/vc.py", line 31, in
[rank0]: llm = LLM(model=sys.argv[1], trust_remote_code=True, max_model_len=32768, cpu_offload_gb=15)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/user/anaconda3/lib/python3.12/site-packages/vllm/utils.py", line 1037, in inner
[rank0]: return fn(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/user/anaconda3/lib/python3.12/site-packages/vllm/entrypoints/llm.py", line 243, in init
[rank0]: self.llm_engine = LLMEngine.from_engine_args(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/user/anaconda3/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 520, in from_engine_args
[rank0]: return engine_cls.from_vllm_config(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/user/anaconda3/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 496, in from_vllm_config
[rank0]: return cls(
[rank0]: ^^^^
[rank0]: File "/home/user/anaconda3/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 283, in init
[rank0]: self._initialize_kv_caches()
[rank0]: File "/home/user/anaconda3/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 445, in _initialize_kv_caches
[rank0]: self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
[rank0]: File "/home/user/anaconda3/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 122, in initialize_cache
[rank0]: self.collective_rpc("initialize_cache",
[rank0]: File "/home/user/anaconda3/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
[rank0]: answer = run_method(self.driver_worker, method, args, kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/user/anaconda3/lib/python3.12/site-packages/vllm/utils.py", line 2255, in run_method
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/user/anaconda3/lib/python3.12/site-packages/vllm/worker/worker.py", line 308, in initialize_cache
[rank0]: self._warm_up_model()
[rank0]: File "/home/user/anaconda3/lib/python3.12/site-packages/vllm/worker/worker.py", line 338, in _warm_up_model
[rank0]: self.model_runner.capture_model(self.gpu_cache)
[rank0]: File "/home/user/anaconda3/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/user/anaconda3/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1560, in capture_model
[rank0]: graph_runner.capture(**capture_inputs)
[rank0]: File "/home/user/anaconda3/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1926, in capture
[rank0]: self.model(
[rank0]: File "/home/user/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/user/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/user/anaconda3/lib/python3.12/site-packages/vllm/model_executor/models/deepseek_v2.py", line 689, in forward
[rank0]: hidden_states = self.model(input_ids, positions, intermediate_tensors,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/user/anaconda3/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 172, in call
[rank0]: return self.forward(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/user/anaconda3/lib/python3.12/site-packages/vllm/model_executor/models/deepseek_v2.py", line 646, in forward
[rank0]: hidden_states, residual = layer(positions, hidden_states, residual)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/user/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/user/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/user/anaconda3/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 541, in forward
[rank0]: output = functional_call(module,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/user/anaconda3/lib/python3.12/site-packages/torch/_functorch/functional_call.py", line 148, in functional_call
[rank0]: return nn.utils.stateless._functional_call(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/user/anaconda3/lib/python3.12/site-packages/torch/nn/utils/stateless.py", line 300, in _functional_call
[rank0]: return module(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/user/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/user/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/user/anaconda3/lib/python3.12/site-packages/vllm/model_executor/models/deepseek_v2.py", line 559, in forward
[rank0]: hidden_states = self.self_attn(
[rank0]: ^^^^^^^^^^^^^^^
[rank0]: File "/home/user/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/user/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/user/anaconda3/lib/python3.12/site-packages/vllm/model_executor/models/deepseek_v2.py", line 477, in forward
[rank0]: return self.mla_attn(hidden_states_or_q_c,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/user/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/user/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/user/anaconda3/lib/python3.12/site-packages/vllm/attention/layer.py", line 229, in forward
[rank0]: return torch.ops.vllm.unified_attention(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/user/anaconda3/lib/python3.12/site-packages/torch/_ops.py", line 1123, in call
[rank0]: return self._op(*args, **(kwargs or {}))
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/user/anaconda3/lib/python3.12/site-packages/vllm/attention/layer.py", line 342, in unified_attention
[rank0]: return self.impl.forward(self, query, key, value, kv_cache, attn_metadata)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/user/anaconda3/lib/python3.12/site-packages/vllm/attention/backends/mla/common.py", line 1377, in forward
[rank0]: self._q_proj_and_k_up_proj(decode_hs_or_q_c)
[rank0]: File "/home/user/anaconda3/lib/python3.12/site-packages/vllm/attention/backends/mla/common.py", line 1069, in _q_proj_and_k_up_proj
[rank0]: ql_nope = torch.bmm(q_nope, self.W_UK_T)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)
[rank0]:[W401 16:44:41.218518878 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstaleOver 90 days of inactivity

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions