[Installation]: Dual 5090's (sm120, cu128) Issues Running vLLM

### Your current environment (collect_env.py failed)

<details>
<summary>Environment Details</summary>

```
# wget https://raw.githubusercontent.com/vllm-project/vllm/main/collect_env.py
--2025-04-11 20:33:52--  https://raw.githubusercontent.com/vllm-project/vllm/main/collect_env.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 26879 (26K) [text/plain]
Saving to: 'collect_env.py.4'

collect_env.py.4                       100%[===========================================================================>]  26.25K  --.-KB/s    in 0.002s  

2025-04-11 20:33:52 (15.0 MB/s) - 'collect_env.py.4' saved [26879/26879]

# python collect_env.py
/bin/sh: 2: python: not found
# ^[[A^[[D^C
# python3 collect_env.py       
DEBUG 04-11 20:34:16 [__init__.py:28] No plugins for group vllm.platform_plugins found.
DEBUG 04-11 20:34:16 [__init__.py:34] Checking if TPU platform is available.
DEBUG 04-11 20:34:16 [__init__.py:44] TPU platform is not available because: No module named 'libtpu'
DEBUG 04-11 20:34:16 [__init__.py:52] Checking if CUDA platform is available.
DEBUG 04-11 20:34:16 [__init__.py:72] Confirmed CUDA platform is available.
DEBUG 04-11 20:34:16 [__init__.py:100] Checking if ROCm platform is available.
DEBUG 04-11 20:34:16 [__init__.py:114] ROCm platform is not available because: No module named 'amdsmi'
DEBUG 04-11 20:34:16 [__init__.py:122] Checking if HPU platform is available.
DEBUG 04-11 20:34:16 [__init__.py:129] HPU platform is not available because habana_frameworks is not found.
DEBUG 04-11 20:34:16 [__init__.py:140] Checking if XPU platform is available.
DEBUG 04-11 20:34:16 [__init__.py:150] XPU platform is not available because: No module named 'intel_extension_for_pytorch'
DEBUG 04-11 20:34:16 [__init__.py:158] Checking if CPU platform is available.
DEBUG 04-11 20:34:16 [__init__.py:180] Checking if Neuron platform is available.
DEBUG 04-11 20:34:16 [__init__.py:187] Neuron platform is not available because: No module named 'transformers_neuronx'
DEBUG 04-11 20:34:16 [__init__.py:52] Checking if CUDA platform is available.
DEBUG 04-11 20:34:16 [__init__.py:72] Confirmed CUDA platform is available.
INFO 04-11 20:34:16 [__init__.py:239] Automatically detected platform cuda.
Collecting environment information...
Traceback (most recent call last):
  File "/workspace/vllm/collect_env.py", line 785, in <module>
    main()
  File "/workspace/vllm/collect_env.py", line 764, in main
    output = get_pretty_env_info()
  File "/workspace/vllm/collect_env.py", line 759, in get_pretty_env_info
    return pretty_str(get_env_info())
  File "/workspace/vllm/collect_env.py", line 557, in get_env_info
    pip_version, pip_list_output = get_pip_packages(run_lambda)
  File "/workspace/vllm/collect_env.py", line 512, in get_pip_packages
    out = run_with_pip()
  File "/workspace/vllm/collect_env.py", line 508, in run_with_pip
    return "\n".join(line for line in out.splitlines()
AttributeError: 'NoneType' object has no attribute 'splitlines'
# docker-compose exec vllm env
/bin/sh: 4: docker-compose: not found
# env
LIBRARY_PATH=/usr/local/cuda/lib64/stubs
NV_LIBCUBLAS_VERSION=12.8.4.1-1
NV_NVPROF_DEV_PACKAGE=cuda-nvprof-12-8=12.8.90-1
NCCL_DEBUG=INFO
NV_CUDA_NSIGHT_COMPUTE_VERSION=12.8.1-1
HOSTNAME=6b4c64495e8d
VLLM_FLASH_ATTN_VERSION=2
LD_LIBRARY_PATH=/usr/local/nccl/lib:/usr/local/cuda/lib64
NV_LIBNCCL_PACKAGE_VERSION=2.25.1-1
HOME=/root
VLLM_LOGGING_LEVEL=DEBUG
NV_LIBCUBLAS_DEV_VERSION=12.8.4.1-1
NV_CUDNN_PACKAGE_NAME=libcudnn9-cuda-12
VLLM_CUTLASS_SRC_DIR=/opt/cutlass
NV_LIBNCCL_DEV_PACKAGE_VERSION=2.25.1-1
NV_LIBNPP_PACKAGE=libnpp-12-8=12.3.3.100-1
CUDA_VERSION=12.8.1
NV_CUDNN_PACKAGE=libcudnn9-cuda-12=9.8.0.87-1
NV_NVPROF_VERSION=12.8.90-1
TORCH_CUDA_ARCH_LIST=12.0+PTX
NV_LIBCUBLAS_PACKAGE_NAME=libcublas-12-8
NVIDIA_REQUIRE_CUDA=cuda>=12.8 brand=unknown,driver>=470,driver<471 brand=grid,driver>=470,driver<471 brand=tesla,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=vapps,driver>=470,driver<471 brand=vpc,driver>=470,driver<471 brand=vcs,driver>=470,driver<471 brand=vws,driver>=470,driver<471 brand=cloudgaming,driver>=470,driver<471 brand=unknown,driver>=535,driver<536 brand=grid,driver>=535,driver<536 brand=tesla,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=vapps,driver>=535,driver<536 brand=vpc,driver>=535,driver<536 brand=vcs,driver>=535,driver<536 brand=vws,driver>=535,driver<536 brand=cloudgaming,driver>=535,driver<536 brand=unknown,driver>=550,driver<551 brand=grid,driver>=550,driver<551 brand=tesla,driver>=550,driver<551 brand=nvidia,driver>=550,driver<551 brand=quadro,driver>=550,driver<551 brand=quadrortx,driver>=550,driver<551 brand=nvidiartx,driver>=550,driver<551 brand=vapps,driver>=550,driver<551 brand=vpc,driver>=550,driver<551 brand=vcs,driver>=550,driver<551 brand=vws,driver>=550,driver<551 brand=cloudgaming,driver>=550,driver<551 brand=unknown,driver>=560,driver<561 brand=grid,driver>=560,driver<561 brand=tesla,driver>=560,driver<561 brand=nvidia,driver>=560,driver<561 brand=quadro,driver>=560,driver<561 brand=quadrortx,driver>=560,driver<561 brand=nvidiartx,driver>=560,driver<561 brand=vapps,driver>=560,driver<561 brand=vpc,driver>=560,driver<561 brand=vcs,driver>=560,driver<561 brand=vws,driver>=560,driver<561 brand=cloudgaming,driver>=560,driver<561 brand=unknown,driver>=565,driver<566 brand=grid,driver>=565,driver<566 brand=tesla,driver>=565,driver<566 brand=nvidia,driver>=565,driver<566 brand=quadro,driver>=565,driver<566 brand=quadrortx,driver>=565,driver<566 brand=nvidiartx,driver>=565,driver<566 brand=vapps,driver>=565,driver<566 brand=vpc,driver>=565,driver<566 brand=vcs,driver>=565,driver<566 brand=vws,driver>=565,driver<566 brand=cloudgaming,driver>=565,driver<566
NVIDIA_DRIVER_CAPABILITIES=compute,utility
NV_CUDA_LIB_VERSION=12.8.1-1
NV_LIBCUSPARSE_VERSION=12.5.8.93-1
NV_LIBNCCL_PACKAGE_NAME=libnccl2
TERM=xterm
NV_NVML_DEV_VERSION=12.8.90-1
NV_LIBNPP_DEV_PACKAGE=libnpp-dev-12-8=12.3.3.100-1
NCCL_P2P_DISABLE=1
NV_CUDA_CUDART_VERSION=12.8.90-1
NV_CUDNN_PACKAGE_DEV=libcudnn9-dev-cuda-12=9.8.0.87-1
NVCC_THREADS=8
PATH=/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
NVARCH=x86_64
NV_LIBCUBLAS_PACKAGE=libcublas-12-8=12.8.4.1-1
NV_LIBCUBLAS_DEV_PACKAGE_NAME=libcublas-dev-12-8
NV_LIBNCCL_PACKAGE=libnccl2=2.25.1-1+cuda12.8
NV_LIBCUSPARSE_DEV_VERSION=12.5.8.93-1
NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev
NVIDIA_PRODUCT_NAME=CUDA
NV_CUDA_CUDART_DEV_VERSION=12.8.90-1
NCCL_IB_DISABLE=1
MAX_JOBS=8
NV_LIBCUBLAS_DEV_PACKAGE=libcublas-dev-12-8=12.8.4.1-1
USE_CUDA=1
NV_CUDA_NSIGHT_COMPUTE_DEV_PACKAGE=cuda-nsight-compute-12-8=12.8.1-1
NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.25.1-1+cuda12.8
NV_NVTX_VERSION=12.8.90-1
NV_LIBNPP_VERSION=12.3.3.100-1
NV_CUDNN_VERSION=9.8.0.87-1
CUDA_HOME=/usr/local/cuda
PWD=/workspace/vllm
NVIDIA_VISIBLE_DEVICES=0,1
NCCL_VERSION=2.25.1-1
NVCC_GENCODE=-gencode=arch=compute_120,code=sm_120
NV_LIBNPP_DEV_VERSION=12.3.3.100-1


```
</details>

### How you are installing vllm

For the last week, I have been scouring the VLLM issues, reading Nvidia libraries documentation, and attempting to get VLLM working across two 5090s, with minimal success. Success for me would be running an FP8 model leveraging the 5090's native support for this quantinisation.

The Issue arrives after graph generation:

`Compiling a graph for general shape takes 32.51 s`

And appears to fail at:

```text
torch.ops._C.cutlass_scaled_mm.default(buf10, buf0, arg4_1, buf5, arg5_1, arg6_1)
File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 776, in __call__
return self._op(*args, **kwargs)
```
[2025-04-11_Debug.txt](https://github.com/user-attachments/files/19714180/2025-04-11_Debug.txt)

I believe the main issue is that I am trying to use the newest versions of everything, such as sm120, Cuda 12.8, NCCL, and Cutlass, which is causing a large mess. If I was to pin it on one problem, it would be Cutlass, which I'm relying on VLLM to install with the VLLM_CUTLASS_SRC_DIR as per the documentation: 

[VLLM Installation Documentation](https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html)

I would appreciate an expert eye on this, as I have run out of ideas and am willing to try anything. I hope this can help anyone in the future with a similar setup.

### dockerfile:
```
FROM nvidia/cuda:12.8.1-cudnn-devel-ubuntu22.04

# Environment configuration
ENV MAX_JOBS=8 \
    NVCC_THREADS=8 \
    CUDA_HOME=/usr/local/cuda \
    LD_LIBRARY_PATH=/usr/local/nccl/lib:$LD_LIBRARY_PATH \
    TORCH_CUDA_ARCH_LIST="12.0+PTX" \
    USE_CUDA=1 \
    VLLM_FLASH_ATTN_VERSION=2 \
    VLLM_CUTLASS_SRC_DIR=/opt/cutlass \
    NCCL_P2P_DISABLE=1 \
    NCCL_IB_DISABLE=1 \
    NCCL_DEBUG=INFO \
    NVCC_GENCODE="-gencode=arch=compute_120,code=sm_120"

# System build dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    kmod git cmake ccache build-essential libhwloc-dev \
    python3 python3-pip python3-dev devscripts debhelper fakeroot dh-make lintian \
 && apt-get clean && rm -rf /var/lib/apt/lists/*

RUN python3 -m pip install --upgrade pip

# Install torch
RUN pip3 uninstall -y torch torchvision torchaudio && \
    pip3 install --pre torch torchvision torchaudio \
    --index-url https://download.pytorch.org/whl/nightly/cu128

# Install bitsandbytes
RUN pip3 install --no-cache-dir bitsandbytes

# Fetch CUTLASS from source
RUN git clone https://github.com/NVIDIA/cutlass.git /opt/cutlass

# Build and install NCCL from source (master branch, with SM120 support)
RUN git clone https://github.com/NVIDIA/nccl.git /opt/nccl && \
    cd /opt/nccl && \
    git checkout master && \
    make -j$MAX_JOBS src.build CUDA_HOME=${CUDA_HOME} && \
    make pkg.debian.build && \
    dpkg -i build/pkg/deb/libnccl*.deb

# Clone and build vLLM from source using container's PyTorch
RUN git clone https://github.com/vllm-project/vllm.git /workspace/vllm
WORKDIR /workspace/vllm
RUN python3 use_existing_torch.py && \
    pip3 install --no-cache-dir -r requirements/build.txt && \
    pip3 install --no-cache-dir setuptools_scm && \
    pip3 install -e . --no-build-isolation

# Expose port for vLLM's OpenAI-compatible server
EXPOSE 8000

# Entrypoint to start the vLLM server by default
ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]
```

### docker-compose:
```
services:
  vllm:
    build:
      context: .
      dockerfile: Dockerfile.py4
    command: [
      "--model", "JamAndTeaStudios/DeepSeek-R1-Distill-Qwen-14B-FP8-Dynamic",
      "--tensor-parallel-size", "2",
      "--gpu-memory-utilization", "0.80",
      "--max-model-len", "16000",
      "--trust-remote-code",
      "--served-model-name", "QwQ-32B-4bit"
    ]
    ports:
      - "8000:8000"
    environment:
      - NVIDIA_VISIBLE_DEVICES=0,1
      - VLLM_FLASH_ATTN_VERSION=2
      - VLLM_LOGGING_LEVEL=DEBUG
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 2
              capabilities: [gpu]
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
      - ./model_cache:/models
    ipc: host
    privileged: true
```

Thanks to my main sources of truth for getting this far:
#13306 
#14452 

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Installation]: Dual 5090's (sm120, cu128) Issues Running vLLM #16515

Your current environment (collect_env.py failed)

How you are installing vllm

dockerfile:

docker-compose:

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Installation]: Dual 5090's (sm120, cu128) Issues Running vLLM #16515

Description

Your current environment (collect_env.py failed)

How you are installing vllm

dockerfile:

docker-compose:

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions