Serving Llama3-405B on 2 GPU node with 8H100 following the example in the repository #184

liurupeng · 2024-07-31T02:28:09Z

What happened:
I followed the example in vllm+ray+lws, but somehow meet this bug when using 2 GPU nodes with 8H100 GPUs. here is my deployment:

apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  name: vllm
spec:
  replicas: 1
  leaderWorkerTemplate:
    size: 2
    restartPolicy: RecreateGroupOnPodRestart
    leaderTemplate:
      metadata:
        labels:
          role: leader
      spec:
        containers:
          - name: vllm-leader
            # this image is build with the Dockerfile under ./build
            image: us-central1-docker.pkg.dev/gke-aishared-dev/vllm-ray-multihost/ray-vllm:v0.5.3.post1
            env:
              - name: RAY_CLUSTER_SIZE
                valueFrom:
                  fieldRef:
                    fieldPath: metadata.annotations['leaderworkerset.sigs.k8s.io/size']
              - name: HUGGING_FACE_HUB_TOKEN
                value: ""
            command:
              - sh
              - -c
              - "/vllm-workspace/ray_init.sh leader --ray_cluster_size=$RAY_CLUSTER_SIZE; 
                 python3 -m vllm.entrypoints.openai.api_server --port 8080 --model meta-llama/Meta-Llama-3.1-405B-Instruct --tensor-parallel-size 8 --pipeline-parallel-size 2"
            resources:
              limits:
                nvidia.com/gpu: "8"
              requests:
                nvidia.com/gpu: "8"
            ports:
              - containerPort: 8080
    workerTemplate:
      spec:
        containers:
          - name: vllm-worker
            # this image is build with the Dockerfile under ./build
            image: us-central1-docker.pkg.dev/gke-aishared-dev/vllm-ray-multihost/ray-vllm:v0.5.3.post1
            command:
              - sh
              - -c
              - "/vllm-workspace/ray_init.sh worker --ray_address=$(LEADER_NAME).$(LWS_NAME).$(NAMESPACE).svc.cluster.local"
            resources:
              limits:
                nvidia.com/gpu: "8"
              requests:
                nvidia.com/gpu: "8"
            env:
              - name: LEADER_NAME
                valueFrom:
                  fieldRef:
                    fieldPath: metadata.annotations['leaderworkerset.sigs.k8s.io/leader-name']
              - name: NAMESPACE
                valueFrom:
                  fieldRef:
                    fieldPath: metadata.namespace
              - name: LWS_NAME
                valueFrom:
                  fieldRef:
                    fieldPath: metadata.labels['leaderworkerset.sigs.k8s.io/name']
              - name: HUGGING_FACE_HUB_TOKEN
                value: ""

But I saw errors "ValueError: Ray does not allocate any GPUs on the driver node. Consider adjusting the Ray placement group or running the driver on a GPU node."

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version):
LWS version (use git describe --tags --dirty --always):
Cloud provider or hardware configuration:
OS (e.g: cat /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Others:

The text was updated successfully, but these errors were encountered:

liurupeng · 2024-07-31T02:29:29Z

seems related to vllm-project/vllm#2406. @gujingit do you know is there a workaround for this?

liurupeng added the kind/bug Categorizes issue or PR as related to a bug. label Jul 31, 2024

Edwinhr716 mentioned this issue Aug 1, 2024

updating vllm docs to use llama3 405B as example #185

Merged

liurupeng closed this as completed Aug 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serving Llama3-405B on 2 GPU node with 8H100 following the example in the repository #184

Serving Llama3-405B on 2 GPU node with 8H100 following the example in the repository #184

liurupeng commented Jul 31, 2024

liurupeng commented Jul 31, 2024

Serving Llama3-405B on 2 GPU node with 8H100 following the example in the repository #184

Serving Llama3-405B on 2 GPU node with 8H100 following the example in the repository #184

Comments

liurupeng commented Jul 31, 2024

liurupeng commented Jul 31, 2024