Error to deploy `llama-3.1-8b-instruct:1.1.1` using downloaded model repository with modelcar and kserve #64

xieshenzh · 2024-08-07T21:26:21Z

I tried to deploy llama-3.1-8b-instruct:1.1.1 with Kserve and modelcar on Openshift AI.

What I have done?

Downloaded the models files: podman run --rm -e NGC_API_KEY=<API_KEY> -v /models:/opt/nim/.cache nvcr.io/nim/meta/llama-3.1-8b-instruct:1.1.1 create-model-store --profile <PROFILE> --model-store /opt/nim/.cache.
Built a modelcar image by copying the models files, using this Dockerfile:

FROM --platform=linux/amd64 busybox
RUN mkdir /models && chmod 775 /models
COPY /models/ /models/

Setup the environment based on the guide.
Deployed the ServingRuntime CR and set the NIM_MODEL_NAME environment variable to /mnt/models/ which is the path where model files mounted from the modelcar container.

---
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: nvidia-nim-llama-3.1-8b-instruct-1.1.1
spec:
  annotations:
    prometheus.kserve.io/path: /metrics
    prometheus.kserve.io/port: '8000'
    serving.kserve.io/enable-metric-aggregation: 'true'
    serving.kserve.io/enable-prometheus-scraping: 'true'
  containers:
    - env:
        - name: NIM_MODEL_NAME
          value: /mnt/models/
        - name: NIM_SERVED_MODEL_NAME
          value: meta/llama3-8b-instruct
        - name: NGC_API_KEY
          valueFrom:
            secretKeyRef:
              key: NGC_API_KEY
              name: nvidia-nim-secrets
      image: 'nvcr.io/nim/meta/llama-3.1-8b-instruct:1.1.1'
      name: kserve-container
      ports:
        - containerPort: 8000
          protocol: TCP
      resources:
        limits:
          cpu: '12'
          memory: 32Gi
        requests:
          cpu: '12'
          memory: 32Gi
      volumeMounts:
        - mountPath: /dev/shm
          name: dshm
  imagePullSecrets:
    - name: ngc-secret
  protocolVersions:
    - v2
    - grpc-v2
  supportedModelFormats:
    - autoSelect: true
      name: nvidia-nim-llama-3.1-8b-instruct
      priority: 1
      version: 1.1.1
  volumes:
    - emptyDir:
        medium: Memory
        sizeLimit: 25Gi
      name: dshm

Deployed the InferenceService CR and set the storageUri to use the modelcar image created in 2.

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  annotations:
    autoscaling.knative.dev/target: '10'
  name: llama-3-1-8b-instruct-1xgpu
spec:
  predictor:
    minReplicas: 1
    model:
      modelFormat:
        name: nvidia-nim-llama-3.1-8b-instruct
      name: ''
      resources:
        limits:
          nvidia.com/gpu: '1'
        requests:
          nvidia.com/gpu: '1'
      runtime: nvidia-nim-llama-3.1-8b-instruct-1.1.1
      storageUri: 'oci://<modelcar image registry and name>:<tag>'

The Pod failed to start due to an error:

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/opt/nim/llm/vllm_nvext/entrypoints/openai/api_server.py", line 702, in <module>
    engine = AsyncLLMEngineFactory.from_engine_args(engine_args, usage_context=UsageContext.OPENAI_API_SERVER)
  File "/opt/nim/llm/vllm_nvext/engine/async_trtllm_engine_factory.py", line 33, in from_engine_args
    engine = engine_cls.from_engine_args(engine_args, start_engine_loop, usage_context)
  File "/opt/nim/llm/vllm_nvext/engine/async_trtllm_engine.py", line 304, in from_engine_args
    return cls(
  File "/opt/nim/llm/vllm_nvext/engine/async_trtllm_engine.py", line 278, in __init__
    self.engine: _AsyncTRTLLMEngine = self._init_engine(*args, **kwargs)
  File "/opt/nim/llm/.venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 505, in _init_engine
    return engine_class(*args, **kwargs)
  File "/opt/nim/llm/vllm_nvext/engine/async_trtllm_engine.py", line 136, in __init__
    self._tllm_engine = TrtllmModelRunner(
  File "/opt/nim/llm/vllm_nvext/engine/trtllm_model_runner.py", line 275, in __init__
    self._tllm_exec, self._cfg = self._create_engine(
  File "/opt/nim/llm/vllm_nvext/engine/trtllm_model_runner.py", line 569, in _create_engine
    return create_trt_executor(
  File "/opt/nim/llm/vllm_nvext/trtllm/utils.py", line 283, in create_trt_executor
    engine_size_bytes = _get_rank_engine_file_size_bytes(profile_dir)
  File "/opt/nim/llm/vllm_nvext/trtllm/utils.py", line 226, in _get_rank_engine_file_size_bytes
    engine_size_bytes = rank0_engine.stat().st_size
  File "/usr/lib/python3.10/pathlib.py", line 1097, in stat
    return self._accessor.stat(self, follow_symlinks=follow_symlinks)
FileNotFoundError: [Errno 2] No such file or directory: '/models/trtllm_engine/rank0.engine'

Issue:
The directory containing model files in the sidecar container is correctly mounted to the NIM container with a symlink:

(Scripts executed in the terminal of the NIM container)

$ ls -al /mnt/models
lrwxrwxrwx. 1 1001090000 1001090000 20 Aug  7 20:34 /mnt/models -> /proc/76/root/models
$ ls -al /proc/76/root/models/trtllm_engine/rank0.engine 
-rw-r--r--. 1 root root 16218123260 Jul 30 18:18 /proc/76/root/models/trtllm_engine/rank0.engine

Code of the NIM container invokes function_get_rank_engine_file_size_bytes in vllm_nvext/trtllm/utils.py which calls Path.resolve() to resolve the symlink.
As a result, the directory containing the rank engine file (i.e. /proc/76/root/models/trtllm_engine/rank0.engine) is resolved to /models/trtllm_engine/rank0.engine which is invalid.
Then, the code could not find the file /models/trtllm_engine/rank0.engine to get its file size, and threw the error.

What I expect?
NIM container should properly resolve the symlink to the directory containing the model files.

The text was updated successfully, but these errors were encountered:

mpaulgreen · 2024-08-12T12:20:46Z

@supertetelman can you take a look into the issue.

mosfeets · 2024-09-10T14:30:30Z

@xieshenzh thanks for reporting this, I'm trying to do the exact same thing. Followed your procedure and got the same results with the nvidia-nim-llama-3.1-8b-instruct-1.1.2 image

My overall thought is to pre-cache new NIM models with modelcars on each of my OpenShift nodes using image puller and let KServe do its thing for faster scale up when necessary.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error to deploy `llama-3.1-8b-instruct:1.1.1` using downloaded model repository with modelcar and kserve #64

Error to deploy `llama-3.1-8b-instruct:1.1.1` using downloaded model repository with modelcar and kserve #64

xieshenzh commented Aug 7, 2024

mpaulgreen commented Aug 12, 2024

mosfeets commented Sep 10, 2024

Error to deploy llama-3.1-8b-instruct:1.1.1 using downloaded model repository with modelcar and kserve #64

Error to deploy llama-3.1-8b-instruct:1.1.1 using downloaded model repository with modelcar and kserve #64

Comments

xieshenzh commented Aug 7, 2024

mpaulgreen commented Aug 12, 2024

mosfeets commented Sep 10, 2024

Error to deploy `llama-3.1-8b-instruct:1.1.1` using downloaded model repository with modelcar and kserve #64

Error to deploy `llama-3.1-8b-instruct:1.1.1` using downloaded model repository with modelcar and kserve #64