Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deploying nim on k8s. With this custom-value.yaml, 3.1 8b model can be deployed but 70b failed. #92

Open
SarielMa opened this issue Sep 25, 2024 · 0 comments

Comments

@SarielMa
Copy link

we follow the steps here: https://docs.nvidia.com/nim/large-language-models/latest/deploy-helm.html

after helm install ....

kubectl logs my-nim-01 --previous

...
{"level": "None", "time": "None", "file_name": "None", "file_path": "None", "line_number": "-1", "message": "[09-25 19:03:45.989 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:119] error sending request for url (https://api.ngc.nvidia.com/v2/org/nim/team/meta/models/llama-3_1-70b-instruct/hf-1d54af3-nim1.2/files)", "exc_info": "None", "stack_info": "None"}
{"level": "ERROR", "time": "None", "file_name": "None", "file_path": "None", "line_number": "-1", "message": "", "exc_info": "Traceback (most recent call last):\n File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main\n return _run_code(code, main_globals, None,\n File "/usr/lib/python3.10/runpy.py", line 86, in _run_code\n exec(code, run_globals)\n File "/opt/nim/llm/vllm_nvext/entrypoints/launch.py", line 99, in \n main()\n File "/opt/nim/llm/vllm_nvext/entrypoints/launch.py", line 42, in main\n inference_env = prepare_environment()\n File "/opt/nim/llm/vllm_nvext/entrypoints/args.py", line 155, in prepare_environment\n engine_args, extracted_name = inject_ngc_hub(engine_args)\n File "/opt/nim/llm/vllm_nvext/hub/ngc_injector.py", line 247, in inject_ngc_hub\n cached = repo.get_all()\nException: error sending request for url (https://api.ngc.nvidia.com/v2/org/nim/team/meta/models/llama-3_1-70b-instruct/hf-1d54af3-nim1.2/files)", "stack_info": "None"}

kubectl describe pod my-nim-01
...
Events:
Type Reason Age From Message


Warning BackOff 5m3s (x90 over 102m) kubelet Back-off restarting failed container nim-llm in pod my-nim-0_default(ce8f1e3a-f0e6-4a95-9086-2901091b7a57)
Normal Pulled 4m52s (x15 over 116m) kubelet Container image "nvcr.io/nim/meta/llama-3.1-70b-instruct:latest" already present on machine

kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
default my-nim-0 0/1 Running 14 (6m46s ago) 117m

vim custom-value.yaml
image:
repository: "nvcr.io/nim/meta/llama-3.1-70b-instruct" # container location
tag: latest # NIM version you want to deploy
model:
ngcAPISecret: ngc-api # name of a secret in the cluster that includes a key named NGC_API_KEY and is an NGC API key
imagePullSecrets:

  • name: ngc-secret # name of a secret used to pull nvcr.io images, see https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/
    persistence:
    enabled: true
    size: 800Gi
    accessMode: ReadWriteMany
    storageClass: ""
    annotations:
    helm.sh/resource-policy: "keep"
    livenessProbe:
    initialDelaySeconds: 600
    periodSeconds: 60
    timeoutSeconds: 10
    startupProbe:
    initialDelaySeconds: 600
    periodSeconds: 60
    timeoutSeconds: 10
    failureThreshold: 1500
    resources:
    limits:
    nvidia.com/gpu: 4 # much more GPU memory is required
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant