livenessProbe and readinessProbe fails pod to start #6469

wwolny · 2023-10-23T11:36:39Z

Description
I am trying to setup Deployment for triton inference server. When I add livenessProbe or readinessProbe pod fails. When I run it without neither of those Probes the pod successfully starts and I can send requests without any issues.

The 'model' is the only model in /models/ directory. I got similar error message to this closed issue:
#5786

Triton Information
What version of Triton are you using?
2.38.0

Are you using the Triton container or did you build it yourself?
I build myself from nvcr.io/nvidia/tritonserver:23.09-pyt-python-py3 image. In Dockerfile I am only installing additional packages like torch, pip requirements.txt. I am also setting HF_HOME, TRANSFORMERS_CACHE. HUGGINGFACE_HUB_CACHE and MPLCONFIGDIR env variables to cache paths that I am creating using mkdir.

When I try to add following probes the deployment fails:
livenessProbe:
httpGet:
path: /v2/health/live
port: http
readinessProbe:
initialDelaySeconds: 5
periodSeconds: 5
httpGet:
path: /v2/health/ready
port: http
When I comment it out the deployment successfully starts and the pod is running without any issues. The same behavior I can see when I change kind from Deployment to Job.

Output
ERROR 2023-10-23T11:22:01.067582451Z [resource.labels.containerName: triton-server] +-----------+---------+--------+
ERROR 2023-10-23T11:22:01.067584751Z [resource.labels.containerName: triton-server] | Model | Version | Status |
ERROR 2023-10-23T11:22:01.067587571Z [resource.labels.containerName: triton-server] +-----------+---------+--------+
ERROR 2023-10-23T11:22:01.067590071Z [resource.labels.containerName: triton-server] | model | 1 | READY |
ERROR 2023-10-23T11:22:01.067592431Z [resource.labels.containerName: triton-server] +-----------+---------+--------+
ERROR 2023-10-23T11:22:01.067594561Z [resource.labels.containerName: triton-server] {}
INFO 2023-10-23T11:22:01.067980541Z [resource.labels.containerName: triton-server] Collecting CPU metrics
INFO 2023-10-23T11:22:01.068103831Z [resource.labels.containerName: triton-server] {"pid":"1"}
ERROR 2023-10-23T11:22:01.068153951Z [resource.labels.containerName: triton-server] +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
ERROR 2023-10-23T11:22:01.068159091Z [resource.labels.containerName: triton-server] | Option | Value |
ERROR 2023-10-23T11:22:01.068162151Z [resource.labels.containerName: triton-server] +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
ERROR 2023-10-23T11:22:01.068164531Z [resource.labels.containerName: triton-server] | server_id | triton |
ERROR 2023-10-23T11:22:01.068166831Z [resource.labels.containerName: triton-server] | server_version | 2.38.0 |
ERROR 2023-10-23T11:22:01.068169211Z [resource.labels.containerName: triton-server] | server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
ERROR 2023-10-23T11:22:01.068171311Z [resource.labels.containerName: triton-server] | model_repository_path[0] | /models/ |
ERROR 2023-10-23T11:22:01.068173531Z [resource.labels.containerName: triton-server] | model_control_mode | MODE_NONE |
ERROR 2023-10-23T11:22:01.068175871Z [resource.labels.containerName: triton-server] | strict_model_config | 0 |
ERROR 2023-10-23T11:22:01.068199571Z [resource.labels.containerName: triton-server] | rate_limit | OFF |
ERROR 2023-10-23T11:22:01.068202351Z [resource.labels.containerName: triton-server] | pinned_memory_pool_byte_size | 268435456 |
ERROR 2023-10-23T11:22:01.068204521Z [resource.labels.containerName: triton-server] | min_supported_compute_capability | 6.0 |
ERROR 2023-10-23T11:22:01.068206711Z [resource.labels.containerName: triton-server] | strict_readiness | 1 |
ERROR 2023-10-23T11:22:01.068208941Z [resource.labels.containerName: triton-server] | exit_timeout | 30 |
ERROR 2023-10-23T11:22:01.068211341Z [resource.labels.containerName: triton-server] | cache_enabled | 0 |
ERROR 2023-10-23T11:22:01.068213541Z [resource.labels.containerName: triton-server] +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
ERROR 2023-10-23T11:22:01.068215581Z [resource.labels.containerName: triton-server] {}
INFO 2023-10-23T11:22:01.068218741Z [resource.labels.containerName: triton-server] Waiting for in-flight requests to complete.
INFO 2023-10-23T11:22:01.068241131Z [resource.labels.containerName: triton-server] Timeout 30: Found 0 model versions that have in-flight inferences
INFO 2023-10-23T11:22:01.068246131Z [resource.labels.containerName: triton-server] All models are stopped, unloading models
INFO 2023-10-23T11:22:01.068248511Z [resource.labels.containerName: triton-server] Timeout 30: Found 1 live models and 0 in-flight non-inference requests
INFO 2023-10-23T11:22:02.068732922Z [resource.labels.containerName: triton-server] Timeout 29: Found 1 live models and 0 in-flight non-inference requests
INFO 2023-10-23T11:22:03.069179523Z [resource.labels.containerName: triton-server] Timeout 28: Found 1 live models and 0 in-flight non-inference requests
INFO 2023-10-23T11:22:03.330380370Z [resource.labels.containerName: triton-server] successfully unloaded 'model' version 1
INFO 2023-10-23T11:22:04.069446204Z [resource.labels.containerName: triton-server] Timeout 27: Found 0 live models and 0 in-flight non-inference requests
ERROR 2023-10-23T11:22:04.070221324Z [resource.labels.containerName: triton-server] error: creating server: Internal - failed to load all models
INFO 2023-10-23T11:22:08.780037735Z [resource.labels.containerName: triton-server] {}

Expected behavior
I expect pods to start with readinessProbe and livenessProbe.

dyastremsky · 2023-10-23T19:13:41Z

Can you get and share the verbose logs? So your directory matches the below?
-models
-model_name
-1/
-config.pbtxt

Does it start successfully when run locally on this folder? When you remove the liveness and readiness probes, does the server actually start up with the models loaded? Can you successfully run inference in that case?

wwolny · 2023-10-25T12:29:09Z

Yes, my /models/ directory matches this structure. When I start the server without readiness and liveness probes the server starts successfully and I am able to run inference and I am getting correct responses.
When I run triton server with readiness and liveness probes the pod manged by the deployment keeps restarting with following logs:
triton_logs.txt

dyastremsky · 2023-10-25T16:45:58Z

Can you try updating the values to extend the time for a successful probe? For example:

      livenessProbe:
        initialDelaySeconds: 15
        failureThreshold: 3
        periodSeconds: 10

        httpGet:
          path: /v2/health/live
          port: http
      readinessProbe:
        initialDelaySeconds: 5
        periodSeconds: 5
        failureThreshold: 3
        httpGet:
          path: /v2/health/ready
          port: http
      startupProbe:

        periodSeconds: 10
        failureThreshold: 30
        httpGet:
          path: /v2/health/ready
          port: http`

Sourced from this pending PR.

wwolny · 2023-10-26T14:57:21Z

Thank you! This setup worked.

dyastremsky · 2023-10-26T16:42:58Z

Any time, happy to hear it! We'll work on merging that PR soon.

dyastremsky added the bug Something isn't working label Oct 23, 2023

dyastremsky closed this as completed Oct 26, 2023

dyastremsky mentioned this issue Oct 26, 2023

StartupProbe for K8s-onprem added and documented #5257

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

livenessProbe and readinessProbe fails pod to start #6469

livenessProbe and readinessProbe fails pod to start #6469

wwolny commented Oct 23, 2023

dyastremsky commented Oct 23, 2023

wwolny commented Oct 25, 2023

dyastremsky commented Oct 25, 2023 •

edited

Loading

wwolny commented Oct 26, 2023

dyastremsky commented Oct 26, 2023

livenessProbe and readinessProbe fails pod to start #6469

livenessProbe and readinessProbe fails pod to start #6469

Comments

wwolny commented Oct 23, 2023

dyastremsky commented Oct 23, 2023

wwolny commented Oct 25, 2023

dyastremsky commented Oct 25, 2023 • edited Loading

wwolny commented Oct 26, 2023

dyastremsky commented Oct 26, 2023

dyastremsky commented Oct 25, 2023 •

edited

Loading