Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

livenessProbe and readinessProbe fails pod to start #6469

Closed
wwolny opened this issue Oct 23, 2023 · 5 comments
Closed

livenessProbe and readinessProbe fails pod to start #6469

wwolny opened this issue Oct 23, 2023 · 5 comments
Labels
bug Something isn't working

Comments

@wwolny
Copy link

wwolny commented Oct 23, 2023

Description
I am trying to setup Deployment for triton inference server. When I add livenessProbe or readinessProbe pod fails. When I run it without neither of those Probes the pod successfully starts and I can send requests without any issues.

The 'model' is the only model in /models/ directory. I got similar error message to this closed issue:
#5786

Triton Information
What version of Triton are you using?
2.38.0

Are you using the Triton container or did you build it yourself?
I build myself from nvcr.io/nvidia/tritonserver:23.09-pyt-python-py3 image. In Dockerfile I am only installing additional packages like torch, pip requirements.txt. I am also setting HF_HOME, TRANSFORMERS_CACHE. HUGGINGFACE_HUB_CACHE and MPLCONFIGDIR env variables to cache paths that I am creating using mkdir.

When I try to add following probes the deployment fails:
livenessProbe:
httpGet:
path: /v2/health/live
port: http
readinessProbe:
initialDelaySeconds: 5
periodSeconds: 5
httpGet:
path: /v2/health/ready
port: http
When I comment it out the deployment successfully starts and the pod is running without any issues. The same behavior I can see when I change kind from Deployment to Job.

Output
ERROR 2023-10-23T11:22:01.067582451Z [resource.labels.containerName: triton-server] +-----------+---------+--------+
ERROR 2023-10-23T11:22:01.067584751Z [resource.labels.containerName: triton-server] | Model | Version | Status |
ERROR 2023-10-23T11:22:01.067587571Z [resource.labels.containerName: triton-server] +-----------+---------+--------+
ERROR 2023-10-23T11:22:01.067590071Z [resource.labels.containerName: triton-server] | model | 1 | READY |
ERROR 2023-10-23T11:22:01.067592431Z [resource.labels.containerName: triton-server] +-----------+---------+--------+
ERROR 2023-10-23T11:22:01.067594561Z [resource.labels.containerName: triton-server] {}
INFO 2023-10-23T11:22:01.067980541Z [resource.labels.containerName: triton-server] Collecting CPU metrics
INFO 2023-10-23T11:22:01.068103831Z [resource.labels.containerName: triton-server] {"pid":"1"}
ERROR 2023-10-23T11:22:01.068153951Z [resource.labels.containerName: triton-server] +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
ERROR 2023-10-23T11:22:01.068159091Z [resource.labels.containerName: triton-server] | Option | Value |
ERROR 2023-10-23T11:22:01.068162151Z [resource.labels.containerName: triton-server] +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
ERROR 2023-10-23T11:22:01.068164531Z [resource.labels.containerName: triton-server] | server_id | triton |
ERROR 2023-10-23T11:22:01.068166831Z [resource.labels.containerName: triton-server] | server_version | 2.38.0 |
ERROR 2023-10-23T11:22:01.068169211Z [resource.labels.containerName: triton-server] | server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
ERROR 2023-10-23T11:22:01.068171311Z [resource.labels.containerName: triton-server] | model_repository_path[0] | /models/ |
ERROR 2023-10-23T11:22:01.068173531Z [resource.labels.containerName: triton-server] | model_control_mode | MODE_NONE |
ERROR 2023-10-23T11:22:01.068175871Z [resource.labels.containerName: triton-server] | strict_model_config | 0 |
ERROR 2023-10-23T11:22:01.068199571Z [resource.labels.containerName: triton-server] | rate_limit | OFF |
ERROR 2023-10-23T11:22:01.068202351Z [resource.labels.containerName: triton-server] | pinned_memory_pool_byte_size | 268435456 |
ERROR 2023-10-23T11:22:01.068204521Z [resource.labels.containerName: triton-server] | min_supported_compute_capability | 6.0 |
ERROR 2023-10-23T11:22:01.068206711Z [resource.labels.containerName: triton-server] | strict_readiness | 1 |
ERROR 2023-10-23T11:22:01.068208941Z [resource.labels.containerName: triton-server] | exit_timeout | 30 |
ERROR 2023-10-23T11:22:01.068211341Z [resource.labels.containerName: triton-server] | cache_enabled | 0 |
ERROR 2023-10-23T11:22:01.068213541Z [resource.labels.containerName: triton-server] +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
ERROR 2023-10-23T11:22:01.068215581Z [resource.labels.containerName: triton-server] {}
INFO 2023-10-23T11:22:01.068218741Z [resource.labels.containerName: triton-server] Waiting for in-flight requests to complete.
INFO 2023-10-23T11:22:01.068241131Z [resource.labels.containerName: triton-server] Timeout 30: Found 0 model versions that have in-flight inferences
INFO 2023-10-23T11:22:01.068246131Z [resource.labels.containerName: triton-server] All models are stopped, unloading models
INFO 2023-10-23T11:22:01.068248511Z [resource.labels.containerName: triton-server] Timeout 30: Found 1 live models and 0 in-flight non-inference requests
INFO 2023-10-23T11:22:02.068732922Z [resource.labels.containerName: triton-server] Timeout 29: Found 1 live models and 0 in-flight non-inference requests
INFO 2023-10-23T11:22:03.069179523Z [resource.labels.containerName: triton-server] Timeout 28: Found 1 live models and 0 in-flight non-inference requests
INFO 2023-10-23T11:22:03.330380370Z [resource.labels.containerName: triton-server] successfully unloaded 'model' version 1
INFO 2023-10-23T11:22:04.069446204Z [resource.labels.containerName: triton-server] Timeout 27: Found 0 live models and 0 in-flight non-inference requests
ERROR 2023-10-23T11:22:04.070221324Z [resource.labels.containerName: triton-server] error: creating server: Internal - failed to load all models
INFO 2023-10-23T11:22:08.780037735Z [resource.labels.containerName: triton-server] {}

Expected behavior
I expect pods to start with readinessProbe and livenessProbe.

@dyastremsky
Copy link
Contributor

Can you get and share the verbose logs? So your directory matches the below?
-models
-model_name
-1/
-config.pbtxt

Does it start successfully when run locally on this folder? When you remove the liveness and readiness probes, does the server actually start up with the models loaded? Can you successfully run inference in that case?

@dyastremsky dyastremsky added the bug Something isn't working label Oct 23, 2023
@wwolny
Copy link
Author

wwolny commented Oct 25, 2023

Yes, my /models/ directory matches this structure. When I start the server without readiness and liveness probes the server starts successfully and I am able to run inference and I am getting correct responses.
When I run triton server with readiness and liveness probes the pod manged by the deployment keeps restarting with following logs:
triton_logs.txt

@dyastremsky
Copy link
Contributor

dyastremsky commented Oct 25, 2023

Can you try updating the values to extend the time for a successful probe? For example:

      livenessProbe:
        initialDelaySeconds: 15
        failureThreshold: 3
        periodSeconds: 10

        httpGet:
          path: /v2/health/live
          port: http
      readinessProbe:
        initialDelaySeconds: 5
        periodSeconds: 5
        failureThreshold: 3
        httpGet:
          path: /v2/health/ready
          port: http
      startupProbe:

        periodSeconds: 10
        failureThreshold: 30
        httpGet:
          path: /v2/health/ready
          port: http`

Sourced from this pending PR.

@wwolny
Copy link
Author

wwolny commented Oct 26, 2023

Thank you! This setup worked.

@dyastremsky
Copy link
Contributor

Any time, happy to hear it! We'll work on merging that PR soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Development

No branches or pull requests

2 participants