-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
livenessProbe and readinessProbe fails pod to start #6469
Comments
Can you get and share the verbose logs? So your directory matches the below? Does it start successfully when run locally on this folder? When you remove the liveness and readiness probes, does the server actually start up with the models loaded? Can you successfully run inference in that case? |
Yes, my /models/ directory matches this structure. When I start the server without readiness and liveness probes the server starts successfully and I am able to run inference and I am getting correct responses. |
Can you try updating the values to extend the time for a successful probe? For example:
Sourced from this pending PR. |
Thank you! This setup worked. |
Any time, happy to hear it! We'll work on merging that PR soon. |
Description
I am trying to setup Deployment for triton inference server. When I add livenessProbe or readinessProbe pod fails. When I run it without neither of those Probes the pod successfully starts and I can send requests without any issues.
The 'model' is the only model in /models/ directory. I got similar error message to this closed issue:
#5786
Triton Information
What version of Triton are you using?
2.38.0
Are you using the Triton container or did you build it yourself?
I build myself from nvcr.io/nvidia/tritonserver:23.09-pyt-python-py3 image. In Dockerfile I am only installing additional packages like torch, pip requirements.txt. I am also setting HF_HOME, TRANSFORMERS_CACHE. HUGGINGFACE_HUB_CACHE and MPLCONFIGDIR env variables to cache paths that I am creating using mkdir.
When I try to add following probes the deployment fails:
livenessProbe:
httpGet:
path: /v2/health/live
port: http
readinessProbe:
initialDelaySeconds: 5
periodSeconds: 5
httpGet:
path: /v2/health/ready
port: http
When I comment it out the deployment successfully starts and the pod is running without any issues. The same behavior I can see when I change kind from Deployment to Job.
Output
ERROR 2023-10-23T11:22:01.067582451Z [resource.labels.containerName: triton-server] +-----------+---------+--------+
ERROR 2023-10-23T11:22:01.067584751Z [resource.labels.containerName: triton-server] | Model | Version | Status |
ERROR 2023-10-23T11:22:01.067587571Z [resource.labels.containerName: triton-server] +-----------+---------+--------+
ERROR 2023-10-23T11:22:01.067590071Z [resource.labels.containerName: triton-server] | model | 1 | READY |
ERROR 2023-10-23T11:22:01.067592431Z [resource.labels.containerName: triton-server] +-----------+---------+--------+
ERROR 2023-10-23T11:22:01.067594561Z [resource.labels.containerName: triton-server] {}
INFO 2023-10-23T11:22:01.067980541Z [resource.labels.containerName: triton-server] Collecting CPU metrics
INFO 2023-10-23T11:22:01.068103831Z [resource.labels.containerName: triton-server] {"pid":"1"}
ERROR 2023-10-23T11:22:01.068153951Z [resource.labels.containerName: triton-server] +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
ERROR 2023-10-23T11:22:01.068159091Z [resource.labels.containerName: triton-server] | Option | Value |
ERROR 2023-10-23T11:22:01.068162151Z [resource.labels.containerName: triton-server] +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
ERROR 2023-10-23T11:22:01.068164531Z [resource.labels.containerName: triton-server] | server_id | triton |
ERROR 2023-10-23T11:22:01.068166831Z [resource.labels.containerName: triton-server] | server_version | 2.38.0 |
ERROR 2023-10-23T11:22:01.068169211Z [resource.labels.containerName: triton-server] | server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
ERROR 2023-10-23T11:22:01.068171311Z [resource.labels.containerName: triton-server] | model_repository_path[0] | /models/ |
ERROR 2023-10-23T11:22:01.068173531Z [resource.labels.containerName: triton-server] | model_control_mode | MODE_NONE |
ERROR 2023-10-23T11:22:01.068175871Z [resource.labels.containerName: triton-server] | strict_model_config | 0 |
ERROR 2023-10-23T11:22:01.068199571Z [resource.labels.containerName: triton-server] | rate_limit | OFF |
ERROR 2023-10-23T11:22:01.068202351Z [resource.labels.containerName: triton-server] | pinned_memory_pool_byte_size | 268435456 |
ERROR 2023-10-23T11:22:01.068204521Z [resource.labels.containerName: triton-server] | min_supported_compute_capability | 6.0 |
ERROR 2023-10-23T11:22:01.068206711Z [resource.labels.containerName: triton-server] | strict_readiness | 1 |
ERROR 2023-10-23T11:22:01.068208941Z [resource.labels.containerName: triton-server] | exit_timeout | 30 |
ERROR 2023-10-23T11:22:01.068211341Z [resource.labels.containerName: triton-server] | cache_enabled | 0 |
ERROR 2023-10-23T11:22:01.068213541Z [resource.labels.containerName: triton-server] +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
ERROR 2023-10-23T11:22:01.068215581Z [resource.labels.containerName: triton-server] {}
INFO 2023-10-23T11:22:01.068218741Z [resource.labels.containerName: triton-server] Waiting for in-flight requests to complete.
INFO 2023-10-23T11:22:01.068241131Z [resource.labels.containerName: triton-server] Timeout 30: Found 0 model versions that have in-flight inferences
INFO 2023-10-23T11:22:01.068246131Z [resource.labels.containerName: triton-server] All models are stopped, unloading models
INFO 2023-10-23T11:22:01.068248511Z [resource.labels.containerName: triton-server] Timeout 30: Found 1 live models and 0 in-flight non-inference requests
INFO 2023-10-23T11:22:02.068732922Z [resource.labels.containerName: triton-server] Timeout 29: Found 1 live models and 0 in-flight non-inference requests
INFO 2023-10-23T11:22:03.069179523Z [resource.labels.containerName: triton-server] Timeout 28: Found 1 live models and 0 in-flight non-inference requests
INFO 2023-10-23T11:22:03.330380370Z [resource.labels.containerName: triton-server] successfully unloaded 'model' version 1
INFO 2023-10-23T11:22:04.069446204Z [resource.labels.containerName: triton-server] Timeout 27: Found 0 live models and 0 in-flight non-inference requests
ERROR 2023-10-23T11:22:04.070221324Z [resource.labels.containerName: triton-server] error: creating server: Internal - failed to load all models
INFO 2023-10-23T11:22:08.780037735Z [resource.labels.containerName: triton-server] {}
Expected behavior
I expect pods to start with readinessProbe and livenessProbe.
The text was updated successfully, but these errors were encountered: