-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added readiness and liveness probes #8488
Conversation
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughThe changes introduce a new management command, Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant WorkerProbe
participant QueueManager
participant Worker
User->>WorkerProbe: Run command with queue names
WorkerProbe->>QueueManager: Validate queue names
QueueManager->>WorkerProbe: Return valid queues
WorkerProbe->>Worker: Check health status
Worker->>WorkerProbe: Return health status
WorkerProbe->>User: Report health status
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🧹 Outside diff range and nitpick comments (17)
helm-chart/Chart.yaml (1)
Line range hint
24-61
: Consider using consistent versioning patterns for dependencies.While the current dependency versioning works, there's an inconsistency in how versions are specified:
- Some use exact versions (e.g., nuclio: 0.19.0, traefik: 25.0.0)
- Others use version ranges (e.g., postgresql: "12.1.", vector: "0.19.")
Consider adopting a consistent approach across all dependencies. Using version ranges (e.g., "X.Y.*") for all dependencies could help automatically incorporate bug fixes while still maintaining control over major and minor version updates.
helm-chart/templates/cvat_frontend/deployment.yml (1)
48-59
: LGTM! Consider differentiating readiness and liveness probes.The addition of readiness and liveness probes for the frontend container is a great improvement for monitoring the health and availability of the service. The implementation is flexible, allowing for easy configuration through Helm values.
Some suggestions for potential improvements:
- Consider differentiating the readiness and liveness probes. While TCP checks are good, you might want to use an HTTP GET request for the readiness probe to ensure the application is fully initialized and ready to serve traffic.
- For the liveness probe, you could keep the TCP check or use a simple HTTP endpoint that doesn't put much load on the system.
Example:
readinessProbe: httpGet: path: / port: 80 # ... other configurations livenessProbe: tcpSocket: port: 80 # ... other configurationsThis differentiation can provide more accurate information about the application's state.
helm-chart/templates/cvat_opa/deployment.yml (2)
56-61
: Readiness probe configuration looks good.The readiness probe is well-configured using a tcpSocket check on port 8181, which is appropriate for an OPA server. The conditional inclusion and use of
toYaml
andomit
functions provide flexibility for different deployment scenarios.Consider adding a comment explaining the purpose of the readiness probe and the significance of port 8181 for better maintainability. For example:
# Readiness probe to ensure OPA server is listening on port 8181 {{- if .Values.cvat.opa.readinessProbe.enabled }} readinessProbe: tcpSocket: port: 8181 # Default OPA server port {{- toYaml (omit .Values.cvat.opa.readinessProbe "enabled") | nindent 12 }} {{- end }}
62-67
: Liveness probe configuration is correct.The liveness probe is well-configured, mirroring the readiness probe setup. Using a tcpSocket check on port 8181 is appropriate for monitoring the OPA server's health.
For consistency with the readiness probe suggestion, consider adding a comment explaining the purpose of the liveness probe:
# Liveness probe to ensure OPA server remains responsive {{- if .Values.cvat.opa.livenessProbe.enabled }} livenessProbe: tcpSocket: port: 8181 # Default OPA server port {{- toYaml (omit .Values.cvat.opa.livenessProbe "enabled") | nindent 12 }} {{- end }}helm-chart/templates/cvat_backend/worker_webhooks/deployment.yml (2)
63-72
: Readiness probe implementation looks good.The readiness probe is well-implemented:
- It's conditionally included, allowing flexibility in deployment.
- Uses an exec probe with a custom management command, which is appropriate for checking application-specific health.
- Allows for additional configuration through Helm values.
Consider adding a comment explaining the purpose of the
workerprobe webhooks
command for better maintainability.
73-82
: Consider differentiating between readiness and liveness probes.The liveness probe implementation is correct, but it's identical to the readiness probe. While this can be valid in some cases, it's often beneficial to have different checks for readiness and liveness:
- Readiness typically checks if the application is ready to receive traffic.
- Liveness checks if the application is functioning correctly.
Using the same check for both might lead to unnecessary restarts if the application is temporarily unable to handle requests but is still running.
If possible, consider implementing a separate check for the liveness probe that focuses on the overall health of the application rather than its readiness to handle requests.
helm-chart/templates/cvat_backend/worker_qualityreports/deployment.yml (2)
63-72
: Readiness probe implementation looks good.The readiness probe is well-implemented, using a custom management command
workerprobe
to check the worker's readiness for thequality_reports
queue. The conditional inclusion and use of theomit
function allow for flexible configuration through Helm values.Consider adding a comment explaining the purpose of the
workerprobe
command for better clarity:readinessProbe: exec: command: - python - manage.py - workerprobe # Custom command to check worker readiness - quality_reports
73-82
: Liveness probe implementation is correct.The liveness probe is correctly implemented, using the same custom management command
workerprobe
as the readiness probe. This is appropriate for checking the worker's ability to handle tasks in thequality_reports
queue.For consistency with the readiness probe, consider adding a similar comment:
livenessProbe: exec: command: - python - manage.py - workerprobe # Custom command to check worker liveness - quality_reportshelm-chart/templates/cvat_backend/worker_analyticsreports/deployment.yml (1)
64-83
: Excellent addition of readiness and liveness probes!The introduction of readiness and liveness probes for the CVAT backend worker handling analytics reports is a great improvement for reliability and observability. Here are some observations and suggestions:
- The conditional inclusion of probes based on their enabled status in the values file provides flexibility in configuration.
- Using the same command (
python manage.py workerprobe analytics_reports
) for both readiness and liveness probes ensures consistency in health checks.Suggestions for consideration:
- Consider if different checks for readiness and liveness might be more appropriate. Readiness typically checks if the application is ready to handle requests, while liveness checks if it's functioning correctly.
- Document the implementation and behavior of the
workerprobe
command, as it's not visible in this file.- Monitor the impact of these probes on worker performance, as frequent health checks might increase the load.
To further improve this implementation, consider the following:
- Document the
workerprobe
command's behavior and any potential side effects.- Set up monitoring to track the frequency and duration of probe checks to ensure they don't impact worker performance significantly.
- Consider implementing different checks for readiness and liveness if the application's startup and runtime health criteria differ.
helm-chart/templates/cvat_backend/utils/deployment.yml (1)
63-73
: Readiness probe configuration looks good, consider explicit key definitions.The addition of the readiness probe is a good practice for ensuring the backend utilities are ready to handle requests. The use of a custom management command (
workerprobe
) to check specific worker tasks (notifications and cleaning) is appropriate for this deployment.Consider explicitly defining the keys you expect from
.Values.cvat.backend.worker.readinessProbe
instead of usingtoYaml
for the entire object. This would provide better clarity and control over the probe configuration. For example:readinessProbe: exec: command: - python - manage.py - workerprobe - notifications - cleaning initialDelaySeconds: {{ .Values.cvat.backend.worker.readinessProbe.initialDelaySeconds }} periodSeconds: {{ .Values.cvat.backend.worker.readinessProbe.periodSeconds }} timeoutSeconds: {{ .Values.cvat.backend.worker.readinessProbe.timeoutSeconds }} successThreshold: {{ .Values.cvat.backend.worker.readinessProbe.successThreshold }} failureThreshold: {{ .Values.cvat.backend.worker.readinessProbe.failureThreshold }}This approach would make it clearer which configuration options are available and expected.
helm-chart/templates/cvat_backend/worker_import/deployment.yml (1)
63-72
: LGTM! Consider adding a comment for clarity.The readiness probe configuration looks good. It's conditionally added and uses a custom management command to check the worker's readiness, which is a robust approach.
Consider adding a brief comment explaining the purpose of the
workerprobe
command for better maintainability:readinessProbe: exec: command: # Check if the import worker is ready to process tasks - python - manage.py - workerprobe - importhelm-chart/templates/cvat_backend/worker_annotation/deployment.yml (1)
63-82
: LGTM! Readiness and liveness probes added correctly.The implementation of readiness and liveness probes for the worker annotation container looks good. A few observations:
- Both probes use the same command
python manage.py workerprobe annotation
, which is appropriate for checking the worker's ability to process annotation tasks.- The probes are conditionally included based on the
enabled
flags in the values file, providing flexibility in deployment configurations.- The use of the
omit
function ensures that theenabled
flag is not included in the actual probe configuration.For improved clarity, consider adding a comment explaining the purpose of the
workerprobe
command and why it's suitable for both readiness and liveness checks. This would help future maintainers understand the rationale behind this implementation.helm-chart/templates/cvat_backend/worker_export/deployment.yml (1)
64-73
: LGTM! Consider adding a comment for clarity.The addition of the readiness probe is well-implemented. It's conditionally included and uses a custom management command for health checking, which is a good practice for ensuring the export worker is ready to handle requests.
Consider adding a brief comment explaining the purpose of the
workerprobe export
command for better maintainability. For example:readinessProbe: exec: command: - python - manage.py - workerprobe # Custom management command to check worker health - export # Specific to the export workerhelm-chart/values.yaml (4)
45-53
: LGTM: Backend server probe configuration looks good.The readiness and liveness probe configurations for the backend server are well-structured and reasonable. The longer initial delay (60s) for the liveness probe allows sufficient time for the server to start up, while the failure threshold of 10 provides a buffer against transient issues.
Consider adding a
failureThreshold
to the readiness probe as well, to provide more resilience against temporary network issues or brief spikes in server load. For example:readinessProbe: enabled: true periodSeconds: 15 initialDelaySeconds: 15 failureThreshold: 3
55-64
: LGTM: Worker probe configuration is appropriate.The readiness and liveness probe configurations for the worker are well-suited for background processes. The longer period (120s) allows for less frequent checks, which is appropriate for workers that might take longer to complete tasks.
Consider adding a
failureThreshold
to both probes to provide more resilience against temporary issues. For example:readinessProbe: enabled: true periodSeconds: 120 initialDelaySeconds: 30 timeoutSeconds: 10 failureThreshold: 3 livenessProbe: enabled: true periodSeconds: 120 initialDelaySeconds: 30 timeoutSeconds: 10 failureThreshold: 3
242-249
: LGTM: OPA probe configuration looks good with room for minor improvements.The readiness and liveness probe configurations for the Open Policy Agent (OPA) are well-structured. The frequent checks (every 15 seconds) ensure quick detection of any issues, and the initial delay of 15 seconds should be sufficient for OPA to start up.
Consider the following improvements:
- Add a
failureThreshold
to both probes for better resilience against transient issues.- Include a
httpGet
section to specify the health check endpoint for OPA.Here's a suggested configuration:
readinessProbe: enabled: true httpGet: path: /health port: 8181 periodSeconds: 15 initialDelaySeconds: 15 failureThreshold: 3 livenessProbe: enabled: true httpGet: path: /health port: 8181 periodSeconds: 15 initialDelaySeconds: 15 failureThreshold: 3These changes will make the probes more specific to OPA and provide better control over failure conditions.
300-307
: LGTM: Kvrocks probe configuration is good with room for minor enhancements.The readiness and liveness probe configurations for kvrocks are well-structured. The frequent checks (every 10 seconds) ensure quick detection of any issues, and the initial delay of 30 seconds should be sufficient for kvrocks to start up.
Consider the following improvements:
- Add a
failureThreshold
to both probes for better resilience against transient issues.- Include a
tcpSocket
check instead of the default HTTP check, as kvrocks is a Redis-compatible key-value database.Here's a suggested configuration:
readinessProbe: enabled: true tcpSocket: port: 6666 # Default kvrocks port periodSeconds: 10 initialDelaySeconds: 30 failureThreshold: 3 livenessProbe: enabled: true tcpSocket: port: 6666 # Default kvrocks port periodSeconds: 10 initialDelaySeconds: 30 failureThreshold: 3These changes will make the probes more specific to kvrocks and provide better control over failure conditions.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (15)
- cvat/apps/health/management/commands/workerprobe.py (1 hunks)
- helm-chart/Chart.yaml (1 hunks)
- helm-chart/templates/cvat_backend/server/deployment.yml (1 hunks)
- helm-chart/templates/cvat_backend/utils/deployment.yml (1 hunks)
- helm-chart/templates/cvat_backend/worker_analyticsreports/deployment.yml (1 hunks)
- helm-chart/templates/cvat_backend/worker_annotation/deployment.yml (1 hunks)
- helm-chart/templates/cvat_backend/worker_export/deployment.yml (1 hunks)
- helm-chart/templates/cvat_backend/worker_import/deployment.yml (1 hunks)
- helm-chart/templates/cvat_backend/worker_qualityreports/deployment.yml (1 hunks)
- helm-chart/templates/cvat_backend/worker_webhooks/deployment.yml (1 hunks)
- helm-chart/templates/cvat_frontend/deployment.yml (1 hunks)
- helm-chart/templates/cvat_kvrocks/statefulset.yml (1 hunks)
- helm-chart/templates/cvat_opa/deployment.yml (1 hunks)
- helm-chart/values.yaml (4 hunks)
- supervisord/utils.conf (1 hunks)
🔇 Additional comments (16)
supervisord/utils.conf (4)
27-27
: LGTM: Improved robustness for rqschedulerThe addition of
autorestart=true
for the rqscheduler program is a good improvement. This ensures that the scheduler process will automatically restart if it exits unexpectedly, enhancing the overall reliability of the system.
34-35
: LGTM: Enhanced scalability and reliability for rqworker-notificationsThe changes to the rqworker-notifications configuration are well-considered:
- Using
numprocs=%(ENV_NUMPROCS)s
allows for dynamic scaling of notification workers based on environment variables, improving deployment flexibility.- Adding
autorestart=true
ensures that notification workers automatically restart if they exit unexpectedly, enhancing system reliability.These modifications will contribute to a more robust and scalable notification processing system.
Line range hint
44-44
: LGTM: Improved reliability for rqworker-cleaningThe addition of
autorestart=true
for the rqworker-cleaning program is a positive change. This ensures that the cleaning worker process will automatically restart if it exits unexpectedly, enhancing the overall reliability of the system.It's worth noting that the
numprocs
setting for this worker was already using the environment variable%(ENV_NUMPROCS)s
, which is consistent with the change made to the notifications worker. This maintains a uniform approach to worker process scaling across different worker types.
Line range hint
27-44
: Summary: Improved robustness and scalability of worker processesThe changes made to the supervisord configuration file consistently enhance the reliability and scalability of the CVAT worker processes:
- All worker types (rqscheduler, rqworker-notifications, and rqworker-cleaning) now have
autorestart=true
, ensuring automatic recovery from unexpected exits.- The rqworker-notifications configuration now uses a dynamic
numprocs
setting, aligning with the existing configuration of rqworker-cleaning.These improvements contribute to a more robust and flexible CVAT application, aligning well with the PR objectives of enhancing reliability and introducing better health management. The consistent approach across different worker types is commendable and promotes maintainability.
helm-chart/Chart.yaml (1)
19-19
: LGTM: Version bump is appropriate for the changes.The increment of the chart version from 0.13.2 to 0.14.0 is consistent with Semantic Versioning for adding new features (readiness and liveness probes). This change aligns well with the PR objectives.
As a reminder, please ensure that you've updated the changelog to reflect this version change and the new features added. You can verify this with the following script:
helm-chart/templates/cvat_opa/deployment.yml (1)
56-67
: Summary: Readiness and liveness probes enhance OPA deployment reliability.The addition of readiness and liveness probes to the OPA deployment is a valuable improvement. These probes will help Kubernetes better manage the lifecycle and health of the OPA container, ensuring that traffic is only routed to ready instances and that unresponsive instances are restarted.
Key points:
- Both probes use tcpSocket checks on port 8181, which is appropriate for the OPA server.
- The probes are conditionally included, allowing for flexible configuration.
- Additional probe settings can be easily added through the values file.
These changes align well with the PR objective of adding readiness and liveness probes to improve the reliability of the CVAT application in containerized environments.
helm-chart/templates/cvat_backend/worker_webhooks/deployment.yml (1)
63-82
: Request for additional information and consistency check.The implementation of the readiness and liveness probes looks good, but I have a few questions and suggestions:
Could you provide more information about the
workerprobe
command? It would be helpful to have a brief explanation of what it checks and how it determines the health of the webhooks worker.Have similar probes been added to other deployments in the CVAT application? It would be good to ensure consistency across all components.
The configuration of the probes (like initial delay, timeout, etc.) is not visible in this file. Are these configured elsewhere, or are you using default values?
To check for consistency across the application, you can run the following command:
This will help verify if similar probes have been added to other components of the application.
✅ Verification successful
Probes Implemented Consistently Across Workers with Centralized Configuration
workerprobe
Command: Theworkerprobe
command is consistently used across all worker deployments (e.g., analytics_reports, annotation, export, import, quality_reports, webhooks) to perform health checks. This standardized approach ensures reliable monitoring of worker components.Consistency Across Deployments: Similar readiness and liveness probes are implemented in all relevant deployments within the CVAT application, maintaining consistency and reliability across different services.
Centralized Probe Configuration: The configurations for the probes, such as
periodSeconds
,initialDelaySeconds
, andtimeoutSeconds
, are centralized in thehelm-chart/values.yaml
file. This ensures that probe settings are consistent and easily manageable across the entire codebase.🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Check for readiness and liveness probes in other deployment files fd -e yml -e yaml | xargs grep -n -A 10 'readinessProbe\|livenessProbe'Length of output: 33719
helm-chart/templates/cvat_backend/worker_qualityreports/deployment.yml (1)
63-82
: Verify probe implementation and consider default parameters.The readiness and liveness probes are well-configured, but there are two points to consider:
Verify that the
workerprobe
management command performs appropriate checks for both readiness and liveness of thequality_reports
worker. Ensure it checks connectivity to required services and the ability to process tasks.Consider specifying default values for probe parameters such as
initialDelaySeconds
,periodSeconds
,timeoutSeconds
, etc. This can help fine-tune the probe behavior and prevent potential issues during container startup or under high load.Example:
readinessProbe: exec: command: - python - manage.py - workerprobe - quality_reports initialDelaySeconds: 10 periodSeconds: 10 timeoutSeconds: 5 failureThreshold: 3To verify the
workerprobe
command implementation, you can run:This will help ensure that the probe performs appropriate checks for the worker's health.
helm-chart/templates/cvat_kvrocks/statefulset.yml (1)
63-71
: Readiness probe implementation looks good!The readiness probe for the kvrocks container is well-implemented:
- It uses an appropriate exec probe to check the Redis server's status.
- The command correctly checks for both "PONG" and "NOAUTH" responses, covering different server states.
- The probe is conditionally included, allowing for flexible deployment configurations.
- Additional configuration can be customized through the Values file, promoting reusability.
helm-chart/templates/cvat_backend/server/deployment.yml (2)
68-74
: Readiness probe implementation looks good.The readiness probe is well-implemented:
- It's conditionally included, allowing for flexibility in deployment configurations.
- The
/api/server/about
endpoint is a suitable choice for checking readiness.- The use of the
omit
function ensures clean configuration by excluding theenabled
flag.
75-81
: Liveness probe implementation is correct.The liveness probe is correctly implemented:
- It's conditionally included, providing deployment flexibility.
- The
/api/server/health/
endpoint is an appropriate choice for checking the server's health.- The
omit
function is used to exclude theenabled
flag from the probe configuration.helm-chart/templates/cvat_backend/utils/deployment.yml (1)
Line range hint
1-84
: Overall, the addition of health checks improves deployment reliability.The introduction of readiness and liveness probes to the CVAT backend utilities deployment is a positive change that enhances the reliability and manageability of the application in a Kubernetes environment. These probes allow for better health monitoring and automatic recovery in case of issues.
The rest of the deployment configuration remains unchanged, which maintains consistency with the existing setup. This focused change aligns well with the PR objectives of adding readiness and liveness probes to the CVAT project.
To further improve the deployment, consider the following:
- Implement different checks for readiness and liveness probes as suggested earlier.
- Ensure that the
workerprobe
management command is efficient and doesn't cause significant overhead when called frequently.- Review and adjust the probe timing parameters (e.g.,
initialDelaySeconds
,periodSeconds
) based on your application's specific behavior and requirements.- Document the purpose and expected behavior of these probes in your project documentation for future reference and maintenance.
helm-chart/templates/cvat_backend/worker_import/deployment.yml (2)
63-82
: Summary: Improved worker health monitoringThe addition of readiness and liveness probes enhances the reliability and observability of the worker deployment. These changes allow Kubernetes to better manage the lifecycle of the worker pods, ensuring they are ready to accept work and are functioning correctly.
Key points:
- Both probes use a custom
workerprobe
command, allowing for specific health checks.- Probes are conditionally added, providing flexibility in configuration.
- The probes specifically target the 'import' queue, which is appropriate for this worker.
These changes align well with the PR objectives of adding readiness and liveness probes to the CVAT project.
73-82
: LGTM! Verify theworkerprobe
command implementation.The liveness probe configuration looks good and is consistent with the readiness probe.
Since both readiness and liveness probes use the same
workerprobe
command, it's important to ensure that this command appropriately distinguishes between readiness and liveness checks. Please verify the implementation of theworkerprobe
command:Ensure that the command performs different checks for readiness (e.g., can accept new tasks) and liveness (e.g., is running and not deadlocked).
helm-chart/templates/cvat_backend/worker_export/deployment.yml (1)
64-83
: Summary: Good addition of health probes with room for refinement.The introduction of readiness and liveness probes to the CVAT backend worker export deployment is a valuable improvement. These probes will enhance the reliability and manageability of the application in a Kubernetes environment. The implementation is flexible, allowing for easy configuration through Helm values.
Key points:
- Both probes are conditionally included, providing deployment flexibility.
- The use of a custom management command (
workerprobe
) allows for tailored health checks.- The
omit
function is used correctly to exclude theenabled
key from the final configuration.Suggestions for further improvement:
- Differentiate between readiness and liveness probe commands or parameters for more accurate health monitoring.
- Add brief comments explaining the purpose of the
workerprobe
command and its parameters.Overall, these changes will significantly improve the robustness of the CVAT deployment in Kubernetes.
helm-chart/values.yaml (1)
Line range hint
1-307
: Overall, the addition of readiness and liveness probes is a significant improvement.The introduction of readiness and liveness probes for the backend, worker, frontend, OPA, and kvrocks components is a valuable enhancement to the CVAT Helm chart. These probes will significantly improve the health monitoring and reliability of the deployed services.
Key points:
- The backend and worker configurations are well-thought-out, with appropriate timing parameters.
- The frontend probe configuration could benefit from more specific parameters.
- The OPA and kvrocks configurations are good but could be further improved with component-specific checks.
These changes will enable better orchestration of the CVAT components in a Kubernetes environment, allowing for more robust deployments and easier troubleshooting.
To further enhance this PR, consider implementing the suggested improvements for each component, particularly the frontend probe configuration. These refinements will ensure that each component has tailored health checks that accurately reflect its specific characteristics and requirements.
{{- if .Values.cvat.kvrocks.livenessProbe.enabled }} | ||
livenessProbe: | ||
exec: | ||
command: | ||
- /bin/sh | ||
- -c | ||
- ./bin/redis-cli -p 6666 PING | grep -E '(PONG|NOAUTH)' || exit 1 | ||
{{- toYaml (omit .Values.cvat.kvrocks.livenessProbe "enabled") | nindent 12 }} | ||
{{- end }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consider differentiating the liveness probe from the readiness probe
While the liveness probe implementation is functional, it's identical to the readiness probe. This approach might not be ideal:
-
Readiness and liveness probes serve different purposes:
- Readiness determines if the pod is ready to receive traffic.
- Liveness determines if the pod needs to be restarted.
-
Using the same check for both could lead to unnecessary restarts if the pod is temporarily unable to serve requests but is still running.
-
The current check might be too strict for a liveness probe, as it requires the Redis server to be fully operational.
Consider implementing a simpler liveness probe that checks if the kvrocks process is running, rather than checking the Redis server's response. For example:
livenessProbe:
exec:
command:
- /bin/sh
- -c
- pgrep kvrocks || exit 1
{{- toYaml (omit .Values.cvat.kvrocks.livenessProbe "enabled") | nindent 12 }}
This change would make the liveness probe less likely to cause unnecessary restarts while still detecting if the kvrocks process has crashed.
{{- if $localValues.readinessProbe.enabled }} | ||
readinessProbe: | ||
httpGet: | ||
path: /api/server/about | ||
port: 8080 | ||
{{- toYaml (omit $localValues.readinessProbe "enabled") | nindent 12 }} | ||
{{- end }} | ||
{{- if $localValues.livenessProbe.enabled }} | ||
livenessProbe: | ||
httpGet: | ||
path: /api/server/health/ | ||
port: 8080 | ||
{{- toYaml (omit $localValues.livenessProbe "enabled") | nindent 12 }} | ||
{{- end }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
Missing timing parameters in probe configurations.
The following readinessProbe
and livenessProbe
definitions are missing essential timing parameters (initialDelaySeconds
, periodSeconds
, timeoutSeconds
, successThreshold
, and failureThreshold
):
helm-chart/templates/cvat_kvrocks/statefulset.yml
readinessProbe
livenessProbe
helm-chart/values.yaml
readinessProbe
livenessProbe
helm-chart/templates/cvat_opa/deployment.yml
readinessProbe
livenessProbe
helm-chart/templates/cvat_backend/worker_webhooks/deployment.yml
readinessProbe
livenessProbe
helm-chart/templates/cvat_backend/worker_import/deployment.yml
readinessProbe
livenessProbe
helm-chart/templates/cvat_backend/server/deployment.yml
readinessProbe
livenessProbe
helm-chart/templates/cvat_backend/worker_qualityreports/deployment.yml
readinessProbe
livenessProbe
helm-chart/templates/cvat_backend/worker_annotation/deployment.yml
readinessProbe
livenessProbe
helm-chart/templates/cvat_backend/utils/deployment.yml
readinessProbe
livenessProbe
helm-chart/templates/cvat_backend/worker_analyticsreports/deployment.yml
readinessProbe
livenessProbe
helm-chart/templates/cvat_frontend/deployment.yml
readinessProbe
livenessProbe
helm-chart/templates/cvat_backend/worker_export/deployment.yml
readinessProbe
livenessProbe
Please update each probe configuration to include the necessary timing parameters to ensure proper functionality and avoid potential deployment issues.
🔗 Analysis chain
Verify probe timing parameters.
The implementation of both readiness and liveness probes looks good. However, it's important to ensure that appropriate timing parameters (such as initialDelaySeconds
, periodSeconds
, timeoutSeconds
, successThreshold
, and failureThreshold
) are configured in the $localValues.readinessProbe
and $localValues.livenessProbe
objects.
These parameters are crucial for fine-tuning the behavior of the probes and avoiding premature restarts or false positives/negatives.
To confirm the presence of these parameters, you can run the following command:
Please review the output to ensure that appropriate timing parameters are set for both probes.
Probe configurations not found in values.yaml
.
To locate and verify the timing parameters for both readiness and liveness probes, please run the following command:
Verify the presence of timing parameters in all probe configurations.
To ensure that both readiness and liveness probes have appropriate timing parameters configured, please run the following command:
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for probe timing parameters in values.yaml
# Test: Search for readinessProbe and livenessProbe configurations
rg --type yaml 'readinessProbe:|livenessProbe:' -A 10 values.yaml
Length of output: 119
Script:
#!/bin/bash
# Description: Search for readinessProbe and livenessProbe across all YAML files
rg --type yaml '(readinessProbe:|livenessProbe:)' --multiline
Length of output: 2373
Script:
#!/bin/bash
# Description: Check for timing parameters in all readinessProbe and livenessProbe definitions
# Define the timing parameters to check
TIMING_PARAMS=("initialDelaySeconds" "periodSeconds" "timeoutSeconds" "successThreshold" "failureThreshold")
# Search for readinessProbe and livenessProbe in all YAML files
probes=$(rg --type yaml '(readinessProbe:|livenessProbe:)' --multiline)
# Iterate over each probe found
echo "$probes" | while read -r line; do
file=$(echo "$line" | cut -d':' -f1)
probe=$(echo "$line" | cut -d':' -f2 | tr -d ' ')
echo "Checking $probe in $file"
for param in "${TIMING_PARAMS[@]}"; do
if ! rg --type yaml "$param:" "$file" >/dev/null; then
echo " - Missing $param in $probe"
fi
done
done
Length of output: 50850
{{- if .Values.cvat.backend.worker.livenessProbe.enabled }} | ||
livenessProbe: | ||
exec: | ||
command: | ||
- python | ||
- manage.py | ||
- workerprobe | ||
- notifications | ||
- cleaning | ||
{{- toYaml (omit .Values.cvat.backend.worker.livenessProbe "enabled") | nindent 12 }} | ||
{{- end }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consider differentiating between readiness and liveness probes.
While the addition of the liveness probe is beneficial, using the same command for both readiness and liveness probes might not be ideal. The liveness probe should check if the application is alive, while the readiness probe checks if it's ready to accept traffic.
Consider implementing a simpler check for the liveness probe that doesn't depend on the worker tasks. For example, you could create a simple endpoint that returns a 200 OK status if the application is running. This would prevent unnecessary restarts if the worker tasks are temporarily unavailable.
Example:
livenessProbe:
httpGet:
path: /healthz
port: 8000
initialDelaySeconds: {{ .Values.cvat.backend.worker.livenessProbe.initialDelaySeconds }}
periodSeconds: {{ .Values.cvat.backend.worker.livenessProbe.periodSeconds }}
timeoutSeconds: {{ .Values.cvat.backend.worker.livenessProbe.timeoutSeconds }}
successThreshold: {{ .Values.cvat.backend.worker.livenessProbe.successThreshold }}
failureThreshold: {{ .Values.cvat.backend.worker.livenessProbe.failureThreshold }}
This assumes you've implemented a /healthz
endpoint in your application that performs a basic aliveness check.
{{- if .Values.cvat.backend.worker.livenessProbe.enabled }} | ||
livenessProbe: | ||
exec: | ||
command: | ||
- python | ||
- manage.py | ||
- workerprobe | ||
- export | ||
{{- toYaml (omit .Values.cvat.backend.worker.livenessProbe "enabled") | nindent 12 }} | ||
{{- end }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consider differentiating between readiness and liveness probes.
The liveness probe is well-implemented and follows the same pattern as the readiness probe. However, using the same command for both readiness and liveness probes might not be ideal.
Consider the following suggestions:
- Differentiate the commands used for readiness and liveness probes. For example, the readiness probe could check if the worker is ready to accept new tasks, while the liveness probe could check if the worker is still running and responsive.
- If different commands are not feasible, consider adding parameters to the
workerprobe
command to distinguish between readiness and liveness checks.
Example:
livenessProbe:
exec:
command:
- python
- manage.py
- workerprobe
- export
- --liveness # Add a parameter to distinguish liveness check
This differentiation will provide more accurate health monitoring for your Kubernetes deployment.
helm-chart/values.yaml
Outdated
readinessProbe: | ||
enabled: true | ||
livenessProbe: | ||
enabled: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Frontend probe configuration needs more specific parameters.
While enabling readiness and liveness probes for the frontend is a good practice, the current configuration lacks specific parameters. This may lead to using default values that might not be optimal for a frontend service.
Consider adding more specific parameters to the frontend probes. Here's a suggested configuration:
readinessProbe:
enabled: true
httpGet:
path: /
port: 80
periodSeconds: 10
initialDelaySeconds: 30
failureThreshold: 3
livenessProbe:
enabled: true
httpGet:
path: /
port: 80
periodSeconds: 30
initialDelaySeconds: 60
failureThreshold: 3
These settings provide a more tailored approach for a frontend service, with appropriate paths, ports, and timing parameters.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #8488 +/- ##
===========================================
- Coverage 74.31% 74.26% -0.05%
===========================================
Files 397 400 +3
Lines 43139 43165 +26
Branches 3905 3905
===========================================
Hits 32058 32058
- Misses 11081 11107 +26
|
helm-chart/templates/cvat_backend/worker_analyticsreports/deployment.yml
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you also need to add an __init__.py
to commands
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ping?
helm-chart/templates/cvat_backend/worker_analyticsreports/deployment.yml
Outdated
Show resolved
Hide resolved
tests/python/shared/fixtures/init.py
Outdated
"sh", | ||
"-c", | ||
"for p in rq:finished:* rq:job:* rq:wip:* rq:finished:* rq:failed:*; " | ||
'do redis-cli -e --scan --pattern "$p" | xargs -r redis-cli -e del ; done', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain what you did here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously we reset the entire Redis DBs including RQ related keys about registered workers, which would cause the worker to restart on the next heartbeat, but in fact to pass the tests we only need to clear the keys related with jobs and statuses.
Here I find all keys related to jobs and their status and then delete them without deleting the whole DB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, I don't think you got all the keys that need to be removed. E.g. there's the scheduled jobs key, the export cache lock keys, the blocker keys. There could be others I missed.
In general, I don't think this approach is workable, because it's very easy to forget to update this function when writing code that uses new Redis keys. I think you should do the opposite; keep certain keys and delete everything else. And please add a comment to explain what the code is doing.
f65ccad
to
d1ac9ef
Compare
Please add a changelog entry. |
Quality Gate passedIssues Measures |
Motivation and context
How has this been tested?
Checklist
develop
branch(cvat-canvas,
cvat-core,
cvat-data and
cvat-ui)
License
Feel free to contact the maintainers if that's a concern.
Summary by CodeRabbit
Release Notes
New Features
workerprobe
to check the health of workers in specified queues.Chores
0.13.2
to0.14.0
.