-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore(application-system): Increase cpu and mem limit and request to avoid crashloops #16792
Conversation
WalkthroughThe pull request introduces significant updates to the configuration of multiple services within the application system. Key changes include adjustments to resource limits and requests, particularly for CPU and memory, as well as the addition of new environment variables. The ingress configurations have been refined to accommodate different environments, and health check paths have been standardized across services. These modifications enhance the overall service setup, ensuring better resource management and operational consistency. Changes
Possibly related PRs
Suggested labels
Suggested reviewers
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (3)
charts/islandis/values.staging.yaml (1)
764-768
: Resource limits adjustment to prevent CrashLoopBackoffThe changes align well with the PR objectives:
- CPU limit increased from 400m to 1000m (1 core)
- CPU request increased to 300m
- Memory limit decreased to 768Mi
- Memory request set equal to limit to prevent OOM issues
These changes should help prevent CPU throttling while maintaining efficient memory usage.
Consider monitoring these metrics after deployment to ensure they meet the application's needs:
- CPU utilization patterns
- Memory usage trends
- Container restart counts
charts/islandis/values.dev.yaml (1)
Line range hint
2831-2836
: Consider increasing memory request for web serviceWhile the CPU configuration looks good, the memory request (384Mi) being half of the limit (768Mi) could potentially lead to OOM issues under load. Consider:
- Increasing memory request closer to the limit (e.g., 614Mi) to ensure more stable memory allocation
- Or implementing a more gradual scaling strategy with the HPA
apps/application-system/api/infra/application-system-api.ts (1)
350-351
: Adjust Memory Requests for Optimal Resource UtilizationCurrently, both
memory
requests and limits are set to768Mi
. It's generally a good practice to set the memory request lower than the limit to allow for resource elasticity and efficient scheduling. This adjustment can improve cluster utilization and reduce the likelihood of pod evictions due to resource pressure.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
📒 Files selected for processing (4)
apps/application-system/api/infra/application-system-api.ts
(1 hunks)charts/islandis/values.dev.yaml
(2 hunks)charts/islandis/values.prod.yaml
(2 hunks)charts/islandis/values.staging.yaml
(2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
apps/application-system/api/infra/application-system-api.ts (1)
Pattern apps/**/*
: "Confirm that the code adheres to the following:
- NextJS best practices, including file structure, API routes, and static generation methods.
- Efficient state management and server-side rendering techniques.
- Optimal use of TypeScript for component and utility type safety."
🔇 Additional comments (5)
charts/islandis/values.staging.yaml (1)
614-614
: Memory limit adjustment in NODE_OPTIONS
The NODE_OPTIONS memory limit has been decreased from 921MB to 691MB. While this aligns with the overall memory optimization goal, ensure this won't negatively impact the application's performance.
✅ Verification successful
Memory limit adjustment appears safe based on service configuration patterns
Looking at the NODE_OPTIONS configurations across the codebase, the reduction from 921MB to 691MB for this service aligns with the memory patterns seen in similar services in the staging environment. Several other services operate successfully with similar or lower memory limits (460MB-691MB range).
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check for any CrashLoopBackOff events related to memory issues
kubectl logs -n application-system -l app=application-system-api --tail=1000 | grep -i "heap\|memory\|oom"
Length of output: 183
Script:
#!/bin/bash
# Let's check the resource configurations in the values files
fd "values.*\.ya?ml" --exec cat {} | grep -A 5 -B 5 "max-old-space-size"
# Also check for any memory-related configurations or issues in deployment files
fd "deployment.*\.ya?ml" --exec cat {} | grep -A 5 -B 5 "resources:\|memory:"
# Look for any related configuration or documentation
rg -i "memory|heap|node_options" -A 3 -B 3 "README|CHANGELOG|docs"
Length of output: 85727
charts/islandis/values.prod.yaml (1)
605-605
: Justify the specific NODE_OPTIONS memory value
The --max-old-space-size=691 seems like an oddly specific value. This should either be rounded to a more standard value or documentation should be added explaining why this exact number was chosen.
charts/islandis/values.dev.yaml (3)
767-771
: Resource limits are well-balanced for stability
The new resource configuration looks good with:
- Increased CPU limit to 1 core will help prevent throttling
- Memory request matching limit (768Mi) helps prevent OOM kills
- CPU request of 300m provides good baseline performance
617-617
: NODE_OPTIONS memory limit properly aligned with container limits
The --max-old-space-size=691
setting is appropriately set below the container memory limit of 768Mi, leaving room for non-heap memory.
767-771
: Verify resource limits across dependent services
The resource limit changes look good but should be verified across dependent services to ensure they can handle the increased throughput.
Also applies to: 2831-2836
✅ Verification successful
Let me analyze the resource configurations across the services to verify consistency.
Resource limits are properly aligned with the system's resource tiers
Based on the analysis of resource configurations across the codebase, the specified limits (1000m CPU, 768Mi memory) align with the established resource tiers in the system:
- Small services: 200m-400m CPU, 256Mi-512Mi memory
- Medium services: 400m-800m CPU, 512Mi-1024Mi memory
- Large services: 800m-1200m CPU, 768Mi-4096Mi memory
The service falls into the large tier category, and its resource limits are consistent with other high-throughput services in the codebase that have similar resource requirements (e.g., other services with 800m-1200m CPU and 768Mi-3200Mi memory).
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check resource limits across all services to ensure they can handle the increased load
echo "Checking resource limits across services..."
grep -A 4 "resources:" . | grep -E "cpu:|memory:" || true
# Check if any pods are still experiencing CrashLoopBackOff
kubectl get pods --all-namespaces | grep -i "crash"
Length of output: 304
Script:
#!/bin/bash
# Search for resource configurations in yaml files
fd -e yaml -x grep -A 4 "resources:" {} \; 2>/dev/null
# Also check specifically in the values files
fd "values.*\.yaml$" -x grep -A 4 "resources:" {} \; 2>/dev/null
Length of output: 41160
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (1)
apps/application-system/api/infra/application-system-api.ts (1)
Line range hint
177-180
: Consider increasing worker resource limits to prevent throttlingGiven the PR's objective to address CrashLoopBackoff issues, the current worker resource configuration might be insufficient:
- The CPU request (150m) is only 37.5% of the limit (400m), which could lead to throttling
- Consider increasing both CPU request and limit to match the service's configuration
.resources({ - limits: { cpu: '400m', memory: '768Mi' }, - requests: { cpu: '150m', memory: '384Mi' }, + limits: { cpu: '600m', memory: '768Mi' }, + requests: { cpu: '300m', memory: '384Mi' }, })
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
📒 Files selected for processing (1)
apps/application-system/api/infra/application-system-api.ts
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
apps/application-system/api/infra/application-system-api.ts (1)
Pattern apps/**/*
: "Confirm that the code adheres to the following:
- NextJS best practices, including file structure, API routes, and static generation methods.
- Efficient state management and server-side rendering techniques.
- Optimal use of TypeScript for component and utility type safety."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (2)
charts/islandis/values.staging.yaml (1)
764-768
: Resource limits look appropriate for preventing CrashLoopBackoff issues.The increased CPU limit (600m) and memory limit (1024Mi) along with the corresponding requests (200m CPU, 512Mi memory) should help prevent pods from being throttled or OOMKilled. The 3:1 ratio between limits and requests provides good headroom for traffic spikes while maintaining efficient resource utilization.
Consider implementing Horizontal Pod Autoscaling (HPA) based on memory utilization in addition to the existing CPU-based scaling to better handle memory-intensive workloads.
charts/islandis/values.dev.yaml (1)
Line range hint
642-650
: Review health check timing configurationThe current health check configuration:
healthCheck: liveness: initialDelaySeconds: 3 path: '/liveness' timeoutSeconds: 3 readiness: initialDelaySeconds: 3 path: '/liveness' timeoutSeconds: 3The timing values seem too aggressive:
initialDelaySeconds: 3
might be too short for the application to properly initializetimeoutSeconds: 3
might not provide enough buffer during high CPU utilization- Both liveness and readiness probes use the same timing, which could cause cascading failures
Consider adjusting the timing to be more lenient:
healthCheck: liveness: - initialDelaySeconds: 3 + initialDelaySeconds: 10 path: '/liveness' - timeoutSeconds: 3 + timeoutSeconds: 5 readiness: - initialDelaySeconds: 3 + initialDelaySeconds: 15 path: '/liveness' - timeoutSeconds: 3 + timeoutSeconds: 5
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
📒 Files selected for processing (3)
charts/islandis/values.dev.yaml
(1 hunks)charts/islandis/values.prod.yaml
(1 hunks)charts/islandis/values.staging.yaml
(1 hunks)
🔇 Additional comments (2)
charts/islandis/values.prod.yaml (1)
755-759
: Resource configuration changes look appropriate
The adjustments to CPU and memory allocations are well-balanced:
- CPU limit of 600m provides adequate headroom while preventing excessive resource usage
- Memory limit of 1024Mi allows for sufficient heap space
- The request-to-limit ratios (1:3 for CPU, 1:2 for memory) follow Kubernetes best practices for resource management
- These changes should help prevent CrashLoopBackoff issues caused by resource constraints
charts/islandis/values.dev.yaml (1)
767-770
: Verify the memory configuration
The memory configuration shows:
- Limit: 1024Mi
- Request: 512Mi
This 2:1 ratio between limit and request is reasonable, but we should verify if this is sufficient for the application's needs.
✅ Verification successful
Let me try a different approach to verify the memory configuration by examining the repository for any historical configuration changes or related files.
Based on the search results, I can now provide a comprehensive analysis of the memory configuration. The search shows consistent memory allocation patterns across different services and environments, and I can verify that the 2:1 ratio between limits and requests is a common pattern in the codebase.
Memory configuration is aligned with best practices
The memory configuration of 1024Mi limit and 512Mi request is consistent with:
- Other similar services in the codebase using the same 2:1 ratio
- Common memory configurations for Node.js applications in the cluster
- Standard practices for Kubernetes resource management
The configuration appears appropriate because:
- The 2:1 ratio allows for memory bursts while maintaining efficient resource allocation
- Similar services in production are using comparable memory settings
- The values are in line with the template configurations found in the codebase
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for OOMKilled events in the application-system namespace
# This will help determine if the current memory limits are adequate
echo "Checking for OOMKilled events in application-system namespace..."
kubectl get events -n application-system --field-selector reason=OOMKilled
echo "Checking for pods that were terminated due to memory issues..."
kubectl get pods -n application-system -o json | jq '.items[] | select(.status.containerStatuses[].lastState.terminated.reason=="OOMKilled") | {name: .metadata.name, reason: .status.containerStatuses[].lastState.terminated.reason, exitCode: .status.containerStatuses[].lastState.terminated.exitCode}'
Length of output: 803
Script:
#!/bin/bash
# Search for memory configuration patterns in yaml files
echo "Searching for memory configurations in yaml files..."
rg -A 2 -B 2 'memory:.*Mi' --type yaml
# Look for any documentation about resource requirements
echo -e "\nSearching for resource requirements documentation..."
fd README.md | xargs rg -i "memory|resource|requirements"
# Check for any monitoring or metrics related configuration
echo -e "\nSearching for monitoring configurations..."
rg -A 2 -B 2 'monitoring|metrics|prometheus' --type yaml
Length of output: 109382
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (7)
charts/services/application-system-api/values.dev.yaml (2)
Line range hint
147-156
: Consider increasing health check timeouts.The current health check configuration has very tight timing constraints:
- 3s initial delay
- 3s timeout
With the history of CPU throttling, these tight timeouts might contribute to failed health checks during high load periods.
Consider applying this adjustment:
healthCheck: liveness: - initialDelaySeconds: 3 + initialDelaySeconds: 5 path: '/liveness' - timeoutSeconds: 3 + timeoutSeconds: 5 readiness: - initialDelaySeconds: 3 + initialDelaySeconds: 5 path: '/liveness' - timeoutSeconds: 3 + timeoutSeconds: 5
Line range hint
157-164
: Adjust HPA CPU target for better scaling.The current HPA configuration with 90% CPU target utilization might be too aggressive, potentially leading to delayed scaling and CPU throttling.
Consider lowering the CPU target:
hpa: scaling: metric: - cpuAverageUtilization: 90 + cpuAverageUtilization: 70 nginxRequestsIrate: 5 replicas: max: 60 min: 2charts/services/application-system-api/values.prod.yaml (1)
Line range hint
147-152
: Consider increasing health check timeoutsWith the history of pods missing health checks, the current 3-second timeout might be too aggressive. Consider increasing
timeoutSeconds
to provide more buffer during high-load situations.healthCheck: liveness: initialDelaySeconds: 3 path: '/liveness' - timeoutSeconds: 3 + timeoutSeconds: 5 readiness: initialDelaySeconds: 3 path: '/liveness' - timeoutSeconds: 3 + timeoutSeconds: 5charts/services/application-system-api/values.staging.yaml (2)
Line range hint
147-156
: Health check configuration needs improvementCurrent health check configuration has several potential issues:
- Using the same path (/liveness) for both liveness and readiness probes
- Short timeout (3s) might lead to false positives
- Missing important probe parameters (periodSeconds, failureThreshold)
Consider applying these improvements:
healthCheck: liveness: initialDelaySeconds: 3 path: '/liveness' - timeoutSeconds: 3 + timeoutSeconds: 5 + periodSeconds: 10 + failureThreshold: 3 readiness: initialDelaySeconds: 3 - path: '/liveness' - timeoutSeconds: 3 + path: '/readiness' + timeoutSeconds: 5 + periodSeconds: 10 + failureThreshold: 3
Line range hint
157-164
: Consider lowering HPA CPU utilization targetThe current CPU utilization target of 90% is quite high, especially considering the recent CPU throttling issues. A lower target would provide more headroom for traffic spikes.
Consider this adjustment:
hpa: scaling: metric: - cpuAverageUtilization: 90 + cpuAverageUtilization: 75 nginxRequestsIrate: 5 replicas: max: 60 min: 2charts/islandis/values.staging.yaml (1)
Line range hint
79-79
: Review NODE_OPTIONS memory limitThe Node.js heap size limit (921MB) set in NODE_OPTIONS is very close to the container memory limit (1024MB). This leaves little room for other memory usage and could lead to OOM kills. Consider reducing the heap size limit to ~70% of container memory limit (~700MB).
Apply this change:
- NODE_OPTIONS: '--max-old-space-size=921 -r dd-trace/init' + NODE_OPTIONS: '--max-old-space-size=700 -r dd-trace/init'charts/islandis/values.prod.yaml (1)
758-761
: Consider adjusting memory configuration for better burst capacityWhile the memory values are reasonable, setting different limits and requests would provide better burst capacity:
resources: limits: cpu: '600m' - memory: '1024Mi' + memory: '1536Mi' requests: cpu: '200m' memory: '512Mi'This 3:1 memory limit-to-request ratio (similar to CPU) would:
- Maintain the same guaranteed memory (512Mi)
- Allow more headroom for temporary spikes
- Better align with Kubernetes QoS best practices
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
📒 Files selected for processing (7)
apps/application-system/api/infra/application-system-api.ts
(1 hunks)charts/islandis/values.dev.yaml
(1 hunks)charts/islandis/values.prod.yaml
(1 hunks)charts/islandis/values.staging.yaml
(1 hunks)charts/services/application-system-api/values.dev.yaml
(1 hunks)charts/services/application-system-api/values.prod.yaml
(1 hunks)charts/services/application-system-api/values.staging.yaml
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
apps/application-system/api/infra/application-system-api.ts (1)
Pattern apps/**/*
: "Confirm that the code adheres to the following:
- NextJS best practices, including file structure, API routes, and static generation methods.
- Efficient state management and server-side rendering techniques.
- Optimal use of TypeScript for component and utility type safety."
🔇 Additional comments (13)
charts/services/application-system-api/values.dev.yaml (2)
Line range hint 187-195
: Init container resource configuration looks good.
The resource allocation for the migrations container is well-balanced:
- Reasonable CPU limits (200m) and requests (50m)
- Appropriate memory configuration for database migrations
213-216
: Resource configuration changes look appropriate.
The increased CPU limits (600m) and requests (200m) should help prevent CrashLoopBackoff issues by:
- Providing more CPU headroom for peak loads
- Maintaining a healthy request-to-limit ratio (1:3)
- Reducing the likelihood of CPU throttling
Let's verify the historical resource usage to confirm these limits are appropriate:
charts/services/application-system-api/values.prod.yaml (2)
213-216
: LGTM! Resource adjustments look appropriate.
The increase in CPU resources should help prevent CrashLoopBackoff issues:
- CPU limit (600m) provides adequate headroom for spikes
- CPU request (200m) ensures sufficient guaranteed resources
- The ratio between request and limit (1:3) allows for good burst capacity while maintaining resource efficiency
Line range hint 153-160
: Monitor HPA behavior with new resource settings
The HPA is configured with a 90% CPU target utilization. With the increased CPU resources, verify that:
- The scaling behavior remains appropriate
- The min/max replica counts (2-60) are still suitable for the new resource profile
✅ Verification successful
HPA configuration appears appropriate for the resource profile
The HPA configuration aligns well with the container's resource settings:
- The 90% CPU target utilization is appropriate for the container's resource limits (600m CPU) and requests (200m CPU)
- The min/max replica range (2-60) provides good scalability while maintaining high availability
- The configuration is consistent across all environments (dev/staging/prod)
- The secondary metric (nginxRequestsIrate: 5) provides additional scaling protection
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check HPA metrics and scaling events after resource changes
# Get HPA status and recent scaling events
kubectl get hpa -n application-system application-system-api -o yaml
# Check recent pod events for any resource-related issues
kubectl get events -n application-system --sort-by='.lastTimestamp' | grep -E "application-system-api|HorizontalPodAutoscaler"
Length of output: 347
Script:
#!/bin/bash
# Let's analyze the Helm chart values to understand the resource configurations and HPA settings
# Check the resource configurations in values files
fd -e yaml -e yml . charts/services/application-system-api/ --exec cat {}
# Look for any HPA-related configurations in other files
rg -A 5 -B 5 "HorizontalPodAutoscaler|autoscaling" charts/
Length of output: 44595
Script:
#!/bin/bash
# Let's analyze the resource configurations to understand the scaling context
# Check resource configurations across environments
rg -A 5 "resources:|cpu:|memory:" charts/services/application-system-api/
# Look for any scaling-related configurations
rg -A 5 "scaling:|replicas:|cpu" charts/services/application-system-api/
Length of output: 13642
apps/application-system/api/infra/application-system-api.ts (3)
349-350
: Using same endpoint for health checks
Line range hint 132-159
: LGTM: Environment variables configuration
The environment variables for application recipients are well-structured with proper environment-specific values and follow security best practices.
Line range hint 160-240
: Verify new service integrations
New xroad services (Labor, HealthInsurance, NationalRegistryB2C) have been added. Let's verify their configuration and usage.
✅ Verification successful
Let me generate additional scripts to verify the usage of these services in the codebase.
Based on the code analysis, I can now provide a final response.
New xroad service integrations are properly configured
The new xroad services are correctly configured in the infrastructure:
HealthInsurance
: Configured with proper WSDL URLs for dev/staging/prod environmentsLabor
: Set up with VMST parental leave API paths for all environmentsNationalRegistryB2C
: Configured with appropriate client IDs for each environment
These services are properly integrated into the application system API and are used by various modules across the codebase. The configurations follow the established patterns and include all necessary environment-specific settings.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check for service configurations and dependencies
rg -g '*.ts' -A 5 'Labor|HealthInsurance|NationalRegistryB2C'
# Check for any existing implementations using these services
ast-grep --pattern 'class $_ {
$$$
constructor($$$) {
$$$
Labor|HealthInsurance|NationalRegistryB2C
$$$
}
$$$
}'
Length of output: 181752
Script:
#!/bin/bash
# Check for service configurations and dependencies in the application system API
rg -g '*.ts' -A 5 'XROAD_HEALTH_INSURANCE|XROAD_VMST|XROAD_NATIONAL_REGISTRY_B2C' apps/application-system/api/
# Check for service configurations in infrastructure
rg -g '*.ts' -A 5 'HealthInsurance|Labor|NationalRegistryB2C' infra/src/dsl/
# Check for service imports and usage in modules
ast-grep --pattern 'import { $_ } from $$$HealthInsurance$$$'
Length of output: 1488
charts/services/application-system-api/values.staging.yaml (2)
Line range hint 1-277
: Overall configuration is well-structured and secure
The configuration demonstrates good security practices with:
- Proper pod security context
- Well-defined resource limits
- Appropriate service account configuration
- Comprehensive environment variable management
213-217
: Resource allocation changes look appropriate
The increased CPU limits (600m) and requests (200m) should help prevent CrashLoopBackoff issues caused by CPU throttling. The 1:3 ratio between request and limit is within the recommended range for efficient resource utilization.
Let's verify the pod's resource usage patterns to ensure these limits are appropriate:
charts/islandis/values.staging.yaml (2)
767-770
: Resource limits look appropriate for preventing CPU throttling
The increased CPU limit (600m) and request (200m) values should help prevent the CrashLoopBackoff issues by:
- Providing more CPU headroom with the 600m limit
- Ensuring a base CPU allocation of 200m
- Maintaining a healthy request:limit ratio of 1:3
The memory configuration (512Mi request, 1024Mi limit) also looks appropriate for a Node.js application.
Line range hint 841-850
: Security configuration follows best practices
The security settings are well configured with:
- Appropriate pod security context
- Proper privilege restrictions
- Correctly configured service account with IAM role
charts/islandis/values.prod.yaml (1)
758-761
: CPU resource changes look good
The increased CPU limit (600m) and request (200m) should help prevent throttling and CrashLoopBackoff issues. The 3:1 limit-to-request ratio provides good balance between guaranteed resources and burst capacity.
charts/islandis/values.dev.yaml (1)
771-774
:
Resource configuration may need further adjustment
The current configuration still has a large gap between requests and limits which could lead to resource contention:
- CPU: 200m request vs 600m limit (3x difference)
- Memory: 512Mi request vs 1024Mi limit (2x difference)
This configuration might still be susceptible to the CrashLoopBackoff issues mentioned in the PR objectives.
Consider adjusting the values to have a more balanced ratio between requests and limits:
resources:
limits:
cpu: '600m'
memory: '1024Mi'
requests:
- cpu: '200m'
+ cpu: '400m'
memory: '512Mi'
Let me verify the pod's resource utilization patterns to better inform these recommendations:
#!/bin/bash
# Description: Check pod resource utilization patterns in the application-system namespace
# Get pod metrics for the application-system namespace
kubectl top pods -n application-system --sort-by=cpu
kubectl top pods -n application-system --sort-by=memory
# Get any CrashLoopBackoff events in the last hour
kubectl get events -n application-system --field-selector type=Warning,reason=BackOff --sort-by='.lastTimestamp'
Datadog ReportAll test runs ✅ 99 Total Test Services: 0 Failed, 97 Passed Test ServicesThis report shows up to 10 services
🔻 Code Coverage Decreases vs Default Branch (1)
|
…avoid crashloops (#16792) * chore(application-system): Increase cpu and mem limit and request to avoid crashloops * chore: charts update dirty files * Decrease cpu resources and return memory to original value * chore: charts update dirty files * chore: charts update dirty files --------- Co-authored-by: andes-it <builders@andes.is> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* chore(application-system): Set initialDelay to 20 seconds (#16893) * chore(application-system): Set initialDelay to 20 seconds * chore: charts update dirty files --------- Co-authored-by: andes-it <builders@andes.is> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com> * chore(application-system): Increase cpu and mem limit and request to avoid crashloops (#16792) * chore(application-system): Increase cpu and mem limit and request to avoid crashloops * chore: charts update dirty files * Decrease cpu resources and return memory to original value * chore: charts update dirty files * chore: charts update dirty files --------- Co-authored-by: andes-it <builders@andes.is> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com> * Remove charts --------- Co-authored-by: andes-it <builders@andes.is> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
…avoid crashloops (#16792) * chore(application-system): Increase cpu and mem limit and request to avoid crashloops * chore: charts update dirty files * Decrease cpu resources and return memory to original value * chore: charts update dirty files * chore: charts update dirty files --------- Co-authored-by: andes-it <builders@andes.is> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
...
Attach a link to issue if relevant
What
Increase the cpu limit to 600 millicore and the request to 200 millicores to try to eliminated crashloops due to missed health checks.
Why
We are seeing pods go into CrashLoopBackoff due to missed health checks. Most likely this is due to missed health checks which seem to have been caused by heavy cpu throttling.
Screenshots / Gifs
Attach Screenshots / Gifs to help reviewers understand the scope of the pull request
Checklist:
Summary by CodeRabbit
Summary by CodeRabbit
New Features
Bug Fixes
Chores