chore(application-system): Increase cpu and mem limit and request to avoid crashloops #16792

norda-gunni · 2024-11-11T12:57:27Z

...

Attach a link to issue if relevant

What

Increase the cpu limit to 600 millicore and the request to 200 millicores to try to eliminated crashloops due to missed health checks.

Why

We are seeing pods go into CrashLoopBackoff due to missed health checks. Most likely this is due to missed health checks which seem to have been caused by heavy cpu throttling.

Screenshots / Gifs

Attach Screenshots / Gifs to help reviewers understand the scope of the pull request

Checklist:

I have performed a self-review of my own code
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
Formatting passes locally with my changes
I have rebased against main before asking for a review

Summary by CodeRabbit

New Features
- Enhanced service configurations with additional environment variables for improved functionality.
- Increased CPU and optimized memory allocations for various services to support better performance.
- Introduced a new migration container to facilitate database migrations.
- Standardized health check paths across services for consistent monitoring.
Bug Fixes
- Adjusted resource limits and health check configurations to ensure reliable service operation.
Chores
- Updated ingress configurations for improved traffic management and security across multiple environments.

…avoid crashloops

coderabbitai · 2024-11-11T12:57:35Z

Walkthrough

The pull request introduces significant updates to the configuration of multiple services within the application system. Key changes include adjustments to resource limits and requests, particularly for CPU and memory, as well as the addition of new environment variables. The ingress configurations have been refined to accommodate different environments, and health check paths have been standardized across services. These modifications enhance the overall service setup, ensuring better resource management and operational consistency.

Changes

File Path	Change Summary
`apps/application-system/api/infra/application-system-api.ts`	Updated `workerSetup` and `serviceSetup` functions with new resource limits and environment variables; modified ingress settings.
`charts/islandis/values.dev.yaml`	Adjusted `NODE_OPTIONS` for multiple services; modified resource limits; standardized health check paths; updated ingress annotations.
`charts/islandis/values.prod.yaml`	Similar updates as in `values.dev.yaml`, focusing on environment variables, resource management, and health checks.
`charts/islandis/values.staging.yaml`	Updates to `NODE_OPTIONS`, resource limits, health checks, ingress annotations, and HPA settings across services.
`charts/services/application-system-api/values.dev.yaml`	Increased CPU limits and requests; added `migrations` initContainer for database migrations.
`charts/services/application-system-api/values.prod.yaml`	Similar updates as in `values.dev.yaml`, with added `migrations` initContainer for database migrations.
`charts/services/application-system-api/values.staging.yaml`	Similar updates as in `values.dev.yaml`, with added `migrations` initContainer for database migrations.

Possibly related PRs

chore(user-profile): Remove migration worker and make user profile strict #14791: The workerSetup function was modified to remove the workerSetup function and its related configurations, which may relate to the changes in resource management in the main PR.
fix: increase resource #14902: This PR increases CPU limits for the serviceSetup function, which aligns with the changes made in the main PR regarding CPU resource limits.
fix(chore): missing env for application system #15926: The changes in this PR involve increasing CPU requests, which is directly related to the adjustments made in the main PR for resource management.
fix: application-system-api replicas too high due to CPU request being too low #15958: This PR addresses CPU requests in the application-system-api.ts, which is relevant to the changes made in the main PR regarding resource allocation.
fix(application-system-api): Add missing national registry v3 xroad client config #16770: The addition of the NationalRegistryB2C to the serviceSetup function in this PR is related to the enhancements in service integration capabilities mentioned in the main PR.

Suggested labels

high priority

Suggested reviewers

thorkellmani
robertaandersen
eirikurn

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (3)

charts/islandis/values.staging.yaml (1)

764-768: Resource limits adjustment to prevent CrashLoopBackoff

The changes align well with the PR objectives:

CPU limit increased from 400m to 1000m (1 core)

CPU request increased to 300m

Memory limit decreased to 768Mi

Memory request set equal to limit to prevent OOM issues

These changes should help prevent CPU throttling while maintaining efficient memory usage.

Consider monitoring these metrics after deployment to ensure they meet the application's needs:

CPU utilization patterns

Memory usage trends

Container restart counts

charts/islandis/values.dev.yaml (1)

Line range hint 2831-2836: Consider increasing memory request for web service

While the CPU configuration looks good, the memory request (384Mi) being half of the limit (768Mi) could potentially lead to OOM issues under load. Consider:

Increasing memory request closer to the limit (e.g., 614Mi) to ensure more stable memory allocation

Or implementing a more gradual scaling strategy with the HPA

apps/application-system/api/infra/application-system-api.ts (1)

350-351: Adjust Memory Requests for Optimal Resource Utilization

Currently, both memory requests and limits are set to 768Mi. It's generally a good practice to set the memory request lower than the limit to allow for resource elasticity and efficient scheduling. This adjustment can improve cluster utilization and reduce the likelihood of pod evictions due to resource pressure.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 129bd5f and ede8f22.

📒 Files selected for processing (4)

apps/application-system/api/infra/application-system-api.ts (1 hunks)
charts/islandis/values.dev.yaml (2 hunks)
charts/islandis/values.prod.yaml (2 hunks)
charts/islandis/values.staging.yaml (2 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

apps/application-system/api/infra/application-system-api.ts (1)

Pattern apps/**/*: "Confirm that the code adheres to the following:

NextJS best practices, including file structure, API routes, and static generation methods.
Efficient state management and server-side rendering techniques.
Optimal use of TypeScript for component and utility type safety."

🔇 Additional comments (5)

charts/islandis/values.staging.yaml (1)

614-614: Memory limit adjustment in NODE_OPTIONS

The NODE_OPTIONS memory limit has been decreased from 921MB to 691MB. While this aligns with the overall memory optimization goal, ensure this won't negatively impact the application's performance.

✅ Verification successful

Memory limit adjustment appears safe based on service configuration patterns

Looking at the NODE_OPTIONS configurations across the codebase, the reduction from 921MB to 691MB for this service aligns with the memory patterns seen in similar services in the staging environment. Several other services operate successfully with similar or lower memory limits (460MB-691MB range).

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for any CrashLoopBackOff events related to memory issues
kubectl logs -n application-system -l app=application-system-api --tail=1000 | grep -i "heap\|memory\|oom"

Length of output: 183

Script:

#!/bin/bash
# Let's check the resource configurations in the values files
fd "values.*\.ya?ml" --exec cat {} | grep -A 5 -B 5 "max-old-space-size"

# Also check for any memory-related configurations or issues in deployment files
fd "deployment.*\.ya?ml" --exec cat {} | grep -A 5 -B 5 "resources:\|memory:"

# Look for any related configuration or documentation
rg -i "memory|heap|node_options" -A 3 -B 3 "README|CHANGELOG|docs"

Length of output: 85727

charts/islandis/values.prod.yaml (1)

605-605: Justify the specific NODE_OPTIONS memory value

The --max-old-space-size=691 seems like an oddly specific value. This should either be rounded to a more standard value or documentation should be added explaining why this exact number was chosen.

charts/islandis/values.dev.yaml (3)

767-771: Resource limits are well-balanced for stability

The new resource configuration looks good with:

Increased CPU limit to 1 core will help prevent throttling
Memory request matching limit (768Mi) helps prevent OOM kills
CPU request of 300m provides good baseline performance

617-617: NODE_OPTIONS memory limit properly aligned with container limits

The --max-old-space-size=691 setting is appropriately set below the container memory limit of 768Mi, leaving room for non-heap memory.

767-771: Verify resource limits across dependent services

The resource limit changes look good but should be verified across dependent services to ensure they can handle the increased throughput.

Also applies to: 2831-2836

✅ Verification successful

Let me analyze the resource configurations across the services to verify consistency.

Resource limits are properly aligned with the system's resource tiers

Based on the analysis of resource configurations across the codebase, the specified limits (1000m CPU, 768Mi memory) align with the established resource tiers in the system:

Small services: 200m-400m CPU, 256Mi-512Mi memory
Medium services: 400m-800m CPU, 512Mi-1024Mi memory
Large services: 800m-1200m CPU, 768Mi-4096Mi memory

The service falls into the large tier category, and its resource limits are consistent with other high-throughput services in the codebase that have similar resource requirements (e.g., other services with 800m-1200m CPU and 768Mi-3200Mi memory).

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check resource limits across all services to ensure they can handle the increased load
echo "Checking resource limits across services..."
grep -A 4 "resources:" . | grep -E "cpu:|memory:" || true

# Check if any pods are still experiencing CrashLoopBackOff
kubectl get pods --all-namespaces | grep -i "crash"

Length of output: 304

Script:

#!/bin/bash
# Search for resource configurations in yaml files
fd -e yaml -x grep -A 4 "resources:" {} \; 2>/dev/null

# Also check specifically in the values files
fd "values.*\.yaml$" -x grep -A 4 "resources:" {} \; 2>/dev/null

Length of output: 41160

charts/islandis/values.prod.yaml

coderabbitai

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (1)

apps/application-system/api/infra/application-system-api.ts (1)
Line range hint 177-180: Consider increasing worker resource limits to prevent throttling

Given the PR's objective to address CrashLoopBackoff issues, the current worker resource configuration might be insufficient:

The CPU request (150m) is only 37.5% of the limit (400m), which could lead to throttling

Consider increasing both CPU request and limit to match the service's configuration
 .resources({
-  limits: { cpu: '400m', memory: '768Mi' },
-  requests: { cpu: '150m', memory: '384Mi' },
+  limits: { cpu: '600m', memory: '768Mi' },
+  requests: { cpu: '300m', memory: '384Mi' },
 })

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between ede8f22 and 82b23f1.

📒 Files selected for processing (1)

apps/application-system/api/infra/application-system-api.ts (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

apps/application-system/api/infra/application-system-api.ts (1)

Pattern apps/**/*: "Confirm that the code adheres to the following:

NextJS best practices, including file structure, API routes, and static generation methods.
Efficient state management and server-side rendering techniques.
Optimal use of TypeScript for component and utility type safety."

apps/application-system/api/infra/application-system-api.ts

coderabbitai

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (2)

charts/islandis/values.staging.yaml (1)

764-768: Resource limits look appropriate for preventing CrashLoopBackoff issues.

The increased CPU limit (600m) and memory limit (1024Mi) along with the corresponding requests (200m CPU, 512Mi memory) should help prevent pods from being throttled or OOMKilled. The 3:1 ratio between limits and requests provides good headroom for traffic spikes while maintaining efficient resource utilization.

Consider implementing Horizontal Pod Autoscaling (HPA) based on memory utilization in addition to the existing CPU-based scaling to better handle memory-intensive workloads.
charts/islandis/values.dev.yaml (1)
Line range hint 642-650: Review health check timing configuration

The current health check configuration:
healthCheck:
  liveness:
    initialDelaySeconds: 3
    path: '/liveness'
    timeoutSeconds: 3
  readiness:
    initialDelaySeconds: 3
    path: '/liveness'
    timeoutSeconds: 3
The timing values seem too aggressive:

initialDelaySeconds: 3 might be too short for the application to properly initialize

timeoutSeconds: 3 might not provide enough buffer during high CPU utilization

Both liveness and readiness probes use the same timing, which could cause cascading failures

Consider adjusting the timing to be more lenient:
healthCheck:
  liveness:
-   initialDelaySeconds: 3
+   initialDelaySeconds: 10
    path: '/liveness'
-   timeoutSeconds: 3
+   timeoutSeconds: 5
  readiness:
-   initialDelaySeconds: 3
+   initialDelaySeconds: 15
    path: '/liveness'
-   timeoutSeconds: 3
+   timeoutSeconds: 5

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 82b23f1 and a7cece9.

📒 Files selected for processing (3)

charts/islandis/values.dev.yaml (1 hunks)
charts/islandis/values.prod.yaml (1 hunks)
charts/islandis/values.staging.yaml (1 hunks)

🔇 Additional comments (2)

charts/islandis/values.prod.yaml (1)

755-759: Resource configuration changes look appropriate

The adjustments to CPU and memory allocations are well-balanced:

CPU limit of 600m provides adequate headroom while preventing excessive resource usage
Memory limit of 1024Mi allows for sufficient heap space
The request-to-limit ratios (1:3 for CPU, 1:2 for memory) follow Kubernetes best practices for resource management
These changes should help prevent CrashLoopBackoff issues caused by resource constraints

charts/islandis/values.dev.yaml (1)

767-770: Verify the memory configuration

The memory configuration shows:

Limit: 1024Mi
Request: 512Mi

This 2:1 ratio between limit and request is reasonable, but we should verify if this is sufficient for the application's needs.

✅ Verification successful

Let me try a different approach to verify the memory configuration by examining the repository for any historical configuration changes or related files.

Based on the search results, I can now provide a comprehensive analysis of the memory configuration. The search shows consistent memory allocation patterns across different services and environments, and I can verify that the 2:1 ratio between limits and requests is a common pattern in the codebase.

Memory configuration is aligned with best practices

The memory configuration of 1024Mi limit and 512Mi request is consistent with:

Other similar services in the codebase using the same 2:1 ratio
Common memory configurations for Node.js applications in the cluster
Standard practices for Kubernetes resource management

The configuration appears appropriate because:

The 2:1 ratio allows for memory bursts while maintaining efficient resource allocation
Similar services in production are using comparable memory settings
The values are in line with the template configurations found in the codebase

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for OOMKilled events in the application-system namespace
# This will help determine if the current memory limits are adequate

echo "Checking for OOMKilled events in application-system namespace..."
kubectl get events -n application-system --field-selector reason=OOMKilled

echo "Checking for pods that were terminated due to memory issues..."
kubectl get pods -n application-system -o json | jq '.items[] | select(.status.containerStatuses[].lastState.terminated.reason=="OOMKilled") | {name: .metadata.name, reason: .status.containerStatuses[].lastState.terminated.reason, exitCode: .status.containerStatuses[].lastState.terminated.exitCode}'

Length of output: 803

Script:

#!/bin/bash
# Search for memory configuration patterns in yaml files
echo "Searching for memory configurations in yaml files..."
rg -A 2 -B 2 'memory:.*Mi' --type yaml

# Look for any documentation about resource requirements
echo -e "\nSearching for resource requirements documentation..."
fd README.md | xargs rg -i "memory|resource|requirements"

# Check for any monitoring or metrics related configuration
echo -e "\nSearching for monitoring configurations..."
rg -A 2 -B 2 'monitoring|metrics|prometheus' --type yaml

Length of output: 109382

charts/islandis/values.dev.yaml

coderabbitai

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (7)

charts/services/application-system-api/values.dev.yaml (2)
Line range hint 147-156: Consider increasing health check timeouts.

The current health check configuration has very tight timing constraints:

3s initial delay

3s timeout

With the history of CPU throttling, these tight timeouts might contribute to failed health checks during high load periods.

Consider applying this adjustment:
  healthCheck:
    liveness:
-     initialDelaySeconds: 3
+     initialDelaySeconds: 5
      path: '/liveness'
-     timeoutSeconds: 3
+     timeoutSeconds: 5
    readiness:
-     initialDelaySeconds: 3
+     initialDelaySeconds: 5
      path: '/liveness'
-     timeoutSeconds: 3
+     timeoutSeconds: 5
Line range hint 157-164: Adjust HPA CPU target for better scaling.

The current HPA configuration with 90% CPU target utilization might be too aggressive, potentially leading to delayed scaling and CPU throttling.

Consider lowering the CPU target:
  hpa:
    scaling:
      metric:
-       cpuAverageUtilization: 90
+       cpuAverageUtilization: 70
        nginxRequestsIrate: 5
      replicas:
        max: 60
        min: 2
charts/services/application-system-api/values.prod.yaml (1)
Line range hint 147-152: Consider increasing health check timeouts

With the history of pods missing health checks, the current 3-second timeout might be too aggressive. Consider increasing timeoutSeconds to provide more buffer during high-load situations.
  healthCheck:
    liveness:
      initialDelaySeconds: 3
      path: '/liveness'
-     timeoutSeconds: 3
+     timeoutSeconds: 5
    readiness:
      initialDelaySeconds: 3
      path: '/liveness'
-     timeoutSeconds: 3
+     timeoutSeconds: 5
charts/services/application-system-api/values.staging.yaml (2)
Line range hint 147-156: Health check configuration needs improvement

Current health check configuration has several potential issues:

Using the same path (/liveness) for both liveness and readiness probes

Short timeout (3s) might lead to false positives

Missing important probe parameters (periodSeconds, failureThreshold)

Consider applying these improvements:
 healthCheck:
   liveness:
     initialDelaySeconds: 3
     path: '/liveness'
-    timeoutSeconds: 3
+    timeoutSeconds: 5
+    periodSeconds: 10
+    failureThreshold: 3
   readiness:
     initialDelaySeconds: 3
-    path: '/liveness'
-    timeoutSeconds: 3
+    path: '/readiness'
+    timeoutSeconds: 5
+    periodSeconds: 10
+    failureThreshold: 3
Line range hint 157-164: Consider lowering HPA CPU utilization target

The current CPU utilization target of 90% is quite high, especially considering the recent CPU throttling issues. A lower target would provide more headroom for traffic spikes.

Consider this adjustment:
 hpa:
   scaling:
     metric:
-      cpuAverageUtilization: 90
+      cpuAverageUtilization: 75
       nginxRequestsIrate: 5
     replicas:
       max: 60
       min: 2
charts/islandis/values.staging.yaml (1)
Line range hint 79-79: Review NODE_OPTIONS memory limit

The Node.js heap size limit (921MB) set in NODE_OPTIONS is very close to the container memory limit (1024MB). This leaves little room for other memory usage and could lead to OOM kills. Consider reducing the heap size limit to ~70% of container memory limit (~700MB).

Apply this change:
-    NODE_OPTIONS: '--max-old-space-size=921 -r dd-trace/init'
+    NODE_OPTIONS: '--max-old-space-size=700 -r dd-trace/init'
charts/islandis/values.prod.yaml (1)
758-761: Consider adjusting memory configuration for better burst capacity

While the memory values are reasonable, setting different limits and requests would provide better burst capacity:
  resources:
    limits:
      cpu: '600m'
-     memory: '1024Mi'
+     memory: '1536Mi'  
    requests:
      cpu: '200m'
      memory: '512Mi'
This 3:1 memory limit-to-request ratio (similar to CPU) would:

Maintain the same guaranteed memory (512Mi)

Allow more headroom for temporary spikes

Better align with Kubernetes QoS best practices

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between a7cece9 and 31266c6.

📒 Files selected for processing (7)

apps/application-system/api/infra/application-system-api.ts (1 hunks)
charts/islandis/values.dev.yaml (1 hunks)
charts/islandis/values.prod.yaml (1 hunks)
charts/islandis/values.staging.yaml (1 hunks)
charts/services/application-system-api/values.dev.yaml (1 hunks)
charts/services/application-system-api/values.prod.yaml (1 hunks)
charts/services/application-system-api/values.staging.yaml (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

apps/application-system/api/infra/application-system-api.ts (1)

Pattern apps/**/*: "Confirm that the code adheres to the following:

NextJS best practices, including file structure, API routes, and static generation methods.
Efficient state management and server-side rendering techniques.
Optimal use of TypeScript for component and utility type safety."

🔇 Additional comments (13)

charts/services/application-system-api/values.dev.yaml (2)

Line range hint 187-195: Init container resource configuration looks good.

The resource allocation for the migrations container is well-balanced:

Reasonable CPU limits (200m) and requests (50m)
Appropriate memory configuration for database migrations

213-216: Resource configuration changes look appropriate.

The increased CPU limits (600m) and requests (200m) should help prevent CrashLoopBackoff issues by:

Providing more CPU headroom for peak loads
Maintaining a healthy request-to-limit ratio (1:3)
Reducing the likelihood of CPU throttling

Let's verify the historical resource usage to confirm these limits are appropriate:

charts/services/application-system-api/values.prod.yaml (2)

213-216: LGTM! Resource adjustments look appropriate.

The increase in CPU resources should help prevent CrashLoopBackoff issues:

CPU limit (600m) provides adequate headroom for spikes
CPU request (200m) ensures sufficient guaranteed resources
The ratio between request and limit (1:3) allows for good burst capacity while maintaining resource efficiency

Line range hint 153-160: Monitor HPA behavior with new resource settings

The HPA is configured with a 90% CPU target utilization. With the increased CPU resources, verify that:

The scaling behavior remains appropriate
The min/max replica counts (2-60) are still suitable for the new resource profile

✅ Verification successful

HPA configuration appears appropriate for the resource profile

The HPA configuration aligns well with the container's resource settings:

The 90% CPU target utilization is appropriate for the container's resource limits (600m CPU) and requests (200m CPU)
The min/max replica range (2-60) provides good scalability while maintaining high availability
The configuration is consistent across all environments (dev/staging/prod)
The secondary metric (nginxRequestsIrate: 5) provides additional scaling protection

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check HPA metrics and scaling events after resource changes

# Get HPA status and recent scaling events
kubectl get hpa -n application-system application-system-api -o yaml

# Check recent pod events for any resource-related issues
kubectl get events -n application-system --sort-by='.lastTimestamp' | grep -E "application-system-api|HorizontalPodAutoscaler"

Length of output: 347

Script:

#!/bin/bash
# Let's analyze the Helm chart values to understand the resource configurations and HPA settings

# Check the resource configurations in values files
fd -e yaml -e yml . charts/services/application-system-api/ --exec cat {}

# Look for any HPA-related configurations in other files
rg -A 5 -B 5 "HorizontalPodAutoscaler|autoscaling" charts/

Length of output: 44595

Script:

#!/bin/bash
# Let's analyze the resource configurations to understand the scaling context

# Check resource configurations across environments
rg -A 5 "resources:|cpu:|memory:" charts/services/application-system-api/

# Look for any scaling-related configurations
rg -A 5 "scaling:|replicas:|cpu" charts/services/application-system-api/

Length of output: 13642

apps/application-system/api/infra/application-system-api.ts (3)

349-350: Using same endpoint for health checks

Line range hint 132-159: LGTM: Environment variables configuration

The environment variables for application recipients are well-structured with proper environment-specific values and follow security best practices.

Line range hint 160-240: Verify new service integrations

New xroad services (Labor, HealthInsurance, NationalRegistryB2C) have been added. Let's verify their configuration and usage.

✅ Verification successful

Let me generate additional scripts to verify the usage of these services in the codebase.

Based on the code analysis, I can now provide a final response.

New xroad service integrations are properly configured

The new xroad services are correctly configured in the infrastructure:

HealthInsurance: Configured with proper WSDL URLs for dev/staging/prod environments
Labor: Set up with VMST parental leave API paths for all environments
NationalRegistryB2C: Configured with appropriate client IDs for each environment

These services are properly integrated into the application system API and are used by various modules across the codebase. The configurations follow the established patterns and include all necessary environment-specific settings.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for service configurations and dependencies
rg -g '*.ts' -A 5 'Labor|HealthInsurance|NationalRegistryB2C'

# Check for any existing implementations using these services
ast-grep --pattern 'class $_ {
  $$$
  constructor($$$) {
    $$$
    Labor|HealthInsurance|NationalRegistryB2C
    $$$
  }
  $$$
}'

Length of output: 181752

Script:

#!/bin/bash
# Check for service configurations and dependencies in the application system API
rg -g '*.ts' -A 5 'XROAD_HEALTH_INSURANCE|XROAD_VMST|XROAD_NATIONAL_REGISTRY_B2C' apps/application-system/api/

# Check for service configurations in infrastructure
rg -g '*.ts' -A 5 'HealthInsurance|Labor|NationalRegistryB2C' infra/src/dsl/

# Check for service imports and usage in modules
ast-grep --pattern 'import { $_ } from $$$HealthInsurance$$$'

Length of output: 1488

charts/services/application-system-api/values.staging.yaml (2)

Line range hint 1-277: Overall configuration is well-structured and secure

The configuration demonstrates good security practices with:

Proper pod security context
Well-defined resource limits
Appropriate service account configuration
Comprehensive environment variable management

213-217: Resource allocation changes look appropriate

The increased CPU limits (600m) and requests (200m) should help prevent CrashLoopBackoff issues caused by CPU throttling. The 1:3 ratio between request and limit is within the recommended range for efficient resource utilization.

Let's verify the pod's resource usage patterns to ensure these limits are appropriate:

charts/islandis/values.staging.yaml (2)

767-770: Resource limits look appropriate for preventing CPU throttling

The increased CPU limit (600m) and request (200m) values should help prevent the CrashLoopBackoff issues by:

Providing more CPU headroom with the 600m limit
Ensuring a base CPU allocation of 200m
Maintaining a healthy request:limit ratio of 1:3

The memory configuration (512Mi request, 1024Mi limit) also looks appropriate for a Node.js application.

Line range hint 841-850: Security configuration follows best practices

The security settings are well configured with:

Appropriate pod security context
Proper privilege restrictions
Correctly configured service account with IAM role

charts/islandis/values.prod.yaml (1)

758-761: CPU resource changes look good

The increased CPU limit (600m) and request (200m) should help prevent throttling and CrashLoopBackoff issues. The 3:1 limit-to-request ratio provides good balance between guaranteed resources and burst capacity.

charts/islandis/values.dev.yaml (1)

771-774: ⚠️ Potential issue

Resource configuration may need further adjustment

The current configuration still has a large gap between requests and limits which could lead to resource contention:

CPU: 200m request vs 600m limit (3x difference)
Memory: 512Mi request vs 1024Mi limit (2x difference)

This configuration might still be susceptible to the CrashLoopBackoff issues mentioned in the PR objectives.

Consider adjusting the values to have a more balanced ratio between requests and limits:

resources:
  limits:
    cpu: '600m'
    memory: '1024Mi'
  requests:
-   cpu: '200m'
+   cpu: '400m'
    memory: '512Mi'

Let me verify the pod's resource utilization patterns to better inform these recommendations:

#!/bin/bash
# Description: Check pod resource utilization patterns in the application-system namespace

# Get pod metrics for the application-system namespace
kubectl top pods -n application-system --sort-by=cpu
kubectl top pods -n application-system --sort-by=memory

# Get any CrashLoopBackoff events in the last hour
kubectl get events -n application-system --field-selector type=Warning,reason=BackOff --sort-by='.lastTimestamp'

apps/application-system/api/infra/application-system-api.ts

codecov · 2024-11-14T18:50:42Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 36.44%. Comparing base (d06732d) to head (31266c6).

Additional details and impacted files

@@           Coverage Diff           @@
##             main   #16792   +/-   ##
=======================================
  Coverage   36.44%   36.44%           
=======================================
  Files        6851     6851           
  Lines      143569   143569           
  Branches    40980    40980           
=======================================
  Hits        52330    52330           
  Misses      91239    91239

Flag	Coverage Δ
air-discount-scheme-backend	`54.08% <ø> (ø)`
air-discount-scheme-web	`0.00% <ø> (ø)`
api	`3.34% <ø> (ø)`
api-catalogue-services	`77.85% <ø> (ø)`
api-domains-air-discount-scheme	`37.08% <ø> (ø)`
api-domains-assets	`26.71% <ø> (ø)`
api-domains-auth-admin	`48.48% <ø> (ø)`
api-domains-communications	`39.64% <ø> (ø)`
api-domains-criminal-record	`47.56% <ø> (ø)`
api-domains-driving-license	`44.50% <ø> (ø)`
api-domains-education	`30.62% <ø> (ø)`
api-domains-health-insurance	`34.43% <ø> (ø)`
api-domains-mortgage-certificate	`34.74% <ø> (ø)`
api-domains-payment-schedule	`41.31% <ø> (ø)`
application-api-files	`62.52% <ø> (ø)`
application-core	`71.06% <ø> (+0.31%)`	⬆️
application-system-api	`40.99% <ø> (ø)`
application-template-api-modules	`27.67% <ø> (+<0.01%)`	⬆️
application-templates-accident-notification	`28.98% <ø> (ø)`
application-templates-car-recycling	`3.12% <ø> (ø)`
application-templates-criminal-record	`25.87% <ø> (ø)`
application-templates-driving-license	`18.26% <ø> (ø)`
application-templates-estate	`12.14% <ø> (ø)`
application-templates-example-payment	`24.80% <ø> (ø)`
application-templates-financial-aid	`15.48% <ø> (ø)`
application-templates-general-petition	`23.07% <ø> (ø)`
application-templates-inheritance-report	`6.52% <ø> (ø)`
application-templates-marriage-conditions	`15.04% <ø> (ø)`
application-templates-mortgage-certificate	`43.34% <ø> (ø)`
application-templates-parental-leave	`29.86% <ø> (ø)`
application-types	`6.60% <ø> (ø)`
application-ui-components	`1.27% <ø> (ø)`
application-ui-shell	`20.83% <ø> (ø)`
auth-admin-web	`2.43% <ø> (ø)`
auth-nest-tools	`30.92% <ø> (ø)`
auth-react	`21.85% <ø> (ø)`
auth-shared	`75.00% <ø> (ø)`
clients-charge-fjs-v2	`24.11% <ø> (ø)`
clients-driving-license	`40.28% <ø> (ø)`
clients-driving-license-book	`43.50% <ø> (ø)`
clients-financial-statements-inao	`49.06% <ø> (ø)`
clients-license-client	`1.26% <ø> (ø)`
clients-middlewares	`73.56% <ø> (+0.33%)`	⬆️
clients-regulations	`42.36% <ø> (ø)`
clients-rsk-company-registry	`29.76% <ø> (ø)`
clients-rsk-personal-tax-return	`38.00% <ø> (ø)`
clients-smartsolutions	`12.77% <ø> (ø)`
clients-syslumenn	`49.23% <ø> (ø)`
clients-zendesk	`50.61% <ø> (ø)`
cms	`0.42% <ø> (ø)`
cms-translations	`38.94% <ø> (ø)`
content-search-index-manager	`95.65% <ø> (ø)`
content-search-toolkit	`8.14% <ø> (ø)`
contentful-apps	`4.69% <ø> (ø)`
dokobit-signing	`62.76% <ø> (ø)`
download-service	`44.16% <ø> (ø)`
email-service	`60.57% <ø> (ø)`
feature-flags	`90.62% <ø> (ø)`
file-storage	`46.09% <ø> (ø)`
financial-aid-backend	`51.29% <ø> (ø)`
financial-aid-shared	`17.81% <ø> (ø)`
icelandic-names-registry-backend	`54.34% <ø> (ø)`
infra-nest-server	`48.37% <ø> (ø)`
infra-tracing	`43.24% <ø> (ø)`
island-ui-core	`28.88% <ø> (ø)`
judicial-system-api	`19.76% <ø> (ø)`
judicial-system-audit-trail	`68.86% <ø> (ø)`
judicial-system-backend	`55.14% <ø> (ø)`
judicial-system-formatters	`79.26% <ø> (ø)`
judicial-system-message	`66.99% <ø> (ø)`
judicial-system-message-handler	`47.99% <ø> (ø)`
judicial-system-scheduler	`70.67% <ø> (ø)`
judicial-system-types	`43.58% <ø> (ø)`
judicial-system-web	`27.16% <ø> (ø)`
license-api	`42.58% <ø> (-0.02%)`	⬇️
localization	`10.15% <ø> (ø)`
logging	`48.43% <ø> (ø)`
message-queue	`68.79% <ø> (ø)`
nest-audit	`68.20% <ø> (ø)`
nest-aws	`54.03% <ø> (ø)`
nest-config	`78.00% <ø> (ø)`
nest-core	`43.54% <ø> (ø)`
nest-feature-flags	`51.09% <ø> (ø)`
nest-problem	`45.82% <ø> (ø)`
nest-sequelize	`94.44% <ø> (ø)`
nest-swagger	`51.71% <ø> (ø)`
nova-sms	`62.09% <ø> (ø)`
portals-admin-regulations-admin	`1.85% <ø> (ø)`
portals-core	`15.89% <ø> (ø)`
reference-backend	`49.74% <ø> (ø)`
regulations	`16.78% <ø> (ø)`
residence-history	`85.00% <ø> (ø)`
services-auth-admin-api	`52.50% <ø> (ø)`
services-auth-delegation-api	`58.22% <ø> (ø)`
services-auth-ids-api	`52.10% <ø> (ø)`
services-auth-personal-representative	`45.62% <ø> (ø)`
services-auth-personal-representative-public	`41.79% <ø> (ø)`
services-auth-public-api	`49.59% <ø> (ø)`
services-documents	`60.81% <ø> (ø)`
services-endorsements-api	`53.26% <ø> (ø)`
services-sessions	`65.44% <ø> (-0.05%)`	⬇️
services-university-gateway	`49.34% <ø> (+0.08%)`	⬆️
services-user-notification	`46.88% <ø> (ø)`
services-user-profile	`61.81% <ø> (ø)`
shared-components	`26.90% <ø> (ø)`
shared-form-fields	`31.26% <ø> (ø)`
shared-mocking	`60.89% <ø> (ø)`
shared-pii	`92.85% <ø> (ø)`
shared-problem	`87.50% <ø> (ø)`
shared-utils	`27.69% <ø> (ø)`
skilavottord-ws	`24.14% <ø> (ø)`
web	`1.80% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d06732d...31266c6. Read the comment docs.

datadog-island-is · 2024-11-14T19:11:57Z

Datadog Report

All test runs 5c424e4 🔗

✅ 99 Total Test Services: 0 Failed, 97 Passed
🔻 Test Sessions change in coverage: 1 decreased (-0.01%), 2 increased, 197 no change

Test Services

This report shows up to 10 services

Service Name	Passed	Total Time	Code Coverage Change	Test Service View
`air-discount-scheme-backend`	82	33.38s	N/A	Link
`air-discount-scheme-web`	2	9.06s	N/A	Link
`api`	4	3.19s	N/A	Link
`api-catalogue-services`	23	12.93s	N/A	Link
`api-domains-air-discount-scheme`	6	21.21s	N/A	Link
`api-domains-assets`	3	13.29s	N/A	Link
`api-domains-auth-admin`	18	14.3s	N/A	Link
`api-domains-communications`	5	35.95s	N/A	Link
`api-domains-criminal-record`	5	10.17s	N/A	Link
`api-domains-driving-license`	23	33.61s	N/A	Link

🔻 Code Coverage Decreases vs Default Branch (1)

license-api - jest 33.91% (-0.01%) - Details

…avoid crashloops (#16792) * chore(application-system): Increase cpu and mem limit and request to avoid crashloops * chore: charts update dirty files * Decrease cpu resources and return memory to original value * chore: charts update dirty files * chore: charts update dirty files --------- Co-authored-by: andes-it <builders@andes.is> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>

* chore(application-system): Set initialDelay to 20 seconds (#16893) * chore(application-system): Set initialDelay to 20 seconds * chore: charts update dirty files --------- Co-authored-by: andes-it <builders@andes.is> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com> * chore(application-system): Increase cpu and mem limit and request to avoid crashloops (#16792) * chore(application-system): Increase cpu and mem limit and request to avoid crashloops * chore: charts update dirty files * Decrease cpu resources and return memory to original value * chore: charts update dirty files * chore: charts update dirty files --------- Co-authored-by: andes-it <builders@andes.is> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com> * Remove charts --------- Co-authored-by: andes-it <builders@andes.is> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>

…avoid crashloops (#16792) * chore(application-system): Increase cpu and mem limit and request to avoid crashloops * chore: charts update dirty files * Decrease cpu resources and return memory to original value * chore: charts update dirty files * chore: charts update dirty files --------- Co-authored-by: andes-it <builders@andes.is> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>

chore(application-system): Increase cpu and mem limit and request to …

6be4e08

…avoid crashloops

norda-gunni requested a review from a team as a code owner November 11, 2024 12:57

chore: charts update dirty files

ede8f22

coderabbitai bot requested changes Nov 11, 2024

View reviewed changes

charts/islandis/values.prod.yaml Outdated Show resolved Hide resolved

Decrease cpu resources and return memory to original value

82b23f1

coderabbitai bot requested changes Nov 12, 2024

View reviewed changes

apps/application-system/api/infra/application-system-api.ts Show resolved Hide resolved

chore: charts update dirty files

a7cece9

coderabbitai bot requested changes Nov 12, 2024

View reviewed changes

charts/islandis/values.dev.yaml Show resolved Hide resolved

coderabbitai bot approved these changes Nov 12, 2024

View reviewed changes

AndesKrrrrrrrrrrr approved these changes Nov 14, 2024

View reviewed changes

AndesKrrrrrrrrrrr added the automerge Merge this PR as soon as all checks pass label Nov 14, 2024

kodiakhq bot and others added 2 commits November 14, 2024 18:26

Merge branch 'main' into sa-increase-resource-request-and-limit

97de188

chore: charts update dirty files

31266c6

coderabbitai bot requested changes Nov 14, 2024

View reviewed changes

apps/application-system/api/infra/application-system-api.ts Show resolved Hide resolved

coderabbitai bot approved these changes Nov 15, 2024

View reviewed changes

Merge branch 'main' into sa-increase-resource-request-and-limit

7d16ac1

kodiakhq bot merged commit 3249c77 into main Nov 15, 2024
27 checks passed

kodiakhq bot deleted the sa-increase-resource-request-and-limit branch November 15, 2024 08:34

coderabbitai bot mentioned this pull request Nov 15, 2024

chore(application-system): Set initialDelay to 20 seconds #16893

Merged

6 tasks

This was referenced Nov 27, 2024

chore(delegation-api): Add missing infra config. #17059

Merged

fix: add missing ns grant to api for bff #17105

Merged

This was referenced Dec 13, 2024

chore: allow staging to scale down more #17227

Merged

fix: search indexer service scaling on staging #17234

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(application-system): Increase cpu and mem limit and request to avoid crashloops #16792

chore(application-system): Increase cpu and mem limit and request to avoid crashloops #16792

norda-gunni commented Nov 11, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 11, 2024 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

codecov bot commented Nov 14, 2024 •

edited

Loading

datadog-island-is bot commented Nov 14, 2024

chore(application-system): Increase cpu and mem limit and request to avoid crashloops #16792

chore(application-system): Increase cpu and mem limit and request to avoid crashloops #16792

Conversation

norda-gunni commented Nov 11, 2024 • edited by coderabbitai bot Loading

...

What

Why

Screenshots / Gifs

Checklist:

Summary by CodeRabbit

Summary by CodeRabbit

coderabbitai bot commented Nov 11, 2024 • edited Loading

Walkthrough

Changes

Possibly related PRs

Suggested labels

Suggested reviewers

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

codecov bot commented Nov 14, 2024 • edited Loading

Codecov Report

datadog-island-is bot commented Nov 14, 2024

Datadog Report

🔻 Code Coverage Decreases vs Default Branch (1)

norda-gunni commented Nov 11, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 11, 2024 •

edited

Loading

codecov bot commented Nov 14, 2024 •

edited

Loading