Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(application-system): Increase cpu and mem limit and request to avoid crashloops #16792

Merged
merged 7 commits into from
Nov 15, 2024

Conversation

norda-gunni
Copy link
Member

@norda-gunni norda-gunni commented Nov 11, 2024

...

Attach a link to issue if relevant

What

Increase the cpu limit to 600 millicore and the request to 200 millicores to try to eliminated crashloops due to missed health checks.

Why

We are seeing pods go into CrashLoopBackoff due to missed health checks. Most likely this is due to missed health checks which seem to have been caused by heavy cpu throttling.

Screenshots / Gifs

Attach Screenshots / Gifs to help reviewers understand the scope of the pull request
image
image

Checklist:

  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • Formatting passes locally with my changes
  • I have rebased against main before asking for a review

Summary by CodeRabbit

Summary by CodeRabbit

  • New Features

    • Enhanced service configurations with additional environment variables for improved functionality.
    • Increased CPU and optimized memory allocations for various services to support better performance.
    • Introduced a new migration container to facilitate database migrations.
    • Standardized health check paths across services for consistent monitoring.
  • Bug Fixes

    • Adjusted resource limits and health check configurations to ensure reliable service operation.
  • Chores

    • Updated ingress configurations for improved traffic management and security across multiple environments.

@norda-gunni norda-gunni requested a review from a team as a code owner November 11, 2024 12:57
Copy link
Contributor

coderabbitai bot commented Nov 11, 2024

Walkthrough

The pull request introduces significant updates to the configuration of multiple services within the application system. Key changes include adjustments to resource limits and requests, particularly for CPU and memory, as well as the addition of new environment variables. The ingress configurations have been refined to accommodate different environments, and health check paths have been standardized across services. These modifications enhance the overall service setup, ensuring better resource management and operational consistency.

Changes

File Path Change Summary
apps/application-system/api/infra/application-system-api.ts Updated workerSetup and serviceSetup functions with new resource limits and environment variables; modified ingress settings.
charts/islandis/values.dev.yaml Adjusted NODE_OPTIONS for multiple services; modified resource limits; standardized health check paths; updated ingress annotations.
charts/islandis/values.prod.yaml Similar updates as in values.dev.yaml, focusing on environment variables, resource management, and health checks.
charts/islandis/values.staging.yaml Updates to NODE_OPTIONS, resource limits, health checks, ingress annotations, and HPA settings across services.
charts/services/application-system-api/values.dev.yaml Increased CPU limits and requests; added migrations initContainer for database migrations.
charts/services/application-system-api/values.prod.yaml Similar updates as in values.dev.yaml, with added migrations initContainer for database migrations.
charts/services/application-system-api/values.staging.yaml Similar updates as in values.dev.yaml, with added migrations initContainer for database migrations.

Possibly related PRs

Suggested labels

high priority

Suggested reviewers

  • thorkellmani
  • robertaandersen
  • eirikurn

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (3)
charts/islandis/values.staging.yaml (1)

764-768: Resource limits adjustment to prevent CrashLoopBackoff

The changes align well with the PR objectives:

  1. CPU limit increased from 400m to 1000m (1 core)
  2. CPU request increased to 300m
  3. Memory limit decreased to 768Mi
  4. Memory request set equal to limit to prevent OOM issues

These changes should help prevent CPU throttling while maintaining efficient memory usage.

Consider monitoring these metrics after deployment to ensure they meet the application's needs:

  • CPU utilization patterns
  • Memory usage trends
  • Container restart counts
charts/islandis/values.dev.yaml (1)

Line range hint 2831-2836: Consider increasing memory request for web service

While the CPU configuration looks good, the memory request (384Mi) being half of the limit (768Mi) could potentially lead to OOM issues under load. Consider:

  • Increasing memory request closer to the limit (e.g., 614Mi) to ensure more stable memory allocation
  • Or implementing a more gradual scaling strategy with the HPA
apps/application-system/api/infra/application-system-api.ts (1)

350-351: Adjust Memory Requests for Optimal Resource Utilization

Currently, both memory requests and limits are set to 768Mi. It's generally a good practice to set the memory request lower than the limit to allow for resource elasticity and efficient scheduling. This adjustment can improve cluster utilization and reduce the likelihood of pod evictions due to resource pressure.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 129bd5f and ede8f22.

📒 Files selected for processing (4)
  • apps/application-system/api/infra/application-system-api.ts (1 hunks)
  • charts/islandis/values.dev.yaml (2 hunks)
  • charts/islandis/values.prod.yaml (2 hunks)
  • charts/islandis/values.staging.yaml (2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
apps/application-system/api/infra/application-system-api.ts (1)

Pattern apps/**/*: "Confirm that the code adheres to the following:

  • NextJS best practices, including file structure, API routes, and static generation methods.
  • Efficient state management and server-side rendering techniques.
  • Optimal use of TypeScript for component and utility type safety."
🔇 Additional comments (5)
charts/islandis/values.staging.yaml (1)

614-614: Memory limit adjustment in NODE_OPTIONS

The NODE_OPTIONS memory limit has been decreased from 921MB to 691MB. While this aligns with the overall memory optimization goal, ensure this won't negatively impact the application's performance.

✅ Verification successful

Memory limit adjustment appears safe based on service configuration patterns

Looking at the NODE_OPTIONS configurations across the codebase, the reduction from 921MB to 691MB for this service aligns with the memory patterns seen in similar services in the staging environment. Several other services operate successfully with similar or lower memory limits (460MB-691MB range).

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for any CrashLoopBackOff events related to memory issues
kubectl logs -n application-system -l app=application-system-api --tail=1000 | grep -i "heap\|memory\|oom"

Length of output: 183


Script:

#!/bin/bash
# Let's check the resource configurations in the values files
fd "values.*\.ya?ml" --exec cat {} | grep -A 5 -B 5 "max-old-space-size"

# Also check for any memory-related configurations or issues in deployment files
fd "deployment.*\.ya?ml" --exec cat {} | grep -A 5 -B 5 "resources:\|memory:"

# Look for any related configuration or documentation
rg -i "memory|heap|node_options" -A 3 -B 3 "README|CHANGELOG|docs"

Length of output: 85727

charts/islandis/values.prod.yaml (1)

605-605: Justify the specific NODE_OPTIONS memory value

The --max-old-space-size=691 seems like an oddly specific value. This should either be rounded to a more standard value or documentation should be added explaining why this exact number was chosen.

charts/islandis/values.dev.yaml (3)

767-771: Resource limits are well-balanced for stability

The new resource configuration looks good with:

  • Increased CPU limit to 1 core will help prevent throttling
  • Memory request matching limit (768Mi) helps prevent OOM kills
  • CPU request of 300m provides good baseline performance

617-617: NODE_OPTIONS memory limit properly aligned with container limits

The --max-old-space-size=691 setting is appropriately set below the container memory limit of 768Mi, leaving room for non-heap memory.


767-771: Verify resource limits across dependent services

The resource limit changes look good but should be verified across dependent services to ensure they can handle the increased throughput.

Also applies to: 2831-2836

✅ Verification successful

Let me analyze the resource configurations across the services to verify consistency.


Resource limits are properly aligned with the system's resource tiers

Based on the analysis of resource configurations across the codebase, the specified limits (1000m CPU, 768Mi memory) align with the established resource tiers in the system:

  • Small services: 200m-400m CPU, 256Mi-512Mi memory
  • Medium services: 400m-800m CPU, 512Mi-1024Mi memory
  • Large services: 800m-1200m CPU, 768Mi-4096Mi memory

The service falls into the large tier category, and its resource limits are consistent with other high-throughput services in the codebase that have similar resource requirements (e.g., other services with 800m-1200m CPU and 768Mi-3200Mi memory).

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check resource limits across all services to ensure they can handle the increased load
echo "Checking resource limits across services..."
grep -A 4 "resources:" . | grep -E "cpu:|memory:" || true

# Check if any pods are still experiencing CrashLoopBackOff
kubectl get pods --all-namespaces | grep -i "crash"

Length of output: 304


Script:

#!/bin/bash
# Search for resource configurations in yaml files
fd -e yaml -x grep -A 4 "resources:" {} \; 2>/dev/null

# Also check specifically in the values files
fd "values.*\.yaml$" -x grep -A 4 "resources:" {} \; 2>/dev/null

Length of output: 41160

charts/islandis/values.prod.yaml Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (1)
apps/application-system/api/infra/application-system-api.ts (1)

Line range hint 177-180: Consider increasing worker resource limits to prevent throttling

Given the PR's objective to address CrashLoopBackoff issues, the current worker resource configuration might be insufficient:

  • The CPU request (150m) is only 37.5% of the limit (400m), which could lead to throttling
  • Consider increasing both CPU request and limit to match the service's configuration
 .resources({
-  limits: { cpu: '400m', memory: '768Mi' },
-  requests: { cpu: '150m', memory: '384Mi' },
+  limits: { cpu: '600m', memory: '768Mi' },
+  requests: { cpu: '300m', memory: '384Mi' },
 })
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between ede8f22 and 82b23f1.

📒 Files selected for processing (1)
  • apps/application-system/api/infra/application-system-api.ts (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
apps/application-system/api/infra/application-system-api.ts (1)

Pattern apps/**/*: "Confirm that the code adheres to the following:

  • NextJS best practices, including file structure, API routes, and static generation methods.
  • Efficient state management and server-side rendering techniques.
  • Optimal use of TypeScript for component and utility type safety."

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (2)
charts/islandis/values.staging.yaml (1)

764-768: Resource limits look appropriate for preventing CrashLoopBackoff issues.

The increased CPU limit (600m) and memory limit (1024Mi) along with the corresponding requests (200m CPU, 512Mi memory) should help prevent pods from being throttled or OOMKilled. The 3:1 ratio between limits and requests provides good headroom for traffic spikes while maintaining efficient resource utilization.

Consider implementing Horizontal Pod Autoscaling (HPA) based on memory utilization in addition to the existing CPU-based scaling to better handle memory-intensive workloads.

charts/islandis/values.dev.yaml (1)

Line range hint 642-650: Review health check timing configuration

The current health check configuration:

healthCheck:
  liveness:
    initialDelaySeconds: 3
    path: '/liveness'
    timeoutSeconds: 3
  readiness:
    initialDelaySeconds: 3
    path: '/liveness'
    timeoutSeconds: 3

The timing values seem too aggressive:

  1. initialDelaySeconds: 3 might be too short for the application to properly initialize
  2. timeoutSeconds: 3 might not provide enough buffer during high CPU utilization
  3. Both liveness and readiness probes use the same timing, which could cause cascading failures

Consider adjusting the timing to be more lenient:

healthCheck:
  liveness:
-   initialDelaySeconds: 3
+   initialDelaySeconds: 10
    path: '/liveness'
-   timeoutSeconds: 3
+   timeoutSeconds: 5
  readiness:
-   initialDelaySeconds: 3
+   initialDelaySeconds: 15
    path: '/liveness'
-   timeoutSeconds: 3
+   timeoutSeconds: 5
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 82b23f1 and a7cece9.

📒 Files selected for processing (3)
  • charts/islandis/values.dev.yaml (1 hunks)
  • charts/islandis/values.prod.yaml (1 hunks)
  • charts/islandis/values.staging.yaml (1 hunks)
🔇 Additional comments (2)
charts/islandis/values.prod.yaml (1)

755-759: Resource configuration changes look appropriate

The adjustments to CPU and memory allocations are well-balanced:

  • CPU limit of 600m provides adequate headroom while preventing excessive resource usage
  • Memory limit of 1024Mi allows for sufficient heap space
  • The request-to-limit ratios (1:3 for CPU, 1:2 for memory) follow Kubernetes best practices for resource management
  • These changes should help prevent CrashLoopBackoff issues caused by resource constraints
charts/islandis/values.dev.yaml (1)

767-770: Verify the memory configuration

The memory configuration shows:

  • Limit: 1024Mi
  • Request: 512Mi

This 2:1 ratio between limit and request is reasonable, but we should verify if this is sufficient for the application's needs.

✅ Verification successful

Let me try a different approach to verify the memory configuration by examining the repository for any historical configuration changes or related files.


Based on the search results, I can now provide a comprehensive analysis of the memory configuration. The search shows consistent memory allocation patterns across different services and environments, and I can verify that the 2:1 ratio between limits and requests is a common pattern in the codebase.

Memory configuration is aligned with best practices

The memory configuration of 1024Mi limit and 512Mi request is consistent with:

  • Other similar services in the codebase using the same 2:1 ratio
  • Common memory configurations for Node.js applications in the cluster
  • Standard practices for Kubernetes resource management

The configuration appears appropriate because:

  • The 2:1 ratio allows for memory bursts while maintaining efficient resource allocation
  • Similar services in production are using comparable memory settings
  • The values are in line with the template configurations found in the codebase
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for OOMKilled events in the application-system namespace
# This will help determine if the current memory limits are adequate

echo "Checking for OOMKilled events in application-system namespace..."
kubectl get events -n application-system --field-selector reason=OOMKilled

echo "Checking for pods that were terminated due to memory issues..."
kubectl get pods -n application-system -o json | jq '.items[] | select(.status.containerStatuses[].lastState.terminated.reason=="OOMKilled") | {name: .metadata.name, reason: .status.containerStatuses[].lastState.terminated.reason, exitCode: .status.containerStatuses[].lastState.terminated.exitCode}'

Length of output: 803


Script:

#!/bin/bash
# Search for memory configuration patterns in yaml files
echo "Searching for memory configurations in yaml files..."
rg -A 2 -B 2 'memory:.*Mi' --type yaml

# Look for any documentation about resource requirements
echo -e "\nSearching for resource requirements documentation..."
fd README.md | xargs rg -i "memory|resource|requirements"

# Check for any monitoring or metrics related configuration
echo -e "\nSearching for monitoring configurations..."
rg -A 2 -B 2 'monitoring|metrics|prometheus' --type yaml

Length of output: 109382

charts/islandis/values.dev.yaml Show resolved Hide resolved
@AndesKrrrrrrrrrrr AndesKrrrrrrrrrrr added the automerge Merge this PR as soon as all checks pass label Nov 14, 2024
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (7)
charts/services/application-system-api/values.dev.yaml (2)

Line range hint 147-156: Consider increasing health check timeouts.

The current health check configuration has very tight timing constraints:

  • 3s initial delay
  • 3s timeout

With the history of CPU throttling, these tight timeouts might contribute to failed health checks during high load periods.

Consider applying this adjustment:

  healthCheck:
    liveness:
-     initialDelaySeconds: 3
+     initialDelaySeconds: 5
      path: '/liveness'
-     timeoutSeconds: 3
+     timeoutSeconds: 5
    readiness:
-     initialDelaySeconds: 3
+     initialDelaySeconds: 5
      path: '/liveness'
-     timeoutSeconds: 3
+     timeoutSeconds: 5

Line range hint 157-164: Adjust HPA CPU target for better scaling.

The current HPA configuration with 90% CPU target utilization might be too aggressive, potentially leading to delayed scaling and CPU throttling.

Consider lowering the CPU target:

  hpa:
    scaling:
      metric:
-       cpuAverageUtilization: 90
+       cpuAverageUtilization: 70
        nginxRequestsIrate: 5
      replicas:
        max: 60
        min: 2
charts/services/application-system-api/values.prod.yaml (1)

Line range hint 147-152: Consider increasing health check timeouts

With the history of pods missing health checks, the current 3-second timeout might be too aggressive. Consider increasing timeoutSeconds to provide more buffer during high-load situations.

  healthCheck:
    liveness:
      initialDelaySeconds: 3
      path: '/liveness'
-     timeoutSeconds: 3
+     timeoutSeconds: 5
    readiness:
      initialDelaySeconds: 3
      path: '/liveness'
-     timeoutSeconds: 3
+     timeoutSeconds: 5
charts/services/application-system-api/values.staging.yaml (2)

Line range hint 147-156: Health check configuration needs improvement

Current health check configuration has several potential issues:

  1. Using the same path (/liveness) for both liveness and readiness probes
  2. Short timeout (3s) might lead to false positives
  3. Missing important probe parameters (periodSeconds, failureThreshold)

Consider applying these improvements:

 healthCheck:
   liveness:
     initialDelaySeconds: 3
     path: '/liveness'
-    timeoutSeconds: 3
+    timeoutSeconds: 5
+    periodSeconds: 10
+    failureThreshold: 3
   readiness:
     initialDelaySeconds: 3
-    path: '/liveness'
-    timeoutSeconds: 3
+    path: '/readiness'
+    timeoutSeconds: 5
+    periodSeconds: 10
+    failureThreshold: 3

Line range hint 157-164: Consider lowering HPA CPU utilization target

The current CPU utilization target of 90% is quite high, especially considering the recent CPU throttling issues. A lower target would provide more headroom for traffic spikes.

Consider this adjustment:

 hpa:
   scaling:
     metric:
-      cpuAverageUtilization: 90
+      cpuAverageUtilization: 75
       nginxRequestsIrate: 5
     replicas:
       max: 60
       min: 2
charts/islandis/values.staging.yaml (1)

Line range hint 79-79: Review NODE_OPTIONS memory limit

The Node.js heap size limit (921MB) set in NODE_OPTIONS is very close to the container memory limit (1024MB). This leaves little room for other memory usage and could lead to OOM kills. Consider reducing the heap size limit to ~70% of container memory limit (~700MB).

Apply this change:

-    NODE_OPTIONS: '--max-old-space-size=921 -r dd-trace/init'
+    NODE_OPTIONS: '--max-old-space-size=700 -r dd-trace/init'
charts/islandis/values.prod.yaml (1)

758-761: Consider adjusting memory configuration for better burst capacity

While the memory values are reasonable, setting different limits and requests would provide better burst capacity:

  resources:
    limits:
      cpu: '600m'
-     memory: '1024Mi'
+     memory: '1536Mi'  
    requests:
      cpu: '200m'
      memory: '512Mi'

This 3:1 memory limit-to-request ratio (similar to CPU) would:

  • Maintain the same guaranteed memory (512Mi)
  • Allow more headroom for temporary spikes
  • Better align with Kubernetes QoS best practices
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between a7cece9 and 31266c6.

📒 Files selected for processing (7)
  • apps/application-system/api/infra/application-system-api.ts (1 hunks)
  • charts/islandis/values.dev.yaml (1 hunks)
  • charts/islandis/values.prod.yaml (1 hunks)
  • charts/islandis/values.staging.yaml (1 hunks)
  • charts/services/application-system-api/values.dev.yaml (1 hunks)
  • charts/services/application-system-api/values.prod.yaml (1 hunks)
  • charts/services/application-system-api/values.staging.yaml (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
apps/application-system/api/infra/application-system-api.ts (1)

Pattern apps/**/*: "Confirm that the code adheres to the following:

  • NextJS best practices, including file structure, API routes, and static generation methods.
  • Efficient state management and server-side rendering techniques.
  • Optimal use of TypeScript for component and utility type safety."
🔇 Additional comments (13)
charts/services/application-system-api/values.dev.yaml (2)

Line range hint 187-195: Init container resource configuration looks good.

The resource allocation for the migrations container is well-balanced:

  • Reasonable CPU limits (200m) and requests (50m)
  • Appropriate memory configuration for database migrations

213-216: Resource configuration changes look appropriate.

The increased CPU limits (600m) and requests (200m) should help prevent CrashLoopBackoff issues by:

  • Providing more CPU headroom for peak loads
  • Maintaining a healthy request-to-limit ratio (1:3)
  • Reducing the likelihood of CPU throttling

Let's verify the historical resource usage to confirm these limits are appropriate:

charts/services/application-system-api/values.prod.yaml (2)

213-216: LGTM! Resource adjustments look appropriate.

The increase in CPU resources should help prevent CrashLoopBackoff issues:

  • CPU limit (600m) provides adequate headroom for spikes
  • CPU request (200m) ensures sufficient guaranteed resources
  • The ratio between request and limit (1:3) allows for good burst capacity while maintaining resource efficiency

Line range hint 153-160: Monitor HPA behavior with new resource settings

The HPA is configured with a 90% CPU target utilization. With the increased CPU resources, verify that:

  1. The scaling behavior remains appropriate
  2. The min/max replica counts (2-60) are still suitable for the new resource profile
✅ Verification successful

HPA configuration appears appropriate for the resource profile

The HPA configuration aligns well with the container's resource settings:

  • The 90% CPU target utilization is appropriate for the container's resource limits (600m CPU) and requests (200m CPU)
  • The min/max replica range (2-60) provides good scalability while maintaining high availability
  • The configuration is consistent across all environments (dev/staging/prod)
  • The secondary metric (nginxRequestsIrate: 5) provides additional scaling protection
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check HPA metrics and scaling events after resource changes

# Get HPA status and recent scaling events
kubectl get hpa -n application-system application-system-api -o yaml

# Check recent pod events for any resource-related issues
kubectl get events -n application-system --sort-by='.lastTimestamp' | grep -E "application-system-api|HorizontalPodAutoscaler"

Length of output: 347


Script:

#!/bin/bash
# Let's analyze the Helm chart values to understand the resource configurations and HPA settings

# Check the resource configurations in values files
fd -e yaml -e yml . charts/services/application-system-api/ --exec cat {}

# Look for any HPA-related configurations in other files
rg -A 5 -B 5 "HorizontalPodAutoscaler|autoscaling" charts/

Length of output: 44595


Script:

#!/bin/bash
# Let's analyze the resource configurations to understand the scaling context

# Check resource configurations across environments
rg -A 5 "resources:|cpu:|memory:" charts/services/application-system-api/

# Look for any scaling-related configurations
rg -A 5 "scaling:|replicas:|cpu" charts/services/application-system-api/

Length of output: 13642

apps/application-system/api/infra/application-system-api.ts (3)

349-350: Using same endpoint for health checks


Line range hint 132-159: LGTM: Environment variables configuration

The environment variables for application recipients are well-structured with proper environment-specific values and follow security best practices.


Line range hint 160-240: Verify new service integrations

New xroad services (Labor, HealthInsurance, NationalRegistryB2C) have been added. Let's verify their configuration and usage.

✅ Verification successful

Let me generate additional scripts to verify the usage of these services in the codebase.


Based on the code analysis, I can now provide a final response.

New xroad service integrations are properly configured

The new xroad services are correctly configured in the infrastructure:

  • HealthInsurance: Configured with proper WSDL URLs for dev/staging/prod environments
  • Labor: Set up with VMST parental leave API paths for all environments
  • NationalRegistryB2C: Configured with appropriate client IDs for each environment

These services are properly integrated into the application system API and are used by various modules across the codebase. The configurations follow the established patterns and include all necessary environment-specific settings.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for service configurations and dependencies
rg -g '*.ts' -A 5 'Labor|HealthInsurance|NationalRegistryB2C'

# Check for any existing implementations using these services
ast-grep --pattern 'class $_ {
  $$$
  constructor($$$) {
    $$$
    Labor|HealthInsurance|NationalRegistryB2C
    $$$
  }
  $$$
}'

Length of output: 181752


Script:

#!/bin/bash
# Check for service configurations and dependencies in the application system API
rg -g '*.ts' -A 5 'XROAD_HEALTH_INSURANCE|XROAD_VMST|XROAD_NATIONAL_REGISTRY_B2C' apps/application-system/api/

# Check for service configurations in infrastructure
rg -g '*.ts' -A 5 'HealthInsurance|Labor|NationalRegistryB2C' infra/src/dsl/

# Check for service imports and usage in modules
ast-grep --pattern 'import { $_ } from $$$HealthInsurance$$$'

Length of output: 1488

charts/services/application-system-api/values.staging.yaml (2)

Line range hint 1-277: Overall configuration is well-structured and secure

The configuration demonstrates good security practices with:

  • Proper pod security context
  • Well-defined resource limits
  • Appropriate service account configuration
  • Comprehensive environment variable management

213-217: Resource allocation changes look appropriate

The increased CPU limits (600m) and requests (200m) should help prevent CrashLoopBackoff issues caused by CPU throttling. The 1:3 ratio between request and limit is within the recommended range for efficient resource utilization.

Let's verify the pod's resource usage patterns to ensure these limits are appropriate:

charts/islandis/values.staging.yaml (2)

767-770: Resource limits look appropriate for preventing CPU throttling

The increased CPU limit (600m) and request (200m) values should help prevent the CrashLoopBackoff issues by:

  • Providing more CPU headroom with the 600m limit
  • Ensuring a base CPU allocation of 200m
  • Maintaining a healthy request:limit ratio of 1:3

The memory configuration (512Mi request, 1024Mi limit) also looks appropriate for a Node.js application.


Line range hint 841-850: Security configuration follows best practices

The security settings are well configured with:

  • Appropriate pod security context
  • Proper privilege restrictions
  • Correctly configured service account with IAM role
charts/islandis/values.prod.yaml (1)

758-761: CPU resource changes look good

The increased CPU limit (600m) and request (200m) should help prevent throttling and CrashLoopBackoff issues. The 3:1 limit-to-request ratio provides good balance between guaranteed resources and burst capacity.

charts/islandis/values.dev.yaml (1)

771-774: ⚠️ Potential issue

Resource configuration may need further adjustment

The current configuration still has a large gap between requests and limits which could lead to resource contention:

  • CPU: 200m request vs 600m limit (3x difference)
  • Memory: 512Mi request vs 1024Mi limit (2x difference)

This configuration might still be susceptible to the CrashLoopBackoff issues mentioned in the PR objectives.

Consider adjusting the values to have a more balanced ratio between requests and limits:

resources:
  limits:
    cpu: '600m'
    memory: '1024Mi'
  requests:
-   cpu: '200m'
+   cpu: '400m'
    memory: '512Mi'

Let me verify the pod's resource utilization patterns to better inform these recommendations:

#!/bin/bash
# Description: Check pod resource utilization patterns in the application-system namespace

# Get pod metrics for the application-system namespace
kubectl top pods -n application-system --sort-by=cpu
kubectl top pods -n application-system --sort-by=memory

# Get any CrashLoopBackoff events in the last hour
kubectl get events -n application-system --field-selector type=Warning,reason=BackOff --sort-by='.lastTimestamp'

Copy link

codecov bot commented Nov 14, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 36.44%. Comparing base (d06732d) to head (31266c6).

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main   #16792   +/-   ##
=======================================
  Coverage   36.44%   36.44%           
=======================================
  Files        6851     6851           
  Lines      143569   143569           
  Branches    40980    40980           
=======================================
  Hits        52330    52330           
  Misses      91239    91239           
Flag Coverage Δ
air-discount-scheme-backend 54.08% <ø> (ø)
air-discount-scheme-web 0.00% <ø> (ø)
api 3.34% <ø> (ø)
api-catalogue-services 77.85% <ø> (ø)
api-domains-air-discount-scheme 37.08% <ø> (ø)
api-domains-assets 26.71% <ø> (ø)
api-domains-auth-admin 48.48% <ø> (ø)
api-domains-communications 39.64% <ø> (ø)
api-domains-criminal-record 47.56% <ø> (ø)
api-domains-driving-license 44.50% <ø> (ø)
api-domains-education 30.62% <ø> (ø)
api-domains-health-insurance 34.43% <ø> (ø)
api-domains-mortgage-certificate 34.74% <ø> (ø)
api-domains-payment-schedule 41.31% <ø> (ø)
application-api-files 62.52% <ø> (ø)
application-core 71.06% <ø> (+0.31%) ⬆️
application-system-api 40.99% <ø> (ø)
application-template-api-modules 27.67% <ø> (+<0.01%) ⬆️
application-templates-accident-notification 28.98% <ø> (ø)
application-templates-car-recycling 3.12% <ø> (ø)
application-templates-criminal-record 25.87% <ø> (ø)
application-templates-driving-license 18.26% <ø> (ø)
application-templates-estate 12.14% <ø> (ø)
application-templates-example-payment 24.80% <ø> (ø)
application-templates-financial-aid 15.48% <ø> (ø)
application-templates-general-petition 23.07% <ø> (ø)
application-templates-inheritance-report 6.52% <ø> (ø)
application-templates-marriage-conditions 15.04% <ø> (ø)
application-templates-mortgage-certificate 43.34% <ø> (ø)
application-templates-parental-leave 29.86% <ø> (ø)
application-types 6.60% <ø> (ø)
application-ui-components 1.27% <ø> (ø)
application-ui-shell 20.83% <ø> (ø)
auth-admin-web 2.43% <ø> (ø)
auth-nest-tools 30.92% <ø> (ø)
auth-react 21.85% <ø> (ø)
auth-shared 75.00% <ø> (ø)
clients-charge-fjs-v2 24.11% <ø> (ø)
clients-driving-license 40.28% <ø> (ø)
clients-driving-license-book 43.50% <ø> (ø)
clients-financial-statements-inao 49.06% <ø> (ø)
clients-license-client 1.26% <ø> (ø)
clients-middlewares 73.56% <ø> (+0.33%) ⬆️
clients-regulations 42.36% <ø> (ø)
clients-rsk-company-registry 29.76% <ø> (ø)
clients-rsk-personal-tax-return 38.00% <ø> (ø)
clients-smartsolutions 12.77% <ø> (ø)
clients-syslumenn 49.23% <ø> (ø)
clients-zendesk 50.61% <ø> (ø)
cms 0.42% <ø> (ø)
cms-translations 38.94% <ø> (ø)
content-search-index-manager 95.65% <ø> (ø)
content-search-toolkit 8.14% <ø> (ø)
contentful-apps 4.69% <ø> (ø)
dokobit-signing 62.76% <ø> (ø)
download-service 44.16% <ø> (ø)
email-service 60.57% <ø> (ø)
feature-flags 90.62% <ø> (ø)
file-storage 46.09% <ø> (ø)
financial-aid-backend 51.29% <ø> (ø)
financial-aid-shared 17.81% <ø> (ø)
icelandic-names-registry-backend 54.34% <ø> (ø)
infra-nest-server 48.37% <ø> (ø)
infra-tracing 43.24% <ø> (ø)
island-ui-core 28.88% <ø> (ø)
judicial-system-api 19.76% <ø> (ø)
judicial-system-audit-trail 68.86% <ø> (ø)
judicial-system-backend 55.14% <ø> (ø)
judicial-system-formatters 79.26% <ø> (ø)
judicial-system-message 66.99% <ø> (ø)
judicial-system-message-handler 47.99% <ø> (ø)
judicial-system-scheduler 70.67% <ø> (ø)
judicial-system-types 43.58% <ø> (ø)
judicial-system-web 27.16% <ø> (ø)
license-api 42.58% <ø> (-0.02%) ⬇️
localization 10.15% <ø> (ø)
logging 48.43% <ø> (ø)
message-queue 68.79% <ø> (ø)
nest-audit 68.20% <ø> (ø)
nest-aws 54.03% <ø> (ø)
nest-config 78.00% <ø> (ø)
nest-core 43.54% <ø> (ø)
nest-feature-flags 51.09% <ø> (ø)
nest-problem 45.82% <ø> (ø)
nest-sequelize 94.44% <ø> (ø)
nest-swagger 51.71% <ø> (ø)
nova-sms 62.09% <ø> (ø)
portals-admin-regulations-admin 1.85% <ø> (ø)
portals-core 15.89% <ø> (ø)
reference-backend 49.74% <ø> (ø)
regulations 16.78% <ø> (ø)
residence-history 85.00% <ø> (ø)
services-auth-admin-api 52.50% <ø> (ø)
services-auth-delegation-api 58.22% <ø> (ø)
services-auth-ids-api 52.10% <ø> (ø)
services-auth-personal-representative 45.62% <ø> (ø)
services-auth-personal-representative-public 41.79% <ø> (ø)
services-auth-public-api 49.59% <ø> (ø)
services-documents 60.81% <ø> (ø)
services-endorsements-api 53.26% <ø> (ø)
services-sessions 65.44% <ø> (-0.05%) ⬇️
services-university-gateway 49.34% <ø> (+0.08%) ⬆️
services-user-notification 46.88% <ø> (ø)
services-user-profile 61.81% <ø> (ø)
shared-components 26.90% <ø> (ø)
shared-form-fields 31.26% <ø> (ø)
shared-mocking 60.89% <ø> (ø)
shared-pii 92.85% <ø> (ø)
shared-problem 87.50% <ø> (ø)
shared-utils 27.69% <ø> (ø)
skilavottord-ws 24.14% <ø> (ø)
web 1.80% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d06732d...31266c6. Read the comment docs.

@datadog-island-is
Copy link

Datadog Report

All test runs 5c424e4 🔗

99 Total Test Services: 0 Failed, 97 Passed
🔻 Test Sessions change in coverage: 1 decreased (-0.01%), 2 increased, 197 no change

Test Services
This report shows up to 10 services
Service Name Failed Known Flaky New Flaky Passed Skipped Total Time Code Coverage Change Test Service View
air-discount-scheme-backend 0 0 0 82 0 33.38s N/A Link
air-discount-scheme-web 0 0 0 2 0 9.06s N/A Link
api 0 0 0 4 0 3.19s N/A Link
api-catalogue-services 0 0 0 23 0 12.93s N/A Link
api-domains-air-discount-scheme 0 0 0 6 0 21.21s N/A Link
api-domains-assets 0 0 0 3 0 13.29s N/A Link
api-domains-auth-admin 0 0 0 18 0 14.3s N/A Link
api-domains-communications 0 0 0 5 0 35.95s N/A Link
api-domains-criminal-record 0 0 0 5 0 10.17s N/A Link
api-domains-driving-license 0 0 0 23 0 33.61s N/A Link

🔻 Code Coverage Decreases vs Default Branch (1)

  • license-api - jest 33.91% (-0.01%) - Details

@kodiakhq kodiakhq bot merged commit 3249c77 into main Nov 15, 2024
27 checks passed
@kodiakhq kodiakhq bot deleted the sa-increase-resource-request-and-limit branch November 15, 2024 08:34
norda-gunni added a commit that referenced this pull request Nov 18, 2024
…avoid crashloops (#16792)

* chore(application-system): Increase cpu and mem limit and request to avoid crashloops

* chore: charts update dirty files

* Decrease cpu resources and return memory to original value

* chore: charts update dirty files

* chore: charts update dirty files

---------

Co-authored-by: andes-it <builders@andes.is>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
kodiakhq bot added a commit that referenced this pull request Nov 18, 2024
* chore(application-system): Set initialDelay to 20 seconds (#16893)

* chore(application-system): Set initialDelay to 20 seconds

* chore: charts update dirty files

---------

Co-authored-by: andes-it <builders@andes.is>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>

* chore(application-system): Increase cpu and mem limit and request to avoid crashloops (#16792)

* chore(application-system): Increase cpu and mem limit and request to avoid crashloops

* chore: charts update dirty files

* Decrease cpu resources and return memory to original value

* chore: charts update dirty files

* chore: charts update dirty files

---------

Co-authored-by: andes-it <builders@andes.is>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>

* Remove charts

---------

Co-authored-by: andes-it <builders@andes.is>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
jonnigs pushed a commit that referenced this pull request Nov 26, 2024
…avoid crashloops (#16792)

* chore(application-system): Increase cpu and mem limit and request to avoid crashloops

* chore: charts update dirty files

* Decrease cpu resources and return memory to original value

* chore: charts update dirty files

* chore: charts update dirty files

---------

Co-authored-by: andes-it <builders@andes.is>
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automerge Merge this PR as soon as all checks pass
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants