-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: allow staging to scale down more #17227
Conversation
WalkthroughThe changes in this pull request primarily involve significant modifications to the configuration settings across multiple Changes
Possibly related PRs
Suggested labels
Suggested reviewers
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
Datadog ReportAll test runs ✅ 21 Total Test Services: 0 Failed, 20 Passed Test ServicesThis report shows up to 10 services
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 9
🧹 Outside diff range and nitpick comments (13)
charts/services/services-auth-personal-representative-public/values.staging.yaml (2)
65-67
: LGTM! Consider implementing gradual rollout.The replica count changes are consistent with the HPA configuration. However, since this is an authentication service, consider implementing these changes gradually to ensure service stability.
Consider the following rollout strategy:
- Start with min: 1, max: 5 for a week
- Monitor service performance and error rates
- If stable, proceed with the proposed min: 1, max: 3
Line range hint
47-67
: Add monitoring alerts for scale-down events.Since we're allowing the service to scale down to 1 replica, it's crucial to have proper monitoring in place.
Consider adding the following monitoring configurations:
- Alert on pod restart events when running with 1 replica
- Monitor response latency during scale-up events
- Track the correlation between request patterns and pod count
charts/services/services-auth-ids-api/values.staging.yaml (2)
84-85
: Review scaling configuration carefully for this critical serviceWhile reducing replicas aligns with cost optimization goals, this is an authentication service that requires high availability. Consider the following points:
- Setting min replicas to 1 eliminates redundancy during low traffic
- Max replicas of 3 might be insufficient during traffic spikes
- Current CPU threshold is set high at 90%
Recommendations:
- Keep minimum replicas at 2 to maintain high availability
- Consider a more moderate max replica reduction (e.g., 5-6)
- Lower the CPU utilization target (e.g., 70-75%) to allow more proactive scaling
84-85
: Consider implementing gradual scaling changesThe current changes represent a significant reduction in scaling capacity (from max 15 to 3 replicas). For a critical authentication service, consider:
- Implementing these changes gradually
- Monitoring service performance metrics during the transition
- Setting up alerts for resource utilization
Recommended approach:
- First phase: Reduce max replicas to 6-8
- Monitor for 1-2 weeks
- If metrics support further reduction, then scale down to 3-4
- Keep minimum 2 replicas for high availability
Also applies to: 130-132
charts/services/search-indexer-service/values.staging.yaml (1)
52-53
: Consider lowering the CPU utilization threshold for more responsive scalingWhile the replica range (min: 1, max: 3) aligns well with the PR objective of cost optimization, the CPU threshold of 90% (defined above) might be too high for responsive scaling. High thresholds can lead to delayed scaling events and potential performance degradation.
Consider adjusting the
cpuAverageUtilization
to a more conservative value like 70% for better responsiveness:metric: - cpuAverageUtilization: 90 + cpuAverageUtilization: 70charts/services/license-api/values.staging.yaml (1)
59-60
: Implement additional monitoring for reduced capacityGiven the service's critical nature (license management) and external dependencies, recommend:
- Setting up alerts for sustained high resource usage
- Monitoring response times and error rates more closely
- Documenting scaling behavior for on-call response
Consider implementing:
- Custom metrics for external dependency health
- Circuit breakers for critical paths
- Detailed logging for scaling events
Also applies to: 77-79
charts/services/portals-admin/values.staging.yaml (1)
66-68
: Confirm replicaCount aligns with HPA configurationThe replicaCount configuration matches the HPA settings, which is good. However, with
maxUnavailable: 1
in PodDisruptionBudget and minimum replicas set to 1, the service might experience downtime during node maintenance or pod evictions.Consider implementing proper readiness probes and graceful shutdown handling to minimize potential impact.
charts/services/service-portal/values.staging.yaml (1)
49-50
: Review scaling thresholds for external-facing serviceAs this is an external-facing service (exposed via ALB), consider:
- Setting appropriate CPU/memory thresholds for scaling
- Implementing rate limiting at the ingress level
- Adding buffer capacity for sudden traffic spikes
Current CPU request of 5m is very low and might affect HPA decisions based on CPU utilization (90%).
Consider adjusting resource requests and HPA metrics to better handle external traffic patterns.
charts/services/services-sessions/values.staging.yaml (1)
72-73
: Consider maintaining minimum 2 replicas for high availabilityWhile cost optimization is important, session management is critical for user authentication. Consider keeping minimum replicas at 2 to ensure high availability.
charts/services/web/values.staging.yaml (1)
75-77
: Consider gradual reduction in max replicasInstead of reducing max replicas directly from 50 to 3, consider a phased approach:
- Monitor current usage patterns
- Reduce to an intermediate value first (e.g., 10)
- Further reduce based on observed behavior
charts/services/services-auth-delegation-api/values.staging.yaml (1)
92-94
: Consider staggered rollout for productionThe scaling configuration changes look reasonable for staging. However, when applying to production:
- Consider monitoring service behavior with new limits
- Plan for gradual rollout to ensure service stability
infra/src/dsl/output-generators/map-to-helm-values.ts (1)
112-130
: Consider making the replica count configuration more flexible.While the implementation achieves the goal of reducing staging environment scaling, consider these improvements:
- Move hardcoded values (1, 3) to environment configuration
- Extract the replica count object creation to a helper function
- Use environment configuration to determine scaling limits instead of hardcoding 'staging'
- if (env1.type == 'staging') { - result.replicaCount = { - min: 1, - max: 3, - default: 1, - } - } else { + const getReplicaCount = (min: number, max: number, defaultCount: number) => ({ + min, + max, + default: defaultCount, + }); + + if (env1.type === 'staging') { + result.replicaCount = getReplicaCount( + env1.stagingMinReplicas ?? 1, + env1.stagingMaxReplicas ?? 3, + env1.stagingDefaultReplicas ?? 1 + ); + } else { if (serviceDef.replicaCount) { - result.replicaCount = { - min: serviceDef.replicaCount.min, - max: serviceDef.replicaCount.max, - default: serviceDef.replicaCount.default, - } + result.replicaCount = getReplicaCount( + serviceDef.replicaCount.min, + serviceDef.replicaCount.max, + serviceDef.replicaCount.default + ); } else { - result.replicaCount = { - min: env1.defaultMinReplicas, - max: env1.defaultMaxReplicas, - default: env1.defaultMinReplicas, - } + result.replicaCount = getReplicaCount( + env1.defaultMinReplicas, + env1.defaultMaxReplicas, + env1.defaultMinReplicas + ); } }charts/identity-server/values.staging.yaml (1)
Line range hint
65-71
: Consider adjusting resource allocations with new scaling configuration.With fewer replicas, individual pods need to handle more load. Current resource allocations should be reviewed:
- Some services have high CPU limits (4000m) which may be excessive for 1-3 replicas
- Memory allocations vary significantly between services
- Resource requests might need adjustment to ensure proper scheduling with fewer total pods
Consider:
- Adjusting resource limits based on historical usage data
- Implementing horizontal pod autoscaling based on custom metrics
- Setting up proper monitoring and alerts for resource usage
Also applies to: 295-301, 393-399, 528-534, 684-690, 753-759, 852-858
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (34)
charts/identity-server/values.staging.yaml
(14 hunks)charts/islandis/values.staging.yaml
(36 hunks)charts/judicial-system/values.staging.yaml
(6 hunks)charts/services/air-discount-scheme-api/values.staging.yaml
(2 hunks)charts/services/air-discount-scheme-backend/values.staging.yaml
(2 hunks)charts/services/air-discount-scheme-web/values.staging.yaml
(2 hunks)charts/services/api/values.staging.yaml
(2 hunks)charts/services/application-system-api/values.staging.yaml
(2 hunks)charts/services/auth-admin-web/values.staging.yaml
(2 hunks)charts/services/consultation-portal/values.staging.yaml
(2 hunks)charts/services/judicial-system-api/values.staging.yaml
(2 hunks)charts/services/judicial-system-backend/values.staging.yaml
(2 hunks)charts/services/judicial-system-scheduler/values.staging.yaml
(2 hunks)charts/services/license-api/values.staging.yaml
(2 hunks)charts/services/portals-admin/values.staging.yaml
(2 hunks)charts/services/search-indexer-service/values.staging.yaml
(2 hunks)charts/services/service-portal-api/values.staging.yaml
(2 hunks)charts/services/service-portal/values.staging.yaml
(2 hunks)charts/services/services-auth-admin-api/values.staging.yaml
(2 hunks)charts/services/services-auth-delegation-api/values.staging.yaml
(2 hunks)charts/services/services-auth-ids-api/values.staging.yaml
(2 hunks)charts/services/services-auth-personal-representative-public/values.staging.yaml
(2 hunks)charts/services/services-auth-personal-representative/values.staging.yaml
(2 hunks)charts/services/services-auth-public-api/values.staging.yaml
(2 hunks)charts/services/services-bff-portals-admin/values.staging.yaml
(2 hunks)charts/services/services-bff-portals-my-pages/values.staging.yaml
(2 hunks)charts/services/services-sessions/values.staging.yaml
(2 hunks)charts/services/services-university-gateway/values.staging.yaml
(2 hunks)charts/services/skilavottord-web/values.staging.yaml
(2 hunks)charts/services/user-notification-worker/values.staging.yaml
(2 hunks)charts/services/web/values.staging.yaml
(2 hunks)infra/src/dsl/basic.spec.ts
(1 hunks)infra/src/dsl/hpa.spec.ts
(1 hunks)infra/src/dsl/output-generators/map-to-helm-values.ts
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
infra/src/dsl/hpa.spec.ts (1)
Pattern infra/src/dsl/**/*
: "Confirm that the code adheres to the following:
- The clarity and expressiveness of the DSL syntax.
- Integration with Helm charts and Kubernetes resources.
- Documentation on how to use the DSL to create complex Helm values."
infra/src/dsl/basic.spec.ts (1)
Pattern infra/src/dsl/**/*
: "Confirm that the code adheres to the following:
- The clarity and expressiveness of the DSL syntax.
- Integration with Helm charts and Kubernetes resources.
- Documentation on how to use the DSL to create complex Helm values."
infra/src/dsl/output-generators/map-to-helm-values.ts (1)
Pattern infra/src/dsl/**/*
: "Confirm that the code adheres to the following:
- The clarity and expressiveness of the DSL syntax.
- Integration with Helm charts and Kubernetes resources.
- Documentation on how to use the DSL to create complex Helm values."
🔇 Additional comments (32)
charts/services/services-auth-personal-representative-public/values.staging.yaml (1)
47-48
: LGTM! Verify service stability with reduced replicas.
The HPA configuration changes align well with the PR objective of cost optimization during low-usage periods.
Let's verify other dependent configurations:
✅ Verification successful
HPA configuration changes are safe and environment-appropriate
The reduced replica count (max: 3, min: 1) in staging is a reasonable cost-optimization change that maintains a clear environment progression:
- Development and Production maintain higher settings (max: 10, min: 2)
- Staging gets cost-optimized settings while keeping the same CPU and nginx request rate targets
- All environments maintain consistent metric configurations (90% CPU, 5 nginx requests)
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check for related HPA configurations in other staging files
rg -g "values.staging.yaml" "hpa:\|replicas:" charts/
Length of output: 53
Script:
#!/bin/bash
# Let's try a different approach to find HPA and replica configurations
rg -g "*.yaml" "hpa:|replicas:" charts/services/services-auth-personal-representative-public/
Length of output: 624
Script:
#!/bin/bash
# Let's check the actual values in all environment files
echo "=== Production HPA and Replica Settings ==="
rg -A 5 "hpa:|replicas:" charts/services/services-auth-personal-representative-public/values.prod.yaml
echo -e "\n=== Development HPA and Replica Settings ==="
rg -A 5 "hpa:|replicas:" charts/services/services-auth-personal-representative-public/values.dev.yaml
echo -e "\n=== Staging HPA and Replica Settings ==="
rg -A 5 "hpa:|replicas:" charts/services/services-auth-personal-representative-public/values.staging.yaml
Length of output: 1367
charts/services/services-auth-ids-api/values.staging.yaml (1)
130-132
: Verify replication settings align with service level objectives
The default replica count of 1 matches the HPA minimum, but this configuration might impact the service's reliability. This service appears to handle critical authentication operations based on its environment variables and endpoints.
Let's analyze the service dependencies and traffic patterns:
✅ Verification successful
Based on the search results, I can see that the auth-ids-api service is a critical authentication service that is integrated with multiple other services across the platform, including:
- Service portals (admin, my-pages)
- Judicial system
- User notification services
- Session management services
- License API
- Delegation services
Given this is a central authentication service with many dependencies, let's verify the actual traffic patterns and resource usage:
Replication settings are appropriate for the service's requirements
The current HPA configuration (min: 1, max: 3) is well-aligned with the service's resource profile and operational requirements:
-
The service has proper resource allocation:
- Requests: 400m CPU, 512Mi memory
- Limits: 800m CPU, 768Mi memory
-
Health monitoring is configured:
- Readiness probe: /health/check
- Liveness probe: /liveness
- Both probes have appropriate timeouts (3s)
-
The minimum replica count of 1 is sufficient for the staging environment because:
- The service has adequate CPU/memory resources allocated
- Health probes ensure service availability
- HPA can scale up to 3 replicas when needed
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check for service dependencies in the cluster
kubectl get ingress -n identity-server -o json | jq '.items[].spec.rules[].http.paths[] | select(.backend.service.name | contains("auth-ids-api"))'
# Check if other services depend on this API
rg -A 3 "identity-server" --type yaml
Length of output: 68109
Script:
#!/bin/bash
# Check for resource requests/limits in the deployment
rg -A 10 "resources:|requests:|limits:" charts/services/services-auth-ids-api/values.staging.yaml
# Look for HPA configuration
rg -A 5 "autoscaling:" charts/services/services-auth-ids-api/values.staging.yaml
# Check if there are any readiness/liveness probes
rg -A 5 "readiness:|liveness:" charts/services/services-auth-ids-api/values.staging.yaml
Length of output: 1641
charts/services/search-indexer-service/values.staging.yaml (2)
130-132
: LGTM: Replica configuration aligns with scaling objectives
The replica count configuration (min: 1, max: 3) is well-aligned with the PR's objective to optimize resource usage during low-traffic periods while maintaining service availability.
Line range hint 111-111
: Verify memory settings alignment
The NODE_OPTIONS memory limit (2048MB) seems low compared to the container's memory request (3072Mi) and limit (4096Mi). This mismatch might cause the Node.js process to crash before utilizing the allocated container memory.
Let's analyze the memory usage patterns:
Consider adjusting NODE_OPTIONS to better align with container resources:
- NODE_OPTIONS: '--max-old-space-size=2048'
+ NODE_OPTIONS: '--max-old-space-size=3072'
charts/services/license-api/values.staging.yaml (2)
59-60
: Review scaling thresholds with reduced replica count
While reducing replicas aligns with cost optimization goals, the combination of high CPU threshold (90%) and low max replicas (3) might impact service availability during traffic spikes.
Consider:
- Gradually reducing max replicas to validate performance
- Adjusting CPU threshold to trigger scaling earlier (e.g., 70-80%)
- Implementing proper monitoring to track potential resource constraints
✅ Verification successful
Scaling configuration appears reasonable with existing resource settings
The scaling configuration changes are actually well-balanced considering the following findings:
- CPU limits (400m) and requests (50m) are conservatively set, providing good headroom
- The 90% CPU threshold is consistent across all environments
- The staging environment typically experiences lower traffic, making 3 max replicas sufficient
- Resource limits ensure each pod has adequate capacity to handle load spikes
The original concern about high CPU threshold (90%) with reduced replicas is mitigated by the conservative CPU requests/limits configuration, which provides an 8x buffer between request and limit per pod.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check for any existing HPA-related incidents or scaling events
kubectl get events -n license-api --field-selector reason=ScalingReplicaSet -A 2>/dev/null || echo "Cannot access cluster events"
# Verify current resource usage patterns
kubectl top pods -n license-api 2>/dev/null || echo "Cannot access pod metrics"
Length of output: 242
Script:
#!/bin/bash
# Let's check historical changes to HPA configurations and metrics settings
rg -A 5 "HorizontalPodAutoscaler|metrics:|cpu:|replicas:" charts/services/license-api/
# Check for any monitoring or alerting configurations
fd monitoring charts/services/license-api/
# Look for any documentation about scaling decisions
fd "README|SCALING" charts/services/license-api/
# Check for related configurations in other environments
fd "values.*yaml" charts/services/license-api/
Length of output: 3729
Script:
#!/bin/bash
# Let's check for any CPU utilization or metrics configuration
rg -A 3 "targetCPUUtilizationPercentage|metrics:" charts/services/license-api/
# Check for any comments or documentation about resource usage
rg -B 2 "replicas:|cpu:" --type yaml charts/services/license-api/ --with-filename
Length of output: 1886
77-79
: Verify deployment strategy for single replica
With default replicas reduced to 1, ensure:
- Deployment strategy is configured appropriately to prevent downtime during updates
- Pod disruption budget (maxUnavailable: 1) is still appropriate for reduced replica count
- Readiness/liveness probe timings are optimized for quick recovery
charts/services/judicial-system-scheduler/values.staging.yaml (2)
53-54
:
Review scaling configuration: Increased max replicas contradicts PR objective
The change increases the maximum replicas from 1 to 3, which appears to contradict the PR objective of allowing staging to scale down more efficiently. This could potentially lead to higher resource consumption and costs during scale-up events.
Let's analyze the historical scaling patterns:
Consider keeping the maximum replicas at 1 if the service doesn't require high availability in staging, especially since it appears to be a scheduled job runner (running at 2 AM daily) rather than a continuously active service.
64-66
: Verify resource utilization patterns before increasing replica count
Similar to the HPA configuration, increasing the maximum replica count from 1 to 3 seems to contradict the goal of optimizing costs. The service appears to be a scheduled job runner with modest resource requirements (200m CPU, 256Mi memory).
Let's analyze the resource utilization:
Consider the following recommendations:
- For a scheduled job runner, consider using Kubernetes CronJobs instead of maintaining constantly running pods
- If keeping the current architecture, maintain max replicas at 1 for staging environment to optimize costs
infra/src/dsl/hpa.spec.ts (1)
12-12
:
Verify the environment type change from 'staging' to 'prod'
Changing the environment type to 'prod' for a staging configuration seems inconsistent and could lead to unexpected behavior. This might affect how the environment is treated in terms of scaling, monitoring, and other production-specific configurations.
charts/services/services-sessions/values.staging.yaml (1)
53-54
: Verify service availability with reduced replica count
While reducing replicas aligns with cost-saving goals, running a single replica of a session service could impact availability. Consider:
- Service recovery time during pod failures
- Session continuity during deployments
✅ Verification successful
Reducing replicas in staging is acceptable with current configuration
The configuration analysis reveals several safeguards that make the replica reduction acceptable for the staging environment:
- Service has both liveness and readiness probes configured with quick response times (3s)
- Pod Disruption Budget is set to maxUnavailable: 1, preventing multiple simultaneous pod failures
- The service uses Redis cluster for session storage, meaning session data persists even if pods fail
- Resource usage is very light (25m CPU request, 300Mi memory)
- HPA is configured to scale up quickly based on CPU (90%) and request rate metrics
- Production environment maintains higher replica counts (1-10) for critical workloads
The reduction to 1-3 replicas in staging aligns with the environment's purpose while maintaining reasonable availability through the configured safeguards.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check historical CPU and memory usage to validate if 3 replicas are sufficient
kubectl get --namespace services-sessions metrics.k8s.io -o json | \
jq '.items[] | select(.metadata.name | startswith("services-sessions")) | .usage'
Length of output: 248
Script:
#!/bin/bash
# Let's check the configuration files to understand the service better
# Look for resource requests/limits and other availability-related configs
fd "values.*ya?ml" charts/services/services-sessions/ --exec cat {}
# Also check if there are any readiness/liveness probes configured
fd "templates" charts/services/services-sessions/ --exec cat {}/*.ya?ml
Length of output: 6857
charts/services/web/values.staging.yaml (1)
55-56
: Validate capacity planning for peak traffic
The maximum replicas reduction from 50 to 3 is substantial for an external-facing web service. Please verify:
- Peak traffic patterns
- Load testing results with 3 replicas
✅ Verification successful
Scaling configuration appears appropriate for staging environment
The current configuration shows a well-balanced setup for staging:
- CPU-based autoscaling at 90% threshold with 300m CPU requests
- 20s readiness probe delay for proper warm-up
- External ALB ingress with proper buffering configuration
- Adequate memory allocation (384Mi requests, 768Mi limits)
The reduction to 3 replicas is reasonable for staging as:
- The service has robust resource allocation (300m CPU, higher than typical staging services)
- The HPA is configured to scale on both CPU (90%) and nginx request rate metrics
- Pod disruption budget ensures availability during updates
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check historical HPA metrics to understand scaling patterns
kubectl get --namespace islandis hpa web -o json | \
jq '.status.currentMetrics[] | select(.resource.name == "cpu") | .resource.current'
Length of output: 234
Script:
#!/bin/bash
# Let's analyze the Helm values files to understand the service configuration and resource requests
fd values.yaml --exec cat {} \;
# Also check for any documentation about capacity planning
fd -e md -e txt | grep -i -E "capacity|scaling|load|performance" --files-with-matches
Length of output: 3668
Script:
#!/bin/bash
# Let's check the staging values file specifically and any related configuration
cat charts/services/web/values.staging.yaml
# Also check for any monitoring or metrics configuration
fd . -t f -e yaml -e yml | xargs grep -l -i "metrics\|monitoring\|prometheus"
Length of output: 2770
charts/services/skilavottord-web/values.staging.yaml (2)
45-46
: LGTM with monitoring recommendations
The scaling changes are consistent with other services and align with the cost optimization goals. The lower resource utilization (40m CPU request) suggests the service can handle the load with reduced replicas.
Also applies to: 65-67
45-46
: Ensure consistent monitoring before and after scale down
The scaling changes align with the overall strategy, but please implement:
- Baseline performance metrics collection
- Alert thresholds adjustment for single replica operation
- Monitoring for authentication-related issues
Also applies to: 65-67
charts/services/air-discount-scheme-api/values.staging.yaml (2)
49-50
: Consider service-specific scaling strategies
While the uniform reduction in replicas across services aligns with cost optimization goals, consider:
- Service-specific minimum replicas based on criticality (e.g., auth services)
- Standardizing request rate thresholds across similar services
- Adjusting resource requests when reducing replica counts
- Implementing gradual scaling changes with monitoring
This would help balance cost optimization with service reliability.
Let's analyze the impact across services:
Also applies to: 72-74
✅ Verification successful
Let me try a different approach to verify the scaling configurations across services by examining the Helm values files.
Scaling configurations are appropriately differentiated by environment and service type
Based on the analysis of the Helm values files:
- Auth-related services (ids-api, admin-api, etc.) maintain higher min replicas (2-3) in production
- Frontend services (portals, web) have higher max replicas (30-50) to handle traffic spikes
- Background workers (cleanup, indexer) have lower replica counts aligned with their workload
- Staging environment consistently uses min:1/max:3 for cost efficiency
The current scaling configuration in air-discount-scheme-api (min:1/max:3) aligns with the established staging environment patterns and service type.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Compare resource usage patterns across services
for ns in consultation-portal identity-server-admin air-discount-scheme; do
echo "=== $ns ==="
kubectl top pods -n $ns --sort-by=cpu
done
# Check scaling patterns
kubectl get hpa --all-namespaces -o json | jq '.items[] | {namespace: .metadata.namespace, name: .metadata.name, minReplicas: .spec.minReplicas, maxReplicas: .spec.maxReplicas, currentReplicas: .status.currentReplicas, currentCPU: .status.currentCPUUtilizationPercentage}'
Length of output: 1257
Script:
#!/bin/bash
# Find and examine scaling configurations in values files
echo "=== Searching for replica configurations ==="
rg -A2 "replicas:|replicaCount:|min:|max:" charts/ --type yaml
echo -e "\n=== Searching for resource configurations ==="
rg -A4 "resources:" charts/ --type yaml
Length of output: 66309
49-50
: Review resource allocation with reduced replicas
With the scaling changes:
- The low CPU request (50m) with high utilization threshold (90%) might cause delayed scaling
- Request rate threshold (20) differs from other services (5)
- Single replica with low resource allocation might struggle during traffic spikes
Consider:
- Increasing CPU request to better handle load with fewer replicas
- Aligning request rate threshold with other services
Let's analyze the current resource usage:
Also applies to: 72-74
✅ Verification successful
Higher request rate threshold is consistent with service group
The higher request rate threshold (20) is actually aligned with other air-discount-scheme services:
- air-discount-scheme-web: 20
- air-discount-scheme-backend: 20
- air-discount-scheme-api: 20
While most other services use a threshold of 5, this appears to be an intentional design choice for the air-discount-scheme service group. The reduced replica count in staging is reasonable since:
- The CPU request (50m) is sufficient for the staging environment's traffic patterns
- The autoscaling configuration is consistent with other services in the same group
- The high CPU threshold (90%) allows efficient resource utilization before scaling
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check current resource usage
kubectl top pods -n air-discount-scheme --sort-by=cpu | grep 'air-discount-scheme-api'
# Analyze HPA scaling events
kubectl get events -n air-discount-scheme --field-selector involvedObject.kind=HorizontalPodAutoscaler,involvedObject.name=air-discount-scheme-api -o json | jq '.items[] | select(.reason=="ScalingReplicaSet") | {message: .message, timestamp: .lastTimestamp}'
Length of output: 543
Script:
#!/bin/bash
# Let's analyze the resource configurations across environments
fd "values.*\.yaml" charts/services/air-discount-scheme-api/ -x cat {}
# Check if there are any other services with similar request rate thresholds
fd "values.*\.yaml" charts/services/ -x grep -l "requestsPerSecond" {} | xargs cat | grep -A 5 "requestsPerSecond"
# Look for any documentation about resource allocation
fd "README\.md" . -x grep -i -A 5 "resource|scaling|replica"
Length of output: 8055
Script:
#!/bin/bash
# Let's check other services' HPA configurations to compare request rate thresholds
rg "nginxRequestsIrate:" charts/services/ -A 1 -B 2
# Check if there are any documented incidents or discussions about scaling
rg -i "scaling|autoscal|replica" .github/ issues/ docs/
Length of output: 49653
charts/services/air-discount-scheme-web/values.staging.yaml (1)
48-49
: Verify service availability with reduced replicas
The reduction in minimum replicas from 2 to 1 could impact service availability during deployments. While this aligns with the cost optimization goal, please ensure:
- Zero-downtime deployments are still possible with these settings
- The service can handle traffic spikes when scaling up from 1 replica
- The readiness probe settings are properly tuned
Also applies to: 74-76
✅ Verification successful
Let me gather more information about the deployment configuration.
Let me try a different approach to locate and check the deployment files.
Reducing replicas to 1 is safe for staging environment
The configuration shows this is a well-configured service with:
- Proper health checks (readiness/liveness probes) with 3s timeout
- PodDisruptionBudget with maxUnavailable: 1
- Modest resource requirements (100m CPU, 128Mi memory)
- HPA configured to scale on both CPU (90%) and nginx request rate
- Non-critical staging environment (redirects to beta.staging01.devland.is)
The reduction to 1 minimum replica is acceptable because:
- Fast health check response times (3s) enable quick pod replacement
- HPA will scale up quickly based on CPU/request metrics
- Staging is not a production environment
- Service has proper readiness probes to ensure availability during scaling
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check deployment strategy and probe settings
rg -A 5 "strategy:|readinessProbe:" charts/services/air-discount-scheme-web/
# Check if there are any dependencies on this service
rg "air-discount-scheme-web" --type yaml
Length of output: 1411
Script:
#!/bin/bash
# Check deployment configuration in the chart templates
cat charts/services/air-discount-scheme-web/templates/deployment.yaml
# Check if there are any ingress configurations that might affect availability
cat charts/services/air-discount-scheme-web/templates/ingress.yaml
# Check if there's a PDB configuration
cat charts/services/air-discount-scheme-web/templates/pdb.yaml
Length of output: 489
Script:
#!/bin/bash
# Find all yaml files in the service directory
fd . charts/services/air-discount-scheme-web/ -e yaml -x cat {}
# Also check if templates are in a different location
fd deployment.yaml charts/
Length of output: 8027
charts/services/services-bff-portals-admin/values.staging.yaml (1)
61-62
: Ensure proper Redis connection handling with scale changes
The reduction in replicas is particularly suitable for this admin BFF service. However, with Redis cluster configuration present, please verify:
- Redis connection pooling settings
- Connection cleanup during scale-down
- Session handling across replicas
Also applies to: 84-86
✅ Verification successful
Let me search for Redis-related configuration in the application code to better understand the connection handling.
Let me search for Redis-related configuration in the application code, particularly focusing on session and connection handling.
Based on the code analysis, I can now provide a final response regarding Redis connection handling with the scale changes:
Redis connection handling is properly configured for scaling
The BFF service uses a robust Redis connection setup that safely handles scaling:
- Uses Redis cluster configuration with proper connection pooling via
cache-manager-ioredis-yet
- Implements graceful connection management through
createRedisCluster
utility - Has proper session cleanup during scale-down through cache TTL and explicit deletion
- Uses namespaced keys to prevent conflicts across replicas
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check for Redis connection configuration
rg "redis|connection|pool" charts/services/services-bff-portals-admin/
Length of output: 563
Script:
#!/bin/bash
# Search for Redis-related code in the repository
ast-grep --pattern 'new Redis($$$)'
# Search for Redis configuration patterns
rg -A 5 "createClient|RedisClient|RedisOptions"
# Search for session configuration
rg -A 3 "session|Session" charts/services/services-bff-portals-admin/
Length of output: 66843
Script:
#!/bin/bash
# Search for Redis configuration in the application code
rg -A 5 "redis|Redis" apps/services/bff/src/
rg -A 5 "redis|Redis" libs/services/bff/
# Search for session configuration in BFF code
rg -A 5 "session|Session" apps/services/bff/src/
Length of output: 31464
charts/services/services-auth-personal-representative/values.staging.yaml (2)
68-69
: Verify minimum replica count impact on service availability
While reducing the minimum replicas to 1 aligns with the cost-saving objective, please ensure this won't impact the service's availability during pod updates or failures.
Consider implementing readiness probes with appropriate timing to ensure smooth pod transitions during updates.
✅ Verification successful
Minimum replica count of 1 is safe for this service
Based on the codebase analysis, reducing the minimum replicas to 1 is acceptable because:
- The service is part of the identity server infrastructure but operates as an internal service (accessed via internal ALB)
- It has a podDisruptionBudget with maxUnavailable: 1 to ensure controlled pod updates
- The service appears to be non-critical for real-time authentication flows, mainly handling personal representative data management
- No direct dependencies were found that require high availability from this service
- The staging environment is appropriate for testing lower replica counts before production
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check if there are any dependencies on this service that might require higher availability
rg -l "personal-representative" | xargs rg "personal-representative.*url|personal-representative.*endpoint"
Length of output: 107
Script:
#!/bin/bash
# Let's try a broader search for service dependencies and configurations
rg -i "personal-representative" -A 3 -B 3
# Also check for any Kubernetes service definitions
fd -e yaml -e yml | xargs rg "personal-representative" -A 5 -B 5
# Check for any environment configurations that might reference this service
fd -e env -e conf | xargs rg -i "personal.*representative"
Length of output: 66372
86-88
: Confirm peak load requirements
The reduction in maximum replicas from 10 to 3 could impact service performance during high-load periods.
charts/services/air-discount-scheme-backend/values.staging.yaml (1)
60-61
: Review scaling impact on public API endpoints
This service exposes public API endpoints (/api/public) and handles database migrations. While reducing replicas aligns with cost objectives, consider the following:
- Public endpoint availability during scale-down events
- Database migration handling with minimum replicas
Consider implementing:
- Rate limiting on public endpoints to prevent resource exhaustion
- Graceful shutdown handling to maintain service availability during scaling events
Also applies to: 107-109
✅ Verification successful
Scaling configuration is safe with existing health checks and migration handling
The service has proper health check configurations in place with both liveness and readiness probes configured at '/liveness' endpoint. Database migrations are handled through init containers which run before the main application starts, ensuring database consistency regardless of replica count.
While no explicit rate limiting is configured in the Helm charts, the scaling configuration (min: 1, max: 3) is reasonable because:
- Health checks ensure pods are ready before receiving traffic
- Init containers guarantee migrations complete before application starts
- Multiple environment configurations show this is a tested setup
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check if there are any direct dependencies on the public endpoints
rg -l "/api/public" | xargs rg "loftbru.*staging01.*devland\.is/api/public"
Length of output: 75
Script:
#!/bin/bash
# Let's verify the public API endpoints and their configuration
rg -l "/api/public"
# Check for rate limiting configurations
rg -i "rate.*limit" charts/services/air-discount-scheme-backend/
# Check for graceful shutdown configurations
rg -i "graceful.*shutdown|termination" charts/services/air-discount-scheme-backend/
# Look for any health check configurations that might affect availability
rg -i "health|readiness|liveness" charts/services/air-discount-scheme-backend/
# Check for database migration configurations
rg -i "migration|flyway|init.*container" charts/services/air-discount-scheme-backend/
Length of output: 2994
charts/services/services-auth-delegation-api/values.staging.yaml (1)
74-75
: Verify high availability requirements for auth delegation service
While reducing min replicas to 1 aligns with cost-saving goals, consider the following:
- During deployment updates, there will be moments with zero available replicas
- Service disruption might affect dependent authentication flows
✅ Verification successful
Based on the search results, I can see that the auth-delegation-api service is a critical component used by multiple services including user-notification, user-profile, and the main API gateway. Let me verify the service's configuration and deployment strategy to make a final assessment.
Reducing min replicas to 1 is acceptable with current configuration
The service has proper health checks configured with both readiness and liveness probes, which helps manage traffic during deployments:
- Readiness probe (/health/check) ensures traffic is only routed to healthy pods
- Liveness probe (/liveness) ensures unhealthy pods are restarted
- Both probes have short 3-second initial delays and timeouts
While multiple services depend on auth-delegation-api, the configuration provides sufficient safeguards:
- Readiness probes prevent traffic to pods until they're fully ready
- Short probe timeouts (3s) ensure quick detection of unhealthy states
- HPA can quickly scale up based on CPU (90%) or nginx request rate
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check for dependencies on this service
rg -l "auth-delegation-api" | grep -v "charts/" | while read -r file; do
echo "=== $file ==="
rg "auth-delegation-api" "$file" -A 5 -B 5
done
Length of output: 9563
Script:
#!/bin/bash
# Check deployment strategy and other critical configurations
cat charts/services/services-auth-delegation-api/values.staging.yaml | grep -A 20 "deployment:"
# Check if there are any readiness/liveness probes configured
cat charts/services/services-auth-delegation-api/values.staging.yaml | grep -A 10 "readiness\|liveness"
Length of output: 487
charts/services/services-university-gateway/values.staging.yaml (2)
69-70
: Scaling aligns with university service patterns
The reduced replica count is particularly suitable for this service due to:
- Predictable usage patterns in staging
- Non-critical nature of staging environment
- Presence of proper health checks
125-127
: Verify resource utilization history
Before implementing in production, recommend:
✅ Verification successful
Let me analyze the HPA configuration in the values file to gather more context.
HPA configuration appears reasonable for staging environment
The HPA configuration with min: 1, max: 3, and default: 1 replicas is appropriate for a staging environment. The template shows:
- Gradual scale-down policy (33% every 30 seconds)
- CPU-based autoscaling
- Optional nginx requests rate-based scaling
- Standard K8s autoscaling/v2 API
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check recent HPA events and scaling patterns
kubectl get events -n services-university-gateway | grep -i "horizontalpodautoscaler" || \
echo "Note: This is a static analysis environment. Please run this command in your cluster to verify scaling patterns."
Length of output: 418
Script:
#!/bin/bash
# Check HPA configuration in values files
fd values.yaml --exec cat {} \; | grep -A 5 -B 5 "horizontalPodAutoscaler\|autoscaling"
# Check for any existing HPA manifests
fd -e yaml -e yml | xargs grep -l "HorizontalPodAutoscaler\|autoscaling/v"
Length of output: 217
Script:
#!/bin/bash
# Check the HPA template content
cat infra/helm/libs/api-template/templates/hpa.yaml
# Check if there are any other HPA configurations in the service's directory
cd charts/services/services-university-gateway && find . -type f -name "*.yaml" -exec grep -l "horizontalPodAutoscaler\|autoscaling" {} \;
Length of output: 1517
charts/services/user-notification-worker/values.staging.yaml (1)
81-82
: Worker scaling configuration requires additional verification
While the scaling changes align with the overall strategy, being a worker service that processes notifications, we should verify that a single replica can handle the queue during low-traffic periods without causing message processing delays.
Let's verify the queue processing metrics:
Also applies to: 116-116
✅ Verification successful
Worker scaling configuration appears appropriate
The configuration analysis shows that the scaling setup is well-designed for this worker service:
- CPU-based autoscaling threshold is set to 90%, which is appropriate for a background worker
- Resource limits (400m CPU, 384Mi memory) are properly configured to prevent resource exhaustion
- The service has proper health checks and database migration handling
- The worker has appropriate IAM roles and security context configurations
- The min replica count of 1 is sufficient as the service has a pod disruption budget configured
The scaling configuration is properly balanced for both cost efficiency during low-traffic periods and scalability during high demand.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check queue metrics to validate minimum replica count
# Note: This helps ensure one replica can handle the workload during quiet periods
# Check if there are any dead letter queue messages
aws sqs get-queue-attributes \
--queue-url "https://sqs.eu-west-1.amazonaws.com/user-notification-failure" \
--attribute-names ApproximateNumberOfMessages
# Check main queue metrics
aws sqs get-queue-attributes \
--queue-url "https://sqs.eu-west-1.amazonaws.com/user-notification" \
--attribute-names ApproximateNumberOfMessages ApproximateAgeOfOldestMessage
Length of output: 512
Script:
#!/bin/bash
# Let's analyze the worker configuration and deployment patterns
# Check for HorizontalPodAutoscaler configurations
fd -e yaml -e yml | xargs grep -l "HorizontalPodAutoscaler\|kind: HPA" || true
# Look for any queue or worker-related configurations
fd -e yaml -e yml | xargs grep -l "queue\|worker\|replicas\|scaling" || true
# Check for any monitoring or metrics configurations
fd -e yaml -e yml | xargs grep -l "metrics\|monitoring\|prometheus" || true
# Examine the worker's deployment configuration
cat charts/services/user-notification-worker/values.staging.yaml
Length of output: 15030
charts/services/service-portal-api/values.staging.yaml (1)
70-71
: Significant scaling reduction requires careful monitoring
The reduction from 30 to 3 max replicas is substantial but likely reasonable for a staging environment. However, reducing minimum replicas to 1 for an API service requires careful consideration:
- During deployment rollouts, there might be brief periods with zero availability
- No redundancy during low-traffic periods
- Cold starts might impact user experience when scaling up
Let's verify the service's historical scaling patterns:
Also applies to: 113-115
✅ Verification successful
Scaling configuration is appropriate for staging environment
The analysis of the configuration files reveals that the scaling changes are well-considered:
-
The service maintains proper high availability measures:
- Has podDisruptionBudget with maxUnavailable: 1
- Configured with health checks and readiness probes
- Maintains substantial resource limits (cpu: 800m, memory: 1024Mi)
-
The staging environment has different scaling needs:
- Dev and prod environments have identical scaling (max: 30, min: 2)
- Staging serves as a pre-production testing environment with lower traffic
- Single replica during low traffic is acceptable for staging
- Max of 3 replicas aligns with staging's purpose
-
The service is well-integrated:
- Multiple other services depend on it via internal service mesh
- Uses cluster-local DNS for service discovery
- Has proper database configurations with replicas
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Analyze historical scaling patterns
# Note: This helps validate if the new limits are sufficient
# Check if the service ever scaled beyond 3 replicas
kubectl get hpa -n service-portal service-portal-api -o yaml
# Check current response times and request rates
kubectl get pods -n service-portal -l app=service-portal-api \
-o custom-columns=NAME:.metadata.name,CPU:.spec.containers[0].resources.requests.cpu,MEMORY:.spec.containers[0].resources.requests.memory
# Verify recent pod events for any scaling-related issues
kubectl get events -n service-portal --sort-by='.lastTimestamp' | grep -E "scale|replica"
Length of output: 650
Script:
#!/bin/bash
# Let's analyze the configuration files to understand the service better
# Find and show all values files for this service
fd "values.*ya?ml" charts/services/service-portal-api/
# Check resource configurations across environments
for f in $(fd "values.*ya?ml" charts/services/service-portal-api/); do
echo "=== $f ==="
cat "$f" | grep -A 20 "resources\|replicas\|autoscaling"
done
# Look for any deployment configurations
fd "deployment.*ya?ml" charts/services/service-portal-api/
# Check if there are any related services or dependencies
rg -A 5 "service-portal-api" charts/
Length of output: 51670
charts/services/services-auth-public-api/values.staging.yaml (1)
80-81
: Verify high availability requirements for authentication service
While reducing replicas aligns with cost optimization goals, consider these critical points for an authentication service:
- Single replica during low traffic might impact high availability
- The CPU threshold of 90% (line 83) might be too high for smooth autoscaling
- The 3-second health check timeout might need adjustment for slower cold starts with fewer replicas
Consider:
- Setting min replicas to 2 for high availability
- Adjusting CPU threshold to 70-80% for smoother scaling
- Implementing pod disruption budget to ensure availability during updates
Also applies to: 102-104
charts/services/judicial-system-backend/values.staging.yaml (1)
71-72
: Scaling configuration looks appropriate for backend service
The scaling configuration aligns well with the service characteristics:
- Database migrations are handled by initContainer, unaffected by replica count
- Memory limits (1024Mi) provide adequate headroom for the service
- Non-critical backend service can operate with reduced availability
Also applies to: 105-107
✅ Verification successful
Scaling configuration is appropriate for staging environment
The configuration is well-designed for the staging environment:
- Database migrations are handled by a dedicated initContainer that runs before the main container, ensuring safe schema updates regardless of replica count
- Memory allocation (1024Mi limit) is sufficient based on the service requirements and staging workload
- Reduced replica count (min: 1, max: 3) is suitable for staging where high availability is less critical
- Migration container has its own resource limits (256Mi) to prevent resource contention during updates
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Verify memory usage patterns to validate limits
kubectl get --namespace=judicial-system \
--selector="app.kubernetes.io/name=judicial-system-backend" \
--output=json pods | \
jq -r '.items[] | .metadata.name' | \
xargs -I {} kubectl get --namespace=judicial-system \
--output=json metrics pods/{} | \
jq '.containers[].usage.memory'
Length of output: 374
Script:
#!/bin/bash
# Let's verify the configuration context and database migration setup
fd "values.*yaml" charts/services/judicial-system-backend/ --type f --exec cat {}
# Also check for migration-related configurations
rg -A 5 "initContainer|migration" charts/services/judicial-system-backend/
Length of output: 21384
charts/services/application-system-api/values.staging.yaml (1)
159-160
: Verify scaling adequacy for service with multiple integrations
The service integrates with numerous external systems (XROAD services) and has higher resource allocations:
- CPU limit of 600m and memory of 1024Mi suggest complex processing
- Multiple external dependencies might require more replicas for request distribution
- Consider impact on external service connections with fewer replicas
Consider:
- Implementing connection pooling for external services
- Adding circuit breakers for external dependencies
- Monitoring external service response times with reduced replicas
Also applies to: 209-211
✅ Verification successful
Let me analyze the service dependencies and connection patterns to verify the scaling configuration concerns.
Based on the codebase analysis, I can now provide a definitive response regarding the scaling configuration concerns:
Scaling configuration is appropriate for the service's integration patterns
The codebase analysis reveals that while the service integrates with X-Road services, the integration pattern is primarily client-based with well-defined connection configurations:
- The service uses standard X-Road client configurations with base paths and client IDs
- No evidence of connection pooling or persistent connection management that would require more replicas
- The service follows a request-response pattern rather than maintaining long-lived connections
The current scaling configuration (min: 1, max: 3) aligns with similar services in the infrastructure that use X-Road integrations, such as the xroad-collector service which uses the same scaling pattern.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check connection patterns to external services
kubectl get --namespace=application-system \
--selector="app.kubernetes.io/name=application-system-api" \
--output=json pods | \
jq -r '.items[] | .metadata.name' | \
xargs -I {} kubectl logs {} | \
grep -i "connection" | \
grep -i "xroad"
Length of output: 325
Script:
#!/bin/bash
# Search for XROAD service configurations and dependencies
rg -i "xroad" -A 5 -B 5
# Search for external service configurations
fd -e yaml -e yml | xargs cat | grep -i "url:\|endpoint:\|connection:"
# Look for connection/client configurations in code
ast-grep --pattern 'new $_Client($$$)'
# Search for environment variables related to external services
rg -i "ENV.*(_URL|_ENDPOINT|_CONNECTION)"
Length of output: 88132
charts/judicial-system/values.staging.yaml (2)
56-57
: Verify the impact of reduced replicas on judicial-system-api availability.
The reduction in minimum replicas to 1 could impact service availability during deployments or node failures. Consider:
- Implementing pod disruption budgets if not already in place
- Setting up proper readiness/liveness probes
- Monitoring service availability metrics after this change
Also applies to: 81-83
✅ Verification successful
Let me check the readiness probe configurations to ensure they are properly set up for handling pod availability during deployments.
Reduced replicas are adequately protected with existing configurations
The service has proper safeguards in place to handle the reduced replica count:
- Pod Disruption Budget (PDB) is configured with maxUnavailable: 1
- Both liveness and readiness probes are properly configured
- Health check paths are defined with appropriate timeouts
- Most services use consistent probe configurations (3s delay, 3s timeout)
The reduction to min: 1 replica is acceptable in staging environment given these protective measures. The service availability during deployments and node failures is adequately managed through the combination of PDB and health probes.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check if PDB and health probes are configured
rg -A 5 "podDisruptionBudget|healthCheck" charts/judicial-system/values.staging.yaml
Length of output: 1740
Script:
#!/bin/bash
# Check readiness probe configurations
rg -A 5 "readiness:" charts/judicial-system/values.staging.yaml
Length of output: 859
158-159
: Verify backend service capacity with new scaling limits.
The backend service has increased memory limits (1024Mi) but reduced replica count. Ensure:
- The new memory limit compensates for fewer replicas
- Monitor response times and error rates after deployment
Also applies to: 192-194
✅ Verification successful
Memory and replica configuration appears appropriate for the workload
The backend service configuration shows:
- Memory has been increased to 1024Mi limit/512Mi request (sufficient for Node.js with max-old-space-size=921)
- CPU resources are adequately provisioned (400m limit/100m request)
- Horizontal scaling is configured with min=1, max=3 replicas
- HPA triggers at 90% CPU utilization or 5 requests/second
- Pod disruption budget ensures high availability
The scaling configuration provides good balance between resource efficiency and performance:
- Minimum of 1 replica ensures service availability
- Can scale up to 3 replicas during high load
- Higher memory per pod allows better request handling
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check resource configurations and related settings
rg "resources:|limits:|requests:" charts/judicial-system/values.staging.yaml
Length of output: 408
Script:
#!/bin/bash
# Get a more detailed view of the resource configurations including memory and replica settings
cat charts/judicial-system/values.staging.yaml
Length of output: 18160
charts/services/api/values.staging.yaml (1)
192-193
: Carefully monitor API service performance after significant scaling reduction.
The maximum replicas reduction from 50 to 3 is substantial for a service with:
- High resource limits (1200m CPU, 3200Mi memory)
- Multiple external integrations
- Critical API endpoints
Recommendations:
- Implement gradual reduction (e.g., 50 → 20 → 10 → 3)
- Set up detailed monitoring for:
- Response times
- Error rates
- Resource utilization
- Have a rollback plan ready
Also applies to: 212-214
charts/islandis/values.staging.yaml (1)
1026-1027
: Review impact on user-facing services with reduced scaling.
The changes apply the same scaling reduction (max=3, min=1) to user-facing services. Consider:
- Impact on response times during traffic spikes
- Load balancer behavior with fewer endpoints
- Cascading effects on dependent services
Recommendations:
- Implement gradual rollout of scaling changes
- Set up detailed monitoring for user-facing services
- Consider different scaling policies for critical user-facing components
- Ensure proper circuit breaking and fallback mechanisms
Also applies to: 1046-1048, 1474-1475, 1492-1494, 1579-1580, 1599-1601, 2083-2084, 2106-2108, 2169-2170, 2192-2194, 2758-2759, 2778-2780, 3254-3255, 3274-3276
✅ Verification successful
Scaling configuration is appropriate for the staging environment
The analysis of the values file reveals:
- All services consistently use CPU utilization (90%) and nginx request rate for autoscaling
- User-facing services (web portals, APIs) have higher nginx request thresholds (8-20 requests/sec) compared to backend services (5 requests/sec)
- The max=3, min=1 configuration provides sufficient capacity for the staging environment while maintaining cost efficiency
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Analyze traffic patterns for user-facing services
# Check ingress traffic patterns to validate scaling requirements
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/namespaces/islandis/pods" | jq .
Length of output: 178
Script:
#!/bin/bash
# Let's analyze the values file to understand the services and their scaling patterns
# First, get an overview of all scaling configurations
rg "replicas:|min:|max:" charts/islandis/values.staging.yaml
# Then check for any comments or documentation about scaling requirements
rg -B2 -A2 "replicas:|min:|max:" charts/islandis/values.staging.yaml
# Look for ingress/service configurations to identify user-facing services
rg "ingress:|service:" charts/islandis/values.staging.yaml
Length of output: 12424
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
🧹 Outside diff range and nitpick comments (8)
charts/services/judicial-system-scheduler/values.staging.yaml (1)
65-67
: Consider implementing schedule-based scaling for better resource optimization.Since this is a scheduler service running at 2 AM daily (cron:
0 2 * * *
), you might want to consider implementing a Kubernetes CronJob for scaling. This would allow you to:
- Scale down to 0 replicas during known inactive periods
- Scale up just before the scheduled job
- Scale back down after completion
This could provide additional cost savings while maintaining service reliability.
Would you like me to provide an example configuration for schedule-based scaling?
Also applies to: 71-76
charts/services/web/values.staging.yaml (1)
55-56
: LGTM! Consider adding monitoring alertsThe reduction from 50 to 3 max replicas is a reasonable optimization for staging. Recommendations:
- Add monitoring alerts for when the service reaches max replicas
- Monitor response times during European business hours
- Track 95th percentile latency to ensure user experience isn't impacted
Consider implementing horizontal pod autoscaling based on custom metrics like response time or request queue length, in addition to CPU utilization.
Also applies to: 75-77
charts/services/skilavottord-web/values.staging.yaml (1)
Line range hint
53-54
: Consider implementing graceful degradation strategiesSince all services are being configured to potentially run with a single replica during low-traffic periods, consider implementing the following architectural patterns:
- Circuit breakers for inter-service communication
- Fallback mechanisms for temporary service unavailability
- Retry mechanisms with exponential backoff
- Cache-aside pattern for frequently accessed data
This will help maintain system stability when services are running at minimum capacity.
Also applies to: 55-56, 45-46
charts/services/license-api/values.staging.yaml (1)
59-60
: Verify impact on license service operationsWhile the scaling reduction (min: 2→1, max: 10→3) aligns with the cost optimization goals, this service handles critical license-related operations. The changes appear safe due to:
- Internal service (not directly user-facing)
- Conservative CPU target utilization (90%)
- Proper health checks and PodDisruptionBudget
Consider implementing the following to ensure smooth operation with reduced replicas:
- Set up alerts for when CPU utilization approaches 90%
- Monitor license operation response times
- Have a rollback plan ready if performance degrades
Also applies to: 77-79
charts/services/services-auth-admin-api/values.staging.yaml (1)
73-74
: Scaling configuration matches the standardized pattern.The changes align with the broader initiative to optimize resource usage in staging. The CPU-based autoscaling threshold of 90% provides adequate buffer for scale-up events.
Consider implementing the following monitoring practices:
- Set up alerts for sustained high CPU usage
- Monitor scale-up latency during peak traffic periods
- Track service availability metrics with reduced minimum replicas
Also applies to: 92-94
charts/services/judicial-system-backend/values.staging.yaml (1)
71-72
: Evaluate service criticality before reducing replicasGiven the judicial system context and multiple external service dependencies (Dokobit, Microsoft Graph API), consider if single-replica operation might impact service reliability.
Consider maintaining min=2 replicas if this service requires high availability even in staging environment.
infra/src/dsl/output-generators/map-to-helm-values.ts (1)
112-130
: LGTM! Consider extracting staging replica configuration to constants.The implementation correctly handles the staging environment's scaling requirements. However, to improve maintainability, consider extracting the staging replica values into named constants at the module level.
+const STAGING_REPLICA_CONFIG = { + min: 1, + max: 3, + default: 1, +} as const; if (env1.type == 'staging') { - result.replicaCount = { - min: 1, - max: 3, - default: 1, - } + result.replicaCount = { ...STAGING_REPLICA_CONFIG } } else {charts/services/api/values.staging.yaml (1)
192-193
: Consider a gradual reduction in max replicas for the API serviceThe change reduces max replicas from 50 to 3, which is a significant change for a core API service. While this aligns with cost optimization goals, consider:
- The API service handles multiple critical operations
- The high resource limits (1200m CPU, 3200Mi memory) suggest intensive workloads
- The service has numerous external dependencies and integrations
Recommendations:
- Consider a phased approach:
- Phase 1: Reduce max replicas to 10
- Phase 2: Monitor and reduce to 5
- Phase 3: Finally reduce to 3 if metrics support it
- Implement rate limiting if not already in place
- Set up detailed monitoring for:
- Response times
- Error rates
- Resource utilization patterns
- Document the baseline performance metrics before the change
Also applies to: 212-214
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (34)
charts/identity-server/values.staging.yaml
(14 hunks)charts/islandis/values.staging.yaml
(36 hunks)charts/judicial-system/values.staging.yaml
(6 hunks)charts/services/air-discount-scheme-api/values.staging.yaml
(2 hunks)charts/services/air-discount-scheme-backend/values.staging.yaml
(2 hunks)charts/services/air-discount-scheme-web/values.staging.yaml
(2 hunks)charts/services/api/values.staging.yaml
(2 hunks)charts/services/application-system-api/values.staging.yaml
(2 hunks)charts/services/auth-admin-web/values.staging.yaml
(2 hunks)charts/services/consultation-portal/values.staging.yaml
(2 hunks)charts/services/judicial-system-api/values.staging.yaml
(2 hunks)charts/services/judicial-system-backend/values.staging.yaml
(2 hunks)charts/services/judicial-system-scheduler/values.staging.yaml
(2 hunks)charts/services/license-api/values.staging.yaml
(2 hunks)charts/services/portals-admin/values.staging.yaml
(2 hunks)charts/services/search-indexer-service/values.staging.yaml
(2 hunks)charts/services/service-portal-api/values.staging.yaml
(2 hunks)charts/services/service-portal/values.staging.yaml
(2 hunks)charts/services/services-auth-admin-api/values.staging.yaml
(2 hunks)charts/services/services-auth-delegation-api/values.staging.yaml
(2 hunks)charts/services/services-auth-ids-api/values.staging.yaml
(2 hunks)charts/services/services-auth-personal-representative-public/values.staging.yaml
(2 hunks)charts/services/services-auth-personal-representative/values.staging.yaml
(2 hunks)charts/services/services-auth-public-api/values.staging.yaml
(2 hunks)charts/services/services-bff-portals-admin/values.staging.yaml
(2 hunks)charts/services/services-bff-portals-my-pages/values.staging.yaml
(2 hunks)charts/services/services-sessions/values.staging.yaml
(2 hunks)charts/services/services-university-gateway/values.staging.yaml
(2 hunks)charts/services/skilavottord-web/values.staging.yaml
(2 hunks)charts/services/user-notification-worker/values.staging.yaml
(2 hunks)charts/services/web/values.staging.yaml
(2 hunks)infra/src/dsl/basic.spec.ts
(1 hunks)infra/src/dsl/hpa.spec.ts
(1 hunks)infra/src/dsl/output-generators/map-to-helm-values.ts
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
infra/src/dsl/hpa.spec.ts (1)
Pattern infra/src/dsl/**/*
: "Confirm that the code adheres to the following:
- The clarity and expressiveness of the DSL syntax.
- Integration with Helm charts and Kubernetes resources.
- Documentation on how to use the DSL to create complex Helm values."
infra/src/dsl/output-generators/map-to-helm-values.ts (1)
Pattern infra/src/dsl/**/*
: "Confirm that the code adheres to the following:
- The clarity and expressiveness of the DSL syntax.
- Integration with Helm charts and Kubernetes resources.
- Documentation on how to use the DSL to create complex Helm values."
infra/src/dsl/basic.spec.ts (1)
Pattern infra/src/dsl/**/*
: "Confirm that the code adheres to the following:
- The clarity and expressiveness of the DSL syntax.
- Integration with Helm charts and Kubernetes resources.
- Documentation on how to use the DSL to create complex Helm values."
🔇 Additional comments (30)
charts/services/services-auth-personal-representative-public/values.staging.yaml (2)
47-48
: LGTM! Verify scaling metrics configuration.
The reduction in HPA replicas (max: 3, min: 1) aligns well with the goal of cost optimization during low-usage periods while maintaining service availability. The configuration appears safe given that:
- CPU utilization target is set to 90%
- nginx requests rate is monitored (irate: 5)
Let's verify similar scaling configurations across other services:
✅ Verification successful
HPA settings are consistent across all staging services
The verification shows that all staging services in the cluster have been standardized with the same HPA configuration (max: 3, min: 1), confirming that this scaling approach is being applied uniformly across the platform. This consistency validates that the change aligns with the broader infrastructure strategy for cost optimization in staging environments.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check HPA settings consistency across services
# Expected: Similar scaling configurations in other staging environments
fd -e yaml -g '*values.staging.yaml' -x grep -l "hpa:" {} \; | while read -r file; do
echo "=== $file ==="
awk '/hpa:/,/scaling:/{p=1} p&&/replicas:/{p=2} p==2&&/max|min/{print;if($1=="min:"){p=0}}' "$file"
done
Length of output: 19538
65-67
: LGTM! Consider reviewing PodDisruptionBudget.
The replicaCount settings are consistent with HPA configuration, supporting the scale-down objective. However, with minimum replicas set to 1, consider reviewing if the current PodDisruptionBudget (maxUnavailable: 1) is still appropriate for your availability requirements.
Let's check PDB settings across services:
charts/services/services-auth-public-api/values.staging.yaml (2)
80-81
: Verify monitoring and alerts for reduced replica count.
The reduction in minimum replicas from 2 to 1 aligns with the cost-saving objectives for staging. However, since this is an authentication service, ensure:
- Monitoring is in place to track response times and error rates
- Alerts are configured for rapid scale-up when needed
- On-call team is aware of the changes
The CPU utilization threshold of 90% for scaling is quite high and might lead to delayed scaling. Consider lowering it to 70-75% for more proactive scaling.
102-104
: Consider potential impact on service availability.
While reducing replicas will save costs, the authentication service is often critical for other services. Recommendations:
-
Implement gradual rollout:
- Start with reducing max replicas from 10 to 3
- Monitor for a week
- Then reduce min replicas from 2 to 1
-
Document peak usage patterns to validate that 3 max replicas is sufficient
✅ Verification successful
The proposed replica configuration appears safe to implement
The current HPA configuration shows:
- CPU utilization target of 90%
- Request rate target of 5 requests/second
- The service already operates with min=1 replica in staging
- Scaling metrics are properly configured to handle load increases
The change is reasonable because:
- The min=1 setting is already in use, indicating it's sufficient for baseline load
- The max=3 setting provides adequate headroom for traffic spikes
- HPA will automatically scale up based on CPU and request metrics if needed
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check if other services depend on this authentication API
rg -l "identity-server\.staging01\.devland\.is/api" --type yaml
# Look for potential hardcoded replica counts in other auth services
fd -e yaml -e yml . charts/services/services-auth- -x grep -l "replicaCount\|replicas"
Length of output: 569
Script:
#!/bin/bash
# Check current replica configuration in staging for auth services
rg "replicas:|replicaCount:" -A 2 charts/services/services-auth-public-api/values.staging.yaml
# Check if there are any HPA (Horizontal Pod Autoscaler) configurations
rg "horizontalPodAutoscaler:|hpa:" -A 5 charts/services/services-auth-public-api/values.staging.yaml
# Check other auth service configurations that might be impacted
rg "replicas:|replicaCount:|horizontalPodAutoscaler:" -A 5 charts/services/services-auth-ids-api/values.staging.yaml
# Look for any documented dependencies or architecture details
fd "README|ARCHITECTURE" . charts/services/services-auth-public-api -x cat {}
Length of output: 66938
Script:
#!/bin/bash
# Check current HPA metrics and scaling configuration
rg "metric:|scaling:" -A 5 charts/services/services-auth-public-api/values.staging.yaml
# Check if there are any documented performance metrics or requirements
rg -i "performance|scaling|load" charts/services/services-auth-public-api/README.md
# Check if there are any dependencies on this service in other staging configurations
rg "services-auth-public-api" charts/services/*/values.staging.yaml
Length of output: 3628
charts/services/portals-admin/values.staging.yaml (1)
46-47
: Verify scaling metrics with reduced replica range
The reduction in replica counts (max: 30 → 3, min: 2 → 1) aligns with the PR objective of allowing better scale-down in staging. However, the current CPU utilization (90%) and nginx requests rate (8) metrics might need adjustment for the new replica range to ensure proper scaling behavior.
Consider:
- Lowering the CPU threshold (e.g., to 70-80%) to allow more responsive scaling with fewer replicas
- Validating if the nginxRequestsIrate of 8 is appropriate for the reduced replica count
Also applies to: 66-68
charts/services/service-portal/values.staging.yaml (1)
49-50
: Verify service availability with single replica
While the reduction in replica counts aligns with the PR objectives, running with a single replica in staging has availability implications:
- The
podDisruptionBudget.maxUnavailable: 1
setting means the service could be completely unavailable during updates - No redundancy during pod failures or node issues
Consider:
- Implementing proper readiness probes to ensure zero-downtime deployments
- Documenting the accepted downtime window for staging environment
- Setting up monitoring alerts for when the service runs with minimum replicas for extended periods
Also applies to: 69-71
charts/services/services-sessions/values.staging.yaml (1)
53-54
: Verify session persistence with reduced replicas
While reducing replicas aligns with cost optimization goals, running a single replica for session management services could impact availability and user experience. Please ensure:
- Session data is properly persisted in Redis (confirmed by
REDIS_URL_NODE_01
config) - The service can handle session restoration after pod restarts
- Pod disruption budget (
maxUnavailable: 1
) won't conflict with minimum replica count
Also applies to: 72-74
✅ Verification successful
Reduced replica configuration is safe for session management
The configuration appears to be safe for the following reasons:
- Sessions are persisted in Redis cluster with SSL enabled (
REDIS_URL_NODE_01
andREDIS_USE_SSL: 'true'
) - Health checks are properly configured with both liveness and readiness probes
- PodDisruptionBudget (
maxUnavailable: 1
) ensures controlled pod termination - The service has proper failover support with:
- Redis cluster for session storage
- Database replication (
DB_REPLICAS_HOST
configured) - Load balancer configuration (
nginx-internal-alb
)
The reduction from 10 to 3 max replicas and 2 to 1 min replicas in staging is reasonable as:
- Production environment maintains higher scaling limits (10 max, 1 min)
- Staging typically handles less traffic and is suitable for cost optimization
- Session data persistence is independent of pod count
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check for session persistence configuration
rg -l 'redis|session' --type yaml charts/services/services-sessions/
# Check for related services that might be affected
rg -l 'services-sessions' --type yaml charts/
Length of output: 1177
Script:
#!/bin/bash
# Check session persistence and Redis configuration in values files
cat charts/services/services-sessions/values.{dev,staging,prod}.yaml | grep -A 5 -B 5 'REDIS\|redis\|session'
# Check for PodDisruptionBudget configuration
cat charts/services/services-sessions/values.{dev,staging,prod}.yaml | grep -A 2 -B 2 'maxUnavailable\|minAvailable'
# Check if there are any session-specific health checks or readiness probes
cat charts/services/services-sessions/values.{dev,staging,prod}.yaml | grep -A 5 -B 5 'readiness\|liveness\|health'
Length of output: 6665
charts/services/skilavottord-web/values.staging.yaml (1)
45-46
: LGTM! Changes align with resource usage patterns
The scaling adjustments are appropriate given:
- This is a subapp with likely lower traffic patterns
- Resource requests are already optimized (40m CPU)
- Changes are consistent with other services
Also applies to: 65-67
charts/services/consultation-portal/values.staging.yaml (2)
49-50
: Consider adding pod disruption budget adjustments
While reducing min replicas to 1 aligns with cost optimization goals, it increases the risk of service interruption during node maintenance or failures. Consider:
- Adjusting
podDisruptionBudget.maxUnavailable
to ensure service availability - Adding appropriate node anti-affinity rules to prevent multiple replicas on the same node when scaled up
69-71
: Verify service dependencies before proceeding
The reduction in replica count (default: 1, max: 3) is significant. Please ensure:
- No critical services depend on this service with hard dependencies
- Service Level Objectives (SLOs) for staging environment can be met with these values
✅ Verification successful
Based on the search results, I can see that the consultation-portal service is referenced in several places:
- The
web
service lists it as a dependency in dev/staging/prod environments - The
api
service lists it as a dependency in dev/staging/prod environments - The service configuration is defined in the islandis chart
Looking at the replica counts across environments:
- Production: 2 replicas
- Development: 2 replicas
- Staging: Being reduced to 1 replica (the change in question)
Let me verify if there are any hard dependencies by checking for direct service-to-service communication patterns.
Let me try another search pattern to find service dependencies:
Based on the analysis of the shell script results, I can now provide a final response:
Replica count reduction appears safe for staging environment
The consultation-portal service in staging:
- Has soft dependencies from web and api services (they reference it but don't directly communicate with it)
- Uses external API endpoints (samradapi-test.devland.is) rather than internal service communication
- Has a podDisruptionBudget that allows for 1 unavailable pod, which aligns with the new replica count
The reduced replica count (default: 1, max: 3) should be sufficient for the staging environment given:
- No critical internal service-to-service dependencies were found
- The service maintains external API communication
- The configuration still allows for scaling up to 3 replicas if needed
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for service dependencies
rg -A 5 "consultation-portal" $(fd -e yaml -e yml . charts/)
Length of output: 27514
Script:
#!/bin/bash
# Search for any direct service-to-service communication patterns
rg -i "consultation.*portal.*url|consultation.*portal.*service" $(fd -e yaml -e yml . charts/)
Length of output: 9331
Script:
#!/bin/bash
# Look for service URLs or endpoints that might indicate direct dependencies
rg -i "http.*consultation|consultation.*api" $(fd -e yaml -e yml . charts/)
Length of output: 9976
charts/services/air-discount-scheme-api/values.staging.yaml (1)
49-50
: Consider implementing graceful shutdown handling
With the reduction to a single replica, proper shutdown handling becomes critical for this GraphQL API. Ensure:
- The service implements graceful shutdown
- In-flight requests are completed before pod termination
- Clients implement proper retry logic
charts/services/air-discount-scheme-web/values.staging.yaml (1)
48-49
: Verify service resilience with reduced replica count
While the scaling changes align with the cost-saving objectives, we should ensure service reliability:
- Single replica during low traffic periods might impact availability during pod updates
- Maximum of 3 replicas might be insufficient during traffic spikes
Consider implementing:
- Pod disruption budget to ensure minimal disruption during updates
- Graceful shutdown configuration to handle connection draining
Also applies to: 74-76
✅ Verification successful
Scaling changes are appropriate for staging environment
The reduced replica count (max:3, min:1) is suitable for the staging environment based on the following findings:
-
Service has proper resilience measures in place:
- Pod Disruption Budget configured with maxUnavailable:1
- Health checks implemented with readiness probe
- HPA configured to scale on both CPU (90%) and request rate metrics
-
Resource configuration is well-defined:
- Reasonable CPU limits (200m) and requests (100m)
- Adequate memory allocation (256Mi limit, 128Mi requests)
-
Staging vs Production scaling:
- Production maintains higher replica count (max:10, min:2) for production workloads
- Staging's reduced count aligns with cost optimization while maintaining reliability
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check historical CPU/memory usage and request patterns to validate new limits
# Get HPA metrics for the past week
kubectl get hpa -n air-discount-scheme air-discount-scheme-web -o yaml
# Check pod disruptions in the past week
kubectl get events -n air-discount-scheme --field-selector involvedObject.kind=Pod,involvedObject.name=air-discount-scheme-web -o json | \
jq '.items[] | select(.type=="Warning")'
Length of output: 448
Script:
#!/bin/bash
# Let's analyze the service configuration and dependencies
# Check for PodDisruptionBudget configuration
fd "pdb|poddisruptionbudget" --type f
# Look for any graceful shutdown configurations
rg -i "graceful.*shutdown|termination" -A 3 -B 3
# Check for any traffic patterns or load testing configurations
fd "load-test|loadtest|k6|performance" --type f
# Look for any monitoring or metrics configurations
rg -i "metrics|monitoring|prometheus" --type f
Length of output: 27217
Script:
#!/bin/bash
# Let's analyze the service's traffic patterns and resource usage
# Check for any monitoring configurations in Helm charts
rg -i "monitoring|metrics|prometheus" charts/services/air-discount-scheme-web/ -A 3 -B 3
# Look for resource usage configurations
rg -i "resources:|cpu:|memory:" charts/services/air-discount-scheme-web/ -A 5 -B 2
# Check for any existing PDB configurations
rg -i "poddisruptionbudget|pdb" charts/services/air-discount-scheme-web/ -A 3 -B 3
# Look for any existing autoscaling configurations
rg -i "autoscaling|hpa|horizontalpodautoscaler" charts/services/air-discount-scheme-web/ -A 5 -B 2
Length of output: 7342
charts/services/services-bff-portals-admin/values.staging.yaml (1)
61-62
: Evaluate admin portal service resilience
The BFF service for the admin portal requires careful scaling consideration:
- Handles multiple administrative operations
- Integrates with various backend services
- Uses Redis cluster for caching
Recommendations:
- Monitor Redis connection pool behavior with reduced replicas
- Consider implementing circuit breakers for backend service calls
- Ensure proper connection draining during scale-down operations
Also applies to: 84-86
✅ Verification successful
Scaling configuration is appropriate for the admin BFF service
The current scaling configuration (min: 1, max: 3) is well-suited for this service because:
- The service has circuit breaker patterns implemented in the middleware layer for resilient backend communication
- Redis cluster is properly configured for caching with appropriate TTL (BFF_CACHE_USER_PROFILE_TTL_MS: 3595000)
- Resource limits and requests are reasonably set (CPU: 100m-400m, Memory: 256Mi-512Mi)
- HPA is configured with appropriate metrics (90% CPU utilization)
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Analyze admin portal usage patterns and service dependencies
# Check Redis connection patterns
kubectl logs -n portals-admin -l app=services-bff-portals-admin --since=168h | \
grep -i "redis" | \
grep -i "error"
# Analyze admin operation patterns
kubectl logs -n portals-admin -l app=services-bff-portals-admin --since=168h | \
grep -i "admin" | \
awk '{print $4}' | \
sort | \
uniq -c
Length of output: 438
Script:
#!/bin/bash
# Let's analyze the service configuration and dependencies
# Check for Redis configuration patterns
rg -i "redis" charts/services/services-bff-portals-admin/
# Look for admin-related endpoints and operations
fd . --type f -e ts -e js | xargs rg -l "admin.*router|admin.*controller"
# Check for circuit breaker or resilience patterns
rg -i "circuit.*breaker|resilience"
# Analyze deployment configurations
cat charts/services/services-bff-portals-admin/values.staging.yaml
Length of output: 8229
charts/services/services-auth-personal-representative/values.staging.yaml (1)
68-69
: LGTM: Scaling configuration aligns with cost optimization goals
The reduction in replica counts (min: 1, max: 3) aligns well with the PR objective to optimize resource usage during low-traffic periods while maintaining service availability. The configuration maintains a good balance between cost savings and service reliability.
Let's verify if similar services in staging have consistent scaling configurations:
Also applies to: 86-88
✅ Verification successful
Verified: Scaling configurations are consistent across all auth services
The scaling configuration (min: 1, max: 3) is uniformly applied across all auth services in the staging environment, including:
- services-auth-public-api
- services-auth-personal-representative
- services-auth-personal-representative-public
- services-auth-ids-api-cleanup
- services-auth-delegation-api
- services-auth-admin-api
- services-auth-ids-api
This confirms that the scaling changes align with the overall infrastructure strategy and maintain consistency across the auth service ecosystem.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check scaling configurations across auth services
# Expect: Similar min/max replica counts for consistency
rg -A 5 "replicas:|replicaCount:" charts/services/services-auth-**/values.staging.yaml
Length of output: 7631
charts/services/search-indexer-service/values.staging.yaml (1)
52-53
: Verify if increased max replicas is necessary for this service
While other services are reducing their replica counts, this service is increasing its maximum replicas from 1 to 3. Please confirm if this increase is necessary based on historical load patterns.
Let's check the service's resource utilization history:
charts/services/air-discount-scheme-backend/values.staging.yaml (1)
60-61
: Scaling configuration changes align with cost optimization goals
The reduction in replica counts (min: 2→1, max: 10→3) is reasonable for a staging environment and aligns with the PR objectives. The service maintains high availability through:
- Proper health checks configuration
- PodDisruptionBudget limiting unavailable pods
- Conservative resource limits
Let's verify the service's historical resource usage to confirm these limits are appropriate:
Also applies to: 107-109
✅ Verification successful
Scaling configuration changes are appropriate for staging environment
Based on the analysis of the configuration files:
- The service has proper health checks and pod disruption budget configured
- The new scaling limits (min:1, max:3) are only applied to staging environment
- Production environment maintains higher limits (min:2, max:10) for better availability
- The service has moderate resource limits (CPU: 400m, Memory: 512Mi) which can be handled by fewer replicas in staging
- The service is internal/backend with proper health checks configured
The reduction in replica count for staging aligns with cost optimization while maintaining basic availability through:
- Health checks with 3s timeout
- PodDisruptionBudget maxUnavailable:1
- Conservative CPU target utilization (90%)
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check historical CPU/Memory usage and replica counts
# Note: This helps verify if the new scaling limits align with actual usage patterns
# Check recent pod counts to verify if we ever scaled beyond 3 replicas
kubectl get --no-headers=true -n air-discount-scheme horizontalpodautoscaler air-discount-scheme-backend -o custom-columns="MIN:.spec.minReplicas,MAX:.spec.maxReplicas,CURRENT:.status.currentReplicas,DESIRED:.status.desiredReplicas"
# Check CPU usage to verify if lower replica count can handle the load
kubectl get --no-headers=true -n air-discount-scheme horizontalpodautoscaler air-discount-scheme-backend -o custom-columns="TARGETCPU:.spec.targetCPUUtilizationPercentage,CURRENTCPU:.status.currentCPUUtilizationPercentage"
Length of output: 651
Script:
#!/bin/bash
# Let's analyze the Helm values and related configuration files to understand the service better
# Check for any monitoring/metrics configuration that might justify the scaling decision
fd -e yaml -e yml . charts/services/air-discount-scheme-backend/ -x cat {}
# Look for any documentation about service requirements or load patterns
fd README -e md . charts/services/air-discount-scheme-backend/ -x cat {}
# Search for related deployment configurations
rg -A 5 -B 5 "replicas|autoscaling|resources" charts/services/air-discount-scheme-backend/
Length of output: 65819
charts/services/services-auth-delegation-api/values.staging.yaml (1)
74-75
: Consider impact on dependent services
The scaling changes (min: 2→1, max: 10→3) require careful consideration as this service:
- Handles authentication delegation
- Integrates with multiple external services
- Has several dependent services
While the changes align with cost optimization goals and maintain basic availability through health checks and PodDisruptionBudget, the service's critical nature suggests additional precautions.
Recommendations for safe implementation:
- Monitor authentication latency metrics
- Track error rates for external service calls
- Consider implementing circuit breakers for external service calls
- Set up alerts for when replica count stays at maximum
Let's verify the service dependencies:
Also applies to: 92-94
✅ Verification successful
Scaling changes require additional monitoring due to high service dependency
The verification confirms this is a critical service with multiple dependent services:
- Core platform services:
- identity-server
- islandis (island.is platform)
- service-portal-api
- Supporting services:
- user-notification
- user-notification-worker
- api
While the scaling changes (min: 2→1, max: 10→3) can proceed, the high number of dependent services reinforces the need for:
- Implementing the suggested monitoring and alerting
- Phased rollout starting with non-production environments
- Close monitoring of service latency and error rates during the initial deployment
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for services depending on auth-delegation-api
# This helps assess the impact radius of the scaling changes
# Find ingress configurations that might be routing to this service
rg -l "auth-delegation-api" --type yaml
# Check for direct service references
rg "auth-delegation-api\.internal" --type yaml
Length of output: 2809
charts/services/services-university-gateway/values.staging.yaml (1)
69-70
: LGTM! Scaling configuration changes align with cost optimization goals.
The reduction in replica count (min: 1, max: 3) is reasonable for a staging environment. However, ensure proper monitoring is in place during the transition period.
Let's verify the scaling configuration across other services to ensure consistency:
Also applies to: 125-127
✅ Verification successful
Scaling configuration is consistent across all staging services
The verification confirms that all services in the staging environment follow the same scaling configuration pattern:
- Minimum replicas: 1
- Maximum replicas: 3
- Default replicas: 1
This consistent configuration aligns with the cost optimization goals while maintaining service availability.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check scaling configurations across services
# Expected: Consistent min/max replica settings
# Search for HPA and replicaCount configurations
echo "=== Checking HPA configurations ==="
rg -A 2 "hpa:\s*$" "charts/*/values.staging.yaml"
echo "=== Checking replicaCount configurations ==="
rg -A 3 "replicaCount:\s*$" "charts/*/values.staging.yaml"
Length of output: 430
Script:
#!/bin/bash
# Let's try with fd to find the correct paths first
echo "=== Available Helm chart directories ==="
fd -t d "^services-.*$" charts/
echo "=== Checking HPA configurations ==="
fd -t f "values.staging.yaml" charts/ --exec rg -A 2 "hpa:\s*$" {}
echo "=== Checking replicaCount configurations ==="
fd -t f "values.staging.yaml" charts/ --exec rg -A 3 "replicaCount:\s*$" {}
echo "=== Checking min/max replica configurations ==="
fd -t f "values.staging.yaml" charts/ --exec rg -A 1 "min:|max:" {}
Length of output: 16733
charts/services/services-bff-portals-my-pages/values.staging.yaml (1)
60-61
: Verify service resilience with reduced replica count.
The BFF (Backend-For-Frontend) service is critical for user interactions. While the scaling changes align with the optimization initiative, ensure the service can handle traffic spikes effectively.
Let's analyze the service dependencies and traffic patterns:
Also applies to: 83-85
✅ Verification successful
Service scaling configuration appears adequate for staging environment
The service configuration shows proper resilience measures:
- HPA configured with CPU utilization target of 90% and nginx request rate monitoring
- Reasonable resource limits (400m CPU, 512Mi memory) and requests (100m CPU, 256Mi memory)
- Health checks implemented with both liveness and readiness probes
- Pod disruption budget ensures maximum 1 unavailable pod
- Service connects to clustered Redis for session management
The reduced replica count (min: 1, max: 3) is appropriate for the staging environment given:
- Proper autoscaling triggers are in place
- Resource limits provide headroom for traffic spikes
- Critical dependencies are properly configured (Redis cluster, web-api service)
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check service dependencies and ingress configuration
# Expected: Proper ingress setup and dependent services
# Check ingress configuration
echo "=== Checking ingress paths ==="
rg "ingress:" -A 10 "charts/services/services-bff-portals-my-pages/values.staging.yaml"
# Check dependent services
echo "=== Checking service dependencies ==="
rg "BFF_PROXY_API_ENDPOINT|BFF_ALLOWED_EXTERNAL_API_URLS" "charts/services/services-bff-portals-my-pages/values*.yaml"
Length of output: 861
Script:
#!/bin/bash
# Let's gather more comprehensive information about the service configuration
# Check the complete values.staging.yaml for resources and scaling config
echo "=== Checking complete staging configuration ==="
cat charts/services/services-bff-portals-my-pages/values.staging.yaml
# Check if there are any HPA (Horizontal Pod Autoscaler) configurations
echo "=== Checking HPA configurations ==="
rg "horizontalPodAutoscaler|targetCPUUtilizationPercentage" charts/services/services-bff-portals-my-pages/values.staging.yaml
# Check resource limits and requests
echo "=== Checking resource configurations ==="
rg "resources:|limits:|requests:" -A 4 charts/services/services-bff-portals-my-pages/values.staging.yaml
# Check if there are any readiness/liveness probes
echo "=== Checking probe configurations ==="
rg "readinessProbe|livenessProbe" -A 5 charts/services/services-bff-portals-my-pages/values.staging.yaml
Length of output: 5669
charts/services/user-notification-worker/values.staging.yaml (2)
81-82
: LGTM: HPA scaling configuration aligns with resource optimization goals
The new HPA configuration with min=1 and max=3 replicas provides a good balance between resource efficiency during low-traffic periods and service availability during peak times.
116-118
: Verify service responsiveness with single replica
While the scaling configuration looks good, ensure that a single replica can handle the base load during off-peak hours without significant latency.
✅ Verification successful
Let me gather more information about the service's configuration and workload patterns.
Let me gather more specific information about the worker's queue processing and performance metrics.
Based on the analysis of the codebase and configuration files, I can now provide a final response:
Single replica configuration is appropriate for staging environment
The current configuration with min=1 replica is suitable for the staging environment because:
- The service uses a queue-based architecture which naturally handles load spikes
- The worker has built-in SQS message handling which provides automatic backpressure
- Staging environment typically has lower traffic volumes than production
- The configuration allows scaling up to 3 replicas if needed during peak loads
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check recent SQS queue metrics to verify message processing capacity
gh api graphql -f query='
{
repository(owner: "island-is", name: "island.is") {
object(expression: "main") {
... on Commit {
history(path: "charts/services/user-notification-worker/values.staging.yaml") {
nodes {
message
committedDate
}
}
}
}
}
}'
Length of output: 3586
Script:
#!/bin/bash
# Check service configuration and workload patterns
rg -A 5 "user-notification" apps/ libs/ -g "*.ts" -g "*.js"
# Look for queue configuration
fd -e yaml -e yml . | xargs grep -l "user-notification" | xargs cat
# Check for any performance related configurations
ast-grep --pattern 'new SQSClient($$$)'
ast-grep --pattern 'createQueue($$$)'
Length of output: 86185
Script:
#!/bin/bash
# Check for queue processing configuration and worker implementation
rg -A 5 "MAIN_QUEUE_NAME" apps/services/user-notification/src/ -g "*.ts"
# Look for any rate limiting or batch processing configurations
rg -A 5 "batchSize|rateLimit|concurrency" apps/services/user-notification/src/ -g "*.ts"
# Check for any worker-specific configurations
ast-grep --pattern 'class $_ implements Worker {
$$$
}'
Length of output: 1435
charts/services/service-portal-api/values.staging.yaml (1)
113-115
: Verify service impact during deployment rollouts
With a single default replica, verify that the service maintains acceptable availability during deployments and pod transitions.
✅ Verification successful
Single replica configuration is acceptable for staging environment
The configuration is appropriate for the staging environment based on the following findings:
- The service has a rolling update strategy with
maxUnavailable: 1
which ensures zero-downtime deployments - The replica configuration (min: 1, max: 3) follows a consistent pattern across other staging services in the cluster
- Production environment has higher replica counts (min: 2) for increased availability where it matters most
- Staging environment typically has lower traffic and availability requirements compared to production
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check recent deployment history and pod transition times
rg -A 3 "service(-|\s)portal(-|\s)api.*deployment" | grep -i "rollout\|unavailable"
# Check if there are any recent incident reports related to service availability
rg -l "service(-|\s)portal(-|\s)api.*incident"
Length of output: 134
Script:
#!/bin/bash
# Check for any existing HPA configurations
fd -e yaml -e yml | xargs grep -l "HorizontalPodAutoscaler\|hpa"
# Look for deployment configurations and replica settings
fd -e yaml -e yml | xargs grep -A 5 "replicas:\|minReplicas:\|maxReplicas:"
# Search for any documented deployment strategies
fd -e yaml -e yml | xargs grep -A 3 "strategy:\|rollingUpdate:"
# Check for readiness probe configurations that might affect availability
fd -e yaml -e yml | xargs grep -A 5 "readinessProbe:\|livenessProbe:"
Length of output: 75282
Script:
#!/bin/bash
# Check for deployment strategy and rolling update configurations
fd -e yaml -e yml | xargs grep -A 3 "strategy:\|rollingUpdate:\|maxSurge:\|maxUnavailable:"
# Check for readiness and liveness probe configurations
fd "service-portal-api" -e yaml | xargs grep -A 5 "readinessProbe:\|livenessProbe:"
# Look for any documented deployment practices or guidelines
fd -t f "README.md\|DEPLOYMENT.md" | xargs grep -i "deployment\|rollout\|availability"
Length of output: 65833
charts/services/judicial-system-backend/values.staging.yaml (1)
105-107
: Verify handling of external service failures
With reduced replica count, ensure the service can handle external service failures gracefully, especially with single-replica operation.
✅ Verification successful
Let me analyze the findings from the shell script results to verify the handling of external service failures.
The codebase shows several robust error handling patterns:
- Circuit breaker implementation in
libs/clients/middlewares/src/lib/withCircuitBreaker.ts
that prevents cascading failures - Retry mechanisms through:
- Apollo Client retry link in GraphQL operations
- AWS SDK retry middleware
- General retry utilities and configurations
- Error handling with fallbacks throughout the application
Let me verify the specific error handling patterns in the backend code:
Let me check for error handling patterns in the backend service configuration and implementation:
Let me check one more aspect - the health check and monitoring configurations:
Service resilience is adequately configured for single-replica operation
The service has appropriate error handling and resilience mechanisms in place:
- HTTP status code handling and error responses in court-client service
- Message queue retry logic with error handling in the message service
- Proper health check timeouts configured (3 seconds) across all environments
- Deployment configuration with maxUnavailable=1 to ensure service availability during updates
- Status monitoring and health checks implemented in multiple services
The reduced replica count (1-3) is safe given these resilience mechanisms that handle external service failures gracefully.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check for error handling patterns in external service calls
ast-grep --pattern 'try {
$$$
// External service calls
$$$
} catch (error) {
$$$
}'
# Look for circuit breaker implementations
rg -A 3 "circuit(-|\s)breaker|fallback|retry"
Length of output: 68414
Script:
#!/bin/bash
# Check for error handling in backend services
rg -g "*.ts" -B 2 -A 2 "catch.*error" libs/judicial-system/
# Look for timeout configurations
rg -g "*.ts" "timeout.*=" libs/judicial-system/
Length of output: 1156
Script:
#!/bin/bash
# Check for error handling in backend service configuration
rg -g "*.{ts,yaml}" "(try|catch|error|retry|circuit|fallback|timeout)" charts/services/judicial-system-backend/
# Look for error handling in backend modules
ast-grep --pattern 'try {
$$$
} catch (error) {
$$$
}' libs/judicial-system/
# Check for health check and readiness probe configurations
rg -g "*.yaml" "(livenessProbe|readinessProbe)" charts/services/judicial-system-backend/
Length of output: 1837
Script:
#!/bin/bash
# Check for health check configurations in values files
cat charts/services/judicial-system-backend/values.*.yaml | grep -A 10 "probe"
# Look for monitoring and health check implementations
rg -g "*.ts" "(health|monitor|status)" libs/judicial-system/
# Check for AWS service configurations that might affect resilience
rg -g "*.yaml" "(maxSurge|maxUnavailable|replicas)" charts/services/judicial-system-backend/
Length of output: 2856
charts/services/services-auth-ids-api/values.staging.yaml (1)
84-85
: LGTM! Scaling configuration aligns with the new staging policy.
The HPA and replicaCount settings have been correctly adjusted to allow for more efficient resource utilization during low-traffic periods while maintaining service availability.
Also applies to: 130-132
charts/services/application-system-api/values.staging.yaml (1)
159-160
: LGTM! Scaling configuration aligns with the new staging policy.
The HPA and replicaCount settings have been correctly adjusted to allow for more efficient resource utilization during low-traffic periods while maintaining service availability.
Also applies to: 209-211
charts/judicial-system/values.staging.yaml (1)
56-57
: Verify service availability requirements for judicial system components
The reduction in minimum replicas to 1 for judicial system services could impact availability during pod updates or failures. While this aligns with the cost-saving objectives, consider:
- These services handle critical judicial system operations
- The PodDisruptionBudget maxUnavailable:1 setting becomes more impactful with fewer replicas
- Pod recycling during deployments will cause brief service interruptions
Recommendations:
- Implement proper circuit breakers in dependent services
- Set up alerts for when single replicas are unavailable
- Consider keeping min:2 for the most critical components
- Monitor error rates during low-traffic periods after this change
Also applies to: 81-83, 158-159, 192-194, 416-417, 428-429
charts/identity-server/values.staging.yaml (1)
39-40
: Ensure authentication service stability with reduced replicas
The reduction in replicas for identity server components requires careful consideration:
- These services handle critical authentication operations
- Multiple components have interdependencies
- Service disruption could affect all authenticated operations
Recommendations:
- Implement robust retry mechanisms in clients
- Set up enhanced monitoring for authentication failures
- Consider keeping min:2 for the core identity server service
- Add readiness probe delays to prevent premature pod termination
- Document authentication failure scenarios and recovery procedures
- Monitor token validation performance with reduced capacity
Also applies to: 62-64, 273-274, 292-294, 372-373, 390-392, 479-480, 525-527, 663-664, 681-683, 732-733, 750-752, 827-828, 849-851
charts/islandis/values.staging.yaml (3)
Line range hint 467-471
: Resource allocations look appropriate for the new scaling configuration
The CPU and memory resource requests/limits are well-configured to support the new scaling behavior:
- Reasonable CPU requests allowing for efficient autoscaling
- Adequate memory limits to handle workload spikes
- Good request-to-limit ratios promoting efficient resource utilization
Also applies to: 769-773, 1049-1053, 1495-1499, 1602-1606, 2109-2113, 2195-2199, 2781-2785, 3277-3281
Line range hint 1-7
: LGTM: Configuration header is properly maintained
The warning comment about manual edits is preserved, which is important for maintaining configuration consistency.
38-39
: Verify minimum replica count for critical services
The changes reduce minimum replicas to 1 across all services. While this aligns with the cost optimization goal, some critical services may need a higher minimum for reliability:
- api service (handles core API functionality)
- application-system-api (handles application submissions)
- web service (main web interface)
Consider keeping min: 2 for these critical services to ensure high availability.
Consider:
- Implementing proper readiness/liveness probes (already in place)
- Setting up alerts for when services scale down to 1 replica
- Monitoring response times during low-replica periods
Also applies to: 61-63, 126-127, 173-175, 226-227, 252-254, 443-444, 463-465, 716-717, 766-768, 1026-1027, 1046-1048, 1474-1475, 1492-1494, 1579-1580, 1599-1601, 2083-2084, 2106-2108, 2169-2170, 2192-2194, 2758-2759, 2778-2780, 3254-3255, 3274-3276
✅ Verification successful
Let me generate additional scripts to check for service dependencies and critical paths:
This will help us better understand the service dependencies and criticality levels to make an informed decision about minimum replica counts.
Based on the analysis of the service dependencies from the shell script results, I can now provide a final response:
Minimum replica count of 1 is acceptable for staging environment
The service dependency analysis shows that while there are multiple service-to-service communications, the staging environment is non-critical and used for testing purposes:
- The services communicate internally through Kubernetes service discovery (
.svc.cluster.local
) - All staging URLs use the pattern
*.staging01.devland.is
indicating a non-production environment - No critical dependencies or priority annotations were found in the configuration
- Services have proper health checks and Kubernetes will handle pod rescheduling if needed
The HPA settings (max: 3, min: 1) are appropriate for the staging environment where:
- Cost optimization is a priority
- High availability is less critical than in production
- Traffic patterns are more predictable and controlled
- Automated tests and manual QA are the primary users
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check for critical service dependencies and traffic patterns
# Look for services marked as dependencies in ingress configurations
rg -A 3 "nginx.ingress.kubernetes.io/service-upstream: 'true'"
# Find services that other services depend on
rg -A 2 "http://web-[a-zA-Z-]+\..*\.svc\.cluster\.local"
Length of output: 68609
Script:
#!/bin/bash
# Check for critical service dependencies in the codebase
# Look for service dependencies in configuration files
rg -A 2 "depends_on|dependsOn|dependencies" charts/
# Check for internal service communication patterns
rg -A 2 "http://(api|application-system-api|web)-service"
# Look for critical service annotations
rg -A 2 "critical:|tier:|priority:" charts/
Length of output: 30775
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #17227 +/- ##
=======================================
Coverage 35.74% 35.74%
=======================================
Files 6937 6937
Lines 148167 148167
Branches 42250 42250
=======================================
Hits 52969 52969
Misses 95198 95198 Flags with carried forward coverage won't be shown. Click here to find out more. Continue to review full report in Codecov by Sentry.
|
* chore: allow staging to scale down more * chore: nx format:write update dirty files * set overrides to 3 * fixed tests --------- Co-authored-by: andes-it <builders@andes.is> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
* chore: allow staging to scale down more * chore: nx format:write update dirty files * set overrides to 3 * fixed tests --------- Co-authored-by: andes-it <builders@andes.is> Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
Allow more scale down on support environments
Currently we are running many services with min replicas set to 2 on dev and staging. It could be a decent chunk of change we could save by allowing more scale down during low usage such as at night and during weekends and holidays.
Summary by CodeRabbit