Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scale down operation fails and is never requeued if getReplicasForPod fails transiently #707

Open
kos-team opened this issue Jun 11, 2024 · 2 comments
Labels
autoscaling Autoscaling of Solr Nodes using the HPA and Solr APIs bug Something isn't working
Milestone

Comments

@kos-team
Copy link

Bug Report

What did you do? / Minimal Reproducible Example
We first deployed a five-replica Solr cluster, then modified the CR to scale down to three replicas.
The scale down triggers the solr-operator to perform the ManagedScaleDown operation, invoking the handleManagedCloudScaleDown function. Inside the function, the operator invokes the evictSinglePod function. If the call to the Solr transiently fails (due to transient network issue or transient unavailability of Solr), the operator writes an error message and never retries. Even if the transient fault is resolved later, the operator is stuck and never perform the scale down operation.

To reproduce this bug,

  1. apply the CR to deploy a 5 replica Solr cluster by changing replicas to 5 in the example: https://github.com/apache/solr-operator/blob/main/example/test_solrcloud.yaml
  2. Expose faults to the operator (this can be reliably reproduced using the ChaosMesh controller by injecting a network loss between the Solr pods and the Solr operator)
  3. Change the replicas: 5 to replicas:3 in the CR to initiate the scale down.
  4. Observe that the operator reports an error log because it is unable to contact the Solr Pods for status
ERROR    Error retrieving cluster status, cannot determine if pod has replicas    {"controller": "solrcloud", "controllerGroup": "solr.apache.org", "controllerKind": "SolrCloud", "SolrCloud": {"name":"example","namespace":"default"}, "namespace": "default", "name": "example", "reconcileID": "66d08e1c-a6cd-4176-a792-c706177cc39f", "error": "Get \"http://example-solrcloud-common.default/solr/admin/collections?action=CLUSTERSTATUS&wt=json\": dial tcp 10.96.177.119:80: i/o timeout"}
  1. Remove the fault and observe that the operator never requeues and reconciles the request

What did you expect to see?
The operator should be able to scale down the cluster after the transient faults go away

What did you see instead?
The operator never requeues the request

Root Cause
The root cause of this bug is at https://github.com/apache/solr-operator/blob/b2c951052828f88e9fc415f54aec189b805117e9/controllers/solr_cluster_ops_util.go#L251C57-L251C71
where the operator does not handle the error returned by evictSinglePod but just directly returns normally.

The fix should be simple to add an error handling branch to requeue the request.

@HoustonPutman
Copy link
Contributor

Thanks for finding and reporting this! Yeah for some reason in #689 I set err = nil, at

, which might be the cause for this. I'll need to investigate why I did that, but we will certainly get this fixed for v0.9.0!

@HoustonPutman HoustonPutman added bug Something isn't working autoscaling Autoscaling of Solr Nodes using the HPA and Solr APIs labels Jun 14, 2024
@HoustonPutman HoustonPutman added this to the main (v0.9.0) milestone Jun 14, 2024
@kos-team
Copy link
Author

@HoustonPutman Thanks for the confirmation. Looking forward!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
autoscaling Autoscaling of Solr Nodes using the HPA and Solr APIs bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants