Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the user experience of Failover #5150

Open
XiShanYongYe-Chang opened this issue Jul 6, 2024 · 14 comments
Open

Improve the user experience of Failover #5150

XiShanYongYe-Chang opened this issue Jul 6, 2024 · 14 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@XiShanYongYe-Chang
Copy link
Member

What would you like to be added:

Improve the user experience of Failover

Why is this needed:

The Failover and GracefulEviction features are currently in the Beta phase, which means they are enabled by default.

There is a scenario where users propagate configuration resources by directly specifying the cluster names. When a cluster is disconnected from the Karmada control plane for several hours, it is identified as NotReady. Once the cluster recovers, the configuration resources on that cluster are deleted unexpectedly. If this occurs in a production environment, it could lead to serious consequences.

Therefore, we need to optimize the Failover feature for this scenario to provide users with a more stable and reliable experience.

@XiShanYongYe-Chang XiShanYongYe-Chang added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 6, 2024
@whitewindmills
Copy link
Member

are you saying that resources on unhealthy member clusters will be deleted cause they're migrated to other member clusters? if so, how to improve it?

@XiShanYongYe-Chang
Copy link
Member Author

are you saying that resources on unhealthy member clusters will be deleted cause they're migrated to other member clusters?

That's it. One thing to note is that it's not migrated to another cluster, it's just moved out of the failed cluster.

if so, how to improve it?

I hope to hear everyone's opinion.

@whitewindmills
Copy link
Member

One thing to note is that it's not migrated to another cluster, it's just moved out of the failed cluster.

let me guest how it happened, no fit cluster?

@XiShanYongYe-Chang
Copy link
Member Author

In other words, the clusters that need to be distributed have been listed, and these clusters will be distributed with configuration resources.

@whitewindmills
Copy link
Member

does it look like this? if the cluster foo becomes unhealthy, the configuration will be stuck in the cluster until the cluster becomes healthy and then it is deleted. am I right?

apiVersion: policy.karmada.io/v1alpha1
kind: PropagationPolicy
metadata:
  name: test-pp
  namespace: default
spec:
  resourceSelectors:
    - apiVersion: v1
      kind: ConfigMap
      name: conf
  placement:
    clusterAffinity:
      clusterNames:
      - foo
      - bar
  ...

@XiShanYongYe-Chang
Copy link
Member Author

Yes, you are right.

@whitewindmills
Copy link
Member

yes, this is a noteworthy case where we would prefer not to delete resources when there is no new cluster to migrate to.

@NickYadance
Copy link

does it look like this? if the cluster foo becomes unhealthy, the configuration will be stuck in the cluster until the cluster becomes healthy and then it is deleted. am I right?

apiVersion: policy.karmada.io/v1alpha1
kind: PropagationPolicy
metadata:
  name: test-pp
  namespace: default
spec:
  resourceSelectors:
    - apiVersion: v1
      kind: ConfigMap
      name: conf
  placement:
    clusterAffinity:
      clusterNames:
      - foo
      - bar
  ...

just had an issue that a cluster becomes not ready for about 2minutes, karmada deleted all the resources then created the resources again in that member cluster, seems that karmada failed to handle the cluster recovery properly

@whitewindmills
Copy link
Member

@NickYadance thanks for your feedback, that's where improvement is needed.
I will work on solving it.
the overall solution is to ensure that there are new member clusters available for migration before evicting resources from the failed cluster. otherwise, failover will not occur.
cc @XiShanYongYe-Chang @RainbowMango

@whitewindmills
Copy link
Member

/assign

@XiShanYongYe-Chang
Copy link
Member Author

just had an issue that a cluster becomes not ready for about 2minutes, karmada deleted all the resources then created the resources again in that member cluster, seems that karmada failed to handle the cluster recovery properly

Thanks for your feedback.

Two minutes later, the cluster is back up?

In addition, the failover feature gate is enabled by default. Is this what you expect?

@XiShanYongYe-Chang
Copy link
Member Author

the overall solution is to ensure that there are new member clusters available for migration before evicting resources from the failed cluster. otherwise, failover will not occur.

Thanks @whitewindmills, this is a feasible solution.

@NickYadance
Copy link

just had an issue that a cluster becomes not ready for about 2minutes, karmada deleted all the resources then created the resources again in that member cluster, seems that karmada failed to handle the cluster recovery properly

Thanks for your feedback.

Two minutes later, the cluster is back up?

In addition, the failover feature gate is enabled by default. Is this what you expect?

Yes, the cluster is backup two minutes later. I would prefer to control the failover process manually. Something like "hey Karmada, failover the resources in member A to member B when member A is down. Or else don't do anything out of expectation."

@NickYadance
Copy link

NickYadance commented Aug 2, 2024

@NickYadance thanks for your feedback, that's where improvement is needed. I will work on solving it. the overall solution is to ensure that there are new member clusters available for migration before evicting resources from the failed cluster. otherwise, failover will not occur. cc @XiShanYongYe-Chang @RainbowMango

May i know where in the code causes the issue ? i tried to reproduce the issue offline but failed. From what i found, the failover timeout is default to 5 minutes, so the resources in old cluster shouldn't be evicted if it downs for only 2 minutes. @whitewindmills

case metav1.ConditionFalse:
if features.FeatureGate.Enabled(features.Failover) && decisionTimestamp.After(clusterHealth.readyTransitionTimestamp.Add(c.FailoverEvictionTimeout)) {
// We want to update the taint straight away if Cluster is already tainted with the UnreachableTaint
taintToAdd := *NotReadyTaintTemplate
if err := c.updateClusterTaints(ctx, []*corev1.Taint{&taintToAdd}, []*corev1.Taint{UnreachableTaintTemplate}, cluster); err != nil {
klog.ErrorS(err, "Failed to instantly update UnreachableTaint to NotReadyTaint, will try again in the next cycle.", "cluster", cluster.Name)
}
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
Status: Planning
Development

No branches or pull requests

3 participants