Improve the user experience of Failover #5150

XiShanYongYe-Chang · 2024-07-06T11:29:13Z

What would you like to be added:

Improve the user experience of Failover

Why is this needed:

The Failover and GracefulEviction features are currently in the Beta phase, which means they are enabled by default.

There is a scenario where users propagate configuration resources by directly specifying the cluster names. When a cluster is disconnected from the Karmada control plane for several hours, it is identified as NotReady. Once the cluster recovers, the configuration resources on that cluster are deleted unexpectedly. If this occurs in a production environment, it could lead to serious consequences.

Therefore, we need to optimize the Failover feature for this scenario to provide users with a more stable and reliable experience.

The text was updated successfully, but these errors were encountered:

whitewindmills · 2024-07-08T01:54:17Z

are you saying that resources on unhealthy member clusters will be deleted cause they're migrated to other member clusters? if so, how to improve it?

XiShanYongYe-Chang · 2024-07-08T02:08:04Z

are you saying that resources on unhealthy member clusters will be deleted cause they're migrated to other member clusters?

That's it. One thing to note is that it's not migrated to another cluster, it's just moved out of the failed cluster.

if so, how to improve it?

I hope to hear everyone's opinion.

whitewindmills · 2024-07-08T02:29:28Z

One thing to note is that it's not migrated to another cluster, it's just moved out of the failed cluster.

let me guest how it happened, no fit cluster?

XiShanYongYe-Chang · 2024-07-08T02:36:13Z

In other words, the clusters that need to be distributed have been listed, and these clusters will be distributed with configuration resources.

whitewindmills · 2024-07-08T02:44:32Z

does it look like this? if the cluster foo becomes unhealthy, the configuration will be stuck in the cluster until the cluster becomes healthy and then it is deleted. am I right?

apiVersion: policy.karmada.io/v1alpha1
kind: PropagationPolicy
metadata:
  name: test-pp
  namespace: default
spec:
  resourceSelectors:
    - apiVersion: v1
      kind: ConfigMap
      name: conf
  placement:
    clusterAffinity:
      clusterNames:
      - foo
      - bar
  ...

XiShanYongYe-Chang · 2024-07-08T03:21:42Z

Yes, you are right.

whitewindmills · 2024-07-08T06:07:57Z

yes, this is a noteworthy case where we would prefer not to delete resources when there is no new cluster to migrate to.

NickYadance · 2024-07-30T05:04:03Z

does it look like this? if the cluster foo becomes unhealthy, the configuration will be stuck in the cluster until the cluster becomes healthy and then it is deleted. am I right?

apiVersion: policy.karmada.io/v1alpha1
kind: PropagationPolicy
metadata:
  name: test-pp
  namespace: default
spec:
  resourceSelectors:
    - apiVersion: v1
      kind: ConfigMap
      name: conf
  placement:
    clusterAffinity:
      clusterNames:
      - foo
      - bar
  ...

just had an issue that a cluster becomes not ready for about 2minutes, karmada deleted all the resources then created the resources again in that member cluster, seems that karmada failed to handle the cluster recovery properly

whitewindmills · 2024-07-31T02:59:05Z

@NickYadance thanks for your feedback, that's where improvement is needed.
I will work on solving it.
the overall solution is to ensure that there are new member clusters available for migration before evicting resources from the failed cluster. otherwise, failover will not occur.
cc @XiShanYongYe-Chang @RainbowMango

whitewindmills · 2024-07-31T02:59:23Z

/assign

XiShanYongYe-Chang · 2024-07-31T03:42:21Z

just had an issue that a cluster becomes not ready for about 2minutes, karmada deleted all the resources then created the resources again in that member cluster, seems that karmada failed to handle the cluster recovery properly

Thanks for your feedback.

Two minutes later, the cluster is back up?

In addition, the failover feature gate is enabled by default. Is this what you expect?

XiShanYongYe-Chang · 2024-07-31T03:43:12Z

the overall solution is to ensure that there are new member clusters available for migration before evicting resources from the failed cluster. otherwise, failover will not occur.

Thanks @whitewindmills, this is a feasible solution.

NickYadance · 2024-08-02T06:21:33Z

just had an issue that a cluster becomes not ready for about 2minutes, karmada deleted all the resources then created the resources again in that member cluster, seems that karmada failed to handle the cluster recovery properly

Thanks for your feedback.

Two minutes later, the cluster is back up?

In addition, the failover feature gate is enabled by default. Is this what you expect?

Yes, the cluster is backup two minutes later. I would prefer to control the failover process manually. Something like "hey Karmada, failover the resources in member A to member B when member A is down. Or else don't do anything out of expectation."

NickYadance · 2024-08-02T06:23:00Z

@NickYadance thanks for your feedback, that's where improvement is needed. I will work on solving it. the overall solution is to ensure that there are new member clusters available for migration before evicting resources from the failed cluster. otherwise, failover will not occur. cc @XiShanYongYe-Chang @RainbowMango

May i know where in the code causes the issue ? i tried to reproduce the issue offline but failed. From what i found, the failover timeout is default to 5 minutes, so the resources in old cluster shouldn't be evicted if it downs for only 2 minutes. @whitewindmills

karmada/pkg/controllers/cluster/cluster_controller.go

Lines 626 to 633 in 04a4d84

    
           case metav1.ConditionFalse: 
        
           	if features.FeatureGate.Enabled(features.Failover) && decisionTimestamp.After(clusterHealth.readyTransitionTimestamp.Add(c.FailoverEvictionTimeout)) { 
        
           		// We want to update the taint straight away if Cluster is already tainted with the UnreachableTaint 
        
           		taintToAdd := *NotReadyTaintTemplate 
        
           		if err := c.updateClusterTaints(ctx, []*corev1.Taint{&taintToAdd}, []*corev1.Taint{UnreachableTaintTemplate}, cluster); err != nil { 
        
           			klog.ErrorS(err, "Failed to instantly update UnreachableTaint to NotReadyTaint, will try again in the next cycle.", "cluster", cluster.Name) 
        
           		} 
        
           	}

XiShanYongYe-Chang added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 6, 2024

github-project-automation bot added this to Karmada Overall Backlog Jul 6, 2024

karmada-bot assigned whitewindmills Jul 31, 2024

whitewindmills moved this to Planned In Release 1.11 in Karmada Overall Backlog Jul 31, 2024

RainbowMango mentioned this issue Aug 6, 2024

Update tolerate forever with serviceexport crd karmada-io/multicluster-cloud-provider#19

Merged

XiShanYongYe-Chang mentioned this issue Aug 21, 2024

In pull mode, after a member cluster's karmada-agent goes down and goes up, the resources be recreated. #5406

Open

RainbowMango mentioned this issue Aug 27, 2024

fix controller can't restart in helm for dependent secret not found #5305

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the user experience of Failover #5150

Improve the user experience of Failover #5150

XiShanYongYe-Chang commented Jul 6, 2024

whitewindmills commented Jul 8, 2024

XiShanYongYe-Chang commented Jul 8, 2024

whitewindmills commented Jul 8, 2024

XiShanYongYe-Chang commented Jul 8, 2024

whitewindmills commented Jul 8, 2024

XiShanYongYe-Chang commented Jul 8, 2024

whitewindmills commented Jul 8, 2024

NickYadance commented Jul 30, 2024

whitewindmills commented Jul 31, 2024

whitewindmills commented Jul 31, 2024

XiShanYongYe-Chang commented Jul 31, 2024

XiShanYongYe-Chang commented Jul 31, 2024

NickYadance commented Aug 2, 2024

NickYadance commented Aug 2, 2024 •

edited

Loading

Improve the user experience of Failover #5150

Improve the user experience of Failover #5150

Comments

XiShanYongYe-Chang commented Jul 6, 2024

whitewindmills commented Jul 8, 2024

XiShanYongYe-Chang commented Jul 8, 2024

whitewindmills commented Jul 8, 2024

XiShanYongYe-Chang commented Jul 8, 2024

whitewindmills commented Jul 8, 2024

XiShanYongYe-Chang commented Jul 8, 2024

whitewindmills commented Jul 8, 2024

NickYadance commented Jul 30, 2024

whitewindmills commented Jul 31, 2024

whitewindmills commented Jul 31, 2024

XiShanYongYe-Chang commented Jul 31, 2024

XiShanYongYe-Chang commented Jul 31, 2024

NickYadance commented Aug 2, 2024

NickYadance commented Aug 2, 2024 • edited Loading

NickYadance commented Aug 2, 2024 •

edited

Loading