Add failover history information #5251

Dyex719 · 2024-07-25T23:59:38Z

What type of PR is this?
/kind feature

What this PR does / why we need it:
Adding failover history information so that applications can keep a record of what failovers happened in the past.
Stateful applications can use this information to detect failures and continue processing from a particular state

Which issue(s) this PR fixes:
Fixes #5116 #4969

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

karmada-bot · 2024-07-25T23:59:44Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign rainbowmango for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS
pkg/apis/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

pkg/controllers/utils/common_test.go

Dyex719 · 2024-07-26T13:29:09Z

Hi All,
I have implemented and tested the failover information feature and wanted to request a review from the community on what changes might be required.
I notice that there are some minor linter issues that I will fix.
I have also addressed the failing test case in the comment above.

Thank you!

pkg/controllers/binding/common.go

pkg/controllers/utils/common.go

mszacillo · 2024-09-06T17:09:30Z

Hi @RainbowMango ! Could we get a review of this PR when you get a chance? Failures are due to the test case describes above in the comments. We can decide how to tackle this moving forward. Greatly appreciate it!

RainbowMango · 2024-09-07T08:46:25Z

Sure. And sorry for letting this sit again!

pkg/apis/work/v1alpha2/binding_types.go

mszacillo · 2024-10-16T22:23:03Z

Hi @RainbowMango,

I've taken a closer look at your demo branch (master...XiShanYongYe-Chang:karmada:api_draft_application_failover). I don't have any major complaints from the API side. It seems that logically this effort can be divided into two parts:

Adding FailoverHistoryItem to the resource binding API (which is handled in this PR).
Adding StatePreservation to the propagationpolicy API (which has been done in your branch).

Would you be open to dividing up the work as such? The StatePreservation work can be merged after the FailoverHistoryItem is added. I've already updated the FailoverHistoryItem API to align better with the proposed changes in your branch, but I've left the StatePreservation out as Chang seems to have implemented that already and I wouldn't consider it fair to copy his work under our PR. :)

Note: I haven't updated the codegen for this PR yet so there are some failing checks. Will get that sorted.

mszacillo · 2024-10-16T22:26:52Z

Follow-up question related to the proposed changes for state preservation, are we planning on supporting this only for applications that failover gracefully?

In our use-case we use Immediately, since some of our users have strict only-once processing requirements. So when a failover occurs, we want to remove the application as fast as possible to reduce the likelihood of duplicate applications. Perhaps the state preservation could be kept as part of the status of the resourcebinding and referenced during the scheduling step, similar to how we handle the FailoverHistoryItem in our PR.

mszacillo · 2024-10-29T17:48:35Z

Both application failover and cluster failover ended in the resource-binding-graceful-eviction-controller, so this might be the best place to generate the history. What do you think?

If we're interested in generating the historyItem in one place then I think that's a good idea. Additionally we are reliant on a failover label to be appended to the workload in our use-case, so if we can guarantee that the eviction task is not cleaned up before scheduling, we can check the eviction reason and append the label if necessary.

mszacillo · 2024-10-29T17:50:19Z

@RainbowMango The e2e test failure is due to interference with the existing cluster filtration step that checks clustersBeforeFailover and the workload rebalancer. Since the failoverHistory is not ephemeral, it will always filter out clusters in clustersBeforeFailover until there is a new failover. This doesn't work alongside the workload rebalancer.

Seems we should keep the existing filter logic that only checks eviction tasks (as long as we make sure these are cleaned up post scheduling).

Signed-off-by: mszacillo <mszacillo@bloomberg.net>

RainbowMango · 2024-10-30T08:58:07Z

Seems we should keep the existing filter logic that only checks eviction tasks (as long as we make sure these are cleaned up post scheduling).

If that's the case, we can narrow down the scope of this PR and focus on generating the history.

Note that, currently, for purge modes Graciously and Never, we can ensure the eviction task exists before scheduling, as we will build an eviction task for them. However, we still need to take care of the Immediately mode. We can iterate it by separate PR.

Signed-off-by: mszacillo <mszacillo@bloomberg.net>

mszacillo · 2024-10-30T21:32:49Z

/retest

Seems agents were interrupted.

karmada-bot · 2024-10-30T21:33:07Z

@mszacillo: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Seems agents were interrupted.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

RainbowMango · 2024-10-31T01:34:24Z

/retest

mszacillo · 2024-11-05T19:11:29Z

pkg/controllers/utils/common.go

+// UpdateFailoverStatus adds a failoverHistoryItem to the failoverHistory field in the ResourceBinding.
+func UpdateFailoverStatus(client client.Client, binding *workv1alpha2.ResourceBinding, clusters []string, failoverType workv1alpha2.FailoverReason) (err error) {
+	// If the resource is Duplicated, then it does not have a concept of failover. We skip attaching that status here.
+	if binding.Spec.Placement.ReplicaScheduling.ReplicaSchedulingType == policyv1alpha1.ReplicaSchedulingTypeDuplicated {


I've added this filter as I was noticing failoverHistory being added to duplicated resources in response to cluster failures. I really don't believe duplicated resources should have a failover history, since there is no concept of failover in those cases.

Let me know if you have concerns.

Nice finding!
But for resources with scheduling type duplicated, why not disable failover in the first place? So that, there won't be any eviction task for them.
@XiShanYongYe-Chang What do you think?

But for resources with scheduling type duplicated, why not disable failover in the first place?

Does that mean for duplicated resources, we don't put them in the eviction queue?

Yes, I mean we should revisit which scheduling type is applicable for failover.

But, wait, I think for scheduling type duplicated, there might be a case that also needs failover, especially if they use spread constraints. Anyway, this deserves to double confirm.

This may be another separate point to consider for a failover, not subordinate to the current task.

mszacillo · 2024-11-05T19:16:57Z

@RainbowMango Thanks for putting together the stateful application support checklist, its very comprehensive!

I've just updated this PR to remove the last bit of the failover flag implementation, and have this PR solely focus on the API change. Running some tests for both divided and grouped replicas looked as we expect:

Single replica

failoverHistory:
  - clustersAfterFailover:
    - cluster-A
    clustersBeforeFailover:
    - cluster-B
    reason: ApplicationFailure
    startTime: "2024-11-05T19:04:14Z"
  - clustersAfterFailover:
    - cluster-B
    clustersBeforeFailover:
    - cluster-A
    reason: ApplicationFailure
    startTime: "2024-11-05T19:04:45Z"

Divided replicas

failoverHistory:
  - clustersAfterFailover:
    - cluster-A
    - cluster-B
    clustersBeforeFailover:
    - cluster-A
    - cluster-B
    reason: ApplicationFailure
    startTime: "2024-11-05T19:00:07Z"
  - clustersAfterFailover:
    - cluster-A
    - cluster-B
    clustersBeforeFailover:
    - cluster-B
    - cluster-A
    reason: ApplicationFailure
    startTime: "2024-11-05T19:00:38Z"

Note: The purgeMode for this resource was set to immediately, and since we've decided to go with the evictionTask strategy for filtering out clusters, the Karmada scheduler rescheduled the replicas to the same clusters. This should be resolved in the future. Please let me know if anything else needs to be addressed in this PR.

Cheers!

mszacillo · 2024-11-05T19:21:17Z

Additional comment before I forget - we still append the clustersAfterFailover within the scheduler for now. As discussed, this logic should be moved into the evictionTask controller before the eviction task gets cleared. We can tackle that in an additional PR.

If you'd like to remove the setting of the failoverHistory and keep this as simply an API change, I could do that as well. Let me know.

…urces Signed-off-by: mszacillo <mszacillo@bloomberg.net>

XiShanYongYe-Chang · 2024-11-06T02:02:02Z

Hi @Dyex719 @mszacillo @RainbowMango

Regarding the specific implementation details of the plan, I have combined previous discussions and preliminary conclusions to conduct some analysis and organization, which should aid in the implementation of the feature. Can you help take a look?

https://docs.google.com/document/d/1Np67nVXokSCAfzrI8P8JwQgvqJF64068Yu6leDn2Kjs/edit?tab=t.0#heading=h.eyvom3e0lj8w

RainbowMango · 2024-11-06T10:49:06Z

Additional comment before I forget - we still append the clustersAfterFailover within the scheduler for now. As discussed, this logic should be moved into the evictionTask controller before the eviction task gets cleared. We can tackle that in an additional PR.

If you'd like to remove the setting of the failoverHistory and keep this as simply an API change, I could do that as well. Let me know.

I'm still concerned about letting the scheduler get involved in the failover process. But haven't gotten a chance to explore further.

In addition, according to #5788, seems we don't have a strong dependency on the history to figure out which cluster the application would be migrated to. So, the history information is kind of optional for this feature. That's why I put this task to part 3. (But it still makes sense from an instrumentation point of view.)

So, I'm thinking should we focus on Part 1 and Part 2 first, and come back on the history thing later?

XiShanYongYe-Chang · 2024-11-11T02:59:13Z

Regarding the specific implementation details of the plan, I have combined previous discussions and preliminary conclusions to conduct some analysis and organization, which should aid in the implementation of the feature. Can you help take a look?

https://docs.google.com/document/d/1Np67nVXokSCAfzrI8P8JwQgvqJF64068Yu6leDn2Kjs/edit?tab=t.0#heading=h.eyvom3e0lj8w

Hi @mszacillo do you have time to come to the community meeting tomorrow and exchange ideas about the program?

mszacillo · 2024-11-11T14:25:43Z

Hi @mszacillo do you have time to come to the community meeting tomorrow and exchange ideas about the program?

Hi! Yes, happy to join to discuss more in person! Will take a look at your document in detail today and leave comments there, apologies for the delay.

RainbowMango · 2024-11-12T04:03:13Z

Hi @mszacillo you are not going to KubeCon this time, right?

mszacillo · 2024-11-12T13:01:51Z

Hi @RainbowMango, sadly I won’t be there this time. :(

I’m thinking about attending KubeCon EU next year, but that’s still far away.

mszacillo · 2024-12-02T15:13:33Z

Hi @RainbowMango, we can go ahead and close this!

RainbowMango · 2024-12-03T09:43:09Z

Hi @mszacillo Yeah, this one could be a candidate for the release 1.13.

We just released v1.12 last week, have you ever tested it on your side? I want to know if it works as we expected.

In addition, v1.12 covered application failover, shall we go ahead with cluster failover?

mszacillo · 2024-12-03T16:02:22Z

We just released v1.12 last week, have you ever tested it on your side? I want to know if it works as we expected.

We're working on rebasing our local branch onto v1.12. I'll be sure to give an update once we are finished and can thoroughly test.

In addition, v1.12 covered application failover, shall we go ahead with cluster failover?

And yes that would be great. Perhaps we can discuss design ideas in the next community meeting?

RainbowMango · 2024-12-04T01:45:35Z

And yes that would be great. Perhaps we can discuss design ideas in the next community meeting?

That would be great!

karmada-bot added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 25, 2024

karmada-bot requested review from calvin0327, Garrybest, jwcesign, Poor12 and whitewindmills July 25, 2024 23:59

karmada-bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jul 25, 2024

Dyex719 commented Jul 26, 2024

View reviewed changes

pkg/controllers/utils/common_test.go Outdated Show resolved Hide resolved

Dyex719 commented Jul 30, 2024

View reviewed changes

pkg/controllers/binding/common.go Outdated Show resolved Hide resolved

Dyex719 commented Jul 30, 2024

View reviewed changes

pkg/controllers/utils/common.go Outdated Show resolved Hide resolved

karmada-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Aug 6, 2024

RainbowMango reviewed Sep 18, 2024

View reviewed changes

pkg/apis/work/v1alpha2/binding_types.go Outdated Show resolved Hide resolved

RainbowMango reviewed Sep 18, 2024

View reviewed changes

pkg/apis/work/v1alpha2/binding_types.go Outdated Show resolved Hide resolved

mszacillo force-pushed the failover-flag-impl branch 2 times, most recently from e6529b4 to 7da903c Compare October 16, 2024 22:16

mszacillo force-pushed the failover-flag-impl branch from 7da903c to 9f08df9 Compare October 17, 2024 21:02

karmada-bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 17, 2024

mszacillo force-pushed the failover-flag-impl branch 5 times, most recently from 6883e35 to a6627bf Compare October 17, 2024 22:03

karmada-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 29, 2024

mszacillo force-pushed the failover-flag-impl branch from a457582 to 076b746 Compare October 29, 2024 16:49

Updating clustersBeforeFailover spec

1d8fe66

Signed-off-by: mszacillo <mszacillo@bloomberg.net>

mszacillo force-pushed the failover-flag-impl branch from 076b746 to 1d8fe66 Compare October 29, 2024 17:53

Removing filter logic

5dba54e

Signed-off-by: mszacillo <mszacillo@bloomberg.net>

mszacillo force-pushed the failover-flag-impl branch from 2f1195e to 5dba54e Compare October 30, 2024 20:02

RainbowMango mentioned this pull request Nov 5, 2024

[Feature] Stateful Application Failover Support #5788

Open

14 tasks

mszacillo reviewed Nov 5, 2024

View reviewed changes

mszacillo force-pushed the failover-flag-impl branch from 23c3d04 to 3af2629 Compare November 5, 2024 21:03

Removing failover label from constants, filtering out duplicated reso…

2d434bd

…urces Signed-off-by: mszacillo <mszacillo@bloomberg.net>

mszacillo force-pushed the failover-flag-impl branch from 3af2629 to 2d434bd Compare November 5, 2024 21:04

RainbowMango mentioned this pull request Nov 16, 2024

Update application failover purge mode comments #5826

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add failover history information #5251

Add failover history information #5251

Dyex719 commented Jul 25, 2024 •

edited

Loading

karmada-bot commented Jul 25, 2024

Dyex719 commented Jul 26, 2024

mszacillo commented Sep 6, 2024

RainbowMango commented Sep 7, 2024

mszacillo commented Oct 16, 2024

mszacillo commented Oct 16, 2024

mszacillo commented Oct 29, 2024

mszacillo commented Oct 29, 2024

RainbowMango commented Oct 30, 2024

mszacillo commented Oct 30, 2024

karmada-bot commented Oct 30, 2024

RainbowMango commented Oct 31, 2024

mszacillo Nov 5, 2024

RainbowMango Nov 6, 2024

XiShanYongYe-Chang Nov 7, 2024

RainbowMango Nov 7, 2024

XiShanYongYe-Chang Nov 7, 2024

mszacillo commented Nov 5, 2024

mszacillo commented Nov 5, 2024

XiShanYongYe-Chang commented Nov 6, 2024

RainbowMango commented Nov 6, 2024

XiShanYongYe-Chang commented Nov 11, 2024

mszacillo commented Nov 11, 2024

RainbowMango commented Nov 12, 2024

mszacillo commented Nov 12, 2024

mszacillo commented Dec 2, 2024

RainbowMango commented Dec 3, 2024

mszacillo commented Dec 3, 2024 •

edited

Loading

RainbowMango commented Dec 4, 2024

Add failover history information #5251

Are you sure you want to change the base?

Add failover history information #5251

Conversation

Dyex719 commented Jul 25, 2024 • edited Loading

karmada-bot commented Jul 25, 2024

Dyex719 commented Jul 26, 2024

mszacillo commented Sep 6, 2024

RainbowMango commented Sep 7, 2024

mszacillo commented Oct 16, 2024

mszacillo commented Oct 16, 2024

mszacillo commented Oct 29, 2024

mszacillo commented Oct 29, 2024

RainbowMango commented Oct 30, 2024

mszacillo commented Oct 30, 2024

karmada-bot commented Oct 30, 2024

RainbowMango commented Oct 31, 2024

mszacillo Nov 5, 2024

Choose a reason for hiding this comment

RainbowMango Nov 6, 2024

Choose a reason for hiding this comment

XiShanYongYe-Chang Nov 7, 2024

Choose a reason for hiding this comment

RainbowMango Nov 7, 2024

Choose a reason for hiding this comment

XiShanYongYe-Chang Nov 7, 2024

Choose a reason for hiding this comment

mszacillo commented Nov 5, 2024

mszacillo commented Nov 5, 2024

XiShanYongYe-Chang commented Nov 6, 2024

RainbowMango commented Nov 6, 2024

XiShanYongYe-Chang commented Nov 11, 2024

mszacillo commented Nov 11, 2024

RainbowMango commented Nov 12, 2024

mszacillo commented Nov 12, 2024

mszacillo commented Dec 2, 2024

RainbowMango commented Dec 3, 2024

mszacillo commented Dec 3, 2024 • edited Loading

RainbowMango commented Dec 4, 2024

Dyex719 commented Jul 25, 2024 •

edited

Loading

mszacillo commented Dec 3, 2024 •

edited

Loading