Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add failover history information #5251

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

Dyex719
Copy link

@Dyex719 Dyex719 commented Jul 25, 2024

What type of PR is this?
/kind feature

What this PR does / why we need it:
Adding failover history information so that applications can keep a record of what failovers happened in the past.
Stateful applications can use this information to detect failures and continue processing from a particular state

Which issue(s) this PR fixes:
Fixes #5116 #4969

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

@karmada-bot karmada-bot added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 25, 2024
@karmada-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign rainbowmango for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@karmada-bot karmada-bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jul 25, 2024
@Dyex719
Copy link
Author

Dyex719 commented Jul 26, 2024

Hi All,
I have implemented and tested the failover information feature and wanted to request a review from the community on what changes might be required.
I notice that there are some minor linter issues that I will fix.
I have also addressed the failing test case in the comment above.

Thank you!

@karmada-bot karmada-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Aug 6, 2024
@mszacillo
Copy link
Contributor

Hi @RainbowMango ! Could we get a review of this PR when you get a chance? Failures are due to the test case describes above in the comments. We can decide how to tackle this moving forward. Greatly appreciate it!

@RainbowMango
Copy link
Member

Sure. And sorry for letting this sit again!

@mszacillo mszacillo force-pushed the failover-flag-impl branch 2 times, most recently from e6529b4 to 7da903c Compare October 16, 2024 22:16
@mszacillo
Copy link
Contributor

Hi @RainbowMango,

I've taken a closer look at your demo branch (master...XiShanYongYe-Chang:karmada:api_draft_application_failover). I don't have any major complaints from the API side. It seems that logically this effort can be divided into two parts:

  1. Adding FailoverHistoryItem to the resource binding API (which is handled in this PR).
  2. Adding StatePreservation to the propagationpolicy API (which has been done in your branch).

Would you be open to dividing up the work as such? The StatePreservation work can be merged after the FailoverHistoryItem is added. I've already updated the FailoverHistoryItem API to align better with the proposed changes in your branch, but I've left the StatePreservation out as Chang seems to have implemented that already and I wouldn't consider it fair to copy his work under our PR. :)

Note: I haven't updated the codegen for this PR yet so there are some failing checks. Will get that sorted.

@mszacillo
Copy link
Contributor

Follow-up question related to the proposed changes for state preservation, are we planning on supporting this only for applications that failover gracefully?

In our use-case we use Immediately, since some of our users have strict only-once processing requirements. So when a failover occurs, we want to remove the application as fast as possible to reduce the likelihood of duplicate applications. Perhaps the state preservation could be kept as part of the status of the resourcebinding and referenced during the scheduling step, similar to how we handle the FailoverHistoryItem in our PR.

@karmada-bot karmada-bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 17, 2024
@mszacillo mszacillo force-pushed the failover-flag-impl branch 5 times, most recently from 6883e35 to a6627bf Compare October 17, 2024 22:03
@karmada-bot karmada-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 29, 2024
@mszacillo
Copy link
Contributor

Both application failover and cluster failover ended in the resource-binding-graceful-eviction-controller, so this might be the best place to generate the history. What do you think?

If we're interested in generating the historyItem in one place then I think that's a good idea. Additionally we are reliant on a failover label to be appended to the workload in our use-case, so if we can guarantee that the eviction task is not cleaned up before scheduling, we can check the eviction reason and append the label if necessary.

@mszacillo
Copy link
Contributor

@RainbowMango The e2e test failure is due to interference with the existing cluster filtration step that checks clustersBeforeFailover and the workload rebalancer. Since the failoverHistory is not ephemeral, it will always filter out clusters in clustersBeforeFailover until there is a new failover. This doesn't work alongside the workload rebalancer.

Seems we should keep the existing filter logic that only checks eviction tasks (as long as we make sure these are cleaned up post scheduling).

Signed-off-by: mszacillo <mszacillo@bloomberg.net>
@RainbowMango
Copy link
Member

Seems we should keep the existing filter logic that only checks eviction tasks (as long as we make sure these are cleaned up post scheduling).

If that's the case, we can narrow down the scope of this PR and focus on generating the history.

Note that, currently, for purge modes Graciously and Never, we can ensure the eviction task exists before scheduling, as we will build an eviction task for them. However, we still need to take care of the Immediately mode. We can iterate it by separate PR.

Signed-off-by: mszacillo <mszacillo@bloomberg.net>
@mszacillo
Copy link
Contributor

/retest

Seems agents were interrupted.

@karmada-bot
Copy link
Collaborator

@mszacillo: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Seems agents were interrupted.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@RainbowMango
Copy link
Member

/retest

// UpdateFailoverStatus adds a failoverHistoryItem to the failoverHistory field in the ResourceBinding.
func UpdateFailoverStatus(client client.Client, binding *workv1alpha2.ResourceBinding, clusters []string, failoverType workv1alpha2.FailoverReason) (err error) {
// If the resource is Duplicated, then it does not have a concept of failover. We skip attaching that status here.
if binding.Spec.Placement.ReplicaScheduling.ReplicaSchedulingType == policyv1alpha1.ReplicaSchedulingTypeDuplicated {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added this filter as I was noticing failoverHistory being added to duplicated resources in response to cluster failures. I really don't believe duplicated resources should have a failover history, since there is no concept of failover in those cases.

Let me know if you have concerns.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice finding!
But for resources with scheduling type duplicated, why not disable failover in the first place? So that, there won't be any eviction task for them.
@XiShanYongYe-Chang What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But for resources with scheduling type duplicated, why not disable failover in the first place?

Does that mean for duplicated resources, we don't put them in the eviction queue?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I mean we should revisit which scheduling type is applicable for failover.

But, wait, I think for scheduling type duplicated, there might be a case that also needs failover, especially if they use spread constraints. Anyway, this deserves to double confirm.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be another separate point to consider for a failover, not subordinate to the current task.

@mszacillo
Copy link
Contributor

@RainbowMango Thanks for putting together the stateful application support checklist, its very comprehensive!

I've just updated this PR to remove the last bit of the failover flag implementation, and have this PR solely focus on the API change. Running some tests for both divided and grouped replicas looked as we expect:

Single replica

failoverHistory:
  - clustersAfterFailover:
    - cluster-A
    clustersBeforeFailover:
    - cluster-B
    reason: ApplicationFailure
    startTime: "2024-11-05T19:04:14Z"
  - clustersAfterFailover:
    - cluster-B
    clustersBeforeFailover:
    - cluster-A
    reason: ApplicationFailure
    startTime: "2024-11-05T19:04:45Z"

Divided replicas

failoverHistory:
  - clustersAfterFailover:
    - cluster-A
    - cluster-B
    clustersBeforeFailover:
    - cluster-A
    - cluster-B
    reason: ApplicationFailure
    startTime: "2024-11-05T19:00:07Z"
  - clustersAfterFailover:
    - cluster-A
    - cluster-B
    clustersBeforeFailover:
    - cluster-B
    - cluster-A
    reason: ApplicationFailure
    startTime: "2024-11-05T19:00:38Z"

Note: The purgeMode for this resource was set to immediately, and since we've decided to go with the evictionTask strategy for filtering out clusters, the Karmada scheduler rescheduled the replicas to the same clusters. This should be resolved in the future. Please let me know if anything else needs to be addressed in this PR.

Cheers!

@mszacillo
Copy link
Contributor

Additional comment before I forget - we still append the clustersAfterFailover within the scheduler for now. As discussed, this logic should be moved into the evictionTask controller before the eviction task gets cleared. We can tackle that in an additional PR.

If you'd like to remove the setting of the failoverHistory and keep this as simply an API change, I could do that as well. Let me know.

…urces

Signed-off-by: mszacillo <mszacillo@bloomberg.net>
@XiShanYongYe-Chang
Copy link
Member

Hi @Dyex719 @mszacillo @RainbowMango

Regarding the specific implementation details of the plan, I have combined previous discussions and preliminary conclusions to conduct some analysis and organization, which should aid in the implementation of the feature. Can you help take a look?

https://docs.google.com/document/d/1Np67nVXokSCAfzrI8P8JwQgvqJF64068Yu6leDn2Kjs/edit?tab=t.0#heading=h.eyvom3e0lj8w

@RainbowMango
Copy link
Member

Additional comment before I forget - we still append the clustersAfterFailover within the scheduler for now. As discussed, this logic should be moved into the evictionTask controller before the eviction task gets cleared. We can tackle that in an additional PR.

If you'd like to remove the setting of the failoverHistory and keep this as simply an API change, I could do that as well. Let me know.

I'm still concerned about letting the scheduler get involved in the failover process. But haven't gotten a chance to explore further.

In addition, according to #5788, seems we don't have a strong dependency on the history to figure out which cluster the application would be migrated to. So, the history information is kind of optional for this feature. That's why I put this task to part 3. (But it still makes sense from an instrumentation point of view.)

So, I'm thinking should we focus on Part 1 and Part 2 first, and come back on the history thing later?

@XiShanYongYe-Chang
Copy link
Member

Regarding the specific implementation details of the plan, I have combined previous discussions and preliminary conclusions to conduct some analysis and organization, which should aid in the implementation of the feature. Can you help take a look?

https://docs.google.com/document/d/1Np67nVXokSCAfzrI8P8JwQgvqJF64068Yu6leDn2Kjs/edit?tab=t.0#heading=h.eyvom3e0lj8w

Hi @mszacillo do you have time to come to the community meeting tomorrow and exchange ideas about the program?

@mszacillo
Copy link
Contributor

Hi @mszacillo do you have time to come to the community meeting tomorrow and exchange ideas about the program?

Hi! Yes, happy to join to discuss more in person! Will take a look at your document in detail today and leave comments there, apologies for the delay.

@RainbowMango
Copy link
Member

Hi @mszacillo you are not going to KubeCon this time, right?

@mszacillo
Copy link
Contributor

Hi @RainbowMango, sadly I won’t be there this time. :(

I’m thinking about attending KubeCon EU next year, but that’s still far away.

@mszacillo
Copy link
Contributor

Hi @RainbowMango, we can go ahead and close this!

@RainbowMango
Copy link
Member

Hi @mszacillo Yeah, this one could be a candidate for the release 1.13.

We just released v1.12 last week, have you ever tested it on your side? I want to know if it works as we expected.

In addition, v1.12 covered application failover, shall we go ahead with cluster failover?

@mszacillo
Copy link
Contributor

mszacillo commented Dec 3, 2024

We just released v1.12 last week, have you ever tested it on your side? I want to know if it works as we expected.

We're working on rebasing our local branch onto v1.12. I'll be sure to give an update once we are finished and can thoroughly test.

In addition, v1.12 covered application failover, shall we go ahead with cluster failover?

And yes that would be great. Perhaps we can discuss design ideas in the next community meeting?

@RainbowMango
Copy link
Member

And yes that would be great. Perhaps we can discuss design ideas in the next community meeting?

That would be great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Status: Planning
Development

Successfully merging this pull request may close these issues.

6 participants