-
Notifications
You must be signed in to change notification settings - Fork 900
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add failover history information #5251
base: master
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi All, Thank you! |
Hi @RainbowMango ! Could we get a review of this PR when you get a chance? Failures are due to the test case describes above in the comments. We can decide how to tackle this moving forward. Greatly appreciate it! |
Sure. And sorry for letting this sit again! |
e6529b4
to
7da903c
Compare
Hi @RainbowMango, I've taken a closer look at your demo branch (master...XiShanYongYe-Chang:karmada:api_draft_application_failover). I don't have any major complaints from the API side. It seems that logically this effort can be divided into two parts:
Would you be open to dividing up the work as such? The StatePreservation work can be merged after the FailoverHistoryItem is added. I've already updated the FailoverHistoryItem API to align better with the proposed changes in your branch, but I've left the StatePreservation out as Chang seems to have implemented that already and I wouldn't consider it fair to copy his work under our PR. :)
|
Follow-up question related to the proposed changes for state preservation, are we planning on supporting this only for applications that failover gracefully? In our use-case we use |
7da903c
to
9f08df9
Compare
6883e35
to
a6627bf
Compare
a457582
to
076b746
Compare
If we're interested in generating the historyItem in one place then I think that's a good idea. Additionally we are reliant on a failover label to be appended to the workload in our use-case, so if we can guarantee that the eviction task is not cleaned up before scheduling, we can check the eviction reason and append the label if necessary. |
@RainbowMango The e2e test failure is due to interference with the existing cluster filtration step that checks Seems we should keep the existing filter logic that only checks eviction tasks (as long as we make sure these are cleaned up post scheduling). |
Signed-off-by: mszacillo <mszacillo@bloomberg.net>
076b746
to
1d8fe66
Compare
If that's the case, we can narrow down the scope of this PR and focus on generating the history. Note that, currently, for purge modes |
Signed-off-by: mszacillo <mszacillo@bloomberg.net>
2f1195e
to
5dba54e
Compare
/retest Seems agents were interrupted. |
@mszacillo: Cannot trigger testing until a trusted user reviews the PR and leaves an In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/retest |
pkg/controllers/utils/common.go
Outdated
// UpdateFailoverStatus adds a failoverHistoryItem to the failoverHistory field in the ResourceBinding. | ||
func UpdateFailoverStatus(client client.Client, binding *workv1alpha2.ResourceBinding, clusters []string, failoverType workv1alpha2.FailoverReason) (err error) { | ||
// If the resource is Duplicated, then it does not have a concept of failover. We skip attaching that status here. | ||
if binding.Spec.Placement.ReplicaScheduling.ReplicaSchedulingType == policyv1alpha1.ReplicaSchedulingTypeDuplicated { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added this filter as I was noticing failoverHistory being added to duplicated resources in response to cluster failures. I really don't believe duplicated resources should have a failover history, since there is no concept of failover in those cases.
Let me know if you have concerns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice finding!
But for resources with scheduling type duplicated, why not disable failover in the first place? So that, there won't be any eviction task for them.
@XiShanYongYe-Chang What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But for resources with scheduling type duplicated, why not disable failover in the first place?
Does that mean for duplicated resources, we don't put them in the eviction queue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I mean we should revisit which scheduling type is applicable for failover.
But, wait, I think for scheduling type duplicated, there might be a case that also needs failover, especially if they use spread constraints. Anyway, this deserves to double confirm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be another separate point to consider for a failover, not subordinate to the current task.
@RainbowMango Thanks for putting together the stateful application support checklist, its very comprehensive! I've just updated this PR to remove the last bit of the failover flag implementation, and have this PR solely focus on the API change. Running some tests for both divided and grouped replicas looked as we expect: Single replica
Divided replicas
Note: The purgeMode for this resource was set to immediately, and since we've decided to go with the evictionTask strategy for filtering out clusters, the Karmada scheduler rescheduled the replicas to the same clusters. This should be resolved in the future. Please let me know if anything else needs to be addressed in this PR. Cheers! |
Additional comment before I forget - we still append the If you'd like to remove the setting of the failoverHistory and keep this as simply an API change, I could do that as well. Let me know. |
23c3d04
to
3af2629
Compare
…urces Signed-off-by: mszacillo <mszacillo@bloomberg.net>
3af2629
to
2d434bd
Compare
Hi @Dyex719 @mszacillo @RainbowMango Regarding the specific implementation details of the plan, I have combined previous discussions and preliminary conclusions to conduct some analysis and organization, which should aid in the implementation of the feature. Can you help take a look? |
I'm still concerned about letting the scheduler get involved in the failover process. But haven't gotten a chance to explore further. In addition, according to #5788, seems we don't have a strong dependency on the history to figure out which cluster the application would be migrated to. So, the history information is kind of optional for this feature. That's why I put this task to part 3. (But it still makes sense from an instrumentation point of view.) So, I'm thinking should we focus on Part 1 and Part 2 first, and come back on the history thing later? |
Hi @mszacillo do you have time to come to the community meeting tomorrow and exchange ideas about the program? |
Hi! Yes, happy to join to discuss more in person! Will take a look at your document in detail today and leave comments there, apologies for the delay. |
Hi @mszacillo you are not going to KubeCon this time, right? |
Hi @RainbowMango, sadly I won’t be there this time. :( I’m thinking about attending KubeCon EU next year, but that’s still far away. |
Hi @RainbowMango, we can go ahead and close this! |
Hi @mszacillo Yeah, this one could be a candidate for the release 1.13. We just released v1.12 last week, have you ever tested it on your side? I want to know if it works as we expected. In addition, v1.12 covered application failover, shall we go ahead with cluster failover? |
We're working on rebasing our local branch onto v1.12. I'll be sure to give an update once we are finished and can thoroughly test.
And yes that would be great. Perhaps we can discuss design ideas in the next community meeting? |
That would be great! |
What type of PR is this?
/kind feature
What this PR does / why we need it:
Adding failover history information so that applications can keep a record of what failovers happened in the past.
Stateful applications can use this information to detect failures and continue processing from a particular state
Which issue(s) this PR fixes:
Fixes #5116 #4969
Special notes for your reviewer:
Does this PR introduce a user-facing change?: