-
Notifications
You must be signed in to change notification settings - Fork 909
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stateful Failover Proposal #5116
base: master
Are you sure you want to change the base?
Conversation
Welcome @Dyex719! It looks like this is your first PR to karmada-io/karmada 🎉 |
Codecov ReportAll modified and coverable lines are covered by tests ✅
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## master #5116 +/- ##
===========================================
+ Coverage 28.21% 40.91% +12.69%
===========================================
Files 632 650 +18
Lines 43556 55182 +11626
===========================================
+ Hits 12291 22575 +10284
- Misses 30368 31170 +802
- Partials 897 1437 +540
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Signed-off-by: Aditya Addepalli <dyex719@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/assign
Thanks @Dyex719 Just quickly went through the propoasl, I believe this is meaningful enhancement and very inspiring.
Thanks @RainbowMango! One assumption that we wanted to discuss about (I wrote this in the proposal too):
Do you think this is a valid assumption? This narrows our scope a little, but defines the problem more clearly. |
I 100% agree.
|
|
||
We propose to add two fields to a propogation policy to enable stateful failover. | ||
1. persistedFields.maxHistory: This sets the max limit on the amount of stateful failover history that is persisted before the older entries are overwritten. If this is set to 5, the resourcebinding will store a maximum of 5 failover entries in FailoverHistory before it overwrites the older history. | ||
2. persistedFields.fields: This is a list of the fields that are required by the stateful application to be persisted during failover to resume processing from a particular case. This takes in a list of field names as well as how to access them from the spec. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any restriction for the field type here? Can we or should we support types other than String
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
persistedFields.fields contains the name of the field we want to persist and the path to obtain the value of that label. Since both the name and value is tied to a label which supports only strings, I think having a string is sufficient.
persistedFields.fields:
- LabelName: jobID
PersistedStatusItem: obj.status.jobStatus.jobID
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It works I believe. But what I'm thinking here is maybe we can make room for extensibility. The info we can take can be more generic, like a string, integer, and so on. I'm still thinking about it. will get back to you once I have an idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried to draft the API, please take a look:
// ApplicationFailoverBehavior indicates application failover behaviors.
type ApplicationFailoverBehavior struct {
// other fields.
// StatePreservation defines the policy for preserving and restoring state data
// during failover events for stateful applications.
//
// When an application fails over from one cluster to another, this policy enables
// the extraction of critical data from the original resource configuration.
// Upon successful migration, the extracted data is then re-injected into the new
// resource, ensuring that the application can resume operation with its previous
// state intact.
// This is particularly useful for stateful applications where maintaining data
// consistency across failover events is crucial.
// If not specified, means no state data will be preserved.
// +optional
StatePreservation *StatePreservation `json:"statePreservation,omitempty"`
}
// StatePreservation defines the policy for preserving state during failover events.
type StatePreservation struct {
// Rules contains a list of StatePreservationRule configurations.
// Each rule specifies a JSONPath expression targeting specific pieces of
// state data to be preserved during failover events. An AliasName is associated
// with each rule, serving as a label key when the preserved data is passed
// to the new cluster.
// +required
Rules []StatePreservationRule `json:"rules"`
// Note: We probably need more policies to control how to feed the new cluster with the
// preserved state data in the future. Such as:
// - Is it always acceptable to feed data as the label? Is there a need for annotation?
// - If each label name should be started with a prefix, like `karmada.io/failover-preserving-<fieldname>: <state>`
// Sure, we can default with the label, this structure just makes room for future extensions.
//
// For instance, we can introduce a policy if someone wants to control how the preserving state
// feed to new clusters. This is probably not included this time.
// RestorePolicy determines when and how the preserved state should be restored.
// RestorePolicy RestorePolicy `json:"restorePolicy"`
}
// StatePreservationRule defines a single rule for state preservation.
// It includes a JSONPath expression and an alias name that will be used
// as a label key when passing state information to the new cluster.
type StatePreservationRule struct {
// AliasName is the name that will be used as a label key when the preserved
// data is passed to the new cluster. This facilitates the injection of the
// preserved state back into the application resources during recovery.
// +required
AliasName string `json:"aliasName"`
// JSONPath is the JSONPath query used to identify the state data
// to be preserved from the original resource configuration.
// Example: ".spec.template.spec.containers[?(@.name=='my-container')].image"
// This example selects the image of a container named 'my-container'.
// +required
JSONPath string `json:"jsonPath"`
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Dyex719 @mszacillo What do you think of this API?
I haven't put history thing here, but seems it should not in StatePreservation or persistedFields
as referenced by #5116.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, let me think about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes 100% sense to move the StatePreservation
to FailoverBehavior
, so that both Cluster failover and application failover can share the same configurations.
But, given the cluster failover configuration is still not provided, also not sure when it will come, I'm afraid it might slow down the process if we consider cluster failover in this proposal. So, I tend to focus on application failover right now.
In addition, if cluster failover also needs this configuration later, we can deprecate the one in the application in a compatible way, it's not a concern for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, now I'm going to look at the second part(as I mentioned on #5116 (comment)), which is how to grab and store the state data in ResourceBining.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In our implementation we use the taint condition to specify that a cluster failover is taking place.
Are there plans to include a separate cluster failover implementation? We would prefer to have cluster failover be in scope of this proposal as it is important for flink applications to recover from these failures as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there plans to include a separate cluster failover implementation? We would prefer to have cluster failover be in scope of this proposal as it is important for flink applications to recover from these failures as well.
I don't mind having the cluster failover in this proposal. You might need to refresh the proposal.
I'm sure this gonna be a large proposal.
|
||
// PersistedItem is a pointer to the status item that should be persisted to the rescheduled | ||
// replica during a failover. This should be input in the form: obj.status.<path-to-item> | ||
PersistedStatusItem string `json:"persistedStatusItem,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given this persistedStatusItem
will be applied as a label value, which brings a limitation that the length can not exceed 63 characters. Does Flink's checkpoint path exceed this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In our use-case, we only want to persist the jobID which is a 32 hexadecimal so this limitation is not a problem for us. Both labels and annotations have a 63 character limit for the value in the key, value pair so I don't think there is any way around this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should make a change here:
// persistedStatusItem is
a pointer to the status itemthe value of the status item that should be persisted to the rescheduled replica during a failover. This should be input in the form:obj.status.path-to-itemvalue at the location obj.status.path-to-item
In the case of flink, the flink process will create a jobID at obj.status.jobStatus.jobID which would look like fd72014d4c864993a2e5a9287b4a9c5d
so the persistedStatusItem would store this value in the PersistedStatusItem field.
Flink Deployment spec would have:
status:
jobStatus:
jobID: fd72014d4c864993a2e5a9287b4a9c5d
PersistedDuringFailover would store jobID
as the labelName and fd72014d4c864993a2e5a9287b4a9c5d
as the PersistedStatusItem (value)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are storing the metadata related to the state here, I think the limitation of the value being a string and less than 63 characters is okay. The actual state of the flink job will be stored in some durable storage and may be very big. In other stateful applications as well I think we can make this assumption that metadata related to the state should be persisted with the resourceBindingStatus and the actual data can be stored elsewhere.
I will include this in the proposal. Let me know what you think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think we can go with the label as the default data provision.
By the way, I talked to Keyverno's maintainer, I was told that Keyverno can perform filters against annotation. We can support that if there is a need.
Both labels and annotations have a 63 character limit for the value in the key, value pair so I don't think there is any way around this.
By the way, the value length of the annotation can support up to 255 chars. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like as you said annotations values can support 253 characters whereas label values can only be 64 characters long. Annotation keys can only be 64 characters whereas label keys can be 253 characters.
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @Dyex719, @mszacillo, I've been thinking this feature recently and came up with some ideas, this feature consists of 3 parts: The first part is how to declare which state data(fields) should be preserved during failover, and I post my idea at #5116 (comment). Please take a look. The second part is how to store the preserved state data, one approach is to store them in the history items(this is our first idea), the challenge things is that it's hard to maintain the history item, especially figuring out what is the destination cluster. I think it is worth thinking about storing them in the GracefulEvictionTask, during the failover process, just before creating the eviction task, it makes sense to grab some state(fields) or snapshot of scheduled cluster list. With this grabbed data, the controller would get to know which cluster is the destination by comparing the snapshot and the newly scheduled cluster. [edit] The third part is how to feed(or inject) the preserved state data to the destination cluster. This is going to be not that complex as long as the controller can figure out the destination cluster and only feed the state data when creating the application. |
Hi @RainbowMango, To address your comment:
One thing we were thinking about it is to only store the cluster from which the workload failed over from. This would help keep the logic simple without involving multiple components. The current cluster the workload is scheduled on can always be inferred from the resourcebinding. With this we would achieve:
The entire trace of the workload could still be inferred from the current state + failoverHistory, the workload was originally on member1, then it migrated to member2, then it moved to member3 which can be inferred from spec.clusters in the resource binding where it is currently scheduled. Work in progress implementation is available here: master...Dyex719:karmada:stateful-failover-flag. Let me know what you think about this! I will get back to your idea at #5116 (comment) in a bit. Thanks! |
Our idea was to preserve only the metadata of the state rather than the actual state which may be large and be of different formats in the resourcebinding. As you know, we use Kyverno for fetching the actual state using the persisted metadata in the resourcebinding. My understanding is that Kyverno is well supported by Karmada so we could use Kyverno to do this injection. It also allows a lot of customization that the user can perform to suit their needs. Is your idea to have another component (maybe the scheduler) do this injection? My only concern is that customization may become difficult. |
@mszacillo @Dyex719 API Design: master...RainbowMango:karmada:api_draft_application_failover |
@XiShanYongYe-Chang I see @mszacillo and @Dyex719 added an agenda to this week's community meeting, I wonder if you can show a demo during the meeting, or share the test report here. |
Okay, let me show you the demo during the meeting tonight. |
Let me describe the demo in words. Create each resource according to the following configuration file: unfold me to see the yaml# deploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
labels:
app: nginx
spec:
replicas: 2
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- image: nginx
name: nginx
===================================
# pp.yaml
# propagate the target Deployment to member1, member2 member clusters, only to one cluster by setting spreadConstraints
apiVersion: policy.karmada.io/v1alpha1
kind: PropagationPolicy
metadata:
name: nginx-propagation
spec:
resourceSelectors:
- apiVersion: apps/v1
kind: Deployment
name: nginx
placement:
clusterAffinity:
clusterNames:
- member1
- member2
spreadConstraints:
- spreadByField: cluster
maxGroups: 1
minGroups: 1
propagateDeps: true
failover:
application:
decisionConditions:
tolerationSeconds: 60
purgeMode: Graciously
statePreservation:
rules:
- aliasLabelName: cztest-replicas
jsonPath: ".updatedReplicas"
===================================
# op.yaml
# Modify the Deployment image distributed to the member1 cluster via OverridePolicy to simulate an application failure on member1
apiVersion: policy.karmada.io/v1alpha1
kind: OverridePolicy
metadata:
name: nginx-op
spec:
resourceSelectors:
- apiVersion: apps/v1
kind: Deployment
name: nginx
overrideRules:
- targetCluster:
clusterNames:
- member1
overriders:
imageOverrider:
- component: "Registry"
operator: replace
value: "fake" Execute the following commands to deploy the above resources to the karmada control plane:
The deployment will be propagated to the member1 cluster first, but due to a image error, after waiting for 1 minute, it will be propagated to the member2 cluster after application failover. Then execute the following command to watch the deployment resource on the member2 cluster:
You will observe that the target |
Hi Hongcai and Chang,
Thanks for the heads up! I'm a little surprised that there has been a
different branch created for the change - as this is something we had
proposed and put up for review (for reference we currently use the failover
feature in our own DEV setup), but thank you for putting so much thought into this feature.
Perhaps we can discuss the implementation process more during the meeting?
Cheers.
…On Tue, Oct 15, 2024 at 9:08 AM Chang ***@***.***> wrote:
Let me describe the demo in words.
Create each resource according to the following configuration file:
unfold me to see the yaml
# deploy.yamlapiVersion: apps/v1kind: Deploymentmetadata:
name: nginx
labels:
app: nginxspec:
replicas: 2
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- image: nginx
name: nginx===================================# pp.yaml # propagate the target Deployment to member1, member2 member clusters, only to one cluster by setting spreadConstraintsapiVersion: policy.karmada.io/v1alpha1kind: PropagationPolicymetadata:
name: nginx-propagationspec:
resourceSelectors:
- apiVersion: apps/v1
kind: Deployment
name: nginx
placement:
clusterAffinity:
clusterNames:
- member1
- member2
spreadConstraints:
- spreadByField: cluster
maxGroups: 1
minGroups: 1
propagateDeps: true
failover:
application:
decisionConditions:
tolerationSeconds: 60
purgeMode: Graciously
statePreservation:
rules:
- aliasLabelName: cztest-replicas
jsonPath: ".updatedReplicas"===================================# op.yaml # Modify the Deployment image distributed to the member1 cluster via OverridePolicy to simulate an application failure on member1apiVersion: policy.karmada.io/v1alpha1kind: OverridePolicymetadata:
name: nginx-opspec:
resourceSelectors:
- apiVersion: apps/v1
kind: Deployment
name: nginx
overrideRules:
- targetCluster:
clusterNames:
- member1
overriders:
imageOverrider:
- component: "Registry"
operator: replace
value: "fake"
Execute the following commands to deploy the above resources to the
karmada control plane:
kubectl --context karmada-apiserver apply -f op.yaml
kubectl --context karmada-apiserver apply -f pp.yaml
kubectl --context karmada-apiserver apply -f deployment.yaml
The deployment will be propagated to the member1 cluster first, but due to
a image error, after waiting for 1 minute, it will be propagated to the
member2 cluster after application failover.
Then execute the following command to watch the deployment resource on the
member2 cluster:
kubectl --kubeconfig /root/.kube/members.config --context member2 get deployments.apps -oyaml -w
You will observe that the target cztest-replicas: "2" appears in the
label of the deployment nginx resource on the member2 cluster.
image.png (view on web)
<https://github.com/user-attachments/assets/acafcbe7-cfdb-4a6d-b677-7b266004dbdc>
—
Reply to this email directly, view it on GitHub
<#5116 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACTHQ2PDV52RS6MWFDTZ7DDZ3UHV5AVCNFSM6AAAAABKGBCDCGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJTHA3TAOJUGA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Thanks for letting me know! |
e4768bc
to
f91381f
Compare
Signed-off-by: Aditya Addepalli <dyex719@gmail.com>
f91381f
to
fd35fb4
Compare
Hi @Dyex719, @mszacillo I'm reviewing this feature to identify any remaining tasks that need to be completed, and then determine where to start from. My two cents:
By the way, I and @XiShanYongYe-Chang currently working on the cluster failover feature enhancement, will let you updated once we work out some ideas. |
Hi @RainbowMango, I'll go ahead and make edits to the proposal to match up with the API changes that have been made as part of #5788. I'm happy to keep this proposal application failover specific - in the meantime we'll maintain a small internal commit that allows us to re-use the graceful eviction task logic for cluster failover. The commit is primarily changes to the related test files, the actual code change is gated behind the failover feature flag and is ~20 lines. And sounds good! Please keep me updated about cluster failover design, as we are planning on using the feature. |
What type of PR is this?
Proposal for stateful failover
What this PR does / why we need it:
Explained in the doc
Which issue(s) this PR fixes:
Fixes #5006, #4969
Special notes for your reviewer:
N/A
Does this PR introduce a user-facing change?:
NONE