Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add topologySpreadConstraints configuration to pod spec. #2530

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

laiminhtrung1997
Copy link
Contributor

@laiminhtrung1997 laiminhtrung1997 commented Feb 4, 2024

Dear all,

I think we should configure topologySpreadConstraints to pod spec so these pods can spread zones for high availability.

Could someone review it, please? Thank you very much.

Best regards.

@monotek
Copy link

monotek commented May 16, 2024

We need that feature too.

@FxKu FxKu modified the milestones: 2.0, 1.13.0 May 24, 2024
@@ -465,6 +465,11 @@ func (c *Cluster) compareStatefulSetWith(statefulSet *appsv1.StatefulSet) *compa
needsRollUpdate = true
reasons = append(reasons, "new statefulset's pod affinity does not match the current one")
}
if !reflect.DeepEqual(c.Statefulset.Spec.Template.Spec.TopologySpreadConstraints, statefulSet.Spec.Template.Spec.TopologySpreadConstraints) {
needsReplace = true
needsRollUpdate = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this really need to trigger a rolling update of pods executed by operator? Will not K8s take care of it then once the statefulset is replaced?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm good point. Maybe we can leave as is for now. With rolling update we make sure pods immediately adhere the new constraints.

pkg/cluster/k8sres.go Outdated Show resolved Hide resolved
pkg/cluster/k8sres.go Outdated Show resolved Hide resolved
@FxKu
Copy link
Member

FxKu commented Jun 26, 2024

Can you also write an e2e test that tests that the constraints work as expected, please?

@FxKu FxKu modified the milestones: 1.13.0, 1.14.0 Jun 26, 2024
@laiminhtrung1997
Copy link
Contributor Author

Can you also write an e2e test that tests that the constraints work as expected, please?

Dear @FxKu
Thanks so much for your comments, I haven't written the UT or e2e test many, but I'll try my best. I'll mark it ready and let you know when I fixed the comments.
Best regards.

@laiminhtrung1997 laiminhtrung1997 force-pushed the add-topologySpreadConstraints branch 8 times, most recently from 530f847 to 18023cb Compare July 7, 2024 06:56
@laiminhtrung1997
Copy link
Contributor Author

Dear @FxKu
I completed the UT and E2E tests and resolved your comment. Could you please review it again? Thanks.

@laiminhtrung1997
Copy link
Contributor Author

Dear @FxKu
Have you been able to take a look yet? I'm looking forward to hearing from you soon about this.

@FxKu FxKu added the minor label Aug 27, 2024
@FxKu
Copy link
Member

FxKu commented Aug 27, 2024

@laiminhtrung1997 thanks a lot for the update. I think, in this state you can be sure we will merge it for the next release. We have to focus on the new status feature first but I will get back to you in September.

@laiminhtrung1997 laiminhtrung1997 force-pushed the add-topologySpreadConstraints branch 3 times, most recently from 256fd9f to fbac974 Compare October 20, 2024 05:45
@@ -254,6 +254,7 @@ type Config struct {
EnableSecretsDeletion *bool `name:"enable_secrets_deletion" default:"true"`
EnablePersistentVolumeClaimDeletion *bool `name:"enable_persistent_volume_claim_deletion" default:"true"`
PersistentVolumeClaimRetentionPolicy map[string]string `name:"persistent_volume_claim_retention_policy" default:"when_deleted:retain,when_scaled:retain"`
EnablePostgresTopologySpreadConstraints bool `json:"enable_postgres_topology_spread_constraints,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be removed. See comment above.

@@ -599,6 +599,22 @@ func generatePodAntiAffinity(podAffinityTerm v1.PodAffinityTerm, preferredDuring
return podAntiAffinity
}

func generateTopologySpreadConstraints(labels labels.Set, additionalTopologySpreadConstraints []v1.TopologySpreadConstraint) []v1.TopologySpreadConstraint {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Help me to understand what this function is doing:

  • Do we have to define this first hard-coded TopologySpreadConstraint when somebody specifies constraints in the manifest?
  • What would happen if it is missing?
  • Should the operator always create this spread constraint, similar to the node affinities?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of this function is:

  • If topologySpreadConstraints is set to true and additionalTopologySpreadConstraints is either empty or undefined, the operator will apply default constraints as hardcoded.
  • If additionalTopologySpreadConstraints is defined, the specified list of constraints will be appended.
  • The topologySpreadConstraints setting is configured to make the constraints customizable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FxKu
Our logic might be different. Please let me know how you’d like the operator to apply the constraint, and I’ll implement it according to your suggestion. I'd love to do it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@laiminhtrung1997 you have only described what the code does. I can read and understand code myself 😃

Please try to answer my questions. I'm wondering if the function is needed at all? Why not go with what people specify in the manifest? I should have made this thought more clear. Hope you will understand my questions better now.

Copy link
Contributor Author

@laiminhtrung1997 laiminhtrung1997 Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FxKu

Okay, I got it. So, the only configuration input by users is topologySpreadConstraints, which will be generated in the manifest.

I will refactor it right away. Thank you.

nullable: true
items:
type: object
x-kubernetes-preserve-unknown-fields: true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not add all the fields of a topologySpreadConstraint like with what we have for nodeAffinity. I feel, it's too lazy and unsafe to allow arbitrary fields with x-kubernetes-preserve-unknown-fields: true

XPreserveUnknownFields: util.True(),
},
},
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here with the XPreserveUnknownFields. Yes there are other fields where are doing it like this, but lets get it right for new additions. I know, it's tedious to reflect the full schema because we don use a framework like kubeBulder. But it should be the trade-off for contributors when they go the "easy way" with allowing full specs in our manifest over custom stripped-down designs better suitable for end users.

@FxKu
Copy link
Member

FxKu commented Nov 11, 2024

@laiminhtrung1997 have you tested how topology spread constraints behave together with specified nodeAffinity in the manifest and globally configured pod anti affinity rules. How easy is it to create a scenario where they contradict themselves and lead to scheduling problems? Should one be used over the other? Maybe @monotek can answer this, too?

@FxKu FxKu modified the milestones: 1.14.0, 1.15.0 Nov 11, 2024
@laiminhtrung1997 laiminhtrung1997 force-pushed the add-topologySpreadConstraints branch 2 times, most recently from 5f916be to 3e99f92 Compare November 14, 2024 10:27
@FxKu
Copy link
Member

FxKu commented Nov 14, 2024

Advice for the future: Don't force push and squash your commits in the middle of a review. Now it's super hard for me to see what feedback you've reflected and I have to review everything again 😞

@laiminhtrung1997
Copy link
Contributor Author

@FxKu

It is note. I am truly sorry for this. It will not happen again.

@laiminhtrung1997
Copy link
Contributor Author

@FxKu

Currently, I have configured the operator using topologySpreadConstraints and Affinity.PodAntiAffinity together. My expectation is that the pods are always scheduled in different nodes and availability zones. This is my manifest for them.

# OperatorConfiguration
configuration:
  kubernetes:
    enable_pod_antiaffinity: true
    pod_antiaffinity_preferred_during_scheduling: false
    pod_antiaffinity_topology_key: kubernetes.io/hostname
    topology_spread_constraints:
    - max_skew: 1
      topology_key: topology.kubernetes.io/zone
      when_unsatisfiable: DoNotSchedule

Do you want me to put that scenario in the e2e test?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Open Questions
Development

Successfully merging this pull request may close these issues.

3 participants