Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pvc as part of equivalence hash #56577

Merged
merged 3 commits into from
Dec 5, 2017

Conversation

resouer
Copy link
Contributor

@resouer resouer commented Nov 29, 2017

What this PR does / why we need it:

Should add PVC as part of equivalence hash so that StatefulSet and Operator will always run the volume predicate, while the ReplicaSet can still re-use cached ones.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #56265

Special notes for your reviewer:

Release note:

Add pvc as part of equivalence hash

@resouer resouer added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Nov 29, 2017
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 29, 2017
continue
}
pvcName := volume.PersistentVolumeClaim.ClaimName
result.Insert(pod.GetNamespace() + "-" + pvcName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the PVC UID? I understand this is probably equivalent, but using the UID might make it more obvious. Is there a reason to avoid using the UID? If @bsalamat is ok with this, then maybe we can add a comment explaining it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A user could delete a PVC and create a new one with the same name, but different UID. The PVC contents could be different, so we would not want to use the cached results for the old PVC.

But actually, how do entries in the equivalence cache get removed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that UID will be more reliable to ensure uniqueness.

Copy link
Contributor Author

@resouer resouer Nov 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@msau42 Yes when PVC is updated, the volume predicate cache will be invalidated so it still works.

I was just trying to avoid fetching UID it from scheduler cache since we may need to change get ecache function's interface a bit. Not sure if it's allowed during code freeze ...

Anyway let me update the patch to a more reliable way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see what you mean, the Pod object only contains the PVC name, so you have to lookup the PVC object from the informer cache to get the UID

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, so now I've already refactored it to a factory like NewEquivalencePodGenerator(args.PVCInfo), will update the PR soon after test passes.

@resouer resouer force-pushed the fix-eclass-pvc branch 2 times, most recently from b746d6d to 2932f4c Compare November 30, 2017 10:28
Copy link
Member

@msau42 msau42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the NoVolumeZoneConflict predicate needs to be invalidated on PVC delete

@resouer
Copy link
Contributor Author

resouer commented Dec 1, 2017

@msau42 Thanks, just updated. Also cleaned up and fixed the related MaxPDVolumeCountPredicate

if pvc.Spec.VolumeName != "" {
c.equivalencePodCache.InvalidateCachedPredicateItemOfAllNodes(maxPDVolumeCountPredicateSet)
}
// TODO(harry): make all predicate key to be global const, e.g. predicates.NoVolumeZoneConflict
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment that we need to do this here because the ecache uses PVC uid to hash the pod

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess for that same reason, VolumeBind predicate also needs to be invalidated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack, just updated.

@@ -384,16 +384,7 @@ func (c *configFactory) onPvDelete(obj interface{}) {
}

func (c *configFactory) invalidatePredicatesForPv(pv *v1.PersistentVolume) {
invalidPredicates := sets.NewString("MaxPDVolumeCountPredicate")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to stay. You could have a PVC that points to a PV, but the PV object doesn't exist. So when the PV object gets added, we can recount.

Copy link
Contributor Author

@resouer resouer Dec 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, and also updated other comments to clarify the invalidation op reasosn.

@@ -188,12 +188,29 @@ func (ec *EquivalenceCache) InvalidateCachedPredicateItemForPodAdd(pod *v1.Pod,

// GeneralPredicates: will always be affected by adding a new pod
invalidPredicates := sets.NewString("GeneralPredicates")

// MaxPDVolumeCountPredicate: we check the volumes of pod to make decision.
for _, vol := range pod.Spec.Volumes {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need to be added here?

Copy link
Contributor Author

@resouer resouer Dec 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed MaxPDVolumeCountChecker actually relies on volumes count of specify Pod to do predicate. So when a pod with PV is add/delete on a node. The MaxPDVolumeCountChecker e-cache of this node should be re-calculated for subsequent pods.

Hope I am understanding right :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the volume count checker counts the volumes on pods that are assigned on nodes. I guess for the case where you create the Pod with NodeName already set, then yes, the predicate should be invalidated.

@msau42
Copy link
Member

msau42 commented Dec 1, 2017

lgtm from volume perspective.

Have you tried testing your change with a StatefulSet with PVCs to make sure it is fixed?

@@ -1492,3 +1492,14 @@ func (c *VolumeBindingChecker) predicate(pod *v1.Pod, meta algorithm.PredicateMe
glog.V(5).Info("All PVCs found matches for pod %v/%v, node %q", pod.Namespace, pod.Name, node.Name)
return true, nil, nil
}

type FakePersistentVolumeClaimInfo []v1.PersistentVolumeClaim
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this block moved to this file? It does not belong here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be used by equivalence_cache_test.go, maybe create a test helper pkg?

// For now we only consider pods:
// 1. OwnerReferences is Controller
// 2. with same OwnerReferences
// 3. with same PVC claim
// to be equivalent
if len(pod.OwnerReferences) != 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that you are making changes, please drop this if statement. The for loop takes care of this.

// For now we only consider pods:
// 1. OwnerReferences is Controller
// 2. with same OwnerReferences
// 3. with same PVC claim
// to be equivalent
if len(pod.OwnerReferences) != 0 {
for _, ref := range pod.OwnerReferences {
if *ref.Controller {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't ref.Controller ever be nil? We should probably check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch ... just fixed

// PVC volume binding has changed
invalidPredicates.Insert(predicates.CheckVolumeBinding)
}
// The binded volume type may change
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/binded/bound/ here an other places in your PR

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 2, 2017
@resouer
Copy link
Contributor Author

resouer commented Dec 2, 2017

@msau42 Yes, I've tested the patch with a StatefulSet with ReadWriteOnce PVC:

  volumeClaimTemplates:
  - metadata:
      name: www
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 1Gi

And when set export FEATURE_GATES=EnableEquivalenceClassCache=true,VolumeScheduling=true, the second Pod is expected to be failed to schedule.

cluster/kubectl.sh get pods
NAME      READY     STATUS    RESTARTS   AGE
web-0     1/1       Running   0          4s
web-1     0/1       Pending   0          1s

Events:
  Type     Reason            Age               From               Message
  ----     ------            ----              ----               -------
  Warning  FailedScheduling  0s (x6 over 15s)  default-scheduler  pod has unbound PersistentVolumeClaims

@resouer
Copy link
Contributor Author

resouer commented Dec 2, 2017

The eclass feature do have e2e tests but normally skipped due to it's Serial. While I strongly agree #56104 would be great way to guard this.

@msau42
Copy link
Member

msau42 commented Dec 2, 2017

The second pod fails to schedule with or without your patch? The original reported 1.8 bug was that the second pod did get scheduled, but in the wrong zone. You would need a multizone cluster to test this.

Or the other way is to enable my alpha features in 1.9 for local volumes and to run my local e2e test. Or I can run it with your patch.

Also in general, for e2e alpha tests, you have to explicitly add it to the regex for the alpha suite. See kubernetes/test-infra#5679

if *ref.Controller {
// a pod can only belongs to one controller
for _, ref := range pod.OwnerReferences {
if ref.Controller != nil && *ref.Controller {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you also need to *ref.Controller?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@msau42 ref.Controller is a *bool. Its value can be false.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thanks, makes sense

@msau42
Copy link
Member

msau42 commented Dec 4, 2017

lgtm from volume side

matchInterPodAffinitySet = sets.NewString("MatchInterPodAffinity")
generalPredicatesSets = sets.NewString("GeneralPredicates")
noDiskConflictSet = sets.NewString("NoDiskConflict")
maxPDVolumeCountPredicateKeys = []string{"MaxGCEPDVolumeCount", "MaxAzureDiskVolumeCount", "MaxEBSVolumeCount"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I had asked this question before, but looks like I haven't:
why do we need to break MaxPDVolumeCountPredicate to three keys? Is it only for optimization purposes? If so, I doubt if it has a major effect on performance. It will probably be rare to have more than one of these volume types on a single node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the registered predicates are three separated names instead of "MaxPDVolumeCount".

Otherwise we may need to check the type of volume before invalidating, while just as you pointed out, it's rare to have more than one of them as PV, so I chose to invalidate them all in most cases unless we can check the PV type very easily.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thanks for pointing it out.

@bsalamat
Copy link
Member

bsalamat commented Dec 5, 2017

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 5, 2017
@k8s-github-robot k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 5, 2017
@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bsalamat, resouer

Associated issue: 56265

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@msau42
Copy link
Member

msau42 commented Dec 5, 2017

/kind bug

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Dec 5, 2017
@msau42
Copy link
Member

msau42 commented Dec 5, 2017

/priority critical-urgent

@k8s-ci-robot k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Dec 5, 2017
@k8s-github-robot
Copy link

[MILESTONENOTIFIER] Milestone Pull Request Current

@bsalamat @msau42 @resouer

Note: This pull request is marked as priority/critical-urgent, and must be updated every 1 day during code freeze.

Example update:

ACK.  In progress
ETA: DD/MM/YYYY
Risks: Complicated fix required
Pull Request Labels
  • sig/scheduling: Pull Request will be escalated to these SIGs if needed.
  • priority/critical-urgent: Never automatically move pull request out of a release milestone; continually escalate to contributor and SIG through all available channels.
  • kind/bug: Fixes a bug discovered during the current release.
Help

@k8s-github-robot
Copy link

/test all [submit-queue is verifying that this PR is safe to merge]

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 56688, 56577). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit 305d644 into kubernetes:master Dec 5, 2017
@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Dec 5, 2017

@resouer: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-e2e-gce b3bb74e link /test pull-kubernetes-e2e-gce

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@resouer resouer deleted the fix-eclass-pvc branch December 12, 2017 17:35
k8s-github-robot pushed a commit that referenced this pull request Dec 17, 2017
…7-upstream-release-1.8

Automatic merge from submit-queue.

Automated cherry pick of #56577

Cherry pick of #56577 on release-1.8.

#56577: Fix PV counter predicate in eclass

fixes: #54370
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Equivalence cache pod hashing doesn't work for StatefulSets with PVCs
7 participants