Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s.io] Rescheduler [Serial] should ensure that critical pod is scheduled in case there is no resources available {Kubernetes e2e suite} #32531

Closed
k8s-github-robot opened this issue Sep 12, 2016 · 24 comments · Fixed by #43106
Assignees
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/flake Categorizes issue or PR as related to a flaky test. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.
Milestone

Comments

@k8s-github-robot
Copy link

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gke-serial/2248/

Failed: [k8s.io] Rescheduler [Serial] should ensure that critical pod is scheduled in case there is no resources available {Kubernetes e2e suite}

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/rescheduler.go:67
Expected error:
    <*errors.errorString | 0xc8211e20b0>: {
        s: "Error while waiting for replication controller kube-dns-v19 pods to be running: Timeout while waiting for pods with labels \"k8s-app=kube-dns,version=v19\" to be running",
    }
    Error while waiting for replication controller kube-dns-v19 pods to be running: Timeout while waiting for pods with labels "k8s-app=kube-dns,version=v19" to be running
not to have occurred
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/rescheduler.go:66

Previous issues for this test: #31277 #31347 #31710 #32260

@k8s-github-robot k8s-github-robot added kind/flake Categorizes issue or PR as related to a flaky test. priority/backlog Higher priority than priority/awaiting-more-evidence. labels Sep 12, 2016
@k8s-github-robot
Copy link
Author

[FLAKE-PING] @mtaufen

This flaky-test issue would love to have more attention.

@k8s-github-robot
Copy link
Author

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gke-staging/31/

Failed: [k8s.io] Rescheduler [Serial] should ensure that critical pod is scheduled in case there is no resources available {Kubernetes e2e suite}

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/rescheduler.go:67
Expected error:
    <*errors.errorString | 0xc821c980e0>: {
        s: "Error while waiting for replication controller kube-dns-v20 pods to be running: Timeout while waiting for pods with labels \"k8s-app=kube-dns,version=v20\" to be running",
    }
    Error while waiting for replication controller kube-dns-v20 pods to be running: Timeout while waiting for pods with labels "k8s-app=kube-dns,version=v20" to be running
not to have occurred
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/rescheduler.go:66

@k8s-github-robot
Copy link
Author

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gke-serial-release-1.4/74/

Failed: [k8s.io] Rescheduler [Serial] should ensure that critical pod is scheduled in case there is no resources available {Kubernetes e2e suite}

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/rescheduler.go:67
Expected error:
    <*errors.errorString | 0xc8208c63e0>: {
        s: "Pod name reserve-all-cpu: Gave up waiting 5m0s for 60 pods to come up",
    }
    Pod name reserve-all-cpu: Gave up waiting 5m0s for 60 pods to come up
not to have occurred
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/rescheduler.go:52

@davidopp
Copy link
Member

davidopp commented Mar 9, 2017

cc/ @piosz

k8s-github-robot pushed a commit that referenced this issue Mar 9, 2017
Automatic merge from submit-queue (batch tested with PRs 42762, 42739, 42425, 42778)

Fixed potential OutOfSync of nodeInfo.

The cloned NodeInfo still share the same resource objects in cache; it may make `requestedResource` and Pods OutOfSync, for example, if the pod was deleted, the `requestedResource` is updated by Pods are not in cloned info. Found this when investigating #32531 , but seems not the root cause, as nodeInfo are readonly in predicts & priorities.

Sample codes for `&(*)`:

```
package main

import (
	"fmt"
)

type Resource struct {
	A int
}

type Node struct {
	Res *Resource
}

func main() {
	r1 := &Resource { A:10 }
	n1 := &Node{Res: r1}
	r2 := &(*n1.Res)
	r2.A = 11

	fmt.Printf("%t, %d %d\n", r1==r2, r1, r2)
}
```

Output:

```
true, &{11} &{11}
```
@piosz piosz assigned piosz and unassigned mtaufen and k82cn Mar 10, 2017
@piosz
Copy link
Member

piosz commented Mar 10, 2017

@davidopp please correct me if I'm wrong but tolerations/taints is already migrated to fields in HEAD. This means that to fix this issue we need to migrate rescheduler to use fields. I'll do it very soon.

@marun
Copy link
Contributor

marun commented Mar 10, 2017

@piosz Do you think this issue indicates a regression that should block 1.6, which would require that a fix be available asap? Or is it a test-only issue that can be moved to the 1.6.1 or 1.7 milestone?

@piosz
Copy link
Member

piosz commented Mar 10, 2017

@marun the former one. See kubernetes-retired/contrib#2382

@ethernetdan ethernetdan added kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. and removed priority/backlog Higher priority than priority/awaiting-more-evidence. labels Mar 10, 2017
@aveshagarwal
Copy link
Member

@piosz yes taints and tolerations are already moved to api fields, and the related PRs are already merged for 1.6.

Also this looks duplicate of #42686 , I think we should close one.

@davidopp
Copy link
Member

Also this looks duplicate of #42686 , I think we should close one.

+1

@spiffxp
Copy link
Member

spiffxp commented Mar 14, 2017

#42686 was closed in favor of this issue, since we expect the bot to re-open this

@ethernetdan
Copy link
Contributor

Status: @piosz is working on a fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/flake Categorizes issue or PR as related to a flaky test. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.
Projects
None yet