Schedule Pods in resource-constrained environments #734

imjasonh · 2019-04-05T14:15:46Z

Expected Behavior

In a resource-constrained environment like a namespace with resource limits imposed (or just an insufficiently provisioned cluster), creating a TaskRun (Pod) that exceeds those limits should not fail the TaskRun, but should instead continually try to create the Pod until it either succeeds or times out.

Actual Behavior

Pods fail to start and the TaskRun is failed ~immediately.

Steps to Reproduce the Problem

Define a namespace with resource constraints (e.g., 10 CPU, 10 GB RAM)
Create 15 TaskRuns each requesting 1 CPU and 1 GB RAM, running hello world or something simple
~10 of those will be scheduled and will succeed, the rest will fail due to insufficient resources.

Additional Info

This is similar to how Jobs can handle Pod scheduling failures by retrying until they are successful.

It's unclear whether users would expect TaskRuns waiting for sufficient resources to queue in order of the time they were created, or whether they'd expect the Kubernetes scheduler to do whatever it needs to do to schedule the Pods. As an initial implementation it's probably fine to have Kubernetes schedule Pods, and not have to worry about enforcing FIFO.

The text was updated successfully, but these errors were encountered:

ghost · 2019-04-17T17:17:18Z

/assign @sbwsg

Our TaskRun timeout end to end test has been intermittently failing against PRs. After adding a lot of (terrible) log messages, I found out that the reason TaskRuns weren't timing out was b/c the go routine checking for a timeout was considering them timed out several milliseconds before the reconcile loop would. So what would happen was: 1. The timeout handler decided the run timed out, and triggered a reconcile 2. The reconcile checked if the run timed out, decided the run had a few more milliseconds of execution time and let it go 3. And the TaskRun would just keep running It looks like the root cause is that when the go routine starts, it uses a `StartTime` that has been set by `TaskRun.InitializeConditions`, but after that, the pod is started and the `TaskRun.Status` is updated to use _the pod's_ `StartTime` instead - which is what the Reconcile loop will use from that point forward, causing the slight drift. This is fixed by no longer tying the start time of the TaskRun to the pod: the TaskRun will be considered started as soon as the reconciler starts to act on it, which is probably the functionality the user would expect anyway (e.g. if the pod was delayed in being started, this delay should be subtracted from the timeout, and in tektoncd#734 we are looking to be more tolerant of pods not immediately scheduling anyway). Fixes tektoncd#731 Co-authored-by: Nader Ziada <nziada@pivotal.io>

ghost · 2019-04-23T14:13:34Z

Been working through some of the implementation details in a POC but want to drop current working notes here since I likely won't be able to work on it more until tomorrow.

Catching a pod failure is relatively straightforward; checking the error message produced by the createPod() func in pkg/reconciler/v1alpha1/taskrun/taskrun.go reveals the reason. From here it's quick to parse out the error message and look for e.g. "exceeded quota" in the string. This relies on a somewhat brittle contract though. I'll also need to check for the different messages generated both by LimitRanges as well as ResourceQuotas since it looks like they both enforce resource limits on a pod. I'm currently looking around to see if there's a less brittle approach to this error checking.
Once the resource constraint error is detected the pod then needs to be restarted. In my POC implementation this works by simply Enqueue()ing the TR to be re-assessed on the next reconcile loop. This results in many rapid reruns however when really it would be nicer to see an exponential backoff strategy similar to that used by k8s' job controller. The job controller uses a particular kind of workqueue to implement this ("ExponentialFailureRateLimiter") but TaskRun's controller Impl uses the "RateLimitedQueue", which is set up via knative's controller.NewImpl() func. So I'm looking at other alternatives to implement this.

Our TaskRun timeout end to end test has been intermittently failing against PRs. After adding a lot of (terrible) log messages, I found out that the reason TaskRuns weren't timing out was b/c the go routine checking for a timeout was considering them timed out several milliseconds before the reconcile loop would. So what would happen was: 1. The timeout handler decided the run timed out, and triggered a reconcile 2. The reconcile checked if the run timed out, decided the run had a few more milliseconds of execution time and let it go 3. And the TaskRun would just keep running It looks like the root cause is that when the go routine starts, it uses a `StartTime` that has been set by `TaskRun.InitializeConditions`, but after that, the pod is started and the `TaskRun.Status` is updated to use _the pod's_ `StartTime` instead - which is what the Reconcile loop will use from that point forward, causing the slight drift. This is fixed by no longer tying the start time of the TaskRun to the pod: the TaskRun will be considered started as soon as the reconciler starts to act on it, which is probably the functionality the user would expect anyway (e.g. if the pod was delayed in being started, this delay should be subtracted from the timeout, and in #734 we are looking to be more tolerant of pods not immediately scheduling anyway). Fixes #731 Co-authored-by: Nader Ziada <nziada@pivotal.io>

ghost · 2019-04-25T17:13:18Z

I'm going to move this into a design doc. There are enough variables here to seed some discussion and it'd be good to get broader input before committing to one approach.

ghost · 2019-04-29T12:58:36Z

I've started the design doc here including use cases, a draft implementation, some open questions and possible alternative implementations that I'm still working through.

Our TaskRun timeout end to end test has been intermittently failing against PRs. After adding a lot of (terrible) log messages, I found out that the reason TaskRuns weren't timing out was b/c the go routine checking for a timeout was considering them timed out several milliseconds before the reconcile loop would. So what would happen was: 1. The timeout handler decided the run timed out, and triggered a reconcile 2. The reconcile checked if the run timed out, decided the run had a few more milliseconds of execution time and let it go 3. And the TaskRun would just keep running It looks like the root cause is that when the go routine starts, it uses a `StartTime` that has been set by `TaskRun.InitializeConditions`, but after that, the pod is started and the `TaskRun.Status` is updated to use _the pod's_ `StartTime` instead - which is what the Reconcile loop will use from that point forward, causing the slight drift. This is fixed by no longer tying the start time of the TaskRun to the pod: the TaskRun will be considered started as soon as the reconciler starts to act on it, which is probably the functionality the user would expect anyway (e.g. if the pod was delayed in being started, this delay should be subtracted from the timeout, and in tektoncd#734 we are looking to be more tolerant of pods not immediately scheduling anyway). Fixes tektoncd#731 Co-authored-by: Nader Ziada <nziada@pivotal.io>

…pace-duplicate-test Skip TestWorkspacePipelineRunDuplicateWorkspaceEntriesInvalid Test

tekton-robot assigned ghost Apr 17, 2019

bobcatfish mentioned this issue Apr 22, 2019

Decouple TaskRun startTime from pod start time ⌚ #780

Merged

2 tasks

This was referenced May 13, 2019

Proposal: graceful sidecar support #727

Closed

Surface resource constraint problems in TaskRun Status #876

Merged

This was referenced May 24, 2019

Configure number of pipelines running in parallel #868

Closed

Reattempt pod creation in the face of ResourceQuota errors #905

Merged

vdemeester added this to the Pipelines 0.5 🐱 : <cat name tbd @dlorenc> milestone May 28, 2019

tekton-robot closed this as completed in #905 May 29, 2019

bobcatfish mentioned this issue Jul 15, 2021

Pipeline controller shouldn't retry creating pod when the error cannot be mitigated by retry #4092

Closed

nikhil-thomas pushed a commit to nikhil-thomas/pipeline that referenced this issue Aug 25, 2021

Merge pull request tektoncd#734 from nikhil-thomas/fix/ci/skip__works…

8074e05

…pace-duplicate-test Skip TestWorkspacePipelineRunDuplicateWorkspaceEntriesInvalid Test

dibyom mentioned this issue Aug 3, 2022

Fix for ResourceQuotaConflictError #5252

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Schedule Pods in resource-constrained environments #734

Schedule Pods in resource-constrained environments #734

imjasonh commented Apr 5, 2019

ghost commented Apr 17, 2019

ghost commented Apr 23, 2019

ghost commented Apr 25, 2019

ghost commented Apr 29, 2019

Schedule Pods in resource-constrained environments #734

Schedule Pods in resource-constrained environments #734

Comments

imjasonh commented Apr 5, 2019

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Additional Info

ghost commented Apr 17, 2019

ghost commented Apr 23, 2019

ghost commented Apr 25, 2019

ghost commented Apr 29, 2019