Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The timeoutHandler is only instructed to wait when it creates pods #2905

Closed
mattmoor opened this issue Jul 6, 2020 · 20 comments · Fixed by #3500
Closed

The timeoutHandler is only instructed to wait when it creates pods #2905

mattmoor opened this issue Jul 6, 2020 · 20 comments · Fixed by #3500
Assignees
Labels
kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt.

Comments

@mattmoor
Copy link
Member

mattmoor commented Jul 6, 2020

/kind bug

Consider this logic in taskrun.go:

		pod, err = c.createPod(ctx, tr, rtr)
		if err != nil {
			newErr := c.handlePodCreationError(ctx, tr, err)
			logger.Error("Failed to create task run pod for task %q: %v", tr.Name, newErr)
			return newErr
		}
		go c.timeoutHandler.WaitTaskRun(tr, tr.Status.StartTime)

If after creating the Pod the controller process is restarted (e.g. to rollout a new controller binary) then the timeout handler won't properly timeout task runs straddling this event because the controller's resync won't trigger the go routine.

The pattern we have used successfully in Knative is to have the library support idempotent invocation, so that WaitTaskRun can be called on every resync (as appropriate) and simply have it elide redundant invocations. This is something that would shake out pretty quickly with chaos testing (knative/pkg#1333 👀 ).

@mattmoor mattmoor added the kind/bug Categorizes issue or PR as related to a bug. label Jul 6, 2020
@vdemeester vdemeester added this to the Pipelines v0.15 milestone Jul 7, 2020
@bobcatfish bobcatfish self-assigned this Jul 13, 2020
@bobcatfish
Copy link
Collaborator

I was very excited to fix this bug but I'm actually no longer sure it's a bug:

When the controllers start, they call CheckTimeouts (pipelinerun, taskrun (they probably dont BOTH need to call it but oh well XD) which looks for taskruns and pipelineruns and starts the waiting go routine or times them out as needed

Looks like this was originally addressed in #456

That being said, having WaitTaskRun be idempotent makes a lot of sense and it's definitely not obvious that this is happening, so I think there's some improvement to make here, but I don't think it's a bug.

@bobcatfish bobcatfish added kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. and removed kind/bug Categorizes issue or PR as related to a bug. labels Jul 24, 2020
@bobcatfish bobcatfish removed this from the Pipelines v0.15 milestone Jul 24, 2020
@bobcatfish
Copy link
Collaborator

Thinking about this a bit more, even making the call idempotent won't handle the issue being described:

If after creating the Pod the controller process is restarted (e.g. to rollout a new controller binary) then the timeout handler won't properly timeout task runs straddling this event because the controller's resync won't trigger the go routine.

The call to WaitTaskRun wouldn't happen unless the reconcile loop occurs: will a newly started controller attempt to reconcile all instances of the types it is interested in, or will it only be invoked when a change occurs - do you know @mattmoor ?

If it's only when the instance changes, we'll still need to query for them all and check them on start up. If the reconciler tries to reconcile all types it is interested in on start up, then the solution you describe would work.

bobcatfish added a commit to bobcatfish/pipeline that referenced this issue Jul 24, 2020
While investigating tektoncd#2905, I struggled to understand how the timeout
handling works, especially with TimeoutSet having very little comments,
so I've added some. I didn't add anything for backoffs yet because I'm
hoping we can separate that into a separate structure since it has a
very specific purpose that doesn't generalize to all timeouts.

Also changed the name "finished" to consistently use "done" so the
reader doesn't have to wonder about the difference between "finished"
and "done" (there isn't one)
@mattmoor
Copy link
Member Author

will a newly started controller attempt to reconcile all instances of the types it is interested in, or will it only be invoked when a change occurs

Yes, we also do this when a controller becomes the leader of a particular bucket of the key space.

@bobcatfish
Copy link
Collaborator

Yes, we also do this when a controller becomes the leader of a particular bucket of the key space.

oh great!! @mattmoor are you able to point me toward the code where that happens?

@mattmoor
Copy link
Member Author

When informers are populating their cache they call OnAdd for each key as it is being read the first time, so your informer event hooks will queue stuff in the workqueue and the controller will process it.

tekton-robot pushed a commit that referenced this issue Jul 25, 2020
While investigating #2905, I struggled to understand how the timeout
handling works, especially with TimeoutSet having very little comments,
so I've added some. I didn't add anything for backoffs yet because I'm
hoping we can separate that into a separate structure since it has a
very specific purpose that doesn't generalize to all timeouts.

Also changed the name "finished" to consistently use "done" so the
reader doesn't have to wonder about the difference between "finished"
and "done" (there isn't one)
@bobcatfish
Copy link
Collaborator

When informers are populating their cache they call OnAdd for each key as it is being read the first time, so your informer event hooks will queue stuff in the workqueue and the controller will process it.

do you have a link to the code where this happens? (or some docs?) i can go looking for it if not - basically if we're going to remove our attempt to manually reconcile all the timeouts, which seems totally unnecessary now, i want to be able to point at the justification for why we don't need it anymore

bobcatfish added a commit to bobcatfish/pipeline that referenced this issue Jul 29, 2020
The timeout handler was trying to do two different jobs:
- Track timeouts for executing TaskRuns and PipelineRuns; when they
  timeout, they should be re-reconciled, at which point they would
  fail b/c they took to long
- Track backoffs when TaskRun pod creation fails due to resource quota
  issues. In this case we want to retry pod creation (by re-reconciling)
  after a backoff in case the resource issue has been resolved.

The logic for these 2 jobs was combined because they both share at their
heart very similar logic: the logic for creating go routines that wait
until a period of time has passed.

Now these two jobs are done by 2 different structs, which both use the
new Timer struct and logic, so they can share this logic, without having
to mix the code on top.

Future things we can do to make this even better:
- Add more test coverage for the new structs
- The "Handler" struct (which may have too generic a name at this point)
  probably doesn't need to have separate logic for pipelineruns +
  taskruns since each of these controllers will instanitate it
  separately
- As discussed in tektoncd#2905, knative/pkg's controller logic will attempt
  to re-reconcile all the Runs on start up, so we probably don't need
  to be explicitly doing this as well
bobcatfish added a commit to bobcatfish/pipeline that referenced this issue Jul 30, 2020
The timeout handler was trying to do two different jobs:
- Track timeouts for executing TaskRuns and PipelineRuns; when they
  timeout, they should be re-reconciled, at which point they would
  fail b/c they took to long
- Track backoffs when TaskRun pod creation fails due to resource quota
  issues. In this case we want to retry pod creation (by re-reconciling)
  after a backoff in case the resource issue has been resolved.

The logic for these 2 jobs was combined because they both share at their
heart very similar logic: the logic for creating go routines that wait
until a period of time has passed.

Now these two jobs are done by 2 different structs, which both use the
new Timer struct and logic, so they can share this logic, without having
to mix the code on top.

Future things we can do to make this even better:
- Add more test coverage for the new structs
- The "Handler" struct (which may have too generic a name at this point)
  probably doesn't need to have separate logic for pipelineruns +
  taskruns since each of these controllers will instanitate it
  separately
- As discussed in tektoncd#2905, knative/pkg's controller logic will attempt
  to re-reconcile all the Runs on start up, so we probably don't need
  to be explicitly doing this as well
@afrittoli afrittoli added this to the Pipelines v0.16 milestone Aug 10, 2020
@bobcatfish
Copy link
Collaborator

I think @imjasonh 's suggestion is ultimately the best way to go here: #3078 (review)

I think there's a chance we can simplify a lot of this by using impl.EnqueueAfter to enqueue a reconciliation in the future when the timeout should be elapsed

I wish I could remember why we didn't do it originally. I'm guessing that either we didn't realize we could (most likely?) or we were avoiding an extra reconcile (doesnt seem worth the complexity)

We added the timeout logic as it currently stands in #621 to explicitly NOT rely on reconciling - because our first version relied on re-reconciling every 30s and catching a timeout that way. I'm not sure if we didn't think of timing a future re-reconcile or if the functionality didn't exist at the time?

Anyway this seems like the way to go!

@mattmoor
Copy link
Member Author

mattmoor commented Sep 9, 2020

do you have a link to the code where this happens?

Sorry, this fell off my radar. I'm not sure what code ptr you are looking for, but it's probably deep in the informer/cache/lister code in k8s.io/client-go.

our first version relied on re-reconciling every 30s and catching a timeout that way

Woof, this won't scale. Generally here I'd recommend designing an idempotent library that moves this work off the main Reconciler thread. The Reconciler thread can call into the idempotent non-blocking library to access the result. The library should take a callback to enqueue a key when it thinks the state has changed. We use this pattern fairly successfully for our network prober in Knative (also workqueue based): https://github.com/knative/networking/blob/1b3a7517e759e94d813eb5440194413e8e2c603f/pkg/status/status.go#L131

@bobcatfish
Copy link
Collaborator

@mattmoor what do you think about @imjasonh 's suggestion to call impl.EnqueueAfter? When a TaskRun or PipelineRuns starts we can do something like:

  1. Reconcile, start pods, etc.
  2. Call EnqueueAfter to trigger a re-reconcile after the timeout expires
  3. At this point either:
    a. the TaskRun or PipelineRun will complete successfully: when the EnqueueAfter expires, it will re-reconcile but see there is nothing to do
    b. the TaskRun or PipelineRun will be still be running when EnqueueAfter expires, it will re-reconciler, see that the timeout has occurred, and stop it

@mattmoor
Copy link
Member Author

mattmoor commented Sep 9, 2020

tl;dr 👍

I'd probably still encapsulate that in a library to avoid having the Reconciler need to know about enqueue functions directly. It's generally better for testing and whatnot too.

You can probably rely on the workqueue itself to make it idempotent, but I think the key thing is that every reconcile (not just "created") should call into that library to avoid the workqueue becoming SPOF state.

@bobcatfish
Copy link
Collaborator

I think the key thing is that every reconcile (not just "created") should call into that library to avoid the workqueue becoming SPOF state.

hmm it seems like something is still going to have to know if items have been enqueued or not 🤔

@mattmoor
Copy link
Member Author

mattmoor commented Sep 9, 2020

Not if it's idempotent, then it can blindly add it and rely on the workqueue to dedup 😉

@bobcatfish
Copy link
Collaborator

Not if it's idempotent, then it can blindly add it and rely on the workqueue to dedup 😉

It would only be idempotent if we know the workqueue will dedupe, right?

There are 2 things ill want to track down and verify before relying on this:

  1. That we can guarantee everything will be re-reconciled on startup
  2. That the workqueue will dedupe (im not sure it would?? how would it know you didnt mean to enqueue twice - esp if there was a mistake and the After was off by milliseconds or something 🤔 )

@mattmoor
Copy link
Member Author

Syncing the world on startup (or when elected leader) is pretty fundamental, so if you don't see this it is a bug and we should drill in.

The workqueue will deduplicate any key that isn't in-flight; which is key to reacting properly to informer events mid-reconcile. I'm not sure how the workqueue handles different delays, but for simple enqueues it should be fine and I'd sort of expect it to deduplicate with the smallest delay.

@bobcatfish
Copy link
Collaborator

hmmm so it sounds like we'd be in a situation something like this:

  1. task run created by user
  2. task run added to workqueue
  3. pipeline controller starts reconciling task run
  4. pipeline controller calls EnqueueAfter for timeout
  5. pipeline controller finishes reconcile, creates pod, etc....
    ... later....
  6. something changes, e.g. pod status updated
  7. task run added to workqueue (at this point, the existing enqueue casued by EnqueueAfter is deduped, and removed)
  8. pipeline controller starts reconciling task run
  9. pipeline controller calls EnqueueAfter for timeout
  10. pipeline controller finishes reconcile
    ... etc ...

i.e. every time the controller reconciles the TaskRun/PipelineRun, it needs to call impl.EnqueueAfter because any previous calls are moot

is that right? that seems less reliable to me than our current method i.e. starting a goroutine and relying on that - if we ever failed to call impl.EnqueueAfter correctly on a re-reconcile we'd lose the timeouts completely

@mattmoor
Copy link
Member Author

Right, because imagine if 6. is <controller crashes and a new controller picks it up>. If during that Reconcile it identifies that it should be processed at time=X to handle timing things out, then it should blindly call EnqueueAfter.

If the controller is in a good steady state, then the idempotency will make it a nop. If it isn't, then it just saved your bacon.

@bobcatfish bobcatfish removed this from the Pipelines v0.17 milestone Sep 21, 2020
@bobcatfish bobcatfish assigned imjasonh and unassigned bobcatfish Sep 21, 2020
@bobcatfish
Copy link
Collaborator

@imjasonh kindly volunteered to pick this up with the goal to update it with enough information that we could put a "good first issue" label on it

@imjasonh
Copy link
Member

I'll start with a summary of the discussion so far, please anybody hop in with more questions/clarifications/links to source as you see fit.

The current thinking (correct me if I'm wrong!) is that we can use EnqueueAfter to schedule a reconciliation at a point in the future when the run should be complete, and have reconciliation logic simply 1.) check whether the timeout is elapsed, and if so trigger existing timeout behavior (update status, delete pod, etc.), and 2.) exit early if a run IsDone since there's nothing to reconcile in that case. (2) should always be done anyway, since reconciliations for done runs can happen already today (e.g., on startup).

Relying on EnqueueAfter does introduce a new reconciliation for every run, but since the vast majority of those will be no-ops to reconcile already-completed runs, this shouldn't be too painful.

This also makes us reliant on the specific behavior of the workqueue, which we'd like to understand better. Rather than spawning a goroutine in the reconciler/controller code, we'd be offloading that "remind me later" to the external workqueue. This means that if we're behind on processing the workqueue, timeouts might be enforced later than they otherwise should be, and later than they might be if we retain our goroutine-based timeout enforcement.

However, offloading to the workqueue should make us more reliable than we are today in cases when the controller crashes or gets a normal update, since the state of "I am waiting to check on a run's timeout" is encoded in the deferred workqueue item, and not in a spawned-but-waiting goroutine. It also means we can delete our own code and streamline this area of the reconciler.

@mattmoor @bobcatfish Is there any more context I'm missing? Next I'll outline the code changes we'd make to move to EnqueueAfter, and mark this as a [good first issue].

@mattmoor
Copy link
Member Author

mattmoor commented Nov 5, 2020

Alrighty, so this is the root cause of at least some of the timeout_test flakes I have been chasing.

From a recent failure's logstream:

    kubelogs.go:197: I 03:38:39.163 controlplane-2 [timeout/handler.go:278] [arendelle-652kd/task-run-timeout-xovuhquv] timer for "arendelle-652kd/task-run-timeout-xovuhquv" has activated after 26.852349086s
    kubelogs.go:197: I 03:38:39.357 controlplane-2 [taskrun/taskrun.go:351] [arendelle-652kd/task-run-timeout-xovuhquv] Cloud Events: []
    kubelogs.go:197: I 03:38:39.564 controlplane-2 [taskrun/taskrun.go:441] [arendelle-652kd/task-run-timeout-xovuhquv] Successfully reconciled taskrun task-run-timeout-xovuhquv/arendelle-652kd with status: &apis.Condition{Type:"Succeeded", Status:"Unknown", Severity:"", LastTransitionTime:apis.VolatileTime{Inner:v1.Time{Time:time.Time{wall:0x0, ext:63740144300, loc:(*time.Location)(0x3722a00)}}}, Reason:"Running", Message:"Not all Steps in the Task have finished executing"}

Looking at the log line from the timeout handler a bit closer... This log statement is prior to calling the callback to enqueue a taskrun that we expect to fail with a timeout, and this test sets a 30s timeout, so the start time should be 03:38:09, but it is actually 03:38:15.

So the callback is executed ~6 second prematurely, and because the non-timeout reconcile doesn't do an idempotent requeue, handling of timeouts for this key is permanently dropped on the floor.

@mattmoor
Copy link
Member Author

mattmoor commented Nov 5, 2020

I really think that the bulk of this logic could simply be handled by something along the lines of this defer (in taskrun.go):

	// Check if the TaskRun has timed out; if it is, this will set its status
	// accordingly.
	if tr.HasTimedOut() {
		message := fmt.Sprintf("TaskRun %q failed to finish within %q", tr.Name, tr.GetTimeout())
		err := c.failTaskRun(ctx, tr, v1beta1.TaskRunReasonTimedOut, message)
		return c.finishReconcileUpdateEmitEvents(ctx, tr, before, err)
	}
	defer func() {
		if tr.Status.StartTime == nil || tr.Spec.Timeout == nil {
			return
		}
		// Compute the time since the task started.
		elapsed := time.Since(tr.Status.StartTime.Time)
		// enqueue this taskrun's key after that long.
		enqueueAfter(tr, tr.Spec.Timeout.Duration.Sub(elapsed))
	}()

There is some logic around backoff that is probably better rationalized in terms of changes to how "StartTime" is set (since if there are pod creation errors, it hasn't really started to effect timeout handling IMO).

PipelineRun could probably get away with something similar, though I haven't cracked that code today to look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants