Possible bugs in task reaper #2672

talex5 · 2018-06-22T13:43:42Z

We've been seeing some suspicious behaviour of the task reasper (using lots of CPU) and I've been looking at the code to see what might cause that. This issue documents what I've found so far.

Current design

The design documentation for the reaper (https://github.com/docker/swarmkit/blob/master/design/orchestrators.md#task-reaper) says:

As discussed above, restarting a task involves shutting down the old task and starting a new one. If restarts happen frequently, a lot of old tasks that aren't actually running might accumulate.

The task reaper implements configurable garbage collection of these no-longer-running tasks. The number of old tasks to keep per slot or node is controlled by Orchestration.TaskHistoryRetentionLimit in the cluster's ClusterSpec.

The task reaper watches for task creation events, and adds the slots or nodes from these events to a watchlist. It periodically iterates over the watchlist and deletes tasks from referenced slots or nodes which exceed the retention limit. It prefers to delete tasks with the oldest Status timestamps.

A comment in the code adds that:

the TaskReaper watch loop is also responsible for cleaning up tasks
associated with slots that were removed as part of service scale down or
service removal.

What the code does

At startup, we remove:

all service-less orphaned task, and
all reapable tasks with desired-state=remove
("reapable" means terminated or not yet assigned to a node)

After the startup code has run, we start a 250ms timer running and enter the main loop. In the loop, we process events and wait for the timer to fire. When the timer fires, we stop the timer (although presumably it's already stopped) and call tick.

XXX: A comment says Clean up when we hit TaskHistoryRetentionLimit or when the timer expires. I believe it means Clean up when we hit maxDirty or ....

When we get an event in the loop:

On CreateTask we mark the task's vslot as dirty (I assume the task we get is the newly created one, not some later version of it).
On UpdateTask we mark the task for cleanup using the same logic as in the startup phase (orphaned and no service, or desired-remove and reapable).
On UpdateCluster we update our copy of the taskHistory setting.

For any event, we arrange for tick to run, either by resetting the timer to 250ms, or by calling it immediately if len(tr.dirty)+len(tr.cleanup) > maxDirty.

XXX: The code for resetting the timer is wrong, according to the Go docs. You have to stop it first (looks like there's already a PR for this: #2669).

The name dirty and the fact that the code calls tick on every event when dirty is large suggests that the code expects tick to remove all the dirty items from the list. In fact, it mostly just leaves them there.

tick deletes all tasks in cleanup and resets it to empty. It also processes all the vslots in dirty:

If the vslot's service no longer exists, it removes it from the dirty list (recently fixed in [orchestrator/task reaper] Clean up tasks in dirty list for which the service has been deleted #2666 recently merged).
If taskHistory < 0 (no cleanup) then we leave it on the dirty list.
If the total number of tasks in this vslot is within the taskHistory limit, we leave it on the dirty list.
Otherwise, we start going through all the vslot's tasks from oldest to newest, marking tasks for deletion until we're within the limit. We skip tasks that are runnable and desired-running.

XXX: What about desired-ready tasks? We probably shouldn't be deleting them.

XXX: In fact, we probably shouldn't be deleting any running task (even if it's desired to be terminated). The comment says "This check is important to ignore tasks which are running or need to be running" but the code says t.DesiredState == api.TaskStateRunning && t.Status.State <= api.TaskStateRunning (and).

Finally, if we skipped over fewer than two tasks during this scan, we remove the vslot from the dirty list. XXX: I have no idea why we do this. Is it trying to count the total number of runnable tasks in the slot? But we exit early once we've killed enough tasks, so that won't work if so.

XXX: A comment says "If MaxAttempts is set, keep at least one more than that number of tasks". I would expect this logic to only increase taskHistory, but in fact it might lower it too. That is, if I set TaskHistoryRetentionLimit=100 and create a service with Restart.MaxAttempts=1 then I will get at most 2 retained tasks, I think.

Conclusions

I find this algorithm very strange. It's O(n^2) in the number of vslots. e.g. if you create 400 services each with 4 replicas (1600 tasks), this is what happens:

We keep resetting the timer and not scanning anything as long as we create new tasks faster than one per 250ms.
Once we hit 1000 tasks, we run a complete scan of all tasks for each new task added. The dirty set keeps growing, since none of these tasks should be reaped.

Here's a graph of that happening:

This shows two runs each creating 400 services with 4 replicas and then deleting them. It's from before #2666 was merged, which is why it doesn't reset to zero when the first batch of services goes away. I added a prometheus metric for the dirty set size. You can see that CPU usage is low until we hit 1000 vslots (at the dotted line), then the CPU needed to add each task increases steadily. Although the vslots weren't removed after the first run, they don't seem to affect the CPU time much. I guess ignoring vslots with no service is quite fast.

There seems to be no need for this, since a task event can only ever lead us to delete another task in the same vslot. There's no point scanning every task in the system. Possibly tick was supposed to empty the dirty list after processing it, but then the UpdateTask handler would need to re-add the vslot, and that doesn't explain the complicated logic about when to remove it.

XXX: I'm a bit confused about TaskHistoryRetentionLimit. The docs say "The number of old tasks to keep per slot or node is controlled by Orchestration.TaskHistoryRetentionLimit", which makes sense to me, but the code seems to treat it as the total number of tasks (not the number of old tasks). This makes a difference during restarts, for example.

If TaskHistoryRetentionLimit is the number of old tasks, then we can ignore CreateTask events completely and just listen for tasks becoming reapable. When a task becomes reapable, check just that vslot to see if it should push an older task out of the system.

If it's the total number of tasks, then we must also monitor CreateTask, since adding a new runnable task may require deleting an old one.

Issues noted above for further investigation:

Clean up when we hit TaskHistoryRetentionLimit comment.
Desired-ready tasks vs desired-running.
Reaping running tasks just because they're desired-remove.
Mysterious criterion for getting out of the dirty set.
Restart.MaxAttempts reducing TaskHistoryRetentionLimit.
O(n^2) scanning algorithm, rather than just looking at the slot that changed.
Clarify meaning of TaskHistoryRetentionLimit.

The text was updated successfully, but these errors were encountered:

dperny · 2018-06-22T16:48:22Z

thanks for the analysis @talex5. i'll dig into this in a bit.

anshulpundir · 2018-06-22T17:30:28Z

FYI I'm already working on the fixes for some of these @dperny

dperny · 2018-06-22T20:20:41Z

Solid, I hadn't started.

anshulpundir · 2018-06-22T22:24:05Z

Comments/questions for the following:

Clean up when we hit TaskHistoryRetentionLimit comment.

Any suggestions ? @talex5

Desired-ready tasks vs desired-running.

Can you please point to the code this ? @talex5

Reaping running tasks just because they're desired-remove.

This is correct behavior. Also note that this is not the only condition. Its desired-remove && (t.Status.State < api.TaskStateAssigned || t.Status.State >= api.TaskStateCompleted)
The use-case is that the service has already been deleted (thats when tasks get marked desired-removed) @talex5

Mysterious criterion for getting out of the dirty set.

Can you please point to the code this ? @talex5

Restart.MaxAttempts reducing TaskHistoryRetentionLimit.

I couldn't make sense of this either. Any ideas? @dperny

O(n^2) scanning algorithm, rather than just looking at the slot that changed.

Should be fixed by #2675 @talex5

Clarify meaning of TaskHistoryRetentionLimit.

I think this is the total tasks to be kept for each slot. @talex5 cc @dperny

talex5 · 2018-06-25T09:29:03Z

(note: each bullet list item at the end of the issue refers to an XXX point in the main body)

"Clean up when we hit TaskHistoryRetentionLimit" comment.

Any suggestions ?

I suggest "Clean up when we hit maxDirty ..."

Desired-ready tasks vs desired-running.

Can you please point to the code this ?

https://github.com/docker/swarmkit/blob/58764805d83cc0f5290bfa6ba748898ee5ff87d5/manager/orchestrator/taskreaper/task_reaper.go#L262

The code is:

for _, t := range historicTasks {
	// Skip tasks which are desired to be running but the current state
	// is less than or equal to running.
	// This check is important to ignore tasks which are running or need to be running,
	// but to delete tasks which are either past running,
	// or have not reached running but need to be shutdown (because of a service update, for example).
	if t.DesiredState == api.TaskStateRunning && t.Status.State <= api.TaskStateRunning {
		// Don't delete running tasks
		runningTasks++
		continue
	}
	deleteTasks[t.ID] = struct{}{}
        ...

e.g. if a task is (desired=ready,state=ready) then we probably don't want to reap it.

Reaping running tasks just because they're desired-remove.

This is correct behavior. Also note that this is not the only condition. Its desired-remove && (t.Status.State < api.TaskStateAssigned || t.Status.State >= api.TaskStateCompleted)
The use-case is that the service has already been deleted (thats when tasks get marked desired-removed)

I think you're looking at a different bit of code. It's the same as the bit I quoted above. The logic is:

if desired=running AND state <= running then don't delete else delete

For example, if a task is (desired=complete, state=running) then this code will delete it from SwarmKit, causing SwarmKit to believe that its resources are no longer in use, which is not correct.

Mysterious criterion for getting out of the dirty set.

Can you please point to the code this ?

Well, it's really the whole file. The code seems very confused about what is supposed to be in the dirty list. It needs to be documented.

cyli · 2018-06-27T00:03:03Z

XXX: A comment says "If MaxAttempts is set, keep at least one more than that number of tasks". I would expect this logic to only increase taskHistory, but in fact it might lower it too. That is, if I set TaskHistoryRetentionLimit=100 and create a service with Restart.MaxAttempts=1 then I will get at most 2 retained tasks, I think.

Agree I think it should only increase the task history, not decrease it.

taskHistory < 0 (no cleanup) then we leave it on the dirty list.

It seems like if taskHistory < 0 then, our dirty list is just the size of the total number of slots in the system, which seems unnecessary. I'm wondering if we should just not bother adding to the dirty list unless TaskHistoryRetentionLimit is set to > 0. This unfortunately means that once that value changes, we may have to view the entire store again and obtain a list of dirty tasks that we keep track of.

But we don't do that when starting up the task reaper from scratch - if there's a change in leadership, it might be possible we lose track of tasks to reap until another task is created for that particular slot. If that's ok, then maybe it's fine that we don't bother adding things to dirty until TaskHistoryRetentionLimit > 0.

olljanat · 2019-05-30T22:43:47Z

Is there some plans to continue with these?
Note that there is also tasks.db growing issue #2367 which is most probably related.

anshulpundir mentioned this issue Jun 25, 2018

[manager/orchestrator/reaper] Clean out the task reaper dirty set at the end of tick() #2675

Merged

anshulpundir mentioned this issue Jun 27, 2018

[manager/orchestrator/reaper] Fix the condition used for skipping over running tasks #2677

Merged

This was referenced Aug 6, 2018

[18.06] Bump SwarmKit to 8852e88 docker-archive/engine#32

Merged

[18.03] [manager/orchestrator/reaper] Fix the condition used for skipping over running tasks. #2724

Merged

olljanat mentioned this issue Mar 6, 2020

[WIP] Cleanup routine for old tasks.db tasks #2917

Closed

thaJeztah mentioned this issue Apr 16, 2020

[19.03] vendor: swarmkit 0b8364e7d08aa0e972241eb59ae981a67a587a0e moby/moby#40831

Merged

dperny mentioned this issue Apr 16, 2020

Bump swarmkit to ebe39a32e3ed4c3a3783a02c11cccf388818694c moby/moby#40745

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible bugs in task reaper #2672

Possible bugs in task reaper #2672

talex5 commented Jun 22, 2018

dperny commented Jun 22, 2018

anshulpundir commented Jun 22, 2018

dperny commented Jun 22, 2018

anshulpundir commented Jun 22, 2018 •

edited

Loading

talex5 commented Jun 25, 2018

cyli commented Jun 27, 2018 •

edited

Loading

olljanat commented May 30, 2019

Possible bugs in task reaper #2672

Possible bugs in task reaper #2672

Comments

talex5 commented Jun 22, 2018

Current design

What the code does

Conclusions

dperny commented Jun 22, 2018

anshulpundir commented Jun 22, 2018

dperny commented Jun 22, 2018

anshulpundir commented Jun 22, 2018 • edited Loading

talex5 commented Jun 25, 2018

cyli commented Jun 27, 2018 • edited Loading

olljanat commented May 30, 2019

anshulpundir commented Jun 22, 2018 •

edited

Loading

cyli commented Jun 27, 2018 •

edited

Loading