Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible bugs in task reaper #2672

Open
7 tasks
talex5 opened this issue Jun 22, 2018 · 7 comments
Open
7 tasks

Possible bugs in task reaper #2672

talex5 opened this issue Jun 22, 2018 · 7 comments

Comments

@talex5
Copy link
Contributor

talex5 commented Jun 22, 2018

We've been seeing some suspicious behaviour of the task reasper (using lots of CPU) and I've been looking at the code to see what might cause that. This issue documents what I've found so far.

Current design

The design documentation for the reaper (https://github.com/docker/swarmkit/blob/master/design/orchestrators.md#task-reaper) says:

As discussed above, restarting a task involves shutting down the old task and starting a new one. If restarts happen frequently, a lot of old tasks that aren't actually running might accumulate.

The task reaper implements configurable garbage collection of these no-longer-running tasks. The number of old tasks to keep per slot or node is controlled by Orchestration.TaskHistoryRetentionLimit in the cluster's ClusterSpec.

The task reaper watches for task creation events, and adds the slots or nodes from these events to a watchlist. It periodically iterates over the watchlist and deletes tasks from referenced slots or nodes which exceed the retention limit. It prefers to delete tasks with the oldest Status timestamps.

A comment in the code adds that:

the TaskReaper watch loop is also responsible for cleaning up tasks
associated with slots that were removed as part of service scale down or
service removal.

What the code does

At startup, we remove:

  • all service-less orphaned task, and
  • all reapable tasks with desired-state=remove
    ("reapable" means terminated or not yet assigned to a node)

After the startup code has run, we start a 250ms timer running and enter the main loop. In the loop, we process events and wait for the timer to fire. When the timer fires, we stop the timer (although presumably it's already stopped) and call tick.

XXX: A comment says Clean up when we hit TaskHistoryRetentionLimit or when the timer expires. I believe it means Clean up when we hit maxDirty or ....

When we get an event in the loop:

  • On CreateTask we mark the task's vslot as dirty (I assume the task we get is the newly created one, not some later version of it).
  • On UpdateTask we mark the task for cleanup using the same logic as in the startup phase (orphaned and no service, or desired-remove and reapable).
  • On UpdateCluster we update our copy of the taskHistory setting.

For any event, we arrange for tick to run, either by resetting the timer to 250ms, or by calling it immediately if len(tr.dirty)+len(tr.cleanup) > maxDirty.

XXX: The code for resetting the timer is wrong, according to the Go docs. You have to stop it first (looks like there's already a PR for this: #2669).

The name dirty and the fact that the code calls tick on every event when dirty is large suggests that the code expects tick to remove all the dirty items from the list. In fact, it mostly just leaves them there.

tick deletes all tasks in cleanup and resets it to empty. It also processes all the vslots in dirty:

  • If the vslot's service no longer exists, it removes it from the dirty list (recently fixed in [orchestrator/task reaper] Clean up tasks in dirty list for which the service has been deleted #2666 recently merged).

  • If taskHistory < 0 (no cleanup) then we leave it on the dirty list.

  • If the total number of tasks in this vslot is within the taskHistory limit, we leave it on the dirty list.

  • Otherwise, we start going through all the vslot's tasks from oldest to newest, marking tasks for deletion until we're within the limit. We skip tasks that are runnable and desired-running.

XXX: What about desired-ready tasks? We probably shouldn't be deleting them.

XXX: In fact, we probably shouldn't be deleting any running task (even if it's desired to be terminated). The comment says "This check is important to ignore tasks which are running or need to be running" but the code says t.DesiredState == api.TaskStateRunning && t.Status.State <= api.TaskStateRunning (and).

  • Finally, if we skipped over fewer than two tasks during this scan, we remove the vslot from the dirty list. XXX: I have no idea why we do this. Is it trying to count the total number of runnable tasks in the slot? But we exit early once we've killed enough tasks, so that won't work if so.

XXX: A comment says "If MaxAttempts is set, keep at least one more than that number of tasks". I would expect this logic to only increase taskHistory, but in fact it might lower it too. That is, if I set TaskHistoryRetentionLimit=100 and create a service with Restart.MaxAttempts=1 then I will get at most 2 retained tasks, I think.

Conclusions

I find this algorithm very strange. It's O(n^2) in the number of vslots. e.g. if you create 400 services each with 4 replicas (1600 tasks), this is what happens:

  1. We keep resetting the timer and not scanning anything as long as we create new tasks faster than one per 250ms.
  2. Once we hit 1000 tasks, we run a complete scan of all tasks for each new task added. The dirty set keeps growing, since none of these tasks should be reaped.

Here's a graph of that happening:

image

This shows two runs each creating 400 services with 4 replicas and then deleting them. It's from before #2666 was merged, which is why it doesn't reset to zero when the first batch of services goes away. I added a prometheus metric for the dirty set size. You can see that CPU usage is low until we hit 1000 vslots (at the dotted line), then the CPU needed to add each task increases steadily. Although the vslots weren't removed after the first run, they don't seem to affect the CPU time much. I guess ignoring vslots with no service is quite fast.

There seems to be no need for this, since a task event can only ever lead us to delete another task in the same vslot. There's no point scanning every task in the system. Possibly tick was supposed to empty the dirty list after processing it, but then the UpdateTask handler would need to re-add the vslot, and that doesn't explain the complicated logic about when to remove it.

XXX: I'm a bit confused about TaskHistoryRetentionLimit. The docs say "The number of old tasks to keep per slot or node is controlled by Orchestration.TaskHistoryRetentionLimit", which makes sense to me, but the code seems to treat it as the total number of tasks (not the number of old tasks). This makes a difference during restarts, for example.

If TaskHistoryRetentionLimit is the number of old tasks, then we can ignore CreateTask events completely and just listen for tasks becoming reapable. When a task becomes reapable, check just that vslot to see if it should push an older task out of the system.

If it's the total number of tasks, then we must also monitor CreateTask, since adding a new runnable task may require deleting an old one.

Issues noted above for further investigation:

  • Clean up when we hit TaskHistoryRetentionLimit comment.
  • Desired-ready tasks vs desired-running.
  • Reaping running tasks just because they're desired-remove.
  • Mysterious criterion for getting out of the dirty set.
  • Restart.MaxAttempts reducing TaskHistoryRetentionLimit.
  • O(n^2) scanning algorithm, rather than just looking at the slot that changed.
  • Clarify meaning of TaskHistoryRetentionLimit.
@dperny
Copy link
Collaborator

dperny commented Jun 22, 2018

thanks for the analysis @talex5. i'll dig into this in a bit.

@anshulpundir
Copy link
Contributor

FYI I'm already working on the fixes for some of these @dperny

@dperny
Copy link
Collaborator

dperny commented Jun 22, 2018

Solid, I hadn't started.

@anshulpundir
Copy link
Contributor

anshulpundir commented Jun 22, 2018

Comments/questions for the following:

Clean up when we hit TaskHistoryRetentionLimit comment.

Any suggestions ? @talex5

Desired-ready tasks vs desired-running.

Can you please point to the code this ? @talex5

Reaping running tasks just because they're desired-remove.

This is correct behavior. Also note that this is not the only condition. Its desired-remove && (t.Status.State < api.TaskStateAssigned || t.Status.State >= api.TaskStateCompleted)
The use-case is that the service has already been deleted (thats when tasks get marked desired-removed) @talex5

Mysterious criterion for getting out of the dirty set.

Can you please point to the code this ? @talex5

Restart.MaxAttempts reducing TaskHistoryRetentionLimit.

I couldn't make sense of this either. Any ideas? @dperny

O(n^2) scanning algorithm, rather than just looking at the slot that changed.

Should be fixed by #2675 @talex5

Clarify meaning of TaskHistoryRetentionLimit.

I think this is the total tasks to be kept for each slot. @talex5 cc @dperny

@talex5
Copy link
Contributor Author

talex5 commented Jun 25, 2018

(note: each bullet list item at the end of the issue refers to an XXX point in the main body)

"Clean up when we hit TaskHistoryRetentionLimit" comment.

Any suggestions ?

I suggest "Clean up when we hit maxDirty ..."

Desired-ready tasks vs desired-running.

Can you please point to the code this ?

https://github.com/docker/swarmkit/blob/58764805d83cc0f5290bfa6ba748898ee5ff87d5/manager/orchestrator/taskreaper/task_reaper.go#L262

The code is:

for _, t := range historicTasks {
	// Skip tasks which are desired to be running but the current state
	// is less than or equal to running.
	// This check is important to ignore tasks which are running or need to be running,
	// but to delete tasks which are either past running,
	// or have not reached running but need to be shutdown (because of a service update, for example).
	if t.DesiredState == api.TaskStateRunning && t.Status.State <= api.TaskStateRunning {
		// Don't delete running tasks
		runningTasks++
		continue
	}
	deleteTasks[t.ID] = struct{}{}
        ...

e.g. if a task is (desired=ready,state=ready) then we probably don't want to reap it.

Reaping running tasks just because they're desired-remove.

This is correct behavior. Also note that this is not the only condition. Its desired-remove && (t.Status.State < api.TaskStateAssigned || t.Status.State >= api.TaskStateCompleted)
The use-case is that the service has already been deleted (thats when tasks get marked desired-removed)

I think you're looking at a different bit of code. It's the same as the bit I quoted above. The logic is:

if desired=running AND state <= running then don't delete else delete

For example, if a task is (desired=complete, state=running) then this code will delete it from SwarmKit, causing SwarmKit to believe that its resources are no longer in use, which is not correct.

Mysterious criterion for getting out of the dirty set.

Can you please point to the code this ?

Well, it's really the whole file. The code seems very confused about what is supposed to be in the dirty list. It needs to be documented.

@cyli
Copy link
Contributor

cyli commented Jun 27, 2018

XXX: A comment says "If MaxAttempts is set, keep at least one more than that number of tasks". I would expect this logic to only increase taskHistory, but in fact it might lower it too. That is, if I set TaskHistoryRetentionLimit=100 and create a service with Restart.MaxAttempts=1 then I will get at most 2 retained tasks, I think.

Agree I think it should only increase the task history, not decrease it.

taskHistory < 0 (no cleanup) then we leave it on the dirty list.

It seems like if taskHistory < 0 then, our dirty list is just the size of the total number of slots in the system, which seems unnecessary. I'm wondering if we should just not bother adding to the dirty list unless TaskHistoryRetentionLimit is set to > 0. This unfortunately means that once that value changes, we may have to view the entire store again and obtain a list of dirty tasks that we keep track of.

But we don't do that when starting up the task reaper from scratch - if there's a change in leadership, it might be possible we lose track of tasks to reap until another task is created for that particular slot. If that's ok, then maybe it's fine that we don't bother adding things to dirty until TaskHistoryRetentionLimit > 0.

@olljanat
Copy link
Contributor

Is there some plans to continue with these?
Note that there is also tasks.db growing issue #2367 which is most probably related.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants