Skip to content
This repository has been archived by the owner on Feb 20, 2020. It is now read-only.

Dealing with stray processes remaining after a task ends #260

Open
SimonSapin opened this issue Nov 14, 2019 · 3 comments
Open

Dealing with stray processes remaining after a task ends #260

SimonSapin opened this issue Nov 14, 2019 · 3 comments

Comments

@SimonSapin
Copy link

https://community-tc.services.mozilla.com/provisioners/proj-servo/worker-types/macos uses generic-worker in "simple" mode (not "multiuser") on long-lived macOS workers. Semi-regularly, some tasks start taking much longer than usual because there’s a servo process consuming CPU time, remaining from a task that has long finished.

Could there be a way for generic-worker, when a task finishes, to terminate all processes that were recursively started from the task’s command even if they are detached?

Alternatively, maybe rebooting occasionally would help. These processes seem rare enough that a reboot every 24 hours would make it so there are none most of the time, while adding less delay than rebooting after every task. But rebooting at a fixed time in a cron job would make the currently-running task fail, so it’d be better somehow coordinate with generic-worker to make it finish the current task, not pick up another one, and only then reboot.

CC @jdm, @petemoore

@SimonSapin
Copy link
Author

I have an idea for a work around but when attempting to test it I couldn’t even reproduce the condition of a process still running after a task ends. When I use command &, generic-worker waits for that command to end before resolving the task. When I use nohup command &, I get a nohup: can't detach from console: Inappropriate ioctl for device error:

https://community-tc.services.mozilla.com/tasks/bV5mczUBTqukz-MCvujWQg/runs/0/logs/https%3A%2F%2Fcommunity-tc.services.mozilla.com%2Fapi%2Fqueue%2Fv1%2Ftask%2FbV5mczUBTqukz-MCvujWQg%2Fruns%2F0%2Fartifacts%2Fpublic%2Flogs%2Flive.log


The idea is that ps -eo pid,comm | grep $tasksDir shows the processes we care about. If we can determine that such processes are associated to tasks that are already resolved, we can safely kill them.

Time-base heuristics don’t sound great. An executable that has been removed from the filesystem could be another hint. But ideally, we could run a script after generic-worker finishes each task, before it starts the next one. I didn’t find a configuration key doing that, but I did find numberOfTasksToRun. We could set it to 1 and wrap generic-worker in a script that runs a loop. Then at each iteration we have an opportunity to run some other code while no task is running.

@petemoore
Copy link
Member

Hi Simon,

Indeed setting numberOfTasksToRun to 1 and running in a loop is a good way to schedule cleanup activities between tasks. NSS actually run 10 workers on a single mac, and used to have a cron job that ran once per day, and then waited for all workers to become idle before rebooting, although I'm not sure if they still do that. For the case of a single worker on a machine, of course it is simpler to organise rebooting on a cadence by setting numberOfTasksToRun in combination with idleTimeoutSecs to something reasonable (for workers that only execute tasks sporadically).

In theory you should be able to graciously stop a worker with kill -2 PID, although I think that may only currently work if the worker is running a task when the kill is issued.

Another option would be to run the multiuser engine. That is inherently a little safer than the simple engine since each task is run as a different operating system user, and there are reboots between all tasks in any case. Also the config file of the worker containing taskcluster credentials is unreadable by task users, which would enable you to run untrusted tasks, e.g. in PRs from forked repos, without risk of exposing taskcluster credentials, for example. There would be a minor efficiency penalty due to worker downtime during OS reboots, but this should hopefully be quite negligible. I think if there aren't any technical blockers to running multiuser engine, that would be my recommended approach.

However, if you are happy to write a custom calling script that can do the process cleanup, that sounds like a good way to immediately unblock you, and if you find a good generic approach to doing this that would be suitable in the worker, I'm happy to incorporate it so that custom wrapper scripts are not always required. Let me know how you get on!

@SimonSapin
Copy link
Author

Yes the isolation of multi-user would be good but means we’d need to run Homebrew in a separate deploy step. Maybe it’s worth it. For now we have /usr/local owned by the (single) user that runs tasks.

Why does multiuser mode require reboots?

The overhead of rebooting after each task might not be negligible, we just recently split our largest test suite into many more tasks: servo/servo#24768 (in order to more evenly distribute the work between our fixed number of workers). Rebooting every N tasks (in single-user mode, as a way to clean up remaining processes) could be a compromise. I any case, numberOfTasksToRun + a wrapper script gives us enough control.

I’d appreciate some help with:

  • When a sub-command it detached with command & inside a bash command, do you know why generic-worker waits until it exits to resolve the task?
  • Do you know why nohup fails with that error message?
  • Do you know what conditions could lead to a process still running after a task is resolved?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants