-
Notifications
You must be signed in to change notification settings - Fork 29
Dealing with stray processes remaining after a task ends #260
Comments
I have an idea for a work around but when attempting to test it I couldn’t even reproduce the condition of a process still running after a task ends. When I use The idea is that Time-base heuristics don’t sound great. An executable that has been removed from the filesystem could be another hint. But ideally, we could run a script after generic-worker finishes each task, before it starts the next one. I didn’t find a configuration key doing that, but I did find |
Hi Simon, Indeed setting In theory you should be able to graciously stop a worker with Another option would be to run the multiuser engine. That is inherently a little safer than the simple engine since each task is run as a different operating system user, and there are reboots between all tasks in any case. Also the config file of the worker containing taskcluster credentials is unreadable by task users, which would enable you to run untrusted tasks, e.g. in PRs from forked repos, without risk of exposing taskcluster credentials, for example. There would be a minor efficiency penalty due to worker downtime during OS reboots, but this should hopefully be quite negligible. I think if there aren't any technical blockers to running multiuser engine, that would be my recommended approach. However, if you are happy to write a custom calling script that can do the process cleanup, that sounds like a good way to immediately unblock you, and if you find a good generic approach to doing this that would be suitable in the worker, I'm happy to incorporate it so that custom wrapper scripts are not always required. Let me know how you get on! |
Yes the isolation of multi-user would be good but means we’d need to run Homebrew in a separate deploy step. Maybe it’s worth it. For now we have Why does multiuser mode require reboots? The overhead of rebooting after each task might not be negligible, we just recently split our largest test suite into many more tasks: servo/servo#24768 (in order to more evenly distribute the work between our fixed number of workers). Rebooting every N tasks (in single-user mode, as a way to clean up remaining processes) could be a compromise. I any case, I’d appreciate some help with:
|
https://community-tc.services.mozilla.com/provisioners/proj-servo/worker-types/macos uses generic-worker in "simple" mode (not "multiuser") on long-lived macOS workers. Semi-regularly, some tasks start taking much longer than usual because there’s a
servo
process consuming CPU time, remaining from a task that has long finished.Could there be a way for generic-worker, when a task finishes, to terminate all processes that were recursively started from the task’s
command
even if they are detached?Alternatively, maybe rebooting occasionally would help. These processes seem rare enough that a reboot every 24 hours would make it so there are none most of the time, while adding less delay than rebooting after every task. But rebooting at a fixed time in a cron job would make the currently-running task fail, so it’d be better somehow coordinate with generic-worker to make it finish the current task, not pick up another one, and only then reboot.
CC @jdm, @petemoore
The text was updated successfully, but these errors were encountered: