-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
do not lock up addprocs on worker setup errors #32290
Conversation
This is a corollary to the previous commit in JuliaLang#32290, and implements suggestions thereof. It restricts the master to wait for a worker to respond within `Distributed.worker_timeout()` seconds. Beyond that it releases the lock on `rr_ntfy_join` with a special flag `:TIMEDOUT`. This flag is set to `:ERROR` in case of any errors during worker setup, and to `:OK` when the master received a `JoinCompleteMsg` indicating setup completion from worker. `addprocs` returns the worker id in the list of workers it added only if it has received a `JoinCompleteMsg`, that is, only when `rr_ntfy_join` contains `:OK`. Note that the worker process may not be dead yet, and it may still be listed in `workers()` until it actually goes down.
This is a corollary to the previous commit in JuliaLang#32290, and implements suggestions thereof. It restricts the master to wait for a worker to respond within `Distributed.worker_timeout()` seconds. Beyond that it releases the lock on `rr_ntfy_join` with a special flag `:TIMEDOUT`. This flag is set to `:ERROR` in case of any errors during worker setup, and to `:OK` when the master received a `JoinCompleteMsg` indicating setup completion from worker. `addprocs` returns the worker id in the list of workers it added only if it has received a `JoinCompleteMsg`, that is, only when `rr_ntfy_join` contains `:OK`. Note that the worker process may not be dead yet, and it may still be listed in `workers()` until it actually goes down.
…c on worker setup errors
I just realized that with this PR, now there is a difference in the way that we handle errors during initial master-worker connection setup and errors during the handshake or later (while Should we just
In any case the fact that |
Needs docs and further discussions before merging. |
Agree. I would wait a while for conclusion of further opinions/discussions before updating docs. |
This is a corollary to the previous commit in JuliaLang#32290, and implements suggestions thereof. It restricts the master to wait for a worker to respond within `Distributed.worker_timeout()` seconds. Beyond that it releases the lock on `rr_ntfy_join` with a special flag `:TIMEDOUT`. This flag is set to `:ERROR` in case of any errors during worker setup, and to `:OK` when the master received a `JoinCompleteMsg` indicating setup completion from worker. `addprocs` returns the worker id in the list of workers it added only if it has received a `JoinCompleteMsg`, that is, only when `rr_ntfy_join` contains `:OK`. Note that the worker process may not be dead yet, and it may still be listed in `workers()` until it actually goes down.
This is a corollary to the previous commit in JuliaLang#32290, and implements suggestions thereof. It restricts the master to wait for a worker to respond within `Distributed.worker_timeout()` seconds. Beyond that it releases the lock on `rr_ntfy_join` with a special flag `:TIMEDOUT`. This flag is set to `:ERROR` in case of any errors during worker setup, and to `:OK` when the master received a `JoinCompleteMsg` indicating setup completion from worker. `addprocs` returns the worker id in the list of workers it added only if it has received a `JoinCompleteMsg`, that is, only when `rr_ntfy_join` contains `:OK`. Note that the worker process may not be dead yet, and it may still be listed in `workers()` until it actually goes down.
This is a corollary to the previous commit in JuliaLang#32290, and implements suggestions thereof. It restricts the master to wait for a worker to respond within `Distributed.worker_timeout()` seconds. Beyond that it releases the lock on `rr_ntfy_join` with a special flag `:TIMEDOUT`. This flag is set to `:ERROR` in case of any errors during worker setup, and to `:OK` when the master received a `JoinCompleteMsg` indicating setup completion from worker. `addprocs` returns the worker id in the list of workers it added only if it has received a `JoinCompleteMsg`, that is, only when `rr_ntfy_join` contains `:OK`. Note that the worker process may not be dead yet, and it may still be listed in `workers()` until it actually goes down.
This is a corollary to the previous commit in JuliaLang#32290, and implements suggestions thereof. It restricts the master to wait for a worker to respond within `Distributed.worker_timeout()` seconds. Beyond that it releases the lock on `rr_ntfy_join` with a special flag `:TIMEDOUT`. This flag is set to `:ERROR` in case of any errors during worker setup, and to `:OK` when the master received a `JoinCompleteMsg` indicating setup completion from worker. `addprocs` returns the worker id in the list of workers it added only if it has received a `JoinCompleteMsg`, that is, only when `rr_ntfy_join` contains `:OK`. Note that the worker process may not be dead yet, and it may still be listed in `workers()` until it actually goes down.
Not sure why CI is failing, though tests pass on my local machine. Will investigate this. |
Putting this up for discussion - Shouldn't worker connect or reading host/port errors be converted to warnings and ignored too? i.e., when we cannot cannot connect to some newly launched workers For example the errors thrown here and here and tested here. |
Looks like connect to non-routable IP I do not see any way to simulate that condition reliably. So I'll |
And yes. Seems to me that we can make some/all of the worker connect errors into warnings now. |
CI has passed except one failure in |
CI is passing now. We can take up and discuss #32290 (comment) as a separate PR. Does this look okay to merge? |
Discovered that |
Also master should not exit with an error if it failed to issue a remote kill to a deregistered worker. Will make changes for that too. |
end | ||
@test length(npids) == 0 | ||
@test nprocs() == 1 | ||
@test Distributed.worker_timeout() < t < 360.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding up to 6 minutes to the test is probably unacceptable. We should use the existing environment variable to set this timeout to something very small.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. Pushing a fix in a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in c399338
This is a corollary to the previous commit in JuliaLang#32290, and implements suggestions thereof. It restricts the master to wait for a worker to respond within `Distributed.worker_timeout()` seconds. Beyond that it releases the lock on `rr_ntfy_join` with a special flag `:TIMEDOUT`. This flag is set to `:ERROR` in case of any errors during worker setup, and to `:OK` when the master received a `JoinCompleteMsg` indicating setup completion from worker. `addprocs` returns the worker id in the list of workers it added only if it has received a `JoinCompleteMsg`, that is, only when `rr_ntfy_join` contains `:OK`. Note that the worker process may not be dead yet, and it may still be listed in `workers()` until it actually goes down.
Bump. We fixed most CI issues, so could you push a rebase and we can try to review this soon |
`create_worker` (invoked during `addprocs`) waits on a message from the worker to indicate success. If the worker process terminates before sending this response, `create_worker` (and therefore `addprocs`) will remain locked up. Usually the master process does become aware of a terminated worker, when the communication channel between them breaks due to worker exiting. The message processing loop exits as a result. This commit introduces an additional task (timer) that monitors this message processing loop while master is waiting for a `JoinCompleteMsg` response from a worker. It makes `create_worker` return both when setup is successful (master receives a `JoinCompleteMsg`) and also when worker is terminated. Return value of `create_worker` is 0 when worker setup fails, instead of worker id when it is successful. Return value of `addprocs` contains only workers that were successfully launched and connected to. Added some tests for that too.
This is a corollary to the previous commit in JuliaLang#32290, and implements suggestions thereof. It restricts the master to wait for a worker to respond within `Distributed.worker_timeout()` seconds. Beyond that it releases the lock on `rr_ntfy_join` with a special flag `:TIMEDOUT`. This flag is set to `:ERROR` in case of any errors during worker setup, and to `:OK` when the master received a `JoinCompleteMsg` indicating setup completion from worker. `addprocs` returns the worker id in the list of workers it added only if it has received a `JoinCompleteMsg`, that is, only when `rr_ntfy_join` contains `:OK`. Note that the worker process may not be dead yet, and it may still be listed in `workers()` until it actually goes down.
do not throw a fatal error if we could not issue a remote exit to kill the worker while deregistering.
It is possible for a `connect` call from master to worker during worker setup to hang indefinitely. This adds a timeout to handle that, so that master does nto lock up as a result. It simply deregisters and terminates the worker and carries on.
- show exception along with message when kill fails - but do not warn if error is ProcessExitedException for the same process
also the additional async task for timeout introduced in JuliaLang#34502 will not be required, because this PR handles that already and also differentiates between timeout and error.
@vtjnash It is now rebased. Couldn't figure out why windows and macos tests failed though. |
It looks like you changed the setup code for that test to remove all workers instead of adding them? |
Co-authored-by: Jameson Nash <vtjnash@gmail.com>
@tanmaykm Any thoughts here? It's an old PR, but if it is still good to go, should we put some effort into getting it merged? |
Moved to JuliaLang/Distributed.jl#61 |
create_worker
(invoked duringaddprocs
) waits on a message from the worker to indicate success. If the worker process terminates before sending this response,create_worker
(and thereforeaddprocs
) will remain locked up.Usually the master process does become aware of a terminated worker, when the communication channel between them breaks due to worker exiting. The message processing loop exits as a result. But
create_worker
keeps on waiting though.This commit introduces an additional task (timer) that monitors this message processing loop while master is waiting for a
JoinCompleteMsg
response from a worker. It makescreate_worker
return both when setup is successful (master receives aJoinCompleteMsg
) and also when worker is terminated. Return value ofcreate_worker
is0
when worker setup fails, instead of worker id when it is successful. Return value ofaddprocs
contains only workers that were successfully launched and connected to. Added tests.