-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
flaky test_nanny_worker_port_range #6045
Comments
I'm inclined to skip Ubuntu 3.9 and move on. I think that this feature is fringe enough that I don't care enough about it to prioritize spending the time on it. Thoughts? |
Blindly spawning services that go in race condition with each other and rely on the OS/libraries to handle it gracefully is very bad practice. I'm actually surprised that it doesn't fail elsewhere. |
I don't have a strong opinion here. Mostly this feature isn't important enough for me to invest much time into it. |
Turns out I was wrong in my diagnosis; I'm still getting the failure after the port split. I've labelled the test as flaky on Ubuntu 3.9 for the time being but I think it should be looked at. What I think I'm seeing is a race condition when multiple nannies invoke
I don't see any reason why the issue should be test-specific. However, as this is not something that's vexing everybody in production right now, I suspect it's triggered by starting a Nanny on a (temporarily) very slow box, such as those from CI. We don't have enough runs yet to understand if this issue is specific to Python 3.9 or it also affects 3.10. |
I'm stopping work on this. Anybody should feel free to pick it up. |
I recently overhauled test_nanny_worker_port_range in #5956, as it was hanging on Windows due to the stdout/stderr pipes being too small for the output - so I changed it to have a thread read from them asynchronously.
The change made it stable on Windows but it made it fairly flaky on Ubuntu 3.9, and Ubuntu 3.9 only. It seems stable on Ubuntu 3.8 (3.10 is too young to tell).
As much the graph shows the contrary, I don't think my change is causing the flakiness.
Rather, I'm persuaded that the test is highlighting a genuine design flaw.
The workers connect successfully to the scheduler, successfully receive the Client.run request, and then hang. I have no way to figure out if they all hang or only one (the cluster dump tool could tell me, but by the time I reach it the nanny subprocess has already been killed).
nanny stdout/stderr is captured effectively, and is shown at the end of this post. I can spot a bunch of worker restarts that should not be there. On the last lines, I can see FOUR notices for Client.run, whereas I was expecting three. In the cluster dump, scheduler.workers shows 3 workers.
What I think is happening is:
port="10000:11000"
toWorker.__init__
(the type annotation inworker.py
which statesport: int | None = None
is wrong)I don't know why Ubuntu 3.8 is unaffected. Both the 3.8 and 3.9 environment install tornado 6.1.
Proposed design
dask-worker
should parse the port range and split it in equal parts. It should also spot and prevent overlaps between the nanny port range and worker port range.In other words,
must spawn
Note the last worker_port ends at 10999, because 11000 is dedicated to the nannies.
nanny/worker stdout/stderr
The text was updated successfully, but these errors were encountered: