Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flaky test_nanny_worker_port_range #6045

Open
crusaderky opened this issue Apr 1, 2022 · 5 comments · Fixed by #6054
Open

flaky test_nanny_worker_port_range #6045

crusaderky opened this issue Apr 1, 2022 · 5 comments · Fixed by #6054
Labels
flaky test Intermittent failures on CI.

Comments

@crusaderky
Copy link
Collaborator

crusaderky commented Apr 1, 2022

I recently overhauled test_nanny_worker_port_range in #5956, as it was hanging on Windows due to the stdout/stderr pipes being too small for the output - so I changed it to have a thread read from them asynchronously.

The change made it stable on Windows but it made it fairly flaky on Ubuntu 3.9, and Ubuntu 3.9 only. It seems stable on Ubuntu 3.8 (3.10 is too young to tell).

Screenshot from 2022-04-01 10-26-47

As much the graph shows the contrary, I don't think my change is causing the flakiness.
Rather, I'm persuaded that the test is highlighting a genuine design flaw.

The workers connect successfully to the scheduler, successfully receive the Client.run request, and then hang. I have no way to figure out if they all hang or only one (the cluster dump tool could tell me, but by the time I reach it the nanny subprocess has already been killed).

nanny stdout/stderr is captured effectively, and is shown at the end of this post. I can spot a bunch of worker restarts that should not be there. On the last lines, I can see FOUR notices for Client.run, whereas I was expecting three. In the cluster dump, scheduler.workers shows 3 workers.

What I think is happening is:

  1. The three Nanny objects are all passing the raw string port="10000:11000" to Worker.__init__ (the type annotation in worker.py which states port: int | None = None is wrong)
  2. The three worker processes are individually trying to bind on the same port range. They sequentially try ports from left to right (I can see from Scheduler.workers that they use ports 10000, 10001, and 10002) and, unavoidably, end up fighting for the same ports. Unlike when binding on port=0, which is thoroughly supported at OS level, this triggers a race condition.

I don't know why Ubuntu 3.8 is unaffected. Both the 3.8 and 3.9 environment install tornado 6.1.

Proposed design

dask-worker should parse the port range and split it in equal parts. It should also spot and prevent overlaps between the nanny port range and worker port range.
In other words,

dask-worker --nworkers 3 --worker-port 10000:11000 --nanny-port 11000:12000

must spawn

Nanny(port="11000:11333", worker_port="10000:10333")
Nanny(port="11334:11666", worker_port="10334:10666")
Nanny(port="11667:12000", worker_port="10667:10999")

Note the last worker_port ends at 10999, because 11000 is dedicated to the nannies.

nanny/worker stdout/stderr

022-03-28 18:06:11,115 - distributed.nanny - INFO -         Start Nanny at: 'tcp://127.0.0.1:11000'
2022-03-28 18:06:11,125 - distributed.nanny - INFO -         Start Nanny at: 'tcp://127.0.0.1:11001'
2022-03-28 18:06:11,135 - distributed.nanny - INFO -         Start Nanny at: 'tcp://127.0.0.1:11002'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/synchronize.py", line 110, in __setstate__
    self._semlock = _multiprocessing.SemLock._rebuild(*state)
FileNotFoundError: [Errno 2] No such file or directory
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/synchronize.py", line 110, in __setstate__
    self._semlock = _multiprocessing.SemLock._rebuild(*state)
FileNotFoundError: [Errno 2] No such file or directory
2022-03-28 18:06:13,398 - distributed.nanny - INFO - Worker process 3229 exited with status 1
Exception ignored in: <Finalize object, dead>
Traceback (most recent call last):
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/synchronize.py", line 87, in _cleanup
    sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
Exception ignored in: <Finalize object, dead>
Traceback (most recent call last):
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/synchronize.py", line 87, in _cleanup
    sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
Exception ignored in: <Finalize object, dead>
Traceback (most recent call last):
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
Traceback (most recent call last):
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/synchronize.py", line 87, in _cleanup
    sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
Exception ignored in: <Finalize object, dead>
Traceback (most recent call last):
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/synchronize.py", line 87, in _cleanup
    sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
2022-03-28 18:06:16,551 - distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:43139
2022-03-28 18:06:16,551 - distributed.worker - INFO - -------------------------------------------------
2022-03-28 18:06:16,554 - distributed.core - INFO - Starting established connection
2022-03-28 18:06:16,582 - distributed.worker - INFO - Run out-of-band function 'lambda'
2022-03-28 18:06:16,584 - distributed.worker - INFO - Run out-of-band function 'lambda'
2022-03-28 18:06:16,587 - distributed.worker - INFO - Run out-of-band function 'lambda'
2022-03-28 18:06:16,597 - distributed.worker - INFO - Run out-of-band function 'lambda'
@crusaderky crusaderky added the flaky test Intermittent failures on CI. label Apr 1, 2022
@mrocklin
Copy link
Member

mrocklin commented Apr 1, 2022

I'm inclined to skip Ubuntu 3.9 and move on. I think that this feature is fringe enough that I don't care enough about it to prioritize spending the time on it. Thoughts?

@crusaderky crusaderky self-assigned this Apr 1, 2022
@crusaderky
Copy link
Collaborator Author

Blindly spawning services that go in race condition with each other and rely on the OS/libraries to handle it gracefully is very bad practice. I'm actually surprised that it doesn't fail elsewhere.
I'm almost done working on a fix.

@mrocklin
Copy link
Member

mrocklin commented Apr 1, 2022

I don't have a strong opinion here. Mostly this feature isn't important enough for me to invest much time into it.

@crusaderky
Copy link
Collaborator Author

Turns out I was wrong in my diagnosis; I'm still getting the failure after the port split.
I think that the change is a healthy thing to have regardless and should be merged.

I've labelled the test as flaky on Ubuntu 3.9 for the time being but I think it should be looked at. What I think I'm seeing is a race condition when multiple nannies invoke Nanny.process.start() at the same time:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/synchronize.py", line 110, in __setstate__
    self._semlock = _multiprocessing.SemLock._rebuild(*state)
FileNotFoundError: [Errno 2] No such file or directory

I don't see any reason why the issue should be test-specific. However, as this is not something that's vexing everybody in production right now, I suspect it's triggered by starting a Nanny on a (temporarily) very slow box, such as those from CI.

We don't have enough runs yet to understand if this issue is specific to Python 3.9 or it also affects 3.10.

@crusaderky crusaderky removed their assignment Apr 4, 2022
@crusaderky
Copy link
Collaborator Author

I'm stopping work on this. Anybody should feel free to pick it up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky test Intermittent failures on CI.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants