flaky test_nanny_worker_port_range #6045

crusaderky · 2022-04-01T10:19:30Z

I recently overhauled test_nanny_worker_port_range in #5956, as it was hanging on Windows due to the stdout/stderr pipes being too small for the output - so I changed it to have a thread read from them asynchronously.

The change made it stable on Windows but it made it fairly flaky on Ubuntu 3.9, and Ubuntu 3.9 only. It seems stable on Ubuntu 3.8 (3.10 is too young to tell).

As much the graph shows the contrary, I don't think my change is causing the flakiness.
Rather, I'm persuaded that the test is highlighting a genuine design flaw.

The workers connect successfully to the scheduler, successfully receive the Client.run request, and then hang. I have no way to figure out if they all hang or only one (the cluster dump tool could tell me, but by the time I reach it the nanny subprocess has already been killed).

nanny stdout/stderr is captured effectively, and is shown at the end of this post. I can spot a bunch of worker restarts that should not be there. On the last lines, I can see FOUR notices for Client.run, whereas I was expecting three. In the cluster dump, scheduler.workers shows 3 workers.

What I think is happening is:

The three Nanny objects are all passing the raw string port="10000:11000" to Worker.__init__ (the type annotation in worker.py which states port: int | None = None is wrong)
The three worker processes are individually trying to bind on the same port range. They sequentially try ports from left to right (I can see from Scheduler.workers that they use ports 10000, 10001, and 10002) and, unavoidably, end up fighting for the same ports. Unlike when binding on port=0, which is thoroughly supported at OS level, this triggers a race condition.

I don't know why Ubuntu 3.8 is unaffected. Both the 3.8 and 3.9 environment install tornado 6.1.

Proposed design

dask-worker should parse the port range and split it in equal parts. It should also spot and prevent overlaps between the nanny port range and worker port range.
In other words,

dask-worker --nworkers 3 --worker-port 10000:11000 --nanny-port 11000:12000

must spawn

Nanny(port="11000:11333", worker_port="10000:10333")
Nanny(port="11334:11666", worker_port="10334:10666")
Nanny(port="11667:12000", worker_port="10667:10999")

Note the last worker_port ends at 10999, because 11000 is dedicated to the nannies.

nanny/worker stdout/stderr

022-03-28 18:06:11,115 - distributed.nanny - INFO -         Start Nanny at: 'tcp://127.0.0.1:11000'
2022-03-28 18:06:11,125 - distributed.nanny - INFO -         Start Nanny at: 'tcp://127.0.0.1:11001'
2022-03-28 18:06:11,135 - distributed.nanny - INFO -         Start Nanny at: 'tcp://127.0.0.1:11002'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/synchronize.py", line 110, in __setstate__
    self._semlock = _multiprocessing.SemLock._rebuild(*state)
FileNotFoundError: [Errno 2] No such file or directory
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/synchronize.py", line 110, in __setstate__
    self._semlock = _multiprocessing.SemLock._rebuild(*state)
FileNotFoundError: [Errno 2] No such file or directory
2022-03-28 18:06:13,398 - distributed.nanny - INFO - Worker process 3229 exited with status 1
Exception ignored in: <Finalize object, dead>
Traceback (most recent call last):
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/synchronize.py", line 87, in _cleanup
    sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
Exception ignored in: <Finalize object, dead>
Traceback (most recent call last):
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/synchronize.py", line 87, in _cleanup
    sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
Exception ignored in: <Finalize object, dead>
Traceback (most recent call last):
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
Traceback (most recent call last):
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/synchronize.py", line 87, in _cleanup
    sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
Exception ignored in: <Finalize object, dead>
Traceback (most recent call last):
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/synchronize.py", line 87, in _cleanup
    sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
2022-03-28 18:06:16,551 - distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:43139
2022-03-28 18:06:16,551 - distributed.worker - INFO - -------------------------------------------------
2022-03-28 18:06:16,554 - distributed.core - INFO - Starting established connection
2022-03-28 18:06:16,582 - distributed.worker - INFO - Run out-of-band function 'lambda'
2022-03-28 18:06:16,584 - distributed.worker - INFO - Run out-of-band function 'lambda'
2022-03-28 18:06:16,587 - distributed.worker - INFO - Run out-of-band function 'lambda'
2022-03-28 18:06:16,597 - distributed.worker - INFO - Run out-of-band function 'lambda'

The text was updated successfully, but these errors were encountered:

mrocklin · 2022-04-01T13:27:51Z

I'm inclined to skip Ubuntu 3.9 and move on. I think that this feature is fringe enough that I don't care enough about it to prioritize spending the time on it. Thoughts?

crusaderky · 2022-04-01T14:46:48Z

Blindly spawning services that go in race condition with each other and rely on the OS/libraries to handle it gracefully is very bad practice. I'm actually surprised that it doesn't fail elsewhere.
I'm almost done working on a fix.

mrocklin · 2022-04-01T17:09:45Z

I don't have a strong opinion here. Mostly this feature isn't important enough for me to invest much time into it.

crusaderky · 2022-04-04T11:31:49Z

Turns out I was wrong in my diagnosis; I'm still getting the failure after the port split.
I think that the change is a healthy thing to have regardless and should be merged.

I've labelled the test as flaky on Ubuntu 3.9 for the time being but I think it should be looked at. What I think I'm seeing is a race condition when multiple nannies invoke Nanny.process.start() at the same time:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/multiprocessing/synchronize.py", line 110, in __setstate__
    self._semlock = _multiprocessing.SemLock._rebuild(*state)
FileNotFoundError: [Errno 2] No such file or directory

I don't see any reason why the issue should be test-specific. However, as this is not something that's vexing everybody in production right now, I suspect it's triggered by starting a Nanny on a (temporarily) very slow box, such as those from CI.

We don't have enough runs yet to understand if this issue is specific to Python 3.9 or it also affects 3.10.

crusaderky · 2022-04-04T12:34:16Z

I'm stopping work on this. Anybody should feel free to pick it up.

crusaderky mentioned this issue Apr 1, 2022

test_nanny_worker_port_range hangs on Windows #5956

Merged

crusaderky added the flaky test Intermittent failures on CI. label Apr 1, 2022

crusaderky self-assigned this Apr 1, 2022

crusaderky mentioned this issue Apr 1, 2022

Avoid port collisions when defining port ranges #6054

Merged

crusaderky removed their assignment Apr 4, 2022

crusaderky closed this as completed in #6054 Apr 4, 2022

crusaderky reopened this Apr 4, 2022

gjoseph92 mentioned this issue May 20, 2022

Remove worker reconnect #6361

Merged

2 tasks

graingert mentioned this issue Jun 21, 2022

Replace loop.call_later and loop.add_callback with background tasks added to Server. #6603

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flaky test_nanny_worker_port_range #6045

flaky test_nanny_worker_port_range #6045

crusaderky commented Apr 1, 2022 •

edited

Loading

mrocklin commented Apr 1, 2022

crusaderky commented Apr 1, 2022

mrocklin commented Apr 1, 2022

crusaderky commented Apr 4, 2022

crusaderky commented Apr 4, 2022

flaky test_nanny_worker_port_range #6045

flaky test_nanny_worker_port_range #6045

Comments

crusaderky commented Apr 1, 2022 • edited Loading

Proposed design

nanny/worker stdout/stderr

mrocklin commented Apr 1, 2022

crusaderky commented Apr 1, 2022

mrocklin commented Apr 1, 2022

crusaderky commented Apr 4, 2022

crusaderky commented Apr 4, 2022

crusaderky commented Apr 1, 2022 •

edited

Loading