Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add zeroconf advertising to the scheduler #5012

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

jacobtomlinson
Copy link
Member

@jacobtomlinson jacobtomlinson commented Jul 1, 2021

  • Tests added / passed
  • Passes black distributed / flake8 distributed / isort distributed

Adds support for zeroconf/bonjour/avahi where the scheduler will advertise itself on the local network via mdns service discovery. This is the same protocol devices like TVs, printers, etc advertise with for auto-setup.

I've made it optional here and not included zeroconf as a dependency, this will only happen if that package is installed, otherwise it will skip over this. I've also added a config option to explicitly disable.

My primary motivation here is to have something more robust in dask-ctl for discovering LocalCluster services running on the same machine. Annoyingly if you run multiple schedulers with random high ports it gets a little tricky to find them in a cross-platform way. This change allows me to resolve that.

This could also be expanded in the future to enable workers to discover a scheduler on the same network without having to explicitly know the IP address. This is predicated on mdns being enabled on the network which is very common on home and office LANs but probably not in datacentres/HPCs. More investigation is required.

@jacobtomlinson
Copy link
Member Author

Rerunning flaky test

@jacobtomlinson
Copy link
Member Author

I've also opened dask-contrib/dask-ctl#10 which makes use of this PR.

Copy link
Member

@fjetter fjetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you want to test this on CI; if not why? we typically dump all our optional requirements in py3.9 environment to have at least one set with everything running. I'm not a huge fan of this but it beats not testing at all. thoughts?

distributed/scheduler.py Outdated Show resolved Hide resolved
@jacobtomlinson
Copy link
Member Author

Thanks @fjetter. I've added the dependency to the 3.9 environment and updated the config constant.

Copy link
Member

@fjetter fjetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once green-ish

@jacobtomlinson
Copy link
Member Author

Test failures here seem flaky, but this is happening a lot at the moment, noting down failures and rerunning to see if they happen consistently.

  • ubuntu 3.9 - TIMEOUT distributed/cli/tests/test_dask_scheduler.py::test_idle_timeout
  • windows 3.7 - FAILED distributed/tests/test_scheduler.py::test_memory
  • windows 3.8 - STILL RUNNING
  • windows 3.9 - TIMEOUT distributed/cli/tests/test_dask_scheduler.py::test_idle_timeout
  • macos 3.7 - FAILED distributed/tests/test_scheduler.py::test_rebalance_least_recently_inserted_sender_min
  • macos 3.8 - FAILED distributed/tests/test_client_executor.py::test_map
  • macos 3.8 - TIMEOUT distributed/cli/tests/test_dask_scheduler.py::test_idle_timeout

@jacobtomlinson
Copy link
Member Author

I've pulled in the latest changes from main and am running CI again. If everything goes ok I intend on merging this tomorrow.

@jacobtomlinson
Copy link
Member Author

Most of these failures seem flaky except for one which is happening consistently in Python 3.9 only.

FAILED distributed/tests/test_scheduler.py::test_no_danglng_asyncio_tasks - A...

The dangling task is definitely related to the zeroconf dependency.

Task was destroyed but it is pending!
task: <Task pending name='Task-112031' coro=<Zeroconf._async_broadcast_service() running at /Users/runner/miniconda3/envs/dask-distributed/lib/python3.9/site-packages/zeroconf/_core.py:578> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x144b3f490>()]>>

@jakirkham
Copy link
Member

@jacobtomlinson, curious what the status is here? This seems like a useful addition. There are a few conflicts that have crept in. Though perhaps the flaky test was fixed in the interim

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants