Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add synced dict between cluster and scheduler to store cluster info #5033

Merged
merged 14 commits into from
Sep 9, 2021

Conversation

jacobtomlinson
Copy link
Member

Closes #5031
Closes #4607
Related to #4263

This PR adds a cluster_info attribute to all Cluster objects which is a dictionary that is synced to the scheduler periodically. Any info already on the scheduler during _start is merged into the dict in Cluster and then that dict is synced back to the scheduler every second.

By default this dict just contains the cluster name and the class type. This will be useful in #5012 for advertising cluster manager provenance for use in dask-ctl.

This is a good place to persist state at various points in the cluster lifecycle. Particularly when wanting to disconnect/reconnect from a running cluster. In cluster managers like KubeCluster this would be a good place to store all the config for the cluster. Things like worker CPU/memory settings for creating new worker pods. This will then be loaded back in when using KubeCluster.from_name() instead of having to try and serialise everything into some platform-specific metadata store as I attempted in dask/dask-kubernetes#318.

I've also updated the cluster.name attribute with a property and setter to ensure that also lives within the cluster.cluster_info dict and is synced back and forth.

distributed/deploy/cluster.py Outdated Show resolved Hide resolved
distributed/deploy/cluster.py Outdated Show resolved Hide resolved
distributed/deploy/cluster.py Outdated Show resolved Hide resolved
@jacobtomlinson jacobtomlinson requested a review from fjetter July 7, 2021 15:06
@fjetter
Copy link
Member

fjetter commented Jul 8, 2021

we might want to add the scheduler_sync_interval to the dask options but for my tests I just passed it through the inits. don't have a strong preference and don't know if this is something people actually would want to adjust. making it somehow configurable is quite useful for tests, though

Co-authored-by: Florian Jetter <fjetter@users.noreply.github.com>
@jacobtomlinson
Copy link
Member Author

Linux and macos tests appear to be flaky and unrelated. However Windows tests are valid failures. Looking into it 👀.

@jacobtomlinson
Copy link
Member Author

Hrm all windows tests are still failing. The sync doesn't seem to be working. I'll investigate this next week.

@jacobtomlinson
Copy link
Member Author

After pulling in latest changes CI seems to be failing inconsistently again, but it's not clear if it is related to this change. Running again.

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jacobtomlinson!

distributed/deploy/cluster.py Outdated Show resolved Hide resolved
distributed/deploy/cluster.py Outdated Show resolved Hide resolved
distributed/deploy/tests/test_local.py Outdated Show resolved Hide resolved
distributed/deploy/tests/test_local.py Outdated Show resolved Hide resolved
distributed/deploy/cluster.py Outdated Show resolved Hide resolved
@jacobtomlinson
Copy link
Member Author

Thanks for the review @jrbourbeau. I've addressed all of your feedback.

I note that CI is still failing just on Windows with timeouts in some other test. I'm going to get a working Windows environment set up locally today so I can try and reproduce and debug this.

@jrbourbeau
Copy link
Member

I'm going to get a working Windows environment set up locally today so I can try and reproduce and debug this.

Thank you for your service 🙇‍♂️

@jacobtomlinson
Copy link
Member Author

I didn't manage this yesterday, got pulled onto something else. But it looks like pulling in latest changes from main have resolved this anyway.

@jacobtomlinson
Copy link
Member Author

Given that things seem ok here now I intend to merge this tomorrow unless there are further comments.

@jacobtomlinson jacobtomlinson merged commit 5bf60e3 into dask:main Sep 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Schedulers and workers do not know their cluster provenance Add key/value state manager to the scheduler
3 participants