-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Add scheduler_file argument to support MNMG setup #1593
[REVIEW] Add scheduler_file argument to support MNMG setup #1593
Conversation
initialize.initialize( | ||
create_cuda_context=True, | ||
enable_tcp_over_ucx=enable_tcp_over_ucx, | ||
enable_nvlink=enable_nvlink, | ||
enable_infiniband=enable_infiniband, | ||
) | ||
cluster = LocalCUDACluster( | ||
protocol="ucx", | ||
enable_tcp_over_ucx=enable_tcp_over_ucx, | ||
enable_nvlink=enable_nvlink, | ||
enable_infiniband=enable_infiniband, | ||
) | ||
yield cluster | ||
cluster.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is no longer needed: https://docs.rapids.ai/api/dask-cuda/nightly/examples/ucx/#localcudacluster-with-automatic-configuration
yield client | ||
client.close() | ||
|
||
|
||
@pytest.fixture() | ||
def ucx_client(ucx_cluster): | ||
client = Client(cluster) | ||
client = create_client(ucx_cluster) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was a bug previously where we passed cluster
instead of ucx_cluster
.
Will revert if this was intentional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needed to add this as we run tests on workers and scheduler and this module needs to be importable for that to happen in dask when running on a non local cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks @VibhuJawa!
/merge |
Add scheduler_file argument to support MNMG setup
Overview:
The primary goal is to provide more flexibility and adaptability in how the Dask cluster for testing is configured.
Changes:
LocalCUDACluster
instances is now contingent on the presence of aSCHEDULER_FILE
environment variable. If this variable exists, the path to the Dask scheduler file is returned instead of creating a new cluster. This change allows the use of pre-existing clusters specified via theSCHEDULER_FILE
environment variable.enable_tcp_over_ucx
,enable_nvlink
,enable_infiniband
) previously used to initialize theLocalCUDACluster
. This is because sinceDask-CUDA 22.02
andUCX >= 1.11.1
we dont need those.See docs: https://docs.rapids.ai/api/dask-cuda/nightly/examples/ucx/#localcudacluster-with-automatic-configuration
This could help in situations where test scenarios need to be conducted on a specific pre-existing cluster (especially for MNMG setups) .
Testing:
I tested using the following setup:
Start Cluster:
Run Tests: