-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add distributed.comm.ucx.create-cuda-context config #5526
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
jakirkham
approved these changes
Nov 18, 2021
rerun tests |
1 similar comment
rerun tests |
The Windows/MacOS failing tests shouldn't be relevant to this, since UCX is anyway only supported on Linux. The failing Linux tests should be fixed, as reported in #5527 . Can we rerun tests here, or are we good as they are now? |
Thanks Peter! 😄 |
Thanks @jakirkham for the review and merging! 😄 |
pentschev
added a commit
to pentschev/distributed
that referenced
this pull request
Nov 22, 2021
Some of the UCX configurations use `_` whereas others use `-`. This is confusing so we now standardize everything to `-`. This also fixes an inconsistency from dask#5526, where configuration files used `create-cuda-context`, but the configuration read was `create_cuda_context`.
jrbourbeau
pushed a commit
that referenced
this pull request
Nov 23, 2021
Some of the UCX configurations use `_` whereas others use `-`. This is confusing so we now standardize everything to `-`. This also fixes an inconsistency from #5526, where configuration files used `create-cuda-context`, but the configuration read was `create_cuda_context`.
rapids-bot bot
pushed a commit
to rapidsai/dask-cuda
that referenced
this pull request
Nov 29, 2021
Up until now, we require users to specify what transports should be used by UCX, pushing the configuration burden onto the user, being also error-prone. We can now reduce this configuration burden with just one configuration being added in dask/distributed#5526: `DASK_DISTRIBUTED__COMM__UCX__CREATE_CUDA_CONTEXT`/`distributed.comm.ucx.create_cuda_context`, which creates the CUDA context _before_ UCX is initialized. This is an example of how to setup a cluster with `dask-cuda-worker` after this change: ``` # Scheduler UCX_MEMTYPE_REG_WHOLE_ALLOC_TYPES=cuda DASK_DISTRIBUTED__COMM__UCX__CREATE_CUDA_CONTEXT=True dask-scheduler --protocol ucx --interface ib0 # Workers UCX_MEMTYPE_REG_WHOLE_ALLOC_TYPES=cuda dask-cuda-worker ucx://${SCHEDULER_IB0_IP}:8786 --interface ib0 --rmm-pool-size 29GiB # Client UCX_MEMTYPE_REG_WHOLE_ALLOC_TYPES=cuda DASK_DISTRIBUTED__COMM__UCX__CREATE_CUDA_CONTEXT=True python client.py ``` Similarly, one can setup: `LocalCUDACluster(protocol="ucx", interface="ib0")`. Note above how `ib0` is intentionally specified. That is mandatory to use RDMACM, as it is necessary to have listeners bind to an InfiniBand interface, but can be left unspecified when using systems without InfiniBand or if RDMACM isn't required (discouraged on systems that have InfiniBand connectivity). The `UCX_MEMTYPE_REG_WHOLE_ALLOC_TYPES=cuda` option is specified for optimal InfiniBand performance with CUDA and will be default in UCX 1.12, when specifying it won't be necessary anymore. Changes introduced here are backwards-compatible, meaning the old options such as `--enable-nvlink`/`enable_nvlink=True` are still valid. However, if any of those options is specified, the user is responsible to enable/disable all desired transports, which can also be useful for benchmarking specific transports. Finally, creating a CUDA context may not be necessary by UCX in the future, at a point where it will be possible to remove `DASK_DISTRIBUTED__COMM__UCX__CREATE_CUDA_CONTEXT=True` from scheduler/client processes entirely. Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Mads R. B. Kristensen (https://github.com/madsbk) URL: #792
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
By allowing to explicitly create a CUDA context, we may allow GPU-based workflows to rely on UCX optimally choosing what transports to use without requiring the user to specify which ones to enable.
Relevant changes to make use of this feature are being applied in rapidsai/dask-cuda#792.