Add distributed.comm.ucx.create-cuda-context config #5526

pentschev · 2021-11-18T22:35:57Z

By allowing to explicitly create a CUDA context, we may allow GPU-based workflows to rely on UCX optimally choosing what transports to use without requiring the user to specify which ones to enable.

Relevant changes to make use of this feature are being applied in rapidsai/dask-cuda#792.

pentschev · 2021-11-18T22:43:41Z

rerun tests

pentschev · 2021-11-18T22:48:26Z

rerun tests

pentschev · 2021-11-19T17:05:29Z

The Windows/MacOS failing tests shouldn't be relevant to this, since UCX is anyway only supported on Linux. The failing Linux tests should be fixed, as reported in #5527 .

Can we rerun tests here, or are we good as they are now?

jakirkham · 2021-11-19T20:35:58Z

Thanks Peter! 😄

pentschev · 2021-11-19T20:42:12Z

Thanks @jakirkham for the review and merging! 😄

Some of the UCX configurations use `_` whereas others use `-`. This is confusing so we now standardize everything to `-`. This also fixes an inconsistency from dask#5526, where configuration files used `create-cuda-context`, but the configuration read was `create_cuda_context`.

Some of the UCX configurations use `_` whereas others use `-`. This is confusing so we now standardize everything to `-`. This also fixes an inconsistency from #5526, where configuration files used `create-cuda-context`, but the configuration read was `create_cuda_context`.

Up until now, we require users to specify what transports should be used by UCX, pushing the configuration burden onto the user, being also error-prone. We can now reduce this configuration burden with just one configuration being added in dask/distributed#5526: `DASK_DISTRIBUTED__COMM__UCX__CREATE_CUDA_CONTEXT`/`distributed.comm.ucx.create_cuda_context`, which creates the CUDA context _before_ UCX is initialized. This is an example of how to setup a cluster with `dask-cuda-worker` after this change: ``` # Scheduler UCX_MEMTYPE_REG_WHOLE_ALLOC_TYPES=cuda DASK_DISTRIBUTED__COMM__UCX__CREATE_CUDA_CONTEXT=True dask-scheduler --protocol ucx --interface ib0 # Workers UCX_MEMTYPE_REG_WHOLE_ALLOC_TYPES=cuda dask-cuda-worker ucx://${SCHEDULER_IB0_IP}:8786 --interface ib0 --rmm-pool-size 29GiB # Client UCX_MEMTYPE_REG_WHOLE_ALLOC_TYPES=cuda DASK_DISTRIBUTED__COMM__UCX__CREATE_CUDA_CONTEXT=True python client.py ``` Similarly, one can setup: `LocalCUDACluster(protocol="ucx", interface="ib0")`. Note above how `ib0` is intentionally specified. That is mandatory to use RDMACM, as it is necessary to have listeners bind to an InfiniBand interface, but can be left unspecified when using systems without InfiniBand or if RDMACM isn't required (discouraged on systems that have InfiniBand connectivity). The `UCX_MEMTYPE_REG_WHOLE_ALLOC_TYPES=cuda` option is specified for optimal InfiniBand performance with CUDA and will be default in UCX 1.12, when specifying it won't be necessary anymore. Changes introduced here are backwards-compatible, meaning the old options such as `--enable-nvlink`/`enable_nvlink=True` are still valid. However, if any of those options is specified, the user is responsible to enable/disable all desired transports, which can also be useful for benchmarking specific transports. Finally, creating a CUDA context may not be necessary by UCX in the future, at a point where it will be possible to remove `DASK_DISTRIBUTED__COMM__UCX__CREATE_CUDA_CONTEXT=True` from scheduler/client processes entirely. Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Mads R. B. Kristensen (https://github.com/madsbk) URL: #792

pentschev added 2 commits November 18, 2021 13:54

Add distributed.comm.ucx.create-cuda-context config

68721c6

Add distributed.comm.ucx.create-cuda-context tests

5477bc2

pentschev mentioned this pull request Nov 18, 2021

Simplify UCX configs, permitting UCX_TLS=all rapidsai/dask-cuda#792

Merged

jakirkham approved these changes Nov 18, 2021

View reviewed changes

jakirkham mentioned this pull request Nov 19, 2021

Why don't we use UCX_TLS=all rapidsai/ucx-py#245

Closed

jakirkham merged commit 7d1401a into dask:main Nov 19, 2021

FabioRosado mentioned this pull request Nov 21, 2021

Fix test_schema tests #5534

Merged

3 tasks

pentschev mentioned this pull request Nov 22, 2021

Standardize UCX config separator to - #5539

Merged

pentschev deleted the ucx-create-cuda-context branch December 3, 2021 11:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add distributed.comm.ucx.create-cuda-context config #5526

Add distributed.comm.ucx.create-cuda-context config #5526

pentschev commented Nov 18, 2021 •

edited

Loading

pentschev commented Nov 18, 2021

pentschev commented Nov 18, 2021

pentschev commented Nov 19, 2021

jakirkham commented Nov 19, 2021

pentschev commented Nov 19, 2021

Add distributed.comm.ucx.create-cuda-context config #5526

Add distributed.comm.ucx.create-cuda-context config #5526

Conversation

pentschev commented Nov 18, 2021 • edited Loading

pentschev commented Nov 18, 2021

pentschev commented Nov 18, 2021

pentschev commented Nov 19, 2021

jakirkham commented Nov 19, 2021

pentschev commented Nov 19, 2021

pentschev commented Nov 18, 2021 •

edited

Loading