Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add distributed.comm.ucx.create-cuda-context config #5526

Merged
merged 2 commits into from
Nov 19, 2021

Conversation

pentschev
Copy link
Member

@pentschev pentschev commented Nov 18, 2021

By allowing to explicitly create a CUDA context, we may allow GPU-based workflows to rely on UCX optimally choosing what transports to use without requiring the user to specify which ones to enable.

Relevant changes to make use of this feature are being applied in rapidsai/dask-cuda#792.

@pentschev
Copy link
Member Author

rerun tests

1 similar comment
@pentschev
Copy link
Member Author

rerun tests

@pentschev
Copy link
Member Author

The Windows/MacOS failing tests shouldn't be relevant to this, since UCX is anyway only supported on Linux. The failing Linux tests should be fixed, as reported in #5527 .

Can we rerun tests here, or are we good as they are now?

@jakirkham
Copy link
Member

Thanks Peter! 😄

@pentschev
Copy link
Member Author

Thanks @jakirkham for the review and merging! 😄

@FabioRosado FabioRosado mentioned this pull request Nov 21, 2021
3 tasks
pentschev added a commit to pentschev/distributed that referenced this pull request Nov 22, 2021
Some of the UCX configurations use `_` whereas others use `-`. This is
confusing so we now standardize everything to `-`. This also fixes an
inconsistency from dask#5526, where
configuration files used `create-cuda-context`, but the configuration
read was `create_cuda_context`.
jrbourbeau pushed a commit that referenced this pull request Nov 23, 2021
Some of the UCX configurations use `_` whereas others use `-`. This is
confusing so we now standardize everything to `-`. This also fixes an
inconsistency from #5526, where
configuration files used `create-cuda-context`, but the configuration
read was `create_cuda_context`.
rapids-bot bot pushed a commit to rapidsai/dask-cuda that referenced this pull request Nov 29, 2021
Up until now, we require users to specify what transports should be used by UCX, pushing the configuration burden onto the user, being also error-prone. We can now reduce this configuration burden with just one configuration being added in dask/distributed#5526: `DASK_DISTRIBUTED__COMM__UCX__CREATE_CUDA_CONTEXT`/`distributed.comm.ucx.create_cuda_context`, which creates the CUDA context _before_ UCX is initialized.

This is an example of how to setup a cluster with `dask-cuda-worker` after this change:

```
# Scheduler
UCX_MEMTYPE_REG_WHOLE_ALLOC_TYPES=cuda DASK_DISTRIBUTED__COMM__UCX__CREATE_CUDA_CONTEXT=True dask-scheduler --protocol ucx --interface ib0

# Workers
UCX_MEMTYPE_REG_WHOLE_ALLOC_TYPES=cuda dask-cuda-worker ucx://${SCHEDULER_IB0_IP}:8786 --interface ib0 --rmm-pool-size 29GiB

# Client
UCX_MEMTYPE_REG_WHOLE_ALLOC_TYPES=cuda DASK_DISTRIBUTED__COMM__UCX__CREATE_CUDA_CONTEXT=True python client.py
```

Similarly, one can setup: `LocalCUDACluster(protocol="ucx", interface="ib0")`.

Note above how `ib0` is intentionally specified. That is mandatory to use RDMACM, as it is necessary to have listeners bind to an InfiniBand interface, but can be left unspecified when using systems without InfiniBand or if RDMACM isn't required (discouraged on systems that have InfiniBand connectivity). The `UCX_MEMTYPE_REG_WHOLE_ALLOC_TYPES=cuda` option is specified for optimal InfiniBand performance with CUDA and will be default in UCX 1.12, when specifying it won't be necessary anymore.

Changes introduced here are backwards-compatible, meaning the old options such as `--enable-nvlink`/`enable_nvlink=True` are still valid. However, if any of those options is specified, the user is responsible to enable/disable all desired transports, which can also be useful for benchmarking specific transports.

Finally, creating a CUDA context may not be necessary by UCX in the future, at a point where it will be possible to remove `DASK_DISTRIBUTED__COMM__UCX__CREATE_CUDA_CONTEXT=True` from scheduler/client processes entirely.

Authors:
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - Mads R. B. Kristensen (https://github.com/madsbk)

URL: #792
@pentschev pentschev deleted the ucx-create-cuda-context branch December 3, 2021 11:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants