Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make IPC handle export optional in cuda_async_memory_resource #1030

Merged
merged 27 commits into from
May 26, 2022

Conversation

harrism
Copy link
Member

@harrism harrism commented May 4, 2022

posix handle export is not supported currently in cudaMemPoolCreate on WSL2. This change makes the default to not export IPC handles (cudaMemHandleTypeNone), and allows setting a different handle type.

In Python, cuda_async_memory_resource takes a new enable_ipc parameter which currently defaults to False, which is a breaking change. Defaulting to False was necessary because we can't check supported handle types on CUDA 11.2, only 11.3 and above. Also, the True path is currently written to only support Posix handles, so WSL2 is not supported. Also, IPC has some overheads on cudaMallocAsync pools which we may want to avoid.

Fixes #1029

@harrism harrism added breaking Breaking change bug Something isn't working labels May 4, 2022
@github-actions github-actions bot added cpp Pertains to C++ code Python Related to RMM Python API labels May 4, 2022
@harrism harrism marked this pull request as ready for review May 11, 2022 23:48
@harrism harrism requested review from a team as code owners May 11, 2022 23:48
@harrism harrism requested review from vyasr and bdice May 11, 2022 23:48
@harrism
Copy link
Member Author

harrism commented May 12, 2022

I ran into a problem with the Pytests on CUDA 11.2 because of this C++ code:

static bool is_export_handle_type_supported(cudaMemAllocationHandleType handle_type)
  {
    int supported_handle_types_bitmask{};
#if CUDART_VERSION >= 11030  // 11.3 introduced cudaDevAttrMemoryPoolSupportedHandleTypes
    cudaDeviceGetAttribute(&supported_handle_types_bitmask,
                           cudaDevAttrMemoryPoolSupportedHandleTypes,
                           rmm::detail::current_device().value());
#endif
    return (supported_handle_types_bitmask & handle_type) == handle_type;
  }

On CUDA 11.2 there is no way to check the supported handle types. So we don't support anything other than cudaMemHandleTypeNone... But in the Python code we default enable_ipc to True which causes it to pass cudaMemHandleTypePosixFileDescriptor. While this handle type is supported on non-WSL2 Linux, this function returns false in this situation, which causes an exception to be thrown.

There are two options: Change the default to enable_ipc = False, or make this function return true for all handle types in CUDA 11.2. I think the latter is undesireable. Will the former cause problems?

@harrism
Copy link
Member Author

harrism commented May 17, 2022

OK @shwina to avoid having to do the hackery on the Cython side, I added an enum class allocation_handle_type as @robertmaynard suggested, inside of cuda_async_memory_resource. This way we only expose the RMM enum on the public interface and we can have the same public interface on all CUDA versions. But I couldn't figure out how to use this enum in Cython (didn't spend long because I figure you will know instantly how to do it). I checked in my attempt -- can you show me how to fix it? Thanks!

@harrism
Copy link
Member Author

harrism commented May 23, 2022

@vyasr I think I've addressed all your comments. Thanks everyone for the reviews.

before we merge, I would like someone to comment on whether changing to default disabling IPC export is going to cause problems (e.g. for UCX or Dask).

Thanks!

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! There's a minor typo fix that I'll just commit.

@vyasr
Copy link
Contributor

vyasr commented May 23, 2022

@vyasr I think I've addressed all your comments. Thanks everyone for the reviews.

before we merge, I would like someone to comment on whether changing to default disabling IPC export is going to cause problems (e.g. for UCX or Dask).

Thanks!

I'm guessing that this would be a good question for @quasiben or @jakirkham.

@jakirkham jakirkham requested a review from pentschev May 24, 2022 06:10
@jakirkham
Copy link
Member

Think @pentschev would know better. So will defer to him

{
int supported_handle_types_bitmask{};
#if CUDART_VERSION >= 11030 // 11.3 introduced cudaDevAttrMemoryPoolSupportedHandleTypes
cudaDeviceGetAttribute(&supported_handle_types_bitmask,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to error check return value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, using RMM_CUDA_TRY. At least this way we will throw a rmm::cuda_error exception.

Copy link
Member

@jakirkham jakirkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussing offline, sounds like we are good to go on the Python side

@harrism
Copy link
Member Author

harrism commented May 25, 2022

I benchmarked with and without IPC enabled (None vs. PosixFileDescriptor) and see no appreciable difference. So I will try setting the default in Python back to True.

Comparing async_mr to async_mr_ipc (from gbenchmarks/RANDOM_ALLOCATIONS_BENCH)
Benchmark                                                                               Time             CPU      Time Old      Time New       CPU Old       CPU New
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
BM_RandomAllocations/cuda_[async_mr vs. async_mr_ipc]/1000/1                         +0.0059         +0.0059             2             2             2             2
BM_RandomAllocations/cuda_[async_mr vs. async_mr_ipc]/1000/4                         -0.0096         -0.0096             2             2             2             2
BM_RandomAllocations/cuda_[async_mr vs. async_mr_ipc]/1000/64                        -0.0167         -0.0167             2             2             2             2
BM_RandomAllocations/cuda_[async_mr vs. async_mr_ipc]/1000/256                       +0.0124         +0.0149             2             2             2             2
BM_RandomAllocations/cuda_[async_mr vs. async_mr_ipc]/1000/1024                      -0.0045         -0.0115             2             2             2             2
BM_RandomAllocations/cuda_[async_mr vs. async_mr_ipc]/1000/4096                      -0.0313         -0.0008             2             2             2             2
BM_RandomAllocations/cuda_[async_mr vs. async_mr_ipc]/10000/1                        +0.0062         +0.0062            22            22            22            22
BM_RandomAllocations/cuda_[async_mr vs. async_mr_ipc]/10000/4                        -0.0024         -0.0024            23            23            23            23
BM_RandomAllocations/cuda_[async_mr vs. async_mr_ipc]/10000/64                       -0.0034         -0.0034            24            24            24            24
BM_RandomAllocations/cuda_[async_mr vs. async_mr_ipc]/10000/256                      +0.0032         +0.0054            24            24            23            23
BM_RandomAllocations/cuda_[async_mr vs. async_mr_ipc]/10000/1024                     -0.0179         -0.0079            23            23            20            20
BM_RandomAllocations/cuda_[async_mr vs. async_mr_ipc]/10000/4096                     +0.0133         +0.0160            24            24            18            18
BM_RandomAllocations/cuda_[async_mr vs. async_mr_ipc]/100000/1                       -0.0005         -0.0005           270           270           270           270
BM_RandomAllocations/cuda_[async_mr vs. async_mr_ipc]/100000/4                       +0.0119         +0.0116           256           259           256           259
BM_RandomAllocations/cuda_[async_mr vs. async_mr_ipc]/100000/64                      +0.0093         +0.0093           242           245           242           245
BM_RandomAllocations/cuda_[async_mr vs. async_mr_ipc]/100000/256                     -0.0096         -0.0084           244           242           236           234
BM_RandomAllocations/cuda_[async_mr vs. async_mr_ipc]/100000/1024                    +0.0082         -0.0163           236           237           212           208
BM_RandomAllocations/cuda_[async_mr vs. async_mr_ipc]/100000/4096                    +0.0143         +0.0075           274           277           193           194

@harrism
Copy link
Member Author

harrism commented May 25, 2022

I changed the Python enable_ipc parameter default back to True. If this passes CI, then we can leave it that way so this is not a breaking change for Python.

@harrism
Copy link
Member Author

harrism commented May 25, 2022

OK, I ran into the same problem with enable_ipc=True: on CUDA 11.2, this causes it to report false for whether the handle type is supported, causing the constructor to fail. The only fix that I can see is either to default to False, or to hack the Cython to use a different value of the handle type for CUDA 11.2 than all other versions. My Cython-foo is not strong enough to do that myself.

@jakirkham
Copy link
Member

We could default to None in Python and then use False if it's an option and True if not

@harrism
Copy link
Member Author

harrism commented May 25, 2022

We could default to None in Python and then use False if it's an option and True if not

I don't understand. True and False are both options. But how do we detect the CUDA runtime version at runtime in Python?

That would still be a breaking change. If we are going to break the API (which I am increasingly thinking is fine to do) then why not just default to False.

@pentschev
Copy link
Member

I don't understand. True and False are both options. But how do we detect the CUDA runtime version at runtime in Python?

That would still be a breaking change. If we are going to break the API (which I am increasingly thinking is fine to do) then why not just default to False.

Defaulting to None would mean RMM internally figures out what to do based on the CUDA runtime versions, where the user hasn't specified that option. True and False would mean an option specified by the user and RMM should respect that, regardless of any internal checks that are in place.

To find the driver/runtime versions you can use RMM! For example:

https://github.com/rapidsai/dask-cuda/blob/63529e891b52aee6c1bfeeecd0c8ff272628d4d0/dask_cuda/tests/test_dask_cuda_worker.py#L112-L115

@shwina
Copy link
Contributor

shwina commented May 25, 2022

That would still be a breaking change. If we are going to break the API (which I am increasingly thinking is fine to do) then why not just default to False.

Sync'd offline with Mark and we agreed this is best to do. We don't believe this is widely used enough that it will impact many users.

@harrism harrism added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for review Ready for review by team labels May 26, 2022
Copy link
Contributor

@msadang msadang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge breaking Breaking change bug Something isn't working cpp Pertains to C++ code Python Related to RMM Python API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] cuda_async_memory_resource.hpp:77: cudaErrorInvalidValue on WSL2 in Windows10
10 participants