-
Notifications
You must be signed in to change notification settings - Fork 389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hmem/cuda: Add dmabuf fd ops functions #9498
Conversation
Build check failed as
I think this bit is only available in newer CUDA library. I need to add a macro defined in configure.ac |
src/hmem_cuda.c
Outdated
if (!cuda_is_dmabuf_supported()) | ||
return -FI_EOPNOTSUPP; | ||
|
||
aligned_ptr = (uintptr_t) ofi_get_page_start(addr, host_page_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when the DMA buf region spans several pages? Is offset always relative to the closest page?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The offset is always relative to the head of the first page
src/hmem_cuda.c
Outdated
CUdevice dev; | ||
int is_supported = 0; | ||
|
||
if (cuda_attr.device_count <= 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it fair to not even check DMA buf support for single GPU instances? It makes sense for GDR but we may use DMA buf with EFA NIC for HPC applications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this line need to be removed. I was copying something from detect_p2p_support
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
666a575
to
4ad9cde
Compare
Intel CI failure seems to be random (rxm/verbs with mpichtestsuite):
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments
include/ofi_hmem.h
Outdated
@@ -191,6 +191,9 @@ int cuda_dev_reg_copy_from_hmem(uint64_t handle, void *dest, const void *src, | |||
bool cuda_is_ipc_enabled(void); | |||
int cuda_get_ipc_handle_size(size_t *size); | |||
bool cuda_is_gdrcopy_enabled(void); | |||
bool cuda_is_dmabuf_supported(void); | |||
int cuda_get_dmabuf_fd(void *addr, uint64_t size, int *fd, | |||
uint64_t *offset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alignment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
configure.ac
Outdated
[[#include <cuda.h>]]) | ||
|
||
AC_CHECK_DECL([CU_DEVICE_ATTRIBUTE_DMA_BUF_SUPPORTED], | ||
[have_cuda_dmabuf_parameters_support=1], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit have_cuda_dmabuf_parameters_support
-> have_cuda_device_dmabuf_support
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, they are not the same. It's some parameters for dmabuf support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This parameter is removed in the latest revision
configure.ac
Outdated
@@ -666,6 +679,10 @@ AS_IF([test x"$with_cuda" != x"no" && test -n "$with_cuda" && test "$have_cuda" | |||
|
|||
AC_DEFINE_UNQUOTED([HAVE_CUDA], [$have_cuda], [CUDA support]) | |||
|
|||
AS_IF([ test x"$have_cuda" = x"1" && test x"$have_cuda_mem_get_handle_for_address_range" = x"1" && test x"$have_cuda_dmabuf_parameters_support" = x"1" ], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line length
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do it in a different way. This line is removed
@@ -191,6 +191,9 @@ int cuda_dev_reg_copy_from_hmem(uint64_t handle, void *dest, const void *src, | |||
bool cuda_is_ipc_enabled(void); | |||
int cuda_get_ipc_handle_size(size_t *size); | |||
bool cuda_is_gdrcopy_enabled(void); | |||
bool cuda_is_dmabuf_supported(void); | |||
int cuda_get_dmabuf_fd(void *addr, uint64_t size, int *fd, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious why not uint64_t size
-> size_t size
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because its the type of size in ibv_reg_dmabuf_mr
. We just make it consistent
cuda_ret = cuda_ops.cuMemGetHandleForAddressRange( | ||
(void *)fd, | ||
aligned_ptr, aligned_size, | ||
CU_MEM_RANGE_HANDLE_TYPE_DMA_BUF_FD, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it safe to assume that this symbol exists at this point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this symbol is inside HAVE_CUDA_DMABUF. If the symbol isn't found in configure.ac it won't be compiled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I think I need to check 2 symbols, CU_MEM_RANGE_HANDLE_TYPE_DMA_BUF_FD
and CU_DEVICE_ATTRIBUTE_DMA_BUF_SUPPORTED
. Currently I only checked 1 of them. I will update the PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/hmem_cuda.c
Outdated
} | ||
|
||
int cuda_get_dmabuf_fd(void *addr, uint64_t size, int *fd, | ||
uint64_t *offset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alignment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
src/hmem_cuda.c
Outdated
* -FI_EIO upon CUDA API error | ||
*/ | ||
int cuda_get_dmabuf_fd(void *addr, uint64_t size, int *fd, | ||
uint64_t *offset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alignment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
4ad9cde
to
8bf56d5
Compare
The new version does look cleaner. |
8bf56d5
to
1aa068d
Compare
@sunkuamzn made a good catch that I should calculate the page aligned address and length from the base address of cuda allocation. I updated it in the latest revision |
c1b2f4a
to
de1b7e3
Compare
src/hmem_cuda.c
Outdated
#if HAVE_CUDA_DMABUF | ||
#define CUDA_DRIVER_FUNCS_DEF(_) \ | ||
_(cuGetErrorName) \ | ||
_(cuGetErrorString) \ | ||
_(cuPointerGetAttribute) \ | ||
_(cuPointerSetAttribute) \ | ||
_(cuDeviceCanAccessPeer) \ | ||
_(cuMemGetAddressRange) \ | ||
_(cuMemGetHandleForAddressRange) \ | ||
_(cuDeviceGetAttribute) \ | ||
_(cuDeviceGet) | ||
#else | ||
#define CUDA_DRIVER_FUNCS_DEF(_) \ | ||
_(cuGetErrorName) \ | ||
_(cuGetErrorString) \ | ||
_(cuPointerGetAttribute) \ | ||
_(cuPointerSetAttribute) \ | ||
_(cuDeviceCanAccessPeer) \ | ||
_(cuMemGetAddressRange) | ||
#endif /* HAVE_CUDA_DMABUF */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's simpler to conditionally define a dedicated macro for the dmabuf-specific functions and have CUDA_DRIVER_FUNCS_DEF(_)
unconditionally inherit them. Otherwise, you need to maintain the function list for both conditions, which defeats the purpose of the macros.
#if HAVE_CUDA_DMABUF
#define CUDA_DRIVER_DMABUF_FUNCS_DEF(_) \
_(cuMemGetHandleForAddressRange) \
_(cuDeviceGetAttribute) \
_(cuDeviceGet)
#else
#define CUDA_DRIVER_DMABUF_FUNCS_DEF(_)
#endif
#define CUDA_DRIVER_FUNCS_DEF(_) \
_(cuGetErrorName) \
/* ... */ \
CUDA_DRIVER_DMABUF_FUNCS_DEF(_)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved cuDeviceGetAttribute
and cuDeviceGet
from CUDA_DRIVER_DMABUF_FUNCS_DEF
because they are not necessarily used for dmabuf and should be generally available for older cuda versions
src/hmem_cuda.c
Outdated
.use_ipc = false, | ||
.driver_handle = NULL, | ||
.runtime_handle = NULL, | ||
.nvml_handle = NULL | ||
.nvml_handle = NULL, | ||
.dmabuf_supported = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style nit: initialize .dmabuf_supported
after .use_ipc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
ccebc4a
to
875de77
Compare
bot:aws:retest |
Implement the get_dmabuf_fd API for cuda interface. Signed-off-by: Shi Jin <sjina@amazon.com>
875de77
to
f7d813a
Compare
@darrylabbate is the new revision looks good to you? |
Sure |
Implement the get_dmabuf_fd API for cuda interface.