Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add rmm::prefetch() and DeviceBuffer.prefetch() #1573

Merged
merged 32 commits into from
Jun 12, 2024

Conversation

harrism
Copy link
Member

@harrism harrism commented May 30, 2024

Description

This adds two rmm::prefetch() functions in C++.

  1. rmm::prefetch(void *ptr, size_t bytes, device, stream)
  2. rmm::prefetch<T>(cuda::std::span<T> data, device, stream)

Item 2 enables prefetching the containers that RMM provides (device_uvector, device_scalar) that support conversion to cuda::std::span. In order to enable that, device_scalar::size() is added.

Note that device_buffers must be prefetched using item 1 because you can't create a span<void>.

In Python, this adds DeviceBuffer.prefetch() because that's really the only RMM Python data type to prefetch. There is one Cython use of device_uvector in cuDF join that we might need to add prefetch support for later.

prefetch is a no-op on non-managed memory. Rather than querying the type of memory, it just catches cudaErrorInvalidValue from cudaMemPrefetchAsync.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added Python Related to RMM Python API cpp Pertains to C++ code labels May 30, 2024
@harrism harrism added feature request New feature or request non-breaking Non-breaking change labels May 30, 2024
@jrhemstad
Copy link
Contributor

I don't think this should be a member function. The fact that the member function is only useful for specific upstream resource types is a big code smell to me.

Instead, I think this should be a freestanding function like:

template <typename T>
auto prefetch(T const* ptr, size_t n, device, stream);

@harrism
Copy link
Member Author

harrism commented Jun 4, 2024

I don't think this should be a member function. The fact that the member function is only useful for specific upstream resource types is a big code smell to me. Instead, I think this should be a freestanding function

I actually had the same thought over the weekend. But there are a few other factors to consider.

  1. In RMM Python, the only thing one has to prefetch is a DeviceBuffer. And pointers are non-Pythonic. So I think it makes sense to associate prefetch with DeviceBuffer in the python API, either as a method or a free function that expects a DeviceBuffer.
  2. I do think we want to put unconditional calls to prefetch in algorithm code (e.g. in cuDF). These should be functional no-ops when the memory is not managed. (A migrateable or similar property for cuda::mr would help with that...)
  3. What do you think about versions for containers and iterators? I think only providing a pointer interface is error-prone, and makes it harder than necessary to prefetch a range from a container, or a whole container.
template <typename Iterator>
auto prefetch(Iterator begin, Iterator end, device, stream)
{
  constexpr auto elt_size = sizeof(std::iterator_traits<Iterator>::value_type);
  return prefetch(begin, std::distance(begin, end) * elt_size, device, stream);
}

template <typename Container>
auto prefetch(Container const& c, device, stream) 
{ 
  return prefetch(c.begin(), c.end(), device, stream);
}

@github-actions github-actions bot added the CMake label Jun 4, 2024
@harrism harrism changed the title Add prefetch() method to device_buffer Add rmm::prefetch() and DeviceBuffer.prefetch() Jun 4, 2024
@harrism harrism marked this pull request as ready for review June 5, 2024 02:20
@harrism harrism requested review from a team as code owners June 5, 2024 02:20
@harrism harrism requested review from rongou and bdice June 5, 2024 02:20
* @param stream The stream to use for the prefetch
*/
template <typename T>
void prefetch(cuda::std::span<T> data, rmm::cuda_device_id device, rmm::cuda_stream_view stream)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function shouldn't be mutating any of the data.

Suggested change
void prefetch(cuda::std::span<T> data, rmm::cuda_device_id device, rmm::cuda_stream_view stream)
void prefetch(cuda::std::span<T const> data, rmm::cuda_device_id device, rmm::cuda_stream_view stream)

Comment on lines 41 to 47
* @param ptr The pointer to the memory to prefetch
* @param size The number of bytes to prefetch
* @param device The device to prefetch to
* @param stream The stream to use for the prefetch
*/
template <typename T>
void prefetch(T* ptr, std::size_t size, rmm::cuda_device_id device, rmm::cuda_stream_view stream)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If ptr is typed, then size shouldn't be bytes, it should be elements.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. If this takes T* ptr, it should use size * sizeof(T) to compute the bytes.

Or, if this is really designed for the case of device buffers, it could just use void* ptr and accept size in bytes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm switching it back to void const* because then we can use span::size_bytes() in the span function. Someone suggested the T* version during discussions but I can't remember who or why. If there is a good reason, I'm all ears.

Comment on lines 41 to 47
* @param ptr The pointer to the memory to prefetch
* @param size The number of bytes to prefetch
* @param device The device to prefetch to
* @param stream The stream to use for the prefetch
*/
template <typename T>
void prefetch(T* ptr, std::size_t size, rmm::cuda_device_id device, rmm::cuda_stream_view stream)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. If this takes T* ptr, it should use size * sizeof(T) to compute the bytes.

Or, if this is really designed for the case of device buffers, it could just use void* ptr and accept size in bytes.

python/rmm/rmm/_lib/device_buffer.pyx Outdated Show resolved Hide resolved
def test_rmm_device_buffer_prefetch(pool, managed):
rmm.reinitialize(pool_allocator=pool, managed_memory=managed)
db = rmm.DeviceBuffer.to_device(np.zeros(256, dtype="u1"))
db.prefetch() # just test that it doesn't throw
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might be able to test that a prefetch call was issued with cudaMemRangeGetAttribute (void* data, size_t dataSize, cudaMemRangeAttribute attribute, const void* devPtr, size_t count), with attribute cudaMemRangeAttributeLastPrefetchLocation. It ought to be possible to call this via cuda-python (API docs).

Also:

Note that this simply returns the last location that the applicaton requested to prefetch the memory range to. It gives no indication as to whether the prefetch operation to that location has completed or even begun.

You wouldn't know if the prefetch completed or not, but you could verify that the prefetch request was issued.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...of course, as soon as I scrolled down, I see you did this exact thing in the C++ tests, at Vyas's request. It would be nice to have a corresponding Python API test, since it should be quick to write with cuda-python.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA Python results in very ugly code due to its non-pythonic error handling. But I've done what you asked...

@harrism harrism requested review from bdice, jrhemstad and vyasr June 7, 2024 02:36
@wence-
Copy link
Contributor

wence- commented Jun 11, 2024

Devcontainer build fails are due to (I think) not having NVIDIA/cccl#1836

@harrism
Copy link
Member Author

harrism commented Jun 12, 2024

Devcontainer build fails are due to (I think) not having NVIDIA/cccl#1836

I don't understand. This was working previously. I think this broke with the update to CCCL 2.5, is this a regression in 2.5?

@miscco any idea why this was working fine before? I tried to go back to your godbolt example but godbolt only supports up to CCCL 2.2

Note that this only affects device_scalar, device_uvector converts to a span no problem. I confirmed that adding begin/end to device_scalar fixes this, but I'm not sure we want to do that...

@leofang
Copy link
Member

leofang commented Jun 12, 2024

It seems the doc-build pipeline failed but there's no log. Is it possible to retrigger it?

@harrism
Copy link
Member Author

harrism commented Jun 12, 2024

/merge

@rapids-bot rapids-bot bot merged commit a709394 into rapidsai:branch-24.08 Jun 12, 2024
58 checks passed
device : optional
The CUDA device to which to prefetch the memory for this buffer.
Defaults to the current CUDA device. To prefetch to the CPU, pass
`~cuda.cudart.cudaCpuDeviceId` as the device.
Copy link
Member

@leofang leofang Jun 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the doc build did not render it right, from here:
https://downloads.rapids.ai/ci/rmm/pull-request/1573/5165889/docs/rmm/html/
Possibly because the default role in RMM is "cpp" instead of "py". If so the fix would be

:py:`~cuda.cudart.cudaCpuDeviceId`

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am certain that the correct reference is there. It can be checked as follows:

$ python -m sphinx.ext.intersphinx https://nvidia.github.io/cuda-python/objects.inv | grep "cuda.cudart.cudaCpuDeviceId"
    cuda.cudart.cudaCpuDeviceId                                                      : module/cudart.html#cuda.cudart.cudaCpuDeviceId

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this get fixed or should this be raised in a new issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t think this was seen after merging. Issue filed: #1635 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Bradley! 🙏

copy-pr-bot bot pushed a commit that referenced this pull request Jun 17, 2024
This adds two `rmm::prefetch()` functions in C++.
1. `rmm::prefetch(void *ptr, size_t bytes, device, stream)`
2. `rmm::prefetch<T>(cuda::std::span<T> data, device, stream)`

Item 2 enables prefetching the containers that RMM provides (`device_uvector`, `device_scalar`) that support conversion to `cuda::std::span`. In order to enable that, `device_scalar::size()` is added.

Note that `device_buffer`s must be prefetched using item 1 because you can't create a `span<void>`.

In Python, this adds `DeviceBuffer.prefetch()` because that's really the only RMM Python data type to prefetch. There is *one* Cython use of `device_uvector` in cuDF `join` that we might need to add prefetch support for later.

`prefetch` is a no-op on non-managed memory. Rather than querying the type of memory, it just catches `cudaErrorInvalidValue` from `cudaMemPrefetchAsync`.

Authors:
  - Mark Harris (https://github.com/harrism)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)
  - Rong Ou (https://github.com/rongou)
  - Jake Hemstad (https://github.com/jrhemstad)
  - Michael Schellenberger Costa (https://github.com/miscco)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #1573
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake cpp Pertains to C++ code feature request New feature or request non-breaking Non-breaking change Python Related to RMM Python API
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

10 participants