Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Improve Cython Lifetime Management by Adding References in DeviceBuffer #661

Merged

Conversation

mdemoret-nv
Copy link
Contributor

As discussed with @shwina @harrism and @kkraus14, this PR adds 2 properties to DeviceBuffer to allow for automatic reference counting of MemoryResource and Stream objects. This will prevent any MemoryResource from being destructed while any DeviceBuffer that needs the MR for deallocation is still alive.

There are a few outstanding issues I could use input on:

  1. The test test_rmm_device_buffer is failing due to the line: sys.getsizeof(b) == b.size. Need input on the best way forward.
    1. This test is failing since DeviceBuffer is now involved in GC. Python automatically adds the GC memory overhead to __size__ (see here) which makes it difficult to continue working the same way it has before.
    2. Only options I can think of are:
      1. Remove this check from the test or alter the "correct" value
      2. Add @cython.no_gc which is very risky.
  2. The current PR implementation includes cuda stream object reference counting but treats all Stream objects the same. @harrism mentioned only streams owned by RMM should be tracked this way but I am not sure if thats necessary or how to distinguish them at this point.

Other than the above items, all test are passing and I ran this through the cuML test suite without any issues. Thanks for your help.

@mdemoret-nv mdemoret-nv requested a review from a team as a code owner December 15, 2020 04:02
@shwina
Copy link
Contributor

shwina commented Dec 15, 2020

Only options I can think of are:
Remove this check from the test or alter the "correct" value
Add @cython.no_gc which is very risky.

We should remove this test entirely. The overhead added to the result of __sizeof__() is very much an implementation detail. Previous discussion here: #654 (comment)

@shwina
Copy link
Contributor

shwina commented Dec 15, 2020

The addition of Python attributes to DeviceBuffer means that it can now technically participate in reference cycles. It also has an attribute (Stream) that can participate in reference cycles. Objects in reference cycles are cleaned up by Python's GC.

My question is: does this open the possibility of Python's GC destroying the Stream before the underlying device buffers are deallocated? (see: cython/cython#3946)

If we're sure that typical usage patterns won't result in reference cycles, then maybe this is not worth worrying about :)

@mdemoret-nv
Copy link
Contributor Author

We should remove this test entirely. The overhead added to the result of __sizeof__() is very much an implementation detail. Previous discussion here: #654 (comment)

Sounds good. Using __sizeof__ this way seems odd to me anyways and I can't think of a situation where a user would query the size of a buffer with sys.getsizeof(buf) instead of buf.size. Leaving it up to the implementation seems like the right move.

The addition of Python attributes to DeviceBuffer means that it can now technically participate in reference cycles. It also has an attribute (Stream) that can participate in reference cycles. Objects in reference cycles are cleaned up by Python's GC.

My question is: does this open the possibility of Python's GC destroying the Stream before the underlying device buffers are deallocated? (see: cython/cython#3946)

That's a great question. The Cython docs have this second on how to handle that scenario. I couldn't completely rule out the possibility so I had considered adding @cython.no_gc_clear to DeviceBuffer. I ultimately didn't include it in the first iteration because I couldn't think of a situation where there would be a reference cycle. Glad you brought this up. Can you post an example reference cycle I could add to the tests? I'm not too familiar with the new stream objects just yet but it would be great to test this experimentally.

If we're sure that typical usage patterns won't result in reference cycles, then maybe this is not worth worrying about :)

I am going to defer to you on this one. I couldn't think of a situation but I hardly am an advanced user. Additionally, it's worth considering if this may come up in the future given the rmm team's current plans.

@harrism
Copy link
Member

harrism commented Dec 15, 2020

2\. The current PR implementation includes cuda stream object reference counting but treats all `Stream` objects the same. @harrism mentioned only streams owned by RMM should be tracked this way but I am not sure if thats necessary or how to distinguish them at this point.

Not sure when I mentioned this too you. However, it's not that you shouldn't hold a reference to a Python stream object for a stream you don't own. It's just that if RMM doesn't own the underlying cudaStream_t, there's a chance that the actual owner might cudaStreamDestroy it out from under you regardless of whether you hold a reference to the Python object or not. It is what it is, and documentation is our only way to avoid this case.

Does this PR overlap #650 ?

Copy link
Contributor

@kkraus14 kkraus14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @mdemoret-nv

@@ -81,6 +84,10 @@ cdef class DeviceBuffer:
if stream.c_is_default():
stream.c_synchronize()

# Save a reference to the MR and stream used for allocation
self.mr = get_current_device_resource()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to allow passing a MR to the constructor?

Copy link
Contributor

@shwina shwina Jan 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I'd say let's not do that. I'd much rather we support multiple MRs with a context manager. That is, I prefer the first approach below to the second:

mr = MemoryResource(...)
with using_memory_resource(mr)
    dbuf_1 = DeviceBuffer(...)
    dbuf_2 = DeviceBuffer(...)

v/s

mr = MemoryResource(...)
dbuf_1 = DeviceBuffer(..., mr=mr)
dbuf_2 = DeviceBuffer(..., mr=mr)

There are arguments in favour of both though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The context manager approach also follows the "One -- and preferably only one -- obvious way to do it" Zen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I have left out passing the MR in the constructor. If the situation has changed, @kkraus14 or @shwina please let me know.

@kkraus14
Copy link
Contributor

kkraus14 commented Jan 7, 2021

Sounds good. Using __sizeof__ this way seems odd to me anyways and I can't think of a situation where a user would query the size of a buffer with sys.getsizeof(buf) instead of buf.size.

If I remember correctly, Dask will call __sizeof__() for tracking memory usage and is why we test the method. Would suggest calling the function directly in the test instead of via sys.getsizeof(...).

@harrism
Copy link
Member

harrism commented Feb 2, 2021

@mdemoret-nv your current PRs seem to have stalled. Today is code freeze for 0.18 so I'm going to push this to 0.19

@harrism harrism changed the base branch from branch-0.18 to branch-0.19 February 2, 2021 22:59
@mdemoret-nv
Copy link
Contributor Author

@mdemoret-nv your current PRs seem to have stalled. Today is code freeze for 0.18 so I'm going to push this to 0.19

Apologies, my Github notifications significantly increased in 0.18 and these PRs slipped through the cracks while I worked on cuml PRs. I think it would be better to get these PRs in a release cycle early anyways, just to ensure there are no unintended side effects.

I will wrap these up once 0.18 has shipped.

@mdemoret-nv
Copy link
Contributor Author

PR is ready for re-review/merging.

@kkraus14
Copy link
Contributor

Would like to let @shwina review again before merging

@kkraus14
Copy link
Contributor

rerun tests

@kkraus14
Copy link
Contributor

Going to merge this as the changes were minimal. Thanks @mdemoret-nv!

@kkraus14
Copy link
Contributor

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 230369d into rapidsai:branch-0.19 Feb 25, 2021
rapids-bot bot pushed a commit that referenced this pull request Mar 1, 2021
#662)

Depends on #661 

While working on PR #661, it looked like it was possible to remove the "owning" C++ memory resource adaptors in `memory_resource_adaptors.hpp`. This PR is a quick implementation of what that would look like to run through CI and get feedback.

The main driving factor of this PR is to eliminate the need for 2 layers of wrappers around every memory resource in the library. When adding new memory resources, C++ wrappers must be created in `memory_resource_adaptors.hpp` and Cython wrappers must be created in `memory_resource.pyx`, for any property/function that needs to be exposed at the python level. This removes the C++ wrappers in favor of using pythons reference counting for lifetime management.

A few notes:

1. `MemoryResource` was renamed `DeviceMemoryResource` to more closely match the C++ class names. Easily can be changed back
1. Upstream MR are kept alive by a base class `UpstreamResourceAdaptor` that stores a single property `upstream_mr`. Any MR that has an upstream, needs to inherit from this class.
1. Once the `UpstreamResourceAdaptor` was created, most of the work/changes were updating the Cython imports to use the C++ classes instead of the C++ wrappers.
1. This should make it easier to expose more methods/properties at the python layer in the future.

Would appreciate any feedback.

Authors:
  - Michael Demoret (@mdemoret-nv)

Approvers:
  - Christopher Harris (@cwharris)
  - Keith Kraus (@kkraus14)
  - Ashwin Srinath (@shwina)

URL: #662
rapids-bot bot pushed a commit that referenced this pull request Mar 4, 2021
This PR adds support for CuPy streams in `rmm_cupy_allocator`. It works by getting CuPy's current stream and passing that to the `DeviceBuffer` constructor.

There's also a fix for the casting of CuPy/Numba streams to `cudaStream_t`, it needs to be cast to `uintptr_t` first, without that the resulting pointer would be wrong and result in a segfault.

Depends on #661

Authors:
  - Peter Andreas Entschev (@pentschev)
  - Keith Kraus (@kkraus14)

Approvers:
  - Keith Kraus (@kkraus14)
  - @jakirkham
  - Mark Harris (@harrism)

URL: #654
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for review Ready for review by team improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Related to RMM Python API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants