[FEA] Relax restrictions for device_uvector destructor in multi-gpu code #1342

fkallen · 2023-09-15T11:17:26Z

Is your feature request related to a problem? Please describe.

According to the RMM readme, the following code contains undefined behavior because the current device when calling device_uvector destructor may be different from the device that was active when the device_uvector was constructed.

{
    int numGpus = 8;
    std::vector<rmm::device_uvector<int>> uvectors;
    for(int g = 0; g < numGpus; g++){
        cudaSetDevice(g);
        uvectors.emplace_back(numGpus, streams[g]);
    }

    // ... use uvectors

    for(int g = 0; g < numGpus; g++){
        cudaSetDevice(g);
        cudaStreamSynchronize(streams[g]);
    }
}

The workaround is to destroy the uvector explicitly.

{
    int numGpus = 8;
    std::vector<rmm::device_uvector<int>> uvectors;
    for(int g = 0; g < numGpus; g++){
        cudaSetDevice(g);
        uvectors.emplace_back(numGpus, streams[g]);
    }

    // ... use uvectors

    for(int g = 0; g < numGpus; g++){
        cudaSetDevice(g);
        cudaStreamSynchronize(streams[g]);
    }
    for(int g = 0; g < numGpus; g++){
        cudaSetDevice(g);
        uvectors[g].release();
    }
}

However, this may not be sufficient because the construction may throw (out-of-memory, for example). So one would need something like:

{
    int numGpus = 8;
    std::vector<rmm::device_uvector<int>> uvectors;
    for(int g = 0; g < numGpus; g++){
        cudaSetDevice(g);
        try{
            uvectors.emplace_back(numGpus, streams[g]);
        }catch(...){
            for(int x = 0; x < g; x++){
                cudaSetDevice(x);
                uvectors[x].release();
            }
            throw;
        }
    }

    // ... use uvectors

    for(int g = 0; g < numGpus; g++){
        cudaSetDevice(g);
        cudaStreamSynchronize(streams[g]);
    }
    for(int g = 0; g < numGpus; g++){
        cudaSetDevice(g);
        uvectors[g].release();
    }
}

Describe the solution you'd like
rmm::device_uvector should automatically set the correct device for deallocation.

Describe alternatives you've considered

Additional context

The text was updated successfully, but these errors were encountered:

wence- · 2023-09-18T08:16:47Z

For context, this clarification in the readme was introduced as part of #1333. See in particular the discussion here #1333 (comment)

The example of a constructor throwing such that the destructor runs with a different device active is one we didn't explicitly discuss there.

harrism · 2023-10-25T04:54:43Z

I've been looking into adding this feature (to device_buffer), but I haven't been able to come up with a way to test it. This is effectively trying to make undefined behavior into defined behavior. Therefore I can't create a test that will fail without the fix, but passes with the fix.

I would love suggestions @fkallen @jrhemstad @wence-

fkallen · 2023-10-25T07:08:48Z

How about a resource adapter which checks the device id on allocation and deallocation?

#include <iostream>
#include <cassert>

#include <rmm/mr/device/cuda_memory_resource.hpp>
#include <rmm/mr/device/device_memory_resource.hpp>
#include <rmm/mr/device/per_device_resource.hpp>
#include <rmm/device_uvector.hpp>

class DeviceCheckResourceAdapter final : public rmm::mr::device_memory_resource {
public:
    DeviceCheckResourceAdapter(rmm::mr::device_memory_resource* upstream) : upstream_(upstream){
        cudaGetDevice(&deviceId);
    }

    DeviceCheckResourceAdapter()                            = default;
    ~DeviceCheckResourceAdapter() override                  = default;
    DeviceCheckResourceAdapter(DeviceCheckResourceAdapter const&) = default;
    DeviceCheckResourceAdapter(DeviceCheckResourceAdapter&&)      = default;
    DeviceCheckResourceAdapter& operator=(DeviceCheckResourceAdapter const&) = default;
    DeviceCheckResourceAdapter& operator=(DeviceCheckResourceAdapter&&) = default;

    bool supports_streams() const noexcept override
    {
        return upstream_->supports_streams();
    }

    bool supports_get_mem_info() const noexcept override
    {
        return upstream_->supports_get_mem_info();
    }

    device_memory_resource* get_upstream() const noexcept{
        return upstream_;
    }

private:
    void checkDeviceId() const{
        int currentDevice;
        cudaGetDevice(&currentDevice);
        if(deviceId != currentDevice){
            throw std::runtime_error("Unexpected device id");
        }
    }

    void* do_allocate(std::size_t bytes, rmm::cuda_stream_view stream) override{
        checkDeviceId();

        return upstream_->allocate(bytes, stream);
    }

    void do_deallocate(void* ptr, std::size_t bytes, rmm::cuda_stream_view stream) override{
        checkDeviceId();

        upstream_->deallocate(ptr, bytes, stream);
    }

    bool do_is_equal(rmm::mr::device_memory_resource const& other) const noexcept override{
        if (this == &other) { return true; }
        auto const* cast = dynamic_cast<DeviceCheckResourceAdapter const*>(&other);
        if (cast != nullptr) { return upstream_->is_equal(*cast->get_upstream()); }
        return upstream_->is_equal(other);
    }

    std::pair<std::size_t, std::size_t> do_get_mem_info(rmm::cuda_stream_view stream) const override {
        return upstream_->get_mem_info(stream);
    }

    int deviceId;
    rmm::mr::device_memory_resource* upstream_;
};




int main(){

    std::vector<rmm::mr::cuda_memory_resource> normalRes(2);
    std::vector<DeviceCheckResourceAdapter> deviceCheckRes;

    for(int i = 0; i < 2; i++){
        cudaSetDevice(i);
        deviceCheckRes.emplace_back(&normalRes[i]);
    }
    for(int i = 0; i < 2; i++){
        cudaSetDevice(i);
        rmm::mr::set_current_device_resource(&deviceCheckRes[i]);
    }

    {
        std::vector<rmm::device_buffer> vec;
        for(int i = 0; i < 2; i++){
            cudaSetDevice(i);
            vec.emplace_back(1024, cudaStreamPerThread);
        }

        // for(int i = 0; i < 2; i++){
        //     cudaSetDevice(i);
        //     auto toRelease = std::move(vec[i]);
        // }
    }

}

wence- · 2023-10-25T09:13:07Z

I think @fkallen's suggestion is a good one. The UB occurs when we try and use a memory resource with an incorrect device live. So, we can avoid explicitly carrying out UB by first checking if what we're about to do would be undefined and raising an error. As long as the check itself doesn't do something that is undefined, this is fine. Since the behaviour is undefined in the library rather than the language, the compiler will not be able to decide that the check is unnecessary (because the subsequent line would invoke UB if it did not pass).

harrism · 2023-10-25T12:01:27Z

@fkallen great idea!

This changes `device_buffer` to store the active CUDA device ID on creation, and (possibly temporarily) set the active device to that ID before allocating or freeing memory. It also adds tests for containers built on `device_buffer` (`device_buffer`, `device_uvector` and `device_scalar`) that ensure correct operation when the device is changed before doing things that alloc/dealloc memory for those containers. This fixes #1342 . HOWEVER, there is an important question yet to answer: `rmm::device_vector` is just an alias for `thrust::device_vector`, which does not use `rmm::device_buffer` for storage. However users may be surprised after this PR because the multidevice semantics of RMM containers will be different from `thrust::device_vector` (and therefore `rmm::device_vector`). Update: opinion is that it's probably OK to diverge from `device_vector`, and some think we should remove `rmm::device_vector`. ~While we discuss this I have set the DO NOT MERGE label.~ Authors: - Mark Harris (https://github.com/harrism) Approvers: - Lawrence Mitchell (https://github.com/wence-) - Jake Hemstad (https://github.com/jrhemstad) URL: #1370

fkallen added ? - Needs Triage Need team to review and classify feature request New feature or request labels Sep 15, 2023

harrism self-assigned this Oct 10, 2023

harrism added this to RMM Project Board Oct 10, 2023

github-project-automation bot moved this to Todo in RMM Project Board Oct 10, 2023

harrism mentioned this issue Nov 2, 2023

Store and set the correct CUDA device in device_buffer #1370

Merged

3 tasks

rapids-bot bot closed this as completed in #1370 Nov 15, 2023

github-project-automation bot moved this from Todo to Done in RMM Project Board Nov 15, 2023

harrism mentioned this issue Aug 6, 2024

[PoC]: Implement cuda::experimental::uninitialized_async_buffer NVIDIA/cccl#1854

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Relax restrictions for device_uvector destructor in multi-gpu code #1342

[FEA] Relax restrictions for device_uvector destructor in multi-gpu code #1342

fkallen commented Sep 15, 2023

wence- commented Sep 18, 2023

harrism commented Oct 25, 2023

fkallen commented Oct 25, 2023

wence- commented Oct 25, 2023

harrism commented Oct 25, 2023

[FEA] Relax restrictions for device_uvector destructor in multi-gpu code #1342

[FEA] Relax restrictions for device_uvector destructor in multi-gpu code #1342

Comments

fkallen commented Sep 15, 2023

Describe alternatives you've considered

Additional context

wence- commented Sep 18, 2023

harrism commented Oct 25, 2023

fkallen commented Oct 25, 2023

wence- commented Oct 25, 2023

harrism commented Oct 25, 2023