Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Relax restrictions for device_uvector destructor in multi-gpu code #1342

Closed
fkallen opened this issue Sep 15, 2023 · 5 comments · Fixed by #1370
Closed

[FEA] Relax restrictions for device_uvector destructor in multi-gpu code #1342

fkallen opened this issue Sep 15, 2023 · 5 comments · Fixed by #1370
Assignees
Labels
? - Needs Triage Need team to review and classify feature request New feature or request

Comments

@fkallen
Copy link
Contributor

fkallen commented Sep 15, 2023

Is your feature request related to a problem? Please describe.

According to the RMM readme, the following code contains undefined behavior because the current device when calling device_uvector destructor may be different from the device that was active when the device_uvector was constructed.

{
    int numGpus = 8;
    std::vector<rmm::device_uvector<int>> uvectors;
    for(int g = 0; g < numGpus; g++){
        cudaSetDevice(g);
        uvectors.emplace_back(numGpus, streams[g]);
    }

    // ... use uvectors

    for(int g = 0; g < numGpus; g++){
        cudaSetDevice(g);
        cudaStreamSynchronize(streams[g]);
    }
}

The workaround is to destroy the uvector explicitly.

{
    int numGpus = 8;
    std::vector<rmm::device_uvector<int>> uvectors;
    for(int g = 0; g < numGpus; g++){
        cudaSetDevice(g);
        uvectors.emplace_back(numGpus, streams[g]);
    }

    // ... use uvectors

    for(int g = 0; g < numGpus; g++){
        cudaSetDevice(g);
        cudaStreamSynchronize(streams[g]);
    }
    for(int g = 0; g < numGpus; g++){
        cudaSetDevice(g);
        uvectors[g].release();
    }
}

However, this may not be sufficient because the construction may throw (out-of-memory, for example). So one would need something like:

{
    int numGpus = 8;
    std::vector<rmm::device_uvector<int>> uvectors;
    for(int g = 0; g < numGpus; g++){
        cudaSetDevice(g);
        try{
            uvectors.emplace_back(numGpus, streams[g]);
        }catch(...){
            for(int x = 0; x < g; x++){
                cudaSetDevice(x);
                uvectors[x].release();
            }
            throw;
        }
    }

    // ... use uvectors

    for(int g = 0; g < numGpus; g++){
        cudaSetDevice(g);
        cudaStreamSynchronize(streams[g]);
    }
    for(int g = 0; g < numGpus; g++){
        cudaSetDevice(g);
        uvectors[g].release();
    }
}

Describe the solution you'd like
rmm::device_uvector should automatically set the correct device for deallocation.

Describe alternatives you've considered

Additional context

@fkallen fkallen added ? - Needs Triage Need team to review and classify feature request New feature or request labels Sep 15, 2023
@wence-
Copy link
Contributor

wence- commented Sep 18, 2023

For context, this clarification in the readme was introduced as part of #1333. See in particular the discussion here #1333 (comment)

The example of a constructor throwing such that the destructor runs with a different device active is one we didn't explicitly discuss there.

@harrism
Copy link
Member

harrism commented Oct 25, 2023

I've been looking into adding this feature (to device_buffer), but I haven't been able to come up with a way to test it. This is effectively trying to make undefined behavior into defined behavior. Therefore I can't create a test that will fail without the fix, but passes with the fix.

I would love suggestions @fkallen @jrhemstad @wence-

@fkallen
Copy link
Contributor Author

fkallen commented Oct 25, 2023

How about a resource adapter which checks the device id on allocation and deallocation?

#include <iostream>
#include <cassert>

#include <rmm/mr/device/cuda_memory_resource.hpp>
#include <rmm/mr/device/device_memory_resource.hpp>
#include <rmm/mr/device/per_device_resource.hpp>
#include <rmm/device_uvector.hpp>

class DeviceCheckResourceAdapter final : public rmm::mr::device_memory_resource {
public:
    DeviceCheckResourceAdapter(rmm::mr::device_memory_resource* upstream) : upstream_(upstream){
        cudaGetDevice(&deviceId);
    }

    DeviceCheckResourceAdapter()                            = default;
    ~DeviceCheckResourceAdapter() override                  = default;
    DeviceCheckResourceAdapter(DeviceCheckResourceAdapter const&) = default;
    DeviceCheckResourceAdapter(DeviceCheckResourceAdapter&&)      = default;
    DeviceCheckResourceAdapter& operator=(DeviceCheckResourceAdapter const&) = default;
    DeviceCheckResourceAdapter& operator=(DeviceCheckResourceAdapter&&) = default;

    bool supports_streams() const noexcept override
    {
        return upstream_->supports_streams();
    }

    bool supports_get_mem_info() const noexcept override
    {
        return upstream_->supports_get_mem_info();
    }

    device_memory_resource* get_upstream() const noexcept{
        return upstream_;
    }

private:
    void checkDeviceId() const{
        int currentDevice;
        cudaGetDevice(&currentDevice);
        if(deviceId != currentDevice){
            throw std::runtime_error("Unexpected device id");
        }
    }

    void* do_allocate(std::size_t bytes, rmm::cuda_stream_view stream) override{
        checkDeviceId();

        return upstream_->allocate(bytes, stream);
    }

    void do_deallocate(void* ptr, std::size_t bytes, rmm::cuda_stream_view stream) override{
        checkDeviceId();

        upstream_->deallocate(ptr, bytes, stream);
    }

    bool do_is_equal(rmm::mr::device_memory_resource const& other) const noexcept override{
        if (this == &other) { return true; }
        auto const* cast = dynamic_cast<DeviceCheckResourceAdapter const*>(&other);
        if (cast != nullptr) { return upstream_->is_equal(*cast->get_upstream()); }
        return upstream_->is_equal(other);
    }

    std::pair<std::size_t, std::size_t> do_get_mem_info(rmm::cuda_stream_view stream) const override {
        return upstream_->get_mem_info(stream);
    }

    int deviceId;
    rmm::mr::device_memory_resource* upstream_;
};




int main(){

    std::vector<rmm::mr::cuda_memory_resource> normalRes(2);
    std::vector<DeviceCheckResourceAdapter> deviceCheckRes;

    for(int i = 0; i < 2; i++){
        cudaSetDevice(i);
        deviceCheckRes.emplace_back(&normalRes[i]);
    }
    for(int i = 0; i < 2; i++){
        cudaSetDevice(i);
        rmm::mr::set_current_device_resource(&deviceCheckRes[i]);
    }

    {
        std::vector<rmm::device_buffer> vec;
        for(int i = 0; i < 2; i++){
            cudaSetDevice(i);
            vec.emplace_back(1024, cudaStreamPerThread);
        }

        // for(int i = 0; i < 2; i++){
        //     cudaSetDevice(i);
        //     auto toRelease = std::move(vec[i]);
        // }
    }

}

@wence-
Copy link
Contributor

wence- commented Oct 25, 2023

I think @fkallen's suggestion is a good one. The UB occurs when we try and use a memory resource with an incorrect device live. So, we can avoid explicitly carrying out UB by first checking if what we're about to do would be undefined and raising an error. As long as the check itself doesn't do something that is undefined, this is fine. Since the behaviour is undefined in the library rather than the language, the compiler will not be able to decide that the check is unnecessary (because the subsequent line would invoke UB if it did not pass).

@harrism
Copy link
Member

harrism commented Oct 25, 2023

@fkallen great idea!

rapids-bot bot pushed a commit that referenced this issue Nov 15, 2023
This changes `device_buffer` to store the active CUDA device ID on creation, and (possibly temporarily) set the active device to that ID before allocating or freeing memory. It also adds tests for containers built on `device_buffer` (`device_buffer`, `device_uvector` and `device_scalar`) that ensure correct operation when the device is changed before doing things that alloc/dealloc memory for those containers. 

This fixes #1342 . HOWEVER, there is an important question yet to answer:

`rmm::device_vector` is just an alias for `thrust::device_vector`, which does not use `rmm::device_buffer` for storage. However users may be surprised after this PR because the multidevice semantics of RMM containers will be different from `thrust::device_vector` (and therefore `rmm::device_vector`).

Update: opinion is that it's probably OK to diverge from `device_vector`, and some think we should remove `rmm::device_vector`.

~While we discuss this I have set the DO NOT MERGE label.~

Authors:
  - Mark Harris (https://github.com/harrism)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)
  - Jake Hemstad (https://github.com/jrhemstad)

URL: #1370
@github-project-automation github-project-automation bot moved this from Todo to Done in RMM Project Board Nov 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify feature request New feature or request
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants