Skip to content

Conversation

@ldematte
Copy link
Contributor

cuvs-java already contains a public GPUInfo class, but methods to retrieve the information and fill it are internal.
This PR exposes them through and interface, GPUInfoProvider. It also separates immutable data related to a GPU (which is kept in GPUInfo) from transient resources-related data and counters (at the moment, only the amount of free memory, which is kept in the new CuVSResourcesInfo).

The change let you query transient data at a later moment; to do this, we need to find the device ID associated with a CuVSResource object. The change to the C API exposes the raft function that does it.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Aug 20, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@ldematte
Copy link
Contributor Author

@mythrocks let me know if it's OK to keep changes to C and Java together, or if you want me to raise 2 separate PRs

Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of nitpicks. But this is a good change, otherwise.

@mythrocks mythrocks added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Aug 21, 2025
@mythrocks
Copy link
Contributor

/ok to test d6ac665

@ldematte ldematte requested a review from mythrocks August 22, 2025 12:17
@mythrocks mythrocks changed the title [REVIEW][Java][C] Expose GPUInfo [Java][C] Expose GPUInfo Aug 25, 2025
@mythrocks
Copy link
Contributor

/ok to test ad2dc03

@mythrocks
Copy link
Contributor

Sorry for the late suggestion of the following, to address @cjnolet's concerns regarding frequent calls to cudaGetDeviceProperties().

cudaGetDeviceProperties() seems to be used only to query GPU compute capability and VRAM specs. I think it should be safe to treat these as invariant. Neither the device ID mappings nor the GPU compute/VRAM specs should change at runtime.

@ldematte: Would you be averse to changing Util.getAvailableGPUs() to return a cached result? On the current head of main, it might look like:

  // Lazy initialization for list of available GPUs.
  private static class AvailableGpuInitializer {

    // Available GPUs are initialized only once when first accessed.
    // This is assumed to be invariant for the lifetime of the program.
    static final List<GPUInfo> AVAILABLE_GPUS = availableGPUs();

    private static List<GPUInfo> availableGPUs() {
      try (var localArena = Arena.ofConfined()) {

        MemorySegment numGpus = localArena.allocate(C_INT);
        int returnValue = cudaGetDeviceCount(numGpus);
        checkCudaError(returnValue, "cudaGetDeviceCount");

        int numGpuCount = numGpus.get(C_INT, 0);
        List<GPUInfo> gpuInfoArr = new ArrayList<GPUInfo>();
        // Fill up with GPUInfos.
        // ...
        return gpuInfoArr;
      }
    }
  }

  /**
   * Gets all the available GPUs
   *
   * @return a list of {@link GPUInfo} objects with GPU details
   */
  private static List<GPUInfo> availableGPUs() {
    return AvailableGpuInitializer.AVAILABLE_GPUS;
  }

@mythrocks
Copy link
Contributor

Note: The caching makes the assumption that the application only has access to the GPUs that were available at application start.

I can think of cases where GPUs are made available at runtime. For instance, a GPU could be attached to the box via PCIe-over-Thunderbolt or something. (My home dev setup is this way.)

@cjnolet, @benfred: Permission to treat that sort of thing as unlikely/unsupported?

@cjnolet
Copy link
Member

cjnolet commented Aug 25, 2025

Note: The caching makes the assumption that the application only has access to the GPUs that were available at application start.

Ideally the caching would not be done by default at application start, but would be done lazily on the first call to get a property.

@cjnolet, @benfred: Permission to treat that sort of thing as unlikely/unsupported?

Very much unlikely / unsupported. This is not something we consider in RAPIDS at all, and not something we need to consider downstream.

@mythrocks
Copy link
Contributor

but would be done lazily on the first call to get a property.

Agreed. I'm wary of doing this in a static block, for fear of races between CUDA context init and the application's first CUDA call. The suggestion above will initialize lazily.

@mythrocks
Copy link
Contributor

As an aside, if @cjnolet or @benfred could review/approve the tiny C-side change in this PR, that'd be appreciated.

@achirkin
Copy link
Contributor

achirkin commented Aug 26, 2025

A small note on device properties and caching

Raft does provide a helper function to get the device properties: raft::resource::get_device_properties(const resources&).

Device properties struct is cached within raft::resources. A raft::resources object is device-bound: and remembers the current device the first time you call any CUDA-related function and assumes the current device is never changed. So the exotic use cases like adding a GPU while the program is running are covered as long as you create a new raft::resources for it (more info about the resources is in this spreadsheet).

I see you query the device properties in a context where raft::resources is not yet created, so it may be tricky to refactor the code to use it. But if it's reasonably doable, I'd recommend to try.

@ldematte
Copy link
Contributor Author

ldematte commented Aug 26, 2025

Would you be averse to changing Util.getAvailableGPUs() to return a cached result?

Sounds like a good idea to me!

but would be done lazily on the first call to get a property.

Agreed.

I'm wary of doing this in a static block, for fear of races between CUDA context init and the application's first CUDA call.

We can make this deterministic by carefully laying these out, but it would be fragile (e.g. moving a class or sorting fields in a class would influence the result). Not sure if I want to go down that route, even if it would be express better that these are immutable. But better be lazy.

Raft does provide a helper function to get the device properties: raft::resource::get_device_properties(const resources&). Device properties struct is cached within raft::resources. A raft::resources object is device-bound: and remembers the current device

That's even better. I think the best would be to expose this via a C API, and use the laziness/caching of raft. I'll go in that direction. @mythrocks wdyt?
Edit: the current code tries to get GPU info for all GPUs in the system; if we want to keep it that way (and I think we should) I'll go with @mythrocks suggestion for lazy initialization and caching.

@achirkin
Copy link
Contributor

achirkin commented Aug 26, 2025

I think you'd need to initialize the resources object for each GPU at some point anyway and so you could in theory create a list of resources in advance and iterate over them (and only ever access the GPUs via the corresponding resources objects). Some cuVS algorithms use these helper functions occasionally, so you'd spare some latency if you always use the same approach (to not get the same struct cached in two different places).
But I'm not familiar with the bigger picture and thus not sure if such refactoring is feasible here.

@ldematte
Copy link
Contributor Author

ldematte commented Aug 26, 2025

I think you'd need to initialize the resources object for each GPU at some point anyway and so you could in theory create a list of resources in advance and iterate over them (and only ever access the GPUs via the corresponding resources objects).
But I'm not familiar with the bigger picture and thus not sure if such refactoring is feasible here.

That's an interesting idea and I think it's worth keeping this in mind.
You are right, I think that when we will want to support multi-GPUs we will need to do that (and also bring in device_resources_manager possibly?)

I have a small change to the C API that exposes raft::resource::get_device_properties, but if you are OK with that I'm going to stash it for now and keep it for a follow-up, when it's time to tackle the multi-GPU support.

@mythrocks
Copy link
Contributor

/ok to test 10d38a1

@mythrocks
Copy link
Contributor

/merge

@rapids-bot rapids-bot bot merged commit 6e0f859 into rapidsai:branch-25.10 Aug 27, 2025
55 checks passed
@ldematte ldematte deleted the java/expose-gpu-info branch September 9, 2025 07:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Development

Successfully merging this pull request may close these issues.

4 participants