[Java][C] Expose GPUInfo #1267

ldematte · 2025-08-20T13:52:15Z

cuvs-java already contains a public GPUInfo class, but methods to retrieve the information and fill it are internal.
This PR exposes them through and interface, GPUInfoProvider. It also separates immutable data related to a GPU (which is kept in GPUInfo) from transient resources-related data and counters (at the moment, only the amount of free memory, which is kept in the new CuVSResourcesInfo).

The change let you query transient data at a later moment; to do this, we need to find the device ID associated with a CuVSResource object. The change to the C API exposes the raft function that does it.

…ted with a resource

…ion in GPUInfoProviderImpl.

copy-pr-bot · 2025-08-20T13:52:19Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

ldematte · 2025-08-20T13:52:54Z

@mythrocks let me know if it's OK to keep changes to C and Java together, or if you want me to raise 2 separate PRs

…r convenience

java/cuvs-java/src/main/java/com/nvidia/cuvs/GPUInfo.java

java/cuvs-java/src/main/java/com/nvidia/cuvs/GPUInfoProvider.java

java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/common/Util.java

java/cuvs-java/src/test/java/com/nvidia/cuvs/GPUInfoIT.java

mythrocks

A couple of nitpicks. But this is a good change, otherwise.

…-gpu-info

java/cuvs-java/src/main/java/com/nvidia/cuvs/SynchronizedCuVSResources.java

mythrocks · 2025-08-21T18:25:08Z

/ok to test d6ac665

…-gpu-info

mythrocks · 2025-08-25T17:42:58Z

/ok to test ad2dc03

mythrocks · 2025-08-25T22:26:02Z

Sorry for the late suggestion of the following, to address @cjnolet's concerns regarding frequent calls to cudaGetDeviceProperties().

cudaGetDeviceProperties() seems to be used only to query GPU compute capability and VRAM specs. I think it should be safe to treat these as invariant. Neither the device ID mappings nor the GPU compute/VRAM specs should change at runtime.

@ldematte: Would you be averse to changing Util.getAvailableGPUs() to return a cached result? On the current head of main, it might look like:

  // Lazy initialization for list of available GPUs.
  private static class AvailableGpuInitializer {

    // Available GPUs are initialized only once when first accessed.
    // This is assumed to be invariant for the lifetime of the program.
    static final List<GPUInfo> AVAILABLE_GPUS = availableGPUs();

    private static List<GPUInfo> availableGPUs() {
      try (var localArena = Arena.ofConfined()) {

        MemorySegment numGpus = localArena.allocate(C_INT);
        int returnValue = cudaGetDeviceCount(numGpus);
        checkCudaError(returnValue, "cudaGetDeviceCount");

        int numGpuCount = numGpus.get(C_INT, 0);
        List<GPUInfo> gpuInfoArr = new ArrayList<GPUInfo>();
        // Fill up with GPUInfos.
        // ...
        return gpuInfoArr;
      }
    }
  }

  /**
   * Gets all the available GPUs
   *
   * @return a list of {@link GPUInfo} objects with GPU details
   */
  private static List<GPUInfo> availableGPUs() {
    return AvailableGpuInitializer.AVAILABLE_GPUS;
  }

mythrocks · 2025-08-25T22:32:16Z

Note: The caching makes the assumption that the application only has access to the GPUs that were available at application start.

I can think of cases where GPUs are made available at runtime. For instance, a GPU could be attached to the box via PCIe-over-Thunderbolt or something. (My home dev setup is this way.)

@cjnolet, @benfred: Permission to treat that sort of thing as unlikely/unsupported?

cjnolet · 2025-08-25T22:38:11Z

Note: The caching makes the assumption that the application only has access to the GPUs that were available at application start.

Ideally the caching would not be done by default at application start, but would be done lazily on the first call to get a property.

@cjnolet, @benfred: Permission to treat that sort of thing as unlikely/unsupported?

Very much unlikely / unsupported. This is not something we consider in RAPIDS at all, and not something we need to consider downstream.

mythrocks · 2025-08-25T22:56:52Z

but would be done lazily on the first call to get a property.

Agreed. I'm wary of doing this in a static block, for fear of races between CUDA context init and the application's first CUDA call. The suggestion above will initialize lazily.

mythrocks · 2025-08-25T22:59:10Z

As an aside, if @cjnolet or @benfred could review/approve the tiny C-side change in this PR, that'd be appreciated.

achirkin · 2025-08-26T06:53:50Z

A small note on device properties and caching

Raft does provide a helper function to get the device properties: raft::resource::get_device_properties(const resources&).

Device properties struct is cached within raft::resources. A raft::resources object is device-bound: and remembers the current device the first time you call any CUDA-related function and assumes the current device is never changed. So the exotic use cases like adding a GPU while the program is running are covered as long as you create a new raft::resources for it (more info about the resources is in this spreadsheet).

I see you query the device properties in a context where raft::resources is not yet created, so it may be tricky to refactor the code to use it. But if it's reasonably doable, I'd recommend to try.

ldematte · 2025-08-26T07:08:04Z

Would you be averse to changing Util.getAvailableGPUs() to return a cached result?

Sounds like a good idea to me!

but would be done lazily on the first call to get a property.

Agreed.

I'm wary of doing this in a static block, for fear of races between CUDA context init and the application's first CUDA call.

We can make this deterministic by carefully laying these out, but it would be fragile (e.g. moving a class or sorting fields in a class would influence the result). Not sure if I want to go down that route, even if it would be express better that these are immutable. But better be lazy.

Raft does provide a helper function to get the device properties: raft::resource::get_device_properties(const resources&). Device properties struct is cached within raft::resources. A raft::resources object is device-bound: and remembers the current device

~~That's even better. I think the best would be to expose this via a C API, and use the laziness/caching of raft. I'll go in that direction. @mythrocks wdyt?~~
Edit: the current code tries to get GPU info for all GPUs in the system; if we want to keep it that way (and I think we should) I'll go with @mythrocks suggestion for lazy initialization and caching.

…java/expose-gpu-info

achirkin · 2025-08-26T08:22:40Z

I think you'd need to initialize the resources object for each GPU at some point anyway and so you could in theory create a list of resources in advance and iterate over them (and only ever access the GPUs via the corresponding resources objects). Some cuVS algorithms use these helper functions occasionally, so you'd spare some latency if you always use the same approach (to not get the same struct cached in two different places).
But I'm not familiar with the bigger picture and thus not sure if such refactoring is feasible here.

ldematte · 2025-08-26T08:30:04Z

I think you'd need to initialize the resources object for each GPU at some point anyway and so you could in theory create a list of resources in advance and iterate over them (and only ever access the GPUs via the corresponding resources objects).
But I'm not familiar with the bigger picture and thus not sure if such refactoring is feasible here.

That's an interesting idea and I think it's worth keeping this in mind.
You are right, I think that when we will want to support multi-GPUs we will need to do that (and also bring in device_resources_manager possibly?)

I have a small change to the C API that exposes raft::resource::get_device_properties, but if you are OK with that I'm going to stash it for now and keep it for a follow-up, when it's time to tackle the multi-GPU support.

mythrocks · 2025-08-27T01:28:19Z

/ok to test 10d38a1

mythrocks · 2025-08-27T06:56:46Z

/merge

ldematte added 4 commits August 19, 2025 11:55

C API changes: adding method to retrieve the ID of the device associa…

5769464

…ted with a resource

Introducing GPUInfoProvider with and a first (incomplete) implementat…

fea987e

…ion in GPUInfoProviderImpl.

GPUInfoProviderImpl implementation + IT tests

2bd9217

Renaming

1c784a1

ldematte requested review from a team as code owners August 20, 2025 13:52

github-project-automation bot added this to Vector Search, ML, & Data Mining Release Board Aug 20, 2025

github-project-automation bot moved this to Todo in Vector Search, ML, & Data Mining Release Board Aug 20, 2025

More renaming, exposing totalDeviceMemoryInBytes to getCurrentInfo fo…

e9f32b8

…r convenience

mythrocks reviewed Aug 20, 2025

View reviewed changes

java/cuvs-java/src/main/java/com/nvidia/cuvs/GPUInfo.java Outdated Show resolved Hide resolved

mythrocks reviewed Aug 20, 2025

View reviewed changes

java/cuvs-java/src/main/java/com/nvidia/cuvs/GPUInfoProvider.java Outdated Show resolved Hide resolved

mythrocks reviewed Aug 20, 2025

View reviewed changes

java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/common/Util.java Outdated Show resolved Hide resolved

mythrocks reviewed Aug 20, 2025

View reviewed changes

java/cuvs-java/src/test/java/com/nvidia/cuvs/GPUInfoIT.java Show resolved Hide resolved

mythrocks requested changes Aug 20, 2025

View reviewed changes

mythrocks assigned ldematte Aug 21, 2025

mythrocks added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Aug 21, 2025

ldematte added 2 commits August 21, 2025 10:23

Merge remote-tracking branch 'upstream/branch-25.10' into java/expose…

bcfb9d7

…-gpu-info

Separate major/minor, add more GPUInfo stats, adjust IT test logging

02e5788

ldematte mentioned this pull request Aug 21, 2025

PoolingCuVSResourceManager with memory availability elastic/elasticsearch#133242

Merged

Moved deviceId to CuVSResources

d6ac665

mythrocks reviewed Aug 21, 2025

View reviewed changes

java/cuvs-java/src/main/java/com/nvidia/cuvs/SynchronizedCuVSResources.java Show resolved Hide resolved

ldematte requested a review from mythrocks August 22, 2025 12:17

Merge remote-tracking branch 'upstream/branch-25.10' into java/expose…

6913f1b

…-gpu-info

mythrocks changed the title ~~[REVIEW][Java][C] Expose GPUInfo~~ [Java][C] Expose GPUInfo Aug 25, 2025

Merge branch 'branch-25.10' into java/expose-gpu-info

ad2dc03

mythrocks mentioned this pull request Aug 25, 2025

Build and test with CUDA 13.0.0 #1273

Merged

mythrocks mentioned this pull request Aug 25, 2025

[WIP][JAVA] Fix cudaGetDeviceProperties symbol name to bind correctly for CUDA > 12 #1280

Closed

cjnolet approved these changes Aug 25, 2025

View reviewed changes

ldematte added 2 commits August 26, 2025 10:21

Review: cache GPUInfo

da0b79b

Merge branch 'java/expose-gpu-info' of github.com:ldematte/cuvs into …

49ab6d2

…java/expose-gpu-info

Merge branch 'branch-25.10' into java/expose-gpu-info

10d38a1

mythrocks approved these changes Aug 27, 2025

View reviewed changes

rapids-bot bot merged commit 6e0f859 into rapidsai:branch-25.10 Aug 27, 2025
55 checks passed

github-project-automation bot moved this from Todo to Done in Vector Search, ML, & Data Mining Release Board Aug 27, 2025

ldematte deleted the java/expose-gpu-info branch September 9, 2025 07:13

[Java][C] Expose GPUInfo #1267

[Java][C] Expose GPUInfo #1267

Uh oh!

Conversation

ldematte commented Aug 20, 2025

Uh oh!

copy-pr-bot bot commented Aug 20, 2025

Uh oh!

ldematte commented Aug 20, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mythrocks left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mythrocks commented Aug 21, 2025

Uh oh!

mythrocks commented Aug 25, 2025

Uh oh!

mythrocks commented Aug 25, 2025

Uh oh!

mythrocks commented Aug 25, 2025

Uh oh!

cjnolet commented Aug 25, 2025

Uh oh!

mythrocks commented Aug 25, 2025

Uh oh!

mythrocks commented Aug 25, 2025

Uh oh!

achirkin commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

A small note on device properties and caching

Uh oh!

ldematte commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

achirkin commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ldematte commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mythrocks commented Aug 27, 2025

Uh oh!

mythrocks commented Aug 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

achirkin commented Aug 26, 2025 •

edited

Loading

ldematte commented Aug 26, 2025 •

edited

Loading

achirkin commented Aug 26, 2025 •

edited

Loading

ldematte commented Aug 26, 2025 •

edited

Loading