Skip to content

Conversation

@ldematte
Copy link
Contributor

@copy-pr-bot
Copy link

copy-pr-bot bot commented Aug 25, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@cjnolet cjnolet added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Aug 25, 2025
@cjnolet
Copy link
Member

cjnolet commented Aug 25, 2025

@mythrocks @ldematte @ChrisHegarty @chatman One thing I'd like to make sure we're considering here is that the cudaDeviceGetProperties is static and we've found that this blocks the entire device every time it's called, and can also lead to contention across devices when we end up enabling multi-GPU. Please refer to this early discussion on cuML for more info.

The solution we've used on RAFT is to call this only once and cache off the results. This is one of the reasons we provide this as an explicit resource inside the raft::resources (and cuvsResources). I would strongly caution that if we are going to be explicitly calling this inside the Java layer (and outside the raft / cuvs resource APIs), we should also be caching off the response per GPU id.

Please note that cumlHandle is the predecessor to the raft::device_resources API, which was later refactored to abstract it into a generalized container for different types of resource implementations and construciton factories.

@mythrocks
Copy link
Contributor

@cjnolet: Thanks for the heads-up. This should be easy enough for us to address: We can cache the results per device.

@mythrocks
Copy link
Contributor

@cjnolet: We can add caching as part of #1267, under the assumption that we don't support GPUs that might appear later on. (E.g. a Thunderbolt device was plugged in later, etc.)

Question: Similar to cudaGetDeviceProperties, is there a concern regarding cudaMemGetInfo? We use this to get free-memory info.

@mythrocks
Copy link
Contributor

For the record, the suggestion from this PR is being rolled into #1273.

@ldematte
Copy link
Contributor Author

Question: Similar to cudaGetDeviceProperties, is there a concern regarding cudaMemGetInfo? We use this to get free-memory info.

++ on this question

One thing I'd like to make sure we're considering here is that the cudaDeviceGetProperties is static

Yep; I think that the functions using cudaGetDeviceProperties (e.g. availableGpus) are meant to be called once anyway, so why not make sure they are called once. #1267 already goes in that direction, separating static information (in GPUInfo) from volatile information (in a new record).

I'll make sure to fully address this as part of #1267; btw, if we go down the route of using raft get_device_properties, that will solve the current issue too.

@ldematte
Copy link
Contributor Author

ldematte commented Sep 9, 2025

This served to showcase a workaround for #1273, and has been used in that PR, so I can close this (which was just meant to be a demo).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Development

Successfully merging this pull request may close these issues.

3 participants