Skip to content

Conversation

@elezar
Copy link
Member

@elezar elezar commented Jan 9, 2024

As reported in NVIDIA/go-nvlib#16 the construction of the link information for devices fails as the Device passed in is not an nvmlDevice.

These changes fix the implementation by making use of the changes from NVIDIA/go-nvlib#17 and ensuring that the list of NVML devices is kept separate from the linked devices.

A simple example is also added and was validated on two systems with multiple GPUs after verifying the reported behaviour.

elezar added 2 commits January 9, 2024 15:02
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
elezar added 2 commits January 9, 2024 15:09
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change fixes the construction of the linked devices.

Firstly, the nvmlDevices are used directly when determining link
properties instead of relying on the local Device type. This allows
the nvml Device handle to be used directly.

In addition an ERROR_INVALID_ARGUMENT when querying link state is
treated as a non-fatal error to handle cases where older drivers are
used. This also aligns with the former implementation in gpu-monitoring-tools.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
@elezar elezar merged commit 02af3d8 into NVIDIA:main Jan 11, 2024
@elezar elezar deleted the fix-devices branch January 11, 2024 09:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants