Skip to content

Conversation

@elezar
Copy link
Member

@elezar elezar commented Nov 2, 2023

I have created NVIDIA/go-nvlib#7 to include the constants and NVML functions that we require here.

I have also created https://gitlab.com/nvidia/kubernetes/device-plugin/-/merge_requests/328 to update the device plugin to use the new API for managing Linked devices.

@elezar elezar force-pushed the migrate-to-go-nvml branch from 8298aa1 to bcec1e9 Compare November 2, 2023 14:28
Copy link
Member Author

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a WIP. We denfinitely need some more work.

"github.com/NVIDIA/gpu-monitoring-tools/bindings/go/nvml"
// TODO: We rename this import to reduce the changes required below.
// This can be removed once the link-specifics have been migrated into go-nvlib.
nvml "github.com/NVIDIA/go-gpuallocator/internal/links"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only provides constants as it stands.

device.Device
Index int
Links map[int][]P2PLink
// We cache certain values for the device.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to avoid embedding all these public symbols, but wanted to limit changes to the other code.

@elezar elezar force-pushed the migrate-to-go-nvml branch 2 times, most recently from 6d06cbe to 55c2315 Compare November 15, 2023 10:25
@elezar elezar force-pushed the migrate-to-go-nvml branch 4 times, most recently from 7e3db16 to 6450bbf Compare November 16, 2023 15:55
@elezar elezar force-pushed the migrate-to-go-nvml branch from 6450bbf to e77fc49 Compare November 16, 2023 17:15
Comment on lines 17 to 33
type Device struct {
*nvml.Device
device.Device
Index int
Links map[int][]P2PLink
// The previous binding implementation used to cache specific device properties.
// These should be considered deprecated and the functions associated with device.Device
// should be used instead.
cachedDeviceInfo
}

// The previous binding implementation used to cache specific device properties.
// We collect these into a separate type to allow for embedding and simpler migration from them.
type cachedDeviceInfo struct {
UUID string
PCI struct {
BusID string
}
CPUAffinity *uint
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have expected it to be like:

type Device struct {
	nvlibDevice
	Index int
	Links map[int][]P2PLink
}

type nvlibDevice struct {
	device.Device
	// The previous binding implementation used to cache specific device properties.
	// These should be considered deprecated and the functions associated with device.Device
	// should be used instead.
	UUID string
	PCI  struct {
		BusID string
	}
	CPUAffinity *uint
}

Copy link
Member Author

@elezar elezar Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Will update. One thing I was just wondering was whether Index should also be in the nvlibDevice struct, but I'm going to leave it where it was previously for now.

return nil, fmt.Errorf("failed to get device pci info: %v", ret)
}

linkedDevice := Device{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/linkedDevice/device

device, err := nvml.NewDevice(uint(i))
var devices DeviceList
err := o.devicelib.VisitDevices(func(i int, d device.Device) error {
linkedDevice, err := newDevice(i, d)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/linkedDevice/device

This change migrates from the gpu-monitoring-tools bindings for
NVML to those abstracted by go-nvlib. The functions such as peer-to-peer
interconnectivity not provided by go-nvlib are implemented in a new interal
links package.

The functions and types here can be considered for future migration to go-nvlib.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
@elezar elezar force-pushed the migrate-to-go-nvml branch from e77fc49 to a73b9c4 Compare November 17, 2023 13:15
@elezar elezar merged commit 68b0fdd into NVIDIA:main Nov 17, 2023
@elezar elezar mentioned this pull request Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants