-
Notifications
You must be signed in to change notification settings - Fork 26
Migrate to go nvml #13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
8298aa1 to
bcec1e9
Compare
elezar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a WIP. We denfinitely need some more work.
| "github.com/NVIDIA/gpu-monitoring-tools/bindings/go/nvml" | ||
| // TODO: We rename this import to reduce the changes required below. | ||
| // This can be removed once the link-specifics have been migrated into go-nvlib. | ||
| nvml "github.com/NVIDIA/go-gpuallocator/internal/links" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only provides constants as it stands.
gpuallocator/device.go
Outdated
| device.Device | ||
| Index int | ||
| Links map[int][]P2PLink | ||
| // We cache certain values for the device. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to avoid embedding all these public symbols, but wanted to limit changes to the other code.
bcec1e9 to
35ca8b6
Compare
35ca8b6 to
48f9f9f
Compare
48f9f9f to
9c4b7d5
Compare
6d06cbe to
55c2315
Compare
55c2315 to
c950cb4
Compare
7e3db16 to
6450bbf
Compare
6450bbf to
e77fc49
Compare
| type Device struct { | ||
| *nvml.Device | ||
| device.Device | ||
| Index int | ||
| Links map[int][]P2PLink | ||
| // The previous binding implementation used to cache specific device properties. | ||
| // These should be considered deprecated and the functions associated with device.Device | ||
| // should be used instead. | ||
| cachedDeviceInfo | ||
| } | ||
|
|
||
| // The previous binding implementation used to cache specific device properties. | ||
| // We collect these into a separate type to allow for embedding and simpler migration from them. | ||
| type cachedDeviceInfo struct { | ||
| UUID string | ||
| PCI struct { | ||
| BusID string | ||
| } | ||
| CPUAffinity *uint | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have expected it to be like:
type Device struct {
nvlibDevice
Index int
Links map[int][]P2PLink
}
type nvlibDevice struct {
device.Device
// The previous binding implementation used to cache specific device properties.
// These should be considered deprecated and the functions associated with device.Device
// should be used instead.
UUID string
PCI struct {
BusID string
}
CPUAffinity *uint
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Will update. One thing I was just wondering was whether Index should also be in the nvlibDevice struct, but I'm going to leave it where it was previously for now.
gpuallocator/device.go
Outdated
| return nil, fmt.Errorf("failed to get device pci info: %v", ret) | ||
| } | ||
|
|
||
| linkedDevice := Device{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/linkedDevice/device
gpuallocator/device.go
Outdated
| device, err := nvml.NewDevice(uint(i)) | ||
| var devices DeviceList | ||
| err := o.devicelib.VisitDevices(func(i int, d device.Device) error { | ||
| linkedDevice, err := newDevice(i, d) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/linkedDevice/device
This change migrates from the gpu-monitoring-tools bindings for NVML to those abstracted by go-nvlib. The functions such as peer-to-peer interconnectivity not provided by go-nvlib are implemented in a new interal links package. The functions and types here can be considered for future migration to go-nvlib. Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
e77fc49 to
a73b9c4
Compare
I have created NVIDIA/go-nvlib#7 to include the constants and NVML functions that we require here.
I have also created https://gitlab.com/nvidia/kubernetes/device-plugin/-/merge_requests/328 to update the device plugin to use the new API for managing Linked devices.