-
Notifications
You must be signed in to change notification settings - Fork 662
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for node labels to report GPU mode #768
Changes from 7 commits
b37dd9c
03f72f9
ec8a1ba
3991e16
e868871
2f2541e
4e2818b
ac2cb43
a2e6a7c
82c22b6
8f10948
7d0e252
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -22,6 +22,8 @@ import ( | |
"strconv" | ||
"strings" | ||
|
||
"github.com/NVIDIA/go-nvlib/pkg/nvpci" | ||
|
||
spec "github.com/NVIDIA/k8s-device-plugin/api/config/v1" | ||
"github.com/NVIDIA/k8s-device-plugin/internal/resource" | ||
) | ||
|
@@ -71,12 +73,18 @@ func NewDeviceLabeler(manager resource.Manager, config *spec.Config) (Labeler, e | |
return nil, fmt.Errorf("error creating resource labeler: %v", err) | ||
} | ||
|
||
gpuModeLabeler, err := newGPUModeLabeler(manager) | ||
if err != nil { | ||
return nil, fmt.Errorf("error creating resource labeler: %v", err) | ||
} | ||
|
||
l := Merge( | ||
machineTypeLabeler, | ||
versionLabeler, | ||
migCapabilityLabeler, | ||
sharingLabeler, | ||
resourceLabeler, | ||
gpuModeLabeler, | ||
) | ||
|
||
return l, nil | ||
|
@@ -193,3 +201,59 @@ func isMPSCapable(manager resource.Manager) (bool, error) { | |
} | ||
return true, nil | ||
} | ||
|
||
func newGPUModeLabeler(manager resource.Manager) (Labeler, error) { | ||
visheshtanksale marked this conversation as resolved.
Show resolved
Hide resolved
visheshtanksale marked this conversation as resolved.
Show resolved
Hide resolved
|
||
devices, err := manager.GetDevices() | ||
if err != nil { | ||
return nil, err | ||
} | ||
if len(devices) == 0 { | ||
// no devices, return empty labels | ||
return empty{}, nil | ||
visheshtanksale marked this conversation as resolved.
Show resolved
Hide resolved
|
||
} | ||
classes, err := getDeviceClasses(devices) | ||
if err != nil { | ||
return nil, err | ||
} | ||
gpuMode := getModeForClasses(classes) | ||
labels := Labels{ | ||
"nvidia.com/gpu.mode": gpuMode, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Question: Since we also extract this label for a MIG device do we expect different labels per MIG profile? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Devices that support MIG do not support changing the GPU mode. They are always going to return compute PCI class. We can actually hard code the class of MIG devices to be compute. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, let's rather do that. It would simplify the implementation quite a bit. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Updated the implementation |
||
} | ||
return labels, nil | ||
} | ||
|
||
func getModeForClasses(classes []uint32) string { | ||
if len(classes) == 0 { | ||
return "unknown" | ||
} | ||
for _, class := range classes { | ||
if class != classes[0] { | ||
return "unknown" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not a blocker for this PR, but we may want to log the content of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added. |
||
} | ||
} | ||
switch classes[0] { | ||
case nvpci.PCIVgaControllerClass: | ||
return "graphics" | ||
case nvpci.PCI3dControllerClass: | ||
return "compute" | ||
default: | ||
return "unknown" | ||
} | ||
} | ||
|
||
func getDeviceClasses(devices []resource.Device) ([]uint32, error) { | ||
seenClasses := make(map[uint32]bool) | ||
for _, d := range devices { | ||
class, err := d.GetPIEClass() | ||
visheshtanksale marked this conversation as resolved.
Show resolved
Hide resolved
|
||
if err != nil { | ||
return nil, err | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Question: Do we want to treat errors in getting the class as fatal? This will crash GFD and cause NO labels to be generated. Should we rather return There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wanted to keep this behavior consistent with other labeler, that is the reason I am returning the error instead instead of labeling it |
||
} | ||
seenClasses[class] = true | ||
} | ||
|
||
var classes []uint32 | ||
for class := range seenClasses { | ||
classes = append(classes, class) | ||
} | ||
return classes, nil | ||
} |
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -21,6 +21,7 @@ import ( | |
"strings" | ||
|
||
"github.com/NVIDIA/go-nvlib/pkg/nvlib/device" | ||
"github.com/NVIDIA/go-nvlib/pkg/nvpci" | ||
"github.com/NVIDIA/go-nvml/pkg/nvml" | ||
) | ||
|
||
|
@@ -132,3 +133,23 @@ func totalMemory(attr map[string]interface{}) (uint64, error) { | |
return 0, fmt.Errorf("unsupported attribute type %v", t) | ||
} | ||
} | ||
|
||
func (d nvmlMigDevice) GetPIEClass() (uint32, error) { | ||
info, retVal := d.MigDevice.GetPciInfo() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Question: Is PCI info valid for a MIG device? How does the busid differ from that of the parent? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is the busID of the parent. |
||
if retVal != nvml.SUCCESS { | ||
return 0, retVal | ||
} | ||
var bytes []byte | ||
for _, char := range info.BusId { | ||
if char == 0 { | ||
break | ||
} | ||
bytes = append(bytes, byte(char)) | ||
} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Instead of implemention this here this should be pulled into There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Happy to do this in a follow-up though. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can decide where to put this behavior once we decide what behavior is expected for MIG devices. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Because of the hardcoding of the PCI class for MIG devices this code is removed |
||
pciID := strings.ToLower(strings.TrimPrefix(string(bytes), "0000")) | ||
nvDevice, err := nvpci.New().GetGPUByPciBusID(pciID) | ||
if err != nil { | ||
return 0, err | ||
} | ||
return nvDevice.Class, nil | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -30,4 +30,5 @@ nvidia\.com\/gpu\.engines\.jpeg=[0-9]+ | |
nvidia\.com\/gpu\.engines\.ofa=[0-9]+ | ||
nvidia\.com\/gpu\.slices\.gi=[0-9]+ | ||
nvidia\.com\/gpu\.slices\.ci=[0-9]+ | ||
nvidia\.com\/gpu\.mode=[unknown|compute|graphics] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: this can only be There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed |
||
nvidia\.com\/mps\.capable=[true|false] |
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we provide more information on what this value means?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Working on it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still need to add a reference as to what this means. From the documentation it is also only applicable to specific device types. Do we want to link to the relevant documentation here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FIxed