-
Notifications
You must be signed in to change notification settings - Fork 639
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for node labels to report GPU mode #768
Conversation
Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a general comment, could we update the commit message to indicate that we are reporting the mode and not specifying it.
Then, should MIG devices have unknown
reported? Should these not always be compute
? What is the intended use of this label?
internal/resource/nvml-device.go
Outdated
return resolvePCIAddressToMode(pciID) | ||
} | ||
|
||
func resolvePCIAddressToMode(addr string) (string, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a function that we should have in go-nvpci
for instead of reimplementing it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the implementation here to just read the class
based on the pci address. Maybe we should still push getting the class function to go-nvml.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The place for this to live would always be go-nvlib
. The package at go-nvml
is a wrapper for NVML specifically and we don't really add additional functionality there.
docs/gpu-feature-discovery/README.md
Outdated
@@ -209,6 +209,7 @@ their meaning: | |||
| nvidia.com/gpu.machine | String | Machine type | DGX-1 | | |||
| nvidia.com/gpu.memory | Integer | Memory of the GPU in Mb | 2048 | | |||
| nvidia.com/gpu.product | String | Model of the GPU | GeForce-GT-710 | | |||
| nvidia.com/gpu.mode | String | Display or Compute Mode of the GPU | compute | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we provide more information on what this value means?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Working on it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still need to add a reference as to what this means. From the documentation it is also only applicable to specific device types. Do we want to link to the relevant documentation here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FIxed
|
Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>
Fixed this
Based on current list of GPUs that support mode switch it will be always compute. But fixed this to just pull it from the actual device. |
Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>
Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>
Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>
Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @visheshtanksale.
There are some typos in the function name and I'm also not clear on what the behaviour should be for a MIG device.
internal/resource/nvml-mig-device.go
Outdated
@@ -132,3 +133,23 @@ func totalMemory(attr map[string]interface{}) (uint64, error) { | |||
return 0, fmt.Errorf("unsupported attribute type %v", t) | |||
} | |||
} | |||
|
|||
func (d nvmlMigDevice) GetPIEClass() (uint32, error) { | |||
info, retVal := d.MigDevice.GetPciInfo() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Is PCI info valid for a MIG device? How does the busid differ from that of the parent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is the busID of the parent.
internal/resource/nvml-mig-device.go
Outdated
if retVal != nvml.SUCCESS { | ||
return 0, retVal | ||
} | ||
var bytes []byte | ||
for _, char := range info.BusId { | ||
if char == 0 { | ||
break | ||
} | ||
bytes = append(bytes, byte(char)) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of implemention this here this should be pulled into go-nvlib
if it is valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to do this in a follow-up though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can decide where to put this behavior once we decide what behavior is expected for MIG devices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of the hardcoding of the PCI class for MIG devices this code is removed
docs/gpu-feature-discovery/README.md
Outdated
@@ -209,6 +209,7 @@ their meaning: | |||
| nvidia.com/gpu.machine | String | Machine type | DGX-1 | | |||
| nvidia.com/gpu.memory | Integer | Memory of the GPU in Mb | 2048 | | |||
| nvidia.com/gpu.product | String | Model of the GPU | GeForce-GT-710 | | |||
| nvidia.com/gpu.mode | String | Display or Compute Mode of the GPU | compute | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still need to add a reference as to what this means. From the documentation it is also only applicable to specific device types. Do we want to link to the relevant documentation here?
} | ||
gpuMode := getModeForClasses(classes) | ||
labels := Labels{ | ||
"nvidia.com/gpu.mode": gpuMode, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Since we also extract this label for a MIG device do we expect different labels per MIG profile?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Devices that support MIG do not support changing the GPU mode. They are always going to return compute PCI class. We can actually hard code the class of MIG devices to be compute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let's rather do that. It would simplify the implementation quite a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the implementation
for _, d := range devices { | ||
class, err := d.GetPIEClass() | ||
if err != nil { | ||
return nil, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Do we want to treat errors in getting the class as fatal? This will crash GFD and cause NO labels to be generated. Should we rather return unknown
as the label in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to keep this behavior consistent with other labeler, that is the reason I am returning the error instead instead of labeling it unknown
.
Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>
Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>
Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @visheshtanksale.
I have some minor comments, but these can be addressed in a follow-up.
} | ||
for _, class := range classes { | ||
if class != classes[0] { | ||
return "unknown" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a blocker for this PR, but we may want to log the content of classes
here as a warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
@@ -204,3 +204,89 @@ func TestSharingLabeler(t *testing.T) { | |||
}) | |||
} | |||
} | |||
|
|||
func TestGPUModeLabeler(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the tests!
@@ -51,6 +51,14 @@ func NewDeviceMock(migEnabled bool) *DeviceMock { | |||
IsMigEnabledFunc: func() (bool, error) { return migEnabled, nil }, | |||
IsMigCapableFunc: func() (bool, error) { return migEnabled, nil }, | |||
GetMigDevicesFunc: func() ([]resource.Device, error) { return nil, nil }, | |||
GetPCIClassFunc: func() (uint32, error) { return 0x030000, nil }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Should this return 0
by default since it's "unknown" / "undefined"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
tests/expected-output-mig-single.txt
Outdated
@@ -30,4 +30,5 @@ nvidia\.com\/gpu\.engines\.jpeg=[0-9]+ | |||
nvidia\.com\/gpu\.engines\.ofa=[0-9]+ | |||
nvidia\.com\/gpu\.slices\.gi=[0-9]+ | |||
nvidia\.com\/gpu\.slices\.ci=[0-9]+ | |||
nvidia\.com\/gpu\.mode=[unknown|compute|graphics] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this can only be compute
now, correct? Not a blocker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
Signed-off-by: Vishesh Tanksale <vtanksale@nvidia.com>
Some GPUs support switching between graphics mode and compute mode.
The mode is switched by a utility called displaymodeselector
This change identifies the mode on the GPU and labels the node to help end user schedule workloads
The assumption is that all the GPUs on the node have the same mode, if that is not the case then the label
nvidia.com/gpu.mode
value isunknown