-
Notifications
You must be signed in to change notification settings - Fork 101
Support for allocating GPUs in Passthrough-Mode #183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
/cc @klueska |
Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com>
Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com>
Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com>
0fc7174 to
0978db9
Compare
| type GpuDriver string | ||
|
|
||
| const ( | ||
| NvidiaDriver GpuDriver = "nvidia" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there another name for this driver that isn't nvidia?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope. The driver is simply called nvidia
| } | ||
|
|
||
| // Validate ensures that GpuDriverConfig has a valid set of values. | ||
| func (c *GpuDriverConfig) Validate() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Is c == nil valid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope. gpuConfig.Normalize() would always ensure GpuDriverConfig is set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My question should probably have been: "Should we add an explict nil check and retunr an error if this is the case?"
| case VfioPciDriver: | ||
| break | ||
| default: | ||
| return fmt.Errorf("invalid driver specified in gpu driver configuration") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we include the invalid value here?
| if c.DriverConfig.Driver != NvidiaDriver { | ||
| return nil | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does adding a SupportsSharing() function to the GpuDriver type make this clearer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, is the expectation that we always call Validate() after Normalize() so that we check whether there is no sharing actually configured? Does it make sense to remove this check here entirely so that we only need to update logic in one place with regards to the driver type and sharing interaction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to remove this check here entirely so that we only need to update logic in one place with regards to the driver type and sharing interaction?
Removing the check here won't work. There's 2 possibilities here:
- On NVIDIA driver, normalize sharing by initializing it if its nil, and then set config based on strategy.
- On VFIO-PCI driver, don't initialize it.
If we allow normalization of c.Sharing in case (2) by relaxing the check, then we're going to fail subsequent validation.
| } | ||
|
|
||
| // Initialize the vfio-pci driver manager. | ||
| vfioPciManager.Init() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we init this here? Where are other managers initialised?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tsManager/mpsManager don't have any "initializations". But checkpointManager is initialized underneath: L117-L131.
What's the concern here?
| case *configapi.GpuConfig: | ||
| return s.applySharingConfig(ctx, castConfig.Sharing, claim, results) | ||
| configState.GpuConfig = castConfig | ||
| err = s.applyGpuConfig(ctx, castConfig, claim, results, &configState) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not still return if err != nil in each of the branches? alternatively, does it make sense to have s.applyGpuConfig return (*DeviceConfigState, error) like the other apply functions and keep the other function unchanged?
| if config.Sharing != nil { | ||
| err := s.applySharingConfig(ctx, config.Sharing, claim, results, configState) | ||
| if err != nil { | ||
| return err | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: We unconditionally call applySharingConfig for mig devices. Since we now have the case where config.Sharing can be nil, could we either update applySharingConfig to ignore nil configs, or introduce a Disabled config so that we can set that for cases where we don't expect sharing to be allowed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For MIG devices, config.Sharing can never be nil right - MIGs would always support Sharing so Normalize would initialize that config.
| } | ||
|
|
||
| func (s *DeviceState) applySharingConfig(ctx context.Context, config configapi.Sharing, claim *resourceapi.ResourceClaim, results []*resourceapi.DeviceRequestAllocationResult) (*DeviceConfigState, error) { | ||
| func (s *DeviceState) applySharingConfig(ctx context.Context, config configapi.Sharing, claim *resourceapi.ResourceClaim, results []*resourceapi.DeviceRequestAllocationResult, configState *DeviceConfigState) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned, I think we should keep this signature the same as before. I would rather avoid passing an argument by reference for modification.
| return nil | ||
| } | ||
|
|
||
| func (s *DeviceState) applyGpuDriverConfig(ctx context.Context, config *configapi.GpuDriverConfig, results []*resourceapi.DeviceRequestAllocationResult, configState *DeviceConfigState) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| func (s *DeviceState) applyGpuDriverConfig(ctx context.Context, config *configapi.GpuDriverConfig, results []*resourceapi.DeviceRequestAllocationResult, configState *DeviceConfigState) error { | |
| func (s *DeviceState) applyGpuDriverConfig(ctx context.Context, config *configapi.GpuDriverConfig, results []*resourceapi.DeviceRequestAllocationResult) (*DeviceConfigState, error) { |
cmd/nvidia-dra-plugin/nvlib.go
Outdated
| } | ||
|
|
||
| func getPciAddressFromNvmlPciInfo(info nvml.PciInfo) string { | ||
| return fmt.Sprintf("%04x:%02x:%02x.0", info.Domain, info.Bus, info.Device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be the same if its returning the address in the format 0000:0a:00.0. Wasn't aware of that implementation. I'll switch to that
cmd/nvidia-dra-plugin/nvlib.go
Outdated
| if ret != nvml.SUCCESS { | ||
| return nil, fmt.Errorf("error getting PCI info for device %d: %w", index, err) | ||
| } | ||
| pciAddress := getPciAddressFromNvmlPciInfo(pciInfo) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we replace this with device.GetBusID()?
cmd/nvidia-dra-plugin/vfio-device.go
Outdated
| } | ||
|
|
||
| func (vm *VfioPciManager) loadVfioPciModule() error { | ||
| cmd := exec.Command("modprobe", vm.vfioPciModule) //nolint:gosec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we chroot to the host root mounted into the container instead of expecting /sys/ to be writable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@elezar Do you mean something like: exec.Command("nsenter", "--mount=/proc/1/ns/mnt", "--", "modprobe", vm.vfioPciModule)?
| } | ||
|
|
||
| func (vm *VfioPciManager) getIommuGroupForVfioPciDevice(pciAddress string) string { | ||
| iommuGroup, err := os.Readlink(filepath.Join(vm.pciDevicesRoot, pciAddress, "iommu_group")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have similar functionality in go-nvlib?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we have it tracked in NvidiaPciDevice here: https://github.com/NVIDIA/go-nvlib/blob/main/pkg/nvpci/nvpci.go. I'm guessing nvpci is the sysfs way of discovering GPUs. But nvlib doesn't have it.
Btw one of the implicit assumptions I havent captured here is that iommu needs to be enabled in the kernel for us to be able to do gpu-passthrough. In that case we would hit the err in L171 and return empty iommu group.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we reuse the NvidiaPciDevice from go-nvlib/pkg/nvpci? It seems as if it is set here https://github.com/NVIDIA/go-nvlib/blob/1482a942fb6d52a023cff85c2d76ed4127af661a/pkg/nvpci/nvpci.go#L292-L304.
| - name: sysfs | ||
| mountPath: /sys | ||
| readOnly: false | ||
| - name: dev-vfio | ||
| mountPath: /dev/vfio | ||
| readOnly: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these be optionally mounted instead?
| # Usage: ./bind_to_driver.sh <ssss:bb:dd.f> <driver> | ||
| # Bind the GPU specified by the PCI_ID=ssss:bb:dd.f to the given driver. | ||
|
|
||
| bind_to_driver() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not implement this in go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Atleast for unbind, in case unbindLock is in use, we'd need the same process that acquires the unbindLock to be the one doing the driver unbind. I want to scope that unbindLock to just the task that's doing the unbind rather than the whole kubeletplugin binary.
(FYI unbindLock is in play when the NVIDIA Grid-vGPU driver is used on the node)
bind is a script just to be consistent with unbind.
Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com>
0978db9 to
49bda9f
Compare
Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com>
Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com>
Signed-off-by: Varun Ramachandra Sekar <vsekar@nvidia.com>
49bda9f to
aa3c2bf
Compare
| - hostPath: /usr/bin/nvidia-ctk | ||
| containerPath: /usr/bin/nvidia-ctk | ||
| - hostPath: /sys | ||
| containerPath: /sys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: newline
| - name: dev-vfio | ||
| mountPath: /dev/vfio | ||
| readOnly: false | ||
| {{- end -}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: newline
| fi | ||
| } | ||
|
|
||
| bind_to_driver "$1" "$2" || exit 1 No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: newline.
| return 0 | ||
| } | ||
|
|
||
| unbind_from_driver "$1" || exit 1 No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: newline
|
Closing stale PR in favor of #668 |
This PR introduces a new DeviceClass
vfiopci.nvidia.comthat will allocate a full GPU in PassThrough-mode (PT) by binding the GPU to vfio-pci driver.The primary usecase for this new DeviceClass are Kata containers and KubeVirt VMs that require the gpu to be in PT-mode and made available to a pod which then would spin up a guest with the gpu.
Note: Regular pod workloads will not benefit from this DeviceClass and shouldn't try to use this.
As part of this change, I've introduced some (but not all) modifications to the kind cluster config that are needed for this DeviceClass to work. Host-level modifications needed:
Validated on a kind cluster with a Quattro P2000 GPU:
Open items: