-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rework device sharing in volcano #2643
rework device sharing in volcano #2643
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make CI happy first.
And squash your code.
@@ -166,47 +166,48 @@ func (gs *GPUDevices) MonitorStatus() string { | |||
} | |||
|
|||
func (gs *GPUDevices) AllocateToPod(kubeClient kubernetes.Interface, pod *v1.Pod) error { | |||
klog.V(4).Infoln("DeviceSharing:Into AllocateToPod", pod) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure whether the parameter pod
will be used in the log info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, you are right, now i'm using pod.Name instead
Rework device-sharing mechanism to volcano
This rework so good, and can you make it a plugin to add a device? |
pkg/scheduler/api/node_info.go
Outdated
Others map[string]interface{} | ||
GPUDevices map[int]*GPUDevice | ||
Others map[string]interface{} | ||
SharedDevices map[string]SharedDevicePool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Others is used for store custom resource. We can reuse it.
pkg/scheduler/api/node_info.go
Outdated
@@ -153,7 +153,7 @@ func NewNodeInfo(node *v1.Node) *NodeInfo { | |||
nodeInfo.Allocatable = NewResource(node.Status.Allocatable).Add(nodeInfo.OversubscriptionResource) | |||
nodeInfo.Capability = NewResource(node.Status.Capacity).Add(nodeInfo.OversubscriptionResource) | |||
} | |||
nodeInfo.setNodeGPUInfo(node) | |||
nodeInfo.SharedDevices[GPUSharingDevice] = gpushare.NewGPUDevices(nodeInfo.Name, node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest to use setNodeOthersResource
to encapsulation the SharedDevices[GPUSharingDevice] = gpushare.NewGPUDevices(nodeInfo.Name, node)
@@ -376,7 +344,7 @@ func (ni *NodeInfo) SetNode(node *v1.Node) { | |||
// setNode sets kubernetes node object to nodeInfo object without assertion | |||
func (ni *NodeInfo) setNode(node *v1.Node) { | |||
ni.setOversubscription(node) | |||
ni.setNodeGPUInfo(node) | |||
ni.SharedDevices[GPUSharingDevice] = gpushare.NewGPUDevices(ni.Name, node) | |||
ni.setRevocableZone(node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
pkg/scheduler/api/node_info.go
Outdated
@@ -395,13 +363,13 @@ func (ni *NodeInfo) setNode(node *v1.Node) { | |||
ni.Idle.sub(ti.Resreq) // sub without assertion | |||
ni.Releasing.Add(ti.Resreq) | |||
ni.Used.Add(ti.Resreq) | |||
ni.AddGPUResource(ti.Pod) | |||
ni.SharedDevices[GPUSharingDevice].AddResource(ti.Pod) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest to retain AddResource
and put the SharedDevices[GPUSharingDevice].AddResource(ti.Pod)
inside AddResource
.
@@ -0,0 +1,25 @@ | |||
package api |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add the copyright and change the name of this file to device_interface.go
GPUSharingDevice = "GpuShare" | ||
) | ||
|
||
type SharedDevicePool interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The interface covers the gpu number as well and npu in the future. It's better to change the name to from SharedDevicePool
to Devices
.
|
||
type SharedDevicePool interface { | ||
//following two functions used in node_info | ||
AddResource(pod *v1.Pod) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add function description for each interface.
SubResource(pod *v1.Pod) | ||
|
||
//following four functions used in predicate | ||
RequestInPod(pod *v1.Pod) bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest to change RequestInPod
to HasDeviceRequest
or some other name that is easy to understand.
|
||
//following four functions used in predicate | ||
RequestInPod(pod *v1.Pod) bool | ||
FitInPod(pod *v1.Pod) (bool, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest to change interface from FitInPod
to Filter
or FilterNode
//following four functions used in predicate | ||
RequestInPod(pod *v1.Pod) bool | ||
FitInPod(pod *v1.Pod) (bool, error) | ||
AllocateToPod(kubeClient kubernetes.Interface, pod *v1.Pod) error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change AllocateToPod
and ReleaseFromPod
to Allocate
and Release
ReleaseFromPod(kubeClient kubernetes.Interface, pod *v1.Pod) error | ||
|
||
//used for debug and monitor | ||
MonitorStatus() string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change MonitorStatus
to GetStatus
or CheckStatus
I'd like to, but you still need to hack into scheduler.api even with this PR merged (ie. you must put your device-initialization logic in node_info.go in order for the scheduler to recognize your device). so it's hard for a plugin to do that |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: william-wang The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
We implement a common interface for shareable devices(GPU,NPU,FPGA,...) called sharedDevicePool, and use it to reimplement current gpu-share mechanism. The goal is to let device-sharing easy to implement, and better organised. If you wish to grant vc-scheduler the ability to share another device, all you need to do is to implement these methods in sharedDevicePool, and place your logic under pkg/scheduler/api/devices. No other hackings are needed.