Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(reporter-plugin): support report rdma topology #314

Merged
merged 3 commits into from
Nov 20, 2023

Conversation

fjding
Copy link
Contributor

@fjding fjding commented Oct 17, 2023

What type of PR is this?

Features

What this PR does / why we need it:

Support reporting devices (such as RDMA, GPU, etc.) as topology zones under NUMA, thereby supporting inter-RDMA affinity at switch level.

Which issue(s) this PR fixes:

Special notes for your reviewer:

@fjding fjding force-pushed the support_rdma_topology branch from f60bd2d to d42ebe2 Compare October 17, 2023 08:21
@CLAassistant
Copy link

CLAassistant commented Oct 17, 2023

CLA assistant check
All committers have signed the CLA.

@fjding fjding force-pushed the support_rdma_topology branch from d42ebe2 to 997b7ad Compare October 18, 2023 06:31
@caohe caohe added the enhancement New feature or request label Nov 1, 2023
@caohe caohe added this to the v0.4 milestone Nov 1, 2023
@caohe caohe added the workflow/need-review review: test succeeded, need to review label Nov 1, 2023
@@ -50,12 +52,15 @@ func (o *KubeletPluginOptions) AddFlags(fss *cliflag.NamedFlagSets) {
"the path of kubelet resource plugin")
fs.BoolVar(&o.EnableReportTopologyPolicy, "enable-report-topology-policy", o.EnableReportTopologyPolicy,
"whether to report topology policy")
fs.BoolVar(&o.EnableReportRDMATopology, "enable-report-rdma-topology", false, "enable report rdma topology, default false")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can use o.EnableReportRDMATopology as the default value instead of false?

pkg/util/cnr.go Outdated
@@ -42,6 +42,8 @@ const (
CNRFieldNameTopologyZone = "TopologyZone"
CNRFieldNameResources = "Resources"
CNRFieldNameTopologyPolicy = "TopologyPolicy"

ResourceRDMA = "vke.volcengine.com/rdma"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make the resource name configurable? We can assign the default value to resource.katalyst.kubewharf.io, and then pass in a custom value.

@@ -315,6 +342,42 @@ func (p *topologyAdapterImpl) addNumaSocketChildrenZoneNodes(generator *util.Top
return nil
}

// addNumaSocketChildrenZoneNodes add the child nodes of socket or numa zone nodes to the generator, the child nodes are
// generated by generateZoneNode according to TopologyLevel, Type and Name in TopologyAwareAllocatableQuantityList
func (p *topologyAdapterImpl) addNICNumaChildrenZoneNodes(generator *util.TopologyZoneGenerator,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can make this function more generic by using a map configuration (key: device.ResourceName, value: ZoneType). If the user has set this map, the function can construct NUMA's child zones by iterating through allocatableResources.Devices. If device.ResourceName is present in the map configuration, the zone name will be the device ID.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have refactored this pr. Could you PTAL?

@fjding fjding force-pushed the support_rdma_topology branch from 997b7ad to ea4c7c5 Compare November 7, 2023 09:59
fjding and others added 2 commits November 16, 2023 00:57
Signed-off-by: fjding <dingfangjie@bytedance.com>
Signed-off-by: caohe <caohe9603@gmail.com>
@caohe caohe force-pushed the support_rdma_topology branch from ea4c7c5 to 9153bef Compare November 15, 2023 17:14
Copy link

codecov bot commented Nov 15, 2023

Codecov Report

Attention: 15 lines in your changes are missing coverage. Please review.

Comparison is base (8d61a49) 53.43% compared to head (7362121) 53.29%.
Report is 3 commits behind head on main.

Files Patch % Lines
...nager/fetcher/kubelet/topology/topology_adapter.go 59.37% 9 Missing and 4 partials ⚠️
...alyst-agent/app/options/reporter/kubelet_plugin.go 50.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #314      +/-   ##
==========================================
- Coverage   53.43%   53.29%   -0.14%     
==========================================
  Files         437      437              
  Lines       48155    48181      +26     
==========================================
- Hits        25732    25680      -52     
- Misses      19505    19586      +81     
+ Partials     2918     2915       -3     
Flag Coverage Δ
unittest 53.29% <66.66%> (-0.14%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@caohe caohe self-assigned this Nov 16, 2023
@@ -38,6 +39,7 @@ func NewKubeletPluginOptions() *KubeletPluginOptions {
pluginapi.ResourcePluginPath,
},
EnableReportTopologyPolicy: false,
ResourceNameToZoneNameMap: make(map[string]string),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DeviceZoneResourceNameToZoneTypeMap ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

pkg/util/cnr.go Outdated
@@ -373,3 +373,13 @@ func GenerateSocketZoneNode(socketID int) ZoneNode {
},
}
}

// GenerateDeviceZoneNode generates device zone node through device id, which must be unique
func GenerateDeviceZoneNode(deviceId, zoneName string) ZoneNode {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zoneType is more properly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

for _, deviceId := range device.DeviceIds {
deviceNode := util.GenerateDeviceZoneNode(deviceId, targetZoneName)
if _, ok := zoneAllocationsMap[deviceNode]; !ok {
zoneAllocationsMap[deviceNode] = []*nodev1alpha1.Allocation{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all this logic can implement in addContainerDevices, because the zoneNode should also need have allocatable and capacity when allocation has request

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@caohe caohe force-pushed the support_rdma_topology branch 2 times, most recently from af0d880 to 5c64218 Compare November 20, 2023 07:56
…ices

Signed-off-by: caohe <caohe9603@gmail.com>
@caohe caohe force-pushed the support_rdma_topology branch from 5c64218 to 7362121 Compare November 20, 2023 08:19
@waynepeking348 waynepeking348 merged commit 08dcb22 into kubewharf:main Nov 20, 2023
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request workflow/need-review review: test succeeded, need to review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants