Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(orm): support outofband-resource-manager #406

Merged

Conversation

WangZzzhe
Copy link
Collaborator

@WangZzzhe WangZzzhe commented Dec 14, 2023

What type of PR is this?

Features

What this PR does / why we need it:

orm(outofband-resource-manager) supports dynamically allocating resources to pods with different QoS asynchronously through a bypass approach, without relying on intrusive changes to kubelet.
The current version supports shared_cores and reclaimed_cores pool.

Which issue(s) this PR fixes:

#430

@waynepeking348 waynepeking348 added the enhancement New feature or request label Dec 14, 2023
@WangZzzhe WangZzzhe force-pushed the dev/outofband-resource-manager branch from e371bb1 to e79e43a Compare December 14, 2023 03:19
Copy link

codecov bot commented Dec 14, 2023

Codecov Report

Attention: 512 lines in your changes are missing coverage. Please review.

Comparison is base (f0f2218) 53.41% compared to head (2e93922) 54.41%.
Report is 14 commits behind head on main.

Files Patch % Lines
pkg/agent/resourcemanager/outofband/manager.go 56.62% 134 Missing and 33 partials ⚠️
...g/agent/resourcemanager/outofband/pluginhandler.go 7.40% 99 Missing and 1 partial ⚠️
...manager/outofband/endpoint/resource_plugin_stub.go 36.61% 84 Missing and 6 partials ⚠️
pkg/util/cgroup/manager/fake_manager.go 11.11% 32 Missing ⚠️
...kg/agent/resourcemanager/outofband/pod_resource.go 77.44% 25 Missing and 5 partials ⚠️
...ent/resourcemanager/outofband/endpoint/endpoint.go 64.86% 21 Missing and 5 partials ⚠️
...ent/resourcemanager/outofband/executor/executor.go 67.14% 21 Missing and 2 partials ⚠️
...t/resourcemanager/outofband/metamanager/manager.go 84.84% 14 Missing and 1 partial ⚠️
cmd/katalyst-agent/app/options/orm/orm_base.go 50.00% 12 Missing ⚠️
cmd/katalyst-agent/app/agent/orm.go 0.00% 10 Missing ⚠️
... and 2 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #406      +/-   ##
==========================================
+ Coverage   53.41%   54.41%   +0.99%     
==========================================
  Files         448      484      +36     
  Lines       50138    53159    +3021     
==========================================
+ Hits        26782    28924    +2142     
- Misses      20298    21053     +755     
- Partials     3058     3182     +124     
Flag Coverage Δ
unittest 54.41% <54.88%> (+0.99%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@WangZzzhe WangZzzhe force-pushed the dev/outofband-resource-manager branch from e79e43a to c983a3b Compare December 14, 2023 03:56
"github.com/kubewharf/katalyst-core/pkg/agent/resourcemanager/outofband/endpoint"
)

func TestCheckpoint(t *testing.T) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t.Parallel()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

)

func InitORM(agentCtx *GenericContext, conf *config.Configuration, _ interface{}, _ string) (bool, Component, error) {
m, err := outofband.NewManager(conf.PluginRegistrationDir+"/kubelet.sock", agentCtx.EmitterPool.GetDefaultMetricsEmitter(), agentCtx.MetaServer, conf)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this sock be conflicted with already-existing ones? such as, qrm, agent, sysadvisor?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, "/var/lib/katalyst/plugin-socks/kubelet.sock" is only used by ORM now.
cpuPlugin: qrm_cpu_plugin_dynamic.sock
memoryPlugin: qrm_memory_plugin_dynamic.sock
headroom reporter: headroom-reporter-plugin.sock
kubelet QRM: /var/lib/kubelet


fs.DurationVar(&o.ORMRconcilePeriod, "orm-reconcile-period",
o.ORMRconcilePeriod, "orm resource reconcile period")
fs.Var(cliflag.NewMapStringString(&o.ORMResourceNamesMap), "orm-resource-names-map", "A set of ResourceName=ResourceQuantity (e.g. best-effort-cpu=cpu,best-effort-memory=memory,...) pairs that map resource name \"best-effort-cpu\" to resource name \"cpu\" during QoS Resource Manager allocation period.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add more comments for this flag (for why we need this), since it is kind of wired for upstream

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@waynepeking348 waynepeking348 added the workflow/need-review review: test succeeded, need to review label Dec 18, 2023
@waynepeking348
Copy link
Collaborator

lgtm for me now; you may still need to fix the conflict and then get approve from code owners

@waynepeking348 waynepeking348 added the workflow/merge-ready merge-ready: code is ready and can be merged label Dec 21, 2023
@WangZzzhe WangZzzhe force-pushed the dev/outofband-resource-manager branch from a26b792 to ca29648 Compare December 21, 2023 12:17
@waynepeking348
Copy link
Collaborator

@luomingmeng @csfldf

csfldf
csfldf previously approved these changes Jan 3, 2024
waynepeking348
waynepeking348 previously approved these changes Jan 3, 2024

fs.DurationVar(&o.ORMRconcilePeriod, "orm-reconcile-period",
o.ORMRconcilePeriod, "orm resource reconcile period")
fs.Var(cliflag.NewMapStringString(&o.ORMResourceNamesMap), "orm-resource-names-map",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use StringToStringVar directly?

return containerType == pluginapi.ContainerType_INIT
}

func isPodKatalystQoSLevelSystemCores(pod *v1.Pod) bool {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use qos conf instead of annotation directly?

@@ -306,3 +306,7 @@ func MultiplyQuantity(quantity resource.Quantity, y float64) resource.Quantity {
value = int64(float64(value) * y)
return *resource.NewQuantity(value, quantity.Format)
}

func ParseQuantityToFloat64(quantity resource.Quantity) float64 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use AsApproximateFloat64 directly?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use AsApproximateFloat64 directly?

All comments have been resolved.


// errUnsupportedVersion is the error raised when the resource plugin uses an API version not
// supported by the Kubelet registry
errUnsupportedVersion = "requested API version %q is not supported by kubelet. Supported version is %q"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by ORM?

MetricAddPodTimeout = "ORM_add_pod_timeout"
MetricDeletePodTImeout = "ORM_delete_pod_timeout"

MainContainerNameAnnotationKey = "kubernetes.io/main-container-name"
Copy link
Member

@caohe caohe Jan 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to support the main container annotation in ORM?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to support the main container annotation for ORM?

Not necessary, first container in container list will be defaulted as the main container, but we can keep it as a feature. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In vanilla K8s, all containers are treated as main containers. If we keep this manner, will there be any problems in the colocation scenario?

Copy link
Member

@caohe caohe Jan 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To remain consistent with QRM, we can retain this usage. In the long term, when the K8s version is higher, we can use the native sidecar container feature provided by K8s. https://kubernetes.io/blog/2023/08/25/native-sidecar-containers/

@waynepeking348 waynepeking348 self-requested a review January 3, 2024 09:44
ContainerName: container.Name,
ContainerType: containerType,
ContainerIndex: containerIndex,
// customize for tce, PodRole and PodType should be identified by more general annotations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can remove "customize for tce" ?

errBadSocket = "bad socketPath, must be an absolute path:"

// errUnsupportedVersion is the error raised when the resource plugin uses an API version not
// supported by the Kubelet registry
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also modify this comment?

@WangZzzhe WangZzzhe force-pushed the dev/outofband-resource-manager branch from 5194e72 to bdebb96 Compare January 3, 2024 12:49
fs.DurationVar(&o.ORMRconcilePeriod, "orm-reconcile-period",
o.ORMRconcilePeriod, "orm resource reconcile period")
fs.StringToStringVar(&o.ORMResourceNamesMap, "orm-resource-names-map", o.ORMResourceNamesMap,
"A set of ResourceName=ResourceQuantity pairs that map resource name during QoS Resource Manager allocation period. "+
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be more accurate to use Out-of-band Resource Manager instead of QoS Resource Manager in the message?

@WangZzzhe WangZzzhe force-pushed the dev/outofband-resource-manager branch from bdebb96 to 2e93922 Compare January 3, 2024 13:02
@caohe caohe requested review from luomingmeng and csfldf January 3, 2024 13:04
@waynepeking348 waynepeking348 merged commit e86c2b7 into kubewharf:main Jan 4, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request workflow/merge-ready merge-ready: code is ready and can be merged workflow/need-review review: test succeeded, need to review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants