KEP-4205: KEP to add PSI metrics in Summary API, and for PSI based node actions #4230

ndixita · 2023-09-21T21:14:09Z

One-line PR description: This KEP lays the initial foundations to use PSI metrics for setting Node Conditions.

Issue link: Support PSI based on cgroupv2 #4205

ndixita · 2023-09-21T21:14:47Z

/assign @rphillips
/assign @haircommander
/assign @bobbypage

keps/sig-node/4205-psi-metric/README.md

tzneal

+1, this may help provide more observability to users for nodes that are overloaded without requiring logging into the node to troubleshoot.

keps/sig-node/4205-psi-metric/README.md

rphillips · 2023-09-29T17:28:58Z

I think we will want to propagate the PSI metrics from the main Kubelet slices separately: /, system.slice, and kubepods.slice. I do not see that reflected in this enhancement. Thoughts?

bobbypage

Thank you for proposing this KEP, I am super excited to see us build on top of cgroupv2 using features like PSI in the kubelet! Left a few comments.

keps/sig-node/4205-psi-metric/README.md

haircommander

A couple more nits, and we still need some resolution on the multi-tiered node condition (separate for kubepods cgroup vs system cgroup), or at least a mention of it as we have a couple of releases between now and then

haircommander · 2023-10-02T17:37:03Z

keps/sig-node/4205-psi-metric/README.md

+Avg300 *float64 `json:”avg300”`
+Total *float64 `json:”total”`


indentation is weird here

This comment still holds

keps/sig-node/4205-psi-metric/README.md

haircommander · 2023-10-02T17:39:12Z

keps/sig-node/4205-psi-metric/README.md

+#### Phase 1: Alpha
+
+- PSI integrated in kubelet behind a feature flag.
+- Initial e2e tests completed and enabled.


may be tricky to have e2e tests if we need CRI extensions to collect the metrics

Yeah we can decide on the coverage later?

TBH I'd drop this and replace it with unit testing

What kind of CRI extensions are we talking about here? Are those optional, iow. the functionality will be only available for some container runtimes? (I'm not a CRI expert 😉 )

keps/sig-node/4205-psi-metric/README.md

haircommander · 2023-10-04T14:06:08Z

a couple more nits. I propose we mostly focus reviews on phase 1 now--which to me mostly looks good--as the following phases will likely change with time

soltysh

#prr-shadow

keps/sig-node/4205-psi-metric/README.md

keps/sig-node/4205-psi-metric/kep.yaml

rphillips · 2023-10-05T22:11:31Z

I took a pass on a review. @soltysh's comments should be addressed before we merge this; otherwise, lgtm.

tzneal · 2023-10-05T22:15:19Z

/lgtm

ndixita · 2023-10-05T22:22:00Z

/assign deads2k

ndixita · 2023-10-05T22:22:46Z

/assign @soltysh

keps/sig-node/4205-psi-metric/README.md

Signed-off-by: Dixita Narang <ndixita@google.com>

johnbelamaric · 2023-10-05T23:15:56Z

/approve
/lgtm

k8s-ci-robot · 2023-10-05T23:16:05Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnbelamaric, mrunalp, ndixita

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [johnbelamaric]
~~keps/sig-node/OWNERS~~ [johnbelamaric,mrunalp]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tzneal · 2023-10-05T23:24:46Z

/lgtm

BojanZelic · 2024-02-13T23:29:10Z

I've been playing around with a similar idea to this, using node-problem-detector to set node conditions based off of PSI pressure indicators and ran into an issue that might affect this KEP.

Basically while CPU PSI pressure information is a good indicator of how overloaded a node is, it also gets set whenever CFS cpu throttling occurs.

For example if a user schedules a pod and sets their cpu limits too low so that throttling happens frequently we see that the PSI metrics for that cgroup also get elevated

root@ip-10-4-172-240:/sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podd75d892c_9bb2_4b9c_8ce2_7868eb91e665.slice/cri-containerd-fc6e9647769050baf7ea79855ca22b8496ed9b6dd2b7f06c2e583e18a25dc920.scope# cat cpu.pressure

some avg10=58.43 avg60=57.45 avg300=57.33 total=27907960587
full avg10=58.12 avg60=57.28 avg300=57.17 total=27494547488

which ends up being propagated to the system level metric:

cat /proc/pressure/cpu
some avg10=58.51 avg60=58.82 avg300=58.91 total=1062145857327
full avg10=0.00 avg60=0.00 avg300=0.00 total=0

cpu thottling:

root@ip-10-4-172-240:/sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podd75d892c_9bb2_4b9c_8ce2_7868eb91e665.slice/cri-containerd-fc6e9647769050baf7ea79855ca22b8496ed9b6dd2b7f06c2e583e18a25dc920.scope# cat cpu.stat
usage_usec 139474622887
user_usec 133159602378
system_usec 6315020509
core_sched.force_idle_usec 0
nr_periods 478231
nr_throttled 459361
throttled_usec 437837995163
nr_bursts 0
burst_usec 0

so elevated cpu PSI pressure could also mean heavy throttling on a container which doesn't necessarily mean there's a node-level issue occurring

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 21, 2023

k8s-ci-robot requested review from derekwaynecarr and mrunalp September 21, 2023 21:14

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Sep 21, 2023

k8s-ci-robot assigned bobbypage, haircommander and rphillips Sep 21, 2023

ndixita force-pushed the psi-kep branch from 4b4b1be to 5cf286a Compare September 21, 2023 21:21

kannon92 reviewed Sep 22, 2023

View reviewed changes

keps/sig-node/4205-psi-metric/README.md Show resolved Hide resolved

kannon92 reviewed Sep 22, 2023

View reviewed changes

keps/sig-node/4205-psi-metric/README.md Show resolved Hide resolved

tzneal reviewed Sep 22, 2023

View reviewed changes

keps/sig-node/4205-psi-metric/README.md Outdated Show resolved Hide resolved

keps/sig-node/4205-psi-metric/README.md Show resolved Hide resolved

tzneal reviewed Sep 22, 2023

View reviewed changes

keps/sig-node/4205-psi-metric/README.md Show resolved Hide resolved

ndixita force-pushed the psi-kep branch from 5cf286a to 8960c2c Compare September 28, 2023 21:14

bobbypage reviewed Sep 30, 2023

View reviewed changes

keps/sig-node/4205-psi-metric/README.md Outdated Show resolved Hide resolved

keps/sig-node/4205-psi-metric/README.md Show resolved Hide resolved

keps/sig-node/4205-psi-metric/README.md Show resolved Hide resolved

haircommander reviewed Oct 2, 2023

View reviewed changes

ndixita force-pushed the psi-kep branch 3 times, most recently from b757350 to e39ea7d Compare October 4, 2023 00:11

haircommander reviewed Oct 4, 2023

View reviewed changes

keps/sig-node/4205-psi-metric/README.md Outdated Show resolved Hide resolved

rayandas mentioned this pull request Oct 4, 2023

Support PSI based on cgroupv2 #4205

Open

10 tasks

soltysh reviewed Oct 4, 2023

View reviewed changes

ndixita force-pushed the psi-kep branch 2 times, most recently from 2ce2c93 to 44189d1 Compare October 4, 2023 22:37

ndixita force-pushed the psi-kep branch 4 times, most recently from ddeb0eb to c280955 Compare October 5, 2023 21:48

ndixita force-pushed the psi-kep branch from c280955 to a395c40 Compare October 5, 2023 22:13

k8s-ci-robot assigned tzneal Oct 5, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 5, 2023

mrunalp approved these changes Oct 5, 2023

View reviewed changes

k8s-ci-robot assigned deads2k Oct 5, 2023

k8s-ci-robot assigned soltysh Oct 5, 2023

johnbelamaric reviewed Oct 5, 2023

View reviewed changes

keps/sig-node/4205-psi-metric/README.md Outdated Show resolved Hide resolved

KEP to add PSI metrics in Summary API, and for PSI based node actions

024d4d5

Signed-off-by: Dixita Narang <ndixita@google.com>

ndixita force-pushed the psi-kep branch from a395c40 to 024d4d5 Compare October 5, 2023 23:10

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 5, 2023

k8s-ci-robot assigned johnbelamaric Oct 5, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 5, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 5, 2023

k8s-ci-robot merged commit 6765cba into kubernetes:master Oct 5, 2023

k8s-ci-robot added this to the v1.29 milestone Oct 5, 2023

akhilerm mentioned this pull request Nov 27, 2023

Expose Pressure Stall Information (PSI) metrics containerd/containerd#9411

Closed

soltysh mentioned this pull request Apr 3, 2024

Add soltysh to prod-readiness-approvers #4566

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KEP-4205: KEP to add PSI metrics in Summary API, and for PSI based node actions #4230

KEP-4205: KEP to add PSI metrics in Summary API, and for PSI based node actions #4230

ndixita commented Sep 21, 2023

ndixita commented Sep 21, 2023

tzneal left a comment

rphillips commented Sep 29, 2023

bobbypage left a comment

haircommander left a comment

haircommander Oct 2, 2023

haircommander Oct 4, 2023

soltysh Oct 5, 2023

haircommander Oct 2, 2023

ndixita Oct 4, 2023

haircommander Oct 4, 2023

soltysh Oct 4, 2023

haircommander commented Oct 4, 2023

soltysh left a comment

rphillips commented Oct 5, 2023

tzneal commented Oct 5, 2023

ndixita commented Oct 5, 2023

ndixita commented Oct 5, 2023

johnbelamaric commented Oct 5, 2023

k8s-ci-robot commented Oct 5, 2023

tzneal commented Oct 5, 2023

BojanZelic commented Feb 13, 2024

		Avg300 *float64 `json:”avg300”`
		Total *float64 `json:”total”`

KEP-4205: KEP to add PSI metrics in Summary API, and for PSI based node actions #4230

KEP-4205: KEP to add PSI metrics in Summary API, and for PSI based node actions #4230

Conversation

ndixita commented Sep 21, 2023

ndixita commented Sep 21, 2023

tzneal left a comment

Choose a reason for hiding this comment

rphillips commented Sep 29, 2023

bobbypage left a comment

Choose a reason for hiding this comment

haircommander left a comment

Choose a reason for hiding this comment

haircommander Oct 2, 2023

Choose a reason for hiding this comment

haircommander Oct 4, 2023

Choose a reason for hiding this comment

soltysh Oct 5, 2023

Choose a reason for hiding this comment

haircommander Oct 2, 2023

Choose a reason for hiding this comment

ndixita Oct 4, 2023

Choose a reason for hiding this comment

haircommander Oct 4, 2023

Choose a reason for hiding this comment

soltysh Oct 4, 2023

Choose a reason for hiding this comment

haircommander commented Oct 4, 2023

soltysh left a comment

Choose a reason for hiding this comment

rphillips commented Oct 5, 2023

tzneal commented Oct 5, 2023

ndixita commented Oct 5, 2023

ndixita commented Oct 5, 2023

johnbelamaric commented Oct 5, 2023

k8s-ci-robot commented Oct 5, 2023

tzneal commented Oct 5, 2023

BojanZelic commented Feb 13, 2024