Skip to content

Conversation

@pohly
Copy link
Contributor

@pohly pohly commented May 7, 2025

What type of PR is this?

/kind feature

What this PR does / why we need it:

The new metric informs admins whether DRA in general (special "driver_name: " label) and/or specific DRA drivers (other label values) are in use on nodes. This is useful to know because removing a driver is only safe if it is not in use. If a driver gets removed while it has prepared a ResourceClaim, unpreparing that ResourceClaim and stopping pods is blocked.

The implementation of the metric uses read locking of the claim info cache. It retrieves "claims in use" and turns those into the metric.

The same code is also used to log changes in the claim info cache with a diff. This hooks into a write update of the claim info cache and uses contextual logging.

Which issue(s) this PR fixes:

Slack discussion: https://kubernetes.slack.com/archives/C0409NGC1TK/p1746168044475379?thread_ts=1746001550.655339&cid=C0409NGC1TK
Related-to: kubernetes/enhancements#4381 (GA?)

Special notes for your reviewer:

The unit tests check that metrics get calculated. The e2e_node test checks that kubelet really exports the metrics data.

While at it, some bugs in the claiminfo_test.go get fixed: the way how the cache got populated in the test did not match the code anymore.

Let's review this proposal, then document it as part of the 1.34 KEP update before merging the implementation.

/hold
/assign @bart0sh

Does this PR introduce a user-facing change?

The new `dra_resource_claims_in_use` kubelet metrics informs about active ResourceClaims, overall and by driver.

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/kubelet labels May 7, 2025
@k8s-ci-robot k8s-ci-robot requested review from dchen1107 and sjenning May 7, 2025 11:52
@k8s-ci-robot k8s-ci-robot added area/test sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels May 7, 2025
@k8s-ci-robot k8s-ci-robot added wg/device-management Categorizes an issue or PR as relevant to WG Device Management. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 7, 2025
// cdiDevicesAsList returns a list of CDIDevices from the provided claim info.
// When the request name is non-empty, only devices relevant for that request
// are returned.
func (info *ClaimInfo) cdiDevicesAsList(requestName string) []kubecontainer.CDIDevice {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this method unchanged because it was odd that most of the ClaimInfo methods were above, except for this one which came after the claimInfoCache methods.

gomega.Eventually(kubeletPlugin2.GetGRPCCalls).WithTimeout(retryTestTimeout).Should(testdriver.NodeUnprepareResourcesSucceeded)
})

ginkgo.It("must provide metrics", func(ctx context.Context) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an e2e_node test because it is easier to get the kubelet metrics there.

If that could be done also in an E2E test, then putting the test there would be more appropriate. The test doesn't really depend on the kubelet configuration and E2E tests are easier to run.

@pohly pohly moved this from 🆕 New to 👀 In review in Dynamic Resource Allocation May 7, 2025
@SergeyKanzhelev SergeyKanzhelev moved this from Triage to Archive-it in SIG Node CI/Test Board May 7, 2025
Copy link
Contributor

@ArangoGutierrez ArangoGutierrez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non blocking comments

@pohly
Copy link
Contributor Author

pohly commented May 13, 2025

e2e-kind failed because of #131748.

gomega.Expect(kubeletPlugin1.GetGRPCCalls()).Should(testdriver.NodePrepareResourcesSucceeded, "Plugin 1 should have prepared resources.")
gomega.Expect(kubeletPlugin2.GetGRPCCalls()).Should(testdriver.NodePrepareResourcesSucceeded, "Plugin 2 should have prepared resources.")
driverName := func(element any) string {
el := element.(*model.Sample)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This triggers

ERROR: Some files are importing packages under github.com/prometheus/* but are not allow-listed to do so.

See: https://github.com/kubernetes/kubernetes/issues/89267

Failing files:
  ./test/e2e_node/dra_test.go

It is how some other code is checking metrics.

@dgrisonnet @richabanker: is this one of those cases where it's okay to extend the allow list? Or is there a different way of checking for the expected outcome?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we have allowed to extend the allow list in the past, ref so I guess we can just do that now? Regarding the usage, I see similar usage of the model package in the codebase to verify the metrics data so should be fine? cc @serathius - the author of the linked issue if he has any ideas on how to avoid importing the package here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively maybe you could try to create a map representation (map[string]float64) of the vector where the key is the driver_name and the value is the metric value. And then use something like this for verifying the values?

claimsInUse := convertVectorToMap(metrics, "dra_resource_claims_in_use")
gomega.Expect(claimsInUse).Should(gstruct.MatchKeys(gstruct.IgnoreExtras, gstruct.Keys{
	"":                   gomega.BeEquivalentTo(1),
	kubeletPlugin1Name: gomega.BeEquivalentTo(1),
	kubeletPlugin2Name: gomega.BeEquivalentTo(1),
}), "metrics while pod is running")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That feels like a workaround. I prefer adding a type alias in k8s.io/component-base/metrics/testutil: that package is geared towards use in tests, and already defines a type which directly exposes model.Sample, so letting consumers of that package also use that type directly seems fair. The code already depends on it anyway.

In other words, this was possible before:

var metrics testutils.Metrics
metrics = ...
samples := metrics["dra_resource_claims_in_use"]
sample := samples[0]

It should also be possible to write:

var sample *testutil.Sample
sample = samples[0]

I suppose some of the importers under

./test/e2e/apimachinery/flowcontrol.go
./test/e2e_node/mirror_pod_grace_period_test.go
./test/e2e/node/pods.go
./test/e2e_node/resource_metrics_test.go
./test/instrumentation/main_test.go
./test/integration/apiserver/flowcontrol/concurrency_test.go
./test/integration/apiserver/flowcontrol/concurrency_util_test.go
./test/integration/metrics/metrics_test.go
could use the same approach, but I haven't checked.

For now I have added the type aliases to this PR and use them in dra_test.go.

@pohly pohly force-pushed the dra-kubelet-in-use-metric branch from 13012f0 to 935ff9f Compare May 14, 2025 11:36
pohly added a commit to pohly/kubernetes that referenced this pull request Jun 26, 2025
@k8s-ci-robot
Copy link
Contributor

@pohly: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-unit-windows-master 6d6a749 link false /test pull-kubernetes-unit-windows-master

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

pohly added a commit to pohly/kubernetes that referenced this pull request Jun 27, 2025
@pohly
Copy link
Contributor Author

pohly commented Jun 30, 2025

@bart0sh
Copy link
Contributor

bart0sh commented Jun 30, 2025

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 30, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 59fb3d89a93e3404fa347cd8d5315c2c0cc07aaf

@pohly
Copy link
Contributor Author

pohly commented Jul 2, 2025

/hold cancel

The metric was discussed in the KEP and implemented as proposed.

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 2, 2025
pohly added a commit to pohly/kubernetes that referenced this pull request Jul 2, 2025
pohly added a commit to pohly/kubernetes that referenced this pull request Jul 2, 2025
pohly added a commit to pohly/kubernetes that referenced this pull request Jul 6, 2025
@klueska
Copy link
Contributor

klueska commented Jul 7, 2025

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: klueska, pohly, richabanker

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 7, 2025
@k8s-ci-robot k8s-ci-robot merged commit ee012e8 into kubernetes:master Jul 7, 2025
18 of 19 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.34 milestone Jul 7, 2025
@github-project-automation github-project-automation bot moved this from Archive-it to Done in SIG Node CI/Test Board Jul 7, 2025
pohly added a commit to pohly/kubernetes that referenced this pull request Jul 7, 2025
@jgehrcke
Copy link

jgehrcke commented Jul 7, 2025

Thanks! As discussed in a sync call, I wanted to think about this a bit.

So, we want to detect more or less correctly when a driver is "not in use". The proposed method based on Prometheus gauges will provide value towards that.

I think that one fundamental challenge is to reliably distinguish an explicit zero from missing data.

IIUC, we will recommend operators to sum(dra_resource_claims_in_use{driver_name="some-driver"}). Effectively, that's a sum over many time series (over all contributing kubelet instances -- running on different nodes).

It's important to know that when one of the contributing series goes away (say, because the node goes down or because a part in the observability pipeline breaks) then in the sum this contribution will drop to zero. This is because of the "staleness" concept in Prometheus (which for gauges typically happens when there hasn't been an update in just a few minutes).

From Prometheus docs:

If a query is evaluated at a sampling timestamp after a time series is marked as stale, then no value is returned for that time series

If I remember correctly, the "no value" in this sentence translates effectively to zero, in a summing operation.

This is

  • convenient behavior if the node is gone for good, with all its state (if the old value would keep on living forever that could indeed be an inconvenience)
  • non-ideal behavior when the node comes back online later, with previous state

Pragmatic conclusion: maybe we can in the future make sure that we emit an explicit zero at all times. In documentation, we can point out that operators should pay attention to two metrics before concluding that a DRA driver is "not in use":

  • count(dra_resource_claims_in_use{driver_name="some-driver"}) is expected (the node count?)
  • sum(dra_resource_claims_in_use{driver_name="some-driver"}) dropped to zero

@pohly
Copy link
Contributor Author

pohly commented Jul 8, 2025

It's important to know that when one of the contributing series goes away (say, because the node goes down or because a part in the observability pipeline breaks) then in the sum this contribution will drop to zero.

Is the same true for taking the maximum? What we want to know before driver removal is "in use anywhere", we don't need the more precise "in use how often", so the maximum would be sufficient.

@jgehrcke
Copy link

jgehrcke commented Jul 8, 2025

Is the same true for taking the maximum?

Yes, I am rather certain that a stale time series is treated the same in all of the aggregate operators.

(ultimately, we should test if this is true)

As of this presentation the "no value" as of staleness is actually a NaN. Looking further... yes: https://github.com/prometheus/prometheus/blob/release-2.29/pkg/value/value.go#L28 -- that's an implementation detail however.

Let's conclude this: if there is no fresh data for a time series for a certain point of time (e.g., "now"), then a design choice in Prometheus is that instead of using an old value they use no value/NaN.

Interesting: a scrape target going away (a scrape failure) is one reason for a time series to go stale even before the five minute interval I mentioned before (--query.lookback-delta config).

Linking to Brian's talk from a couple of years ago: https://youtu.be/GcTzd2CLH7I?si=qX1sOG7DoxHAw-1o&t=704

In a larger-scale system, it's not unlikely for individual scrapes to fail every now and then (a scrape is just an HTTP request that is AFAIR not even retried aggressively).

That is: we maybe should require operators to check for stale time series when looking at dra_resource_claims_in_use.

How does one do that? I just found this recommendation:

Generally, instead of looking for stale data, you should alert when the target isn’t being scraped. Prometheus has built in metrics like up for this.

The up metric documented here is what people typically use for detecting scrape failures.

So, maybe the up metric not indicating any scrape problems at the time of evaluating dra_resource_claims_in_use is a decent solution.

@pohly
Copy link
Contributor Author

pohly commented Jul 8, 2025

It makes sense that aggregation results in NaN. But then why is that not shown to the user, instead reporting 0, which can be confused with a legitimate and (in this case) potentially incorrect value?

Either way, it's obviously a bit more complicated. I'm not sure where to document these best practices. In a page on DRA we shouldn't have to explain how to configure scraping or how to write queries. We cannot even assume that people use Prometheus?!

Do you perhaps want to work on an example dashboard and publish a tutorial or blog post? That then can target Prometheus.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

7 participants