Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-3695: extend pod resource API to for DRA #3738

Merged
merged 3 commits into from
Feb 9, 2023

Conversation

moshe010
Copy link
Contributor

Signed-off-by: Moshe Levi moshele@nvidia.com

  • One-line PR description:
    Extend pod resource API to for Dynamic Resource Allocation
  • Other comments:

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 14, 2023
@k8s-ci-robot k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Jan 14, 2023
@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 14, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @moshe010. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 14, 2023
@bart0sh
Copy link
Contributor

bart0sh commented Jan 14, 2023

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 14, 2023
@bart0sh
Copy link
Contributor

bart0sh commented Jan 14, 2023

/assign
/assign @klueska @derekwaynecarr @kad

@ffromani
Copy link
Contributor

/cc

@k8s-ci-robot k8s-ci-robot requested a review from ffromani January 23, 2023 11:41
@moshe010 moshe010 force-pushed the dra-pod-resource-api branch from c859f61 to 65cfcd2 Compare January 24, 2023 07:19
@klueska
Copy link
Contributor

klueska commented Jan 24, 2023

This PR seems to be missing the required file in keps/prod-readiness/sig-node following the template: https://github.com/kubernetes/enhancements/blob/master/keps/prod-readiness/template/nnnn.yaml

Just to make sure, did you also follow the latest template for the KEP itself?:
https://github.com/kubernetes/enhancements/tree/master/keps/NNNN-kep-template

Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm generally fine with the content of the KEP here, I would just like to have documented better why DRA can't provide data useful to GetAllocatableResources, and if this status can possibly change in the future. Few lines are sufficient.

Comment on lines 77 to 78
To enhance pod resource API GetAllocatableResources for resources that managed by DRA. With DRA there is not standard
way to get resource captivity.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking: I understand we can't provide something in turn DRA is not providing, but making the API more irregular makes me pause. Can we try to list options here to sketch some plans for the future? It would also be nice to spend few lines to explain why DRA can't provide this information.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this issue of common way to get resource captivity with DRA should be tracked as separate issue?

@klueska maybe we can standardize the nodeallocationstates CRD https://gitlab.com/nvidia/cloud-native/k8s-dra-driver/-/blob/main/deployments/static/crds/gpu.resource.nvidia.com_nodeallocationstates.yaml as CRD for the DRA kubelet plugin?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how we would (or even could) provide Capacity or Allocatable for DRA managed resources.

To answer your question directly, I don't think it's reasonable to assume there will be a standard representation for each resource managed by a DRA resource driver. It's not even required that a CRD be used for this purpose, let alone adhere to a specific format.

Besides that, the notion of a "count" for a DRA managed resource is itself not well defined since each resource that gets allocated can a take drastically different form depending on what ClaimParameter objects are used to request them.

Copy link
Contributor

@klueska klueska left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made it about a third of the way through and need to take a break. Will come back to it soon.

@moshe010
Copy link
Contributor Author

This PR seems to be missing the required file in keps/prod-readiness/sig-node following the template: https://github.com/kubernetes/enhancements/blob/master/keps/prod-readiness/template/nnnn.yaml

I wasn't aware of it. Who should I put as the approver? should I just pick one from the list
https://github.com/kubernetes/enhancements/blob/master/OWNERS_ALIASES#L191-L194?

Just to make sure, did you also follow the latest template for the KEP itself?: https://github.com/kubernetes/enhancements/tree/master/keps/NNNN-kep-template
Yes I used the latest template.

@moshe010 moshe010 force-pushed the dra-pod-resource-api branch 2 times, most recently from 6c00f11 to 284623e Compare January 26, 2023 08:05
@ffromani
Copy link
Contributor

This PR seems to be missing the required file in keps/prod-readiness/sig-node following the template: https://github.com/kubernetes/enhancements/blob/master/keps/prod-readiness/template/nnnn.yaml

I wasn't aware of it. Who should I put as the approver? should I just pick one from the list https://github.com/kubernetes/enhancements/blob/master/OWNERS_ALIASES#L191-L194?

Yes, please pick one. The PRR reviewers will rebalance the load internally and suggest a replacement should the one you picked turn out to be overloaded.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 7, 2023
Signed-off-by: Moshe Levi <moshele@nvidia.com>
@moshe010 moshe010 force-pushed the dra-pod-resource-api branch from 24f9123 to ccacf27 Compare February 7, 2023 18:30
Our proposal is to extend the existing PodResources gRPC service of the Kubelet with a repeated DynamicResource field in the ContainerResources message. This new field will contain information about the DRA resource class, the DRA resource claim and a list of CDI Devices allocated by a DRA driver.
Additionally, we propose adding a Get method to the existing gRPC service to allow querying specific pods for their allocated resources.

The extended interface is shown in proto below:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the proposed API is read-only and mainly for the monitoring, but not sure the entire flow. May I have a couple of questions answered before approving this KEP:

  1. We are using polling here. But what if DRA driver is dead, is the pod using the resource continue running? How to make sure the stats up to date?
  2. DRA might be deallocated on the node, how often kubelet to issue Get() methods to reconsolidate the status?
  3. Do we need a registration logic with Kubelet?

I guess some of the questions might be discussed in a separate KEP, but hope we have the answers to those high level questions. Thanks!

Copy link
Contributor

@klueska klueska Feb 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For (1):
I guess you could think of it as polling, but the PodResources API is really more of an on-demand API that goes and gets the information it needs at the time it is requested. Nothing is cached as part of the API itself. The various components that hold the information (CPUManager, DeviceManager, etc) may cache this information, but the API doesn't "store" this information anywhere by itself.

In the case of the DRA manager, the information required to populate the new fields in the PodResources API is all cached in an in-memory store for each container that gets launched. However, unlike the CPUManager and DeviceManager, this information is not persisted / checkpointed by the DRA manager itself (meaning the cache cannot be directly repopulated across a kubelet restart). It is idempotently queriable from each DRA driver though. Meaning that the in-memory cache can be used to populate the API if the information is available, otherwise the cache can be filled by querying the DRA driver and then pushed up to the API as needed.

In response to this specific question: "But what if DRA driver is dead, is the pod using the resource continue running?"

The lifetime of a pod is not (necessarily) tied to the "readiness" of the DRA drivers that manage its resource claims. The only time a pod is blocked from deletion by a claim not being deleted (possibly because the DRA driver managing it is down) is if it is the last pod referencing that claim.

So I think what you are asking is "If a pod is running, but I can't communicate with one of the DRA drivers to query its list of DynamicResources (and this information is not available in the DRA managers in-memory cache), what do I do? Do I omit this information? Do I return an error?, etc."

Does this accurately reflect what you are trying to ask?

For (2):
What do you mean by "DRA might be deallocated on the node"? Do you mean the resources associated with a specific DRA driver might be deallocated on a node? And the question is how often the kubelet will need to consult with the DRA driver to refresh whatever information it may have about these resources? I think I answered this in my response to (1) when talking about the in-memory cache of the DRA manager and it ability to idempotently retrieve any missing information from the DRA driver.

The Get() call you refer to is something exposed by the kubelet (not something the kubelet calls itself), so I assume you actually meant NodePrepareResources() (which is the idempotent call a kubelet can make against a DRA driver to get the list of CDI devices associated with a claim).

For (3):
What are you proposing should have registration logic with the kubelet? DRA resource drivers already do register with the kubelet (so it can make calls against them to Prepare/Unprepare resources). I think I need more context here to answer the question properly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dchen1107
I have updated the Proposed API section with more details that hopefully address your comments / concerns. PTAL

https://github.com/kubernetes/enhancements/pull/3738/files#diff-190ad7c104ab2cace0e7696b41212380fa0825a6b5203ca027af4137c342e487R95

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@klueska Thanks for the detailed response.

You answered my questions. I tried to figure out how idempotent the behavior is since DRA is dynamic, the the status how persistent and consistent over the kubelet and / or DRA driver restarts. Even we might still have the inconsistency over the restart, but it is expected to some extent. I am accepting it for now.

Also the KEP heavily depends on the DRA, but KEP doesn't include a high level component diagram capturing the dependencies and interfaces. It assumes the reviewers have all information. My original wish is to answer my questions, we could address the issue for the future readers. I guess I didn't state it clearly. I am ok with it for now.

Copy link
Contributor

@klueska klueska Feb 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem. They were important questions to ask.

Regarding:

Even we might still have the inconsistency over the restart, but it is expected to some extent.

I don’t think we will have any inconstancies. The PodResources API is provided by the kubelet itself, so if it is down the API is inaccessible anyway. And once it comes back up it will reconcile all of the info it needs before it starts serving the API again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will there be any race if API is called too early before everything got registered?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for plugins not responding - there will be no partial results, correct? Either all known information is returned or an error. So one plugin can be a "noisy" neighbor for the API. Should this be added as a risk?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will there be any race if API is called too early before everything got registered?

There shouldn't be. The set of CDI devices to allocate to a container is pulled from the various DRA plugins before a container is ever created. Either all plugins successfully allocate devices (thus populating the cache) and the container eventually gets created / starts running, or some plugin fails and the container start fails as well. Meaning only containers whose cached CDI device info is complete will ever be created and thus accessible through the API.

As for plugins not responding - there will be no partial results, correct? Either all known information is returned or an error. So one plugin can be a "noisy" neighbor for the API. Should this be added as a risk?

That shouldn't matter. If a plugin does not respond, then the container will never be started and thus not something queriable through the PodResources API.

@klueska klueska force-pushed the dra-pod-resource-api branch from fd9da2f to ccacf27 Compare February 8, 2023 11:26
@klueska klueska force-pushed the dra-pod-resource-api branch from 404fbd9 to 1191ad8 Compare February 8, 2023 13:24
Copy link
Contributor

@cici37 cici37 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have couple PRR questions. Since this proposal highly depends on DRA, I guess we should carefully write the different/dependency here which are all questions about.

/cc @johnbelamaric


###### Does this feature depend on any specific services running in the cluster?

No.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it require DRA working?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It requires DRA to be available, but DRA is not a service (it is something built into the kubelet / default scheduler). Simply ensuring that the DRA feature gate is enabled is enough to ensure that DRA is available, so I tend to agree that this does not depend on any services running in the cluster (as stated currently).

Copy link
Contributor Author

@moshe010 moshe010 Feb 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dchen1107
Copy link
Member

/lgtm
/approve from SIG Node perspective

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 8, 2023
@dchen1107 dchen1107 added this to the v1.27 milestone Feb 8, 2023
Copy link
Member

@johnbelamaric johnbelamaric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see a few minor updates the PRR answers but overall PRR looks good.

Signed-off-by: Moshe Levi <moshele@nvidia.com>
@moshe010 moshe010 force-pushed the dra-pod-resource-api branch from 091b230 to eaa6675 Compare February 9, 2023 09:00
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 9, 2023
@johnbelamaric
Copy link
Member

/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 9, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dchen1107, johnbelamaric, klueska, moshe010

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

owning-sig: sig-node
participating-sigs: []
status: provisional
creation-date: implementable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this field should be the date created, right now this gives the misleading impression that the kep is approved "implementable" when the status is "provisional" and it is missing the date

see e.g.

Copy link
Contributor

@klueska klueska Jan 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like something got screwed up here.

The date should definitely be fixed, but the status should, in fact, be implementable, not provisional (the code that implements this KEP was merged in kubernetes/kubernetes#115847).

### Goals

- To allow node monitoring agents to know the allocated DRA resources for Pods on a node.
- To allow the DRA feature to work with CNIs that require complex network devices such as RDMA. DRA resource drivers will allocate the resources, and the meta-plugin will read the allocated [CDI Devices](https://github.com/container-orchestrated-devices/container-device-interface) using the PodResources API. The meta-plugin will then inject the device-id of these CDI Devices as CNI arguments and invoke other CNIs (just as it does for devices allocated by the device plugin today).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach was not discussed in SIG Network and WG Multinetwork forums AFAIK, it will be better to involve all the stakeholders in these discussions to be able to get a uniform solution.

The recent developments and consensus for the Networking extension with DRA are described in #4861


#### Beta

- [ ] Gather feedback from consumers of the DRA feature and k8snetworkplumbingwg working group
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and SIG Network

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Development

Successfully merging this pull request may close these issues.