From 02d0399e11c9a56fee9fe01fe7a6d2d77b15b276 Mon Sep 17 00:00:00 2001 From: Lionel Jouin Date: Thu, 19 Sep 2024 15:43:26 +0200 Subject: [PATCH 01/11] KEP-4817 - Resource Claim Device Status Signed-off-by: Lionel Jouin --- .../README.md | 516 ++++++++++++++++++ .../kep.yaml | 40 ++ 2 files changed, 556 insertions(+) create mode 100644 keps/sig-node/4817-resource-claim-device-status/README.md create mode 100644 keps/sig-node/4817-resource-claim-device-status/kep.yaml diff --git a/keps/sig-node/4817-resource-claim-device-status/README.md b/keps/sig-node/4817-resource-claim-device-status/README.md new file mode 100644 index 00000000000..2ed5c5aa28e --- /dev/null +++ b/keps/sig-node/4817-resource-claim-device-status/README.md @@ -0,0 +1,516 @@ +# KEP-4817: Resource Claim Device Status + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [API - ResourceClaim.Status](#api---resourceclaimstatus) + - [User Stories (Optional)](#user-stories-optional) + - [Story 1 - Network Device Status for Network Services](#story-1---network-device-status-for-network-services) + - [Story 2 - Network Device Status for Troubleshooting](#story-2---network-device-status-for-troubleshooting) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [API](#api) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Alpha](#alpha) + - [Beta](#beta) + - [GA](#ga) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) + - [Annotations](#annotations) + - [Pod.Status.PodIPs Enhancement](#podstatuspodips-enhancement) + - [New Pod.Status Field](#new-podstatus-field) + - [KEP-4680 extension](#kep-4680-extension) +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) + + +## Release Signoff Checklist + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [ ] e2e Tests for all Beta API Operations (endpoints) + - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [ ] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [ ] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + + + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + +This proposal enhances the `ResourceClaim.Status` by adding a new field `DeviceStatuses`. The new field allows drivers to report driver-specific device status data for each allocated devices in a resource claim. Allowing the drivers to report the device statuses will improve both observability and troubleshooting as well as enabling new functionalities such as, for example, if the IPs of a network device are reported, network services. + +This extension also lays the foundation for a potential future standardization of specific type device data, such as, for example, networking devices information. + +## Motivation + +As of now, when a device is configured in a pod/container, the state and characteristics of the device set during the configuration stage are invisible. The lack of this information is then a challenge for the users and developers to diagnose issues, verify configurations, and integrate the allocated resources into higher-level services. Reporting this information is then crucial in environments where device specific configurations are necessary. + +For certain types of devices, such as network interfaces, knowing the detailed characteristics of what has been allocated is particularly useful and crucial. For example, reporting the interface Name, MAC address and IP addresses of network interfaces in the status of a ResourceClaim can significantly help in configuring and managing network services, as well as in debugging network related issues. By including such device specific information, this proposal addresses existing gaps in visibility and facilitates better integration and management of resources. + +### Goals + +* Allow arbitrary, driver-specific information to be reported from the DRA drivers for each allocated device in a ResourceClaim. +* Establish a foundation for potential future standardization (e.g. Network Devices status). +* Enable 3rd party implementation of new functionalities based on the Device Status (e.g. Secondary Network Services if the IPs of a network device are reported). + +### Non-Goals + +* Implement new functionalities based on the Device Status (e.g. Kubernetes EndpointSlice Controller supporting IPs reported by networking DRA drivers). +* Modifying the Resource Claim workflow to include the device status report. + +## Proposal + +### API - ResourceClaim.Status + +The API changes define a new `DeviceStatuses` field in the existing `ResourceClaimStatus` struct. `DeviceStatuses` is a slice of a new struct `AllocatedDeviceStatus` which holds device specific information. + +A device, identified by `//` can be represented only once in the `DeviceStatuses` slice and will also mention which request caused the device to be allocated. The state and characteristics of the device are reported in the `Conditions`, representing the operational state of the device and in the `DeviceInfo`, an arbitrary data slice representing device specific characteristics. Additionally, for networking devices, a field `NetworkDeviceInfo` can be used to report the IPs, the MAC address and the interface name. + +`DeviceInfo` being a slice of arbitrary data allows the DRA Driver to store device specific data in different formats. For example, a Network Device being configured by via a CNI plugin could get its `DeviceInfo` filled with the CNI result for troubleshooting purpose and with a Network result in a modern standard format (closer to Pod.Status.PodIPs for example) used by 3rd party controllers. + +For each device, if required, the DRA Driver processing the device allocation can report the status of it in the `Status.DeviceStatuses` of the ResourceClaim by using the Kubernetes API. + +```golang +// ResourceClaimStatus tracks whether the resource has been allocated and what +// the result of that was. +type ResourceClaimStatus struct { + ... + + // DeviceStatuses contains the status of each device allocated for this + // claim, as reported by the driver. This can include driver-specific + // information. Entries are owned by their respective drivers. + // + // +optional + // +listType=map + // +listMapKey=devicePoolName + // +listMapKey=deviceName + DeviceStatuses []AllocatedDeviceStatus `json:"deviceStatuses,omitempty" protobuf:"bytes,4,opt,name=deviceStatuses"` +} + + +// AllocatedDeviceStatus contains the status of an allocated device, if the +// driver chooses to report it. This may include driver-specific information. +type AllocatedDeviceStatus struct { + // Request is the name of the request in the claim which caused this + // device to be allocated. Multiple devices may have been allocated + // per request. + // + // +required + Request string `json:"request" protobuf:"bytes,1,rep,name=request"` + + // Driver specifies the name of the DRA driver whose kubelet + // plugin should be invoked to process the allocation once the claim is + // needed on a node. + // + // Must be a DNS subdomain and should end with a DNS domain owned by the + // vendor of the driver. + // + // +required + Driver string `json:"driver" protobuf:"bytes,2,rep,name=driver"` + + // This name together with the driver name and the device name field + // identify which device was allocated (`//`). + // + // Must not be longer than 253 characters and may contain one or more + // DNS sub-domains separated by slashes. + // + // +required + Pool string `json:"pool" protobuf:"bytes,3,rep,name=pool"` + + // Device references one device instance via its name in the driver's + // resource pool. It must be a DNS label. + // + // +required + Device string `json:"device" protobuf:"bytes,4,rep,name=device"` + + // Conditions contains the latest observation of the device's state. + // If the device has been configured according to the class and claim + // config references, the `Ready` condition should be True. + // + // +optional + // +listType=atomic + Conditions []metav1.Condition `json:"conditions" protobuf:"bytes,5,rep,name=conditions"` + + // DeviceInfo contains Arbitrary driver-specific data. + // + // +optional + // +listType=atomic + DeviceInfo []runtime.RawExtension `json:"deviceInfo,omitempty" protobuf:"bytes,6,rep,name=deviceInfo"` + + // NetworkDeviceInfo contains network-related information specific to the device. + // + // +optional + NetworkDeviceInfo NetworkDeviceInfo `json:"networkDeviceInfo,omitempty" protobuf:"bytes,7,rep,name=networkDeviceInfo"` +} + +// NetworkDeviceInfo provides network-related details for the allocated device. +// This information may be filled by drivers or other components to configure +// or identify the device within a network context. +type NetworkDeviceInfo struct { + // Interface specifies the name of the network interface associated with + // the allocated device. This might be the name of a physical or virtual + // network interface. + // + // +optional + Interface string `json:"interface,omitempty" protobuf:"bytes,1,rep,name=interface"` + + // IPs lists the IP addresses assigned to the device's network interface. + // This can include both IPv4 and IPv6 addresses. + // + // +optional + IPs []string `json:"ips,omitempty" protobuf:"bytes,2,rep,name=ips"` + + // Mac represents the MAC address of the device's network interface. + // + // +optional + Mac string `json:"mac,omitempty" protobuf:"bytes,3,rep,name=mac"` +} +``` + +### User Stories (Optional) + +#### Story 1 - Network Device Status for Network Services + +As a Cloud Native Network Function (CNF) vendor, my network services must be integrated with the network devices configured in Pods. The configuration properties of these network devices are therefore essential to configure the network services. For example, the network services must be able to route traffic to pods over networks attached via the network devices, the IP addresses of the network device(s) must then be reflected in the `ResourceClaim.Status` allowing the network service controller(s) to access them. + +#### Story 2 - Network Device Status for Troubleshooting + +As a Network Administrator, troubleshooting networking issues can be complex and time consuming especially when the device characteristics and operational status are not readily accessible. The `DeviceStatuses` field in the `ResourceClaim.Status` provides access to comprehensive details regarding network interfaces helping to quickly and efficiently identify the issues such as error messages on failed network interface configuration, incorrect IP assignments or misconfigured network interfaces. + +### Notes/Constraints/Caveats (Optional) + +The content of `DeviceInfo` is driver specific and not standardized as part of DRA, the interpretation of this field may then vary between controllers and users reading it. + +The accuracy of the information depends on the implementation of the DRA Drivers, the reported status of the device may not always reflect the real time changes of the state of the device. + +### Risks and Mitigations + +As stated, 3rd party DRA drivers will set and update the `DeviceStatuses` for the device they manage. An access control must be set in place to restrict the write access to the appropriate driver (A device status can only be updated by the driver which allocated and configured this device). + +Adding `DeviceInfo` as an arbitrary data slice may introduce extra processing and storage overhead which might impact performance in a cluster with many devices and frequent status updates. In large-scale clusters where many devices are allocated, this impact must be considered. + +## Design Details + +### API + +The `ResourceClaimStatus` struct in `pkg/apis/resource/types.go` will be extended to include the slice of `DeviceStatuses`. + +`ResourceClaim` validation of the status in `pkg/apis/resource/validation/validation.go` will be covered to allow a device to be reported only once in the slice, a device is being identified by `//`. + +### Test Plan + +[x] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +##### Prerequisite testing updates + +##### Unit tests + +- `ResourceClaim` validation: + - A device can only be reported once in the `ResourceClaim`. + - The reported device is allocated in the `ResourceClaim`. + - Properties set in `AllocatedDeviceStatus` are in the correct format. + +##### Integration tests + +- Usage of the `DeviceStatuses` field in the `ResourceClaimStatus`: + * With the feature gate enabled, the field exists in the `ResourceClaim`. + * With the feature gate disabled, the field does not exist in the `ResourceClaim`. + +##### e2e tests + +TBD + +### Graduation Criteria + +#### Alpha + +- Feature implemented behind feature gates (`ResourceClaimDeviceStatus`). Feature Gates are disabled by default. +- Documentation provided. +- Initial unit, integration and e2e tests completed and enabled. + +#### Beta + +- Feature Gates are enabled by default. +- Authorization implemented to allow only the driver managing the device to write the status. +- No major outstanding bugs. +- Feedback collected from the community (developers and users) with adjustments provided, implemented and tested. + +#### GA + +- 2 examples of real-world usage. +- Allowing time for feedback from developers and users. + +### Upgrade / Downgrade Strategy + +This feature only exposes a new field in the `ResourceClaim.Status`, the field will either be present or not. + +DRA implementation requires DRA interfaces change. DRA is in alpha and in active development. The feature will follow the DRA upgrade/downgrade strategy. + +### Version Skew Strategy + +This feature affects only the kube-apiserver, so there is no issue with version skew with other Kubernetes components. + +## Production Readiness Review Questionnaire + +### Feature Enablement and Rollback + +###### How can this feature be enabled / disabled in a live cluster? + +- [x] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: ResourceClaimDeviceStatus + - Components depending on the feature gate: kube-apiserver +- [ ] Other + - Describe the mechanism: + - Will enabling / disabling the feature require downtime of the control + plane? + - Will enabling / disabling the feature require downtime or reprovisioning + of a node? + +###### Does enabling the feature change any default behavior? + +No + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + +Yes, with no side effects except for missing the new field in `ResourceClaim.Status`. +Re-enabling this feature will not guarantee to keep the values written before the feature has been disabled. + +###### What happens if we reenable the feature if it was previously rolled back? + +###### Are there any tests for feature enablement/disablement? + +Enablement/disablement of this feature is tested as part of the integration tests. + +### Rollout, Upgrade and Rollback Planning + +No + +###### How can a rollout or rollback fail? Can it impact already running workloads? + +No + +###### What specific metrics should inform a rollback? + +N/A + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + +N/A + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + +No + +### Monitoring Requirements + +###### How can an operator determine if the feature is in use by workloads? + +Check the `ResourceClaim.Status.DeviceStatuses`. + +###### How can someone using this feature know that it is working for their instance? + +- [ ] Events + - Event Reason: +- [x] API .status + - Condition name: + - Other field: `ResourceClaim.Status.DeviceStatuses` +- [ ] Other (treat as last resort) + - Details: + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + +N/A + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + +- [ ] Metrics + - Metric name: + - [Optional] Aggregation method: + - Components exposing the metric: +- [ ] Other (treat as last resort) + - Details: + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + + + +### Dependencies + +[KEP-4381 - DRA Structured Parameters](https://github.com/kubernetes/enhancements/issues/4381) + +###### Does this feature depend on any specific services running in the cluster? + +No + +### Scalability + +###### Will enabling / using this feature result in any new API calls? + +Yes, DRA Drivers will update the `ResourceClaim.Status` to report the allocated device status. Depending on the driver the size and frequency of these updates can vary. + +###### Will enabling / using this feature result in introducing new API types? + +New field on `ResourceClaim.Status`. + +###### Will enabling / using this feature result in any new calls to the cloud provider? + +No + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + +`ResourceClaim.Status` size will increase. + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + +No + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + +Depending on the content of the `DeviceInfo` set by the DRA drivers, the disk usage could increase. + +###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? + +No + +### Troubleshooting + +###### How does this feature react if the API server and/or etcd is unavailable? + +N/A + +###### What are other known failure modes? + +N/A + +###### What steps should be taken if SLOs are not being met to determine the problem? + +N/A + +## Implementation History + +- Initial proposal: 2024-08-30 + +## Drawbacks + +If the Network device (network interface) characteristics (e.g. IP) and status is reported as part of the `Resource.Claim.Status`, it should be ensured the `ResourceClaim` is not used by several `Pod` at a time. Additionally, if a controller needs to gather IPs for a specific network to which Pods are attached via networking devices, it will need to query each `Pod` and then access the corresponding `ResourceClaim` for every Pod. + +## Alternatives + +### Annotations + +An option the DRA drivers can currently use to report the status of the device allocated in the `ResourceClaim` is the annotation of the `ResourceClaim` or of the `Pod` itself. As a reference, the [k8snetworkplumbingwg/Multus-CNI](https://github.com/k8snetworkplumbingwg/multus-cni) project is utilizing annotation to describe the network attachments/interfaces and report the status. + +Here is the API below representing a network attachment. This is stored as a list in a json format in the annotation of the `Pod`. + +[k8snetworkplumbingwg/network-attachment-definition-client/pkg/apis/k8s.cni.cncf.io/v1/types.go](https://github.com/k8snetworkplumbingwg/network-attachment-definition-client/blob/v1.7.3/pkg/apis/k8s.cni.cncf.io/v1/types.go#L103): +```golang +// NetworkStatus is for network status annotation for pod +// +k8s:deepcopy-gen=false +type NetworkStatus struct { + Name string `json:"name"` + Interface string `json:"interface,omitempty"` + IPs []string `json:"ips,omitempty"` + Mac string `json:"mac,omitempty"` + Mtu int `json:"mtu,omitempty"` + Default bool `json:"default,omitempty"` + DNS DNS `json:"dns,omitempty"` + DeviceInfo *DeviceInfo `json:"device-info,omitempty"` + Gateway []string `json:"gateway,omitempty"` +} +``` + +### Pod.Status.PodIPs Enhancement + +As part of the [Multi-Network (KEP-3698)](https://github.com/kubernetes/enhancements/issues/3698), the idea was to use the existing `Pod.Status.PodIPs` and save the data about the different network interfaces/devices attached to the `Pod`. As part of the review of the KEP, it has been indicated ([here](https://github.com/kubernetes/enhancements/pull/3700#discussion_r1501690793) and [here](https://github.com/kubernetes/kubernetes/pull/123112#issuecomment-1925957930)) that it would be an API breaking change if the `Pod.Status.PodIPs` contains more than 1 value per IP family. + +### New Pod.Status Field + +Still as part of the [KEP-3698 - Multi-Network](https://github.com/kubernetes/enhancements/issues/3698), and in the continuation of the previous alternative, the idea was to add a new field `Networks` in the `Pod.Status` so each networking DRA driver could report the status for each network interface/device directly in the `Pod.Status`. + +Here is below the proposed API: +```golang +// PodStatus represents information about the status of a pod. Status may trail the actual +// state of a system, especially if the node that hosts the pod cannot contact the control +// plane. +type PodStatus struct { +[...] + // Networks is a list of PodNetworks that are attached to the Pod. + // + // +optional + Networks []NetworkStatus `json:"networks,omitempty"` +} + +// NetworkStatus provides the status of specific PodNetwork in a Pod. +type NetworkStatus struct { + // Name is name of PodNetwork + Name string `json:"name"` + + // InterfaceName is the network interface name inside the Pod for this attachment. + // Examples: eth1 or net1 + // + // +optional + InterfaceName string `json:"interfaceName"` + + // ip is an IP address (IPv4 or IPv6) assigned to the pod + IP string `json:"ip,omitempty"` + + // IsDefaultGW is a flag indicating that the interface with this IP + // inside the Pod holds the Default Gateway. + // + // +optional + IsDefaultGW bool `json:"isDefaultGW,omitempty"` +} +``` + +### KEP-4680 extension + +During the WG Device Management meeting on 17th of September 2024 ([Slack summary](https://kubernetes.slack.com/archives/C0409NGC1TK/p1726679433650409)), the idea was to extend the [KEP-4680 about resource health status in the `Pod.Status` ](https://github.com/kubernetes/enhancements/issues/4680) in order to expose device information and not just the health. + +## Infrastructure Needed (Optional) + + diff --git a/keps/sig-node/4817-resource-claim-device-status/kep.yaml b/keps/sig-node/4817-resource-claim-device-status/kep.yaml new file mode 100644 index 00000000000..9a45f4e1aba --- /dev/null +++ b/keps/sig-node/4817-resource-claim-device-status/kep.yaml @@ -0,0 +1,40 @@ +title: Resource Claim Device Status +kep-number: 4817 +authors: + - "@jane.doe" +owning-sig: sig-node +participating-sigs: + - sig-node + - sig-network +status: provisional +creation-date: 2024-08-30 +reviewers: + - "@johnbelamaric" + - "@aojea" + - "@dougbtv" + - "@MikeZappa87" + - "@s1061123" +approvers: + - TBD + +see-also: + - "/keps/sig-node/3063-dynamic-resource-allocation" + - "/keps/sig-node/4381-dra-structured-parameters" + - "/keps/sig-node/4680-add-resource-health-to-pod-status" + - "https://github.com/kubernetes/enhancements/issues/3698" + - "https://github.com/k8snetworkplumbingwg/network-attachment-definition-client" + +# The target maturity stage in the current dev cycle for this KEP. +stage: alpha #beta|stable + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "v1.32" + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: ResourceClaimDeviceStatus + components: + - kube-apiserver +disable-supported: true From 56fc9ee59f599192ee6ffe6f074d7172c9731afa Mon Sep 17 00:00:00 2001 From: Lionel Jouin Date: Mon, 23 Sep 2024 00:20:00 +0200 Subject: [PATCH 02/11] Format (line-wrap) Signed-off-by: Lionel Jouin --- .../README.md | 188 ++++++++++++++---- 1 file changed, 144 insertions(+), 44 deletions(-) diff --git a/keps/sig-node/4817-resource-claim-device-status/README.md b/keps/sig-node/4817-resource-claim-device-status/README.md index 2ed5c5aa28e..22fba470bac 100644 --- a/keps/sig-node/4817-resource-claim-device-status/README.md +++ b/keps/sig-node/4817-resource-claim-device-status/README.md @@ -73,38 +73,78 @@ Items marked with (R) are required *prior to targeting to a milestone / release* ## Summary -This proposal enhances the `ResourceClaim.Status` by adding a new field `DeviceStatuses`. The new field allows drivers to report driver-specific device status data for each allocated devices in a resource claim. Allowing the drivers to report the device statuses will improve both observability and troubleshooting as well as enabling new functionalities such as, for example, if the IPs of a network device are reported, network services. +This proposal enhances the `ResourceClaim.Status` by adding a new field +`DeviceStatuses`. The new field allows drivers to report driver-specific device +status data for each allocated devices in a resource claim. Allowing the +drivers to report the device statuses will improve both observability and +troubleshooting as well as enabling new functionalities such as, for example, +if the IPs of a network device are reported, network services. -This extension also lays the foundation for a potential future standardization of specific type device data, such as, for example, networking devices information. +This extension also lays the foundation for a potential future standardization +of specific type device data, such as, for example, networking devices +information. ## Motivation -As of now, when a device is configured in a pod/container, the state and characteristics of the device set during the configuration stage are invisible. The lack of this information is then a challenge for the users and developers to diagnose issues, verify configurations, and integrate the allocated resources into higher-level services. Reporting this information is then crucial in environments where device specific configurations are necessary. - -For certain types of devices, such as network interfaces, knowing the detailed characteristics of what has been allocated is particularly useful and crucial. For example, reporting the interface Name, MAC address and IP addresses of network interfaces in the status of a ResourceClaim can significantly help in configuring and managing network services, as well as in debugging network related issues. By including such device specific information, this proposal addresses existing gaps in visibility and facilitates better integration and management of resources. +As of now, when a device is configured in a pod/container, the state and +characteristics of the device set during the configuration stage are invisible. +The lack of this information is then a challenge for the users and developers +to diagnose issues, verify configurations, and integrate the allocated +resources into higher-level services. Reporting this information is then +crucial in environments where device specific configurations are necessary. + +For certain types of devices, such as network interfaces, knowing the detailed +characteristics of what has been allocated is particularly useful and crucial. +For example, reporting the interface Name, MAC address and IP addresses of +network interfaces in the status of a ResourceClaim can significantly help in +configuring and managing network services, as well as in debugging network +related issues. By including such device specific information, this proposal +addresses existing gaps in visibility and facilitates better integration and +management of resources. ### Goals -* Allow arbitrary, driver-specific information to be reported from the DRA drivers for each allocated device in a ResourceClaim. -* Establish a foundation for potential future standardization (e.g. Network Devices status). -* Enable 3rd party implementation of new functionalities based on the Device Status (e.g. Secondary Network Services if the IPs of a network device are reported). +* Allow arbitrary, driver-specific information to be reported from the DRA + drivers for each allocated device in a ResourceClaim. +* Establish a foundation for potential future standardization (e.g. Network + Devices status). +* Enable 3rd party implementation of new functionalities based on the Device + Status (e.g. Secondary Network Services if the IPs of a network device are +reported). ### Non-Goals -* Implement new functionalities based on the Device Status (e.g. Kubernetes EndpointSlice Controller supporting IPs reported by networking DRA drivers). +* Implement new functionalities based on the Device Status (e.g. Kubernetes + EndpointSlice Controller supporting IPs reported by networking DRA drivers). * Modifying the Resource Claim workflow to include the device status report. ## Proposal ### API - ResourceClaim.Status -The API changes define a new `DeviceStatuses` field in the existing `ResourceClaimStatus` struct. `DeviceStatuses` is a slice of a new struct `AllocatedDeviceStatus` which holds device specific information. - -A device, identified by `//` can be represented only once in the `DeviceStatuses` slice and will also mention which request caused the device to be allocated. The state and characteristics of the device are reported in the `Conditions`, representing the operational state of the device and in the `DeviceInfo`, an arbitrary data slice representing device specific characteristics. Additionally, for networking devices, a field `NetworkDeviceInfo` can be used to report the IPs, the MAC address and the interface name. - -`DeviceInfo` being a slice of arbitrary data allows the DRA Driver to store device specific data in different formats. For example, a Network Device being configured by via a CNI plugin could get its `DeviceInfo` filled with the CNI result for troubleshooting purpose and with a Network result in a modern standard format (closer to Pod.Status.PodIPs for example) used by 3rd party controllers. - -For each device, if required, the DRA Driver processing the device allocation can report the status of it in the `Status.DeviceStatuses` of the ResourceClaim by using the Kubernetes API. +The API changes define a new `DeviceStatuses` field in the existing +`ResourceClaimStatus` struct. `DeviceStatuses` is a slice of a new struct +`AllocatedDeviceStatus` which holds device specific information. + +A device, identified by `//` can be +represented only once in the `DeviceStatuses` slice and will also mention which +request caused the device to be allocated. The state and characteristics of the +device are reported in the `Conditions`, representing the operational state of +the device and in the `DeviceInfo`, an arbitrary data slice representing device +specific characteristics. Additionally, for networking devices, a field +`NetworkDeviceInfo` can be used to report the IPs, the MAC address and the +interface name. + +`DeviceInfo` being a slice of arbitrary data allows the DRA Driver to store +device specific data in different formats. For example, a Network Device being +configured by via a CNI plugin could get its `DeviceInfo` filled with the CNI +result for troubleshooting purpose and with a Network result in a modern +standard format (closer to Pod.Status.PodIPs for example) used by 3rd party +controllers. + +For each device, if required, the DRA Driver processing the device allocation +can report the status of it in the `Status.DeviceStatuses` of the ResourceClaim +by using the Kubernetes API. ```golang // ResourceClaimStatus tracks whether the resource has been allocated and what @@ -207,31 +247,58 @@ type NetworkDeviceInfo struct { #### Story 1 - Network Device Status for Network Services -As a Cloud Native Network Function (CNF) vendor, my network services must be integrated with the network devices configured in Pods. The configuration properties of these network devices are therefore essential to configure the network services. For example, the network services must be able to route traffic to pods over networks attached via the network devices, the IP addresses of the network device(s) must then be reflected in the `ResourceClaim.Status` allowing the network service controller(s) to access them. +As a Cloud Native Network Function (CNF) vendor, my network services must be +integrated with the network devices configured in Pods. The configuration +properties of these network devices are therefore essential to configure the +network services. For example, the network services must be able to route +traffic to pods over networks attached via the network devices, the IP +addresses of the network device(s) must then be reflected in the +`ResourceClaim.Status` allowing the network service controller(s) to access +them. #### Story 2 - Network Device Status for Troubleshooting -As a Network Administrator, troubleshooting networking issues can be complex and time consuming especially when the device characteristics and operational status are not readily accessible. The `DeviceStatuses` field in the `ResourceClaim.Status` provides access to comprehensive details regarding network interfaces helping to quickly and efficiently identify the issues such as error messages on failed network interface configuration, incorrect IP assignments or misconfigured network interfaces. +As a Network Administrator, troubleshooting networking issues can be complex +and time consuming especially when the device characteristics and operational +status are not readily accessible. The `DeviceStatuses` field in the +`ResourceClaim.Status` provides access to comprehensive details regarding +network interfaces helping to quickly and efficiently identify the issues such +as error messages on failed network interface configuration, incorrect IP +assignments or misconfigured network interfaces. ### Notes/Constraints/Caveats (Optional) -The content of `DeviceInfo` is driver specific and not standardized as part of DRA, the interpretation of this field may then vary between controllers and users reading it. +The content of `DeviceInfo` is driver specific and not standardized as part of +DRA, the interpretation of this field may then vary between controllers and +users reading it. -The accuracy of the information depends on the implementation of the DRA Drivers, the reported status of the device may not always reflect the real time changes of the state of the device. +The accuracy of the information depends on the implementation of the DRA +Drivers, the reported status of the device may not always reflect the real time +changes of the state of the device. ### Risks and Mitigations -As stated, 3rd party DRA drivers will set and update the `DeviceStatuses` for the device they manage. An access control must be set in place to restrict the write access to the appropriate driver (A device status can only be updated by the driver which allocated and configured this device). +As stated, 3rd party DRA drivers will set and update the `DeviceStatuses` for +the device they manage. An access control must be set in place to restrict the +write access to the appropriate driver (A device status can only be updated by +the driver which allocated and configured this device). -Adding `DeviceInfo` as an arbitrary data slice may introduce extra processing and storage overhead which might impact performance in a cluster with many devices and frequent status updates. In large-scale clusters where many devices are allocated, this impact must be considered. +Adding `DeviceInfo` as an arbitrary data slice may introduce extra processing +and storage overhead which might impact performance in a cluster with many +devices and frequent status updates. In large-scale clusters where many devices +are allocated, this impact must be considered. ## Design Details ### API -The `ResourceClaimStatus` struct in `pkg/apis/resource/types.go` will be extended to include the slice of `DeviceStatuses`. +The `ResourceClaimStatus` struct in `pkg/apis/resource/types.go` will be +extended to include the slice of `DeviceStatuses`. -`ResourceClaim` validation of the status in `pkg/apis/resource/validation/validation.go` will be covered to allow a device to be reported only once in the slice, a device is being identified by `//`. +`ResourceClaim` validation of the status in +`pkg/apis/resource/validation/validation.go` will be covered to allow a device +to be reported only once in the slice, a device is being identified by +`//`. ### Test Plan @@ -252,7 +319,8 @@ to implement this enhancement. - Usage of the `DeviceStatuses` field in the `ResourceClaimStatus`: * With the feature gate enabled, the field exists in the `ResourceClaim`. - * With the feature gate disabled, the field does not exist in the `ResourceClaim`. + * With the feature gate disabled, the field does not exist in the + `ResourceClaim`. ##### e2e tests @@ -262,16 +330,19 @@ TBD #### Alpha -- Feature implemented behind feature gates (`ResourceClaimDeviceStatus`). Feature Gates are disabled by default. +- Feature implemented behind feature gates (`ResourceClaimDeviceStatus`). + Feature Gates are disabled by default. - Documentation provided. - Initial unit, integration and e2e tests completed and enabled. #### Beta - Feature Gates are enabled by default. -- Authorization implemented to allow only the driver managing the device to write the status. +- Authorization implemented to allow only the driver managing the device to + write the status. - No major outstanding bugs. -- Feedback collected from the community (developers and users) with adjustments provided, implemented and tested. +- Feedback collected from the community (developers and users) with adjustments + provided, implemented and tested. #### GA @@ -280,13 +351,16 @@ TBD ### Upgrade / Downgrade Strategy -This feature only exposes a new field in the `ResourceClaim.Status`, the field will either be present or not. +This feature only exposes a new field in the `ResourceClaim.Status`, the field +will either be present or not. -DRA implementation requires DRA interfaces change. DRA is in alpha and in active development. The feature will follow the DRA upgrade/downgrade strategy. +DRA implementation requires DRA interfaces change. DRA is in alpha and in +active development. The feature will follow the DRA upgrade/downgrade strategy. ### Version Skew Strategy -This feature affects only the kube-apiserver, so there is no issue with version skew with other Kubernetes components. +This feature affects only the kube-apiserver, so there is no issue with version +skew with other Kubernetes components. ## Production Readiness Review Questionnaire @@ -301,8 +375,8 @@ This feature affects only the kube-apiserver, so there is no issue with version - Describe the mechanism: - Will enabling / disabling the feature require downtime of the control plane? - - Will enabling / disabling the feature require downtime or reprovisioning - of a node? + - Will enabling / disabling the feature require downtime or reprovisioning of + a node? ###### Does enabling the feature change any default behavior? @@ -310,14 +384,16 @@ No ###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? -Yes, with no side effects except for missing the new field in `ResourceClaim.Status`. -Re-enabling this feature will not guarantee to keep the values written before the feature has been disabled. +Yes, with no side effects except for missing the new field in +`ResourceClaim.Status`. Re-enabling this feature will not guarantee to keep +the values written before the feature has been disabled. ###### What happens if we reenable the feature if it was previously rolled back? ###### Are there any tests for feature enablement/disablement? -Enablement/disablement of this feature is tested as part of the integration tests. +Enablement/disablement of this feature is tested as part of the integration +tests. ### Rollout, Upgrade and Rollback Planning @@ -387,7 +463,9 @@ No ###### Will enabling / using this feature result in any new API calls? -Yes, DRA Drivers will update the `ResourceClaim.Status` to report the allocated device status. Depending on the driver the size and frequency of these updates can vary. +Yes, DRA Drivers will update the `ResourceClaim.Status` to report the allocated +device status. Depending on the driver the size and frequency of these updates +can vary. ###### Will enabling / using this feature result in introducing new API types? @@ -407,7 +485,8 @@ No ###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? -Depending on the content of the `DeviceInfo` set by the DRA drivers, the disk usage could increase. +Depending on the content of the `DeviceInfo` set by the DRA drivers, the disk +usage could increase. ###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? @@ -433,15 +512,24 @@ N/A ## Drawbacks -If the Network device (network interface) characteristics (e.g. IP) and status is reported as part of the `Resource.Claim.Status`, it should be ensured the `ResourceClaim` is not used by several `Pod` at a time. Additionally, if a controller needs to gather IPs for a specific network to which Pods are attached via networking devices, it will need to query each `Pod` and then access the corresponding `ResourceClaim` for every Pod. +If the Network device (network interface) characteristics (e.g. IP) and status +is reported as part of the `Resource.Claim.Status`, it should be ensured the +`ResourceClaim` is not used by several `Pod` at a time. Additionally, if a +controller needs to gather IPs for a specific network to which Pods are +attached via networking devices, it will need to query each `Pod` and then +access the corresponding `ResourceClaim` for every Pod. ## Alternatives ### Annotations -An option the DRA drivers can currently use to report the status of the device allocated in the `ResourceClaim` is the annotation of the `ResourceClaim` or of the `Pod` itself. As a reference, the [k8snetworkplumbingwg/Multus-CNI](https://github.com/k8snetworkplumbingwg/multus-cni) project is utilizing annotation to describe the network attachments/interfaces and report the status. +An option the DRA drivers can currently use to report the status of the device +allocated in the `ResourceClaim` is the annotation of the `ResourceClaim` or of +the `Pod` itself. As a reference, the [k8snetworkplumbingwg/Multus-CNI](https://github.com/k8snetworkplumbingwg/multus-cni) project is utilizing annotation to describe the network attachments/interfaces +and report the status. -Here is the API below representing a network attachment. This is stored as a list in a json format in the annotation of the `Pod`. +Here is the API below representing a network attachment. This is stored as a +list in a json format in the annotation of the `Pod`. [k8snetworkplumbingwg/network-attachment-definition-client/pkg/apis/k8s.cni.cncf.io/v1/types.go](https://github.com/k8snetworkplumbingwg/network-attachment-definition-client/blob/v1.7.3/pkg/apis/k8s.cni.cncf.io/v1/types.go#L103): ```golang @@ -462,11 +550,19 @@ type NetworkStatus struct { ### Pod.Status.PodIPs Enhancement -As part of the [Multi-Network (KEP-3698)](https://github.com/kubernetes/enhancements/issues/3698), the idea was to use the existing `Pod.Status.PodIPs` and save the data about the different network interfaces/devices attached to the `Pod`. As part of the review of the KEP, it has been indicated ([here](https://github.com/kubernetes/enhancements/pull/3700#discussion_r1501690793) and [here](https://github.com/kubernetes/kubernetes/pull/123112#issuecomment-1925957930)) that it would be an API breaking change if the `Pod.Status.PodIPs` contains more than 1 value per IP family. +As part of the [Multi-Network (KEP-3698)](https://github.com/kubernetes/enhancements/issues/3698), +the idea was to use the existing `Pod.Status.PodIPs` and save the data about the +different network interfaces/devices attached to the `Pod`. As part of the +review of the KEP, it has been indicated ([here](https://github.com/kubernetes/enhancements/pull/3700#discussion_r1501690793) and [here](https://github.com/kubernetes/kubernetes/pull/123112#issuecomment-1925957930)) +that it would be an API breaking change if the `Pod.Status.PodIPs` contains +more than 1 value per IP family. ### New Pod.Status Field -Still as part of the [KEP-3698 - Multi-Network](https://github.com/kubernetes/enhancements/issues/3698), and in the continuation of the previous alternative, the idea was to add a new field `Networks` in the `Pod.Status` so each networking DRA driver could report the status for each network interface/device directly in the `Pod.Status`. +Still as part of the [KEP-3698 - Multi-Network](https://github.com/kubernetes/enhancements/issues/3698), and in +the continuation of the previous alternative, the idea was to add a new field +`Networks` in the `Pod.Status` so each networking DRA driver could report the +status for each network interface/device directly in the `Pod.Status`. Here is below the proposed API: ```golang @@ -505,7 +601,11 @@ type NetworkStatus struct { ### KEP-4680 extension -During the WG Device Management meeting on 17th of September 2024 ([Slack summary](https://kubernetes.slack.com/archives/C0409NGC1TK/p1726679433650409)), the idea was to extend the [KEP-4680 about resource health status in the `Pod.Status` ](https://github.com/kubernetes/enhancements/issues/4680) in order to expose device information and not just the health. +During the WG Device Management meeting on 17th of September 2024 ([Slack +summary](https://kubernetes.slack.com/archives/C0409NGC1TK/p1726679433650409)), +the idea was to extend the [KEP-4680 about resource health status in the +`Pod.Status` ](https://github.com/kubernetes/enhancements/issues/4680) in order +to expose device information and not just the health. ## Infrastructure Needed (Optional) From f3ae24219b8137768088f945cfa0ed7c566f1975 Mon Sep 17 00:00:00 2001 From: Lionel Jouin Date: Mon, 23 Sep 2024 00:20:34 +0200 Subject: [PATCH 03/11] Fixes based on review * Typos * Establish a standard instead of laying the foundation for creating a standard * KEP Name * DeviceStatuses keys * AllocatedDeviceStatus request field removed * IPs to a new struct Signed-off-by: Lionel Jouin --- .../README.md | 220 +++++++++--------- .../kep.yaml | 2 +- 2 files changed, 112 insertions(+), 110 deletions(-) diff --git a/keps/sig-node/4817-resource-claim-device-status/README.md b/keps/sig-node/4817-resource-claim-device-status/README.md index 22fba470bac..f02361bfe45 100644 --- a/keps/sig-node/4817-resource-claim-device-status/README.md +++ b/keps/sig-node/4817-resource-claim-device-status/README.md @@ -1,4 +1,4 @@ -# KEP-4817: Resource Claim Device Status +# KEP-4817: Resource Claim Status With Possible Standardized Network Interface Data - [Release Signoff Checklist](#release-signoff-checklist) @@ -75,14 +75,13 @@ Items marked with (R) are required *prior to targeting to a milestone / release* This proposal enhances the `ResourceClaim.Status` by adding a new field `DeviceStatuses`. The new field allows drivers to report driver-specific device -status data for each allocated devices in a resource claim. Allowing the +status data for each allocated device in a resource claim. Allowing the drivers to report the device statuses will improve both observability and troubleshooting as well as enabling new functionalities such as, for example, if the IPs of a network device are reported, network services. -This extension also lays the foundation for a potential future standardization -of specific type device data, such as, for example, networking devices -information. +This extension also establishes a standardization for specific type device data, +such as, for example, networking devices information. ## Motivation @@ -106,8 +105,7 @@ management of resources. * Allow arbitrary, driver-specific information to be reported from the DRA drivers for each allocated device in a ResourceClaim. -* Establish a foundation for potential future standardization (e.g. Network - Devices status). +* Establish a standardization for device status (e.g. Network Devices status). * Enable 3rd party implementation of new functionalities based on the Device Status (e.g. Secondary Network Services if the IPs of a network device are reported). @@ -150,96 +148,100 @@ by using the Kubernetes API. // ResourceClaimStatus tracks whether the resource has been allocated and what // the result of that was. type ResourceClaimStatus struct { - ... - - // DeviceStatuses contains the status of each device allocated for this - // claim, as reported by the driver. This can include driver-specific - // information. Entries are owned by their respective drivers. - // - // +optional - // +listType=map - // +listMapKey=devicePoolName - // +listMapKey=deviceName - DeviceStatuses []AllocatedDeviceStatus `json:"deviceStatuses,omitempty" protobuf:"bytes,4,opt,name=deviceStatuses"` + ... + + // DeviceStatuses contains the status of each device allocated for this + // claim, as reported by the driver. This can include driver-specific + // information. Entries are owned by their respective drivers. + // + // +optional + // +listType=map + // +listMapKey=driver + // +listMapKey=device + // +listMapKey=pool + DeviceStatuses []AllocatedDeviceStatus `json:"deviceStatuses,omitempty" protobuf:"bytes,4,opt,name=deviceStatuses"` } // AllocatedDeviceStatus contains the status of an allocated device, if the // driver chooses to report it. This may include driver-specific information. type AllocatedDeviceStatus struct { - // Request is the name of the request in the claim which caused this - // device to be allocated. Multiple devices may have been allocated - // per request. - // - // +required - Request string `json:"request" protobuf:"bytes,1,rep,name=request"` - - // Driver specifies the name of the DRA driver whose kubelet - // plugin should be invoked to process the allocation once the claim is - // needed on a node. - // - // Must be a DNS subdomain and should end with a DNS domain owned by the - // vendor of the driver. - // - // +required - Driver string `json:"driver" protobuf:"bytes,2,rep,name=driver"` - - // This name together with the driver name and the device name field - // identify which device was allocated (`//`). - // - // Must not be longer than 253 characters and may contain one or more - // DNS sub-domains separated by slashes. - // - // +required - Pool string `json:"pool" protobuf:"bytes,3,rep,name=pool"` - - // Device references one device instance via its name in the driver's - // resource pool. It must be a DNS label. - // - // +required - Device string `json:"device" protobuf:"bytes,4,rep,name=device"` - - // Conditions contains the latest observation of the device's state. - // If the device has been configured according to the class and claim - // config references, the `Ready` condition should be True. - // - // +optional - // +listType=atomic - Conditions []metav1.Condition `json:"conditions" protobuf:"bytes,5,rep,name=conditions"` - - // DeviceInfo contains Arbitrary driver-specific data. - // - // +optional - // +listType=atomic - DeviceInfo []runtime.RawExtension `json:"deviceInfo,omitempty" protobuf:"bytes,6,rep,name=deviceInfo"` + // Driver specifies the name of the DRA driver whose kubelet + // plugin should be invoked to process the allocation once the claim is + // needed on a node. + // + // Must be a DNS subdomain and should end with a DNS domain owned by the + // vendor of the driver. + // + // +required + Driver string `json:"driver" protobuf:"bytes,2,rep,name=driver"` + + // This name together with the driver name and the device name field + // identify which device was allocated (`//`). + // + // Must not be longer than 253 characters and may contain one or more + // DNS sub-domains separated by slashes. + // + // +required + Pool string `json:"pool" protobuf:"bytes,3,rep,name=pool"` + + // Device references one device instance via its name in the driver's + // resource pool. It must be a DNS label. + // + // +required + Device string `json:"device" protobuf:"bytes,4,rep,name=device"` + + // Conditions contains the latest observation of the device's state. + // If the device has been configured according to the class and claim + // config references, the `Ready` condition should be True. + // + // +optional + // +listType=atomic + Conditions []metav1.Condition `json:"conditions" protobuf:"bytes,5,rep,name=conditions"` + + // DeviceInfo contains arbitrary driver-specific data. + // + // +optional + // +listType=atomic + DeviceInfo []runtime.RawExtension `json:"deviceInfo,omitempty" protobuf:"bytes,6,rep,name=deviceInfo"` // NetworkDeviceInfo contains network-related information specific to the device. - // - // +optional - NetworkDeviceInfo NetworkDeviceInfo `json:"networkDeviceInfo,omitempty" protobuf:"bytes,7,rep,name=networkDeviceInfo"` + // + // +optional + NetworkDeviceInfo NetworkDeviceInfo `json:"networkDeviceInfo,omitempty" protobuf:"bytes,7,rep,name=networkDeviceInfo"` } // NetworkDeviceInfo provides network-related details for the allocated device. // This information may be filled by drivers or other components to configure // or identify the device within a network context. type NetworkDeviceInfo struct { - // Interface specifies the name of the network interface associated with - // the allocated device. This might be the name of a physical or virtual - // network interface. - // - // +optional - Interface string `json:"interface,omitempty" protobuf:"bytes,1,rep,name=interface"` - - // IPs lists the IP addresses assigned to the device's network interface. - // This can include both IPv4 and IPv6 addresses. - // - // +optional - IPs []string `json:"ips,omitempty" protobuf:"bytes,2,rep,name=ips"` - - // Mac represents the MAC address of the device's network interface. - // - // +optional - Mac string `json:"mac,omitempty" protobuf:"bytes,3,rep,name=mac"` + // Interface specifies the name of the network interface associated with + // the allocated device. This might be the name of a physical or virtual + // network interface. + // + // +optional + Interface string `json:"interface,omitempty" protobuf:"bytes,1,rep,name=interface"` + + // NetworkAddresses lists the network addresses assigned to the device's network interface. + // This can include both IPv4 and IPv6 addresses. + // + // +optional + NetworkAddresses []NetworkAddress `json:"networkAddresses,omitempty" protobuf:"bytes,2,rep,name=networkAddresses"` + + // Mac represents the MAC address of the device's network interface. + // + // +optional + Mac string `json:"mac,omitempty" protobuf:"bytes,3,rep,name=mac"` +} + +// NetworkAddress provides a network address related details such as IP and Mask. +type NetworkAddress struct { + // CIDR contains the network address in CIDR notation, which includes + // both the address and the associated subnet mask. + // e.g.: "192.0.2.0/24" for IPv4 and "2001:db8::/64" for IPv6. + // + // +required + CIDR string `json:"cidr,omitempty" protobuf:"bytes,1,rep,name=cidr"` } ``` @@ -536,15 +538,15 @@ list in a json format in the annotation of the `Pod`. // NetworkStatus is for network status annotation for pod // +k8s:deepcopy-gen=false type NetworkStatus struct { - Name string `json:"name"` - Interface string `json:"interface,omitempty"` - IPs []string `json:"ips,omitempty"` - Mac string `json:"mac,omitempty"` - Mtu int `json:"mtu,omitempty"` - Default bool `json:"default,omitempty"` - DNS DNS `json:"dns,omitempty"` - DeviceInfo *DeviceInfo `json:"device-info,omitempty"` - Gateway []string `json:"gateway,omitempty"` + Name string `json:"name"` + Interface string `json:"interface,omitempty"` + IPs []string `json:"ips,omitempty"` + Mac string `json:"mac,omitempty"` + Mtu int `json:"mtu,omitempty"` + Default bool `json:"default,omitempty"` + DNS DNS `json:"dns,omitempty"` + DeviceInfo *DeviceInfo `json:"device-info,omitempty"` + Gateway []string `json:"gateway,omitempty"` } ``` @@ -579,23 +581,23 @@ type PodStatus struct { // NetworkStatus provides the status of specific PodNetwork in a Pod. type NetworkStatus struct { - // Name is name of PodNetwork - Name string `json:"name"` - - // InterfaceName is the network interface name inside the Pod for this attachment. - // Examples: eth1 or net1 - // - // +optional - InterfaceName string `json:"interfaceName"` - - // ip is an IP address (IPv4 or IPv6) assigned to the pod - IP string `json:"ip,omitempty"` - - // IsDefaultGW is a flag indicating that the interface with this IP - // inside the Pod holds the Default Gateway. - // - // +optional - IsDefaultGW bool `json:"isDefaultGW,omitempty"` + // Name is name of PodNetwork + Name string `json:"name"` + + // InterfaceName is the network interface name inside the Pod for this attachment. + // Examples: eth1 or net1 + // + // +optional + InterfaceName string `json:"interfaceName"` + + // ip is an IP address (IPv4 or IPv6) assigned to the pod + IP string `json:"ip,omitempty"` + + // IsDefaultGW is a flag indicating that the interface with this IP + // inside the Pod holds the Default Gateway. + // + // +optional + IsDefaultGW bool `json:"isDefaultGW,omitempty"` } ``` diff --git a/keps/sig-node/4817-resource-claim-device-status/kep.yaml b/keps/sig-node/4817-resource-claim-device-status/kep.yaml index 9a45f4e1aba..a8b3afe3847 100644 --- a/keps/sig-node/4817-resource-claim-device-status/kep.yaml +++ b/keps/sig-node/4817-resource-claim-device-status/kep.yaml @@ -1,4 +1,4 @@ -title: Resource Claim Device Status +title: Resource Claim Status With Possible Standardized Network Interface Data kep-number: 4817 authors: - "@jane.doe" From 75b0c94fa66cec62d904f7593007ae9bf218c2f4 Mon Sep 17 00:00:00 2001 From: Lionel Jouin Date: Wed, 25 Sep 2024 14:09:58 +0200 Subject: [PATCH 04/11] Fixes based on review Signed-off-by: Lionel Jouin --- .../README.md | 66 +++++++++---------- .../kep.yaml | 6 +- 2 files changed, 37 insertions(+), 35 deletions(-) diff --git a/keps/sig-node/4817-resource-claim-device-status/README.md b/keps/sig-node/4817-resource-claim-device-status/README.md index f02361bfe45..a6697271f41 100644 --- a/keps/sig-node/4817-resource-claim-device-status/README.md +++ b/keps/sig-node/4817-resource-claim-device-status/README.md @@ -74,7 +74,7 @@ Items marked with (R) are required *prior to targeting to a milestone / release* ## Summary This proposal enhances the `ResourceClaim.Status` by adding a new field -`DeviceStatuses`. The new field allows drivers to report driver-specific device +`Devices`. The new field allows drivers to report driver-specific device status data for each allocated device in a resource claim. Allowing the drivers to report the device statuses will improve both observability and troubleshooting as well as enabling new functionalities such as, for example, @@ -120,28 +120,28 @@ reported). ### API - ResourceClaim.Status -The API changes define a new `DeviceStatuses` field in the existing -`ResourceClaimStatus` struct. `DeviceStatuses` is a slice of a new struct +The API changes define a new `Devices` field in the existing +`ResourceClaimStatus` struct. `Devices` is a slice of a new struct `AllocatedDeviceStatus` which holds device specific information. A device, identified by `//` can be -represented only once in the `DeviceStatuses` slice and will also mention which +represented only once in the `Devices` slice and will also mention which request caused the device to be allocated. The state and characteristics of the device are reported in the `Conditions`, representing the operational state of -the device and in the `DeviceInfo`, an arbitrary data slice representing device +the device and in the `Info`, an arbitrary data slice representing device specific characteristics. Additionally, for networking devices, a field -`NetworkDeviceInfo` can be used to report the IPs, the MAC address and the +`NetworkInfo` can be used to report the IPs, the MAC address and the interface name. -`DeviceInfo` being a slice of arbitrary data allows the DRA Driver to store +`Info` being a slice of arbitrary data allows the DRA Driver to store device specific data in different formats. For example, a Network Device being -configured by via a CNI plugin could get its `DeviceInfo` filled with the CNI +configured by via a CNI plugin could get its `Info` filled with the CNI result for troubleshooting purpose and with a Network result in a modern standard format (closer to Pod.Status.PodIPs for example) used by 3rd party controllers. For each device, if required, the DRA Driver processing the device allocation -can report the status of it in the `Status.DeviceStatuses` of the ResourceClaim +can report the status of it in the `Status.Devices` of the ResourceClaim by using the Kubernetes API. ```golang @@ -150,7 +150,7 @@ by using the Kubernetes API. type ResourceClaimStatus struct { ... - // DeviceStatuses contains the status of each device allocated for this + // Devices contains the status of each device allocated for this // claim, as reported by the driver. This can include driver-specific // information. Entries are owned by their respective drivers. // @@ -159,10 +159,9 @@ type ResourceClaimStatus struct { // +listMapKey=driver // +listMapKey=device // +listMapKey=pool - DeviceStatuses []AllocatedDeviceStatus `json:"deviceStatuses,omitempty" protobuf:"bytes,4,opt,name=deviceStatuses"` + Devices []AllocatedDeviceStatus `json:"devices,omitempty" protobuf:"bytes,4,opt,name=devices"` } - // AllocatedDeviceStatus contains the status of an allocated device, if the // driver chooses to report it. This may include driver-specific information. type AllocatedDeviceStatus struct { @@ -199,39 +198,40 @@ type AllocatedDeviceStatus struct { // +listType=atomic Conditions []metav1.Condition `json:"conditions" protobuf:"bytes,5,rep,name=conditions"` - // DeviceInfo contains arbitrary driver-specific data. + // Info contains arbitrary driver-specific data. // // +optional // +listType=atomic - DeviceInfo []runtime.RawExtension `json:"deviceInfo,omitempty" protobuf:"bytes,6,rep,name=deviceInfo"` + Info []runtime.RawExtension `json:"info,omitempty" protobuf:"bytes,6,rep,name=info"` - // NetworkDeviceInfo contains network-related information specific to the device. + // NetworkInfo contains network-related information specific to the device. // // +optional - NetworkDeviceInfo NetworkDeviceInfo `json:"networkDeviceInfo,omitempty" protobuf:"bytes,7,rep,name=networkDeviceInfo"` + // +oneOf=DeviceInfoType + NetworkInfo NetworkDeviceInfo `json:"networkInfo,omitempty" protobuf:"bytes,7,rep,name=networkInfo"` } // NetworkDeviceInfo provides network-related details for the allocated device. // This information may be filled by drivers or other components to configure // or identify the device within a network context. type NetworkDeviceInfo struct { - // Interface specifies the name of the network interface associated with + // InterfaceName specifies the name of the network interface associated with // the allocated device. This might be the name of a physical or virtual // network interface. // // +optional - Interface string `json:"interface,omitempty" protobuf:"bytes,1,rep,name=interface"` + InterfaceName string `json:"interfaceName,omitempty" protobuf:"bytes,1,rep,name=interfaceName"` - // NetworkAddresses lists the network addresses assigned to the device's network interface. + // Addresses lists the network addresses assigned to the device's network interface. // This can include both IPv4 and IPv6 addresses. // // +optional - NetworkAddresses []NetworkAddress `json:"networkAddresses,omitempty" protobuf:"bytes,2,rep,name=networkAddresses"` + Addresses []NetworkAddress `json:"addresses,omitempty" protobuf:"bytes,2,rep,name=addresses"` - // Mac represents the MAC address of the device's network interface. + // HWAddress represents the hardware address (e.g. MAC Address) of the device's network interface. // // +optional - Mac string `json:"mac,omitempty" protobuf:"bytes,3,rep,name=mac"` + HWAddress string `json:"hwAddress,omitempty" protobuf:"bytes,3,rep,name=hwAddress"` } // NetworkAddress provides a network address related details such as IP and Mask. @@ -262,7 +262,7 @@ them. As a Network Administrator, troubleshooting networking issues can be complex and time consuming especially when the device characteristics and operational -status are not readily accessible. The `DeviceStatuses` field in the +status are not readily accessible. The `Devices` field in the `ResourceClaim.Status` provides access to comprehensive details regarding network interfaces helping to quickly and efficiently identify the issues such as error messages on failed network interface configuration, incorrect IP @@ -270,7 +270,7 @@ assignments or misconfigured network interfaces. ### Notes/Constraints/Caveats (Optional) -The content of `DeviceInfo` is driver specific and not standardized as part of +The content of `Info` is driver specific and not standardized as part of DRA, the interpretation of this field may then vary between controllers and users reading it. @@ -280,12 +280,12 @@ changes of the state of the device. ### Risks and Mitigations -As stated, 3rd party DRA drivers will set and update the `DeviceStatuses` for +As stated, 3rd party DRA drivers will set and update the `Devices` for the device they manage. An access control must be set in place to restrict the write access to the appropriate driver (A device status can only be updated by the driver which allocated and configured this device). -Adding `DeviceInfo` as an arbitrary data slice may introduce extra processing +Adding `Info` as an arbitrary data slice may introduce extra processing and storage overhead which might impact performance in a cluster with many devices and frequent status updates. In large-scale clusters where many devices are allocated, this impact must be considered. @@ -295,7 +295,7 @@ are allocated, this impact must be considered. ### API The `ResourceClaimStatus` struct in `pkg/apis/resource/types.go` will be -extended to include the slice of `DeviceStatuses`. +extended to include the slice of `Devices`. `ResourceClaim` validation of the status in `pkg/apis/resource/validation/validation.go` will be covered to allow a device @@ -319,7 +319,7 @@ to implement this enhancement. ##### Integration tests -- Usage of the `DeviceStatuses` field in the `ResourceClaimStatus`: +- Usage of the `Devices` field in the `ResourceClaimStatus`: * With the feature gate enabled, the field exists in the `ResourceClaim`. * With the feature gate disabled, the field does not exist in the `ResourceClaim`. @@ -332,7 +332,7 @@ TBD #### Alpha -- Feature implemented behind feature gates (`ResourceClaimDeviceStatus`). +- Feature implemented behind feature gates (`DRAResourceClaimDeviceStatus`). Feature Gates are disabled by default. - Documentation provided. - Initial unit, integration and e2e tests completed and enabled. @@ -371,7 +371,7 @@ skew with other Kubernetes components. ###### How can this feature be enabled / disabled in a live cluster? - [x] Feature gate (also fill in values in `kep.yaml`) - - Feature gate name: ResourceClaimDeviceStatus + - Feature gate name: DRAResourceClaimDeviceStatus - Components depending on the feature gate: kube-apiserver - [ ] Other - Describe the mechanism: @@ -421,7 +421,7 @@ No ###### How can an operator determine if the feature is in use by workloads? -Check the `ResourceClaim.Status.DeviceStatuses`. +Check the `ResourceClaim.Status.Devices`. ###### How can someone using this feature know that it is working for their instance? @@ -429,7 +429,7 @@ Check the `ResourceClaim.Status.DeviceStatuses`. - Event Reason: - [x] API .status - Condition name: - - Other field: `ResourceClaim.Status.DeviceStatuses` + - Other field: `ResourceClaim.Status.Devices` - [ ] Other (treat as last resort) - Details: @@ -487,7 +487,7 @@ No ###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? -Depending on the content of the `DeviceInfo` set by the DRA drivers, the disk +Depending on the content of the `Info` set by the DRA drivers, the disk usage could increase. ###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? diff --git a/keps/sig-node/4817-resource-claim-device-status/kep.yaml b/keps/sig-node/4817-resource-claim-device-status/kep.yaml index a8b3afe3847..d18efddede1 100644 --- a/keps/sig-node/4817-resource-claim-device-status/kep.yaml +++ b/keps/sig-node/4817-resource-claim-device-status/kep.yaml @@ -15,7 +15,9 @@ reviewers: - "@MikeZappa87" - "@s1061123" approvers: - - TBD + - "@aojea" + - "@thockin" + - "@johnbelamaric" see-also: - "/keps/sig-node/3063-dynamic-resource-allocation" @@ -34,7 +36,7 @@ milestone: # The following PRR answers are required at alpha release # List the feature gate name and the components for which it must be enabled feature-gates: - - name: ResourceClaimDeviceStatus + - name: DRAResourceClaimDeviceStatus components: - kube-apiserver disable-supported: true From ab1940cdd7596cfdaf0eddc61c255faf0ec63f24 Mon Sep 17 00:00:00 2001 From: Lionel Jouin Date: Thu, 26 Sep 2024 17:03:06 +0200 Subject: [PATCH 05/11] Write access and CR alternative Signed-off-by: Lionel Jouin --- .../README.md | 106 ++++++++++++++++-- 1 file changed, 99 insertions(+), 7 deletions(-) diff --git a/keps/sig-node/4817-resource-claim-device-status/README.md b/keps/sig-node/4817-resource-claim-device-status/README.md index a6697271f41..3f195405be0 100644 --- a/keps/sig-node/4817-resource-claim-device-status/README.md +++ b/keps/sig-node/4817-resource-claim-device-status/README.md @@ -15,6 +15,7 @@ - [Risks and Mitigations](#risks-and-mitigations) - [Design Details](#design-details) - [API](#api) + - [Write Permission](#write-permission) - [Test Plan](#test-plan) - [Prerequisite testing updates](#prerequisite-testing-updates) - [Unit tests](#unit-tests) @@ -40,6 +41,7 @@ - [Pod.Status.PodIPs Enhancement](#podstatuspodips-enhancement) - [New Pod.Status Field](#new-podstatus-field) - [KEP-4680 extension](#kep-4680-extension) + - [Custom Resources](#custom-resources) - [Infrastructure Needed (Optional)](#infrastructure-needed-optional) @@ -283,7 +285,7 @@ changes of the state of the device. As stated, 3rd party DRA drivers will set and update the `Devices` for the device they manage. An access control must be set in place to restrict the write access to the appropriate driver (A device status can only be updated by -the driver which allocated and configured this device). +the entities that have a direct control over the device(s) being reported). Adding `Info` as an arbitrary data slice may introduce extra processing and storage overhead which might impact performance in a cluster with many @@ -302,6 +304,85 @@ extended to include the slice of `Devices`. to be reported only once in the slice, a device is being identified by `//`. +### Write Permission + +To prevent unauthorized or accidental modifications by entities that do not +have access to a particular resource, a `ValidatingAdmissionPolicy` will be +created to validate the entities attempting to update the devices in the +`ResourceClaim.Status`. + +The `ValidatingAdmissionPolicy` will restrict `ResourceClaim.Status.Devices` +to be set only during updates, as the object will have first to be created and +allocated, then configured inside the pods. It will also restrict the +`ResourceClaim.Status.Devices` to be set only for when the `ResourceClaim` is +allocated to a node. Additionally, the allocated node where the `ResourceClaim` +is assigned will be used to check if the user/entity updating the +`ResourceClaim.Status.Devices` is running on the same node. + +Here is a `ResourceClaim` allocated on a node. This would only work for now if +exactly one node is set: +```yaml +apiVersion: resource.k8s.io/v1alpha3 +kind: ResourceClaim +metadata: + ... +spec: + ... +status: + allocation: + ... + nodeSelector: + nodeSelectorTerms: + - matchFields: + - key: metadata.name + operator: In + values: + - my-node + ... +``` + +Here is an example of how the `ValidatingAdmissionPolicy` could look like: +```yaml +--- +apiVersion: admissionregistration.k8s.io/v1 +kind: ValidatingAdmissionPolicy +metadata: + name: "resourceclaim-device-status-update" +spec: + failurePolicy: Fail + matchConstraints: + resourceRules: + - apiGroups: ["resource.k8s.io"] + apiVersions: ["*"] + operations: ["UPDATE"] + resources: ["resourceclaims"] + matchConditions: + - name: 'device-status-update' + expression: >- # Validation only for objects with their .status.devices updated. + object.status.devices != oldObject.status.devices + validations: + - expression: >- # User node must be the same node as the one where the ResourceClaim is allocated. + variables.userNodeName != variables.objectNodeName + messageExpression: >- + "User '" + request.userInfo.username + "' on node '" + variables.userNodeName + "' is not allowed to update the .status.devices of a ResourceClaim allocated on node '" + variables.objectNodeName + "'." + reason: Forbidden + variables: + - name: userNodeName + expression: >- + request.userInfo.extra[?'authentication.kubernetes.io/node-name'][0].orValue('') + - name: objectNodeName + expression: >- + object.status.allocation.nodeSelector.nodeSelectorTerms[0].matchFields[0].values[0].orValue('') +--- +apiVersion: admissionregistration.k8s.io/v1 +kind: ValidatingAdmissionPolicyBinding +metadata: + name: "resourceclaim-device-status-update-binding" +spec: + policyName: "resourceclaim-device-status-update" + validationActions: [Deny] +``` + ### Test Plan [x] I/we understand the owners of the involved components may require updates to @@ -323,6 +404,8 @@ to implement this enhancement. * With the feature gate enabled, the field exists in the `ResourceClaim`. * With the feature gate disabled, the field does not exist in the `ResourceClaim`. + * With the feature gate enabled, the `ValidatingAdmissionPolicy` exists and + restricts the write access of the `ResourceClaim.Status.Devices`. ##### e2e tests @@ -336,12 +419,12 @@ TBD Feature Gates are disabled by default. - Documentation provided. - Initial unit, integration and e2e tests completed and enabled. +- Authorization implemented to allow only the user on the same node as the + allocated `ResourceClaim` to write the status of the devices. #### Beta - Feature Gates are enabled by default. -- Authorization implemented to allow only the driver managing the device to - write the status. - No major outstanding bugs. - Feedback collected from the community (developers and users) with adjustments provided, implemented and tested. @@ -527,7 +610,8 @@ access the corresponding `ResourceClaim` for every Pod. An option the DRA drivers can currently use to report the status of the device allocated in the `ResourceClaim` is the annotation of the `ResourceClaim` or of -the `Pod` itself. As a reference, the [k8snetworkplumbingwg/Multus-CNI](https://github.com/k8snetworkplumbingwg/multus-cni) project is utilizing annotation to describe the network attachments/interfaces +the `Pod` itself. As a reference, the [k8snetworkplumbingwg/Multus-CNI](https://github.com/k8snetworkplumbingwg/multus-cni) +project is utilizing annotation to describe the network attachments/interfaces and report the status. Here is the API below representing a network attachment. This is stored as a @@ -555,14 +639,15 @@ type NetworkStatus struct { As part of the [Multi-Network (KEP-3698)](https://github.com/kubernetes/enhancements/issues/3698), the idea was to use the existing `Pod.Status.PodIPs` and save the data about the different network interfaces/devices attached to the `Pod`. As part of the -review of the KEP, it has been indicated ([here](https://github.com/kubernetes/enhancements/pull/3700#discussion_r1501690793) and [here](https://github.com/kubernetes/kubernetes/pull/123112#issuecomment-1925957930)) +review of the KEP, it has been indicated ([here](https://github.com/kubernetes/enhancements/pull/3700#discussion_r1501690793) +and [here](https://github.com/kubernetes/kubernetes/pull/123112#issuecomment-1925957930)) that it would be an API breaking change if the `Pod.Status.PodIPs` contains more than 1 value per IP family. ### New Pod.Status Field -Still as part of the [KEP-3698 - Multi-Network](https://github.com/kubernetes/enhancements/issues/3698), and in -the continuation of the previous alternative, the idea was to add a new field +Still as part of the [KEP-3698 - Multi-Network](https://github.com/kubernetes/enhancements/issues/3698), +and in the continuation of the previous alternative, the idea was to add a new field `Networks` in the `Pod.Status` so each networking DRA driver could report the status for each network interface/device directly in the `Pod.Status`. @@ -609,6 +694,13 @@ the idea was to extend the [KEP-4680 about resource health status in the `Pod.Status` ](https://github.com/kubernetes/enhancements/issues/4680) in order to expose device information and not just the health. +### Custom Resources + +In the `ResourceClaim.Status.Devices`, instead of having opaque field (`Info`) and +specific type fields, an object reference could be used for each device. The custom +object would be created and maintained by the driver to report the status of the +devices. + ## Infrastructure Needed (Optional) +No ### Dependencies @@ -552,7 +555,7 @@ implementation difficulties, etc.). ###### Does this feature depend on any specific services running in the cluster? -No +No, the field won't be populated unless a DRA driver utilizes it. ### Scalability @@ -674,11 +677,12 @@ Here is below the proposed API: // state of a system, especially if the node that hosts the pod cannot contact the control // plane. type PodStatus struct { -[...] - // Networks is a list of PodNetworks that are attached to the Pod. - // - // +optional - Networks []NetworkStatus `json:"networks,omitempty"` + ... + + // Networks is a list of PodNetworks that are attached to the Pod. + // + // +optional + Networks []NetworkStatus `json:"networks,omitempty"` } // NetworkStatus provides the status of specific PodNetwork in a Pod. diff --git a/keps/sig-node/4817-resource-claim-device-status/kep.yaml b/keps/sig-node/4817-resource-claim-device-status/kep.yaml index d18efddede1..e7cb2767d40 100644 --- a/keps/sig-node/4817-resource-claim-device-status/kep.yaml +++ b/keps/sig-node/4817-resource-claim-device-status/kep.yaml @@ -1,12 +1,12 @@ title: Resource Claim Status With Possible Standardized Network Interface Data kep-number: 4817 authors: - - "@jane.doe" + - "@LionelJouin" owning-sig: sig-node participating-sigs: - sig-node - sig-network -status: provisional +status: implementable creation-date: 2024-08-30 reviewers: - "@johnbelamaric" @@ -29,6 +29,8 @@ see-also: # The target maturity stage in the current dev cycle for this KEP. stage: alpha #beta|stable +latest-milestone: "v1.32" + # The milestone at which this feature was, or is targeted to be, at each stage. milestone: alpha: "v1.32" @@ -40,3 +42,7 @@ feature-gates: components: - kube-apiserver disable-supported: true + +metrics: + - resourceclaim_status_devices_update_attempts_total + - resourceclaim_status_devices_update_failures_total \ No newline at end of file From 7cb9482a57283f140a0229f58bd4da59676f8a13 Mon Sep 17 00:00:00 2001 From: Lionel Jouin Date: Wed, 9 Oct 2024 15:57:59 +0200 Subject: [PATCH 11/11] Fixes based on review --- .../README.md | 28 +++++++++++-------- 1 file changed, 16 insertions(+), 12 deletions(-) diff --git a/keps/sig-node/4817-resource-claim-device-status/README.md b/keps/sig-node/4817-resource-claim-device-status/README.md index b9232d65207..962588a829b 100644 --- a/keps/sig-node/4817-resource-claim-device-status/README.md +++ b/keps/sig-node/4817-resource-claim-device-status/README.md @@ -8,10 +8,10 @@ - [Non-Goals](#non-goals) - [Proposal](#proposal) - [API - ResourceClaim.Status](#api---resourceclaimstatus) - - [User Stories (Optional)](#user-stories-optional) + - [User Stories](#user-stories) - [Story 1 - Network Device Status for Network Services](#story-1---network-device-status-for-network-services) - [Story 2 - Network Device Status for Troubleshooting](#story-2---network-device-status-for-troubleshooting) - - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Notes/Constraints/Caveats](#notesconstraintscaveats) - [Risks and Mitigations](#risks-and-mitigations) - [Design Details](#design-details) - [API](#api) @@ -127,8 +127,7 @@ The API changes define a new `Devices` field in the existing `AllocatedDeviceStatus` which holds device specific information. A device, identified by `//` can be -represented only once in the `Devices` slice and will also mention which -request caused the device to be allocated. The state and characteristics of the +represented only once in the `Devices` slice. The state and characteristics of the device are reported in the `Conditions`, representing the operational state of the device and in the `Data`, an arbitrary data field representing device specific characteristics. Additionally, for networking devices, a field @@ -204,12 +203,12 @@ type AllocatedDeviceStatus struct { // Data contains arbitrary driver-specific data. // // +optional - Data runtime.RawExtension `json:"data,omitempty" protobuf:"bytes,5,opt,name=data"` + Data *runtime.RawExtension `json:"data,omitempty" protobuf:"bytes,5,opt,name=data"` // NetworkData contains network-related information specific to the device. // // +optional - NetworkData NetworkDeviceData `json:"networkData,omitempty" protobuf:"bytes,6,opt,name=networkData"` + NetworkData *NetworkDeviceData `json:"networkData,omitempty" protobuf:"bytes,6,opt,name=networkData"` } // NetworkDeviceData provides network-related details for the allocated device. @@ -221,13 +220,13 @@ type NetworkDeviceData struct { // network interface. // // +optional - InterfaceName string `json:"interfaceName,omitempty" protobuf:"bytes,1,opt,name=interfaceName"` + InterfaceName *string `json:"interfaceName,omitempty" protobuf:"bytes,1,opt,name=interfaceName"` // Addresses lists the network addresses assigned to the device's network interface. // This can include both IPv4 and IPv6 addresses. // The addresses are in the CIDR notation, which includes both the address and the // associated subnet mask. - // e.g.: "192.0.2.0/24" for IPv4 and "2001:db8::/64" for IPv6. + // e.g.: "192.0.2.5/24" for IPv4 and "2001:db8::5/64" for IPv6. // // +optional // +listType=atomic @@ -236,11 +235,11 @@ type NetworkDeviceData struct { // HWAddress represents the hardware address (e.g. MAC Address) of the device's network interface. // // +optional - HWAddress string `json:"hwAddress,omitempty" protobuf:"bytes,3,opt,name=hwAddress"` + HWAddress *string `json:"hwAddress,omitempty" protobuf:"bytes,3,opt,name=hwAddress"` } ``` -### User Stories (Optional) +### User Stories #### Story 1 - Network Device Status for Network Services @@ -263,7 +262,7 @@ network interfaces helping to quickly and efficiently identify the issues such as error messages on failed network interface configuration, incorrect IP assignments or misconfigured network interfaces. -### Notes/Constraints/Caveats (Optional) +### Notes/Constraints/Caveats The content of `Data` is driver specific and not standardized as part of DRA, the interpretation of this field may then vary between controllers and @@ -438,6 +437,7 @@ network status. - Feature Gates are enabled by default. - No major outstanding bugs. +- 1 example of real-world usage. - Feedback collected from the community (developers and users) with adjustments provided, implemented and tested. @@ -522,6 +522,8 @@ No Check the `ResourceClaim.Status.Devices`. +The metrics `resourceclaim_status_devices_update_attempts_total` will increase. + ###### How can someone using this feature know that it is working for their instance? - [ ] Events @@ -539,7 +541,9 @@ N/A ###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? - [x] Metrics - - Metric name: `resourceclaim_status_devices_update_attempts_total` ; `resourceclaim_status_devices_update_failures_total` + - Metric name: + - "resourceclaim_status_devices_update_attempts_total" + - "resourceclaim_status_devices_update_failures_total" - [Optional] Aggregation method: - Components exposing the metric: kube-apiserver - [ ] Other (treat as last resort)