enhancement_template: add API extension sections

openshift · Oct 26, 2021 · 8d07520 · 8d07520
1 parent e2f4d4b
commit 8d07520
Showing 1 changed file with 108 additions and 2 deletions.
diff --git a/guidelines/enhancement_template.md b/guidelines/enhancement_template.md
@@ -8,6 +8,8 @@ reviewers:
 approvers:
   - TBD
   - "@oscardoe"
+api-approvers: # in case of new or modified APIs or API extensions (CRDs, aggregated apiservers, webhooks, finalizers)
+  - TBD
 creation-date: yyyy-mm-dd
 last-updated: yyyy-mm-dd
 status: provisional|implementable|implemented|deferred|rejected|withdrawn|replaced|informational
@@ -115,6 +117,24 @@ bogged down.
 
 Include a story on how this proposal will be operationalized:  lifecycled, monitored and remediated at scale.
 
+### API Extensions
+
+API Extensions are CRDs, admission and conversion webhooks, aggregated API servers,
+finalizers, i.e. those mechanisms that change the OCP API surface and behaviour.
+
+- Name the API extensions this enhancement adds or modifies.
+- Does this enhancement modify the behaviour of existing resources, especially those owned
+  by other parties than the authoring team (including upstream resources), and, if yes, how?
+  Please add those other parties as reviewers to the enhancement.
+
+  Examples:
+  - Adds a finalizer to namespaces. Namespace cannot be deleted without our controller running.
+  - Restricts the label format for objects to X.
+  - Defaults field Y on object kind Z.
+
+Fill in the operational impact of these API Extensions in the "Operational Aspects
+of API Extensions" section.
+
 ### Implementation Details/Notes/Constraints [optional]
 
 What are the caveats to the implementation? What are some important details that
@@ -225,8 +245,9 @@ enhancement:
 
 Upgrade expectations:
 - Each component should remain available for user requests and
-  workloads during upgrades. Ensure the components leverage best practices in handling [voluntary disruption](https://kubernetes.io/docs/concepts/workloads/pods/disruptions/). Any exception to this should be
-  identified and discussed here.
+  workloads during upgrades. Ensure the components leverage best practices in handling [voluntary
+  disruption](https://kubernetes.io/docs/concepts/workloads/pods/disruptions/). Any exception to
+  this should be identified and discussed here.
 - Micro version upgrades - users should be able to skip forward versions within a
   minor release stream without being required to pass through intermediate
   versions - i.e. `x.y.N->x.y.N+2` should work without requiring `x.y.N->x.y.N+1`
@@ -265,6 +286,91 @@ enhancement:
 - Will any other components on the node change? For example, changes to CSI, CRI
   or CNI may require updating that component before the kubelet.
 
+### Operational Aspects of API Extensions
+
+Describe the impact of API extensions (mentioned in the proposal section, i.e. CRDs,
+admission and conversion webhooks, aggregated API servers, finalizers) here in detail,
+especially how they impact the OCP system architecture and operational aspects.
+
+- For conversion/admission webhooks and aggregated apiservers: what are the SLIs (Service Level
+  Indicators) an administrator or support can use to determine the health of the API extensions
+
+  Examples (metrics, alerts, operator conditions)
+  - authentication-operator condition `APIServerDegraded=False`
+  - authentication-operator condition `APIServerAvailable=True`
+  - openshift-authentication/oauth-apiserver deployment and pods health
+
+- What impact do these API extensions have on existing SLIs (e.g. scalability, API throughput,
+  API availability)
+
+  Examples:
+  - Adds 1s to every pod update in the system, slowing down pod scheduling by 5s on average.
+  - Fails creation of ConfigMap in the system when the webhook is not available.
+  - Adds a dependency on the SDN service network for all resources, risking API availability in case
+    of SDN issues.
+  - Expected use-cases require less than 1000 instances of the CRD, not impacting
+    general API throughput.
+
+- How is the impact on existing SLIs to be measured and when (e.g. every release by QE, or
+  automatically in CI) and by whom (e.g. perf team; name the responsible person and let them review
+  this enhancement)
+
+#### Failure Modes
+
+- Describe the possible failure modes of the API extensions.
+- Describe how a failure or behaviour of the extension will impact the overall cluster health
+  (e.g. which kube-controller-manager functionality will stop working), especially regarding
+  stability, availability, performance and security.
+- Describe which OCP teams are likely to be called upon in case of escalation with one of the failure modes
+  and add them as reviewers to this enhancement.
+
+#### Support Procedures
+
+Describe how to
+- detect the failure modes in a support situation, describe possible symptoms (events, metrics,
+  alerts, which log output in which component)
+
+  Examples:
+  - If the webhook is not running, kube-apiserver logs will show errors like "failed to call admission webhook xyz".
+  - Operator X will degrade with message "Failed to launch webhook server" and reason "WehhookServerFailed".
+  - The metric `webhook_admission_duration_seconds("openpolicyagent-admission", "mutating", "put", "false")`
+    will show >1s latency and alert `WebhookAdmissionLatencyHigh` will fire.
+
+- disable the API extension (e.g. remove MutatingWebhookConfiguration `xyz`, remove APIService `foo`)
+
+  - What consequences does it have on the cluster health?
+
+    Examples:
+    - Garbage collection in kube-controller-manager will stop working.
+    - Quota will be wrongly computed.
+    - Disabling/removing the CRD is not possible without removing the CR instances. Customer will lose data.
+      Disabling the conversion webhook will break garbage collection.
+
+  - What consequences does it have on existing, running workloads?
+
+    Examples:
+    - New namespaces won't get the finalizer "xyz" and hence might leak resource X
+      when deleted.
+    - SDN pod-to-pod routing will stop updating, potentially breaking pod-to-pod
+      communication after some minutes.
+
+  - What consequences does it have for newly created workloads?
+
+    Examples:
+    - New pods in namespace with Istio support will not get sidecars injected, breaking
+      their networking.
+
+- Does functionality fail gracefully and will work resume when re-enabled without risking
+  consistency?
+
+  Examples:
+  - The mutating admission webhook "xyz" has FailPolicy=Ignore and hence
+    will not block the creation or updates on objects when it fails. When the
+    webhook comes back online, there is a controller reconciling all objects, applying
+    labels that were not applied during admission webhook downtime.
+  - Namespaces deletion will not delete all objects in etcd, leading to zombie
+    objects when another namespace with the same name is created.
+
 ## Implementation History
 
 Major milestones in the life cycle of a proposal should be tracked in `Implementation