Skip to content

Commit

Permalink
enhancement_template: add API extension sections
Browse files Browse the repository at this point in the history
  • Loading branch information
sttts committed Oct 26, 2021
1 parent e2f4d4b commit 8d07520
Showing 1 changed file with 108 additions and 2 deletions.
110 changes: 108 additions & 2 deletions guidelines/enhancement_template.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ reviewers:
approvers:
- TBD
- "@oscardoe"
api-approvers: # in case of new or modified APIs or API extensions (CRDs, aggregated apiservers, webhooks, finalizers)
- TBD
creation-date: yyyy-mm-dd
last-updated: yyyy-mm-dd
status: provisional|implementable|implemented|deferred|rejected|withdrawn|replaced|informational
Expand Down Expand Up @@ -115,6 +117,24 @@ bogged down.

Include a story on how this proposal will be operationalized: lifecycled, monitored and remediated at scale.

### API Extensions

API Extensions are CRDs, admission and conversion webhooks, aggregated API servers,
finalizers, i.e. those mechanisms that change the OCP API surface and behaviour.

- Name the API extensions this enhancement adds or modifies.
- Does this enhancement modify the behaviour of existing resources, especially those owned
by other parties than the authoring team (including upstream resources), and, if yes, how?
Please add those other parties as reviewers to the enhancement.

Examples:
- Adds a finalizer to namespaces. Namespace cannot be deleted without our controller running.
- Restricts the label format for objects to X.
- Defaults field Y on object kind Z.

Fill in the operational impact of these API Extensions in the "Operational Aspects
of API Extensions" section.

### Implementation Details/Notes/Constraints [optional]

What are the caveats to the implementation? What are some important details that
Expand Down Expand Up @@ -225,8 +245,9 @@ enhancement:

Upgrade expectations:
- Each component should remain available for user requests and
workloads during upgrades. Ensure the components leverage best practices in handling [voluntary disruption](https://kubernetes.io/docs/concepts/workloads/pods/disruptions/). Any exception to this should be
identified and discussed here.
workloads during upgrades. Ensure the components leverage best practices in handling [voluntary
disruption](https://kubernetes.io/docs/concepts/workloads/pods/disruptions/). Any exception to
this should be identified and discussed here.
- Micro version upgrades - users should be able to skip forward versions within a
minor release stream without being required to pass through intermediate
versions - i.e. `x.y.N->x.y.N+2` should work without requiring `x.y.N->x.y.N+1`
Expand Down Expand Up @@ -265,6 +286,91 @@ enhancement:
- Will any other components on the node change? For example, changes to CSI, CRI
or CNI may require updating that component before the kubelet.

### Operational Aspects of API Extensions

Describe the impact of API extensions (mentioned in the proposal section, i.e. CRDs,
admission and conversion webhooks, aggregated API servers, finalizers) here in detail,
especially how they impact the OCP system architecture and operational aspects.

- For conversion/admission webhooks and aggregated apiservers: what are the SLIs (Service Level
Indicators) an administrator or support can use to determine the health of the API extensions

Examples (metrics, alerts, operator conditions)
- authentication-operator condition `APIServerDegraded=False`
- authentication-operator condition `APIServerAvailable=True`
- openshift-authentication/oauth-apiserver deployment and pods health

- What impact do these API extensions have on existing SLIs (e.g. scalability, API throughput,
API availability)

Examples:
- Adds 1s to every pod update in the system, slowing down pod scheduling by 5s on average.
- Fails creation of ConfigMap in the system when the webhook is not available.
- Adds a dependency on the SDN service network for all resources, risking API availability in case
of SDN issues.
- Expected use-cases require less than 1000 instances of the CRD, not impacting
general API throughput.

- How is the impact on existing SLIs to be measured and when (e.g. every release by QE, or
automatically in CI) and by whom (e.g. perf team; name the responsible person and let them review
this enhancement)

#### Failure Modes

- Describe the possible failure modes of the API extensions.
- Describe how a failure or behaviour of the extension will impact the overall cluster health
(e.g. which kube-controller-manager functionality will stop working), especially regarding
stability, availability, performance and security.
- Describe which OCP teams are likely to be called upon in case of escalation with one of the failure modes
and add them as reviewers to this enhancement.

#### Support Procedures

Describe how to
- detect the failure modes in a support situation, describe possible symptoms (events, metrics,
alerts, which log output in which component)

Examples:
- If the webhook is not running, kube-apiserver logs will show errors like "failed to call admission webhook xyz".
- Operator X will degrade with message "Failed to launch webhook server" and reason "WehhookServerFailed".
- The metric `webhook_admission_duration_seconds("openpolicyagent-admission", "mutating", "put", "false")`
will show >1s latency and alert `WebhookAdmissionLatencyHigh` will fire.

- disable the API extension (e.g. remove MutatingWebhookConfiguration `xyz`, remove APIService `foo`)

- What consequences does it have on the cluster health?

Examples:
- Garbage collection in kube-controller-manager will stop working.
- Quota will be wrongly computed.
- Disabling/removing the CRD is not possible without removing the CR instances. Customer will lose data.
Disabling the conversion webhook will break garbage collection.

- What consequences does it have on existing, running workloads?

Examples:
- New namespaces won't get the finalizer "xyz" and hence might leak resource X
when deleted.
- SDN pod-to-pod routing will stop updating, potentially breaking pod-to-pod
communication after some minutes.

- What consequences does it have for newly created workloads?

Examples:
- New pods in namespace with Istio support will not get sidecars injected, breaking
their networking.

- Does functionality fail gracefully and will work resume when re-enabled without risking
consistency?

Examples:
- The mutating admission webhook "xyz" has FailPolicy=Ignore and hence
will not block the creation or updates on objects when it fails. When the
webhook comes back online, there is a controller reconciling all objects, applying
labels that were not applied during admission webhook downtime.
- Namespaces deletion will not delete all objects in etcd, leading to zombie
objects when another namespace with the same name is created.

## Implementation History

Major milestones in the life cycle of a proposal should be tracked in `Implementation
Expand Down

0 comments on commit 8d07520

Please sign in to comment.