From dcb669526f8cbdd01634190bbedb4a945fd5b1db Mon Sep 17 00:00:00 2001
From: Lee Verberne <verb@google.com>
Date: Wed, 15 Aug 2018 14:29:55 +0200
Subject: [PATCH] Move Ephemeral Containers into pod.Spec

After discussing with API reviewers and relevant SIG leads, we've agreed that
the configuration for Ephemeral Containers should live in the pod spec.
---
 .../node/troubleshoot-running-pods.md         | 538 +++++++++---------
 1 file changed, 274 insertions(+), 264 deletions(-)

diff --git a/contributors/design-proposals/node/troubleshoot-running-pods.md b/contributors/design-proposals/node/troubleshoot-running-pods.md
index cb86c35b8a0..89db12a53a4 100644
--- a/contributors/design-proposals/node/troubleshoot-running-pods.md
+++ b/contributors/design-proposals/node/troubleshoot-running-pods.md
@@ -16,9 +16,9 @@ Many developers of native Kubernetes applications wish to treat Kubernetes as an
 execution platform for custom binaries produced by a build system. These users
 can forgo the scripted OS install of traditional Dockerfiles and instead `COPY`
 the output of their build system into a container image built `FROM scratch` or
-a [distroless container
-image](https://github.com/GoogleCloudPlatform/distroless). This confers several
-advantages:
+a
+[distroless container image](https://github.com/GoogleCloudPlatform/distroless).
+This confers several advantages:
 
 1.  **Minimal images** lower operational burden and reduce attack vectors.
 1.  **Immutable images** improve correctness and reliability.
@@ -61,10 +61,9 @@ command, `kubectl debug`, which parallels an existing command, `kubectl exec`.
 Whereas `kubectl exec` runs a _process_ in a _container_, `kubectl debug` will
 be similar but run a _container_ in a _pod_.
 
-A container created by `kubectl debug` is a _Debug Container_. Just like a
-process run by `kubectl exec`, a Debug Container is not part of the pod spec.
-Unlike `kubectl exec`, a Debug Container _does_ have status that is reported in
-`v1.PodStatus` and displayed by `kubectl describe pod`.
+A container created by `kubectl debug` is a _Debug Container_. Unlike `kubectl
+exec`, Debug Containers have status that is reported in `PodStatus` and
+displayed by `kubectl describe pod`.
 
 For example, the following command would attach to a newly created container in
 a pod:
@@ -100,70 +99,94 @@ subsequently be used to reattach and is reported by `kubectl describe`.
 
 ### Kubernetes API Changes
 
-There has been much discussion about how this fits best into the Kubernetes API.
-The consensus is for an imperative "debug this pod" action whereby the kubelet
-creates a new, temporary container in a pod on command. SIG Node would like to
-avoid new dependencies in the kubelet, so this will be implemented in the Core
-API. Three possible implementations follow, and additional implementations that
-were evaluated and dismissed are at the end of this document.
+This will be implemented in the Core API to avoid new dependencies in the
+kubelet. The user-level concept of a _Debug Container_ implemented with the
+API-level concept of an _Ephemeral Container_. The API doesn't require an
+Ephemeral Container to be used as a Debug Container. It's intended as a general
+purpose construct for running a short-lived process in a pod.
 
-All of the proposed solutions implement the user-level concept of a _Debug
-Container_ using the API-level concept of an _Ephemeral Container_. The API
-doesn't prescribe how an Ephemeral Container is used. It could conceivably see
-use other than Debug Containers, but we don't currently have other use cases.
+#### Pod Changes
 
-#### Chosen Solution: Subresource to Update PodStatus
-
-An Ephemeral Container is not part of the pod specification as it's not part of
-the declared state of the pod, but we describe it using the same primitives as
-in `PodSpec`. An `EphemeralContainer` contains a Spec, a Status and a Target:
+Ephemeral Containers are represented in `PodSpec` and `PodStatus`:
 
 ```
-// EphemeralContainer describes a container to attach to a running pod for troubleshooting.
-type EphemeralContainer struct {
-        metav1.TypeMeta `json:",inline"`
-
-        // Spec describes the Ephemeral Container to be created.
-        Spec *Container `json:"spec,omitempty" protobuf:"bytes,2,opt,name=spec"`
-
-        // Most recently observed status of the container.
-        // This data may not be up to date.
-        // Populated by the system.
-        // Read-only.
-        // +optional
-        Status *ContainerStatus `json:"status,omitempty" protobuf:"bytes,3,opt,name=status"`
+type PodSpec struct {
+  ...
+  // List of user-initiated ephemeral containers to run in this pod.
+  // This field is alpha-level and is only honored by servers that enable the EphemeralContainers feature.
+  // +optional
+  EphemeralContainers []EphemeralContainer `json:"ephemeralContainers,omitempty" protobuf:"bytes,29,opt,name=ephemeralContainers"`
+}
 
-        // If set, the name of the container from PodSpec that this ephemeral container targets.
-        // If not set then the ephemeral container is run in whatever namespaces are shared
-        // for the pod.
-        TargetContainerName string `json:"targetContainerName,omitempty" protobuf:"bytes,4,opt,name=targetContainerName"`
+type PodStatus struct {
+  ...
+  // Status for any Ephemeral Containers that running in this pod.
+  // This field is alpha-level and is only honored by servers that enable the EphemeralContainers feature.
+  // +optional
+  EphemeralContainerStatuses []ContainerStatus `json:"ephemeralContainerStatuses,omitempty" protobuf:"bytes,12,rep,name=ephemeralContainerStatuses"`
 }
 ```
 
-Ephemeral Containers for a pod are listed in the pod's status:
+`EphemeralContainerStatuses` resembles the existing `ContainerStatuses` and
+`InitContainerStatuses`, but `EphemeralContainers` introduces a new type:
 
 ```
-type PodStatus struct {
-        ...
-        // List of user-initiated ephemeral containers that have been run in this pod.
-        // +optional
-        EphemeralContainers []EphemeralContainer `json:"commands,omitempty" protobuf:"bytes,11,rep,name=ephemeralContainers"`
-
+// An EphemeralContainer is a container which runs temporarily in a pod for human-initiated actions
+// such as troubleshooting. This is an alpha feature enabled by the EphemeralContainers feature flag.
+type EphemeralContainer struct {
+  // Spec describes the Ephemeral Container to be created.
+  Spec Container `json:"spec,omitempty" protobuf:"bytes,1,opt,name=spec"`
+
+  // If set, the name of the container from PodSpec that this ephemeral container targets.
+  // The ephemeral container will be run in the namespaces (IPC, PID, etc) of this container.
+  // If not set then the ephemeral container is run in whatever namespaces are shared
+  // for the pod.
+  // +optional
+  TargetContainerName string `json:"targetContainerName,omitempty" protobuf:"bytes,2,opt,name=targetContainerName"`
 }
 ```
 
-To create a new Ephemeral Container, one appends a new `EphemeralContainer` with
-the desired `v1.Container` as `Spec` in `Pod.Status` and updates the `Pod` in
-the API. Users cannot normally modify the pod status, so we'll create a new
-subresource `/ephemeralcontainers` that allows an update of solely
-`EphemeralContainers` and enforces append-only semantics.
+Much of the utility of Ephemeral Containers comes from the ability to run a
+container within the PID namespace of another container. `TargetContainerName`
+allows targeting a container that doesn't share its PID namespace with the rest
+of the pod. We must modify the CRI to enable this functionality (see below).
+
+##### Alternative Considered: Omitting TargetContainerName
+
+It would be simpler for the API, kubelet and kubectl if `EphemeralContainers`
+was a `[]Container`, but as isolated PID namespaces will be the default for some
+time, being able to target a container will provide a better user experience.
+
+#### Updates
+
+Most fields of `Pod.Spec` are immutable once created. There is a short whitelist
+of fields which may be updated, and we could extend this to include
+`EphemeralContainers`. The ability to add new containers is a large change for
+Pod, however, and we'd like to begin conservatively by enshrining the following
+best practices:
+
+1.  Ephemeral Containers lack guarantees for resources or execution, they will
+    never be automatically restarted. To avoid pods that depend on Ephemeral
+    Containers, we allow their addition only in updates and disallow them during
+    create.
+1.  Some fields of `v1.Container` imply they are a fundamental part of a pod. We
+    will disallow the following fields in Ephemeral Containers: `resources`,
+    `ports`, `livenessProbe`, `readinessProbe`, and `lifecycle.`
+1.  Cluster administrators may want to restrict access to Ephemeral Containers
+    independent of other pod updates.
+1.  The kubelet may remove terminated Ephemeral Containers from the pod spec
+    when they are garbage collected to avoid restarting the ephemeral container
+    when the pod is restarted.
+
+To enforce these restrictions and new permissions, we will introduce a new Pod
+subresource, `/ephemeralcontainers`. `EphemeralContainers` can only be modified
+via this subresource. `EphemeralContainerStatuses` is updated with everything
+else in `Pod.Status` via `/status`.
 
-**Note that Ephemeral Containers are not regular containers and should not be
-used to build services.** They lack guarantees for resources or execution, they
-will never be automatically restarted, and many of the fields of `v1.Container`
-will not be allowed for Debug Containers. In particular, the following fields
-are explicitly disallowed by API validation: `resources`, `ports`,
-`livenessProbe`, `readinessProbe`, and `lifecycle`.
+To create a new Ephemeral Container, one appends a new `EphemeralContainer` with
+the desired `v1.Container` as `Spec` in `Pod.Spec` and `PUT`s the pod to
+`/ephemeralcontainers`. An Ephemeral Container could be removed in a similar
+fashion, but this is not planned in the initial version.
 
 The subresources `attach`, `exec`, `log`, and `portforward` are available for
 Ephemeral Containers and will be forwarded by the apiserver. This means `kubectl
@@ -178,107 +201,30 @@ container using the existing attach endpoint,
 container occurring between its creation and attach will not be replayed, but it
 can be viewed using `kubectl log`.
 
-#### Alternative 1: "exec++"
+##### Alternative Considered: Standard Pod Updates
 
-A simpler change is to extend `v1.Pod`'s `/exec` subresource to support
-"executing" container images. The current `/exec` endpoint must implement `GET`
-to support streaming for all clients. We don't want to encode a (potentially
-large) `v1.Container` into a query string, so we must extend `v1.PodExecOptions`
-with the specific fields required for creating a Debug Container:
+It would simplify initial implementation if we updated the pod spec via the
+normal means, and switched to a new update subresource if required at a future
+date. It's easier to begin with a too-restrictive policy than a too-permissive
+one on which users come to rely, and we expect to be able to remove the
+`/ephemeralcontainers` subresource prior to existing Alpha should it become
+unnecessary.
 
-```
-// PodExecOptions is the query options to a Pod's remote exec call
-type PodExecOptions struct {
-        ...
-        // EphemeralContainerName is the name of an ephemeral container in which the
-        // command ought to be run. Either both EphemeralContainerName and
-        // EphemeralContainerImage fields must be set, or neither.
-        EphemeralContainerName *string `json:"ephemeralContainerName,omitempty" ...`
+### Container Runtime Interface (CRI) changes
 
-        // EphemeralContainerImage is the image of an ephemeral container in which the command
-        // ought to be run. Either both EphemeralContainerName and EphemeralContainerImage
-        // fields must be set, or neither.
-        EphemeralContainerImage *string `json:"ephemeralContainerImage,omitempty" ...`
-}
-```
-
-After creating the Ephemeral Container, the kubelet would upgrade the connection
-to streaming and perform an attach to the container's console. If disconnected,
-the Ephemeral Container could be reattached using the pod's `/attach` endpoint
-with `EphemeralContainerName`.
-
-Ephemeral Containers could not be removed via the API and instead the process
-must terminate. While not ideal, this parallels existing behavior of `kubectl
-exec`. To kill an Ephemeral Container one would `attach` and exit the process
-interactively or create a new Ephemeral Container to send a signal with
-`kill(1)` to the original process.
-
-#### Alternative 2: Ephemeral Container Controller
-
-Using subresources is an imperative style API where the client instructs the
-kubelet to perform an action, but in general Kubernetes prefers declarative APIs
-where the client declares a state for Kubernetes to enact.
-
-We could implement this in a declarative manner by creating a new
-`EphemeralContainer` type:
-
-```
-type EphemeralContainer struct {
-        metav1.TypeMeta
-        metav1.ObjectMeta
-
-        Spec v1.Container
-        Status v1.ContainerStatus
-}
-```
-
-A new controller in the kubelet would watch for EphemeralContainers and
-create/delete debug containers. `EphemeralContainer.Status` would be updated by
-the kubelet at the same time it updates `ContainerStatus` for regular and init
-containers. Clients would create a new `EphemeralContainer` object, wait for it
-to be started and then attach using the pod's attach subresource and the name of
-the `EphemeralContainer`.
-
-Debugging is inherently imperative, however, and not the a desired state to
-describe. Once a Debug Container is started it should not be automatically
-restarted, for example. A declarative API adds new states for the kubelet to
-enforce, and SIG Node strongly prefers to minimize kubelet complexity.
-
-### Ephemeral Container Status
-
-The kubelet should be able to construct `PodStatus` without relying on prior
-state, so we will store the Ephemeral Container's `Spec` and
-`TargetContainerName` as runtime metadata. The kubelet persists container
-metadata as CRI
-[labels](https://github.com/kubernetes/kubernetes/blob/v1.10.0-alpha.0/pkg/kubelet/apis/cri/v1alpha1/runtime/api.proto#L606)
-and
-[annotations](https://github.com/kubernetes/kubernetes/blob/v1.10.0-alpha.0/pkg/kubelet/apis/cri/v1alpha1/runtime/api.proto#L613).
-The entire `v1.Container` used in the request will be serialized and stored as a
-runtime annotation. The value of `TargetContainerName` will be stored as a
-runtime label. Persisting this data in the runtime means it survives kubelet
-restarts.
-
-At least for the Docker runtime, this is [an intended use of docker
-labels](https://docs.docker.com/engine/userguide/labels-custom-metadata/#value-guidelines).
-Docker does not document the maximum length of labels in its API. Empirically,
-it supports up to the 64K constraint of the docker client's `bufio.Scanner`
-size. We will conservatively limit the size of the spec to 32K and add a 32K
-minimum label length test to runtime qualification.
-
-`EphemeralContainer.Status` is populated by the kubelet in the same way as
-regular container statuses. The kubelet then updates the pod's status in the API
-server using the pod's `/status` endpoint, which imposes no restrictions on
-updates to `ephemeralContainers`.
+The CRI requires no changes for basic functionality, but it will need to be
+updated to support container namespace targeting, as described in the
+[Shared PID Namespace Proposal](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node/pod-pid-namespace.md#targeting-a-specific-containers-namespace).
 
 ### Creating Debug Containers
 
-1.  `kubectl` constructs and `EphemeralContainer` based on command line
-    arguments and appends it to `Pod.Status.EphemeralContainers`. It `PUT`s the
-    modified pod to the pod's `/ephemeralcontainers`.
+To create a debug container, kubectl will take the following steps:
+
+1.  `kubectl` constructs an `EphemeralContainer` based on command line arguments
+    and appends it to `Pod.Spec.EphemeralContainers`. It `PUT`s the modified pod
+    to the pod's `/ephemeralcontainers`.
 1.  The apiserver discards changes other than additions to
-    `Pod.Status.EphemeralContainers` and validates the pod update.
-    1.  Update discards `EphemeralContainer.Status` for new Ephemeral
-        Containers.
+    `Pod.Spec.EphemeralContainers` and validates the pod update.
     1.  Pod validation fails if container spec contains fields disallowed for
         Ephemeral Containers or the same name as a container in the spec or
         `EphemeralContainers`.
@@ -286,8 +232,8 @@ updates to `ephemeralContainers`.
 1.  The kubelet's pod watcher notices the update and triggers a `syncPod()`.
     During the sync, the kubelet calls `kuberuntime.StartEphemeralContainer()`
     for any new Ephemeral Container.
-    1.  `StartEphemeralContainer()` uses the existing `startContainer()` method,
-        which gains support for targeting the namespaces of a container by name.
+    1.  `StartEphemeralContainer()` uses the existing `startContainer()` to
+        start the Ephemeral Container.
     1.  After initial creation, future invocations of `syncPod()` will publish
         its ContainerStatus but otherwise ignore the Ephemeral Container. It
         will exist for the life of the pod sandbox or it exits. In no event will
@@ -317,26 +263,8 @@ the pod sandbox) is destroyed. Debug Containers will stop when their command
 exits, such as exiting a shell. Unlike `kubectl exec`, processes in Debug
 Containers will not receive an EOF if their connection is interrupted.
 
-### Container Lifecycle Changes
-
-Implementing debug requires no changes to the Container Runtime Interface as
-it's the same operation as creating a regular container. The following changes
-are necessary in the kubelet:
-
-1.  `SyncPod()` must not kill any Debug Container even though it is not part of
-    the pod spec.
-1.  As an exception to the above, `SyncPod()` will kill Debug Containers when
-    the pod sandbox changes since a lone Debug Container in an abandoned sandbox
-    is not useful. Debug Containers are not started automatically in the new
-    sandbox.
-1.  `convertStatusToAPIStatus()` must sort Debug Containers status into
-    `EphemeralContainer.Status` similar to as it does for
-    `InitContainerStatuses`
-1.  Debug Containers must be excluded from calculation of pod phase and
-    condition
-
-`KillPod()` already operates on all running containers returned by the runtime
-and requires no changes
+A future improvement to Ephemeral Containers could allow killing Debug
+Containers when they're remove the `EphemeralContainers`.
 
 ### Security Considerations
 
@@ -344,9 +272,8 @@ Debug Containers have no additional privileges above what is available to any
 `v1.Container`. It's the equivalent of configuring an shell container in a pod
 spec except that it is created on demand.
 
-Admission plugins must be updated to guard `/ephemeralcontainers`. In
-particular, they should enforce the same container image policy on the
-`EphemeralContainer.Spec` parameter as is enforced for regular containers.
+Admission plugins must be updated to guard `/ephemeralcontainers`. They should
+apply the same container image and security policy as for regular containers.
 
 ### Additional Consideration
 
@@ -356,70 +283,34 @@ particular, they should enforce the same container image policy on the
     troubleshooting causes a pod to exceed its resource limit it may be evicted.
 1.  There's an output stream race inherent to creating then attaching a
     container which causes output generated between the start and attach to go
-    to the log rather than the client. This is not specific to Debug Containers
-    and exists because Kubernetes has no mechanism to attach a container prior
-    to starting it. This larger issue will not be addressed by Debug Containers,
-    but Debug Containers would benefit from future improvements or work arounds.
-1.  Debug Containers should not be used to build services, which we've attempted
-    to reflect in the API.
-1.  If a pod is configured with isolated PID namespaces, the Debug Container
-    will join the PID namespace of the target container. Debug Containers will
-    not be available with runtimes that do not implement PID namespace sharing.
+    to the log rather than the client. This is not specific to Ephemeral
+    Containers and exists because Kubernetes has no mechanism to attach a
+    container prior to starting it. This larger issue will not be addressed by
+    Ephemeral Containers, but Ephemeral Containers would benefit from future
+    improvements or work arounds.
+1.  Ephemeral Containers should not be used to build services, which we've
+    attempted to reflect in the API.
 
 ## Implementation Plan
 
-### Alpha Release
-
-#### Goals and Non-Goals for Alpha Release
+### 1.12: Initial Alpha Release
 
-We're targeting an alpha release in Kubernetes 1.11 that includes the following
+We're targeting an alpha release in Kubernetes 1.12 that includes the following
 basic functionality:
 
-*   Support in the kubelet for creating debug containers in a running pod
-*   A `kubectl alpha debug` command to initiate a debug container
-*   `kubectl describe pod` will list status of debug containers running in a pod
-
-Functionality will be hidden behind an alpha feature flag and disabled by
-default.
-
-#### Kubernetes API Changes
-
-The following changes must be implemented in the API:
-
-1.  `v1.EphemeralContainer` will be added and `v1.PodStatus` will be extended as
-    described above.
-1.  The new subresource will be added to the pods API.
-1.  The API server must check for Ephemeral Containers when validating `attach`.
-
-#### kubelet Implementation
-
-Debug Containers are implemented in the kubelet's generic runtime manager.
-Performing this operation with a legacy (non-CRI) runtime will result in a not
-implemented error. Implementation in the kubelet will be split into the
-following steps:
-
-1.  New container metadata `ContainerType`, `ContainerSpec` &
-    `TargetContainerName` is stored using CRI labels and annotations.
-    `kubecontainer.ContainerStatus` will be extended with a `ContainerType`
-    field (possible values: `REGULAR`, `INIT` & `EPHEMERAL`) so a container can
-    be identified as a debug container.
-1.  `kuberuntimemanager` gains a new `StartEphemeralContainer()` which calls the
-    existing `startContainer()`.
-1.  `syncPod()` will call `StartEphemeralContainer()` to start the Debug
-    Container. The existing `generateAPIPodStatus()` will be updated to also
-    populate `EphemeralContainers.Status`.
+1.  Approval for basic core API changes to Pod
+1.  Basic support in the kubelet for creating Ephemeral Containers
+1.  A `kubectl alpha debug` command to initiate a debug container
+1.  `kubectl describe pod` will list status of debug containers running in a pod
 
-#### kubectl changes
+Functionality out of scope for 1.12:
 
-In anticipation of this change, [#46151](https://pr.k8s.io/46151) added a
-`kubectl alpha` command to contain alpha features. We will add `kubectl alpha
-debug` to invoke Debug Containers. `kubectl` does not use feature gates, so
-`kubectl alpha debug` will be visible by default in `kubectl` 1.11 and return an
-error when used on a cluster with the feature disabled.
+*   Killing running Ephemeral Containers by removing them from the Pod Spec.
+*   Updating `pod.Spec.EphemeralContainers` when containers are garbage
+    collected.
 
-`kubectl describe pod` will report the contents of `EphemeralContainers` when
-not empty as it means the feature is enabled. The field will be hidden when
-empty.
+Functionality will be hidden behind an alpha feature flag and disabled by
+default.
 
 ## Appendices
 
@@ -550,10 +441,10 @@ container image distribution mechanisms to fetch images when the debug command
 is run.
 
 **Respect admission restrictions.** Requests from kubectl are proxied through
-the apiserver and so are available to existing [admission
-controllers](https://kubernetes.io/docs/admin/admission-controllers/). Plugins
-already exist to intercept `exec` and `attach` calls, but extending this to
-support `debug` has not yet been scoped.
+the apiserver and so are available to existing
+[admission controllers](https://kubernetes.io/docs/admin/admission-controllers/).
+Plugins already exist to intercept `exec` and `attach` calls, but extending this
+to support `debug` has not yet been scoped.
 
 **Allow introspection of pod state using existing tools**. The list of
 `EphemeralContainerStatuses` is never truncated. If a debug container has run in
@@ -587,26 +478,146 @@ active debug container.
 
 ### Appendix 3: Alternatives Considered
 
-#### Mutable Pod Spec
+#### Container Spec in PodStatus
+
+Originally there was a desire to keep the pod spec immutable, so we explored
+modifying only the pod status. An `EphemeralContainer` would contain a Spec, a
+Status and a Target:
+
+```
+// EphemeralContainer describes a container to attach to a running pod for troubleshooting.
+type EphemeralContainer struct {
+        metav1.TypeMeta `json:",inline"`
+
+        // Spec describes the Ephemeral Container to be created.
+        Spec *Container `json:"spec,omitempty" protobuf:"bytes,2,opt,name=spec"`
+
+        // Most recently observed status of the container.
+        // This data may not be up to date.
+        // Populated by the system.
+        // Read-only.
+        // +optional
+        Status *ContainerStatus `json:"status,omitempty" protobuf:"bytes,3,opt,name=status"`
+
+        // If set, the name of the container from PodSpec that this ephemeral container targets.
+        // If not set then the ephemeral container is run in whatever namespaces are shared
+        // for the pod.
+        TargetContainerName string `json:"targetContainerName,omitempty" protobuf:"bytes,4,opt,name=targetContainerName"`
+}
+```
+
+Ephemeral Containers for a pod would be listed in the pod's status:
+
+```
+type PodStatus struct {
+        ...
+        // List of user-initiated ephemeral containers that have been run in this pod.
+        // +optional
+        EphemeralContainers []EphemeralContainer `json:"ephemeralContainers,omitempty" protobuf:"bytes,11,rep,name=ephemeralContainers"`
+
+}
+```
+
+To create a new Ephemeral Container, one would append a new `EphemeralContainer`
+with the desired `v1.Container` as `Spec` in `Pod.Status` and updates the `Pod`
+in the API. Users cannot normally modify the pod status, so we'd create a new
+subresource `/ephemeralcontainers` that allows an update of solely
+`EphemeralContainers` and enforces append-only semantics.
+
+Since we have a requirement to describe the Ephemeral Container with a
+`v1.Container`, this lead to a "spec in status" that seemed to violate API best
+practices. It was confusing, and it required added complexity in the kubelet to
+persist and publish user intent, which is rightfully the job of the apiserver.
+
+#### Extend the Existing Exec API ("exec++")
+
+A simpler change is to extend `v1.Pod`'s `/exec` subresource to support
+"executing" container images. The current `/exec` endpoint must implement `GET`
+to support streaming for all clients. We don't want to encode a (potentially
+large) `v1.Container` into a query string, so we must extend `v1.PodExecOptions`
+with the specific fields required for creating a Debug Container:
+
+```
+// PodExecOptions is the query options to a Pod's remote exec call
+type PodExecOptions struct {
+        ...
+        // EphemeralContainerName is the name of an ephemeral container in which the
+        // command ought to be run. Either both EphemeralContainerName and
+        // EphemeralContainerImage fields must be set, or neither.
+        EphemeralContainerName *string `json:"ephemeralContainerName,omitempty" ...`
+
+        // EphemeralContainerImage is the image of an ephemeral container in which the command
+        // ought to be run. Either both EphemeralContainerName and EphemeralContainerImage
+        // fields must be set, or neither.
+        EphemeralContainerImage *string `json:"ephemeralContainerImage,omitempty" ...`
+}
+```
+
+After creating the Ephemeral Container, the kubelet would upgrade the connection
+to streaming and perform an attach to the container's console. If disconnected,
+the Ephemeral Container could be reattached using the pod's `/attach` endpoint
+with `EphemeralContainerName`.
+
+Ephemeral Containers could not be removed via the API and instead the process
+must terminate. While not ideal, this parallels existing behavior of `kubectl
+exec`. To kill an Ephemeral Container one would `attach` and exit the process
+interactively or create a new Ephemeral Container to send a signal with
+`kill(1)` to the original process.
+
+Since the user cannot specify the `v1.Container`, this approach sacrifices a
+great deal of flexibility. This solution still requires the kubelet to publish a
+`Container` spec in the `PodStatus` that can be examined for future admission
+decisions and so retains many of the downsides of the Container Spec in
+PodStatus approach.
+
+#### Ephemeral Container Controller
+
+Kubernetes prefers declarative APIs where the client declares a state for
+Kubernetes to enact. We could implement this in a declarative manner by creating
+a new `EphemeralContainer` type:
+
+```
+type EphemeralContainer struct {
+        metav1.TypeMeta
+        metav1.ObjectMeta
+
+        Spec v1.Container
+        Status v1.ContainerStatus
+}
+```
+
+A new controller in the kubelet would watch for EphemeralContainers and
+create/delete debug containers. `EphemeralContainer.Status` would be updated by
+the kubelet at the same time it updates `ContainerStatus` for regular and init
+containers. Clients would create a new `EphemeralContainer` object, wait for it
+to be started and then attach using the pod's attach subresource and the name of
+the `EphemeralContainer`.
+
+A new controller is a significant amount of complexity to add to the kubelet,
+especially considering that the kubelet is already watching for changes to pods.
+The kubelet would have to be modified to create containers in a pod from
+multiple config sources. SIG Node strongly prefers to minimize kubelet
+complexity.
+
+#### Mutable Pod Spec Containers
 
-Rather than adding an operation to have Kubernetes attach a pod we could instead
-make the pod spec mutable so the client can generate an update adding a
-container. `SyncPod()` has no issues adding the container to the pod at that
-point, but an immutable pod spec has been a basic assumption in Kubernetes thus
-far and changing it carries risk. It's preferable to keep the pod spec immutable
-as a best practice.
+Rather than adding to the pod API, we could instead make the pod spec mutable so
+the client can generate an update adding a container. `SyncPod()` has no issues
+adding the container to the pod at that point, but an immutable pod spec has
+been a basic assumption and best practice in Kubernetes. Changing this
+assumption complicates the requirements of the kubelet state machine. Since the
+kubelet was not written with this in mind, we should expect such a change would
+create bugs we cannot predict.
 
-#### Ephemeral container
+#### Image Exec
 
-An earlier version of this proposal suggested running an ephemeral container in
-the pod namespaces. The container would not be added to the pod spec and would
-exist only as long as the process it ran. This has the advantage of behaving
-similarly to the current kubectl exec, but it is opaque and likely violates
-design assumptions. We could add constructs to track and report on both
-traditional exec process and exec containers, but this would probably be more
-work than adding to the pod spec. Both are generally useful, and neither
-precludes the other in the future, so we chose mutating the pod spec for
-expedience.
+An earlier version of this proposal suggested simply adding `Image` parameter to
+the exec API. This would run an ephemeral container in the pod namespaces
+without adding it to the pod spec or status. This container would exist only as
+long as the process it ran. This parallels the current kubectl exec, including
+its lack of transparency. We could add constructs to track and report on both
+traditional exec process and exec containers. In the end this failed to meet our
+transparency requirements.
 
 #### Attaching Container Type Volume
 
@@ -627,9 +638,8 @@ this simplifies the solution by working within the existing constraints of
 If Kubernetes supported the concept of an "inactive" container, we could
 configure it as part of a pod and activate it at debug time. In order to avoid
 coupling the debug tool versions with those of the running containers, we would
-need to ensure the debug image was pulled at debug time. The container could
-then be run with a TTY and attached using kubectl. We would need to figure out a
-solution that allows access the filesystem of other containers.
+want to ensure the debug image was pulled at debug time. The container could
+then be run with a TTY and attached using kubectl.
 
 The downside of this approach is that it requires prior configuration. In
 addition to requiring prior consideration, it would increase boilerplate config.
@@ -639,14 +649,14 @@ than a feature of the platform.
 #### Implicit Empty Volume
 
 Kubernetes could implicitly create an EmptyDir volume for every pod which would
-then be available as target for either the kubelet or a sidecar to extract a
+then be available as a target for either the kubelet or a sidecar to extract a
 package of binaries.
 
 Users would have to be responsible for hosting a package build and distribution
 infrastructure or rely on a public one. The complexity of this solution makes it
 undesirable.
 
-#### Standalone Pod in Shared Namespace
+#### Standalone Pod in Shared Namespace ("Debug Pod")
 
 Rather than inserting a new container into a pod namespace, Kubernetes could
 instead support creating a new pod with container namespaces shared with
@@ -656,21 +666,21 @@ useful, the containers in this "Debug Pod" should be run inside the namespaces
 (network, pid, etc) of the target pod but remain in a separate resource group
 (e.g. cgroup for container-based runtimes).
 
-This would be a rather fundamental change to pod, which is currently treated as
-an atomic unit. The Container Runtime Interface has no provisions for sharing
+This would be a rather large change for pod, which is currently treated as an
+atomic unit. The Container Runtime Interface has no provisions for sharing
 outside of a pod sandbox and would need a refactor. This could be a complicated
 change for non-container runtimes (e.g. hypervisor runtimes) which have more
 rigid boundaries between pods.
 
-Effectively, Debug Pod must be implemented by the runtimes while Debug
-Containers are implemented by the kubelet. Minimizing change to the Kubernetes
-API is not worth the increased complexity for the kubelet and runtimes.
+This is pushing the complexity of the solution from the kubelet to the runtimes.
+Minimizing change to the Kubernetes API is not worth the increased complexity
+for the kubelet and runtimes.
 
 It could also be possible to implement a Debug Pod as a privileged pod that runs
 in the host namespace and interacts with the runtime directly to run a new
 container in the appropriate namespace. This solution would be runtime-specific
-and effectively pushes the complexity of debugging to the user. Additionally,
-requiring node-level access to debug a pod does not meet our requirements.
+and pushes the complexity of debugging to the user. Additionally, requiring
+node-level access to debug a pod does not meet our requirements.
 
 #### Exec from Node