-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ExecutionHook API Design #705
Conversation
|
||
``` | ||
// ExecutionHook defines a specific action that should be taken with retries and timeouts | ||
type ExecutionHook struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the relationship with the existing container lifecycle hook and why need to introduce a new one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lifecycle hook is called immediately after a container is created or immediately before a container is terminated. The proposed execution hook is not tied to the start or termination time of the container. Instead it is called before/after taking a snapshot while the container is still running. So we can't use the lifecycle hook. I'll add some clarification to the spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explanation!
While as we are introducing a new hook in vanilla container struct, it will be better if there're more general use cases/user stories instead of vol snapshot only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than Lifecycle hook, there is also Liveness Probe and Readiness Probe hooks in Container to check the health of a container. So it seems that the existing hooks all have a specific purpose? It is possible that a user may want to do some preparation work before an action and then do some post-processing after an action. I am just not sure what those actions are, other than snapshotting. Alternatively we can make this a more general-purpose pre-action and post-action hook.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I am thinking there're probably more general purposes, and it deserves some thinking/writing and would be great to be included in the user stories part as well. I am not against adding hooks for specific purpose of course, but then I would suggest to name it as sth more specific like VolSnapshotHook instead of ExecutionHook.
What we are trying to avoid is, we have a ExecutionHook
which sounds like a general purpose hook, but oops, we are telling users that it can only be used for prep snapshot from design doc, code and all the way to user doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Let me discuss with Jing about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See updated spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove any references to NEXT_KEP_NUMBER
and rename the KEP to just be the draft date and KEP title.
KEP numbers will be obsolete once #703 merges.
@justaugustus @resouer Addressed your comments. PTAL. Thanks. |
Thanks @xing-yang! |
// Defines the hook to quiesce and unquiesce an application | ||
// +optional | ||
QuiesceUnquiesceHook *QuiesceUnquiesceHook `json:"quiesceUnquiesceHook,omitempty" protobuf:"bytes,22,opt,name=quiesceUnquiesceHook"` | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add more examples of what these hooks can do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarified in the updated spec.
|
||
## Implementation History | ||
|
||
* Feature description: https://github.com/kubernetes/kubernetes/issues/177 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this correct url kubernetes/kubernetes#177 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, this is wrong. It should be 177 under enhancements: #177
So we want to introduce an `ExecutionHook` to facilitate the quiesce and unquiesce | ||
actions when taking a snapshot. There is an existing lifecycle hook in the `Container` | ||
struct. The lifecycle hook is called immediately after a container is created or | ||
immediately before a container is terminated. The proposed execution hook is not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xing-yang What happens if the container get a terminate event while executionhook
is in progress ? whats the recovery path for execution hook ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If container is terminated and quiesce is in the middle of running it, we should get an error from stdout. We will fail snapshotting in this case.
If container is terminated and unquiesce is in the middle of running it, then unquiesce should automatically happen in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if container terminate means container is no long running, both quiesce/unquiesce commands will not be executed and it will fail, executionHook should report the errors in the status
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be crystal clear, this is along the lines of a feature I want, but it is a tricky feature, and I think this just scratches the surface at the hard parts.
kubernetes/kubernetes#24957 for history
struct. The lifecycle hook is called immediately after a container is created or | ||
immediately before a container is terminated. The proposed execution hook is not | ||
tied to the start or termination time of the container. Instead it is called before | ||
or after taking a snapshot while the container is still running. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I presume that this is not SPECIFIC to snapshotting, but a more general mechanism? That's what the name implies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just left same review comment here #705 (comment)
@xing-yang I think we can actually "borrow" some narratives from #705 (comment), which is really great thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments addressed in the updated spec.
Handler `json:",inline" protobuf:"bytes,1,opt,name=handler"` | ||
// How long the controller should try/retry to execute the hook before giving up | ||
RetryTimeOutSeconds int64 `json:"retryTimeOutSeconds,omitempty" protobuf:"varint,2,opt,name=retryTimeOutSeconds"` | ||
// Number of retries |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under what conditions will it retry? Just timeout? Or will it retry if it fails?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed "NumRetries". Now "TimeoutSeconds" will be the total duration of trying to run the hook, whether retries happen or not.
// Command to execute for a particular trigger | ||
Handler `json:",inline" protobuf:"bytes,1,opt,name=handler"` | ||
// How long the controller should try/retry to execute the hook before giving up | ||
RetryTimeOutSeconds int64 `json:"retryTimeOutSeconds,omitempty" protobuf:"varint,2,opt,name=retryTimeOutSeconds"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this aggregate (across all retries) or per attempt? Naming is weird to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to "TimeoutSeconds". This will be the total duration of trying to run the hook, regardless of retries.
|
||
``` | ||
// QuiesceUnquiesceHook includes a Quiesce action and an Unquiesce action | ||
type QuiesceUnquiesceHook struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First the name - this is very hard to spell. In fact this doc gets it wrong as often as not. Can we adopt simpler naming? Pause/Resume or something that is meaningful but less difficult?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a point to discuss - why do we want this as a specific hook? Assuming we do this at all, would we rather a general "notification" or "imperative" API or N specific hooks?
An advantage of being specific is that we can make specific sub-resources that we can ACL specifically.
An advantage of being generic is that we don't have to keep solving this in almost-identical ways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see the updated spec. Introduced a generic ExecutionHook but also added a type to indicate it is for "Freeze"/"Thaw". Support for new types can be added later.
// A single application container that you want to run within a pod. | ||
type Container struct { | ||
...... | ||
// Defines the hook to quiesce and unquiesce an application |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to talk about delivery and define the guarantees.
Who delivers this? The controller or kubelet?
If the storage controller: How is it authenticated (e.g. it can't be arbitrarily open to anyone)? How do they actually trigger an exec?
If the kubelet: How do we tell kubelet to go do it? What is the trigger from the storage controller?
What guarantees do we make about delivery? Do we guarantee it gets delivered at all? Do we guarantee it gets delivered exactly once? Do we guarantee it gets delivered at least once?
How does the controller know it is done? Does the app indicate somehow that it is complete, or is this expected to be synchronous?
How do we access-control this?
We need to have answers to all these (and more, I am sure) and they need to be documented. See kubernetes/kubernetes#24957 for more exploration of this idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. We need to provide a lot more details in the spec. Will take a look of kubernetes/kubernetes#24957 for ideas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
queiscing an application in order to achieve application consistent snapshots using block level snapshots is tricky. The general outline of what you probably want to do looks like this
- Call the queisce hook to flush any memory and/or WALs to the filesystem, sync the filesystem storage media, freeze the file system, and block and application level writes.
- Take the block level snapshot.
- Call unqueisce to unfreeze the filesystem and allow application level writes.
I think the lifecycle of all the operations is tied.
The hook in (1) needs to return success or failure. If it fails the system should not do (2). If (1) succeeds, without respect to (2), the system must do (3). If (1) succeeds, again without respect to (2), and (3) fails, you've probably broken the application, and you need a way to deal with it. If (2) fails at any point in time, you still need to do (3).
I think it follows that (1) needs to be implemented as idempotent with at least once delivery guarantees. If you do (2) without (1) you are at best crash consistent (and probably not even that).
(3) needs to be idempotent as well.
I think you need to describe how the system will function in more detail in order to motivate the API addition and for people to consider the API in the context of its proposed implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed comments in the updated spec.
cc @kow3ns since I requested thinking about lifecycle hooks more holistically |
|
||
``` | ||
// ExecutionHook defines a specific action that should be taken with timeout | ||
type ExecutionHook struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you consider adding new entries to the already existing "lifecycle" section of the container spec?
https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.10/#lifecycle-v1-core
They do extremely similar things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. We have not considered this. Will look into it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed with Jing about this. We added Alternatives which would include the new ExecutionHook directly inside "Lifecycle". Please take a look. Thanks.
Name string | ||
// Type of an execution hook | ||
// +optional | ||
Type string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type doesn't seem necessary. Will a given item need multiple freeze hooks and multiple thaw hooks?
PostStart *Handler | ||
// Defines the hooks to trigger an action in a container | ||
// +optional | ||
ExecutionHooks []ExecutionHook |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was expecting
Freeze *Handler
Thaw *Handler
TCPSocket *TCPSocketAction | ||
// Name of an execution hook | ||
// +optional | ||
Name string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't have a list of these, this isn't necessary.
Cons for option 1 and 2: | ||
|
||
* `Type` is optional here for backward compatibility, however, `Type` should be required because a `Freeze` type tells the controller it needs to be run before cutting the snapshot while a `Thaw` type tells the controller to run it after cutting the snapshot. Making it optional will make it hard for the snapshot controller to determine when to run those hooks. | ||
* User may get confused. User may expect everything inside `Lifecycle` is handled by kubelet, however, `ExecutionHooks` will be handled by an external controller/operator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can potentially be solved with documentation.
// will set a default timeout. | ||
// +optional | ||
TimeoutSeconds int64 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also embed, e.g.
type NamedHandler struct {
Name string
Handler
}
I don't have time to get on this right now, but I do want to have a look, so please give me a ping when it's baked and I will try to make time |
@lavalamp Yes, I think so :) |
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks good. Much cleaner then original proposal and sufficient (to me) for an alpha implementation. Left a few comments for cleanup.
For beta we will need kubelet executing these instead of a standalone controller, some mechanism (e.g. PodSecurityContext or pod fields) to white-list "allowed" actions, etc. but that is understood, this is to get an alpha kick the tires implementation out.
// If execution fails, the execution hook controller will keep retrying until reaching | ||
// ActionTimeoutSeconds. If execution still fails or hangs, execution hook controller | ||
// stops retrying and updates executionhook status to failed. | ||
// +optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the comments we should clarify the guarantees around this (e.g. if controller loses its state, counter restarts so maybe say that controller will retry for at least this long, before stopping).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comments to clarify.
|
||
// ActionTimeoutSeconds defines when the execution hook controller should stop retrying. | ||
// If execution fails, the execution hook controller will keep retrying until reaching | ||
// ActionTimeoutSeconds. If execution still fails or hangs, execution hook controller |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also explain that once an action is started, controller has no way to stop it even if ActionTimeoutSeconds is exceeded. This simply controls if retry happens or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added explanation.
// This is required. | ||
Action core_v1.Handler | ||
|
||
// ActionTimeoutSeconds defines when the execution hook controller should stop retrying. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should also define the retry policy somewhere: what happens if ActionTimeoutSeconds
is not specified (i.e. it retries until object is deleted). And is there any exponential backoff policy or is it just back to back retries on failures?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comments about retry policy.
// +optional | ||
ContainerExecutionHookStatuses []ContainerExecutionHookStatus | ||
|
||
// Action Summary status |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove this since it is not strictly necessary (and a little confusing). We can reintroduce in the future if necessary).
|
||
``` | ||
// ExecutionHook is in the tenant namespace | ||
type ExecutionHook struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a non-code comment:
Let's talk about names. Looking forward to it being in Pod, I would propose "actions" or "notifications". You seem to like action, I am ok with that for now.
Since we want to keep the declarative request for action, I propose to rename ExecutionHook -> PodAction
Since we want to move the definition of the action itself into Pod, the name matters less. Rename HookAction -> PodActionTemplate or PodActionDefinition or something like that?
// asynchronously. If execution ordering is required, caller has to implement the logic and create | ||
// different hooks in order. | ||
// This field is required. | ||
PodSelection PodSelection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naming: I would call this field pods
or target
// If both PodContainerNamesList and PodContainerSelector are not | ||
// specified, the ExecutionHook cannot be executed and it will fail. | ||
// +optional | ||
PodContainerNamesList []PodContainerNames |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naming: rename to names
or pods
. e.g. with the above it would be:
target:
pods:
- myapp-1234
- myapp-5678
// PodSelection contains two fields, PodContainerNamesList and PodContainerSelector, | ||
// where one of them must be defined so that the hook controller knows where to | ||
// run the hook. | ||
type PodSelection struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This qualifies as a union under emerging conventions and should have a discriminator field, e.g. type
to indicate either "Pods" or "Selector".
@apelisse this is the second KEP I reviewed today that I have said this to. :)
// based on the pod label selector and container names | ||
// If PodContainerNamesList is specified, this field will not be used. | ||
// +optional | ||
PodContainerSelector *PodContainerSelector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can just be called Selector
and be a simple selector. Container Name should be in the action. If the action were defined in Pod, it would be under a container, so it belongs on the other side of this relationship.
// HookAction does not contain information on pods/containers because those are | ||
// runtime info. | ||
// HookAction is namespaced | ||
type HookAction struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm proposing to rename ExecutionHook to PodAction or Action and that (long-term) this be embedded into Pod.
|
||
// This contains the command to run on a container. | ||
// The command should be idempotent because the system does not guarantee exactly-once semantics. | ||
// Any action may be triggered more than once but only the latest results will be logged in status. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"latest" is somewhat meaningless. How about saying "only one" result will be logged
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what we mean here is that if controller retries, it will update the result which will overwrite the old result updated before.
// stops retrying and updates executionhook status to failed. | ||
// +optional | ||
ActionTimeoutSeconds *int64 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, container name should go here.
} | ||
|
||
// ContainerExecutionHookStatus represents the current state of a hook for a specific container in a pod | ||
type ContainerExecutionHookStatus struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to revisit a comment, this will replicate PodName if there are multiple containers in the pod. Is it worth structuring as
{
podName: pod-1234
containers:
- name: app
timestamp: 12345678
success: true
- name: sidecar
success: false
?
PodSelection PodSelection | ||
|
||
// Name of the HookAction. This is required. | ||
HookActionName string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I expect it to come back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of my comments are naming or API related, and this is not an API review. As such, I think the idea is OK and will approve the KEP. Please consider this feedback as you get the API pinned down, and let's do an thorough API review before you get too far into impl.
This maybe a good topic for sig-arch once we can show that it works well.
Thanks!
/lgtm
/approve
This PR adds ExecutionHook in the Container struct.
address comments
/approve |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: j-griffith, liyinan926, saad-ali, thockin, xing-yang The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
// the Snapshot Controller. | ||
type ExecutionHookSpec struct { | ||
// PodSelection defines how to select pods and containers to run | ||
// the executionhook. If multiple pod/containers are selected, the action will executed on them |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a webhook or an executable hook ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is execution hook
|
||
## Summary | ||
|
||
This proposal is to introduce an API (ExecutionHook) for dynamically executing user’s commands in a pod/container or a group of pods/containers and a controller (ExecutionHookController) to manage the hook lifecycle. ExecutionHook provides a general mechanism for users to trigger hook commands in their containers for their different use cases. Different options have been evaluated to decide how this ExecutionHook should be managed and executed. The preferred option is described in the Proposal section. The other options are discussed in the Alternatives section. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a problem with the utilization of the term "hook" in this context. In general, a hook is used "to alter or augment the behaviour of a software system by intercepting function calls or messages or events passed between components" (adapted from Wikipedia entry). Hooks are related to specific events or actions, not are intended to be arbitrarily called as I understand from the proposal.
I understand the proposed mechanism facilitates implementing hooks for other components, as described in the example, but said mechanism is not, IMHO, a hook.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But do you have issues of using ExecutionHook as API name? What we want to say here is to propose a set of APIs to allow users trigger commands inside their containers
-podName: myPod2 | ||
-containerName: myContainer3 | ||
-containerName: myContainer4 | ||
hookActionName: action-demo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is the hookAction a different object ? What do we loose if we combined the hookaction inside ExecutionHook ?
This PR proposes ExecutionHook API design.