Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add implementation details to sidecar kep #841

Merged
merged 1 commit into from
Mar 22, 2019

Conversation

Joseph-Irving
Copy link
Member

This adds some clarification around a few points and adds some initial implementation details.

I didn't want to go too specific with code example as I'm not sure how valuable that would be at this stage and instead just call out the places where I think we should implement the logic so we can discuss whether this seems like a sensible approach. Once some POCing has been done we can link to the POC branch in the implementation details.

/assign @enisoc @kow3ns

@k8s-ci-robot
Copy link
Contributor

Hi @Joseph-Irving. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory size/M Denotes a PR that changes 30-99 lines, ignoring generated files. sig/apps Categorizes an issue or PR as relevant to SIG Apps. labels Feb 18, 2019
@k8s-ci-robot k8s-ci-robot requested a review from kow3ns February 18, 2019 15:34
@k8s-ci-robot k8s-ci-robot added the sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. label Feb 18, 2019
@k8s-ci-robot k8s-ci-robot added sig/pm cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Feb 18, 2019
@Joseph-Irving
Copy link
Member Author

/sig node

cc @dchen1107 @derekwaynecarr, let me know if you any further questions, parts you would like expanded etc.

@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Feb 18, 2019
@@ -67,7 +67,7 @@ Solve issues so that they don't require application modification:

## Non-Goals

Allowing multiple containers to run at once during the init phase. //TODO See if we can solve this problem with this proposal
Allowing multiple containers to run at once during the init phase - this could be solved using the same principal but can be implemented separately. //TODO should this be a non-goal or an extension? is there enough support for this?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not quite what you were asking: Do we have any use-cases or even hypotheticals that speak to the intersection of this design and init containers? E.g. Do init containers need service-mesh functionality? I would assume they do.

In other words, should sidecars start before init containers? And then, do we have sequencing within init containers? Concretely, consider something like Istio which has an init container with privilegs to write iptables rules, then a sidecar which processes net traffic. It seems to me that the sidecar would be needed for any user-provided init containers. I don't know if this has come up in real life yet.

How do we think that holds up in general?

@louiscryan @geeknoid @costinm for some perspective

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thockin So this was brought up by istio specifically, see #753 and https://github.com/istio/cni/issues/77. I think init containers needing service mesh access seems like a fairly valid use case. I think more generally the idea of an init container needing a sidecar isn't unheard of, e.g I have an init container that performs a db schema migration but it should go through mysql proxy.

I think you can also assume that some sidecars may need the init containers to run before they start such as the example of istio with init doing iptables and sidecar then starting for traffic.

You could potentially achieve this using what we've already suggested but adapting it for init e.g

initContainers:
- name: iptables
  command: ['set up some iptables]
- name: proxy
  command: ['proxy-stuff']
  sidecar: true
- name: myinitapp
  command: ['do some init stuff that requires the proxy']

The difference being it would follow order like init does already, but when it gets to the sidecar it could wait for it to be ready instead of complete, you then clean up the init sidecars at the end of the init phase.
This has the advantage that it keeps the UX relatively simple, the init phase is still logically separated, while hopefully enabling the use case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the need for the functionality, but something feels wonky UX-wise about boot order jumping between the containers and initContainers sections.

Would it make sense to have a more explicit DAG of boot operations?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vllry, what do you mean by boot order jumping? Perhaps I wasn't that clear but I was suggesting we still have all the init containers finishing before the containers start. You could just have init sidecars that run alongside the init containers for that phase only.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant was that having a sidecar toggle for initContainers seems clunky in the current config (the idea overall idea makes sense).

EG I might have a boot order like this:

initContainers:
- name: 1
  sidecar: true
  command: ...
- name: 3
  command: ...
containers:
- name: 2
  sidecar: true
- name: 4

It definitely works, but is awkward to read.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was suggesting that Init container sidecars would start and finish in the init phase, container sidecars would start after the init phase is over, so there would be no way to have a boot order like you described. I agree it would be confusing if this was the case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My vote would be to explore this just enough to be able to write a short section in the doc explaining how this could be extended to init containers in the future (at a similar level of implementation detail as the current section for implementing non-init sidecars).

We have clear, concrete use cases for non-init sidecars (Jobs, log/result shippers, sql proxy, Istio) and I don't think we should block addressing those. Non-init sidecar is already a pattern that's in use in the real world, and we just want to replace ugly workarounds (wrapper scripts, etc) with true support. Init sidecars are not in use because it's not possible even as a workaround.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think init containers should continue to be executed to completion strictly before sidecars are launched. The purpose of init containers is to perform actions that pertain to the Pod that need to happen prior to the main containers starting (e.g. setting up configuration files, preparing the file system, pulling data down). Launching run to completion containers subsequent to "main" containers might be problematic. Perhaps in the future we could add init-sidecars?

Copy link
Member

@enisoc enisoc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good first pass to me. I added some suggestions for things to consider in the second pass.

/lgtm

###### Shutdown triggering:
Package `kuberuntime`

Modify `kuberuntime_manager.go`, function `computePodActions`. Have a check in this function that will see if all the non-sidecars had permanently exited, if true: return all the running sidecars in `ContainersToKill`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we trigger shutdown of sidecars through ContainersToKill, what exit status do they get? Do we send them SIGTERM and give them a chance to terminate gracefully? We need to verify that the overall Pod phase (Succeeded/Failed) will reflect the right semantics in the presence of sidecars, so that Job will work as intended.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So when you return ContainersToKill it iterates over them and does KillContainer on each container. This is the function that sends prestop, sigterms/kills etc, grace period is obeyed. So as long as the sidecar exits 0 it should count towards succeeded.

###### Sidecars terminated last
Package `kuberuntime`

Modify `kuberuntime_container.go`, function `killContainersWithSyncResult`. Break up the looping over containers so that it goes through killing the non-sidecars before terminating the sidecars.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inner loop iterates over kubecontainer.Container (as opposed to v1.Container), which doesn't seem to have a way to tell whether the container is a sidecar. You might be able to cross-reference from the v1.Pod that also gets passed in, but we need to be careful to validate that that argument contains what we think it does. For example, I saw some comments that seemed to imply pod might be nil sometimes; we need to know what those cases are, and whether we can handle them properly.

It might end up being necessary to add data to kubecontainer.Container. I'm not sure if that represents a change to CRI? I'm also new to this code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yeah I missed that. Looking around it seems to indicate that Pod may be nil, running pod must not be. If the pod is dead we're not as bothered about term ordering, so cross referencing may work with a if pod != nil.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can verify that pod is only ever nil when the Pod object on the API server has been force-deleted, then I agree it's reasonable to say, "We don't respect sidecar termination order if you force-delete the Pod." Have you found any evidence to say exactly when pod could be nil? For example, is it possible as a transient condition during a Node restart, before it fetches Pods from the API server?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not confident from looking at the code that I could tell you all the scenarios where pod could be nil. If @feiskyer could weigh in it would be appreciated. If there are scenarios where it can be nil and we still care about ordering then we'll need to look at a different way to enable this behaviour. Changing kubecontainer.Container is a possibility but it means changing the container runtime.

###### PreStop hooks sent to Sidecars first
Package `kuberuntime`

Modify `kuberuntime_container.go`, function `killContainersWithSyncResult`. Loop over sidecars and execute `executePreStopHook` on them before moving on to terminating the containers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current code groups pre-stop and stop in one function (killContainer). Do you plan to break that up and fix all call sites?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm quite unsure about this part, killContainer does not appear in too many places so we could potentially change every call to split the pre-stop from the term, this definitely complicates the code a little bit as going forward everyone would have to be aware they need to call both pre-stop and kill. Alternatively you could execute prestop on the sidecars twice, once before the containers are terminated and again when the sidecars are terminated, this would of course rely on the assumption that pre-stop hooks on sidecars are idempotent.

###### Sidecars started first
Package `kuberuntime`

Modify `kuberuntime_manager.go`, function `computePodActions`. If pods has sidecars it will return these first in `ContainersToStart`, until they are all ready it will not return the non-sidecars.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After we return the sidecars, is it guaranteed that we'll sync again at the right time to get a chance to check readiness and start the non-sidecars? For example, do we sync every time any container status (as in, status on the machine, not k8s API object .status) changes? It might be obvious to people familiar with this code, but I just want to make sure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding, sync pods is frequently ran with the kubelet option of SyncFrequency indicating the max gap between syncs, this defaults to 1minute

@@ -67,7 +67,7 @@ Solve issues so that they don't require application modification:

## Non-Goals

Allowing multiple containers to run at once during the init phase. //TODO See if we can solve this problem with this proposal
Allowing multiple containers to run at once during the init phase - this could be solved using the same principal but can be implemented separately. //TODO should this be a non-goal or an extension? is there enough support for this?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My vote would be to explore this just enough to be able to write a short section in the doc explaining how this could be extended to init containers in the future (at a similar level of implementation detail as the current section for implementing non-init sidecars).

We have clear, concrete use cases for non-init sidecars (Jobs, log/result shippers, sql proxy, Istio) and I don't think we should block addressing those. Non-init sidecar is already a pattern that's in use in the real world, and we just want to replace ugly workarounds (wrapper scripts, etc) with true support. Init sidecars are not in use because it's not possible even as a workaround.

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Feb 26, 2019
###### Sidecars terminated last
Package `kuberuntime`

Modify `kuberuntime_container.go`, function `killContainersWithSyncResult`. Break up the looping over containers so that it goes through killing the non-sidecars before terminating the sidecars.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can verify that pod is only ever nil when the Pod object on the API server has been force-deleted, then I agree it's reasonable to say, "We don't respect sidecar termination order if you force-delete the Pod." Have you found any evidence to say exactly when pod could be nil? For example, is it possible as a transient condition during a Node restart, before it fetches Pods from the API server?

###### Sidecars started first
Package `kuberuntime`

Modify `kuberuntime_manager.go`, function `computePodActions`. If pods has sidecars it will return these first in `ContainersToStart`, until they are all ready it will not return the non-sidecars. `computePodActions` is ran frequently, with the max time between runs being determined by the Kubelet's `SyncFrequency` option (default 1 minute).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A minute is actually a large enough delay to be annoying. For example, if we always waited 1 minute after starting sidecars before realizing we can now start regular containers, we'd be adding that minute to the startup of any Pod using sidecars.

I would hope that there is also an event-based sync trigger in addition to the periodic one. Did you see any evidence of that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that any changes to the Pod status should trigger a sync via an update channel. config/config.go should send a kubetypes.RECONCILE when pod.Status has changed. This is given to the kubelet's syncLoop update channel which should trigger handler.HandlePodReconcile(u.Pods) which will cause the pod to be synced. So when the sidecar becomes ready that should trigger a sync

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this was not quite correct, it does indeed trigger handler.HandlePodReconcile(u.Pods) but this only triggers a pod sync if NeedToReconcilePodReadiness returns true, which currently only does anything if the pod is using readiness gates. Potentially we could put another check here looking to see if all sidecars are ready and non sidecars haven't been started, if so trigger a sync.
The kubelet doesn't currently have an update channel for readiness changes. only liveness so that isn't an option without making a fair amount of changes.

@enisoc
Copy link
Member

enisoc commented Mar 1, 2019

/ok-to-test
/lgtm

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 1, 2019
@Joseph-Irving Joseph-Irving mentioned this pull request Mar 7, 2019
11 tasks
@Joseph-Irving
Copy link
Member Author

I have created a PoC for the sidecars and raised a WIP pr for people to view kubernetes/kubernetes#75099, very much a first draft but should hopefully help illustrate what these implementation details might look like.

@Joseph-Irving Joseph-Irving changed the title add draft implementation details to sidecar kep add implementation details to sidecar kep Mar 21, 2019
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 21, 2019
@Joseph-Irving
Copy link
Member Author

Unless anyone has some big objections, it would be good to get this merged soon and revisit some of the unanswered questions in another PR.
This is also because I would like to update the Kep to match the new format in https://github.com/kubernetes/enhancements/blob/master/keps/YYYYMMDD-kep-template.md but I'd rather do that separately as I think it will be a very large diff.

cc @enisoc @kow3ns

Copy link
Member

@enisoc enisoc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 21, 2019
@kow3ns
Copy link
Member

kow3ns commented Mar 22, 2019

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enisoc, Joseph-Irving, kow3ns

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/node Categorizes an issue or PR as relevant to SIG Node. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants