-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1453190 - Fix pod update operation #14892
Conversation
@openshift/networking @dcbw PTAL |
[test] |
pkg/sdn/plugin/pod_linux.go
Outdated
} | ||
|
||
// Kubelet starts asynchronously and when we get an Update op, kubelet may not have created runtime endpoint. | ||
// So try couple of times before bailing out (~2 second timeout). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 seconds seems potentially like not enough...
What bug exactly does this fix? (Code basically LGTM but this is @dcbw's area of expertise.) |
duh, just noticed the bug id in the summary
OK |
So this only affects the re-VNID-ing case, not the resync-pods-at-startup case? In that case we could just change the way multitenant.go:updatePodNetwork() works, so that rather than going pod-by-pod, it just operates on the DumpFlows output as a whole, changing the VNID of every affected pod. That might end up being simpler? |
@pravisankar I like the solution of talking to the runtime directly, I should have done that for 14665 too. For this bug though, could we put the runtime related stuff into update.go once 14665 gets merged, rather than in pod_linux.go? I think pod_linux.go and pod.go should just consume the given sandbox ID which would get passed from node.go (or update.go) instead. Since Update is the only operation that requires talking to the runtime service, might make sense to put all that into update.go. Other than that, looks generally good to me! |
@danwinship we can not directly work on the dump flows output in multitenant.go:updatePodNetwork(). We have to go through pod by pod (CNI_UPDATE per pod) so that CNI pod manager updates its bookkeeping (runningPods field, used for hostport/multicast handling). I think we can get dumpFlows() output once and reuse it for all the pods, this needs some fiddling with CNI request/response. I'll work on this change. |
@pravisankar well #14893 depends on the changes here to use the runtime anyway, right? So in that case it makes sense to keep this like it is for now, although I'd really like to simplify the VNID-changing code at some point in the future |
ce51a17
to
924fd69
Compare
@dcbw @danwinship Updated, ptal |
[test] |
[test] last flaked on #14043 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
924fd69
to
333d356
Compare
Hold on - so now our networking code depends on calling CRI in order to work correctly? |
Removing merge tag temporarily. Why is it good for SDN to make calls to the CRI directly? |
@smarterclayton because Kubernetes doesn't support changes to pods while they are running, but we do for things like merge/split tenant networks. And that requires that we get sandbox IDs from the runtime, because we need to manipulate the sandbox to do those updates. We used to get this information from kubelet directly via the network plugin interface, but we no longer can do that with the move to remote runtimes. That's actually a good thing; it puts us on the same footing as all other CNI plugins. The alternative is to cache the sandbox IDs, net namespaces, and IPAM information on-disk somewhere, validate that if kubelet crashes and restarts, and use that. But the runtime already has most of this information. Furthermore, I'm going to port #14665 over to use these runtime interfaces since that better than banging on Docker directly and will actually work with CRI-O too. |
Does the kubelet expect other daemons to be making calls to the CRI? @derekwaynecarr @sjenning |
@smarterclayton i'm not sure of any hard requirement that the kubelet have exclusive access to the CRI. We have PLEG which should notify the kubelet if anything it cares about changes. Am I correct in understanding that the only CRI request that is being made in this change is querying the sandbox ID? |
@smarterclayton I have to imagine CRI runtimes are safe here, since kubelet calls into the the runtimes from a couple different goroutines. As to whether kubelet handles changes made underneath it, we're essentially doing "docker stop". I realize it's more complicated than that, but looking at the dockershim code it looks like ListPodSandbox() doesn't affect any global state, and while it does read checkpoints, the checkpoint store is required to be thread-safe. |
[test] |
333d356
to
20ef6fe
Compare
Jenkins failed to highlight actual error: |
@sjenning @smarterclayton @derekwaynecarr are your concerns addressed? Can we merge this? Thanks |
@knobunc @dcbw i guess i'm still not complete clear on what a StopPodSandbox() outside the kubelet will do. As soon as the kubelet becomes aware, it will try to restart the pod, assuming the restartPolicy isn't |
@sjenning kubelet assumes everything is fine with networking on the pod at startup if kubelet/openshift-node is restarted for any reason (manual or perhaps it segfaulted/paniced). It simply has no facility to check network status after a pod has started. The pathological case is:
which blows away the OVS flows that openshift-sdn has added. On restart, pod networking is thus broken, but kubelet does not call into the network plugin to detect that, nor does it give the plugin a chance at startup to fix things. So we are left trying to detect that situation and somehow fix networking ourselves. But we cannot get the necessary details out of kubelet (like the network namespace) because we aren't called from kubelet or the runtime. So our only solution is to somehow kill the sandbox, which causes kubelet to restart that sandbox and thus networking will be reconfigured for it.
We don't actually do anything with the dead sandbox. We want kubelet to create a new one, which it does. AFAIK kubelet never re-uses a dead sandbox, or if it does, it will call network setup on it again which is exactly what we want. I am attempting to fix this upstream and in CNI. There is a forward path, but it's going to take a bit. Before CRI we hooked into kubelet directly and could call GetRuntime() which returned a dockertools runtime from which we could make calls directly. But with CRI it's now a kubeGenericRuntimeManager which is simply a stub and doesn't implement anything useful at all. |
@dcbw ah I see. so you are counting on the kubelet restarting the pod. That's actually want you want. In that case, I don't see a problem. lgtm. Thanks for the explaination :) |
[merge] |
pkg/sdn/plugin/runtime.go
Outdated
} | ||
|
||
// Kubelet starts asynchronously and when we get an Update op, kubelet may not have created runtime endpoint. | ||
// So try couple of times before bailing out (~5 second timeout). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should wait longer. Heavily contended systems could flake because of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect 30s
Use pod sandbox ID to update the pod as opposed to pod container ID. OVS flow note identified by sandbox ID is desired as network namespace is held by the pod sandbox and pod could have many containers and single container ID may not represent all the pod ovs flows. Since we can't use kubelet 'Host' (explanation refer commit: f111845), we use runtime shim endpoint to connect to runtime service using gRPC. This is the same mechanism used by kubelet(GenericKubeletRuntimeManager) to talk to runtime service(docker/rkt).
20ef6fe
to
3ef342d
Compare
Evaluated for origin test up to 3ef342d |
Updated getRuntimeService() timeout from 2 secs to 30 secs. |
continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/2892/) (Base Commit: 57a4d50) (PR Branch Commit: 3ef342d) |
Evaluated for origin merge up to 3ef342d |
continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_origin/1204/) (Base Commit: 8f51071) (PR Branch Commit: 3ef342d) (Image: devenv-rhel7_6419) |
Automatic merge from submit-queue sdn: move sandbox kill-on-failed-update to runtime socket from direct docker Follow-on to #14892 that reworks #14665 to use the new runtime socket stuff. @openshift/networking @danwinship @pravisankar
What happens to this for CRI-O? |
Is this now fixed in the kube upstream? |
Use pod sandbox ID to update the pod as opposed to pod container ID.
OVS flow note identified by sandbox ID is desired as network namespace
is held by the pod sandbox and pod could have many containers and single
container ID may not represent all the pod ovs flows.
Since we can't use kubelet 'Host' (explanation refer commit: f111845),
we use runtime shim endpoint to connect to runtime service using gRPC.
This is the same mechanism used by kubelet(GenericKubeletRuntimeManager)
to talk to runtime service(docker/rkt).
bug 1453190