Bug 1453190 - Fix pod update operation #14892

pravisankar · 2017-06-26T23:19:08Z

Use pod sandbox ID to update the pod as opposed to pod container ID.
OVS flow note identified by sandbox ID is desired as network namespace
is held by the pod sandbox and pod could have many containers and single
container ID may not represent all the pod ovs flows.

Since we can't use kubelet 'Host' (explanation refer commit: f111845),
we use runtime shim endpoint to connect to runtime service using gRPC.
This is the same mechanism used by kubelet(GenericKubeletRuntimeManager)
to talk to runtime service(docker/rkt).

bug 1453190

pravisankar · 2017-06-26T23:19:27Z

@openshift/networking @dcbw PTAL

pravisankar · 2017-06-26T23:19:35Z

[test]

danwinship · 2017-06-27T13:05:24Z

pkg/sdn/plugin/pod_linux.go

+	}
+
+	// Kubelet starts asynchronously and when we get an Update op, kubelet may not have created runtime endpoint.
+	// So try couple of times before bailing out (~2 second timeout).


2 seconds seems potentially like not enough...

danwinship · 2017-06-27T13:06:29Z

What bug exactly does this fix?

(Code basically LGTM but this is @dcbw's area of expertise.)

danwinship · 2017-06-27T13:08:43Z

duh, just noticed the bug id in the summary

When we are adding table=20 ovs flows with note generated from ContainerID, we are actually using pod SandboxID (origin-pod or pause container id) which is good as it is holding the network namespace but when pod Update op is performed we are searching the flows that match pod ContainerID not the SandboxID. So no ovs flows will match and pod update op fails.

OK

danwinship · 2017-06-27T13:14:32Z

So this only affects the re-VNID-ing case, not the resync-pods-at-startup case? In that case we could just change the way multitenant.go:updatePodNetwork() works, so that rather than going pod-by-pod, it just operates on the DumpFlows output as a whole, changing the VNID of every affected pod. That might end up being simpler?

dcbw · 2017-06-27T19:58:10Z

@pravisankar I like the solution of talking to the runtime directly, I should have done that for 14665 too. For this bug though, could we put the runtime related stuff into update.go once 14665 gets merged, rather than in pod_linux.go?

I think pod_linux.go and pod.go should just consume the given sandbox ID which would get passed from node.go (or update.go) instead. Since Update is the only operation that requires talking to the runtime service, might make sense to put all that into update.go. Other than that, looks generally good to me!

pravisankar · 2017-06-27T21:03:00Z

@dcbw pod.go/pod_linux.go consuming sandbox ID seems good segregation, Let me rebase my changes on top of #14665 and will move relevant changes to node/update.go

pravisankar · 2017-06-27T21:12:09Z

@danwinship we can not directly work on the dump flows output in multitenant.go:updatePodNetwork(). We have to go through pod by pod (CNI_UPDATE per pod) so that CNI pod manager updates its bookkeeping (runningPods field, used for hostport/multicast handling). I think we can get dumpFlows() output once and reuse it for all the pods, this needs some fiddling with CNI request/response. I'll work on this change.

danwinship · 2017-06-27T22:03:44Z

@pravisankar well #14893 depends on the changes here to use the runtime anyway, right? So in that case it makes sense to keep this like it is for now, although I'd really like to simplify the VNID-changing code at some point in the future

pravisankar · 2017-06-27T23:39:02Z

@dcbw @danwinship Updated, ptal

knobunc · 2017-06-28T14:27:48Z

[test]

knobunc · 2017-06-28T19:10:57Z

[test] last flaked on #14043

dcbw

LGTM

smarterclayton · 2017-06-29T14:50:07Z

Hold on - so now our networking code depends on calling CRI in order to work correctly?

smarterclayton · 2017-06-29T14:51:07Z

Removing merge tag temporarily.

Why is it good for SDN to make calls to the CRI directly?

dcbw · 2017-06-29T15:37:53Z

Why is it good for SDN to make calls to the CRI directly?

@smarterclayton because Kubernetes doesn't support changes to pods while they are running, but we do for things like merge/split tenant networks. And that requires that we get sandbox IDs from the runtime, because we need to manipulate the sandbox to do those updates. We used to get this information from kubelet directly via the network plugin interface, but we no longer can do that with the move to remote runtimes. That's actually a good thing; it puts us on the same footing as all other CNI plugins.

The alternative is to cache the sandbox IDs, net namespaces, and IPAM information on-disk somewhere, validate that if kubelet crashes and restarts, and use that. But the runtime already has most of this information.

Furthermore, I'm going to port #14665 over to use these runtime interfaces since that better than banging on Docker directly and will actually work with CRI-O too.

smarterclayton · 2017-06-29T16:07:16Z

Does the kubelet expect other daemons to be making calls to the CRI? @derekwaynecarr @sjenning

sjenning · 2017-06-29T19:44:54Z

@smarterclayton i'm not sure of any hard requirement that the kubelet have exclusive access to the CRI. We have PLEG which should notify the kubelet if anything it cares about changes.

Am I correct in understanding that the only CRI request that is being made in this change is querying the sandbox ID?

dcbw · 2017-06-29T20:10:26Z

@smarterclayton I have to imagine CRI runtimes are safe here, since kubelet calls into the the runtimes from a couple different goroutines. As to whether kubelet handles changes made underneath it, we're essentially doing "docker stop". I realize it's more complicated than that, but looking at the dockershim code it looks like ListPodSandbox() doesn't affect any global state, and while it does read checkpoints, the checkpoint store is required to be thread-safe.

dcbw · 2017-06-29T20:29:54Z

Am I correct in understanding that the only CRI request that is being made in this change is querying the sandbox ID?

@sjenning yes, but I want to do a follow-up that will call StopPodSandbox() to replace the slighly horrible docker equivalent that #14665 currently does.

eparis · 2017-06-30T01:02:18Z

[test]

pravisankar · 2017-06-30T08:10:54Z

Jenkins failed to highlight actual error:
pkg/sdn/plugin/pod_test.go:332: unknown cniserver.PodRequest field 'ContainerId' in struct literal
pkg/sdn/plugin/pod_test.go:427: unknown cniserver.PodRequest field 'ContainerId' in struct literal
Fixed and Updated, please [test] again.

knobunc · 2017-06-30T14:35:40Z

@sjenning @smarterclayton @derekwaynecarr are your concerns addressed? Can we merge this?

Thanks

sjenning · 2017-06-30T14:56:15Z

@knobunc @dcbw i guess i'm still not complete clear on what a StopPodSandbox() outside the kubelet will do. As soon as the kubelet becomes aware, it will try to restart the pod, assuming the restartPolicy isn't Never. So any attempt to do something while the sandbox was down would race with the kubelet bringing it back up, right?

dcbw · 2017-06-30T15:17:10Z

@knobunc @dcbw i guess i'm still not complete clear on what a StopPodSandbox() outside the kubelet will do. As soon as the kubelet becomes aware, it will try to restart the pod, assuming the restartPolicy isn't Never.

@sjenning kubelet assumes everything is fine with networking on the pod at startup if kubelet/openshift-node is restarted for any reason (manual or perhaps it segfaulted/paniced). It simply has no facility to check network status after a pod has started. The pathological case is:

systemctl restart openvswitch
systemctl restart atomic-openshift-node
(or even 'yum upgrade openvswitch', which restarts OVS)

which blows away the OVS flows that openshift-sdn has added. On restart, pod networking is thus broken, but kubelet does not call into the network plugin to detect that, nor does it give the plugin a chance at startup to fix things.

So we are left trying to detect that situation and somehow fix networking ourselves. But we cannot get the necessary details out of kubelet (like the network namespace) because we aren't called from kubelet or the runtime. So our only solution is to somehow kill the sandbox, which causes kubelet to restart that sandbox and thus networking will be reconfigured for it.

So any attempt to do something while the sandbox was down would race with the kubelet bring it back up, right?

We don't actually do anything with the dead sandbox. We want kubelet to create a new one, which it does. AFAIK kubelet never re-uses a dead sandbox, or if it does, it will call network setup on it again which is exactly what we want.

I am attempting to fix this upstream and in CNI. There is a forward path, but it's going to take a bit. Before CRI we hooked into kubelet directly and could call GetRuntime() which returned a dockertools runtime from which we could make calls directly. But with CRI it's now a kubeGenericRuntimeManager which is simply a stub and doesn't implement anything useful at all.

sjenning · 2017-06-30T15:22:08Z

@dcbw ah I see. so you are counting on the kubelet restarting the pod. That's actually want you want. In that case, I don't see a problem. lgtm. Thanks for the explaination :)

knobunc · 2017-06-30T15:46:59Z

[merge]

smarterclayton · 2017-06-30T17:58:37Z

pkg/sdn/plugin/runtime.go

+	}
+
+	// Kubelet starts asynchronously and when we get an Update op, kubelet may not have created runtime endpoint.
+	// So try couple of times before bailing out (~5 second timeout).


You should wait longer. Heavily contended systems could flake because of this.

I would expect 30s

Use pod sandbox ID to update the pod as opposed to pod container ID. OVS flow note identified by sandbox ID is desired as network namespace is held by the pod sandbox and pod could have many containers and single container ID may not represent all the pod ovs flows. Since we can't use kubelet 'Host' (explanation refer commit: f111845), we use runtime shim endpoint to connect to runtime service using gRPC. This is the same mechanism used by kubelet(GenericKubeletRuntimeManager) to talk to runtime service(docker/rkt).

openshift-bot · 2017-06-30T18:23:33Z

Evaluated for origin test up to 3ef342d

pravisankar · 2017-06-30T18:23:54Z

Updated getRuntimeService() timeout from 2 secs to 30 secs.

openshift-bot · 2017-06-30T20:10:33Z

continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/2892/) (Base Commit: 57a4d50) (PR Branch Commit: 3ef342d)

openshift-bot · 2017-07-01T03:31:37Z

Evaluated for origin merge up to 3ef342d

openshift-bot · 2017-07-02T01:07:27Z

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_origin/1204/) (Base Commit: 8f51071) (PR Branch Commit: 3ef342d) (Image: devenv-rhel7_6419)

@danwinship

Automatic merge from submit-queue sdn: move sandbox kill-on-failed-update to runtime socket from direct docker Follow-on to #14892 that reworks #14665 to use the new runtime socket stuff. @openshift/networking @danwinship @pravisankar

smarterclayton · 2017-09-30T22:14:38Z

What happens to this for CRI-O?

smarterclayton · 2017-09-30T22:16:19Z

Is this now fixed in the kube upstream?

This was referenced Jun 26, 2017

Make sdn pod teardown robust #14893

Merged

Minor cleanup: sdn node plugin #14894

Merged

danwinship reviewed Jun 27, 2017

View reviewed changes

pravisankar force-pushed the fix-update-pod branch from ce51a17 to 924fd69 Compare June 27, 2017 23:08

dcbw approved these changes Jun 28, 2017

View reviewed changes

pravisankar force-pushed the fix-update-pod branch from 924fd69 to 333d356 Compare June 28, 2017 22:12

openshift deleted a comment from knobunc Jun 29, 2017

pravisankar force-pushed the fix-update-pod branch from 333d356 to 20ef6fe Compare June 30, 2017 08:05

dcbw mentioned this pull request Jun 30, 2017

sdn: move sandbox kill-on-failed-update to runtime socket from direct docker #14985

Merged

smarterclayton reviewed Jun 30, 2017

View reviewed changes

pravisankar force-pushed the fix-update-pod branch from 20ef6fe to 3ef342d Compare June 30, 2017 18:21

openshift deleted a comment from openshift-bot Jul 1, 2017

openshift-bot merged commit 79b4411 into openshift:master Jul 2, 2017

Bug 1453190 - Fix pod update operation #14892

Bug 1453190 - Fix pod update operation #14892

Conversation

pravisankar commented Jun 26, 2017 • edited by eparis Loading

pravisankar commented Jun 26, 2017

pravisankar commented Jun 26, 2017

danwinship Jun 27, 2017

Choose a reason for hiding this comment

danwinship commented Jun 27, 2017

danwinship commented Jun 27, 2017

danwinship commented Jun 27, 2017

dcbw commented Jun 27, 2017

pravisankar commented Jun 27, 2017

pravisankar commented Jun 27, 2017

danwinship commented Jun 27, 2017

pravisankar commented Jun 27, 2017

knobunc commented Jun 28, 2017

knobunc commented Jun 28, 2017

dcbw left a comment

Choose a reason for hiding this comment

smarterclayton commented Jun 29, 2017

smarterclayton commented Jun 29, 2017

dcbw commented Jun 29, 2017 • edited Loading

smarterclayton commented Jun 29, 2017

sjenning commented Jun 29, 2017 • edited Loading

dcbw commented Jun 29, 2017

dcbw commented Jun 29, 2017

eparis commented Jun 30, 2017

pravisankar commented Jun 30, 2017

knobunc commented Jun 30, 2017

sjenning commented Jun 30, 2017 • edited Loading

dcbw commented Jun 30, 2017 • edited Loading

sjenning commented Jun 30, 2017

knobunc commented Jun 30, 2017

smarterclayton Jun 30, 2017

Choose a reason for hiding this comment

smarterclayton Jun 30, 2017

Choose a reason for hiding this comment

openshift-bot commented Jun 30, 2017

pravisankar commented Jun 30, 2017

openshift-bot commented Jun 30, 2017

openshift-bot commented Jul 1, 2017

openshift-bot commented Jul 2, 2017 • edited Loading

smarterclayton commented Sep 30, 2017

smarterclayton commented Sep 30, 2017

pravisankar commented Jun 26, 2017 •

edited by eparis

Loading

dcbw commented Jun 29, 2017 •

edited

Loading

sjenning commented Jun 29, 2017 •

edited

Loading

sjenning commented Jun 30, 2017 •

edited

Loading

dcbw commented Jun 30, 2017 •

edited

Loading

openshift-bot commented Jul 2, 2017 •

edited

Loading