Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1453190 - Fix pod update operation #14892

Merged
merged 1 commit into from
Jul 2, 2017

Conversation

pravisankar
Copy link

@pravisankar pravisankar commented Jun 26, 2017

Use pod sandbox ID to update the pod as opposed to pod container ID.
OVS flow note identified by sandbox ID is desired as network namespace
is held by the pod sandbox and pod could have many containers and single
container ID may not represent all the pod ovs flows.

Since we can't use kubelet 'Host' (explanation refer commit: f111845),
we use runtime shim endpoint to connect to runtime service using gRPC.
This is the same mechanism used by kubelet(GenericKubeletRuntimeManager)
to talk to runtime service(docker/rkt).

bug 1453190

@pravisankar
Copy link
Author

@openshift/networking @dcbw PTAL

@pravisankar
Copy link
Author

[test]

}

// Kubelet starts asynchronously and when we get an Update op, kubelet may not have created runtime endpoint.
// So try couple of times before bailing out (~2 second timeout).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 seconds seems potentially like not enough...

@danwinship
Copy link
Contributor

What bug exactly does this fix?

(Code basically LGTM but this is @dcbw's area of expertise.)

@danwinship
Copy link
Contributor

duh, just noticed the bug id in the summary

When we are adding table=20 ovs flows with note generated from ContainerID, we are actually using pod SandboxID (origin-pod or pause container id) which is good as it is holding the network namespace but when pod Update op is performed we are searching the flows that match pod ContainerID not the SandboxID. So no ovs flows will match and pod update op fails.

OK

@danwinship
Copy link
Contributor

So this only affects the re-VNID-ing case, not the resync-pods-at-startup case? In that case we could just change the way multitenant.go:updatePodNetwork() works, so that rather than going pod-by-pod, it just operates on the DumpFlows output as a whole, changing the VNID of every affected pod. That might end up being simpler?

@dcbw
Copy link
Contributor

dcbw commented Jun 27, 2017

@pravisankar I like the solution of talking to the runtime directly, I should have done that for 14665 too. For this bug though, could we put the runtime related stuff into update.go once 14665 gets merged, rather than in pod_linux.go?

I think pod_linux.go and pod.go should just consume the given sandbox ID which would get passed from node.go (or update.go) instead. Since Update is the only operation that requires talking to the runtime service, might make sense to put all that into update.go. Other than that, looks generally good to me!

@pravisankar
Copy link
Author

@dcbw pod.go/pod_linux.go consuming sandbox ID seems good segregation, Let me rebase my changes on top of #14665 and will move relevant changes to node/update.go

@pravisankar
Copy link
Author

@danwinship we can not directly work on the dump flows output in multitenant.go:updatePodNetwork(). We have to go through pod by pod (CNI_UPDATE per pod) so that CNI pod manager updates its bookkeeping (runningPods field, used for hostport/multicast handling). I think we can get dumpFlows() output once and reuse it for all the pods, this needs some fiddling with CNI request/response. I'll work on this change.

@danwinship
Copy link
Contributor

@pravisankar well #14893 depends on the changes here to use the runtime anyway, right? So in that case it makes sense to keep this like it is for now, although I'd really like to simplify the VNID-changing code at some point in the future

@pravisankar
Copy link
Author

@dcbw @danwinship Updated, ptal

@knobunc
Copy link
Contributor

knobunc commented Jun 28, 2017

[test]

@knobunc
Copy link
Contributor

knobunc commented Jun 28, 2017

[test] last flaked on #14043

Copy link
Contributor

@dcbw dcbw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@smarterclayton
Copy link
Contributor

Hold on - so now our networking code depends on calling CRI in order to work correctly?

@openshift openshift deleted a comment from knobunc Jun 29, 2017
@smarterclayton
Copy link
Contributor

Removing merge tag temporarily.

Why is it good for SDN to make calls to the CRI directly?

@dcbw
Copy link
Contributor

dcbw commented Jun 29, 2017

Why is it good for SDN to make calls to the CRI directly?

@smarterclayton because Kubernetes doesn't support changes to pods while they are running, but we do for things like merge/split tenant networks. And that requires that we get sandbox IDs from the runtime, because we need to manipulate the sandbox to do those updates. We used to get this information from kubelet directly via the network plugin interface, but we no longer can do that with the move to remote runtimes. That's actually a good thing; it puts us on the same footing as all other CNI plugins.

The alternative is to cache the sandbox IDs, net namespaces, and IPAM information on-disk somewhere, validate that if kubelet crashes and restarts, and use that. But the runtime already has most of this information.

Furthermore, I'm going to port #14665 over to use these runtime interfaces since that better than banging on Docker directly and will actually work with CRI-O too.

@smarterclayton
Copy link
Contributor

Does the kubelet expect other daemons to be making calls to the CRI? @derekwaynecarr @sjenning

@sjenning
Copy link
Contributor

sjenning commented Jun 29, 2017

@smarterclayton i'm not sure of any hard requirement that the kubelet have exclusive access to the CRI. We have PLEG which should notify the kubelet if anything it cares about changes.

Am I correct in understanding that the only CRI request that is being made in this change is querying the sandbox ID?

@dcbw
Copy link
Contributor

dcbw commented Jun 29, 2017

@smarterclayton I have to imagine CRI runtimes are safe here, since kubelet calls into the the runtimes from a couple different goroutines. As to whether kubelet handles changes made underneath it, we're essentially doing "docker stop". I realize it's more complicated than that, but looking at the dockershim code it looks like ListPodSandbox() doesn't affect any global state, and while it does read checkpoints, the checkpoint store is required to be thread-safe.

@dcbw
Copy link
Contributor

dcbw commented Jun 29, 2017

Am I correct in understanding that the only CRI request that is being made in this change is querying the sandbox ID?

@sjenning yes, but I want to do a follow-up that will call StopPodSandbox() to replace the slighly horrible docker equivalent that #14665 currently does.

@eparis
Copy link
Member

eparis commented Jun 30, 2017

[test]

@pravisankar
Copy link
Author

Jenkins failed to highlight actual error:
pkg/sdn/plugin/pod_test.go:332: unknown cniserver.PodRequest field 'ContainerId' in struct literal
pkg/sdn/plugin/pod_test.go:427: unknown cniserver.PodRequest field 'ContainerId' in struct literal
Fixed and Updated, please [test] again.

@knobunc
Copy link
Contributor

knobunc commented Jun 30, 2017

@sjenning @smarterclayton @derekwaynecarr are your concerns addressed? Can we merge this?

Thanks

@sjenning
Copy link
Contributor

sjenning commented Jun 30, 2017

@knobunc @dcbw i guess i'm still not complete clear on what a StopPodSandbox() outside the kubelet will do. As soon as the kubelet becomes aware, it will try to restart the pod, assuming the restartPolicy isn't Never. So any attempt to do something while the sandbox was down would race with the kubelet bringing it back up, right?

@dcbw
Copy link
Contributor

dcbw commented Jun 30, 2017

@knobunc @dcbw i guess i'm still not complete clear on what a StopPodSandbox() outside the kubelet will do. As soon as the kubelet becomes aware, it will try to restart the pod, assuming the restartPolicy isn't Never.

@sjenning kubelet assumes everything is fine with networking on the pod at startup if kubelet/openshift-node is restarted for any reason (manual or perhaps it segfaulted/paniced). It simply has no facility to check network status after a pod has started. The pathological case is:

systemctl restart openvswitch
systemctl restart atomic-openshift-node
(or even 'yum upgrade openvswitch', which restarts OVS)

which blows away the OVS flows that openshift-sdn has added. On restart, pod networking is thus broken, but kubelet does not call into the network plugin to detect that, nor does it give the plugin a chance at startup to fix things.

So we are left trying to detect that situation and somehow fix networking ourselves. But we cannot get the necessary details out of kubelet (like the network namespace) because we aren't called from kubelet or the runtime. So our only solution is to somehow kill the sandbox, which causes kubelet to restart that sandbox and thus networking will be reconfigured for it.

So any attempt to do something while the sandbox was down would race with the kubelet bring it back up, right?

We don't actually do anything with the dead sandbox. We want kubelet to create a new one, which it does. AFAIK kubelet never re-uses a dead sandbox, or if it does, it will call network setup on it again which is exactly what we want.

I am attempting to fix this upstream and in CNI. There is a forward path, but it's going to take a bit. Before CRI we hooked into kubelet directly and could call GetRuntime() which returned a dockertools runtime from which we could make calls directly. But with CRI it's now a kubeGenericRuntimeManager which is simply a stub and doesn't implement anything useful at all.

@sjenning
Copy link
Contributor

@dcbw ah I see. so you are counting on the kubelet restarting the pod. That's actually want you want. In that case, I don't see a problem. lgtm. Thanks for the explaination :)

@knobunc
Copy link
Contributor

knobunc commented Jun 30, 2017

[merge]

}

// Kubelet starts asynchronously and when we get an Update op, kubelet may not have created runtime endpoint.
// So try couple of times before bailing out (~5 second timeout).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should wait longer. Heavily contended systems could flake because of this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect 30s

Use pod sandbox ID to update the pod as opposed to pod container ID.
OVS flow note identified by sandbox ID is desired as network namespace
is held by the pod sandbox and pod could have many containers and single
container ID may not represent all the pod ovs flows.

Since we can't use kubelet 'Host' (explanation refer commit: f111845),
we use runtime shim endpoint to connect to runtime service using gRPC.
This is the same mechanism used by kubelet(GenericKubeletRuntimeManager)
to talk to runtime service(docker/rkt).
@openshift-bot
Copy link
Contributor

Evaluated for origin test up to 3ef342d

@pravisankar
Copy link
Author

Updated getRuntimeService() timeout from 2 secs to 30 secs.

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/2892/) (Base Commit: 57a4d50) (PR Branch Commit: 3ef342d)

@openshift openshift deleted a comment from openshift-bot Jul 1, 2017
@openshift-bot
Copy link
Contributor

Evaluated for origin merge up to 3ef342d

@openshift-bot
Copy link
Contributor

openshift-bot commented Jul 2, 2017

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_origin/1204/) (Base Commit: 8f51071) (PR Branch Commit: 3ef342d) (Image: devenv-rhel7_6419)

@openshift-bot openshift-bot merged commit 79b4411 into openshift:master Jul 2, 2017
openshift-merge-robot added a commit that referenced this pull request Aug 7, 2017
Automatic merge from submit-queue

sdn: move sandbox kill-on-failed-update to runtime socket from direct docker

Follow-on to #14892 that reworks #14665 to use the new runtime socket stuff.

@openshift/networking @danwinship @pravisankar
@smarterclayton
Copy link
Contributor

What happens to this for CRI-O?

@smarterclayton
Copy link
Contributor

Is this now fixed in the kube upstream?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants