Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use true leader elector for controller counting #680

Merged
merged 1 commit into from
Feb 4, 2021

Conversation

jnummelin
Copy link
Member

Signed-off-by: Jussi Nummelin jnummelin@mirantis.com

Issue
The previous iteration (#672) was reaching to correct --server-count but doing so it did make the process flap quite a bit. And the flapping made the agents confused to which and how many servers they should connect to. That resulted in various errors e.g. during conformance testing where much of the verifications rely on getting logs and execs into pods. One of the most notable side-effects of the connection flapping was that metrics API endpoint caused lots of errors in many kube API calls. (the way it's wired up into k8s api is PITA, but that's a different battle)

What this PR Includes
This PR changes the controller lease to use real leader election. Why? It's more accurate as the leases are updated more often and bit more battle tested. The downside is that it'll use bit more CPU time.

Also changes a bit the way konnectivity-server is restarted. Now there's separate routines to count the controllers and to restart the k-server with a channel in between to pass the controller counts. This proved to be pretty stable way to reach to the correct server count:

Feb 03 21:18:52 k0s-server-0 k0s[1123]: time="2021-02-03 21:18:52" level=info msg="Starting to supervise" component=konnectivity
Feb 03 21:18:52 k0s-server-0 k0s[1123]: time="2021-02-03 21:18:52" level=info msg="I0203 21:18:52.745634    1721 main.go:165] ServerID set to c83caf41b543140e1ede176a23fb94803f1acc6b4294f62264c67dc6c1dd4c1b." component=konnectivity
Feb 03 21:18:52 k0s-server-0 k0s[1123]: time="2021-02-03 21:18:52" level=info msg="I0203 21:18:52.745642    1721 main.go:166] ServerCount set to 1." component=konnectivity
Feb 03 21:21:22 k0s-server-0 k0s[1123]: time="2021-02-03 21:21:22" level=info msg="Shutting down pid 1721" component=konnectivity
Feb 03 21:21:22 k0s-server-0 k0s[1123]: time="2021-02-03 21:21:22" level=info msg="I0203 21:21:22.748506    1721 main.go:387] Shutting down server." component=konnectivity
Feb 03 21:21:22 k0s-server-0 k0s[1123]: time="2021-02-03 21:21:22" level=info msg="Starting to supervise" component=konnectivity
Feb 03 21:21:22 k0s-server-0 k0s[1123]: time="2021-02-03 21:21:22" level=info msg="Started successfully, go nuts" component=konnectivity
Feb 03 21:21:22 k0s-server-0 k0s[1123]: time="2021-02-03 21:21:22" level=info msg="I0203 21:21:22.784575    2638 main.go:165] ServerID set to c83caf41b543140e1ede176a23fb94803f1acc6b4294f62264c67dc6c1dd4c1b." component=konnectivity
Feb 03 21:21:22 k0s-server-0 k0s[1123]: time="2021-02-03 21:21:22" level=info msg="I0203 21:21:22.784592    2638 main.go:166] ServerCount set to 2." component=konnectivity
Feb 03 21:21:32 k0s-server-0 k0s[1123]: time="2021-02-03 21:21:32" level=info msg="Shutting down pid 2638" component=konnectivity
Feb 03 21:21:32 k0s-server-0 k0s[1123]: time="2021-02-03 21:21:32" level=info msg="I0203 21:21:32.712984    2638 main.go:387] Shutting down server." component=konnectivity
Feb 03 21:21:32 k0s-server-0 k0s[1123]: time="2021-02-03 21:21:32" level=info msg="Starting to supervise" component=konnectivity
Feb 03 21:21:32 k0s-server-0 k0s[1123]: time="2021-02-03 21:21:32" level=info msg="Started successfully, go nuts" component=konnectivity
Feb 03 21:21:32 k0s-server-0 k0s[1123]: time="2021-02-03 21:21:32" level=info msg="I0203 21:21:32.748681    2654 main.go:165] ServerID set to c83caf41b543140e1ede176a23fb94803f1acc6b4294f62264c67dc6c1dd4c1b." component=konnectivity
Feb 03 21:21:32 k0s-server-0 k0s[1123]: time="2021-02-03 21:21:32" level=info msg="I0203 21:21:32.748688    2654 main.go:166] ServerCount set to 3." component=konnectivity

After which the process has been fully stable:

root      1123  2.3  0.7 751728 62276 ?        Ssl  Feb03   5:35 /usr/local/bin/k0s server --config=/etc/k0s/k0s.yaml
etcd      1620  6.6  4.6 10886168 368944 ?     Sl   Feb03  15:52  \_ /var/lib/k0s/bin/etcd --data-dir=/var/lib/k0s/etcd --listen-client-urls=https://127.0.0.1:2379 --advertise-client-urls=https://127.0.0.1:2379 --client-
kube-ap+  1647 12.5  8.5 1393552 685644 ?      Sl   Feb03  29:55  \_ /var/lib/k0s/bin/kube-apiserver --proxy-client-cert-file=/var/lib/k0s/pki/front-proxy-client.crt --requestheader-client-ca-file=/var/lib/k0s/pki/front-
kube-sc+  1652  0.4  0.6 747880 53820 ?        Sl   Feb03   1:08  \_ /var/lib/k0s/bin/kube-scheduler --authorization-kubeconfig=/var/lib/k0s/pki/scheduler.conf --kubeconfig=/var/lib/k0s/pki/scheduler.conf --v=1 --bind-ad
kube-ap+  1657  3.2  1.2 764900 101800 ?       Sl   Feb03   7:49  \_ /var/lib/k0s/bin/kube-controller-manager --client-ca-file=/var/lib/k0s/pki/ca.crt --cluster-signing-cert-file=/var/lib/k0s/pki/ca.crt --root-ca-file=/v
root      1664  0.0  0.5 751472 44020 ?        Sl   Feb03   0:04  \_ /usr/local/bin/k0s api --config=/etc/k0s/k0s.yaml --data-dir=/var/lib/k0s
konnect+  2654  0.0  0.4 734544 35292 ?        Sl   Feb03   0:13  \_ /var/lib/k0s/bin/konnectivity-server --uds-name=/run/k0s/konnectivity-server/konnectivity-server.sock --cluster-cert=/var/lib/k0s/pki/server.crt --clus

The added stability also makes the conformance pass:

$ sonobuoy status
         PLUGIN     STATUS   RESULT   COUNT
            e2e   complete   passed       1
   systemd-logs   complete   passed       3
Sonobuoy has completed. Use `sonobuoy retrieve` to get results.

@jnummelin jnummelin requested a review from a team as a code owner February 4, 2021 00:18
@jnummelin jnummelin force-pushed the controller-counter branch 2 times, most recently from 7caa163 to b0adbf5 Compare February 4, 2021 00:28
@jnummelin jnummelin requested review from ncopa and removed request for mviitane February 4, 2021 08:29
jasmingacic
jasmingacic previously approved these changes Feb 4, 2021
@jnummelin jnummelin force-pushed the controller-counter branch 2 times, most recently from f346eda to 2b5c731 Compare February 4, 2021 10:33
pkg/component/server/konnectivity.go Outdated Show resolved Hide resolved
pkg/component/server/konnectivity.go Show resolved Hide resolved
pkg/component/server/konnectivity.go Show resolved Hide resolved
pkg/component/server/konnectivity.go Outdated Show resolved Hide resolved
ncopa
ncopa previously approved these changes Feb 4, 2021
Signed-off-by: Jussi Nummelin <jnummelin@mirantis.com>

The previous iteration, while successfull at proving the approach in general, had a drawback of the konnectivity-server process flapping bit too much.
This has the unwelcome side-effect of the agents getting slightly confused to which and how many servers they should connect with. This causes connectivity issues between API and workers as all that comms is done through konnectivity tunnels.

This commit changes couple things:
- standard leader election used for per controller leases. They are more accurate but do use bit more resources.
- Instead of restarting the whole konnectivity component we now only restart the supervisor part (== the k-server process itself)
- per controller lease broken into separate component, to keep things bit apart

Signed-off-by: Jussi Nummelin <jnummelin@mirantis.com>
@ncopa ncopa merged commit b3cb921 into k0sproject:main Feb 4, 2021
@jnummelin jnummelin deleted the controller-counter branch February 4, 2021 13:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants