apiserver fails to start because livenessprobe is too aggressive #413

anguslees · 2017-08-25T02:32:10Z

[Lubomir] NOTE: possible fix was submitted here:
kubernetes/kubernetes#66264

Is this a BUG REPORT or FEATURE REQUEST?

BUG REPORT

Versions

kubeadm version (use kubeadm version):
kubeadm version: &version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.3+2c2fe6e8278a5", GitCommit:"2c2fe6e8278a5db2d15a013987b53968c743f2a1", GitTreeState:"not a git tree", BuildDate:"1970-01-01T00:00:00Z", GoVersion:"go1.8", Compiler:"gc", Platform:"linux/arm"}

Environment:

Kubernetes version (use kubectl version):
Server Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.4", GitCommit:"793658f2d7ca7f064d2bdf606519f9fe1229c381", GitTreeState:"clean", BuildDate:"2017-08-17T08:30:51Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/arm"}
Cloud provider or hardware configuration:
arm32 (bananapi - basically a raspberrypi2)
OS (e.g. from /etc/os-release):
(my own OS image)
ID="containos"
NAME="containos"
VERSION="v2017.07"
VERSION_ID="v2017.07"
PRETTY_NAME="containos v2017.07"
Kernel (e.g. uname -a):
Linux master2 4.9.20 kubeadm init starts paused containers on ubuntu 16.04 #2 SMP Wed Aug 16 15:36:20 AEST 2017 armv7l GNU/Linux
Others:

What happened?

kubeadm init sits ~forever at the "waiting for control plane" stage. docker ps/logs investigation shows apiserver is being killed (SIGTERM) and restarted continuously.

What you expected to happen?

Everything to work :) In particular, apiserver to come up and the rest of the process to proceed.

How to reproduce it (as minimally and precisely as possible)?

Run kubeadm init on a slow machine.

Anything else we need to know?

For me, during the churn of all those containers starting at once, it takes apiserver about 90s(!) from its first log line to responding to HTTP queries. I haven't looked in detail at what it's doing at that point, but the logs mention what looks like etcd bootstrapping things.

My suggested fix is to set apiserver initialDelaySeconds to 180s. And probably similar elsewhere in general - I think there's very little reason to have aggressive initial delays.

(Unless you're a unittest that expects to frequently encounter failures, my experience with production software suggests the correct solution to timeouts is almost always to have waited longer).

The text was updated successfully, but these errors were encountered:

pipejakob · 2017-08-25T21:00:02Z

It seems that we set both InitialDelaySeconds and TimeoutSeconds to 15 for the control plane pods currently, which matches what kube-up.sh does as well. I understand the initial launch being slow, with all of the images are being pulled at once, but after all of the images have been pulled, how long does it take your apiserver to respond to /healthz checks after being launched?

No doubt we should probably tweak both of these values to accommodate lower-power machines.

anguslees · 2017-08-28T11:04:46Z

Once started it can respond to health checks in << 15s - it's really just all the extra stuff that the apiserver does between exec() and actually being ready to serve afaics.

anguslees · 2017-08-28T11:09:29Z

oh, and the docker pull time doesn't count against InitialDelaySeconds afaics (good). In other examples with larger (generic ubuntu) images over my slow network link, the pull can take many minutes but the initialDelaySeconds timer doesn't seem to start ticking until the pull has completed and the docker run has started. (I haven't looked at the relevant code - just frequent anecdotal experience)

koalalorenzo · 2017-08-30T20:23:25Z

I am running the same problem. With slow machines kubeadm sits forever. Using v1.7.4

pipejakob · 2017-08-30T20:35:52Z

@anguslees and @koalalorenzo, can you confirm that if you manually change the liveness probe settings (by editing the manifest files in /etc/kubernetes/manifests/) that this fixes your problem? I also recently saw a case on Slack where the user had the same symptoms but was likely running into memory constraints, because the problem went away when they moved to a host type with more memory.

I just want to make sure that this approach will actually fix the problem before we invest time into coding it up. Thanks!

praseodym · 2017-08-31T19:37:57Z

I'm also experiencing this issue when attempting to use kubeadm in QEMU without hardware-assisted virtualisation (which is a bad idea because it is terribly slow). Increasing the InitialDelaySeconds and TimeoutSeconds helps; the cluster will then eventually come up.

anguslees · 2017-08-31T23:47:12Z

@pipejakob I can confirm that (on my bananapi) running this in another terminal at the right point in the kubeadm run makes everything come up successfully:

sed -i 's/initialDelaySeconds: [0-9]\+/initialDelaySeconds: 180/' /etc/kubernetes/manifests/kube-apiserver.yaml

(I usually also manually docker kill the old/restart-looping apiserver container, I'm not sure if that gets cleaned up automatically with static pods)

pipejakob · 2017-09-01T01:19:38Z

@anguslees Great! Thanks for confirmation.

alexjmoore · 2017-09-01T09:58:31Z

I can confirm I've also just had this issue, in my case on an raspberry pi 3. Changing to 180s fixed it, however I think I also encountered issue #106 as in my case it simply died with:

Sep 1 10:47:30 raspberrypi kubelet[6053]: W0901 10:47:30.020409 6053 kubelet.go:1596] Deleting mirror pod "kube-apiserver-raspberrypi_kube-system(7c03df63-8efa-1
1e7-ae86-b827ebdd4b52)" because it is outdated

I had to manually HUP the kubelet process to get it to kick back to life.

Tha-Fox · 2017-09-06T22:01:29Z

I can also confirm I had this and wanted to say thank you for saving my sanity. I have Raspberry Pi 2B and I was stuck at init phase for the last month. After running that one-liner once it started waiting for the control plane, I got it going forward.

anguslees · 2017-10-09T00:08:10Z

This issue still exists in kubeadm v1.8.0, and it's worse because kubeadm itself now has a 1min timeout on most actions. The 1min timeout seems to have been chosen arbitrarily, and unfortunately a) doesn't give you enough time to jump in and debug/workaround the issue (eg: sed hack above), b) enough time for the apiserver to start up (~90s for me), even when initialDelaySeconds has been extended, and c) can't be increased without hacking/rebuilding kubeadm afaics.

Timeouts break otherwise correct logic, particularly in complex eventually-consistent systems - we should never add them "just because" :( My understanding is that kubeadm is meant to be a building block on which larger deployment systems can depend. Consequently I boldly propose removing all timeouts from kubeadm itself (the various phases should continue retrying forever), and rely on the higher level process to add an overall timeout if/when appropriate in that higher level context. In the simple/direct use case this would mean "retry until user gives up and presses ^c". Would such a PR be acceptable?

luxas · 2017-10-27T20:29:19Z

@anguslees We had the "wait forever" behavior earlier; but that was very sub-optimal from an UX PoV, so now we do have timeouts. We might want to increase some of those timeouts if you want.

The problem is that usage of kubeadm is two-fold. We both have users typing kubeadm interactively that want to know if something is happening or not and higher-level consumers.

anguslees · 2017-11-13T05:30:15Z

.. So what direction are we going to go here? Currently I use a fork of kubeadm that has numerous timeouts cranked up 10x, and I'd like to believe that I can go back to using standard kubeadm binaries at some point.

0xmichalis · 2018-01-08T19:32:22Z

@anguslees We had the "wait forever" behavior earlier; but that was very sub-optimal from an UX PoV, so now we do have timeouts. We might want to increase some of those timeouts if you want.

How about making them configurable? Does it make sense to have a single option that owns all of them?

0xmichalis · 2018-01-08T19:32:38Z

/priority important-soon

luxas · 2018-01-08T22:27:42Z

Does it make sense to have a single option that owns all of them?

Probably, or some kind of "weight" for all the timeouts to be multiplied with... Otherwise we'll get into config hell with 20 different types of timeout flags :)

rsterrenburg · 2018-02-17T09:48:04Z

Running into the same issue using kubeadm upgrade on raspberry pi 2 cluster. Upgrade fails due to agressive timeouts. Changing the liveness probe settings in the manifests doesn't help. Any ideas?

anguslees · 2018-02-17T23:46:46Z

I still propose a pattern where any kubeadm timeout is inherited from the calling context (or part of a more sophisticated error recovery strategy), rather than sprinkling arbitrary timeouts throughout the lower levels of the kubeadm codebase.

In its simplest form, this would behave almost exactly like removing all the timeouts from kubeadm and replacing them with one overall "run for xx mins, then abort if not finished" global timer (since kubeadm can't do much in the way of error recovery other than just waiting longer).

For the original manifest livenessProbe timeouts, it's literally a one-liner patch. Unfortunately, fixing the livenessProbe alone is no longer sufficient since the "deviation from normal == error" fallacy has spread further throughout the kubeadm codebase. Changing cultural awareness is hard, so in the meantime I have a forked version of kubeadm here, if anyone wants to just install onto a raspberry pi. (Build with make cross WHAT=cmd/kubeadm KUBE_BUILD_PLATFORMS=linux/arm)

dvdmuckle · 2018-03-18T18:21:11Z

@anguslees Do you have a compiled 1.9.4 version of your patched kubeadm? I'm having trouble compiling your patched version.

I'm surprised kubeadm doesn't have this behavior behind a flag. Perhaps a PR is in order?

timothysc · 2018-04-07T14:16:30Z

/assign @liztio

neolit123 · 2018-07-17T10:47:48Z

@joejulian thanks for investigating!
a potential fix can be found in kubernetes/kubernetes#66264

closing until further notice.

/close

ulm0 · 2018-10-17T19:52:42Z

is there a way to pass such settings in kubeadm init file? maybe in apiServerExtraArgs? it's a pain to be watching files for patching them, kinda hard to automate.

joejulian · 2018-10-17T22:31:49Z

There is not. Perhaps that would be a good feature to add.

Later updates have added even more things to check and the extended timeout my PR provided was no longer sufficient. I've given up on managing the timeout. The solution for me was to use ecdsa certificates.

Additionally the controller services including etcd take up more ram, now, than a Raspberry Pi has so rather than double the number of nodes to host the control plane, I've upgraded to Rock64s.

Excuse the pun, but my control plane has been rock solid ever since.

codesqueak · 2018-11-13T20:40:21Z

I've just been trying to do an install on a Raspberry Pi 3+ and can confirm that the install does indeed fail. Using the 'watch' trick on the kube-apiserver.yaml listed above does seem to work consistently ... but only if I change the initialDelaySeconds to 360. The suggested value of 180 seems marginal on my machines.

Just when things where getting too easy, kubeadm is now complaining that the version of Docker (18.09) is unsupported. A quick revert to 18.06 fixed the issue.

neolit123 · 2018-11-13T20:44:04Z

in kubeadm 1.13 we are adding configuration flag under ClusterConfig->ApiServer that can control the api server timeout.
https://godoc.org/k8s.io/kubernetes/cmd/kubeadm/app/apis/kubeadm/v1beta1
timeoutForControlPlane

anguslees · 2018-11-14T00:28:31Z

configuration flag under ClusterConfig->ApiServer that can control the api server timeout.

Searching through the codebase for TimeoutForControlPlane, I think this defaults to 4min, and is only used for the delay used by kubeadm itself to wait for the apiserver to become healthy. In particular, it does not alter the apiserver livenessprobe used by kubelet itself. Is that correct?

I don't think I've seen a counter-argument raised anywhere in the discussion around this issue. Is there a reason we don't just increase livenessProbe initialDelaySeconds and move on to some other problem?

Aside: As far as I can see from a quick read, TimeoutForControlPlane also doesn't take into account other non-failure causes for increased apiserver startup delay, like congestion while pulling multiple images, or additional timeout+retry loop iterations while everything is converging at initial install-time (timeout+retry repeatedly is the k8s design pattern ... and this happens sometimes on a loaded system, which is expected and just fine). I personally feel like 4minutes is both too long for impatient interactive users expecting a fast failure, and too short for an install process on a loaded/slow/automated system that is prepared to wait longer for expected success. How was this arrived at, can we default to 5mins? Can we keep retrying until SIGINT? Why are we imposing an artificial wall-clock deadline internally rather than inheriting it from the calling environment?

Afaics TimeoutForControlPlane is just exposing an arbitrary fatal internal deadline as a parameter where the only intended UX is just to increase the parameter until the deadline is never reached. Alternatively, we could just not impose that arbitrary fatal internal deadline in the first place...

joejulian · 2018-11-14T00:31:44Z

That's an excellent point and I wholeheartedly agree.

neolit123 · 2018-11-14T01:48:41Z

In particular, it does not alter the apiserver livenessprobe used by kubelet itself. Is that correct?

yes, there are no plans to modify the liveness probes from kubeadm, yet.

this rpi issue was qualified at a sig-cluster-lifecyle meeting as "something that should not happen", "seems almost like a race condition in the kubelet", "why does it only happen on rpi and not on other slower devices". i have to admit i haven't tested slow devices myself.

i.e. there was an agreement that increasing this value:
https://github.com/kubernetes/kubernetes/blob/master/cmd/kubeadm/app/util/staticpod/utils.go#L97
is not a good fix and it seems like a workaround for another bug.

How was this arrived at, can we default to 5mins?

the timeout was 30 minutes before being 4 minutes, because it took pulling images into consideration before 1.11.
if you have an opinion on 4 vs 5 minutes, a well outlined PR with strong points on the subject might make it in 1.14.

anguslees · 2018-11-14T02:21:47Z

this rpi issue was qualified at a sig-cluster-lifecyle meeting as "something that should not happen", "seems almost like a race condition in the kubelet", "why does it only happen on rpi and not on other slower devices".

Those are all reasons to continue looking for another bug somewhere else as well, but none of those are reasons not to increase initialDelaySeconds. Is there actually a downside to increasing initialDelaySeconds?

To approach it from another direction, if we know we have an issue elsewhere in kubernetes with a workaround that can be used in kubeadm - is it kubeadm's role to "hold the line" on purity and produce a known-broken result? That seems to conflict with the goal to be a tool that we expect people to actually use for actual deployments. So far I've been unable to use kubeadm on my cluster without patching it to increase timeouts (despite reporting it, with patches, over a year ago), which makes me sad.

(Apologies for letting some of my frustration around this issue leak into my tone)

if you have an opinion on 4 vs 5 minutes

Sigh. I was trying to make the case for no timeout in kubeadm, but I have yet to find a way to phrase that proposal convincingly (see this and other failed attempts in this issue, for example) :(

neolit123 · 2018-11-14T02:49:59Z

Those are all reasons to continue looking for another bug somewhere else as well, but none of those are reasons not to increase initialDelaySeconds. Is there actually a downside to increasing initialDelaySeconds?

it is a small change, but it was agreed to not add this increase because it will also apply to systems that don't exercise the problem.

To approach it from another direction, if we know we have an issue elsewhere in kubernetes with a workaround that can be used in kubeadm - is it kubeadm's role to "hold the line" on purity and produce a known-broken result? That seems to conflict with the goal to be a tool that we expect people to actually use for actual deployments.

facing the end user is a goal for kubeadm, that's true.
but again it's only a report for rpis and the actual bug should be escalated to sig-api-machinery (api server) and sig-node (kubelet) and possibly reproduced outside of kubeadm.

So far I've been unable to use kubeadm on my cluster without patching it to increase timeouts (despite reporting it, with patches, over a year ago), which makes me sad.

you don't have to patch kubeadm, you can patch manifest files instead.
kubeadm 1.13 graduates the init phases to top level commands - e.g. kubeadm init phase [phase-name]
phases are there mostly because users want custom control of how the control plane node is created..

if you do kubeadm init --help it will show you in what order the phases are executed.

so you can break down your kubeadm init command into the appropriate phases, you use custom manifests for the control plane components and skip the control-plane phase. there is also --skip-phases now.

you can already do that in 1.11 and 1.12, but it's less straight forward.

anguslees · 2018-11-14T06:12:46Z

because it will also apply to systems that don't exercise the problem.

So .. what's wrong with that? We fix bugs all the time that only trigger on some systems. Anywhere we have timeouts, we're going to need to tune them for our slowest-ever system, not just some subset of our environments, no?

Another angle on this is that as an ops guy, I'm terrified of cascading failures in overload situations, particularly with the apiserver itself. Afaics, the livenessprobe timeout should only ever trigger when things have clearly failed, not just when it deviates from someone's idea of "normal". Afaics we should have a very relaxed livenessprobe configured, even on our fastest hardware. My little rpi is just demonstrating this overload failure more easily - but it also applies to bigger servers under bigger overload/DoS scenarios. There is no upside to having a small initialDelaySeconds. kubeadm's default livenessprobe is unnecessarily aggressive on all platforms.

I'm sorry I keep repeating the same points, but afaics there are strong practical and theoretical reasons to extend initialDelaySeconds, and I'm just not understanding the opposing argument for keeping it small :(

neolit123 · 2018-11-14T15:11:44Z

adding a kubeadm configuration option for this is unlikely at this point.

i'm trying to explain that this is already doable with 3 commands in 1.13:

sudo kubeadm reset -f
sudo kubeadm init phase control-plane all --config=testkubeadm.yaml
sudo sed -i 's/initialDelaySeconds: 15/initialDelaySeconds: 20/g' /etc/kubernetes/manifests/kube-apiserver.yaml
sudo kubeadm init --skip-phases=control-plane --ignore-preflight-errors=all --config=testkubeadm.yaml

anguslees · 2018-11-15T01:29:08Z

I don't want an option, I'm trying to say the current fixed value (15) should be changed to a different fixed value (360 was suggested above).

.. But I don't want to drag this out any further. It's clear that the decision is to stick with the current value, so I will withdraw defeated. Thanks for your patience :)

alexellis · 2018-12-28T10:43:29Z

@neolit123 that combination looks great - far easier than what I'd documented - having to wait for the control plane set then quickly running sed in another terminal. https://github.com/alexellis/k8s-on-raspbian/blob/master/GUIDE.md

I'll test the instructions and look to update the guide.

alexellis · 2018-12-28T10:48:31Z

@neolit123 this is what I got using the config above on an RPi3 B+

[certs] apiserver serving cert is signed for DNS names [rnode-1 kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [10.96.0.1 192.168.0.110 192.168.0.26 192.168.0.26]
[certs] Generating "sa" key and public key
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
[kubeconfig] Writing "admin.conf" kubeconfig file
[kubeconfig] Writing "kubelet.conf" kubeconfig file
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0xaa7204]

goroutine 1 [running]:
k8s.io/kubernetes/cmd/kubeadm/app/phases/kubeconfig.validateKubeConfig(0xfa93f2, 0xf, 0xfb3d32, 0x17, 0x4032210, 0x68f, 0x7bc)
	/workspace/anago-v1.13.1-beta.0.57+eec55b9ba98609/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/phases/kubeconfig/kubeconfig.go:236 +0x120
k8s.io/kubernetes/cmd/kubeadm/app/phases/kubeconfig.createKubeConfigFileIfNotExists(0xfa93f2, 0xf, 0xfb3d32, 0x17, 0x4032210, 0x0, 0xf7978)
	/workspace/anago-v1.13.1-beta.0.57+eec55b9ba98609/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/phases/kubeconfig/kubeconfig.go:257 +0x90
k8s.io/kubernetes/cmd/kubeadm/app/phases/kubeconfig.createKubeConfigFiles(0xfa93f2, 0xf, 0x3ec65a0, 0x3f71c60, 0x1, 0x1, 0x0, 0x0)
	/workspace/anago-v1.13.1-beta.0.57+eec55b9ba98609/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/phases/kubeconfig/kubeconfig.go:120 +0xf4
k8s.io/kubernetes/cmd/kubeadm/app/phases/kubeconfig.CreateKubeConfigFile(0xfb3d32, 0x17, 0xfa93f2, 0xf, 0x3ec65a0, 0x1f7a701, 0xb9772c)
	/workspace/anago-v1.13.1-beta.0.57+eec55b9ba98609/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/phases/kubeconfig/kubeconfig.go:93 +0xe8
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases.runKubeConfigFile.func1(0xf66a80, 0x4208280, 0x0, 0x0)
	/workspace/anago-v1.13.1-beta.0.57+eec55b9ba98609/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/kubeconfig.go:155 +0x168
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1(0x3cc2d80, 0x0, 0x0)
	/workspace/anago-v1.13.1-beta.0.57+eec55b9ba98609/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:235 +0x160
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll(0x3ec9270, 0x3f71d68, 0x4208280, 0x0)
	/workspace/anago-v1.13.1-beta.0.57+eec55b9ba98609/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:416 +0x5c
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run(0x3ec9270, 0x24, 0x416bdb4)
	/workspace/anago-v1.13.1-beta.0.57+eec55b9ba98609/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:208 +0xc8
k8s.io/kubernetes/cmd/kubeadm/app/cmd.NewCmdInit.func1(0x3e97b80, 0x3e900e0, 0x0, 0x3)
	/workspace/anago-v1.13.1-beta.0.57+eec55b9ba98609/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/init.go:141 +0xfc
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).execute(0x3e97b80, 0x3e3ff80, 0x3, 0x4, 0x3e97b80, 0x3e3ff80)
	/workspace/anago-v1.13.1-beta.0.57+eec55b9ba98609/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:760 +0x20c
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0x3e96140, 0x3e97b80, 0x3e96780, 0x3d82100)
	/workspace/anago-v1.13.1-beta.0.57+eec55b9ba98609/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:846 +0x210
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).Execute(0x3e96140, 0x3c8c0c8, 0x116d958)
	/workspace/anago-v1.13.1-beta.0.57+eec55b9ba98609/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:794 +0x1c
k8s.io/kubernetes/cmd/kubeadm/app.Run(0x3c9c030, 0x0)
	/workspace/anago-v1.13.1-beta.0.57+eec55b9ba98609/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/kubeadm.go:48 +0x1b0
main.main()
	_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/kubeadm.go:29 +0x20

kubeadm version
kubeadm version: &version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.1", GitCommit:"eec55b9ba98609a46fee712359c7b5b365bdd920", GitTreeState:"clean", BuildDate:"2018-12-13T10:36:44Z", GoVersion:"go1.11.2", Compiler:"gc", Platform:"linux/arm"}

In kubeadm-config.yaml - 192.168.0.26 is an LB pointing to 192.168.0.110

apiVersion: kubeadm.k8s.io/v1beta1
kind: ClusterConfiguration
kubernetesVersion: stable
apiServer:
  certSANs:
  - "192.168.0.26"
controlPlaneEndpoint: "192.168.0.26:6443"

alexellis · 2018-12-28T12:52:40Z

I get the same even without the external config/lb IP.

Alex

kfox1111 · 2018-12-31T17:15:34Z

I've been pushing folks to use kubeadm for a while, even schools wanting to run it on their pi clusters. While I understand not wanting to complicate the code base, I think its probably a good thing for your user base to support running on these little devices. it allows young folks to kick the Kubernetes tires on cheep hardware that otherwise might not. The workaround above, while valid, is much harder for this use case.

What about a compromise? Instead of making it configurable, add a simple heuristic that says, if not x86_64, set the default timeout higher?

neolit123 · 2019-01-01T01:26:52Z

[kubeconfig] Writing "admin.conf" kubeconfig file
[kubeconfig] Writing "kubelet.conf" kubeconfig file
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0xaa7204]

odd, the admin.conf is machine generated and should contain a valid Config type with a current-context filed pointing to a context.

the panic comes from this line:
https://github.com/kubernetes/kubernetes/blob/master/cmd/kubeadm/app/phases/kubeconfig/kubeconfig.go#L236

stephenmoloney · 2019-01-06T11:36:37Z

I'm seeing the exact same error as ☝️ above with the following:

modify_kube_apiserver_config(){
  sed -i 's/failureThreshold: [0-9]/failureThreshold: 15/g' /etc/kubernetes/manifests/kube-apiserver.yaml && \
  sed -i 's/timeoutSeconds: [0-9][0-9]/timeoutSeconds: 20/g' /etc/kubernetes/manifests/kube-apiserver.yaml && \
  sed -i 's/initialDelaySeconds: [0-9][0-9]/initialDelaySeconds: 120/g' /etc/kubernetes/manifests/kube-apiserver.yaml
}
kubeadm init phase control-plane all --config=$${KUBEADM_CONFIG_FILE} && \
modify_kube_apiserver_config && \
kubeadm init \
--skip-phases=control-plane \
--ignore-preflight-errors=all \
--config=$${KUBEADM_CONFIG_FILE} \
--v 4

The following script does solve the issue for me using kubeadm versions 1.12, 1.13 (most of the time)

modify_kube_apiserver_config(){
  while [[ ! -e /etc/kubernetes/manifests/kube-apiserver.yaml ]]; do
    sleep 0.5s;
  done && \
  sed -i 's/failureThreshold: [0-9]/failureThreshold: 18/g' /etc/kubernetes/manifests/kube-apiserver.yaml && \
  sed -i 's/timeoutSeconds: [0-9][0-9]/timeoutSeconds: 20/g' /etc/kubernetes/manifests/kube-apiserver.yaml && \
  sed -i 's/initialDelaySeconds: [0-9][0-9]/initialDelaySeconds: 240/g' /etc/kubernetes/manifests/kube-apiserver.yaml
}

# ref https://github.com/kubernetes/kubeadm/issues/413 (initialDelaySeconds is too eager)
if [[ ${var.arch} == "arm" ]]; then modify_kube_apiserver_config & fi

kubeadm init \
  --config=$${KUBEADM_CONFIG_FILE} \
  --v ${var.kubeadm_verbosity}

arnaubennassar · 2019-01-17T18:31:46Z

I was in the same situation, getting the same error with the approach suggested by @neolit123 .
I wasn't able to run the script by @stephenmoloney , I'm not really familiar with bash scripting, probably my fault.

So I ported the script to python (which is installed by default on Raspbian, so no need for extra dependencies), in case anyone is interested:

import os
import time
import threading

filepath = '/etc/kubernetes/manifests/kube-apiserver.yaml'

def replace_defaults():
    print('Thread start looking for the file')
    while not os.path.isfile(filepath):
        time.sleep(1) #wait one second
    print('\033[94m -----------> FILE FOUND: replacing defaults \033[0m')
    os.system("""sed -i 's/failureThreshold: [0-9]/failureThreshold: 18/g' /etc/kubernetes/manifests/kube-apiserver.yaml""")
    os.system("""sed -i 's/timeoutSeconds: [0-9][0-9]/timeoutSeconds: 20/g' /etc/kubernetes/manifests/kube-apiserver.yaml""")
    os.system("""sed -i 's/initialDelaySeconds: [0-9][0-9]/initialDelaySeconds: 240/g' /etc/kubernetes/manifests/kube-apiserver.yaml""")

t = threading.Thread(target=replace_defaults)
t.start()
os.system("kubeadm init")

To run it: sudo python however_you_name_the_file.py
Thank you for pointing me to the solution, @stephenmoloney and @neolit123 !

nelsonyaccuzzi · 2020-06-21T20:00:17Z

Hi! this issue was of much help

I found a fancy way to resolve this using kustomize

mkdir /tmp/kustom

cat > /tmp/kustom/kustomization.yaml <<EOF
patchesJson6902:
- target:
    version: v1
    kind: Pod
    name: kube-apiserver
    namespace: kube-system
  path: patch.yaml
EOF

cat > /tmp/kustom/patch.yaml <<EOF
- op: replace
  path: /spec/containers/0/livenessProbe/initialDelaySeconds
  value: 30
- op: replace
  path: /spec/containers/0/livenessProbe/timeoutSeconds
  value: 30
EOF

sudo kubeadm init --config config.yaml -k /tmp/kustom

shoaibjdev · 2021-08-05T21:07:36Z

@pipejakob I can confirm that (on my bananapi) running this in another terminal at the right point in the kubeadm run makes everything come up successfully:
sed -i 's/initialDelaySeconds: [0-9]\+/initialDelaySeconds: 180/' /etc/kubernetes/manifests/kube-apiserver.yaml
(I usually also manually docker kill the old/restart-looping apiserver container, I'm not sure if that gets cleaned up automatically with static pods)

This actually did worked on low spec hardwares. Increasing Initial Delay for all the major control plane components (kube-apiserver, kuber-controller, kube-scheduler-manager) to 180 seconds brought stability, no more frequent re-starts of control plane components.

neolit123 · 2021-08-05T21:14:39Z

note kubeadm has the --experimental-patches (--patches in 1.22) feature that can be used to apply kubectl-like patches to the manifests before they are deployed. so you don't have to use sed.

these patches can also be applied with kubeadm upgrade ....

0xmichalis mentioned this issue Jan 7, 2018

ARM APIServer timeout is too short for upgrade. #644

Closed

k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jan 8, 2018

timothysc added help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. triaged labels Jan 31, 2018

k8s-ci-robot assigned liztio Apr 7, 2018

timothysc added this to the v1.11 milestone Apr 7, 2018

timothysc added the area/UX label Apr 7, 2018

k8s-ci-robot closed this as completed Jul 17, 2018

neolit123 mentioned this issue Dec 2, 2018

kubeadm init fail 90% of the time on aarch64 (rock64) due to TLS timeouts kubernetes/kubernetes#71505

Closed

neolit123 mentioned this issue Feb 1, 2019

kubeadm init fails #1380

Closed

tcurdt mentioned this issue Feb 2, 2019

kubeadm panic during phase based init #1382

Closed

anguslees mentioned this issue Mar 8, 2019

Controller-manager set to use cluster+host DNS results in non-functional cluster kubernetes-retired/bootkube#1039

Closed

neolit123 mentioned this issue Mar 14, 2019

Add readiness & liveness probes to kube-proxy kubernetes/kubernetes#75323

Closed

dobesv mentioned this issue Mar 29, 2019

kube-apiserver not given enough time to start kubernetes/kops#6702

Closed

apiserver fails to start because livenessprobe is too aggressive #413

apiserver fails to start because livenessprobe is too aggressive #413

Comments

anguslees commented Aug 25, 2017 • edited by neolit123 Loading

Is this a BUG REPORT or FEATURE REQUEST?

Versions

What happened?

What you expected to happen?

How to reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

pipejakob commented Aug 25, 2017

anguslees commented Aug 28, 2017 • edited Loading

anguslees commented Aug 28, 2017

koalalorenzo commented Aug 30, 2017

pipejakob commented Aug 30, 2017

praseodym commented Aug 31, 2017

anguslees commented Aug 31, 2017

pipejakob commented Sep 1, 2017

alexjmoore commented Sep 1, 2017 • edited Loading

Tha-Fox commented Sep 6, 2017

anguslees commented Oct 9, 2017 • edited Loading

luxas commented Oct 27, 2017

anguslees commented Nov 13, 2017 • edited Loading

0xmichalis commented Jan 8, 2018

0xmichalis commented Jan 8, 2018

luxas commented Jan 8, 2018

rsterrenburg commented Feb 17, 2018

anguslees commented Feb 17, 2018 • edited Loading

dvdmuckle commented Mar 18, 2018

timothysc commented Apr 7, 2018

neolit123 commented Jul 17, 2018

ulm0 commented Oct 17, 2018

joejulian commented Oct 17, 2018

codesqueak commented Nov 13, 2018

neolit123 commented Nov 13, 2018

anguslees commented Nov 14, 2018 • edited Loading

joejulian commented Nov 14, 2018

neolit123 commented Nov 14, 2018 • edited Loading

anguslees commented Nov 14, 2018

neolit123 commented Nov 14, 2018

anguslees commented Nov 14, 2018 • edited Loading

neolit123 commented Nov 14, 2018 • edited Loading

anguslees commented Nov 15, 2018 • edited Loading

alexellis commented Dec 28, 2018

alexellis commented Dec 28, 2018 • edited Loading

alexellis commented Dec 28, 2018

kfox1111 commented Dec 31, 2018

neolit123 commented Jan 1, 2019

stephenmoloney commented Jan 6, 2019 • edited Loading

arnaubennassar commented Jan 17, 2019

nelsonyaccuzzi commented Jun 21, 2020

shoaibjdev commented Aug 5, 2021

neolit123 commented Aug 5, 2021

anguslees commented Aug 25, 2017 •

edited by neolit123

Loading

anguslees commented Aug 28, 2017 •

edited

Loading

alexjmoore commented Sep 1, 2017 •

edited

Loading

anguslees commented Oct 9, 2017 •

edited

Loading

anguslees commented Nov 13, 2017 •

edited

Loading

anguslees commented Feb 17, 2018 •

edited

Loading

anguslees commented Nov 14, 2018 •

edited

Loading

neolit123 commented Nov 14, 2018 •

edited

Loading

anguslees commented Nov 14, 2018 •

edited

Loading

neolit123 commented Nov 14, 2018 •

edited

Loading

anguslees commented Nov 15, 2018 •

edited

Loading

alexellis commented Dec 28, 2018 •

edited

Loading

stephenmoloney commented Jan 6, 2019 •

edited

Loading