Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubeadm should not re-use bind-address for api-server while using experimental-control-plane #1348

Closed
aleks-mariusz opened this issue Jan 11, 2019 · 24 comments
Assignees
Labels
area/HA kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.
Milestone

Comments

@aleks-mariusz
Copy link

Is this a BUG REPORT or FEATURE REQUEST?

BUG

  • Fill in as much of the template below as you can. If you leave out information, we can't help you as well.

i am trying to set up HA and after the apiVersion of the yaml that gets passed to kubeadm went to v1beta1 (from v1alpha2), i've had to adjust and use the --experimental-control-plane flag.. well it's sure experimental enough, i uncovered this issue where the first node comes up fine, but when joining the 2nd node using the new flag, it creates a kube-apiserver yaml manifest that has the same bind-address parameter as the first api-server, which results in seemingly unrelated errors in the log file:

F0111 13:57:18.372186       1 controller.go:147] Unable to perform initial IP allocation check: unable to refresh the service IP block: Get https://10.12.100.131:6443/api/v1/services: x509: certificate signed by unknown authority

manually editing /etc/kubernetes/manifests/kube-apiserver.yaml with an in-place sed on the relevant bind-address line fixes it and allows the control plane to come up successfully..

so this seems like a bit of a bug in the way kubeadm sets up the manifest of kube-apiserver on the non-primary node, in that it re-uses the bind-address from the first node

p.s.. i'm specifying a bind-address in my yaml because i don't want it to bind on 0.0.0.0 (all interfaces), since i'm running a nginx on a VIP (setup by keepalived) on each node, so i want to specifically have it only bind to the a certain IP.

Versions

kubeadm version (use kubeadm version): v1.13.2
kubeadm version: &version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.2", GitCommit:"cff46ab41ff0bb44d8584413b598ad8360ec1def", GitTreeState:"clean", BuildDate:"2019-01-10T23:33:30Z", GoVersion:"go1.11.4", Compiler:"gc", Platform:"linux/amd64"}
Environment:

  • Kubernetes version: v1.13.2
  • Cloud provider or hardware configuration: centos 7 kvm on centos 7 baremetal hypervisor
  • OS (e.g. from /etc/os-release): centos 7
  • Kernel (e.g. uname -a): 3.10.0-957.1.3.el7.x86_64 (latest centos 7.6 updates applied)
  • Others: software load balancer self-hosted on master nodes

What happened?

api server on 2nd and 3rd node goes into a crash-backup-loop after getting repeated errors due to certificate errors?

What you expected to happen?

2nd (and 3rd) node joins cluster with a functioning API server

How to reproduce it (as minimally and precisely as possible)?

use the following config on node 1 and use kubeadm init as per the official docs

apiVersion: kubeadm.k8s.io/v1beta1
kind: ClusterConfiguration
kubernetesVersion: stable
apiServer:
  certSANs:
  - api-k8s-lab # hostname of load-balancer entry
  extraArgs:
    bind-address: 10.12.100.131 # ip of first node only
controllerManager:
  extraArgs:
    address: 0.0.0.0
controlPlaneEndpoint: api-k8s-lab:6443
networking:
  serviceSubnet: 10.12.12.0/23
scheduler:
  extraArgs:
    address: 0.0.0.0
---
apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration
mode: ipvs

Anything else we need to know?

am using keepalived on each master node, to keep a VIP on one of the master nodes, and also have nginx on each master node, listening on the VIP... both these are in docker containers

@neolit123 neolit123 added the priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. label Jan 11, 2019
@neolit123
Copy link
Member

hello and thank you for the report.

i am trying to set up HA and after the apiVersion of the yaml that gets passed to kubeadm went to v1beta1 (from v1alpha2)

FYI, the config had an v1alpha3 before v1beta1.

it creates a kube-apiserver yaml manifest that has the same bind-address parameter as the first api-server
...
p.s.. i'm specifying a bind-address in my yaml because i don't want it to bind on 0.0.0.0 (all interfaces), since i'm running a nginx on a VIP (setup by keepalived) on each node, so i want to specifically have it only bind to the a certain IP
...
extraArgs:
bind-address: 10.12.100.131 # ip of first node only

i don't think there is a way for kubeadm (and join --control-plane) to know these custom bind addresses that you are specifying. if each api-server instance has to have a unique one you'd have to specify them manually.

@fabriziopandini @detiber
for further comments.

@neolit123
Copy link
Member

since i'm running a nginx on a VIP (setup by keepalived) on each node, so i want to specifically have it only bind to the a certain IP

something else to note here. the purpose of the field controlPlaneEndpoint is to serve as the load balancer in front of the control plane nodes. what are you running on api-k8s-lab:6443?

@aleks-mariusz
Copy link
Author

aleks-mariusz commented Jan 11, 2019

@neolit123 that's a VIP (virtual IP) that is moved around by keepalived, and behind that VIP is an nginx proxy acting as a load balancer between the 3 api-server IPs.. (i am running these on baremetal so i don't have a cloud-provided load balancer resource to use so i rolled my own, being self-hosted on the master nodes)

this is the main reason i'm specifying a bind-address in the first place, otherwise the api-server defaults to binding on 0.0.0.0 and it would then die because the address is already in use error (on the VIP).

maybe not the most elegant approach but without a separate load balancer, this is the best work-around i could come up with so needed to adjust what address kube-apiserver tries to bind to by specifying

@aleks-mariusz
Copy link
Author

aleks-mariusz commented Jan 11, 2019

basically my thoughts on this are that, kubeadm should not transport the bind-address setting between subsequent master nodes when using control-plane (at least if one is specified, that isn't 0.0.0.0).. this is a fundamental assumption that breaks down in the instance a bind-address is specified..

when i used to use separate kubeadm-config.yaml (back in the v1alpha_3_ days), i would be able to the set individual bind-address values for each master node's MasterConfiguration.. but using --experimental-control-plane, i lost that ability.. is there a better recommended way to specify this, other than manually in-place (s)edit the kube-apiserver manifest on the non-primary master nodes ?

@neolit123
Copy link
Member

like i've mentioned above, you should be using controlPlaneEndpoint for load balancing and not the bind-address of individual api-servers.

this is outlined in our HA guide:
https://kubernetes.io/docs/setup/independent/high-availability/

when i used to use separate kubeadm-config.yaml (back in the v1alpha_3_ days), i would be able to the set individual bind-address values for each master node's MasterConfiguration

you can always modify the api-server config and restart a pod.
but we also have plans to support phases for the join command which will make this easier to do for CP joining a cluster.

@aleks-mariusz
Copy link
Author

aleks-mariusz commented Jan 11, 2019

can you clarify what you mean by

you should be using controlPlaneEndpoint for load balancing and not the bind-address for individual api-servers

i'm confused because to me it seems that i am using controlPlaneEndpoint with a value of the load-balanced VIP hostname (as well as bind-address is only of the individual api-server).. but since i have a VIP that has an nginx listening on port 6443 (only on the VIP), and if i don't specify a bind-address then the api-server pod will fail to start due to "address already in use" error..

the docs you're pointing me to assumes there is an external load balancer device.. but that assumption breaks apart in this set-up

you can always modify the api-server config...

i have to modify it because it's being set incorrectly :-)

is there a better way to achieve this without manually making modifications to what kubeadm (in essence, fixing kubeadm's incorrect behavior on this edge-case by hand)

@neolit123
Copy link
Member

neolit123 commented Jan 11, 2019

but since i have a VIP that has an nginx listening on port 6443 (only on the VIP), and if i don't specify a bind-address then the api-server pod will fail to start due to "address already in use" error..

if i understand this correctly this is not a HA setup, because if the first machines goes away the rest will not have a load balancer, correct?

the docs you're pointing me to assumes there is an external load balancer device.. but that assumption breaks apart in this set-up

LBs like keepalived allow you to run a LB service on each control plane node, which isn't excactly external LB. that is how controlPlaneEndpoint can be used too.
truly external LB are out of scope for our guide.

@aleks-mariusz
Copy link
Author

it isn't ideal HA but it is HA because if the first machine does go away, keepalived should pick the VIP back up and the nginx on the new VIP's host will be load balancing across the remaining API serrvers..

it's kind of a distributed LB approach that still should provide HA (i've tested it already a while back)

this isn't helping the original issue that kubeadm in this edge-case is incorrectly passing bind-address from the ClusterConfiguration onto other hosts breaking other api-servers

@neolit123
Copy link
Member

neolit123 commented Jan 11, 2019

it isn't ideal HA but it is HA because if the first machine does go away, keepalived should pick the VIP back up and the nginx on the new VIP's host will be load balancing across the remaining API serrvers..

it's kind of a distributed LB approach that still should provide HA (i've tested it already a while back)

it feels to me that in this case nginx adds a layer of complication that is not needed.
this setup is neither recommended nor supported.

this isn't helping the original issue that kubeadm in this edge-case is incorrectly passing bind-address from the ClusterConfiguration onto other hosts breaking other api-servers

@fabriziopandini can comment if we want to omit copying the field.

@aleks-mariusz
Copy link
Author

my understanding is that the API server is active/active, so i thought for scalability's sake that the nginx load-balancing across all (available) API servers made sense..

can you propose an alternative that doesn't require an external load-balancer which spreads requests across more than one api-server ?

@fabriziopandini
Copy link
Member

fabriziopandini commented Jan 12, 2019

@aleks-mariusz
The ClusterConfiguration object by design applies to all the cluster, so if you are setting the bind address there, it is applied to all the nodes. This should be removed unless you plan to change the API server manifest manually

As of today, the only settings that can be customized for each single API server instance is the LocalAPIEndpoint (the advertise address), and this can be done using the InitConfiguration object or the JoinConfiguration object
See https://godoc.org/k8s.io/kubernetes/cmd/kubeadm/app/apis/kubeadm/v1beta1 for more details

With regards to your specific setup, if you want to keep the extra layer of the nginx load balancing, a possible workaround is to use a different port for the load balancing (or for the API servers)

One thing that is important to notice is that this extra layer of load balancing will apply only to the external traffic directed to the API server (which usually is not relevant in terms of throughput), while the internal traffic is balanced by other mechanisms.
I hope this help

@aleks-mariusz
Copy link
Author

aleks-mariusz commented Jan 18, 2019

Overlooked about internal traffic and other mechanisms used to service it.. re the external traffic, to me, if we have to work around my 0.0.0.0:6443 already-in-use condition (by changing port-numbers) it seems more preferable to change the bind-port for the API servers itself. That way at least the well-known port # of 6443 is maintained (at least externally)..

So would something like this:

localAPIEndpoint:
  bindPort: 16443

accomplish this ? That is versus instead advertising the API-service on a different (instead of the well-known 6443) port.

For discussion's sake, is customizing the bind-address something that is perhaps undesirable? i.e. by virtue of it (by default) being 0.0.0.0, would mean that it would also listen on 127.0.0.1.. however that would technically also listen on every other interface's address (including container-owned transient interfaces) if the API-server were to re-start (after other containers started up).. that to me seems like it could lead to problems down the line, the impact of which are debatable i suppose? Or does it (somehow) help that the API server listens on the loopback interface? Internally the kernel knows (from the routing table) that to reach its own address it internally uses the loopback anyway, but is anything expecting/benefiting from the API server listening on the loopback interface as well?


In any case, while we're discussing the API-server, are any of you able to offer general best practices ideas for typical bare-metal deployments of 3x API-servers in an HA config? Specifically when being not cloud-based (and we have no hw LB available on-prem); since to recap, I've turned to using nginx to load-balance between the 3 API-servers, with nginx listening only on a VIP floating between the 3 API-servers (managed by keepalived), and asking the API server to instead of 0.0.0.0 to have a bind-address of the host's main interface's IP only.

This work-around worked fine with k8s <= 1.11, back when kubeadm was using had a MasterConfig v1apha3 object, since when bootstrapping the cluster, i had created a separate kubeadm.yaml on each master host (thus letting me customize the bind-address on a per-API-server basis). But now, as of (i think) v1.12, the yaml apiVersion went to v1beta1 (with support for earlier versions seeming like they were withdrawn for bootstrapping newer versions of k8s), so there's now only one ClusterConfig, for the whole cluster. And thus I've uncovered when having specified a bind-address on a specific IP, this was actually set on each API-server the same, causing the 2nd and 3rd to not start (it was listening on an IP that wasn't available).

This is the reason for this issue being open in the first case as kubeadm doesn't currently handle this edge-case properly. I thought this was a bug in the way kubeadm was behaving but you guys explained this might be how it's designed..

So I'm now re-evaluating whether my initial approach (nginx/keepalived) even really follows best-practices.. Personally, i don't really like the idea of having to alter any port numbers to work around the already-in-use issue, so wondering what other ideas people might have ?

You've got me now considering eliminating nginx from this entirely (but keep using keepalived to move the VIP as needed), however to consider the impact, this would result in all requests would be directed at only one API-server.. From what I've read, API-servers can act in active/active, so this side-effect of eliminating nginx (meaning without any kind of load-balancing in front of them), a single API-server would be taking all the load.. which doesn't seem ideal either.

What are other people with on-prem clusters doing?

@fabriziopandini
Copy link
Member

So would something like this ... accomplish this ?

I think yes!

is customizing the bind-address something that is perhaps undesirable?

AFAIK, customising the bind-addres is fine
The fact, from my PoV, is simply that kubeadm is some case should define trade-offs between keeping the UX/Config surface simple vs enable full customization.
HA/Multi node support is one of them.

During this journey, if there is demand from the community (and hopefully also some help), I'm personally open to reconsider actuals assumptions (like e.g. the list of node-specific settings).
So thank you for this issue and if you want to show up at the kubeadm office hours to campaign for allowing configurability of the bind address on node bases you are welcome!

This work-around worked fine ....

We - as a SIG - are working to make this even mores simple and graduate all the different pieaces as fast as possible (hopefully in v1.14) in order to increase stability across releases as well, but we not yet there now, as described in the docs/in the feature graduation process.

What are other people with on-prem clusters doing?

I hope other users will answer here ....

@tuco86
Copy link

tuco86 commented Mar 16, 2019

We have got the same problem and worked around it by setting apiServer.extraArgs.bind-address: <node_ip> so the first server comes up as expected. Then sed replace bind-address with local node ip and restart kubelet after the joins. When apiserver binds on node ip i'm free to setup a load balancing haproxy bound to a floating ip. HA is achieved by moving the floating ip to a healthy node via keepalived.

As for the need: Most smaller cloud providers don't ship load balancers. This is the only way to get any HA on these platforms besides dedicated load balancer nodes. I hope there will be extraArgs support in join config in the future.

@fabriziopandini fabriziopandini mentioned this issue Mar 18, 2019
4 tasks
@fabriziopandini fabriziopandini added this to the Next milestone Apr 20, 2019
@fabriziopandini
Copy link
Member

Config related
/assign @rosti

@fabriziopandini fabriziopandini added the kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API label Apr 20, 2019
@stgleb
Copy link

stgleb commented Apr 28, 2019

I use custom port - 443 for kube-apiserver, first node starts with correct port, but the rest of masters gets up with port 6443

@dparalen
Copy link

dparalen commented Jul 2, 2019

...I've just hit the same...
IMO the issue here is at least three-fold:

  • the inability of kubeadm to configure custom API bind-address per each master node (service) instance which also has security implications; 0.0.0.0 is a reasonable default bind address
  • poor UX having to resort to post-process the generated API service configuration due to the previous shortcoming esp. considering HA deployment being one of the kubeadm declared features
  • the (undocumented) assumption the controlPlaneEndpoint always being an external facility

From the discussion so far, theClusterConfiguration object might not seem the ideal spot to set up the API bind addresses; maybe we need a join-phase NodeRole object to be able to customize node-specific service/role configuration?

@detiber
Copy link
Member

detiber commented Jul 2, 2019

@dparalen if ControlPlaneEndpoint is not an external facility (DNS, virtual IP, load balancer), then how are you configuring the kubelets in the cluster to talk to the API server?

Unless things have changed recently, all Kubernetes clients, including the Kubelet require a single endpoint to contact and do not accept a list of endpoints.

@dparalen
Copy link

dparalen commented Jul 2, 2019

@detiber in the context of this bug, I'd like to use the floating VIP, the address that Nginx (or in my case HAProxy) binds to, on any of the master nodes. That's why folks are working around the 0.0.0.0 being the default K8s API bind address.

@detiber
Copy link
Member

detiber commented Jul 2, 2019

@dparalen If you could specify a separate bind address per control plane instance, is there anything that would prevent you from using the floating VIP as the ControlPlaneEndpoint?

@dparalen
Copy link

dparalen commented Jul 2, 2019

@detiber I hope not, I'm still in the process of working around this in my env.

@dparalen
Copy link

dparalen commented Jul 3, 2019

@detiber actually, there's one more thing I'm facing in the context of this bug, the port_6443 preflight check has been failing for me; having ignored the preflight check, the netstat is:

root@mmon1:~# netstat -tlpn | grep :6443
tcp        0      0 192.168.3.30:6443       0.0.0.0:*               LISTEN      97018/kube-apiserve
tcp        0      0 192.168.3.235:6443      0.0.0.0:*               LISTEN      27668/haproxy
root@mmon1:~#

@neolit123
Copy link
Member

#1348 (comment)

the inability of kubeadm to configure custom API bind-address per each master node (service) instance which also has security implications; 0.0.0.0 is a reasonable default bind address

kubeadm treats control-plane nodes as replicas, by design.
you can still hack around that by customizing per node.

the work to enable such "kustomizations" has begun and will be available hopefully in 1.16:
#1379
(see the design docs / KEP).

however it comes with the caveat that once you enable such customizations you enter unsupported territory.

poor UX having to resort to post-process the generated API service configuration due to the previous shortcoming esp. considering HA deployment being one of the kubeadm declared features

see notes about kustomization above.

the (undocumented) assumption the controlPlaneEndpoint always being an external facility

controlEndPoint can be any FQDN or address that leads traffic to an API server, so not technically a requirement for it to be external or a VIP.

please continue discussions on #1379

@neolit123
Copy link
Member

@rosti , @fabriziopandin
while i closed this issue, if you think there are separate API related items, let's log them with a concise outline and API examples in a new ticket.
thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/HA kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.
Projects
None yet
Development

No branches or pull requests

8 participants