Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

api: loadbalancer: Internal does not create correct Route53 entries #4252

Closed
dcowden opened this issue Jan 10, 2018 · 19 comments
Closed

api: loadbalancer: Internal does not create correct Route53 entries #4252

dcowden opened this issue Jan 10, 2018 · 19 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@dcowden
Copy link

dcowden commented Jan 10, 2018

Operating Environment
kops 1.8.0
aws
k8s 1.8.4

What we expect to happen
Our kops spec looks like this:

apiVersion: kops/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: 2017-12-21T14:37:50Z
  name: ecom-2.dev.domain.com
spec:
  api:
    loadBalancer:
      type: Internal
...
  masterInternalName: api.internal.ecom-2.dev.domain.com

Per the documentation here, we expect kops to create and maintain a route53 record for api.internal.ecom-2.dev.domain.com, which points to an AWS loadbalancer created for the masters.

What actually happens
We get instead a Route53 record that is a list of A records, corresponding to the masters .

Why this matters
Our workloads experience failures because the nodes cannot talk to a master any more--probably because they are using the ip address directly. When the node dies, workloads get failures doing lookups using kube-dns-- research has shown that this is due to the fact that kube-dns on the node is timing out trying to get to the masters via the old ip.

We can temporarily fix the issue by manually updating the Route53 record to point to the load balancer. We've verified that if we do this, and THEN kill a master, the workloads are not affected.

The problem is that when the new master comes back on line, the cluster re-writes the Route53 name back to A records-- so the workaround will not survive rolling updates.

I believe this is the root cause for #4247, and #3702, which I am going to close in favor of this.

I think this may be the actual root cause of #2634. The solution mentioned there is exactly as I describe, but I suspect would not survive a rolling update.

What we need

We need to know what configuration we have wrong that causes us to get an internal name pointing to the masters directly, and how to fix it.

Alternatively, a workaround that simply disables kops from updating the internal DNS record would be super helpful.

@dcowden
Copy link
Author

dcowden commented Jan 11, 2018

as a workaround, we updated the api.internal dns record, and then scaled dns-controller to 0 to prevent it from getting changed back.

This resulted an in improvement, but it means now k8s cannot create dns names when we launch services, so its not acceptable long term.

We also noted that while our workloads kept running this time, the nodes in the same AZ as the master still went to NotReady, making them unschedulable. This is still not the desired behavior.

@rlees85
Copy link
Contributor

rlees85 commented Feb 16, 2018

I've been battling with something similar. The api. DNS record is created correctly as an alias to the load balancer but as soon as any of the masters comes up it is overwritten by 3 IP addresses.

In my instance, masterInternalName and masterPublicName where set the same in the YAML. Changing them to be different fixed the problem at least for masterPublicName.

I doubt it will solve your problem as you seem to be having problems with the k8s nodes talking internally to now-dead masters.

Maybe

  api:
    loadBalancer:

Only supports the public endpoint, which maybe should be documented more clearly.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 17, 2018
@zachaller
Copy link
Contributor

Now that this issue has been accepted in kubernetes kubernetes/kubernetes#63492 I feel it should be a valid option to turn on an elb for the internal api to use because I think we are now running into possible timing issues with dns updates and propagating that having an elb there would make it much easier to manage. With the dns propagation taking so long we get nodes going not ready if we say terminate all 3 masters.

@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 16, 2018
@mellowplace
Copy link
Contributor

Looking at the code the problem appears to be the internal annotation is always added to the apiserver pod https://github.com/kubernetes/kops/blob/master/nodeup/pkg/model/kube_apiserver.go#L444

This means the record will always be updated by dns-controller. Should this annotation not be conditional in the same way as the external one?

@mellowplace
Copy link
Contributor

Just to chime in with a bit more detail on this. We’ve been affected by a memory leak in kube-apiserver that causes a master node to become unresponsive until the docker health check is able to restart docker (which took 15 mins in one case due to high memory pressure).

As far as I can tell dns-controller does not do any modifications to the internal api record based on the availability of the apiserver pods or nodes so the unresponsive pod remains as one of the IPs in the set.

This in turn means, on occasion, kublet for some worker nodes tries to connect to the struggling master and fails resulting in that node being marked NotReady.

This seems naive on behalf of kubernetes and I suspect some improvements need to be made there but in the meantime we’d also like to use the ELB created by kops for kubelet, kube-proxy etc on worker nodes. At least this way a health check could take care of taking unresponsive apiservers out of the pool.

@mellowplace
Copy link
Contributor

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jun 20, 2018
mellowplace pushed a commit to mellowplace/kops that referenced this issue Jun 21, 2018
mellowplace added a commit to mellowplace/kops that referenced this issue Jun 22, 2018
@mellowplace
Copy link
Contributor

MR with proposed fix here #5375

mellowplace added a commit to mellowplace/kops that referenced this issue Jun 25, 2018
@justinsb
Copy link
Member

Ah - thank you all for all the detective work to figure this out.

I think we're close here. To date kops has always used DNS with A records of master nodes when talking to the master inside the cluster (e.g. kubelet -> master), even if there is an ELB present in front of the masters. The ELB is assumed to be primarily for traffic from outside the cluster to reach the master.

We do support internal ELBs, but again this is intended for traffic from outside the cluster (but for clients that are on the VPC, either directly or via a VPN/DirectConnect)

So I don't think the fix in #5375 is quite right, because users that have configured an ELB don't necessarily want their internal traffic going through it.

We could expose another option to allow for internal kubelet -> master traffic to go via an ELB. I'm not sure whether this should be a separate ELB or the same one. (Or whether it should be an NLB).

Another gotcha is that (until we have etcd-manager) we still use DNS for etcd cluster maintenance, so even if we solve this problem we might still have troubles there. Except that hopefully this is actually specific to kubelet -> master communications (if it's all related to kubernetes/kubernetes#63492 )

There's also hopes that this gets fixed in 1.11 or if/when 63492 is backported

@mellowplace
Copy link
Contributor

Ok thanks for the info - I see I've gotten the wrong end of the stick on the proposed fix because api.internal is created with the placeholder IP, so you're right this wouldn't work, the record would never be pointed to the ELB.

I think we'd be totally fine with using the same ELB. I can't think of a reason why you'd want a separate one, unless for auditing purposes maybe. Or you've split your VPC's in such a way that the k8s nodes couldn't see the ELB, is that possible?

Our config relating to this BTW is

spec:
  api:
    loadBalancer:
      type: Internal

We don't specify the DNS setting, so we'd kind of just assumed the ELB was used for everything, do you think some people are expecting this is not the case? If so then I guess we would have to have an explicit setting. Perhaps something like the following...

spec:
  api:
    loadBalancer:
      type: Internal
      useApiInternal: true

Then on the basis of useApiInternal being present and true the record could be created to point to the ELB and the dns-controller annotation excluded.

@dcowden
Copy link
Author

dcowden commented Jun 29, 2018

I agree with @mellowplace. Our workound uses the same elb for internal traffic and we have no problems. We could use the mod he proposed. It seems overcomplicated to use a separate elb to me.

We maintenance our clusters with automated jobs that delete the nodes one at a time periodically to patch them.
In development, we do it very quickly (the whole cluster each night) to try to trigger problem scenarios. With the stock kops config, every master deletion resulted in some application problems due to old ips for masters.

Ever since we changed to using the elb for node to master communication, our node maintenance jobs work great.

@lkysow
Copy link

lkysow commented Jun 29, 2018

You need an internal ELB for node -> master communication but some people might have an external ELB for API communication because they don't have a VPN into their cluster so that's one reason to keep them separate.

@mellowplace
Copy link
Contributor

I've updated #5375 with the proposal from #4252 (comment)

@lkysow So potentially the k8s nodes not be able to see an external ELB? My reading of the code is that we'd be dependant on the CIDR of spec.KubernetesApiAccess

If we want to avoid creating another ELB then we might have to add a source security group to allow k8s nodes guaranteed access (we could make this conditional on useApiInternal)

@justinsb thoughts?

@lkysow
Copy link

lkysow commented Jul 3, 2018

@mellowplace sorry but I can't give you a good answer right now because I don't have the time to dig into this to understand the full context. My comment was just to say that in some cases it makes sense to have an internal ELB for cluster communication and an external ELB for external API communication. I'm not sure if that matters in this context though!

mellowplace added a commit to mellowplace/kops that referenced this issue Jul 23, 2018
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 1, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 31, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

8 participants