-
Notifications
You must be signed in to change notification settings - Fork 295
ELB IP changes can bring the cluster down #598
Comments
@danielfm Thanks for the report. I think I have never encountered the issue myself for my production cluster but anyway, I believe this is a BIG issue. For me, the issue seems to be composed of two parts: one is a long TTL other than ELB's(for example a CNAME record's TTL in kube-aws, which points to ELB's DNS name) and the another is an issue in ELB and/or kubelet's which prevents kubelet to detect broken connection to apiservers. Is my assumption correct? Anyway, once However I'm happy to work-around the issue in kube-aws. Possible work-aroundsPerhaps:
so that kubelet can reconnect to one of active ELB ips before it marks the k8s node If we go that way, the monitor should be executed periodically with an interval of Or we can implement a DNS round robin with a health-checking mechanism for serving k8s API endpoints like suggested and described in #281 #373 |
This hash changes only when backend IPs of an ELB has been changed.
|
@danielfm I guess setting
#!/usr/bin/env bash
set -vxe
current_elb_backends_version() {
dig ${API_ENDPOINT_DNS_NAME:?Missing required env var API_ENDPOINT_DNS_NAME} +noall +answer +short | \
# take into account only ips even if dig returned a CNAME answer(when API_ENDPOINT_DNS_NAME is a CNAME rather than an A(or Route 53's "Alias") record
grep -o '[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}' | \
# sort IPs so that DNS round-robin doesn't unexpectedly trigger a kubelet restart
sort | \
sha256sum | \
# sha256sum returns outputs like "<sha256 hash value> -". We only need the hash value excluding the trailing hyphen
awk '{print $1}'
}
run_once() {
local file=$ELB_BACKENDS_VERSION_FILE
prev_ver=$(cat $file || echo)
current_ver=$(current_elb_backends_version)
echo comparing the previous version "$prev_ver" and the current version "$current_ver"
if [ "$prev_ver" == "" -o "$prev_ver" == "$current_ver" ]; then
echo the version has not changed. no need to restart kubelet.
if [ "$KUBELET_RESTART_STRATEGY" == "watchdog" ]; then
echo "notifying kubelet's watchdog not to trigger a restart of kubelet..."
local kubelet_pid
kubelet_pid=$(systemctl show $KUBELET_SYSTEMD_UNIT_NAME -p MainPID | cut -d '=' -f 2)
systemd-notify --pid=$kubelet_pid WATCHDOG=1
fi
else
echo the version has been changed. need to restart kubelet.
if [ "$KUBELET_RESTART_STRATEGY" == "systemctl" ]; then
systemctl restart $KUBELET_SYSTEMD_UNIT_NAME
fi
fi
echo writing $current_ver to $file
echo "$current_ver" > $file
}
ELB_BACKENDS_VERSION_FILE=${ELB_BACKENDS_VERSION_FILE:-/var/run/coreos/elb-backends-version}
KUBELET_SYSTEMD_UNIT_NAME=${KUBELET_SYSTEMD_UNIT_NAME:-kubelet.service}
KUBELET_RESTART_STRATEGY=${KUBELET_RESTART_STRATEGY:-systemctl}
WATCH_INTERVAL_SEC=${WATCH_INTERVAL_SEC:-3}
systemd-notify --ready
while true; do
systemd-notify --status "determining if there're changes in elb ips"
run_once
systemd-notify --status "sleeping for $WATCH_INTERVAL_SEC seconds"
sleep $WATCH_INTERVAL_SEC
done |
Also - I have never realized that but we're creating CNAME records - which has its own TTL(=300 seconds by default in kube-aws) other than ELB's A records' TTL - for controller ELBs. |
Updated my assumption on the issue in my first comment #598 (comment) |
Hmm, a TTL for a CNAME record associated to ELB's DNS name seems to be capped at 60s, even though kube-aws' default |
Anyway, to forcibly restart kubelet to reconnect apiserver(s) when necessary(=elb ips changed) would still be important. |
@mumoshu does it mean that going ELB-less mode with just CNAME record in Route53 pointing to controller nodes is considered dangerous now? |
@redbaron Do you mean a single CNAME record for your k8s API endpoint which points to just one controller node's DNS name? I've been assuming that you would have a DNS name associated to one or more A records(rather than CNAME) each is associated to one of controller nodes public/private IP if you'd go without an ELB. |
Doesn't change the fact that final set of A records returned from DNS request change when you trigger controllers ASG update, right? From what I read here it matches the case when ELBs change their IPs |
@redbaron Ah, so you have a CNAME record which points to a DNS record containing multiple A records each is associated to one of controller nodes, right? If so, no, as long as you have low TTLs for your DNS records. AFAIK, someone in the upstream issue said that it becomes an issue only when ELB is involved. Perhaps ELB doesn't immediately shutdown an unnecessary instance and doesn't send FIN? If that's the case, kubelet would be unable to detect broken connections immediately. When your controllers ASG is updated, old, unnecessary nodes would be terminated before it becomes non-functional like ELB's instances. So your ELB-less mode with CNAME+A records would be safe as long as you have health checks to update the route 53 record set to eventually return A records "only for healthy controller nodes", and you have a TTL lower than |
ELB-less mode I refer to is a recent feature in kube-aws, I didn't check which records it creates exactly, just wanted to verify that is still a safe option to do considering this bug report. |
@redbaron If you're referring to the DNS round-robin for API endpoints, it isn't implemented yet. |
I wonder if using ALB can help here. L7 load balancer precisely knows all incoming/outgoing requests and can forcibly yet safely close HTTP connection when ELB scales/changes IP addresses. |
@redbaron Thanks, I never realized that ALBs may help. I wish they can help, too. |
Today happened to me for the first time with a 128 days old cluster. |
Seems like Amazon is doing some serious maintenance work in the ELB infrastructure these days... |
@danielfm @camilb @redbaron For me it seems like there are still no obvious fixes other than:
Would you like to proceed with any of them or any other idea(s)? |
I think sudden termination, unresponsive apiserver and etc shouldn't cause problems as long as these events are not happening at the same time, but in this case even ELB setup fails. But keeping route53 record up-to-date is a bit more complicated without using Lambda and SNS. Also I think it shouldn't be cause problems if route53 record is not fully up-to-date when restarting or replacing controller nodes, as long they are not replaced at the same time there should be at least one working controller. |
Thanks @tarvip!
Sorry if I wasn't clear enough but the above concerns are not specific to this problem but more general. I just didn't want to introduce a new issue due to missing health checks / constant updates to route 53 record sets. |
Probably that's true - then, my question is if everyone is ok with e.g. 50% k8s api error rate persists for several minutes when one of two controller nodes are being replaced? |
If we go without cloudwatch events + lambda, we probably need a systemd timer which periodically triggers a script to update a route53 record set so that it periodically start reflecting controller nodes terminated either expectedly or unexpectedly, right? |
@mumoshu , is ALB known not to help here? |
I think missing health check is ok, that health check is just TCP check anyway, kubelet and kube-proxy can also detect connection failure and recreate connection to another host. |
Thanks @redbaron - No but it isn't known to help either. |
Yes, that is one way to solve it. |
@tarvip Thanks for the reply!
Ah, makes sense to me! At least it seems to worth trying for me now - thanks for the clarification. |
AWS's NLB replacement for ELB's using one IP per zone/subnet and those IP be your EIP's. So using this new product you can get a LB with a set of fixed IPs that won't change. http://docs.aws.amazon.com/elasticloadbalancing/latest/network/introduction.html AWS new ALB (for HTTPS) and NLB (for TCP) seem to AWS's next-gen replacement for the old ELB, which AWS now calls 'Classic Load Balancers'. k8s and kube-aws should probably look to transition to the new products, which also appear have some advantages, such as fixed IP's - as I see #937 and #945 are doing! 🎉 |
@whereisaaron Thanks for the suggestion! I agree with your point. Anyway, please let me also add that ALB was experimented in #608 and decided as not appropriate for an K8S API load balancer. |
Unfortunately NLBs don't support VPC Peering on AWS, so some users (including me) will need to use Classic ELBs in conjunction with NLBs to support |
Yes we see this today in production, and experienced player impact yesterday from this exact issue. We confirmed that the DNS was updated almost immediately. We're going with the Kubelet restart tied to DNS change for the time being, but IMHO this is not a good long-term fix. |
Seen this today. Our set up use Consul DNS for kubelet to discover the apiserver, which means the apiserver DNS name are multiple A-record pointing to the exact IP addresses of the apiservers, which changes every time an apiserver node is replaced. In our case the workers come back eventually but it took a long while. My feeling is kubelet is not really respecting DNS TTLs as all Consul DNS names have TTL set to 0. Can anyone confirm? |
Thanks everyone. |
I was in the impression that since some k8s version kubelet has implemented the clide-side timeout to mitigate this issue, but can't remember the exact github issue right now. |
I noticed that after the master DNS record changed the underlying IP, all kubelet instances fail for exactly 15min. (Our master DNS TTL is 0). When it fails we get the following error.
It recovered by its own without restarting after 15min (sharp). It feels more like kubelet (or the apiserver client used) is caching the DNS. I'm trying to pin-point the exact line of code which caused this behaviour. But anyone know the code-base better might be able to confirm this? |
Seeing the following messages right before the worker came back. Last failure was at 13:15:18, then it reported some watch error (10.106.102.105 was the previous master which got destroyed) and re-resolved the DNS name before the cluster report the worker as "Ready" again! Maybe this is related to kubelet watch on apiserver not being dropped quick enough when the apiserver endpoint becomes unavailable?
Found a possible line of code which explains the 15min behaviour |
It seems like there is a problem. If the controller DNS entry has a 30 second TTL, the kubelet should be able to recover from an IP change within 30s + update period, so about 40s. @javefang you think the kubelet is using this long, up to 15 minute back-off when the old IP goes stale? And so not a DNS caching problem, but rather it just stops trying to update the controller for several minutes? For AWS at least, an NLB using fixed EIP addresses would mostly obviate the IP address every changing I think? Even if you recreate or move the LB, you can reapply the EIP so nothing changes. However an extra wrinkle is we would want worker nodes in multi-AZ clusters to use the EIP for the NLM endpoint in the same AZ. NLB's have one EIP per AZ as I understand it? We saw a similar issue a couple time where the workers couldn't contact the controllers for ~2 minutes (no IP address change involved). Even though well less than the 5 minutes eviction time, everything got evicted anyway. Maybe the same back-off issue? |
@whereisaaron yep this is indeed taking 15min for kubelet to recover. I have reproduced it with the following setup:
To reproduce:
I'm just curious about the mechanism in kubelet that can cause kubelet to be broken for 15min after any apiserver IP changes. We are deploying this on-premise. Tomorrow I'll try to put the 3 apiservers behind a load balancer with fixed IP to see if that fixes the issue. |
UPDATE: putting all apiservers behind a load balancer (round-robin) with a static IP fixed it. Now all workers work fine even if I replace one of the master node. So using fixed IP load balancer would be my workaround for now. But do you think it's still worth investigating by kubelet doesn't respect apiserver's DNS TTL? |
I believe the 15 minute window break many of us are experiencing is described in kubernetes/kubernetes#41916 (comment). Reading through issues and pull requests, I don't see where a TCP Timeout was implemented on the underlying connection. The timeout on the HTTP request definitely was implemented. |
…-connections Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. track/close kubelet->API connections on heartbeat failure xref kubernetes#48638 xref kubernetes-retired/kube-aws#598 we're already typically tracking kubelet -> API connections and have the ability to force close them as part of client cert rotation. if we do that tracking unconditionally, we gain the ability to also force close connections on heartbeat failure as well. it's a big hammer (means reestablishing pod watches, etc), but so is having all your pods evicted because you didn't heartbeat. this intentionally does minimal refactoring/extraction of the cert connection tracking transport in case we want to backport this * first commit unconditionally sets up the connection-tracking dialer, and moves all the cert management logic inside an if-block that gets skipped if no certificate manager is provided (view with whitespace ignored to see what actually changed) * second commit plumbs the connection-closing function to the heartbeat loop and calls it on repeated failures follow-ups: * consider backporting this to 1.10, 1.9, 1.8 * refactor the connection managing dialer to not be so tightly bound to the client certificate management /sig node /sig api-machinery ```release-note kubelet: fix hangs in updating Node status after network interruptions/changes between the kubelet and API server ```
All there work around the real problem, that the connections are keep forever. |
They don't live forever, they live for the operating system TCP timeout limit (typically 15 minutes by default) |
I haven't seen this happening anymore in some of the latest versions of Kubernetes 1.8.x (and I suspect the same is true for newer versions as well), so maybe we can close this? |
Yes and this 15 min are to long for many cases, like here. |
The fix merged into the last several releases of kubernetes was to drop/reestablish the apiserver connections from the kubelet if the heartbeat times out twice in a row. Reconnecting every 10 minutes or every hour would still let nodes go unavailable. |
What seen in other go projects, if you use pooling and frequently sent request that keepalive idle timeout get not reached you run into this issue. If you disable pooling and make only one request per connection, you have not that issue. But higher latency and overhead, this why keepalive make sense. By the way, the old Apache httpd had not only keepalive idle timeout but also keepalive max request count. Which helped a lot in many of this problems. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I ran into kubernetes/kubernetes#41916 twice in the last 3 days in my production cluster, with almost 50% of worker nodes transitioning to
NotReady
state almost simultaneously in both days, causing a brief downtime in critical services due to Kubernetes default (and agressive) eviction policy for failing nodes.I just contacted AWS support to validate the hypothesis of the ELB changing IPs at the time of both incidents, and the answer was yes.
My configuration (multi-node control plane with ELB) matches exactly the one in that issue, and probably most kube-aws users are subject to this.
Have anyone else ran into this at some point?
The text was updated successfully, but these errors were encountered: