Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework loadbalancer server selection logic #11329

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

brandond
Copy link
Member

@brandond brandond commented Nov 15, 2024

Proposed Changes

This PR does the following:

  • groups related functions into separate files
  • simplifies the LoadBalancer public functions to be more consistent with regards to naming
  • moves the load-balancer server list into a new type, does away with a number of redundant shared-state vars (localServerURL, defaultServerAddress, randomServers, currentServerAddress, nextServerIndex), and reworks the dialer and health-checking logic to use the sorted list maintained by that type.
  • Adds a fallback check to retrieve apiserver endpoints from the server address, in case of a total apiserver outage.
  • Fixes issues with tests that made it difficult to diagnose issues while testing the above changes.

This should be easier to test and maintain, and provide more consistent behavior:

  1. All connections are sent to the same server (the active server), as long as it passes health checks and can be dialed.
  2. If a server fails health checks or cannot be dialed, we want to pick a new active server, and close other connections to the failed node so that they reconnect to the new server.
  3. The new active server should be picked from the list of servers in order of preference:
    1. Servers that recently came back from failed, as they may have been taken down for patching and are less likely to go down again
    2. Servers that are passing health checks
    3. The default server (if it wouldn't otherwise be present in the server list)
    4. Servers that are were failed, but passed a single health check or were dialed successfully
    5. Servers that failed their most recent health check or dial
Server state preference order:
flowchart LR
Active --> Preferred --> Healthy --> Unchecked --> Standby --> Recovering --> Failed
Loading
Possible state changes:
stateDiagram-v2
[*] --> Unchecked
[*] --> Standby
Unchecked --> Recovering
Unchecked --> Failed
Unchecked --> Invalid
Failed --> Recovering
Failed --> Invalid
Recovering --> Preferred
Recovering --> Active
Recovering --> Failed
Recovering  --> Invalid
Standby --> Unchecked
Standby --> Invalid
Healthy --> Failed
Healthy --> Active
Healthy --> Invalid
Preferred --> Failed
Preferred --> Healthy
Preferred --> Active
Preferred --> Invalid
Active --> Failed
Active --> Invalid
Invalid --> [*]
Loading
Logging

Health check state transitions are also logged at INFO level, for better visibility when not running with debug logging enabled:

Nov 17 20:45:32 systemd-node-2 systemd[1]: Starting Lightweight Kubernetes...

Nov 17 20:45:33 systemd-node-2 k3s[362]: time="2024-11-17T20:45:33Z" level=info msg="Updated load balancer k3s-agent-load-balancer default server: 172.17.0.4:6443"
Nov 17 20:45:33 systemd-node-2 k3s[362]: time="2024-11-17T20:45:33Z" level=info msg="Adding server to load balancer k3s-agent-load-balancer: 172.17.0.7:6443"
Nov 17 20:45:33 systemd-node-2 k3s[362]: time="2024-11-17T20:45:33Z" level=info msg="Updated load balancer k3s-agent-load-balancer server addresses -> [172.17.0.7:6443] [default: 172.17.0.4:6443]"
Nov 17 20:45:33 systemd-node-2 k3s[362]: time="2024-11-17T20:45:33Z" level=info msg="Running load balancer k3s-agent-load-balancer 127.0.0.1:6444 -> [172.17.0.7:6443] [default: 172.17.0.4:6443]"
Nov 17 20:45:33 systemd-node-2 k3s[362]: time="2024-11-17T20:45:33Z" level=info msg="Server 172.17.0.7:6443@PREFERRED->ACTIVE from successful dial"

Nov 17 20:45:47 systemd-node-2 k3s[362]: time="2024-11-17T20:45:47Z" level=info msg="Adding server to load balancer k3s-agent-load-balancer: 172.17.0.8:6443"
Nov 17 20:45:47 systemd-node-2 k3s[362]: time="2024-11-17T20:45:47Z" level=info msg="Updated load balancer k3s-agent-load-balancer server addresses -> [172.17.0.7:6443 172.17.0.8:6443] [default: 172.17.0.4:6443]"

Nov 17 20:45:49 systemd-node-2 k3s[362]: time="2024-11-17T20:45:49Z" level=info msg="Connecting to proxy" url="wss://172.17.0.7:6443/v1-k3s/connect"
Nov 17 20:45:49 systemd-node-2 k3s[362]: time="2024-11-17T20:45:49Z" level=info msg="Connecting to proxy" url="wss://172.17.0.8:6443/v1-k3s/connect"
Nov 17 20:45:49 systemd-node-2 k3s[362]: time="2024-11-17T20:45:49Z" level=info msg="Remotedialer connected to proxy" url="wss://172.17.0.7:6443/v1-k3s/connect"
Nov 17 20:45:49 systemd-node-2 k3s[362]: time="2024-11-17T20:45:49Z" level=info msg="Remotedialer connected to proxy" url="wss://172.17.0.8:6443/v1-k3s/connect"
Nov 17 20:46:48 systemd-node-2 k3s[362]: time="2024-11-17T20:46:48Z" level=info msg="Server 172.17.0.8:6443@PREFERRED->HEALTHY from successful health check"

Nov 17 20:46:53 systemd-node-2 systemd[1]: Starting Lightweight Kubernetes...

Nov 17 20:47:02 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:02Z" level=info msg="Updated load balancer k3s-agent-load-balancer default server: 172.17.0.4:6443"
Nov 17 20:47:02 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:02Z" level=info msg="Adding server to load balancer k3s-agent-load-balancer: 172.17.0.8:6443"
Nov 17 20:47:02 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:02Z" level=info msg="Adding server to load balancer k3s-agent-load-balancer: 172.17.0.7:6443"
Nov 17 20:47:02 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:02Z" level=info msg="Updated load balancer k3s-agent-load-balancer server addresses -> [172.17.0.8:6443 172.17.0.7:6443] [default: 172.17.0.4:6443]"
Nov 17 20:47:02 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:02Z" level=info msg="Running load balancer k3s-agent-load-balancer 127.0.0.1:6444 -> [172.17.0.8:6443 172.17.0.7:6443] [default: 172.17.0.4:6443]"
Nov 17 20:47:02 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:02Z" level=info msg="Server 172.17.0.8:6443@PREFERRED->ACTIVE from successful dial"

Nov 17 20:47:09 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:09Z" level=info msg="Connecting to proxy" url="wss://172.17.0.7:6443/v1-k3s/connect"
Nov 17 20:47:09 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:09Z" level=info msg="Connecting to proxy" url="wss://172.17.0.8:6443/v1-k3s/connect"
Nov 17 20:47:09 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:09Z" level=error msg="Remotedialer proxy error; reconnecting..." error="dial tcp 172.17.0.7:6443: connect: connection refused" url="wss://172.17.0.7:6443/v1-k3s/connect"
Nov 17 20:47:09 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:09Z" level=info msg="Remotedialer connected to proxy" url="wss://172.17.0.8:6443/v1-k3s/connect"
Nov 17 20:47:09 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:09Z" level=info msg="Server 172.17.0.7:6443@PREFERRED->FAILED from failed health check"

Nov 17 20:47:10 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:10Z" level=info msg="Connecting to proxy" url="wss://172.17.0.7:6443/v1-k3s/connect"
Nov 17 20:47:10 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:10Z" level=error msg="Remotedialer proxy error; reconnecting..." error="dial tcp 172.17.0.7:6443: connect: connection refused" url="wss://172.17.0.7:6443/v1-k3s/connect"

Nov 17 20:47:11 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:11Z" level=info msg="Connecting to proxy" url="wss://172.17.0.7:6443/v1-k3s/connect"
Nov 17 20:47:11 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:11Z" level=error msg="Remotedialer proxy error; reconnecting..." error="dial tcp 172.17.0.7:6443: connect: connection refused" url="wss://172.17.0.7:6443/v1-k3s/connect"

Nov 17 20:47:12 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:12Z" level=info msg="Connecting to proxy" url="wss://172.17.0.7:6443/v1-k3s/connect"
Nov 17 20:47:12 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:12Z" level=error msg="Remotedialer proxy error; reconnecting..." error="dial tcp 172.17.0.7:6443: connect: connection refused" url="wss://172.17.0.7:6443/v1-k3s/connect"
Nov 17 20:47:12 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:12Z" level=info msg="Removing server from load balancer k3s-agent-load-balancer: 172.17.0.7:6443"
Nov 17 20:47:12 systemd-node-2 k3s[1190]: time="2024-11-17T20:47:12Z" level=info msg="Updated load balancer k3s-agent-load-balancer server addresses -> [172.17.0.8:6443] [default: 172.17.0.4:6443]"

Nov 17 20:49:50 systemd-node-2 k3s[1190]: time="2024-11-17T20:49:50Z" level=info msg="Adding server to load balancer k3s-agent-load-balancer: 172.17.0.7:6443"
Nov 17 20:49:50 systemd-node-2 k3s[1190]: time="2024-11-17T20:49:50Z" level=info msg="Updated load balancer k3s-agent-load-balancer server addresses -> [172.17.0.8:6443 172.17.0.7:6443] [default: 172.17.0.4:6443]"
Nov 17 20:49:50 systemd-node-2 k3s[1190]: time="2024-11-17T20:49:50Z" level=info msg="Connecting to proxy" url="wss://172.17.0.7:6443/v1-k3s/connect"
Nov 17 20:49:50 systemd-node-2 k3s[1190]: time="2024-11-17T20:49:50Z" level=info msg="Remotedialer connected to proxy" url="wss://172.17.0.7:6443/v1-k3s/connect"

Nov 17 20:50:50 systemd-node-2 k3s[1190]: time="2024-11-17T20:50:50Z" level=info msg="Server 172.17.0.7:6443@PREFERRED->HEALTHY from successful health check"

Types of Changes

tech debt

Verification

Testing

Yes

Linked Issues

User-Facing Change

Further Comments

@brandond brandond force-pushed the loadbalancer-rewrite branch 3 times, most recently from 416f641 to f38a6cf Compare November 16, 2024 04:21
@brandond brandond marked this pull request as ready for review November 16, 2024 04:29
@brandond brandond requested a review from a team as a code owner November 16, 2024 04:29
Copy link

codecov bot commented Nov 16, 2024

Codecov Report

Attention: Patch coverage is 78.70036% with 118 lines in your changes missing coverage. Please review.

Project coverage is 42.59%. Comparing base (cd4dded) to head (fbdbcc5).
Report is 3 commits behind head on master.

Files with missing lines Patch % Lines
pkg/agent/loadbalancer/servers.go 83.02% 41 Missing and 5 partials ⚠️
pkg/etcd/etcdproxy.go 0.00% 19 Missing ⚠️
pkg/agent/loadbalancer/loadbalancer.go 69.04% 8 Missing and 5 partials ⚠️
pkg/agent/tunnel/tunnel.go 83.33% 10 Missing and 3 partials ⚠️
pkg/server/router.go 85.18% 5 Missing and 3 partials ⚠️
pkg/agent/loadbalancer/httpproxy.go 80.00% 5 Missing and 2 partials ⚠️
pkg/agent/config/config.go 63.63% 3 Missing and 1 partial ⚠️
pkg/util/client.go 66.66% 2 Missing and 1 partial ⚠️
pkg/agent/proxy/apiproxy.go 66.66% 1 Missing ⚠️
pkg/cli/token/token.go 0.00% 1 Missing ⚠️
... and 3 more
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #11329      +/-   ##
==========================================
- Coverage   47.18%   42.59%   -4.59%     
==========================================
  Files         179      180       +1     
  Lines       18600    18768     +168     
==========================================
- Hits         8776     7995     -781     
- Misses       8469     9573    +1104     
+ Partials     1355     1200     -155     
Flag Coverage Δ
e2etests 34.81% <62.81%> (-7.40%) ⬇️
inttests 18.69% <18.95%> (-16.00%) ⬇️
unittests 14.18% <50.36%> (+0.35%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

@brandond brandond force-pushed the loadbalancer-rewrite branch 3 times, most recently from a149b59 to 5443d8e Compare November 16, 2024 10:21
@brandond brandond changed the title [WIP] Rework loadbalancer server selection logic Rework loadbalancer server selection logic Nov 16, 2024
@brandond brandond force-pushed the loadbalancer-rewrite branch 13 times, most recently from 168f212 to a50f5d9 Compare November 19, 2024 11:24
pkg/agent/loadbalancer/servers.go Outdated Show resolved Hide resolved
pkg/agent/loadbalancer/loadbalancer.go Show resolved Hide resolved
@liyimeng
Copy link
Contributor

@brandond Thanks! With old implementation, we recently observe a panic when apiserver is offline. I hope this sort out all issues.

@liyimeng
Copy link
Contributor

I even doubt that this is the cause of #11346

@brandond
Copy link
Member Author

@liyimeng what you're describing sounds like #10317 which was fixed a while ago.

@brandond
Copy link
Member Author

brandond commented Nov 22, 2024

@manuelbuil
Looks good! One extra theoretical question: is it possible for healthcheck to work but dial to fail?

Yes, on RKE2 where the etcd and apiserver pods may not be up yet when the service has just started.

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
…rivate

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
@brandond brandond force-pushed the loadbalancer-rewrite branch 2 times, most recently from 034ac01 to bca06c4 Compare November 22, 2024 10:14
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
@liyimeng
Copy link
Contributor

@liyimeng what you're describing sounds like #10317 which was fixed a while ago.

Ha, did not notice that. Yes, it is the fix I need

@brandond brandond force-pushed the loadbalancer-rewrite branch 3 times, most recently from 9e7c1bf to 3f388a5 Compare November 22, 2024 21:43
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
…watch fails

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
The error message should be printf style, not just concatenated. The
current message is garbled if the command or result contains things that
look like formatting directives:

`Internal error occurred: error sending request: Post "https://10.10.10.102:10250/exec/default/volume-test/volume-test?command=sh&command=-c&command=echo+local-path-test+%!!(MISSING)E(MISSING)+%!!(MISSING)F(MISSING)data%!!(MISSING)F(MISSING)test&error=1&output=1": proxy error from 127.0.0.1:6443 while dialing 10.10.10.102:10250, code 502: 502 Bad Gateway`

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Copy link
Contributor

@manuelbuil manuelbuil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Two ideas for future work:

1 - Have a command in the k3s-agent which outputs the status of all servers in that exact moment. We can get this info by querying the logs but this command would make things easier

2 - Extra testing. Some kind of chaos monkey type of tests with a few combinations (etcd-only, kube-api only, combination of both, etc)

@brandond
Copy link
Member Author

brandond commented Nov 27, 2024

@manuelbuil that first item sounds like a good thing to add to metrics. I was planning on working on metrics over hack week before this came up, maybe I'll try to get back to that in december.

For the second item, I would love if we had something like that to run in our nightlies - not just for this feature but to test cluster resilience in general.

@liyimeng

This comment was marked as off-topic.

@brandond
Copy link
Member Author

brandond commented Nov 30, 2024

@liyimeng
In my setup, I tell the agent that server address is at https://kubernetes.svc, and I resolve kubernetes.svc to the service IP (10.43.0.1) by modifying my /etc/hosts file.

Don't do that. On a fresh startup, that address will be unreachable until after kube-proxy starts to add iptables rules to handle traffic for the Kubernetes service. But kube-proxy won't be able to start because the agent can't contact the apiserver at that address until after kube-proxy is running. So you'll be stuck.

Also this is all off topic. Please start a new issue for whatever you have going on here.

@liyimeng

This comment was marked as off-topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants