-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework loadbalancer server selection logic #11329
base: master
Are you sure you want to change the base?
Conversation
416f641
to
f38a6cf
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #11329 +/- ##
==========================================
- Coverage 47.18% 42.59% -4.59%
==========================================
Files 179 180 +1
Lines 18600 18768 +168
==========================================
- Hits 8776 7995 -781
- Misses 8469 9573 +1104
+ Partials 1355 1200 -155
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚨 Try these New Features:
|
a149b59
to
5443d8e
Compare
168f212
to
a50f5d9
Compare
@brandond Thanks! With old implementation, we recently observe a panic when apiserver is offline. I hope this sort out all issues. |
I even doubt that this is the cause of #11346 |
a50f5d9
to
74dde01
Compare
7e516bd
to
bb6b0cb
Compare
Yes, on RKE2 where the etcd and apiserver pods may not be up yet when the service has just started. |
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
…rivate Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
034ac01
to
bca06c4
Compare
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
bca06c4
to
b74b2b1
Compare
9e7c1bf
to
3f388a5
Compare
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
…watch fails Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
The error message should be printf style, not just concatenated. The current message is garbled if the command or result contains things that look like formatting directives: `Internal error occurred: error sending request: Post "https://10.10.10.102:10250/exec/default/volume-test/volume-test?command=sh&command=-c&command=echo+local-path-test+%!!(MISSING)E(MISSING)+%!!(MISSING)F(MISSING)data%!!(MISSING)F(MISSING)test&error=1&output=1": proxy error from 127.0.0.1:6443 while dialing 10.10.10.102:10250, code 502: 502 Bad Gateway` Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
3f388a5
to
fbdbcc5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Two ideas for future work:
1 - Have a command in the k3s-agent which outputs the status of all servers in that exact moment. We can get this info by querying the logs but this command would make things easier
2 - Extra testing. Some kind of chaos monkey type of tests with a few combinations (etcd-only, kube-api only, combination of both, etc)
@manuelbuil that first item sounds like a good thing to add to metrics. I was planning on working on metrics over hack week before this came up, maybe I'll try to get back to that in december. For the second item, I would love if we had something like that to run in our nightlies - not just for this feature but to test cluster resilience in general. |
This comment was marked as off-topic.
This comment was marked as off-topic.
Don't do that. On a fresh startup, that address will be unreachable until after kube-proxy starts to add iptables rules to handle traffic for the Kubernetes service. But kube-proxy won't be able to start because the agent can't contact the apiserver at that address until after kube-proxy is running. So you'll be stuck. Also this is all off topic. Please start a new issue for whatever you have going on here. |
Proposed Changes
This PR does the following:
This should be easier to test and maintain, and provide more consistent behavior:
Server state preference order:
Possible state changes:
Logging
Health check state transitions are also logged at INFO level, for better visibility when not running with debug logging enabled:
Types of Changes
tech debt
Verification
Testing
Yes
Linked Issues
User-Facing Change
Further Comments