-
Notifications
You must be signed in to change notification settings - Fork 264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] CHProxy removes the CH node from the pool #322
Comments
it seems weird because on paper there is an heartbeat that is supposed to check every node and put them back in the pool if they're healthy again. |
We are currently using chproxy ver. 1.22.0, rev. 5c1e8e7, built at 2023-03-01T09:04:29Z |
there is clearly an issue. I don't remember the exact frequency of the healthcheck but it's below once per sec. |
@Tarkerro could you share the configuration of your setup? |
We switched to v1.19.0 and now looks like it works for us. Configuration is not changed: |
thanks |
@Tarkerro thanks for your report. We are working on a fix |
Hello @Tarkerro, we managed to reproduce the issue. Currently, we are focusing on the host priority mechanism to fix it since we assume the bug comes from there. We will release the fix asap |
FYI the next release of chproxy (that should be release before the end of next week) will contain a fix so that you won't have to stick with v1.19.0 |
@Tarkerro the fix has been released. We found a bug where indeed nodes where removed from the node pool with the new retry mechanism (they received a very high penalty due to an integer overflow). Feel free to try the latest version and let us know if it resolves the issue for you. |
I'm closing the issue. Feel free to reopen it if the pb appears in versions >=1.24.0 |
We have a chproxy with 6 nodes in nodes: in the config. It looks like for some reason chproxy gets a 502 from the host once and removes it from the host pool and it never comes back into the pool again, at least until CHProxy is rebooted.
What could be the problem? Can we somehow avoid this?
Maybe we can use some configuration settings to force return the node back to the list or not throw it out at all when receiving a 502 code?
Or maybe this is not the problem and something else needs to be checked?
Here is removed node metrics:
#HELP host_health Health state of hosts by clusters
#TYPE host_health gauge
host_health{cluster="cluster",cluster_node="host:8123",replica="default"} 1
#HELP host_penalties_total Total number of given penalties by host
#TYPE host_penalties_total counter
host_penalties_total{cluster="cluster",cluster_node="host:8123",replica="default"} 1
#HELP status_codes_total Distribution by status codes
#TYPE status_codes_total counter
status_codes_total{cluster="cluster",cluster_node="host:8123",cluster_user="user",code="502",replica="default",user="user"} 1
The text was updated successfully, but these errors were encountered: