Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] CHProxy removes the CH node from the pool #322

Closed
Tarkerro opened this issue Mar 22, 2023 · 12 comments
Closed

[BUG] CHProxy removes the CH node from the pool #322

Tarkerro opened this issue Mar 22, 2023 · 12 comments
Assignees
Labels

Comments

@Tarkerro
Copy link

Tarkerro commented Mar 22, 2023

We have a chproxy with 6 nodes in nodes: in the config. It looks like for some reason chproxy gets a 502 from the host once and removes it from the host pool and it never comes back into the pool again, at least until CHProxy is rebooted.
What could be the problem? Can we somehow avoid this?
Maybe we can use some configuration settings to force return the node back to the list or not throw it out at all when receiving a 502 code?
Or maybe this is not the problem and something else needs to be checked?

Here is removed node metrics:
#HELP host_health Health state of hosts by clusters
#TYPE host_health gauge
host_health{cluster="cluster",cluster_node="host:8123",replica="default"} 1
#HELP host_penalties_total Total number of given penalties by host
#TYPE host_penalties_total counter
host_penalties_total{cluster="cluster",cluster_node="host:8123",replica="default"} 1
#HELP status_codes_total Distribution by status codes
#TYPE status_codes_total counter
status_codes_total{cluster="cluster",cluster_node="host:8123",cluster_user="user",code="502",replica="default",user="user"} 1

@mga-chka
Copy link
Collaborator

it seems weird because on paper there is an heartbeat that is supposed to check every node and put them back in the pool if they're healthy again.
Could you test your issue with v1.19.0 then v1.20.0 and tell me if you still have it? Because your issue might come from the retry feature that will introduced in v1.20.0 (then fixed in v1.21.0)

@Tarkerro
Copy link
Author

Tarkerro commented Mar 28, 2023

We are currently using chproxy ver. 1.22.0, rev. 5c1e8e7, built at 2023-03-01T09:04:29Z
and still facing this issue.

@Tarkerro
Copy link
Author

There is some metrics from Grafana

Everything was going well until we got a 502 error at 9:56
1
2
3

After that, the node was disabled and did not receive requests for more than an hour, until CHProxy is rebooted
4
5
6

How often do healthchecks take place? Can we increase their frequency with the settings, for example, up to 5-10 minutes? Or is that not the point?

@mga-chka
Copy link
Collaborator

there is clearly an issue. I don't remember the exact frequency of the healthcheck but it's below once per sec.
Can you test your issue with v1.19.0 then v1.20.0 and tell me if you still have it? It will help for the troubleshooting to know if we added a regression in those versions.

@mga-chka mga-chka changed the title [QUESTION] CHProxy removes the CH node from the pool [BUG] CHProxy removes the CH node from the pool Mar 29, 2023
@gontarzpawel
Copy link
Contributor

@Tarkerro could you share the configuration of your setup?

@sigua-cs sigua-cs self-assigned this Apr 2, 2023
@Tarkerro
Copy link
Author

Tarkerro commented Apr 5, 2023

We switched to v1.19.0 and now looks like it works for us.
Still have 502 errors, but nodes don't disabled permanently after that.

Configuration is not changed:
config.yml.txt

@mga-chka
Copy link
Collaborator

mga-chka commented Apr 5, 2023

thanks
@sigua-cs , I'm just pigging you to be sure you see the msg. You should look at all the modifications in v1.20.0 (it might not be the retry mechanism but another PR that introduced the bug)

@sigua-cs
Copy link
Contributor

sigua-cs commented Apr 7, 2023

@Tarkerro thanks for your report. We are working on a fix

@mga-chka mga-chka added the bug label Apr 19, 2023
@sigua-cs
Copy link
Contributor

Hello @Tarkerro, we managed to reproduce the issue. Currently, we are focusing on the host priority mechanism to fix it since we assume the bug comes from there. We will release the fix asap

@mga-chka
Copy link
Collaborator

FYI the next release of chproxy (that should be release before the end of next week) will contain a fix so that you won't have to stick with v1.19.0

@Blokje5
Copy link
Collaborator

Blokje5 commented May 3, 2023

@Tarkerro the fix has been released. We found a bug where indeed nodes where removed from the node pool with the new retry mechanism (they received a very high penalty due to an integer overflow). Feel free to try the latest version and let us know if it resolves the issue for you.

@mga-chka
Copy link
Collaborator

mga-chka commented May 6, 2023

I'm closing the issue. Feel free to reopen it if the pb appears in versions >=1.24.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

5 participants