[BUG] CHProxy removes the CH node from the pool #322

Tarkerro · 2023-03-22T14:12:41Z

We have a chproxy with 6 nodes in nodes: in the config. It looks like for some reason chproxy gets a 502 from the host once and removes it from the host pool and it never comes back into the pool again, at least until CHProxy is rebooted.
What could be the problem? Can we somehow avoid this?
Maybe we can use some configuration settings to force return the node back to the list or not throw it out at all when receiving a 502 code?
Or maybe this is not the problem and something else needs to be checked?

Here is removed node metrics:
#HELP host_health Health state of hosts by clusters
#TYPE host_health gauge
host_health{cluster="cluster",cluster_node="host:8123",replica="default"} 1
#HELP host_penalties_total Total number of given penalties by host
#TYPE host_penalties_total counter
host_penalties_total{cluster="cluster",cluster_node="host:8123",replica="default"} 1
#HELP status_codes_total Distribution by status codes
#TYPE status_codes_total counter
status_codes_total{cluster="cluster",cluster_node="host:8123",cluster_user="user",code="502",replica="default",user="user"} 1

mga-chka · 2023-03-24T10:42:56Z

it seems weird because on paper there is an heartbeat that is supposed to check every node and put them back in the pool if they're healthy again.
Could you test your issue with v1.19.0 then v1.20.0 and tell me if you still have it? Because your issue might come from the retry feature that will introduced in v1.20.0 (then fixed in v1.21.0)

Tarkerro · 2023-03-28T11:11:12Z

We are currently using chproxy ver. 1.22.0, rev. 5c1e8e7, built at 2023-03-01T09:04:29Z
and still facing this issue.

Tarkerro · 2023-03-29T07:14:42Z

There is some metrics from Grafana

Everything was going well until we got a 502 error at 9:56

After that, the node was disabled and did not receive requests for more than an hour, until CHProxy is rebooted

How often do healthchecks take place? Can we increase their frequency with the settings, for example, up to 5-10 minutes? Or is that not the point?

mga-chka · 2023-03-29T07:35:41Z

there is clearly an issue. I don't remember the exact frequency of the healthcheck but it's below once per sec.
Can you test your issue with v1.19.0 then v1.20.0 and tell me if you still have it? It will help for the troubleshooting to know if we added a regression in those versions.

gontarzpawel · 2023-03-30T19:59:45Z

@Tarkerro could you share the configuration of your setup?

Tarkerro · 2023-04-05T07:39:04Z

We switched to v1.19.0 and now looks like it works for us.
Still have 502 errors, but nodes don't disabled permanently after that.

Configuration is not changed:
config.yml.txt

mga-chka · 2023-04-05T12:21:37Z

thanks
@sigua-cs , I'm just pigging you to be sure you see the msg. You should look at all the modifications in v1.20.0 (it might not be the retry mechanism but another PR that introduced the bug)

sigua-cs · 2023-04-07T08:58:52Z

@Tarkerro thanks for your report. We are working on a fix

sigua-cs · 2023-04-19T20:56:01Z

Hello @Tarkerro, we managed to reproduce the issue. Currently, we are focusing on the host priority mechanism to fix it since we assume the bug comes from there. We will release the fix asap

mga-chka · 2023-04-25T10:07:02Z

FYI the next release of chproxy (that should be release before the end of next week) will contain a fix so that you won't have to stick with v1.19.0

Blokje5 · 2023-05-03T13:46:17Z

@Tarkerro the fix has been released. We found a bug where indeed nodes where removed from the node pool with the new retry mechanism (they received a very high penalty due to an integer overflow). Feel free to try the latest version and let us know if it resolves the issue for you.

mga-chka · 2023-05-06T08:06:46Z

I'm closing the issue. Feel free to reopen it if the pb appears in versions >=1.24.0

mga-chka changed the title ~~[QUESTION] CHProxy removes the CH node from the pool~~ [BUG] CHProxy removes the CH node from the pool Mar 29, 2023

mga-chka mentioned this issue Mar 29, 2023

[BUG] uneven load among clickhouse shards #325

Closed

sigua-cs self-assigned this Apr 2, 2023

mga-chka added the bug label Apr 19, 2023

This was referenced Apr 29, 2023

fix: short-term fix for issue #322 #333

Closed

fix: short-term fix for issue #322 #334

Merged

mga-chka closed this as completed May 6, 2023

nir3c mentioned this issue Aug 3, 2023

fix: uneven load among clickhouse shards caused by retry error mechanism #357

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] CHProxy removes the CH node from the pool #322

[BUG] CHProxy removes the CH node from the pool #322

Tarkerro commented Mar 22, 2023 •

edited

Loading

mga-chka commented Mar 24, 2023

Tarkerro commented Mar 28, 2023 •

edited

Loading

Tarkerro commented Mar 29, 2023

mga-chka commented Mar 29, 2023

gontarzpawel commented Mar 30, 2023

Tarkerro commented Apr 5, 2023

mga-chka commented Apr 5, 2023

sigua-cs commented Apr 7, 2023

sigua-cs commented Apr 19, 2023

mga-chka commented Apr 25, 2023

Blokje5 commented May 3, 2023

mga-chka commented May 6, 2023

[BUG] CHProxy removes the CH node from the pool #322

[BUG] CHProxy removes the CH node from the pool #322

Comments

Tarkerro commented Mar 22, 2023 • edited Loading

mga-chka commented Mar 24, 2023

Tarkerro commented Mar 28, 2023 • edited Loading

Tarkerro commented Mar 29, 2023

mga-chka commented Mar 29, 2023

gontarzpawel commented Mar 30, 2023

Tarkerro commented Apr 5, 2023

mga-chka commented Apr 5, 2023

sigua-cs commented Apr 7, 2023

sigua-cs commented Apr 19, 2023

mga-chka commented Apr 25, 2023

Blokje5 commented May 3, 2023

mga-chka commented May 6, 2023

Tarkerro commented Mar 22, 2023 •

edited

Loading

Tarkerro commented Mar 28, 2023 •

edited

Loading