discuss: endpoint choose issue #101

nic-chen · 2020-12-03T12:19:27Z

Background

Currently, lua-resty-etcd supports cluster mode, but the way of implementation is too simple: each connection changes to the next endpoint.
Using this mechanism, it can work well under normal circumstances, but once an api or an instance has a problem, the consequences will be unpredictable.

Issues with the current solution

In cluster mode, when an instance is down, there is no way to skip the down instance, and it will still be polled every time.
When all instances of a certain api fail (such as auth api), it will cause crazy retries, which may eventually overwhelm the ETCD cluster.

Suggested changes

Implement a health check mechanism, no active check is required, only passive check is required, that is, it is recorded when the connection fails.
There is no need to poll all instances, and only switch instances when the connection fails.
In a certain period of time, if there are n consecutive failures, the instance is considered unhealthy, and the instance will not be connected for a certain period of time in the future (the duration and times can be configured).

related issue: apache/apisix#2899

what do you think ?

Thanks.

The text was updated successfully, but these errors were encountered:

nic-chen · 2020-12-03T12:21:11Z

cc @membphis @spacewander @Yiyiyimu @tzssangglass

Yiyiyimu · 2020-12-03T13:37:40Z

Looks great to me. It just how much could we benefit from it.

membphis · 2020-12-03T14:51:36Z

related PR: #96

tzssangglass · 2020-12-05T03:58:56Z

get it , assigned to me.

tzssangglass · 2020-12-16T16:41:17Z

Hi folks,
after two tries, and the discussion with @membphis , I found that the previous designs were flawed, so now I'm going to publish some design ideas and ask for your opinions before I try to do it.

the health check instance is global and manages an endpoinds pool, and each ectd instance registers its own endpoints to endpoinds pool
the health check instance track and update the status of each endpoint in the endpoints pool
each etcd instance will report endpoint failures to the health check instance, and when choosing its own endpoint, it will check the status of the corresponding endpoint in the endpoints pool and pass the unhealthy endpoint

tzssangglass · 2020-12-17T01:18:52Z

ping @membphis @nic-chen @spacewander @tokers

nic-chen · 2020-12-17T08:32:39Z

@tzssangglass
How do we implement the health check instance? it's not mentioned above.

membphis · 2020-12-17T11:02:44Z

@tzssangglass I think you can take a look at this plugin[api-breaker]: https://github.com/apache/apisix/blob/master/apisix/plugins/api-breaker.lua#L168

it should be useful for you

tzssangglass · 2020-12-17T13:47:58Z

@tzssangglass I think you can take a look at this plugin[api-breaker]: https://github.com/apache/apisix/blob/master/apisix/plugins/api-breaker.lua#L168

it should be useful for you

got, let me study it

tzssangglass · 2020-12-17T13:50:18Z

health check instance

the health check instance is independent of the etcd instances, which I think can be created in the init_worker_by_lua phase

tzssangglass · 2020-12-21T01:29:13Z

I am busy during this period, so this work will be slow.

tzssangglass · 2020-12-21T16:27:30Z

I've updated the flowchart a bit to hopefully convey my thoughts more clearly.

checker parameter setting: the checker parameters(fail_timeout, max_fails) are global, not per etcd client.
choose endpoint: the function choose_endpoint of the etcd client checks if the selected endpoint is healthy by calling the function check_endpoint_status of the checker and passes if it is not.
endpoint status: the status o endpoint is stored in the shared dict, and the status is global, shared by the worker, and shared by each etcd client.

membphis · 2020-12-22T09:08:14Z

@tzssangglass I think we can avoid using the init_worker_lua phase. Lua top-level variable should be enough.

The others are LGTM . ^_^

tzssangglass · 2020-12-22T14:09:28Z

got

fix #101 fix #55

membphis added the enhancement New feature or request label Dec 3, 2020

membphis assigned tzssangglass Dec 5, 2020

tzssangglass mentioned this issue Dec 6, 2020

feat: endpoint choose by health check #102

Closed

tzssangglass mentioned this issue Jan 4, 2021

feat: endpoint choose by health check #109

Merged

membphis closed this as completed in #109 Jan 25, 2021

membphis pushed a commit that referenced this issue Jan 25, 2021

feat: support healthcheck when connect to etcd cluster nodes (#109)

1fe01f7

fix #101 fix #55

Yiyiyimu mentioned this issue Jan 26, 2021

Bug: when no endpoints is healthy, "it will cause crazy retries" #111

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

discuss: endpoint choose issue #101

discuss: endpoint choose issue #101

nic-chen commented Dec 3, 2020 •

edited

Loading

nic-chen commented Dec 3, 2020

Yiyiyimu commented Dec 3, 2020

membphis commented Dec 3, 2020

tzssangglass commented Dec 5, 2020

tzssangglass commented Dec 16, 2020 •

edited

Loading

tzssangglass commented Dec 17, 2020

nic-chen commented Dec 17, 2020

membphis commented Dec 17, 2020

tzssangglass commented Dec 17, 2020

tzssangglass commented Dec 17, 2020

tzssangglass commented Dec 21, 2020

tzssangglass commented Dec 21, 2020

membphis commented Dec 22, 2020

tzssangglass commented Dec 22, 2020

discuss: endpoint choose issue #101

discuss: endpoint choose issue #101

Comments

nic-chen commented Dec 3, 2020 • edited Loading

Background

Issues with the current solution

Suggested changes

nic-chen commented Dec 3, 2020

Yiyiyimu commented Dec 3, 2020

membphis commented Dec 3, 2020

tzssangglass commented Dec 5, 2020

tzssangglass commented Dec 16, 2020 • edited Loading

tzssangglass commented Dec 17, 2020

nic-chen commented Dec 17, 2020

membphis commented Dec 17, 2020

tzssangglass commented Dec 17, 2020

tzssangglass commented Dec 17, 2020

tzssangglass commented Dec 21, 2020

tzssangglass commented Dec 21, 2020

membphis commented Dec 22, 2020

tzssangglass commented Dec 22, 2020

nic-chen commented Dec 3, 2020 •

edited

Loading

tzssangglass commented Dec 16, 2020 •

edited

Loading