Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

discuss: endpoint choose issue #101

Closed
nic-chen opened this issue Dec 3, 2020 · 14 comments · Fixed by #109
Closed

discuss: endpoint choose issue #101

nic-chen opened this issue Dec 3, 2020 · 14 comments · Fixed by #109
Assignees
Labels
enhancement New feature or request

Comments

@nic-chen
Copy link
Collaborator

nic-chen commented Dec 3, 2020

Background

  1. Currently, lua-resty-etcd supports cluster mode, but the way of implementation is too simple: each connection changes to the next endpoint.

  2. Using this mechanism, it can work well under normal circumstances, but once an api or an instance has a problem, the consequences will be unpredictable.

Issues with the current solution

  1. In cluster mode, when an instance is down, there is no way to skip the down instance, and it will still be polled every time.

  2. When all instances of a certain api fail (such as auth api), it will cause crazy retries, which may eventually overwhelm the ETCD cluster.

Suggested changes

  1. Implement a health check mechanism, no active check is required, only passive check is required, that is, it is recorded when the connection fails.

  2. There is no need to poll all instances, and only switch instances when the connection fails.

  3. In a certain period of time, if there are n consecutive failures, the instance is considered unhealthy, and the instance will not be connected for a certain period of time in the future (the duration and times can be configured).

related issue: apache/apisix#2899

what do you think ?

Thanks.

@nic-chen
Copy link
Collaborator Author

nic-chen commented Dec 3, 2020

@Yiyiyimu
Copy link
Contributor

Yiyiyimu commented Dec 3, 2020

Looks great to me. It just how much could we benefit from it.

@membphis
Copy link
Contributor

membphis commented Dec 3, 2020

related PR: #96

@membphis membphis added the enhancement New feature or request label Dec 3, 2020
@tzssangglass
Copy link
Contributor

get it , assigned to me.

@tzssangglass
Copy link
Contributor

tzssangglass commented Dec 16, 2020

Hi folks,
after two tries, and the discussion with @membphis , I found that the previous designs were flawed, so now I'm going to publish some design ideas and ask for your opinions before I try to do it.

181608135556_ pic_hd

  1. the health check instance is global and manages an endpoinds pool, and each ectd instance registers its own endpoints to endpoinds pool
  2. the health check instance track and update the status of each endpoint in the endpoints pool
  3. each etcd instance will report endpoint failures to the health check instance, and when choosing its own endpoint, it will check the status of the corresponding endpoint in the endpoints pool and pass the unhealthy endpoint

@tzssangglass
Copy link
Contributor

@nic-chen
Copy link
Collaborator Author

@tzssangglass
How do we implement the health check instance? it's not mentioned above.

@membphis
Copy link
Contributor

@tzssangglass I think you can take a look at this plugin[api-breaker]: https://github.com/apache/apisix/blob/master/apisix/plugins/api-breaker.lua#L168

it should be useful for you

@tzssangglass
Copy link
Contributor

@tzssangglass I think you can take a look at this plugin[api-breaker]: https://github.com/apache/apisix/blob/master/apisix/plugins/api-breaker.lua#L168

it should be useful for you

got, let me study it

@tzssangglass
Copy link
Contributor

health check instance

the health check instance is independent of the etcd instances, which I think can be created in the init_worker_by_lua phase

@tzssangglass
Copy link
Contributor

I am busy during this period, so this work will be slow.

@tzssangglass
Copy link
Contributor

I've updated the flowchart a bit to hopefully convey my thoughts more clearly.
201608567280_ pic_hd

  • checker parameter setting: the checker parameters(fail_timeout, max_fails) are global, not per etcd client.
  • choose endpoint: the function choose_endpoint of the etcd client checks if the selected endpoint is healthy by calling the function check_endpoint_status of the checker and passes if it is not.
  • endpoint status: the status o endpoint is stored in the shared dict, and the status is global, shared by the worker, and shared by each etcd client.

@membphis
Copy link
Contributor

@tzssangglass I think we can avoid using the init_worker_lua phase. Lua top-level variable should be enough.

image

The others are LGTM . ^_^

@tzssangglass
Copy link
Contributor

got

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
4 participants