Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: endpoint choose by health check #102

Closed
wants to merge 18 commits into from

Conversation

tzssangglass
Copy link
Contributor

@tzssangglass tzssangglass commented Dec 6, 2020

fix #101
fix #55

@tzssangglass
Copy link
Contributor Author

@membphis @nic-chen pls take a look if you have time

@moonming
Copy link
Contributor

moonming commented Dec 8, 2020

Ping @nic-chen @membphis

@nic-chen
Copy link
Collaborator

nic-chen commented Dec 8, 2020

@tzssangglass CI failed. and you need to resolve conflicts first. thanks.

lib/resty/etcd.lua Show resolved Hide resolved
lib/resty/etcd/v3.lua Outdated Show resolved Hide resolved
lib/resty/etcd/v3.lua Outdated Show resolved Hide resolved
@tzssangglass
Copy link
Contributor Author

@tzssangglass CI failed. and you need to resolve conflicts first. thanks.

this PR is working in process, the PR's title can not pass the Semantic check.

I'd like to involve you in ahead to check if this implementation is the right idea.

@tzssangglass
Copy link
Contributor Author

got the Conflicting files, will resolve conflicts.

Copy link
Contributor

@membphis membphis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice PR, the current way is simpler than before ^_^

need more test cases to confirm it can work fine

cluster_health_check.md Outdated Show resolved Hide resolved
cluster_health_check.md Show resolved Hide resolved
lib/resty/etcd/v3.lua Outdated Show resolved Hide resolved
lib/resty/etcd/v3.lua Outdated Show resolved Hide resolved
lib/resty/etcd/v3.lua Outdated Show resolved Hide resolved
cluster_health_check.md Outdated Show resolved Hide resolved
cluster_health_check.md Show resolved Hide resolved
lib/resty/etcd/v3.lua Outdated Show resolved Hide resolved
lib/resty/etcd/v3.lua Outdated Show resolved Hide resolved
lib/resty/etcd/v3.lua Outdated Show resolved Hide resolved
lib/resty/etcd/utils.lua Outdated Show resolved Hide resolved
lib/resty/etcd/v3.lua Outdated Show resolved Hide resolved
cluster_health_check.md Outdated Show resolved Hide resolved
lib/resty/etcd/v3.lua Outdated Show resolved Hide resolved
lib/resty/etcd/v3.lua Outdated Show resolved Hide resolved
lib/resty/etcd/v3.lua Show resolved Hide resolved
@tzssangglass
Copy link
Contributor Author

note: save choosed endpoint into self, the changes are significant, please pay attention

cluster_health_check.md Outdated Show resolved Hide resolved
@tzssangglass
Copy link
Contributor Author

note: save choosed endpoint into self, the changes are significant, please pay attention

revert this, conflicting implementations, ignore.

unhealthy endpoint trigger by different etcd client configurations
@tzssangglass tzssangglass changed the title [WIP] feature: endpoint choose by health check feature: endpoint choose by health check Dec 13, 2020
@tzssangglass
Copy link
Contributor Author

need to support the V2 protocol?

@tzssangglass tzssangglass changed the title feature: endpoint choose by health check feat: endpoint choose by health check Dec 13, 2020
@membphis
Copy link
Contributor

need to support the V2 protocol?

we can create a new issue about this feature

@tzssangglass
Copy link
Contributor Author

note:
image

  • the fails count is shared in a worker and stored in a lua_shared_dict tagged with the worker id, with restore depending on their own "max_fails" and "fail_timeout "(reference test case no.8).

  • why use lua_shared_dict to store? because the init and init_ttl parameters of the incr function are suitable for counting the number of errors that occur at a given window time on a continuous timeline.

@@ -29,8 +32,57 @@ local mt = { __index = _M }

-- define local refresh function variable
local refresh_jwt_token
local fails
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use a global variable here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh oh, I made a mistake, I thought this was a module variable, I want to define a worker level variable, can I only use lua-resty-lrucache?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is a module variable. But the fails is used like a local variable. Each time in report_fault, a value is assigned to it.

end
end
utils.log_error("has no health etcd endpoint")
return nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we don't check if choose_endpoint returns nil, this change will cause an error to throw. It is bad to use throwing error as a control flow. Although APISIX captures the error, other users may not do this.

We should return nil, err here and check it outside of choose_endpoint.


- `shm_name`: the declarative `lua_shared_dict` is used to store the health status of endpoints.
- `fail_timeout`: sets the time during which a number of failed attempts must happen for the endpoint to be marked unavailable, and also the time for which the endpoint is marked unavailable(default is 10 seconds).
- `max_fails`: sets the number of failed attempts that must occur during the `fail_timeout` period for the endpoint to be marked unavailable (default is 1 attempt).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be better to document how we count the failure in per worker + per endpoint level. The counter is independent between different etcd clients.


local function report_fault(self, endpoint)
utils.log_info("report an endpoint failure: ", endpoint.http_host)
local key = worker_id() .. "-" .. endpoint.http_host
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to add an obvious prefix to the key.
BTW, when the code is running in privileged agent, worker_id() will be nil.

@membphis
Copy link
Contributor

stored in a lua_shared_dict tagged with the worker id @tzssangglass

I think we can store the status without worker id.
Then the status can be shared between the different worker processes.

end

utils.log_info("restore an endpoint to health: ", endpoint.http_host)
endpoint.health_status = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to use warn log when we change the health_status. It could be helpful when we need to do accounting.

@spacewander
Copy link
Contributor

stored in a lua_shared_dict tagged with the worker id @tzssangglass

I think we can store the status without worker id.
Then the status can be shared between the different worker processes.

There is a side effect when we share the counter. The actually try time will be divided by the number of workers. If the workers' number increases, the retry change decreases.

}
```

when use `require "resty.etcd" .new` to create a connection, you can override the default configuration like
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when use require("resty.eycd").new

@tzssangglass
Copy link
Contributor Author

tzssangglass commented Dec 30, 2020

work on a new branch

@tzssangglass tzssangglass deleted the IssueNo101 branch January 30, 2021 09:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

discuss: endpoint choose issue feat: support healthcheck when connect to etcd cluster
6 participants