Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: enable etcd health-check #4191

Merged
merged 27 commits into from
Jun 30, 2021
Merged

Conversation

Yiyiyimu
Copy link
Member

@Yiyiyimu Yiyiyimu commented May 6, 2021

What this PR does / why we need it:

fix #3673
fix #3937.

Import health check of lua-resty-etcd, so

  1. when one etcd node disconnected, apisix would try to connect another one and perform as normal, while health-check would continue print warning log about disconnected etcd node
  2. when all etcd nodes disconnected, apisix would try to re-connect with etcd, with a cyclic backoff interval for 2s, 2s, 4s, ..., 1024s, 2s, 4s ... so the error log would not been flushed when etcd are all down.

TODO:

  • for etcd node deployed in kubernetes, since nodes are behind the same domain, we could not know if one of them is taint, as talked in [discuss]: enable etcd health check #3692
  • add test, but since the current health check does not work for kubernetes, need to implement a new CI pipeline for multiple etcd nodes

Pre-submission checklist:

  • Did you explain what problem does this PR solve? Or what new features have been added?
  • Have you added corresponding test cases?
  • Have you modified the corresponding document?
  • Is this PR backward compatible? If it is not backward compatible, please discuss on the mailing list first

…e apisix fail

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>
Signed-off-by: yiyiyimu <wosoyoung@gmail.com>
…rks in stream mode

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>
Signed-off-by: yiyiyimu <wosoyoung@gmail.com>
Signed-off-by: yiyiyimu <wosoyoung@gmail.com>
Signed-off-by: yiyiyimu <wosoyoung@gmail.com>
@Yiyiyimu Yiyiyimu changed the title fix: enable etcd health-check, so one etcd node failure would not make apisix fail fix: enable etcd health-check May 10, 2021
@Yiyiyimu Yiyiyimu marked this pull request as ready for review May 10, 2021 02:04
@juzhiyuan
Copy link
Member

image

We need another fix hh

…not been blocked

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>
Signed-off-by: yiyiyimu <wosoyoung@gmail.com>
Signed-off-by: yiyiyimu <wosoyoung@gmail.com>
Signed-off-by: yiyiyimu <wosoyoung@gmail.com>
Signed-off-by: yiyiyimu <wosoyoung@gmail.com>
Signed-off-by: yiyiyimu <wosoyoung@gmail.com>
Signed-off-by: yiyiyimu <wosoyoung@gmail.com>
@Yiyiyimu
Copy link
Member Author

Yiyiyimu commented May 10, 2021

Things get a bit weird right now 😕

Before I work on this fix, I test that if killing one node out of an etcd cluster, apisix would fail, as those related issues talk about. However, right now when I test this scenario again on the master branch, apisix would not be affected by the closed etcd node and runs normally. I checked the recent commits and it seems no changes have applied to this problem.

Could someone help me to reproduce the program again


Test running apisix and etcd in Kubernetes, delete one etcd node, and apisix runs normally with master branch. So don't know if anything went wrong yet.

@Yiyiyimu
Copy link
Member Author

Waiting for lua-resty-etcd new release: api7/lua-resty-etcd#129

@Yiyiyimu Yiyiyimu requested a review from spacewander May 10, 2021 21:34
apisix/core/config_etcd.lua Show resolved Hide resolved
apisix/core/config_etcd.lua Outdated Show resolved Hide resolved
apisix/core/config_etcd.lua Outdated Show resolved Hide resolved
apisix/core/config_etcd.lua Outdated Show resolved Hide resolved
Signed-off-by: yiyiyimu <wosoyoung@gmail.com>
…st etcd endpoint

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>
Signed-off-by: yiyiyimu <wosoyoung@gmail.com>
apisix/core/config_etcd.lua Outdated Show resolved Hide resolved
apisix/core/config_etcd.lua Outdated Show resolved Hide resolved
apisix/core/config_etcd.lua Show resolved Hide resolved
Signed-off-by: yiyiyimu <wosoyoung@gmail.com>
Signed-off-by: yiyiyimu <wosoyoung@gmail.com>
@Yiyiyimu
Copy link
Member Author

Yiyiyimu commented Jun 21, 2021

Waiting for api7/lua-resty-etcd#131


Done

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>
Signed-off-by: yiyiyimu <wosoyoung@gmail.com>
Signed-off-by: yiyiyimu <wosoyoung@gmail.com>
Signed-off-by: yiyiyimu <wosoyoung@gmail.com>
@spacewander spacewander changed the title fix: enable etcd health-check feat: enable etcd health-check Jun 29, 2021
apisix/core/config_etcd.lua Outdated Show resolved Hide resolved
Signed-off-by: yiyiyimu <wosoyoung@gmail.com>
if string.find(err, err_etcd_unhealthy_all) then
local reconnected = false
while err and not reconnected do
local backoff_duration, backoff_factor, backoff_step = 1, 2, 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will step 10 too large? As 1024 seconds is too long.

Copy link
Member Author

@Yiyiyimu Yiyiyimu Jun 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not so familiar with the production environment. Do you have some recommendations, like 8, around 4 mins at most? @tokers

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe 6 is enough, also, since we are in a timer, keep a Nginx timer living for a long while is not a good practice as it might cause the memory leaky.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for suggestion!
Changed to use outside counter till 32 to avoid keeping nginx timer too long. PTAL @tokers

Signed-off-by: yiyiyimu <wosoyoung@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants