[discuss]: when a node in the etcd cluster fails, no error log is output #3937

Firstsawyou · 2021-03-29T14:32:09Z

Issue description

In an etcd cluster (3 nodes), when one of the nodes fails. The following error message will be printed in the error.log:

But this does not affect the normal operation of APISIX. Such error log information can make people misunderstand that etcd is unavailable. Can we not output error log information when a node fails in etcd cluster?

spacewander · 2021-03-30T01:24:24Z

The error log is right. If it reports the connection is refused, that means it fails to get any data from etcd at that moment.

moonming · 2021-03-30T02:02:43Z

@Yiyiyimu @spacewander if we can get data from other etcd node, do we need print ERR level logs?

spacewander · 2021-03-30T02:05:54Z

If it reports an error log in 18:26, it doesn't get data from another etcd node at that time.
@Firstsawyou
Does the error log happen forever or just for a moment?

aiyiyi121 · 2021-03-30T04:21:49Z

If it reports an error log in 18:26, it doesn't get data from another etcd node at that time.
@Firstsawyou
Does the error log happen forever or just for a moment?

The error log happen forever when a node in the etcd cluster(3 nodes) fails, at that time ,apisix can get data from other etcd node and work corretcly.

spacewander · 2021-03-30T05:16:00Z

If it reports an error log in 18:26, it doesn't get data from another etcd node at that time.
@Firstsawyou
Does the error log happen forever or just for a moment?

The error log happen forever when a node in the etcd cluster(3 nodes) fails, at that time ,apisix can get data from other etcd node and work corretcly.

Interesting.

@Firstsawyou
Could you investigate this issue?

Firstsawyou · 2021-03-30T05:31:04Z

@Firstsawyou
Could you investigate this issue?

Ok, let me investigate.

aiyiyi121 · 2021-03-30T11:44:32Z

@spacewander @Firstsawyou
When I use ./apisix start to start the apisix, I found it will init etcd. I read the "etcd init" code and found it will check the status of all etcd nodes. So when when a node in the etcd cluster(3 nodes) fails, the etcd_init will fail too , and the apisix can't start.
I think when a node in the etcd cluster(3 nodes) fails, the etcd_init should success because it do no affect on the normal operation of APISIX, and the apisix should start normally.

spacewander · 2021-03-30T11:58:19Z

@spacewander @Firstsawyou
When I use ./apisix start to start the apisix, I found it will init etcd. I read the "etcd init" code and found it will check the status of all etcd nodes. So when when a node in the etcd cluster(3 nodes) fails, the etcd_init will fail too , and the apisix can't start.
I think when a node in the etcd cluster(3 nodes) fails, the etcd_init should success because it do no affect on the normal operation of APISIX, and the apisix should start normally.

It does. Why do you want to configure a bad node inside APISIX? Start APISIX in an unhealthy situation is not a good idea. Consider one of your fail nodes has a wrong auth configuration which can't be detected if we just skip it.

aiyiyi121 · 2021-03-30T13:52:31Z

@spacewander @Firstsawyou
When I use ./apisix start to start the apisix, I found it will init etcd. I read the "etcd init" code and found it will check the status of all etcd nodes. So when when a node in the etcd cluster(3 nodes) fails, the etcd_init will fail too , and the apisix can't start.
I think when a node in the etcd cluster(3 nodes) fails, the etcd_init should success because it do no affect on the normal operation of APISIX, and the apisix should start normally.

It does. Why do you want to configure a bad node inside APISIX? Start APISIX in an unhealthy situation is not a good idea. Consider one of your fail nodes has a wrong auth configuration which can't be detected if we just skip it.

Because we want to be able to normally start and use apisix in a production environment in case of the
extreme cases such as a node in the etcd cluster(3 nodes) fails.

tokers · 2021-03-31T01:50:52Z

Maybe the etcd init operations can be changed to: as long as majority instances in the etcd cluster is healthy, then we can start APISIX. Or, just check one instance, since if the cluster is unavailable, "no leader" error will be thrown.

spacewander · 2021-03-31T02:14:29Z

The etcd init doesn't just do the node check.

As its name indicates, it does the init job. We need this operation to ensure the data in etcd is initialized correctly to avoid unexpected responses.

If we skip this for some nodes, there is no way to ensure they are correctly initialized.

As for "normally start and use apisix in a production environment in case of the
extreme cases such as a node in the etcd cluster(3 nodes) fails", this problem is mostly an etcd HA problem.
IMHO, we should solve it in the etcd cluster, better than solving it in every client.

You can use 3 virtual hosts for etcd and ensure they are mapping into healthy nodes. If it is no enough, you can introduce retry when starting APISIX.

tokers · 2021-03-31T02:19:57Z

If we skip this for some nodes, there is no way to ensure they are correctly initialized.

Why? ETCD is self-replicated.

spacewander · 2021-03-31T02:23:52Z

If we skip this for some nodes, there is no way to ensure they are correctly initialized.

Why? ETCD is self-replicated.

People might configure wrong node. Don't be surprise, it happened before.

tokers · 2021-03-31T02:33:13Z

If we skip this for some nodes, there is no way to ensure they are correctly initialized.

Why? ETCD is self-replicated.

People might configure wrong node. Don't be surprise, it happened before.

OK, got it ...

aiyiyi121 · 2021-03-31T02:49:40Z

If we skip this for some nodes, there is no way to ensure they are correctly initialized.

Why? ETCD is self-replicated.

People might configure wrong node. Don't be surprise, it happened before.

Thx, got it. We should pay more attention to the HA of etcd

spacewander added the discuss label Mar 30, 2021

spacewander added the checking check first if this issue occurred label Mar 30, 2021

Yiyiyimu mentioned this issue May 6, 2021

feat: enable etcd health-check #4191

Merged

6 tasks

spacewander closed this as completed in #4191 Jun 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[discuss]: when a node in the etcd cluster fails, no error log is output #3937

[discuss]: when a node in the etcd cluster fails, no error log is output #3937

Firstsawyou commented Mar 29, 2021

spacewander commented Mar 30, 2021

moonming commented Mar 30, 2021

spacewander commented Mar 30, 2021 •

edited

Loading

aiyiyi121 commented Mar 30, 2021

spacewander commented Mar 30, 2021

Firstsawyou commented Mar 30, 2021

aiyiyi121 commented Mar 30, 2021

spacewander commented Mar 30, 2021

aiyiyi121 commented Mar 30, 2021

tokers commented Mar 31, 2021

spacewander commented Mar 31, 2021 •

edited

Loading

tokers commented Mar 31, 2021

spacewander commented Mar 31, 2021

tokers commented Mar 31, 2021 •

edited

Loading

aiyiyi121 commented Mar 31, 2021

[discuss]: when a node in the etcd cluster fails, no error log is output #3937

[discuss]: when a node in the etcd cluster fails, no error log is output #3937

Comments

Firstsawyou commented Mar 29, 2021

Issue description

spacewander commented Mar 30, 2021

moonming commented Mar 30, 2021

spacewander commented Mar 30, 2021 • edited Loading

aiyiyi121 commented Mar 30, 2021

spacewander commented Mar 30, 2021

Firstsawyou commented Mar 30, 2021

aiyiyi121 commented Mar 30, 2021

spacewander commented Mar 30, 2021

aiyiyi121 commented Mar 30, 2021

tokers commented Mar 31, 2021

spacewander commented Mar 31, 2021 • edited Loading

tokers commented Mar 31, 2021

spacewander commented Mar 31, 2021

tokers commented Mar 31, 2021 • edited Loading

aiyiyi121 commented Mar 31, 2021

spacewander commented Mar 30, 2021 •

edited

Loading

spacewander commented Mar 31, 2021 •

edited

Loading

tokers commented Mar 31, 2021 •

edited

Loading