Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[discuss]: when a node in the etcd cluster fails, no error log is output #3937

Closed
Firstsawyou opened this issue Mar 29, 2021 · 15 comments · Fixed by #4191
Closed

[discuss]: when a node in the etcd cluster fails, no error log is output #3937

Firstsawyou opened this issue Mar 29, 2021 · 15 comments · Fixed by #4191
Labels
checking check first if this issue occurred discuss

Comments

@Firstsawyou
Copy link
Contributor

Issue description

In an etcd cluster (3 nodes), when one of the nodes fails. The following error message will be printed in the error.log:

image

But this does not affect the normal operation of APISIX. Such error log information can make people misunderstand that etcd is unavailable. Can we not output error log information when a node fails in etcd cluster?

@spacewander
Copy link
Member

The error log is right. If it reports the connection is refused, that means it fails to get any data from etcd at that moment.

@moonming
Copy link
Member

@Yiyiyimu @spacewander if we can get data from other etcd node, do we need print ERR level logs?

@spacewander
Copy link
Member

spacewander commented Mar 30, 2021

If it reports an error log in 18:26, it doesn't get data from another etcd node at that time.
@Firstsawyou
Does the error log happen forever or just for a moment?

@aiyiyi121
Copy link

If it reports an error log in 18:26, it doesn't get data from another etcd node at that time.
@Firstsawyou
Does the error log happen forever or just for a moment?

The error log happen forever when a node in the etcd cluster(3 nodes) fails, at that time ,apisix can get data from other etcd node and work corretcly.

@spacewander spacewander added the checking check first if this issue occurred label Mar 30, 2021
@spacewander
Copy link
Member

If it reports an error log in 18:26, it doesn't get data from another etcd node at that time.
@Firstsawyou
Does the error log happen forever or just for a moment?

The error log happen forever when a node in the etcd cluster(3 nodes) fails, at that time ,apisix can get data from other etcd node and work corretcly.

Interesting.

@Firstsawyou
Could you investigate this issue?

@Firstsawyou
Copy link
Contributor Author

@Firstsawyou
Could you investigate this issue?

Ok, let me investigate.

@aiyiyi121
Copy link

@spacewander @Firstsawyou
When I use ./apisix start to start the apisix, I found it will init etcd. I read the "etcd init" code and found it will check the status of all etcd nodes. So when when a node in the etcd cluster(3 nodes) fails, the etcd_init will fail too , and the apisix can't start.
I think when a node in the etcd cluster(3 nodes) fails, the etcd_init should success because it do no affect on the normal operation of APISIX, and the apisix should start normally.

@spacewander
Copy link
Member

@spacewander @Firstsawyou
When I use ./apisix start to start the apisix, I found it will init etcd. I read the "etcd init" code and found it will check the status of all etcd nodes. So when when a node in the etcd cluster(3 nodes) fails, the etcd_init will fail too , and the apisix can't start.
I think when a node in the etcd cluster(3 nodes) fails, the etcd_init should success because it do no affect on the normal operation of APISIX, and the apisix should start normally.

It does. Why do you want to configure a bad node inside APISIX? Start APISIX in an unhealthy situation is not a good idea. Consider one of your fail nodes has a wrong auth configuration which can't be detected if we just skip it.

@aiyiyi121
Copy link

@spacewander @Firstsawyou
When I use ./apisix start to start the apisix, I found it will init etcd. I read the "etcd init" code and found it will check the status of all etcd nodes. So when when a node in the etcd cluster(3 nodes) fails, the etcd_init will fail too , and the apisix can't start.
I think when a node in the etcd cluster(3 nodes) fails, the etcd_init should success because it do no affect on the normal operation of APISIX, and the apisix should start normally.

It does. Why do you want to configure a bad node inside APISIX? Start APISIX in an unhealthy situation is not a good idea. Consider one of your fail nodes has a wrong auth configuration which can't be detected if we just skip it.

Because we want to be able to normally start and use apisix in a production environment in case of the
extreme cases such as a node in the etcd cluster(3 nodes) fails.

@tokers
Copy link
Contributor

tokers commented Mar 31, 2021

Maybe the etcd init operations can be changed to: as long as majority instances in the etcd cluster is healthy, then we can start APISIX. Or, just check one instance, since if the cluster is unavailable, "no leader" error will be thrown.

@spacewander
Copy link
Member

spacewander commented Mar 31, 2021

The etcd init doesn't just do the node check.

As its name indicates, it does the init job. We need this operation to ensure the data in etcd is initialized correctly to avoid unexpected responses.

If we skip this for some nodes, there is no way to ensure they are correctly initialized.

As for "normally start and use apisix in a production environment in case of the
extreme cases such as a node in the etcd cluster(3 nodes) fails", this problem is mostly an etcd HA problem.
IMHO, we should solve it in the etcd cluster, better than solving it in every client.

You can use 3 virtual hosts for etcd and ensure they are mapping into healthy nodes. If it is no enough, you can introduce retry when starting APISIX.

@tokers
Copy link
Contributor

tokers commented Mar 31, 2021

If we skip this for some nodes, there is no way to ensure they are correctly initialized.

Why? ETCD is self-replicated.

@spacewander
Copy link
Member

If we skip this for some nodes, there is no way to ensure they are correctly initialized.

Why? ETCD is self-replicated.

People might configure wrong node. Don't be surprise, it happened before.

@tokers
Copy link
Contributor

tokers commented Mar 31, 2021

If we skip this for some nodes, there is no way to ensure they are correctly initialized.

Why? ETCD is self-replicated.

People might configure wrong node. Don't be surprise, it happened before.

OK, got it ...

@aiyiyi121
Copy link

If we skip this for some nodes, there is no way to ensure they are correctly initialized.

Why? ETCD is self-replicated.

People might configure wrong node. Don't be surprise, it happened before.

Thx, got it. We should pay more attention to the HA of etcd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
checking check first if this issue occurred discuss
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants