-
Notifications
You must be signed in to change notification settings - Fork 387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some antrea-agent Pods always log errors about "join cluster failed" #5966
Comments
I'll take this issue to work on @tnqn. I'll reach back to you after some time after taking a look at the agent code. |
The error is logged at this line, antrea/pkg/agent/memberlist/cluster.go Line 224 in 18bf0d8
and the err object is from the line, antrea/pkg/agent/memberlist/cluster.go Line 222 in 18bf0d8
Where |
We join Nodes in two cases: 1) when a new Node is created; 2) when the periodical job For 1), it could fail "expectedly" in two cases: when Antrea is deployed the first time, agents start at almost the same time, faster agent will fail to join slower agent; when a new Node is really created, its agent will be ready after a few seconds so trying to join it immediately will likely fail. I feel we should change it to INFO to indicate this attempt, result, and it will be retried. Besides, it would be better to consilidate the multiple errors returned by |
This looks to be a fun issue. I am applying to Antrea for the current LFx internship. I am in the process of familiarizing myself with the antrea codebase, so I'd be taking a stab at it once I set up antrea in a cluster to replicate the problem. |
Hey there, I would like to give it a try. Are you still working on it @roopeshsn ? |
Yes, I am working on it. |
Hi! Let me know if this message make sense:
|
Yes, it would be better to append "will retry later" in the message. |
Describe the bug
When Antrea is newly deployed to a multi-node cluster, there would always be the following errors in some antrea-agent Pods:
It's actually harmless and caused by some agents were not started yet when the agent tries to establish a cluster with others. It will recover eventually after all agents are up.
However, such errors could confuse users when they meet other issues and look into the logs. I have received a few issue reports with the questions asking what the error means.
We should avoid logging such errors when nothing wrong happens to avoid unnecessary attention and confusion.
To Reproduce
Deploy Antrea v1.15.0 to a multi-node cluster. Then check all antrea-agent Pods' logs, there is a great chance that at least one of the Pods having the errors.
The text was updated successfully, but these errors were encountered: