Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some antrea-agent Pods always log errors about "join cluster failed" #5966

Closed
tnqn opened this issue Feb 6, 2024 · 8 comments · Fixed by #6048
Closed

Some antrea-agent Pods always log errors about "join cluster failed" #5966

tnqn opened this issue Feb 6, 2024 · 8 comments · Fixed by #6048
Assignees
Labels
good first issue Good for newcomers kind/bug Categorizes issue or PR as related to a bug.

Comments

@tnqn
Copy link
Member

tnqn commented Feb 6, 2024

Describe the bug

When Antrea is newly deployed to a multi-node cluster, there would always be the following errors in some antrea-agent Pods:

E0206 04:40:59.164477       1 cluster.go:224] "Processing Node CREATE event error, join cluster failed" err=<
        1 error occurred:
                * Failed to join 172.18.0.4:10351: dial tcp 172.18.0.4:10351: connect: connection refused

 > member="172.18.0.4"

It's actually harmless and caused by some agents were not started yet when the agent tries to establish a cluster with others. It will recover eventually after all agents are up.

However, such errors could confuse users when they meet other issues and look into the logs. I have received a few issue reports with the questions asking what the error means.

We should avoid logging such errors when nothing wrong happens to avoid unnecessary attention and confusion.

To Reproduce

Deploy Antrea v1.15.0 to a multi-node cluster. Then check all antrea-agent Pods' logs, there is a great chance that at least one of the Pods having the errors.

@tnqn tnqn added good first issue Good for newcomers kind/bug Categorizes issue or PR as related to a bug. labels Feb 6, 2024
@roopeshsn
Copy link
Contributor

roopeshsn commented Feb 6, 2024

I'll take this issue to work on @tnqn. I'll reach back to you after some time after taking a look at the agent code.

@roopeshsn
Copy link
Contributor

The error is logged at this line,

klog.ErrorS(err, "Processing Node CREATE event error, join cluster failed", "member", member)

and the err object is from the line,

_, err := c.mList.Join([]string{member})

Where Join() method is from hashicorp memberlist package. How we can find that nothing goes wrong? so that we'll avoid logging. Or I need to remove that log statement? @tnqn

@tnqn
Copy link
Member Author

tnqn commented Feb 7, 2024

Where Join() method is from hashicorp memberlist package. How we can find that nothing goes wrong? so that we'll avoid logging. Or I need to remove that log statement? @tnqn

We join Nodes in two cases: 1) when a new Node is created; 2) when the periodical job RejoinNodes runs.

For 1), it could fail "expectedly" in two cases: when Antrea is deployed the first time, agents start at almost the same time, faster agent will fail to join slower agent; when a new Node is really created, its agent will be ready after a few seconds so trying to join it immediately will likely fail. I feel we should change it to INFO to indicate this attempt, result, and it will be retried.
For 2), We should keep its failure as Error.

Besides, it would be better to consilidate the multiple errors returned by Join into one line string to keep log lines unified.

@devbird007
Copy link

devbird007 commented Feb 9, 2024

This looks to be a fun issue. I am applying to Antrea for the current LFx internship. I am in the process of familiarizing myself with the antrea codebase, so I'd be taking a stab at it once I set up antrea in a cluster to replicate the problem.

@prakrit55
Copy link
Contributor

Hey there, I would like to give it a try. Are you still working on it @roopeshsn ?

@roopeshsn
Copy link
Contributor

Hey there, I would like to give it a try. Are you still working on it @roopeshsn ?

Yes, I am working on it.

@roopeshsn
Copy link
Contributor

Where Join() method is from hashicorp memberlist package. How we can find that nothing goes wrong? so that we'll avoid logging. Or I need to remove that log statement? @tnqn

We join Nodes in two cases: 1) when a new Node is created; 2) when the periodical job RejoinNodes runs.

For 1), it could fail "expectedly" in two cases: when Antrea is deployed the first time, agents start at almost the same time, faster agent will fail to join slower agent; when a new Node is really created, its agent will be ready after a few seconds so trying to join it immediately will likely fail. I feel we should change it to INFO to indicate this attempt, result, and it will be retried. For 2), We should keep its failure as Error.

Besides, it would be better to consilidate the multiple errors returned by Join into one line string to keep log lines unified.

Hi! Let me know if this message make sense:

I0224 15:05:39.957935       1 cluster.go:230] "Processing Node CREATE event error, join cluster failed" error="Failed to join 172.24.0.4:10351: dial tcp 172.24.0.4:10351: connect: connection refused" member="172.24.0.4"
I0224 15:05:39.960605       1 cluster.go:230] "Processing Node CREATE event error, join cluster failed" error="Failed to join 172.24.0.2:10351: dial tcp 172.24.0.2:10351: connect: connection refused" member="172.24.0.2"

@tnqn
Copy link
Member Author

tnqn commented Feb 27, 2024

Let me know if this message make sense:

Yes, it would be better to append "will retry later" in the message.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants