Some antrea-agent Pods always log errors about "join cluster failed" #5966

tnqn · 2024-02-06T04:51:13Z

Describe the bug

When Antrea is newly deployed to a multi-node cluster, there would always be the following errors in some antrea-agent Pods:

E0206 04:40:59.164477       1 cluster.go:224] "Processing Node CREATE event error, join cluster failed" err=<
        1 error occurred:
                * Failed to join 172.18.0.4:10351: dial tcp 172.18.0.4:10351: connect: connection refused

 > member="172.18.0.4"

It's actually harmless and caused by some agents were not started yet when the agent tries to establish a cluster with others. It will recover eventually after all agents are up.

However, such errors could confuse users when they meet other issues and look into the logs. I have received a few issue reports with the questions asking what the error means.

We should avoid logging such errors when nothing wrong happens to avoid unnecessary attention and confusion.

To Reproduce

Deploy Antrea v1.15.0 to a multi-node cluster. Then check all antrea-agent Pods' logs, there is a great chance that at least one of the Pods having the errors.

The text was updated successfully, but these errors were encountered:

roopeshsn · 2024-02-06T05:05:20Z

I'll take this issue to work on @tnqn. I'll reach back to you after some time after taking a look at the agent code.

roopeshsn · 2024-02-07T03:42:28Z

The error is logged at this line,

antrea/pkg/agent/memberlist/cluster.go

Line 224 in 18bf0d8

    
           klog.ErrorS(err, "Processing Node CREATE event error, join cluster failed", "member", member)

and the err object is from the line,

antrea/pkg/agent/memberlist/cluster.go

Line 222 in 18bf0d8

_, err := c.mList.Join([]string{member})

Where Join() method is from hashicorp memberlist package. How we can find that nothing goes wrong? so that we'll avoid logging. Or I need to remove that log statement? @tnqn

tnqn · 2024-02-07T04:21:08Z

Where Join() method is from hashicorp memberlist package. How we can find that nothing goes wrong? so that we'll avoid logging. Or I need to remove that log statement? @tnqn

We join Nodes in two cases: 1) when a new Node is created; 2) when the periodical job RejoinNodes runs.

For 1), it could fail "expectedly" in two cases: when Antrea is deployed the first time, agents start at almost the same time, faster agent will fail to join slower agent; when a new Node is really created, its agent will be ready after a few seconds so trying to join it immediately will likely fail. I feel we should change it to INFO to indicate this attempt, result, and it will be retried.
For 2), We should keep its failure as Error.

Besides, it would be better to consilidate the multiple errors returned by Join into one line string to keep log lines unified.

devbird007 · 2024-02-09T17:23:38Z

This looks to be a fun issue. I am applying to Antrea for the current LFx internship. I am in the process of familiarizing myself with the antrea codebase, so I'd be taking a stab at it once I set up antrea in a cluster to replicate the problem.

prakrit55 · 2024-02-13T14:54:24Z

Hey there, I would like to give it a try. Are you still working on it @roopeshsn ?

roopeshsn · 2024-02-14T17:07:15Z

Hey there, I would like to give it a try. Are you still working on it @roopeshsn ?

Yes, I am working on it.

roopeshsn · 2024-02-24T15:09:11Z

Where Join() method is from hashicorp memberlist package. How we can find that nothing goes wrong? so that we'll avoid logging. Or I need to remove that log statement? @tnqn

We join Nodes in two cases: 1) when a new Node is created; 2) when the periodical job RejoinNodes runs.

For 1), it could fail "expectedly" in two cases: when Antrea is deployed the first time, agents start at almost the same time, faster agent will fail to join slower agent; when a new Node is really created, its agent will be ready after a few seconds so trying to join it immediately will likely fail. I feel we should change it to INFO to indicate this attempt, result, and it will be retried. For 2), We should keep its failure as Error.

Besides, it would be better to consilidate the multiple errors returned by Join into one line string to keep log lines unified.

Hi! Let me know if this message make sense:

I0224 15:05:39.957935       1 cluster.go:230] "Processing Node CREATE event error, join cluster failed" error="Failed to join 172.24.0.4:10351: dial tcp 172.24.0.4:10351: connect: connection refused" member="172.24.0.4"
I0224 15:05:39.960605       1 cluster.go:230] "Processing Node CREATE event error, join cluster failed" error="Failed to join 172.24.0.2:10351: dial tcp 172.24.0.2:10351: connect: connection refused" member="172.24.0.2"

tnqn · 2024-02-27T16:14:10Z

Let me know if this message make sense:

Yes, it would be better to append "will retry later" in the message.

tnqn added good first issue Good for newcomers kind/bug Categorizes issue or PR as related to a bug. labels Feb 6, 2024

tnqn mentioned this issue Feb 6, 2024

Broken install-vm.sh was not detected by any test job #5965

Closed

tnqn assigned roopeshsn Feb 6, 2024

roopeshsn mentioned this issue Mar 3, 2024

Fix Improve log when agent fails to join a new Node #6048

Merged

tnqn closed this as completed in #6048 Mar 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some antrea-agent Pods always log errors about "join cluster failed" #5966

Some antrea-agent Pods always log errors about "join cluster failed" #5966

tnqn commented Feb 6, 2024

roopeshsn commented Feb 6, 2024 •

edited

Loading

roopeshsn commented Feb 7, 2024

tnqn commented Feb 7, 2024

devbird007 commented Feb 9, 2024 •

edited

Loading

prakrit55 commented Feb 13, 2024

roopeshsn commented Feb 14, 2024

roopeshsn commented Feb 24, 2024

tnqn commented Feb 27, 2024

Some antrea-agent Pods always log errors about "join cluster failed" #5966

Some antrea-agent Pods always log errors about "join cluster failed" #5966

Comments

tnqn commented Feb 6, 2024

roopeshsn commented Feb 6, 2024 • edited Loading

roopeshsn commented Feb 7, 2024

tnqn commented Feb 7, 2024

devbird007 commented Feb 9, 2024 • edited Loading

prakrit55 commented Feb 13, 2024

roopeshsn commented Feb 14, 2024

roopeshsn commented Feb 24, 2024

tnqn commented Feb 27, 2024

roopeshsn commented Feb 6, 2024 •

edited

Loading

devbird007 commented Feb 9, 2024 •

edited

Loading