Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[202405][T2] Excessive syslogs generated after rebooting multiple LCs #20112

Open
arista-nwolfe opened this issue Sep 3, 2024 · 2 comments
Open
Assignees
Labels
Chassis 🤖 Modular chassis support Triaged this issue has been triaged

Comments

@arista-nwolfe
Copy link
Contributor

As part of the acl/test_acl.py::TestAclWithReboot on a T2 or T2-min topology we reboot the individual LCs sequentially.
This results in the first rebooted LC emitting these error logs continuously:

2024 Aug 26 20:00:35.241646 cmp214-5 ERR swss0#orchagent: :- removeLag: Failed to remove ref count 1 LAG cmp214-6|asic0|PortChannel101
2024 Aug 26 20:00:35.241646 cmp214-5 ERR swss0#orchagent: :- removeLag: Failed to remove ref count 1 LAG cmp214-6|asic1|PortChannel105
2024 Aug 26 20:00:35.241673 cmp214-5 NOTICE swss0#orchagent: :- removeRouterIntfs: Router interface cmp214-6|asic0|Ethernet-IB0 is still referenced with ref count 4
2024 Aug 26 20:00:35.241673 cmp214-5 NOTICE swss0#orchagent: :- removeRouterIntfs: Router interface cmp214-6|asic0|Ethernet16 is still referenced with ref count 4
2024 Aug 26 20:00:35.241686 cmp214-5 INFO swss0#rsyslogd: imuxsock[pid: 55, name: /usr/bin/orchagent]: 2698 messages lost due to rate-limiting (20000 allowed within 300 seconds)
2024 Aug 26 20:00:35.241701 cmp214-5 NOTICE swss0#orchagent: :- removeRouterIntfs: Router interface cmp214-6|asic0|PortChannel101 is still referenced with ref count 4
2024 Aug 26 20:00:35.241701 cmp214-5 NOTICE swss0#orchagent: :- removeRouterIntfs: Router interface cmp214-6|asic1|Ethernet-IB1 is still referenced with ref count 4
2024 Aug 26 20:00:35.241731 cmp214-5 NOTICE swss0#orchagent: :- removeRouterIntfs: Router interface cmp214-6|asic1|Ethernet160 is still referenced with ref count 4
2024 Aug 26 20:00:35.241731 cmp214-5 NOTICE swss0#orchagent: :- removeRouterIntfs: Router interface cmp214-6|asic1|PortChannel105 is still referenced with ref count 4
2024 Aug 26 20:00:35.303070 cmp214-5 ERR swss1#orchagent: :- removeLag: Failed to remove ref count 1 LAG cmp214-6|asic0|PortChannel101
2024 Aug 26 20:00:35.303070 cmp214-5 ERR swss1#orchagent: :- removeLag: Failed to remove ref count 1 LAG cmp214-6|asic1|PortChannel105
2024 Aug 26 20:00:35.303104 cmp214-5 NOTICE swss1#orchagent: :- removeRouterIntfs: Router interface cmp214-6|asic0|Ethernet-IB0 is still referenced with ref count 4
2024 Aug 26 20:00:35.303104 cmp214-5 INFO swss1#rsyslogd: imuxsock[pid: 55, name: /usr/bin/orchagent]: 3646 messages lost due to rate-limiting (20000 allowed within 300 seconds)
2024 Aug 26 20:00:35.303118 cmp214-5 NOTICE swss1#orchagent: :- removeRouterIntfs: Router interface cmp214-6|asic0|Ethernet16 is still referenced with ref count 4
2024 Aug 26 20:00:35.303141 cmp214-5 NOTICE swss1#orchagent: :- removeRouterIntfs: Router interface cmp214-6|asic0|PortChannel101 is still referenced with ref count 4
2024 Aug 26 20:00:35.303193 cmp214-5 NOTICE swss1#orchagent: :- removeRouterIntfs: Router interface cmp214-6|asic1|Ethernet-IB1 is still referenced with ref count 4
2024 Aug 26 20:00:35.303228 cmp214-5 NOTICE swss1#orchagent: :- removeRouterIntfs: Router interface cmp214-6|asic1|Ethernet160 is still referenced with ref count 4
2024 Aug 26 20:00:35.303265 cmp214-5 NOTICE swss1#orchagent: :- removeRouterIntfs: Router interface cmp214-6|asic1|PortChannel105 is still referenced with ref count 4
2024 Aug 26 20:00:35.806972 cmp214-5 ERR swss0#orchagent: :- removeLag: Failed to remove ref count 1 LAG cmp214-6|asic0|PortChannel101
2024 Aug 26 20:00:35.807066 cmp214-5 ERR swss0#orchagent: :- removeLag: Failed to remove ref count 1 LAG cmp214-6|asic1|PortChannel105

I can manually reproduce this with the following steps:

<Reboot LC5>
<wait ~6 minutes>
<Reboot LC6>
<wait ~3 minutes>
<LC5 starts emitting logs nonstop>

The side effect of these continuous logs is syslog rate-limiting gets engaged:

2024 Aug 26 19:57:21.303029 cmp214-5 INFO swss1#rsyslogd: imuxsock[pid: 55, name: /usr/bin/orchagent] from <cmp214-5:orchagent>: begin to drop messages due to rate-limiting
2024 Aug 26 19:58:10.807179 cmp214-5 INFO swss0#rsyslogd: imuxsock[pid: 55, name: /usr/bin/orchagent] from <cmp214-5:orchagent>: begin to drop messages due to rate-limiting
2024 Aug 26 20:00:35.241686 cmp214-5 INFO swss0#rsyslogd: imuxsock[pid: 55, name: /usr/bin/orchagent]: 2698 messages lost due to rate-limiting (20000 allowed within 300 seconds)
2024 Aug 26 20:00:35.303104 cmp214-5 INFO swss1#rsyslogd: imuxsock[pid: 55, name: /usr/bin/orchagent]: 3646 messages lost due to rate-limiting (20000 allowed within 300 seconds)

Causing the ACL tests to fail as they are waiting for important syslogs to show up which get rate-limited:

E               Expected Messages that are missing:
E               .*Successfully created ACL rule.*
@rlhui
Copy link
Contributor

rlhui commented Sep 4, 2024

to check if this might be due to clean up when another LC is rebooted.

@rlhui rlhui added the Triaged this issue has been triaged label Sep 4, 2024
@zhangyanzhao zhangyanzhao added the Chassis 🤖 Modular chassis support label Sep 11, 2024
@arlakshm
Copy link
Contributor

@saksarav-nokia to check on Nokia testbeds as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Chassis 🤖 Modular chassis support Triaged this issue has been triaged
Projects
Status: No status
Development

No branches or pull requests

4 participants