You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Since the official build does not include iccpd, we build an image of the 202305 branch with iccpd enabled. (202205 has the same problem)
We are using Portchannel01 as the peerlink with two Eternet-Interfaces and one mclag-instance on that peerlink. After that we added some Portchannels on both sides, we tested this configuration some weeks without a problem. But at some point the iccpd crashes and the mclag-pair was broken. We have to reboot the switches, but after some seconds running as expected the iccpd crashes again and leaves the mclag-pair in an running but broken state. We tried to debug this situation and saw that, if we only run one mclag-enabled switch (leafb for example) the mclag is in error-state but we are able to see the known mac-addresses with mclagctl -i 1 dump macs. Now we wanted to re-add leafa. To circumvent any configuration-diffs in the PortChannels we removed all MCLAG-PortChannels from leafa (only mgmt-int and peerlink is configured) and applied the mclag related config:
Right after the last command on leafa the iccpd crashes on leafb. After rebooting both switches work as before.
In the logs we found the following line on both switches:
Jun 21 18:04:08.128694 leafa INFO iccpd#supervisord: iccpd *** stack smashing detected ***: terminated
Jun 22 18:03:26.217470 leafb INFO iccpd#supervisord: iccpd *** stack smashing detected ***: terminated
We also found these lines in the near of the other ones:
Jun 21 18:04:08.128694 leafa ERR swss#orchagent: :- setMembers: Port Ethe not supported
Jun 22 18:03:26.253784 leafb ERR swss#orchagent: :- setMembers: Port Eth not supported
As you can see the string Eth(e) seems to be cut off. Btw.: Currently we have only one single Ethernet-Uplink on leafa which is shared across the peerlink. We also tried removing it on leafa and try to start the mclag-pair without any luck. iccpd crashes with the same error/behavior.
To be clear, we had this problem first when both switches had the full MCLAG-PortChannel setup. We created tech-support-files on both switches right after the crash and before we reboot them.
Steps to reproduce the issue:
Create the described scenario
Reboot both switches
Wait a couple of seconds
Describe the results you received:
It seems to be okay for a few seconds after that: (Core Dumps for iccpd and orchagent are available)
root@leafb:~# mclagdctl -i 1 dump state
The MCLAG's keepalive is: ERROR
MCLAG info sync is: incomplete
Domain id: 1
Local Ip: 192.168.10.2
Peer Ip: 192.168.10.1
Peer Link Interface: PortChannel01
Keepalive time: 1
sesssion Timeout : 15
Peer Link Mac: 64:9d:99:3a:d8:cc
Role: Standby
MCLAG Interface: PortChannel05,PortChannel03,PortChannel02,PortChannel18,PortChannel21,PortChannel17,PortChannel16,PortChannel13,PortChannel11,PortChannel19,PortChannel14,PortChannel06,PortChannel10,PortChannel20,PortChannel24,PortChannel09,PortChannel12,PortChannel15,PortChannel23,PortChannel07,PortChannel26,PortChannel08,PortChannel25,PortChannel04,PortChannel22
Loglevel: NOTICE
Additional information you deem important (e.g. issue happens only occasionally):
I built the images with symbols and generated the requested stack traces:
ICCPD:
docker run -it -v $PWD:/work --entrypoint bash docker-iccpd-dbg
root@ceb8db9ff0c9:/# gdb /usr/bin/iccpd /work/iccpd.1687374847.23.core
[...snip]
(gdb) bt
#0 0x00007f5ef0a09ce1 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f5ef09f3537 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007f5ef0a4b3a8 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3 0x00007f5ef0adc542 in __fortify_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4 0x00007f5ef0adc520 in __stack_chk_fail () from /lib/x86_64-linux-gnu/libc.so.6
#5 0x00005566de51a845 in update_peerlink_isolate_from_all_csm_lif (csm=0x5566df6e49b0) at mlacp_link_handler.c:1209
#6 0x00005566de51a9e3 in set_peerlink_mlag_port_isolate (csm=0x5566df6e49b0, lif=0x7ffdb6b6e030, lif@entry=0x5566df6ee480, enable=1, is_unbind_pending=225) at mlacp_link_handler.c:1233
#7 0x00005566de51accc in update_peerlink_isolate_from_lif (csm=csm@entry=0x5566df6e49b0, lif=lif@entry=0x5566df6ee480, lif_po_state=lif_po_state@entry=1) at mlacp_link_handler.c:1368
#8 0x00005566de51dc67 in update_peerlink_isolate_from_lif (lif_po_state=1, lif=0x5566df6ee480, csm=0x5566df6e49b0) at mlacp_link_handler.c:1776
#9 mlacp_portchannel_state_handler (csm=0x5566df6e49b0, local_if=0x5566df6ee480, po_state=1) at mlacp_link_handler.c:2104
#10 0x00005566de521346 in mlacp_portchannel_state_handler (po_state=<optimized out>, local_if=0x5566df6ee480, csm=0x5566df6e49b0) at mlacp_link_handler.c:2094
#11 mlacp_peer_conn_handler (csm=csm@entry=0x5566df6e49b0) at mlacp_link_handler.c:2281
#12 0x00005566de5271c8 in mlacp_fsm_transit (csm=csm@entry=0x5566df6e49b0) at mlacp_fsm.c:916
#13 0x00005566de517bc8 in scheduler_transit_fsm () at scheduler.c:116
#14 scheduler_loop () at scheduler.c:479
#15 0x00005566de517c97 in scheduler_start () at scheduler.c:534
#16 0x00005566de50cc5d in main (argc=<optimized out>, argv=0x7ffdb6b6e990) at iccp_main.c:266
(gdb)
Orchagent:
docker run -it -v $PWD:/work --entrypoint bash docker-orchagent-dbg
root@4f2fd291d05c:/# gdb /usr/bin/orchagent /work/orchagent.1687370636.52.core
[...snip]
(gdb) bt
#0 0x00007f6a7d8fdce1 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f6a7d8e7537 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x000055d8918c237c in handleSaiFailure (abort_on_failure=<optimized out>) at saihelper.cpp:771
#3 0x000055d891b0c6d7 in handleSaiRemoveStatus (api=api@entry=SAI_API_FDB, status=-2021033792, status@entry=-7, context=context@entry=0x0) at saihelper.cpp:700
#4 0x000055d891ab44e7 in FdbOrch::removeFdbEntry (this=0x55d8924ed2b0, entry=..., origin=<optimized out>) at fdborch.cpp:1621
#5 0x000055d891ab4dd5 in FdbOrch::doTask (this=0x55d8924ed2b0, consumer=...) at fdborch.cpp:853
#6 0x000055d8919898bd in Consumer::drain (this=0x55d8924e8000) at orch.cpp:264
#7 Consumer::drain (this=0x55d8924e8000) at orch.cpp:261
#8 Consumer::execute (this=0x55d8924e8000) at orch.cpp:258
#9 0x000055d8919795f8 in OrchDaemon::start (this=this@entry=0x55d8924a1100) at orchdaemon.cpp:769
#10 0x000055d8918f6da6 in main (argc=<optimized out>, argv=<optimized out>) at main.cpp:766
(gdb)
The text was updated successfully, but these errors were encountered:
May I know if there any fix available for "Issue1:" ?
Issue #1 : For the ICCPd below are the logs during the crash: It looks some of the ebtables updates are not supported.
Hi @selvatechtalk, unfortunately, we were not able to fix it and stopped testing SONiC. It's been a while though; maybe something has changed on the ICCPd front.
Praveen Elagala already provided some analysis in Google Groups: https://groups.google.com/g/sonicproject/c/00rnM19XgDs
Description
We encountered a problem regarding iccpd and mclag. We use two switches,
leafa
andleafb
(Model) in an L2-scenario. We followed the configuration-example on: https://support.edge-core.com/hc/en-us/articles/900002380706--Enterprise-SONiC-MC-LAGSince the official build does not include iccpd, we build an image of the
202305
branch with iccpd enabled. (202205
has the same problem)We are using
Portchannel01
as the peerlink with two Eternet-Interfaces and one mclag-instance on that peerlink. After that we added some Portchannels on both sides, we tested this configuration some weeks without a problem. But at some point the iccpd crashes and the mclag-pair was broken. We have to reboot the switches, but after some seconds running as expected the iccpd crashes again and leaves the mclag-pair in an running but broken state. We tried to debug this situation and saw that, if we only run one mclag-enabled switch (leafb
for example) the mclag is in error-state but we are able to see the known mac-addresses withmclagctl -i 1 dump macs
. Now we wanted to re-addleafa
. To circumvent any configuration-diffs in the PortChannels we removed all MCLAG-PortChannels fromleafa
(only mgmt-int and peerlink is configured) and applied the mclag related config:Right after the last command on
leafa
the iccpd crashes onleafb
. After rebooting both switches work as before.In the logs we found the following line on both switches:
We also found these lines in the near of the other ones:
As you can see the string Eth(e) seems to be cut off. Btw.: Currently we have only one single Ethernet-Uplink on
leafa
which is shared across the peerlink. We also tried removing it onleafa
and try to start the mclag-pair without any luck. iccpd crashes with the same error/behavior.To be clear, we had this problem first when both switches had the full MCLAG-PortChannel setup. We created tech-support-files on both switches right after the crash and before we reboot them.
Steps to reproduce the issue:
Describe the results you received:
It seems to be okay for a few seconds after that: (Core Dumps for iccpd and orchagent are available)
Describe the results you expected:
A working mclag state.
Output of
show version
:Output of
show techsupport
:https://crossmediasolutions-my.sharepoint.com/:f:/g/personal/m_stroecker_4allportal_com/EtcT8kAQtZxDpGLIv1bSCJkB2_VPhsv-3yOoy2li3XOxug?e=gi8kbF
Additional information you deem important (e.g. issue happens only occasionally):
I built the images with symbols and generated the requested stack traces:
ICCPD:
Orchagent:
The text was updated successfully, but these errors were encountered: