You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The issue is caused by recent changes aimed to improve fast boot performance: the dependency on interfaces-config has been removed from swss/syncd.
This end up in a race between networking service restarting and accessing redis db from within swss/syncd.
sonic-buildimage/files/scripts/syncd_common.sh
function check_warm_boot()
{
SYSTEM_WARM_START=`$SONIC_DB_CLI STATE_DB hget "WARM_RESTART_ENABLE_TABLE|system" enable`
SERVICE_WARM_START=`$SONIC_DB_CLI STATE_DB hget "WARM_RESTART_ENABLE_TABLE|${SERVICE}" enable`
# SYSTEM_WARM_START could be empty, always make WARM_BOOT meaningful.
if [[ x"$SYSTEM_WARM_START" == x"true" ]] || [[ x"$SERVICE_WARM_START" == x"true" ]]; then
WARM_BOOT="true"
else
WARM_BOOT="false"
fi
}
#
# system boot: reboot type - fastfast
#
May 6 11:59:25.207554 r-leopard-01 INFO kernel: [ 0.000000] Command line: BOOT_IMAGE=/image-202211_RC12.1-4ee027200_Internal/boot/vmlinuz-5.10.0-18-2-amd64 root=UU
ID=f25e4512-e4b1-4534-90fb-fc3ac1dba0e7 rw console=tty0 console=ttyS0,9600n8 quiet intel_idle.max_cstate=0 net.ifnames=0 biosdevname=0 loop=image-202211_RC12.1-4ee0272
00_Internal/fs.squashfs loopfstype=squashfs systemd.unified_cgroup_hierarchy=0 apparmor=1 security=apparmor varlog_size=4096 usbcore.autosuspend=-1 SONIC_BOOT_TYPE=fas
tfast
#
# swss/syncd service start: failed to get warm-boot flag - Unable to connect to redis: Cannot assign requested address
#
May 6 11:59:36.699943 r-leopard-01 INFO interfaces-config.sh[5085]: net.ipv6.conf.eth0.accept_ra_defrtr = 0
May 6 11:59:36.700028 r-leopard-01 INFO interfaces-config.sh[5085]: net.ipv6.conf.eth0.accept_ra = 0
May 6 11:59:36.710982 r-leopard-01 INFO systemd[1]: Stopping Network initialization...
May 6 11:59:36.734004 r-leopard-01 INFO networking[5090]: networking: Deconfiguring network interfaces
May 6 11:59:36.828247 r-leopard-01 INFO caclmgrd[4657]: Issuing the following iptables commands:
May 6 11:59:36.849028 r-leopard-01 INFO pcie-check.sh[5099]: 1
May 6 11:59:36.854216 r-leopard-01 NOTICE root: Started swss service...
May 6 11:59:36.858349 r-leopard-01 NOTICE root: /usr/bin/pcie-check.sh : PCIe check passed
May 6 11:59:36.858756 r-leopard-01 INFO systemd[1]: pcie-check.service: Succeeded.
May 6 11:59:36.861054 r-leopard-01 NOTICE root: Unlocking /tmp/swss-syncd-lock (10) from swss service
May 6 11:59:36.868151 r-leopard-01 INFO systemd[1]: Started switch state service.
May 6 11:59:36.871365 r-leopard-01 INFO systemd[1]: Starting Enable backend acl on storage backend ToRs...
May 6 11:59:36.875347 r-leopard-01 INFO systemd[1]: Starting syncd service...
May 6 11:59:36.880671 r-leopard-01 INFO systemd[1]: Starting TEAMD container...
May 6 11:59:36.880853 r-leopard-01 INFO systemd[1]: Started watchdog control service.
May 6 11:59:36.887150 r-leopard-01 NOTICE root: Starting syncd service...
May 6 11:59:36.890145 r-leopard-01 NOTICE root: /usr/local/bin/watchdog-control.sh : Disabling Watchdog during bootup after fastfast
May 6 11:59:36.891808 r-leopard-01 NOTICE admin: Starting teamd service...
May 6 11:59:36.898266 r-leopard-01 NOTICE root: Locking /tmp/swss-syncd-lock from syncd service
May 6 11:59:36.910646 r-leopard-01 NOTICE root: Locked /tmp/swss-syncd-lock (10) from syncd service
May 6 11:59:36.952308 r-leopard-01 INFO backend_acl.py: executing cmd = ['/usr/local/bin/sonic-cfggen', '-m', '-v', 'DEVICE_METADATA.localhost.type']
May 6 11:59:36.957352 r-leopard-01 NOTICE admin: Warm boot flag: teamd false.
May 6 11:59:36.965040 r-leopard-01 NOTICE admin: Fast boot flag: teamd .
May 6 11:59:36.966048 r-leopard-01 INFO swss.sh[5238]: Invalid database name input : 'STATE_DB'
May 6 11:59:36.966142 r-leopard-01 INFO swss.sh[5238]: Unable to connect to redis: Cannot assign requested address
May 6 11:59:36.972625 r-leopard-01 INFO syncd.sh[5241]: Invalid database name input : 'STATE_DB'
May 6 11:59:36.972722 r-leopard-01 INFO syncd.sh[5241]: Unable to connect to redis: Cannot assign requested address
May 6 11:59:36.979634 r-leopard-01 INFO swss.sh[5244]: Invalid database name input : 'STATE_DB'
May 6 11:59:36.979993 r-leopard-01 INFO swss.sh[5244]: Unable to connect to redis: Cannot assign requested address
May 6 11:59:36.985177 r-leopard-01 INFO syncd.sh[5248]: Invalid database name input : 'STATE_DB'
May 6 11:59:36.985266 r-leopard-01 INFO syncd.sh[5248]: Unable to connect to redis: Cannot assign requested address
May 6 11:59:36.989408 r-leopard-01 NOTICE root: Warm boot flag: syncd false.
May 6 11:59:37.010108 r-leopard-01 NOTICE root: Starting Firmware update procedure
#
# syncd is started in cold mode
#
May 6 12:00:25.801003 r-leopard-01 NOTICE syncd#syncd: :- Syncd: command line: EnableDiagShell=NO EnableTempView=YES DisableExitSleep=NO EnableUnittests=NO EnableConsistencyCheck=NO EnableSyncMode=YES RedisCommunicationMode=redis_async EnableSaiBulkSuport=YES StartType=cold ProfileMapFile=/tmp/sai.profile GlobalContext=0 ContextConfig= BreakConfig=
May 6 12:00:25.801047 r-leopard-01 NOTICE syncd#syncd: :- loadFromFile: no context config specified, will load default context config
May 6 12:00:25.801047 r-leopard-01 NOTICE syncd#syncd: :- insert: added switch: idx 0, hwinfo ''
May 6 12:00:25.801047 r-leopard-01 WARNING syncd#syncd: :- Syncd: enable sync mode is deprecated, please use communication mode, FORCING redis sync mode
May 6 12:00:25.802340 r-leopard-01 NOTICE syncd#syncd: :- RedisSelectableChannel: opened redis channel
May 6 12:00:25.804310 r-leopard-01 NOTICE syncd#syncd: :- isVeryFirstRun: First Run: True
May 6 12:00:25.804310 r-leopard-01 WARNING syncd#syncd: :- performStartupLogic: override command line startType=cold via SAI_START_TYPE_WARM_BOOT
May 6 12:00:25.804334 r-leopard-01 NOTICE syncd#syncd: :- profileGetValue: SAI_WARM_BOOT_READ_FILE: /var/warmboot/sai-warmboot.bin
May 6 12:00:25.804352 r-leopard-01 NOTICE syncd#syncd: :- performStartupLogic: using warmBootReadFile: '/var/warmboot/sai-warmboot.bin'
May 6 12:00:25.804376 r-leopard-01 WARNING syncd#syncd: :- performStartupLogic: user requested warmStart but warmBootReadFile is not specified or not accessible, forcing cold start
May 6 12:00:25.804400 r-leopard-01 NOTICE syncd#syncd: :- initialize: SAI API version: 11100
Since, this is a timing issue - it is very hard to reproduce.
The possible solutions are:
Add a dependency on interfaces-config for all sonic services which will guarantee redis availability over TCP/IP for all use cases
Switch to UNIX socket IPC communication for all the processes
In case TCP/IP availability is a must, the entire SONiC architecture must be revisited in order to guarantee loopback (localhost) reachability for all the situations.
Steps to reproduce the issue:
Run warm-reboot sad test in a loop
Describe the results you received:
syncd is started in a cold mode after 'warm-reboot'
Describe the results you expected:
syncd is started in a warm mode after 'warm-reboot'
Output of show version:
N/A
Output of show techsupport:
N/A
Additional information you deem important (e.g. issue happens only occasionally):
More on this:
SONiC DB CLI doesn't work when UNIX socket option is provided: verified by stopping networking service
Verified fast-reboot on MSN4700 with dependency on interfaces-config for swss/syncd: degradation is 2 sec
No dependency "After":
Summary:
2023-05-12 15:24:21 : --------------------------------------------------
2023-05-12 15:24:21 : Longest downtime period was 0:00:10.806707
2023-05-12 15:24:21 : Reboot time was 0:00:38.100113
2023-05-12 15:24:21 : Expected downtime is less then 0:00:30
2023-05-12 15:24:21 : --------------------------------------------------
2023-05-12 15:24:21 : Additional info:
2023-05-12 15:24:21 : --------------------------------------------------
2023-05-12 15:24:21 : INFO:10.213.103.9:PortChannel interface state changed 1 times
2023-05-12 15:24:21 : INFO:10.213.103.9:BGP v6 routes must be down just for once
2023-05-12 15:24:21 : INFO:10.213.103.8:PortChannel interface state changed 1 times
2023-05-12 15:24:21 : INFO:10.213.103.8:BGP v6 routes must be down just for once
2023-05-12 15:24:21 : INFO:10.213.103.7:PortChannel interface state changed 1 times
2023-05-12 15:24:21 : INFO:10.213.103.6:PortChannel interface state changed 1 times
2023-05-12 15:24:21 : --------------------------------------------------
2023-05-12 15:24:21 : =========
With dependency "After":
Summary:
2023-05-12 15:55:51 : --------------------------------------------------
2023-05-12 15:55:51 : Longest downtime period was 0:00:10.492857
2023-05-12 15:55:51 : Reboot time was 0:00:40.035968
2023-05-12 15:55:51 : Expected downtime is less then 0:00:30
2023-05-12 15:55:51 : --------------------------------------------------
2023-05-12 15:55:51 : Additional info:
2023-05-12 15:55:51 : --------------------------------------------------
2023-05-12 15:55:51 : INFO:10.213.103.9:PortChannel interface state changed 1 times
2023-05-12 15:55:51 : INFO:10.213.103.9:BGP v6 routes must be down just for once
2023-05-12 15:55:51 : INFO:10.213.103.8:PortChannel interface state changed 1 times
2023-05-12 15:55:51 : INFO:10.213.103.8:BGP v6 routes must be down just for once
2023-05-12 15:55:51 : INFO:10.213.103.7:PortChannel interface state changed 1 times
2023-05-12 15:55:51 : INFO:10.213.103.7:BGP v6 routes must be down just for once
2023-05-12 15:55:51 : INFO:10.213.103.6:PortChannel interface state changed 1 times
2023-05-12 15:55:51 : INFO:10.213.103.6:BGP v6 routes must be down just for once
2023-05-12 15:55:51 : --------------------------------------------------
2023-05-12 15:55:51 : =========
Description
The issue is caused by recent changes aimed to improve fast boot performance: the dependency on
interfaces-config
has been removed fromswss
/syncd
.This end up in a race between networking service restarting and accessing
redis
db from withinswss
/syncd
.Since, this is a timing issue - it is very hard to reproduce.
The possible solutions are:
In case TCP/IP availability is a must, the entire SONiC architecture must be revisited in order to guarantee loopback (localhost) reachability for all the situations.
Steps to reproduce the issue:
warm-reboot
sad test in a loopDescribe the results you received:
syncd
is started in a cold mode after 'warm-reboot'Describe the results you expected:
syncd
is started in a warm mode after 'warm-reboot'Output of
show version
:Output of
show techsupport
:Additional information you deem important (e.g. issue happens only occasionally):
More on this:
fast-reboot
onMSN4700
with dependency oninterfaces-config
forswss
/syncd
: degradation is 2 secNo dependency "After":
With dependency "After":
The culprit:
Attempt to fix the bug:
Related to:
The text was updated successfully, but these errors were encountered: