Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warm/Fast reboot stability degradation #15076

Open
nazariig opened this issue May 15, 2023 · 1 comment
Open

Warm/Fast reboot stability degradation #15076

nazariig opened this issue May 15, 2023 · 1 comment

Comments

@nazariig
Copy link
Collaborator

nazariig commented May 15, 2023

Description

The issue is caused by recent changes aimed to improve fast boot performance: the dependency on interfaces-config has been removed from swss/syncd.
This end up in a race between networking service restarting and accessing redis db from within swss/syncd.

sonic-buildimage/files/scripts/syncd_common.sh

function check_warm_boot()
{
    SYSTEM_WARM_START=`$SONIC_DB_CLI STATE_DB hget "WARM_RESTART_ENABLE_TABLE|system" enable`
    SERVICE_WARM_START=`$SONIC_DB_CLI STATE_DB hget "WARM_RESTART_ENABLE_TABLE|${SERVICE}" enable`
    # SYSTEM_WARM_START could be empty, always make WARM_BOOT meaningful.
    if [[ x"$SYSTEM_WARM_START" == x"true" ]] || [[ x"$SERVICE_WARM_START" == x"true" ]]; then
        WARM_BOOT="true"
    else
        WARM_BOOT="false"
    fi
}

#
# system boot: reboot type - fastfast
#

May  6 11:59:25.207554 r-leopard-01 INFO kernel: [    0.000000] Command line: BOOT_IMAGE=/image-202211_RC12.1-4ee027200_Internal/boot/vmlinuz-5.10.0-18-2-amd64 root=UU
ID=f25e4512-e4b1-4534-90fb-fc3ac1dba0e7 rw console=tty0 console=ttyS0,9600n8 quiet intel_idle.max_cstate=0 net.ifnames=0 biosdevname=0 loop=image-202211_RC12.1-4ee0272
00_Internal/fs.squashfs loopfstype=squashfs systemd.unified_cgroup_hierarchy=0 apparmor=1 security=apparmor varlog_size=4096 usbcore.autosuspend=-1 SONIC_BOOT_TYPE=fas
tfast

#
# swss/syncd service start: failed to get warm-boot flag - Unable to connect to redis: Cannot assign requested address
#

May  6 11:59:36.699943 r-leopard-01 INFO interfaces-config.sh[5085]: net.ipv6.conf.eth0.accept_ra_defrtr = 0
May  6 11:59:36.700028 r-leopard-01 INFO interfaces-config.sh[5085]: net.ipv6.conf.eth0.accept_ra = 0
May  6 11:59:36.710982 r-leopard-01 INFO systemd[1]: Stopping Network initialization...
May  6 11:59:36.734004 r-leopard-01 INFO networking[5090]: networking: Deconfiguring network interfaces
May  6 11:59:36.828247 r-leopard-01 INFO caclmgrd[4657]: Issuing the following iptables commands:
May  6 11:59:36.849028 r-leopard-01 INFO pcie-check.sh[5099]: 1
May  6 11:59:36.854216 r-leopard-01 NOTICE root: Started swss service...
May  6 11:59:36.858349 r-leopard-01 NOTICE root: /usr/bin/pcie-check.sh : PCIe check passed
May  6 11:59:36.858756 r-leopard-01 INFO systemd[1]: pcie-check.service: Succeeded.
May  6 11:59:36.861054 r-leopard-01 NOTICE root: Unlocking /tmp/swss-syncd-lock (10) from swss service
May  6 11:59:36.868151 r-leopard-01 INFO systemd[1]: Started switch state service.
May  6 11:59:36.871365 r-leopard-01 INFO systemd[1]: Starting Enable backend acl on storage backend ToRs...
May  6 11:59:36.875347 r-leopard-01 INFO systemd[1]: Starting syncd service...
May  6 11:59:36.880671 r-leopard-01 INFO systemd[1]: Starting TEAMD container...
May  6 11:59:36.880853 r-leopard-01 INFO systemd[1]: Started watchdog control service.
May  6 11:59:36.887150 r-leopard-01 NOTICE root: Starting syncd service...
May  6 11:59:36.890145 r-leopard-01 NOTICE root: /usr/local/bin/watchdog-control.sh : Disabling Watchdog during bootup after fastfast
May  6 11:59:36.891808 r-leopard-01 NOTICE admin: Starting teamd service...
May  6 11:59:36.898266 r-leopard-01 NOTICE root: Locking /tmp/swss-syncd-lock from syncd service
May  6 11:59:36.910646 r-leopard-01 NOTICE root: Locked /tmp/swss-syncd-lock (10) from syncd service
May  6 11:59:36.952308 r-leopard-01 INFO backend_acl.py: executing cmd =  ['/usr/local/bin/sonic-cfggen', '-m', '-v', 'DEVICE_METADATA.localhost.type']
May  6 11:59:36.957352 r-leopard-01 NOTICE admin: Warm boot flag: teamd false.
May  6 11:59:36.965040 r-leopard-01 NOTICE admin: Fast boot flag: teamd .
May  6 11:59:36.966048 r-leopard-01 INFO swss.sh[5238]: Invalid database name input : 'STATE_DB'
May  6 11:59:36.966142 r-leopard-01 INFO swss.sh[5238]: Unable to connect to redis: Cannot assign requested address
May  6 11:59:36.972625 r-leopard-01 INFO syncd.sh[5241]: Invalid database name input : 'STATE_DB'
May  6 11:59:36.972722 r-leopard-01 INFO syncd.sh[5241]: Unable to connect to redis: Cannot assign requested address
May  6 11:59:36.979634 r-leopard-01 INFO swss.sh[5244]: Invalid database name input : 'STATE_DB'
May  6 11:59:36.979993 r-leopard-01 INFO swss.sh[5244]: Unable to connect to redis: Cannot assign requested address
May  6 11:59:36.985177 r-leopard-01 INFO syncd.sh[5248]: Invalid database name input : 'STATE_DB'
May  6 11:59:36.985266 r-leopard-01 INFO syncd.sh[5248]: Unable to connect to redis: Cannot assign requested address
May  6 11:59:36.989408 r-leopard-01 NOTICE root: Warm boot flag: syncd false.
May  6 11:59:37.010108 r-leopard-01 NOTICE root: Starting Firmware update procedure

#
# syncd is started in cold mode
#

May  6 12:00:25.801003 r-leopard-01 NOTICE syncd#syncd: :- Syncd: command line:  EnableDiagShell=NO EnableTempView=YES DisableExitSleep=NO EnableUnittests=NO EnableConsistencyCheck=NO EnableSyncMode=YES RedisCommunicationMode=redis_async EnableSaiBulkSuport=YES StartType=cold ProfileMapFile=/tmp/sai.profile GlobalContext=0 ContextConfig= BreakConfig=
May  6 12:00:25.801047 r-leopard-01 NOTICE syncd#syncd: :- loadFromFile: no context config specified, will load default context config
May  6 12:00:25.801047 r-leopard-01 NOTICE syncd#syncd: :- insert: added switch: idx 0, hwinfo ''
May  6 12:00:25.801047 r-leopard-01 WARNING syncd#syncd: :- Syncd: enable sync mode is deprecated, please use communication mode, FORCING redis sync mode
May  6 12:00:25.802340 r-leopard-01 NOTICE syncd#syncd: :- RedisSelectableChannel: opened redis channel
May  6 12:00:25.804310 r-leopard-01 NOTICE syncd#syncd: :- isVeryFirstRun: First Run: True
May  6 12:00:25.804310 r-leopard-01 WARNING syncd#syncd: :- performStartupLogic: override command line startType=cold via SAI_START_TYPE_WARM_BOOT
May  6 12:00:25.804334 r-leopard-01 NOTICE syncd#syncd: :- profileGetValue: SAI_WARM_BOOT_READ_FILE: /var/warmboot/sai-warmboot.bin
May  6 12:00:25.804352 r-leopard-01 NOTICE syncd#syncd: :- performStartupLogic: using warmBootReadFile: '/var/warmboot/sai-warmboot.bin'
May  6 12:00:25.804376 r-leopard-01 WARNING syncd#syncd: :- performStartupLogic: user requested warmStart but warmBootReadFile is not specified or not accessible, forcing cold start
May  6 12:00:25.804400 r-leopard-01 NOTICE syncd#syncd: :- initialize: SAI API version: 11100

Since, this is a timing issue - it is very hard to reproduce.

The possible solutions are:

  1. Add a dependency on interfaces-config for all sonic services which will guarantee redis availability over TCP/IP for all use cases
  2. Switch to UNIX socket IPC communication for all the processes

In case TCP/IP availability is a must, the entire SONiC architecture must be revisited in order to guarantee loopback (localhost) reachability for all the situations.

Steps to reproduce the issue:

  1. Run warm-reboot sad test in a loop

Describe the results you received:

syncd is started in a cold mode after 'warm-reboot'

Describe the results you expected:

syncd is started in a warm mode after 'warm-reboot'

Output of show version:

  • N/A

Output of show techsupport:

  • N/A

Additional information you deem important (e.g. issue happens only occasionally):

More on this:

  1. SONiC DB CLI doesn't work when UNIX socket option is provided: verified by stopping networking service
  2. Verified fast-reboot on MSN4700 with dependency on interfaces-config for swss/syncd: degradation is 2 sec

No dependency "After":

Summary:
2023-05-12 15:24:21 : --------------------------------------------------
2023-05-12 15:24:21 : Longest downtime period was 0:00:10.806707
2023-05-12 15:24:21 : Reboot time was 0:00:38.100113
2023-05-12 15:24:21 : Expected downtime is less then 0:00:30
2023-05-12 15:24:21 : --------------------------------------------------
2023-05-12 15:24:21 : Additional info:
2023-05-12 15:24:21 : --------------------------------------------------
2023-05-12 15:24:21 : INFO:10.213.103.9:PortChannel interface state changed 1 times
2023-05-12 15:24:21 : INFO:10.213.103.9:BGP v6 routes must be down just for once
2023-05-12 15:24:21 : INFO:10.213.103.8:PortChannel interface state changed 1 times
2023-05-12 15:24:21 : INFO:10.213.103.8:BGP v6 routes must be down just for once
2023-05-12 15:24:21 : INFO:10.213.103.7:PortChannel interface state changed 1 times
2023-05-12 15:24:21 : INFO:10.213.103.6:PortChannel interface state changed 1 times
2023-05-12 15:24:21 : --------------------------------------------------
2023-05-12 15:24:21 : =========

With dependency "After":

Summary:
2023-05-12 15:55:51 : --------------------------------------------------
2023-05-12 15:55:51 : Longest downtime period was 0:00:10.492857
2023-05-12 15:55:51 : Reboot time was 0:00:40.035968
2023-05-12 15:55:51 : Expected downtime is less then 0:00:30
2023-05-12 15:55:51 : --------------------------------------------------
2023-05-12 15:55:51 : Additional info:
2023-05-12 15:55:51 : --------------------------------------------------
2023-05-12 15:55:51 : INFO:10.213.103.9:PortChannel interface state changed 1 times
2023-05-12 15:55:51 : INFO:10.213.103.9:BGP v6 routes must be down just for once
2023-05-12 15:55:51 : INFO:10.213.103.8:PortChannel interface state changed 1 times
2023-05-12 15:55:51 : INFO:10.213.103.8:BGP v6 routes must be down just for once
2023-05-12 15:55:51 : INFO:10.213.103.7:PortChannel interface state changed 1 times
2023-05-12 15:55:51 : INFO:10.213.103.7:BGP v6 routes must be down just for once
2023-05-12 15:55:51 : INFO:10.213.103.6:PortChannel interface state changed 1 times
2023-05-12 15:55:51 : INFO:10.213.103.6:BGP v6 routes must be down just for once
2023-05-12 15:55:51 : --------------------------------------------------
2023-05-12 15:55:51 : =========

The culprit:

  1. [swss/syncd] remove dependency on interfaces-config.service #13084

Attempt to fix the bug:

  1. Delay mux/sflow/snmp timer after interface-config service #14506

Related to:

  1. [boot performance] long swss/syncd/teamd start due to interfaces-config.service #12948
  2. Log messages missed at boot for containers not depending on interfaces-config.service #14444
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants