Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KVM][warm reboot] syncd crash when getting virtual router ID #6509

Closed
shi-su opened this issue Jan 20, 2021 · 10 comments · Fixed by sonic-net/sonic-utilities#1460
Closed

[KVM][warm reboot] syncd crash when getting virtual router ID #6509

shi-su opened this issue Jan 20, 2021 · 10 comments · Fixed by sonic-net/sonic-utilities#1460

Comments

@shi-su
Copy link
Contributor

shi-su commented Jan 20, 2021

Description

Steps to reproduce the issue:

  1. Run warm reboot test in KVM (This issue happens by chance at a probability of about one occurrence every 20 attempts)

Describe the results you received:
Syncd fails when getting the default virtual router ID in warm boot.

Recording

2021-01-13.18:58:55.784148|#|recording on: /var/log/swss/sairedis.rec
2021-01-13.18:58:55.784236|#|logrotate on: /var/log/swss/sairedis.rec
2021-01-13.18:58:55.784386|a|INIT_VIEW
2021-01-13.18:58:55.793739|A|SAI_STATUS_SUCCESS
2021-01-13.18:58:55.810958|c|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_INIT_SWITCH=true|SAI_SWITCH_ATTR_FDB_EVENT_NOTIFY=0x55ceed8d25b0|SAI_SWITCH_ATTR_PORT_STATE_CHANGE_NOTIFY=0x55ceed8d25c0|SAI_SWITCH_ATTR_SWITCH_SHUTDOWN_REQUEST_NOTIFY=0x55ceed8d25d0|SAI_SWITCH_ATTR_SRC_MAC_ADDRESS=52:54:00:C6:F2:50
2021-01-13.18:58:55.817467|g|SAI_OBJECT_TYPE_SWITCH:oid:0x21000000000000|SAI_SWITCH_ATTR_DEFAULT_VIRTUAL_ROUTER_ID=oid:0x0
2021-01-13.18:58:55.822074|n|switch_shutdown_request|{"switch_id":"oid:0x21000000000000"}|
2021-01-13.18:59:55.841119|G|SAI_STATUS_FAILURE|

Syslog

Jan 13 18:58:55.777371 vlab-01 NOTICE swss#orchagent: :- main: Enabling synchronous mode
Jan 13 18:58:55.777442 vlab-01 NOTICE swss#orchagent: :- main: --- Starting Orchestration Agent ---
Jan 13 18:58:55.777467 vlab-01 NOTICE swss#orchagent: :- loadFromFile: no context config specified, will load default context config
Jan 13 18:58:55.777516 vlab-01 NOTICE swss#orchagent: :- insert: added switch: 0:
Jan 13 18:58:55.783204 vlab-01 NOTICE swss#orchagent: :- RedisChannel: creating notification thread
Jan 13 18:58:55.783768 vlab-01 NOTICE swss#orchagent: :- clear_local_state: clearing local state
Jan 13 18:58:55.784199 vlab-01 NOTICE swss#orchagent: :- stopRecording: stopped recording
Jan 13 18:58:55.784384 vlab-01 NOTICE swss#orchagent: :- startRecording: started recording: sairedis.rec
Jan 13 18:58:55.784501 vlab-01 NOTICE swss#orchagent: :- initSaiRedis: Enable redis pipeline
Jan 13 18:58:55.784549 vlab-01 NOTICE swss#orchagent: :- notifySyncd: sending syncd: INIT_VIEW
Jan 13 18:58:55.785391 vlab-01 NOTICE syncd#syncd: :- clearTempView: clearing current TEMP VIEW
Jan 13 18:58:55.790450 vlab-01 NOTICE syncd#syncd: :- clearTempView: clear temp view took 0.001164 sec
Jan 13 18:58:55.790591 vlab-01 WARNING syncd#syncd: :- processNotifySyncd: syncd switched to INIT VIEW mode, all op will be saved to TEMP view
Jan 13 18:58:55.794226 vlab-01 NOTICE swss#orchagent: :- sai_redis_notify_syncd: switched ASIC to INIT VIEW
Jan 13 18:58:55.794540 vlab-01 NOTICE swss#orchagent: :- sai_redis_notify_syncd: clearing current local state since init view is called on initialized switch
Jan 13 18:58:55.794573 vlab-01 NOTICE swss#orchagent: :- clear_local_state: clearing local state
Jan 13 18:58:55.794600 vlab-01 NOTICE swss#orchagent: :- meta_init_db: begin
Jan 13 18:58:55.795347 vlab-01 NOTICE swss#orchagent: :- meta_init_db: end
Jan 13 18:58:55.795421 vlab-01 NOTICE swss#orchagent: :- initSaiRedis: Notify syncd INIT_VIEW
Jan 13 18:58:55.806499 vlab-01 WARNING swss#orchagent: :- main: sync mode is depreacated, use -z param
Jan 13 18:58:55.807044 vlab-01 NOTICE swss#orchagent: :- ~RedisChannel: join ntf thread begin
Jan 13 18:58:55.807081 vlab-01 NOTICE swss#orchagent: :- ~RedisChannel: join ntf thread end
Jan 13 18:58:55.807892 vlab-01 NOTICE swss#orchagent: :- setRedisExtensionAttribute: enabling redis sync mode
Jan 13 18:58:55.809508 vlab-01 NOTICE swss#orchagent: :- RedisChannel: creating notification thread
Jan 13 18:58:55.809957 vlab-01 NOTICE swss#orchagent: :- create: request switch create with context 0
Jan 13 18:58:55.810615 vlab-01 NOTICE swss#orchagent: :- allocateNewSwitchObjectId: created SWITCH VID oid:0x21000000000000 for hwinfo: ''
Jan 13 18:58:55.812741 vlab-01 NOTICE syncd#syncd: :- onSwitchCreateInInitViewMode: new switch oid:0x21000000000000 contains hardware info: ''
Jan 13 18:58:55.813541 vlab-01 NOTICE syncd#syncd: :- onSwitchCreateInInitViewMode: current oid:0x21000000000000 switch hardware info: ''
Jan 13 18:58:55.816554 vlab-01 NOTICE swss#orchagent: :- Switch: created switch with hwinfo = ''
Jan 13 18:58:55.816830 vlab-01 NOTICE swss#orchagent: :- main: Create a switch, id:9288674231451648
Jan 13 18:58:55.821338 vlab-01 ERR syncd#syncd: :- translateVidToRid: unable to get RID for VID oid:0x21000000000000
Jan 13 18:58:55.821590 vlab-01 ERR syncd#syncd: :- run: Runtime error: :- translateVidToRid: unable to get RID for VID oid:0x21000000000000
Jan 13 18:58:55.821626 vlab-01 NOTICE syncd#syncd: :- sendShutdownRequest: sending switch_shutdown_request notification to OA for switch: oid:0x21000000000000
Jan 13 18:58:55.821657 vlab-01 NOTICE syncd#syncd: :- sendShutdownRequestAfterException: notification send successfull

Describe the results you expected:
Crash free warm reboot.

Additional information you deem important (e.g. issue happens only occasionally):

**Output of `show version`:**

```
(paste your output here)
```

**Attach debug file `sudo generate_dump`:**

```
(paste your output here)
```
@kcudnik
Copy link
Contributor

kcudnik commented Jan 21, 2021

need redis monitor logs from this crash to confirm that db was cleared or not during that time

@kcudnik
Copy link
Contributor

kcudnik commented Feb 8, 2021

try using "redis-cli monitor"

@shi-su
Copy link
Contributor Author

shi-su commented Feb 8, 2021

try using "redis-cli monitor"

It seems that redis-cli monitor opens a monitor that starts to record new incoming redis activities. I doubt if we could get the history of the redis activities since running monitor seems to significantly reduce the efficiency of redis. Is it possible to get some useful information from the history without running redis-cli monitor? Otherwise, I will probably have to try reproducing this crash with redis-cli monitor running, but this could be pretty tricky.

@kcudnik
Copy link
Contributor

kcudnik commented Feb 8, 2021

if you would run this with system start in background, we could at least confirm what commands is clearing db
my proposal would be to do a loop with warm boot over night, all the time until crash

@kcudnik
Copy link
Contributor

kcudnik commented Feb 16, 2021

do you happen to have full syslog form that event ? lines before beginning of that what you pasted? there could be some hint information from syncd during startup in syslog

@shi-su
Copy link
Contributor Author

shi-su commented Feb 16, 2021

do you happen to have full syslog form that event ? lines before beginning of that what you pasted? there could be some hint information from syncd during startup in syslog
The syslog is attached below.

syslog.txt
syslog.1.txt

@kcudnik
Copy link
Contributor

kcudnik commented Feb 16, 2021

ok, i found something like this in syslog (take a look at timestamps):

Jan 13 18:58:42.716474 vlab-01 DEBUG syncd#syncd: :> syncd_main: enter // syncd starts here
Jan 13 18:58:42.739852 vlab-01 NOTICE syncd#syncd: :- isVeryFirstRun: First Run: False // db not empty
...
Jan 13 18:58:42.740905 vlab-01 NOTICE syncd#syncd: :- onSyncdStart: performing hard reinit since COLD start was performed // COLD BOOT ! not WARM
...
Jan 13 18:58:44.213496 vlab-01 NOTICE syncd#syncd: :- processSwitches: Cold boot: create switch took 0.029908 sec
Jan 13 18:58:44.213496 vlab-01 NOTICE syncd#syncd: :- processSwitches: created switch RID: oid:0x2100000000
...
Jan 13 18:58:46.567308 vlab-01 NOTICE root: Warm boot flag: swss false.
Jan 13 18:58:46.571341 vlab-01 NOTICE root: Flushing APP, ASIC, COUNTER, CONFIG, and partial STATE databases ...
Jan 13 18:58:46.644865 vlab-01 INFO swss.sh[9537]: True
...
Jan 13 18:58:55.821590 vlab-01 ERR syncd#syncd: :- run: Runtime error: :- translateVidToRid: unable to get RID for VID oid:0x21000000000000
Jan 13 18:58:55.821626 vlab-01 NOTICE syncd#syncd: :- sendShutdownRequest: sending switch_shutdown_request notification to OA for switch: oid:0x21000000000000
Jan 13 18:58:55.821657 vlab-01 NOTICE syncd#syncd: :- sendShutdownRequestAfterException: notification send successfull

root: Flushing APP, ASIC, COUNTER, CONFIG, and partial STATE databases

this will be the issue, ASIC db is cleared, but now to figure out why this happening :/
take look here: https://github.com/Azure/sonic-buildimage/blob/master/files/scripts/swss.sh#L144

so actually this operation here is COLD BOOT not warm as you suggested, since both swss.sh and syncd are started in cold boot
there is some race condition in scripts:

  • syncd started in cold boot
  • syncd is performing hardReinit, since ASIC DB is still populated from previous boot (ecreate switch, and all objects on VS)
  • syncd populates oids to local memory copy of oids
  • swss.sh flushes all databases
  • OA starts
  • OA calls init_view
  • syncd on init_view clear local memory of oids
  • syncd switches to TEMP_VIEW mode
  • OA request create_switch
  • syncd in tries to get existing switch OID from VID provided by OA, local mem is empty so goes to ASIC_DB for VIDTORID
  • ASIC_DB is empty, syncd crashes when retrieving RID for switch

If ASIC DB was empty at cold boot start, and there was no switch present, then syncd would create new switch in TEMP_VIEW

depends what operation we want in COLD boot, probably DB should be cleared before start syncd

@shi-su
Copy link
Contributor Author

shi-su commented Feb 16, 2021

It seems that either WARMBOOT_FINALIZER clears the warmboot flag in state_db.

Jan 13 18:57:50.892859 vlab-01 NOTICE admin: Saving counters folder before warmboot...
Jan 13 18:57:51.315720 vlab-01 INFO admin: Checking that ASIC configuration has not changed
Jan 13 18:57:51.643967 vlab-01 INFO admin: ASIC config unchanged, current and destination SONiC version are the same
Jan 13 18:57:52.299256 vlab-01 NOTICE admin: Pausing orchagent ...
Jan 13 18:57:52.445594 vlab-01 NOTICE swss#orchagent_restart_check: :- main: Wait time for response from orchagent set to 2000 milliseconds
Jan 13 18:57:52.445594 vlab-01 NOTICE swss#orchagent_restart_check: :- main: Number of retries for the request to orchagent is set to 5
Jan 13 18:57:52.445891 vlab-01 INFO swss#orchagent_restart_check: :- subscribe: subscribed to RESTARTCHECKREPLY
Jan 13 18:57:52.445961 vlab-01 NOTICE swss#orchagent_restart_check: :- main: requested orchagent to do warm restart state check, retry count: 0
Jan 13 18:57:52.449771 vlab-01 NOTICE swss#orchagent: :- doTask: RESTARTCHECK notification for orchagent 
Jan 13 18:57:52.449771 vlab-01 NOTICE swss#orchagent: :- doTask: orchagent|NoFreeze:false|SkipPendingTaskCheck:false
Jan 13 18:57:52.449771 vlab-01 NOTICE swss#orchagent: :- warmRestartCheck: Restart check result: READY
Jan 13 18:57:52.451145 vlab-01 NOTICE swss#orchagent_restart_check: :- main: RESTARTCHECK success, orchagent is frozen and ready for warm restart
...
Jan 13 18:57:53.205584 vlab-01 NOTICE admin: Warm boot flag: radv true.
...
Jan 13 18:57:53.883548 vlab-01 NOTICE admin: Warm boot flag: bgp true.
...
Jan 13 18:57:58.309133 vlab-01 NOTICE root: WARMBOOT_FINALIZER : Tearing down control plane assistant ...
Jan 13 18:58:00.537181 vlab-01 INFO snmp#supervisord 2021-01-13 18:58:00,533 INFO waiting for supervisor-proc-exit-listener, rsyslogd, snmpd, snmp-subagent to die
Jan 13 18:58:01.662079 vlab-01 NOTICE root: WARMBOOT_FINALIZER : Save in-memory database after warm reboot ...
Jan 13 18:58:02.049779 vlab-01 INFO finalize-warmboot.sh[826]: Running command: /usr/local/bin/sonic-cfggen -d --print-data > /etc/sonic/config_db.json
Jan 13 18:58:02.469173 vlab-01 NOTICE root: WARMBOOT_FINALIZER : Some components didn't finish reconcile: natsyncd ...
Jan 13 18:58:02.474471 vlab-01 NOTICE root: WARMBOOT_FINALIZER : Finalizing warmboot... // this clears the warmboot flag state db
...
Jan 13 18:58:06.340889 vlab-01 INFO systemd[1]: Stopping switch state service...
Jan 13 18:58:06.345660 vlab-01 NOTICE root: Stopping swss service...
Jan 13 18:58:06.351663 vlab-01 NOTICE root: Locking /tmp/swss-syncd-lock from swss service
Jan 13 18:58:06.362103 vlab-01 NOTICE root: Locked /tmp/swss-syncd-lock (10) from swss service
Jan 13 18:58:06.560909 vlab-01 NOTICE root: Warm boot flag: swss false. // should be true here
Jan 13 18:58:06.673867 vlab-01 NOTICE root: Fast boot flag: swss false.

@shi-su
Copy link
Contributor Author

shi-su commented Feb 17, 2021

I think I understood the cause of this issue. There was another issue #6383 that warm reboot finalizer waits for not enabled component for 5 minutes. This causes warm reboot finalizer to clear warmboot flag in state_db after 5 minutes after the switch boots up. And the second warm reboot request happens right before clearing the flag in state_db, the flag might change when things go down, and some services might get a wrong shutdown type.

To address this, I think we need to check if warm reboot finalizer is still in progress when requesting a warm reboot.

@kcudnik
Copy link
Contributor

kcudnik commented Feb 17, 2021

we can sync up with Ying on that too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants