-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Closed
Description
Description
After warm-reboot:
- Processes start doing warmstart.
- The common warm-start states are
initialized
->replayed
->reconciled
vxlanmgrd
is one such process, but it continuously (for > 10mins) fails to reconcile Orchagent.WARMBOOT_FINALIZER
reports that orchagent is not reconciled, but goes ahead withFinalizing warmboot
.- After some time, a new warm reboot is issued, but the
RESTARTCHECK
times-out for 5 (max-allowed) retries due to the fact that orchagent from last warmboot was not reconciled.
Steps to reproduce the issue:
- Run
test_cont_warm_reboot
(the error was seen on KVM). - The issue is seen after 40 successful iterations.
- Check syslog after the failure - warmboot failed due to OA RESTARTCHECK failed.
Describe the results you received:
The error was caught by test_cont_warm_reboot` on KVM test. Artifacts are here https://dev.azure.com/mssonic/build/_build/results?buildId=3698&view=artifacts&pathAsName=false&type=publishedArtifacts
The failure was 42nd iteration.
Feb 11 13:32:53.221127 vlab-01 NOTICE swss#orchagent: :- checkWarmStart: orchagent doing warm start, restore count 40
Feb 11 13:32:56.663604 vlab-01 INFO swss#supervisord 2021-02-11 13:32:56,647 INFO spawned: 'vxlanmgrd' with pid 147
Feb 11 13:32:56.786068 vlab-01 NOTICE swss#vxlanmgrd: :- main: --- Starting vxlanmgrd ---
Feb 11 13:32:56.786622 vlab-01 NOTICE swss#vxlanmgrd: :- checkWarmStart: vxlanmgrd doing warm start, restore count 40
Feb 11 13:32:56.793019 vlab-01 NOTICE swss#vxlanmgrd: :- setWarmStartState: vxlanmgrd warm start state changed to initialized
Feb 11 13:32:57.649618 vlab-01 INFO swss#supervisord 2021-02-11 13:32:57,648 INFO success: vxlanmgrd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
Feb 11 13:33:02.033737 vlab-01 NOTICE swss#vxlanmgrd: :- setWarmStartState: vxlanmgrd warm start state changed to replayed
Feb 11 13:33:02.034131 vlab-01 NOTICE swss#vxlanmgrd: :- main: Waiting Until Orchagent is reconciled. Current 40. Waited 0 secs
Feb 11 13:33:03.037357 vlab-01 NOTICE swss#vxlanmgrd: :- main: Waiting Until Orchagent is reconciled. Current 40. Waited 1 secs
Feb 11 13:33:04.042161 vlab-01 NOTICE swss#vxlanmgrd: :- main: Waiting Until Orchagent is reconciled. Current 40. Waited 2 secs
..
Feb 11 13:38:06.515441 vlab-01 NOTICE root: WARMBOOT_FINALIZER : Some components didn't finish reconcile: orchagent ...
Feb 11 13:38:06.523556 vlab-01 NOTICE root: WARMBOOT_FINALIZER : Finalizing warmboot...
Feb 11 13:38:07.124812 vlab-01 INFO systemd[1]: warmboot-finalizer.service: Succeeded.
..
Feb 11 13:41:54.584331 vlab-01 NOTICE admin: Saving counters folder before warmboot...
Feb 11 13:41:58.865214 vlab-01 NOTICE swss#orchagent_restart_check: :- main: Wait time for response from orchagent set to 2000 milliseconds
Feb 11 13:41:58.865214 vlab-01 NOTICE swss#orchagent_restart_check: :- main: Number of retries for the request to orchagent is set to 5
Feb 11 13:41:58.868188 vlab-01 INFO swss#orchagent_restart_check: :- subscribe: subscribed to RESTARTCHECKREPLY
Feb 11 13:42:06.910388 vlab-01 NOTICE swss#orchagent_restart_check: :- main: RESTARTCHECK for timed out
Feb 11 13:42:06.919690 vlab-01 NOTICE swss#orchagent_restart_check: :- main: requested orchagent to do warm restart state check, retry count: 4
Feb 11 13:42:07.266429 vlab-01 NOTICE swss#vxlanmgrd: :- main: Waiting Until Orchagent is reconciled. Current 40. Waited 543 secs
Feb 11 13:42:08.267817 vlab-01 NOTICE swss#vxlanmgrd: :- main: Waiting Until Orchagent is reconciled. Current 40. Waited 544 secs
Feb 11 13:42:08.921712 vlab-01 NOTICE swss#orchagent_restart_check: :- main: RESTARTCHECK for timed out
Feb 11 13:42:08.925296 vlab-01 NOTICE swss#orchagent_restart_check: :- main: requested orchagent to do warm restart state check, retry count: 5
Feb 11 13:42:09.269244 vlab-01 NOTICE swss#vxlanmgrd: :- main: Waiting Until Orchagent is reconciled. Current 40. Waited 545 secs
Feb 11 13:42:10.270319 vlab-01 NOTICE swss#vxlanmgrd: :- main: Waiting Until Orchagent is reconciled. Current 40. Waited 546 secs
Feb 11 13:42:10.924137 vlab-01 NOTICE swss#orchagent_restart_check: :- main: RESTARTCHECK for timed out
Feb 11 13:42:11.137017 vlab-01 NOTICE admin: warm-reboot failure (0) cleanup ...
..
..
Feb 11 13:46:44.864286 vlab-01 NOTICE swss#vxlanmgrd: :- main: Waiting Until Orchagent is reconciled. Current 40. Waited 819 secs
Describe the results you expected:
Services should reconcile after warmstart. And, Orchagent RESTARTCHECK should not fail when warmboot is issued.
Output of show version
:
SONiC-OS-HEAD.0-11937d37