Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNMP service started and triggered SWSS start in the middle of warm-reboot #2750

Closed
stepanblyschak opened this issue Apr 5, 2019 · 2 comments
Assignees
Labels

Comments

@stepanblyschak
Copy link
Collaborator

Description
SNMP service start is delayed for 3 min and depends on SWSS service.
When issuing sudo warm-reboot after daemons reconciled state was reached (~2 min after system start) SNMP service may wake up in the middle of warm-reboot script, after SWSS was killed by warm-reboot script, and start SWSS container again. This causes warm reboot failure and system becomes in inconsistent state.

See syslog messages from "Apr 5 13:24:46.321628" to "Apr 5 13:25:50.512102"

Apr  5 13:25:03.783931 arc-switch1025 NOTICE syncd#syncd: :- syncd_main: drained queue
Apr  5 13:25:03.783931 arc-switch1025 NOTICE syncd#syncd: :- handleRestartQuery: received WARM switch shutdown event
Apr  5 13:25:03.783931 arc-switch1025 NOTICE syncd#syncd: :- profile_get_value: SAI_WARM_BOOT_WRITE_FILE: /var/warmboot/
Apr  5 13:25:03.783931 arc-switch1025 NOTICE syncd#syncd: :- syncd_main: using warmBootWriteFile: '/var/warmboot/'
Apr  5 13:25:03.783931 arc-switch1025 NOTICE syncd#syncd: :- syncd_main: Warm Reboot requested, keeping data plane running
Apr  5 13:25:03.783931 arc-switch1025 NOTICE syncd#syncd: :- syncd_main: Removing the switch gSwitchId=0x100000021
Apr  5 13:25:03.783931 arc-switch1025 NOTICE syncd#syncd: :- syncd_main: Fast/warm reboot requested, keeping data plane running
Apr  5 13:25:03.784348 arc-switch1025 INFO syncd#supervisord: syncd Apr 05 13:25:03 NOTICE  SAI_UTILS: mlnx_sai_utils.c[2307]- set_dispatch_attrib_handler: Set RESTART_WARM, key:Switch ID 1, val:true
Apr  5 13:25:03.784542 arc-switch1025 INFO syncd#supervisord: syncd Apr 05 13:25:03 NOTICE  SAI_UTILS: mlnx_sai_utils.c[2307]- set_dispatch_attrib_handler: Set UNINIT_DATA_PLANE_ON_REMOVAL, key:Switch ID 1, val:false
Apr  5 13:25:03.784542 arc-switch1025 INFO syncd#supervisord: syncd Apr 05 13:25:03 NOTICE  SAI_SWITCH: mlnx_sai_switch.c[5041]- mlnx_disconnect_switch: Disconnect switch
Apr  5 13:25:03.784896 arc-switch1025 INFO syncd.sh[12173]: requested WARM shutdown
Apr  5 13:25:03.785978 arc-switch1025 NOTICE syncd#syncd: :- syncd_main: remove switch took 0.001976 sec
Apr  5 13:25:03.786457 arc-switch1025 NOTICE syncd#syncd: :- syncd_main: calling api uninitialize
Apr  5 13:25:03.786457 arc-switch1025 NOTICE syncd#syncd: :- syncd_main: uninitialize finished
Apr  5 13:25:04.011900 arc-switch1025 INFO teamd#supervisord 2019-04-05 13:25:02,752 INFO reaped unknown pid 39
Apr  5 13:25:04.011900 arc-switch1025 INFO teamd#supervisord 2019-04-05 13:25:02,753 INFO reaped unknown pid 31
Apr  5 13:25:04.011900 arc-switch1025 INFO teamd#supervisord 2019-04-05 13:25:02,754 INFO reaped unknown pid 23
Apr  5 13:25:04.011900 arc-switch1025 INFO teamd#supervisord 2019-04-05 13:25:02,756 INFO reaped unknown pid 47
Apr  5 13:25:04.325406 arc-switch1025 NOTICE root: Finished warm shutdown syncd process ...
Apr  5 13:25:13.427753 arc-switch1025 INFO syncd#supervisord 2019-04-05 13:25:03,801 INFO exited: syncd (exit status 0; expected)
Apr  5 13:25:13.427753 arc-switch1025 INFO syncd#supervisord 2019-04-05 13:25:04,479 WARN received SIGTERM indicating exit request
Apr  5 13:25:13.427753 arc-switch1025 INFO syncd#supervisord 2019-04-05 13:25:04,479 INFO waiting for rsyslogd, mlnx-sfpd to die
Apr  5 13:25:13.427753 arc-switch1025 INFO syncd#supervisord 2019-04-05 13:25:07,484 INFO waiting for rsyslogd, mlnx-sfpd to die
Apr  5 13:25:13.427753 arc-switch1025 INFO syncd#supervisord 2019-04-05 13:25:10,488 INFO waiting for rsyslogd, mlnx-sfpd to die
Apr  5 13:25:14.482720 arc-switch1025 INFO dockerd[1680]: time="2019-04-05T13:25:14.482018945Z" level=info msg="Container f214ce00b9045dd7edba5c441a05fa747bb2f12936aaf7baaa29665f093ec8bf failed to exit within 10 seconds of signal 15 - us
ing the force"
Apr  5 13:25:14.746779 arc-switch1025 INFO containerd[1678]: time="2019-04-05T13:25:14.746534087Z" level=info msg="shim reaped" id=f214ce00b9045dd7edba5c441a05fa747bb2f12936aaf7baaa29665f093ec8bf
Apr  5 13:25:14.756912 arc-switch1025 INFO dockerd[1680]: time="2019-04-05T13:25:14.756756411Z" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Apr  5 13:25:14.812920 arc-switch1025 INFO syncd.sh[12173]: syncd
Apr  5 13:25:14.813636 arc-switch1025 INFO syncd.sh[6050]: 137
Apr  5 13:25:14.822741 arc-switch1025 NOTICE root: Stopped syncd service...
Apr  5 13:25:14.829094 arc-switch1025 NOTICE root: Unlocking /tmp/swss-syncd-lock (10) from syncd service
Apr  5 13:25:14.842953 arc-switch1025 INFO systemd[1]: Stopped syncd service.
Apr  5 13:25:43.154294 arc-switch1025 INFO systemd[1]: Starting switch state service...
Apr  5 13:25:43.164059 arc-switch1025 NOTICE root: Starting swss service...
Apr  5 13:25:43.171856 arc-switch1025 NOTICE root: Locking /tmp/swss-syncd-lock from swss service
Apr  5 13:25:43.181013 arc-switch1025 NOTICE root: Locked /tmp/swss-syncd-lock (10) from swss service
Apr  5 13:25:44.918430 arc-switch1025 NOTICE root: Warm boot flag: swss true.
Apr  5 13:25:46.137482 arc-switch1025 INFO swss.sh[12753]: Starting existing swss container with HWSKU ACS-MSN2700
Apr  5 13:25:46.244446 arc-switch1025 INFO containerd[1678]: time="2019-04-05T13:25:46.242046842Z" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/d8fdfb3a13a185551f59650deea154bd6144adeb2c030377374133b58b7b2
54c/shim.sock" debug=false pid=13130
Apr  5 13:25:46.502350 arc-switch1025 INFO swss.sh[12753]: swss
Apr  5 13:25:47.386488 arc-switch1025 NOTICE root: Started swss service...
Apr  5 13:25:47.399593 arc-switch1025 NOTICE root: Unlocking /tmp/swss-syncd-lock (10) from swss service
Apr  5 13:25:47.412764 arc-switch1025 INFO systemd[1]: Started switch state service.
Apr  5 13:25:47.413771 arc-switch1025 INFO systemd[1]: Starting SNMP container...
Apr  5 13:25:49.406666 arc-switch1025 INFO snmp.sh[13325]: Starting existing snmp container with HWSKU ACS-MSN2700
Apr  5 13:25:49.575907 arc-switch1025 INFO containerd[1678]: time="2019-04-05T13:25:49.575188155Z" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/13a2ea7a84dfe96b70d702893c561f4cd8732f50be9552937f1f169060951c25/shim.sock" debug=false pid=13547
Apr  5 13:25:49.949186 arc-switch1025 INFO snmp.sh[13325]: snmp

Steps to reproduce the issue:

  1. execute sudo warm-reboot few times

Describe the results you received:
See description

Describe the results you expected:
SNMP start does not mess with warm-reboot

Additional information you deem important (e.g. issue happens only occasionally):

**Output of `show version`:**
SONiC Software Version: SONiC.HEAD.42-5c663ca
Distribution: Debian 9.8
Kernel: 4.9.0-8-amd64
Build commit: 5c663ca
Build date: Fri Apr  5 04:51:41 UTC 2019
Built by: johnar@jenkins-worker-3

Docker images:
REPOSITORY                 TAG                 IMAGE ID            SIZE
docker-orchagent-mlnx      HEAD.42-5c663ca     92227cd6d9e5        286MB
docker-orchagent-mlnx      latest              92227cd6d9e5        286MB
docker-syncd-mlnx          HEAD.42-5c663ca     213b01ae948a        331MB
docker-syncd-mlnx          latest              213b01ae948a        331MB
docker-lldp-sv2            HEAD.42-5c663ca     27eb551cdb9e        274MB
docker-lldp-sv2            latest              27eb551cdb9e        274MB
docker-dhcp-relay          HEAD.42-5c663ca     c918d4fb6c35        256MB
docker-dhcp-relay          latest              c918d4fb6c35        256MB
docker-database            HEAD.42-5c663ca     20c13f3c77c4        255MB
docker-database            latest              20c13f3c77c4        255MB
docker-teamd               HEAD.42-5c663ca     a29189c45fe7        274MB
docker-teamd               latest              a29189c45fe7        274MB
docker-snmp-sv2            HEAD.42-5c663ca     c102aa38ddf5        294MB
docker-snmp-sv2            latest              c102aa38ddf5        294MB
docker-router-advertiser   HEAD.42-5c663ca     c46d7f87c715        254MB
docker-router-advertiser   latest              c46d7f87c715        254MB
docker-platform-monitor    HEAD.42-5c663ca     6784680cc095        286MB
docker-platform-monitor    latest              6784680cc095        286MB
docker-fpm-quagga          HEAD.42-5c663ca     d26abe5b1234        281MB
docker-fpm-quagga          latest              d26abe5b1234        281MB
**Attach debug file `sudo generate_dump`:**

```

syslog.txt

(paste your output here)
```
@yxieca
Copy link
Contributor

yxieca commented Apr 9, 2019

@stepanblyschak great finding! Should we proactively stop snmp before warm reboot?

@yxieca
Copy link
Contributor

yxieca commented Apr 10, 2019

It is hard to solve this issue unless we have an indication of system boot up is done.

This timer is a one time timer since boot up. snmp services wait for 3.5 minutes before starting service.

Can we fix snmp service to not depending on the delay? This delay is a result of delaying creating counters. why did we delay creating counters? @pavel-shirshov can you fill in the information?

Per discussion in warm reboot meeting, we would like to see if we could remove this 3.5 minutes delay.

  • do we have counters db ready flag?
  • do not log error/warming messages until counter is ready.

@stepanblyschak will drive the action items.

dgsudharsan added a commit to dgsudharsan/sonic-buildimage that referenced this issue Jul 11, 2023
Update sonic-utilities submodule pointer to include the following:
* ff380e04 [hash]: Implement GH frontend ([sonic-net#2580](sonic-net/sonic-utilities#2580))
* 61bad064 [db_migrator] Set correct CURRENT_VERSION, extend UT ([sonic-net#2895](sonic-net/sonic-utilities#2895))
* 6b8ee47c [CLI][Show][BGP] Show BGP Change for no neighbor scenario ([sonic-net#2885](sonic-net/sonic-utilities#2885))
* 73d8d633 [doc] Update Command-Reference.md, change show bgp peer command to show bfd peer ([sonic-net#2750](sonic-net/sonic-utilities#2750))
* 7bc08c28 [db_migrator] Remove hardcoded config and migrate config from minigraph ([sonic-net#2887](sonic-net/sonic-utilities#2887))
* b1aa9426 [generate_dump]: Enhance show techsupport for Marvell platform ([sonic-net#2676](sonic-net/sonic-utilities#2676))
* 316b14c0 Add support for secure upgrade ([sonic-net#2698](sonic-net/sonic-utilities#2698))
* dc2945bc [dns] Implement config and show commands for static DNS. ([sonic-net#2737](sonic-net/sonic-utilities#2737))
* 8414a709 [chassis][multi asic] change acl_loader to use tcp socket for db communication ([sonic-net#2525](sonic-net/sonic-utilities#2525))
* 0b629ba1 Revert [chassis][voq] Clear fabric counters queue/port (2789) ([sonic-net#2882](sonic-net/sonic-utilities#2882))
* 3ba8241a [db_migtrator] Add migration of FLEX_COUNTER_DELAY_STATUS during 1911->master upgrade + fast-reboot. Add UT. ([sonic-net#2839](sonic-net/sonic-utilities#2839))
* fceef2ed [chassis][voq] Clear fabric counters queue/port ([sonic-net#2789](sonic-net/sonic-utilities#2789))

Signed-off-by: dgsudharsan <sudharsand@nvidia.com>
liat-grozovik pushed a commit that referenced this issue Jul 11, 2023
Update sonic-utilities submodule pointer to include the following:
* ff380e04 [hash]: Implement GH frontend ([#2580](sonic-net/sonic-utilities#2580))
* 61bad064 [db_migrator] Set correct CURRENT_VERSION, extend UT ([#2895](sonic-net/sonic-utilities#2895))
* 6b8ee47c [CLI][Show][BGP] Show BGP Change for no neighbor scenario ([#2885](sonic-net/sonic-utilities#2885))
* 73d8d633 [doc] Update Command-Reference.md, change show bgp peer command to show bfd peer ([#2750](sonic-net/sonic-utilities#2750))
* 7bc08c28 [db_migrator] Remove hardcoded config and migrate config from minigraph ([#2887](sonic-net/sonic-utilities#2887))
* b1aa9426 [generate_dump]: Enhance show techsupport for Marvell platform ([#2676](sonic-net/sonic-utilities#2676))
* 316b14c0 Add support for secure upgrade ([#2698](sonic-net/sonic-utilities#2698))
* dc2945bc [dns] Implement config and show commands for static DNS. ([#2737](sonic-net/sonic-utilities#2737))
* 8414a709 [chassis][multi asic] change acl_loader to use tcp socket for db communication ([#2525](sonic-net/sonic-utilities#2525))
* 0b629ba1 Revert [chassis][voq] Clear fabric counters queue/port (2789) ([#2882](sonic-net/sonic-utilities#2882))
* 3ba8241a [db_migtrator] Add migration of FLEX_COUNTER_DELAY_STATUS during 1911->master upgrade + fast-reboot. Add UT. ([#2839](sonic-net/sonic-utilities#2839))
* fceef2ed [chassis][voq] Clear fabric counters queue/port ([#2789](sonic-net/sonic-utilities#2789))

Signed-off-by: dgsudharsan <sudharsand@nvidia.com>
mssonicbld added a commit that referenced this issue Jul 11, 2023
…atically (#15456)

#### Why I did it
src/sonic-utilities
```
* ff380e04 - (HEAD -> master, origin/master, origin/HEAD) [hash]: Implement GH frontend (#2580) (13 hours ago) [Nazarii Hnydyn]
* 61bad064 - [db_migrator] Set correct CURRENT_VERSION, extend UT (#2895) (4 days ago) [Vadym Hlushko]
* 6b8ee47c - [CLI][Show][BGP] Show BGP Change for no neighbor scenario (#2885) (6 days ago) [Dev Ojha]
* 73d8d633 - [doc] Update Command-Reference.md, change "show bgp peer" command to "show bfd peer" (#2750) (11 days ago) [PinghaoQu]
* 7bc08c28 - [db_migrator] Remove hardcoded config and migrate config from minigraph (#2887) (11 days ago) [Vaibhav Hemant Dixit]
* b1aa9426 - [generate_dump]: Enhance show techsupport for Marvell platform (#2676) (11 days ago) [pavannaregundi]
* 316b14c0 - Add support for secure upgrade (#2698) (2 weeks ago) [ycoheNvidia]
* dc2945bc - [dns] Implement config and show commands for static DNS. (#2737) (2 weeks ago) [Oleksandr Ivantsiv]
* 8414a709 - [chassis][multi asic] change acl_loader to use tcp socket for db communication (#2525) (2 weeks ago) [Arvindsrinivasan Lakshmi Narasimhan]
* 0b629ba1 - Revert "[chassis][voq] Clear fabric counters queue/port (#2789)" (#2882) (3 weeks ago) [RoRonoa]
* 3ba8241a - [db_migtrator] Add migration of FLEX_COUNTER_DELAY_STATUS during 1911->master upgrade + fast-reboot. Add UT. (#2839) (4 weeks ago) [Vadym Hlushko]
* fceef2ed - [chassis][voq] Clear fabric counters queue/port (#2789) (4 weeks ago) [jfeng-arista]
```
#### How I did it
#### How to verify it
#### Description for the changelog
sonic-otn pushed a commit to sonic-otn/sonic-buildimage that referenced this issue Sep 20, 2023
Update sonic-utilities submodule pointer to include the following:
* ff380e04 [hash]: Implement GH frontend ([sonic-net#2580](sonic-net/sonic-utilities#2580))
* 61bad064 [db_migrator] Set correct CURRENT_VERSION, extend UT ([sonic-net#2895](sonic-net/sonic-utilities#2895))
* 6b8ee47c [CLI][Show][BGP] Show BGP Change for no neighbor scenario ([sonic-net#2885](sonic-net/sonic-utilities#2885))
* 73d8d633 [doc] Update Command-Reference.md, change show bgp peer command to show bfd peer ([sonic-net#2750](sonic-net/sonic-utilities#2750))
* 7bc08c28 [db_migrator] Remove hardcoded config and migrate config from minigraph ([sonic-net#2887](sonic-net/sonic-utilities#2887))
* b1aa9426 [generate_dump]: Enhance show techsupport for Marvell platform ([sonic-net#2676](sonic-net/sonic-utilities#2676))
* 316b14c0 Add support for secure upgrade ([sonic-net#2698](sonic-net/sonic-utilities#2698))
* dc2945bc [dns] Implement config and show commands for static DNS. ([sonic-net#2737](sonic-net/sonic-utilities#2737))
* 8414a709 [chassis][multi asic] change acl_loader to use tcp socket for db communication ([sonic-net#2525](sonic-net/sonic-utilities#2525))
* 0b629ba1 Revert [chassis][voq] Clear fabric counters queue/port (2789) ([sonic-net#2882](sonic-net/sonic-utilities#2882))
* 3ba8241a [db_migtrator] Add migration of FLEX_COUNTER_DELAY_STATUS during 1911->master upgrade + fast-reboot. Add UT. ([sonic-net#2839](sonic-net/sonic-utilities#2839))
* fceef2ed [chassis][voq] Clear fabric counters queue/port ([sonic-net#2789](sonic-net/sonic-utilities#2789))

Signed-off-by: dgsudharsan <sudharsand@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants