Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Warmboot] Error occuring intermittently while executing warmboot command #5439

Closed
prsunny opened this issue Sep 23, 2020 · 12 comments
Closed

Comments

@prsunny
Copy link
Contributor

prsunny commented Sep 23, 2020

Description
The following error is observed while executing warmboot command:

admin@str-sn3800-01:~$ sudo warm-reboot 
Error response from daemon: Cannot kill container: nat: No such container: nat
Failed to stop nat.service: Unit nat.service not loaded.
Warning: Stopping telemetry.service, but it can still be activated by:
  telemetry.timer
Traceback (most recent call last):
  File "/usr/local/bin/sonic-cfggen", line 376, in <module>
    main()
  File "/usr/local/bin/sonic-cfggen", line 293, in main
    configdb.connect()
  File "/usr/local/lib/python2.7/dist-packages/swsssdk/configdb.py", line 74, in connect
    self.db_connect('CONFIG_DB', wait_for_init, retry_on)
  File "/usr/local/lib/python2.7/dist-packages/swsssdk/configdb.py", line 69, in db_connect
    SonicV2Connector.connect(self, self.db_name, retry_on)
  File "/usr/local/lib/python2.7/dist-packages/swsssdk/dbconnector.py", line 250, in connect
    self.dbintf.connect(db_id, retry_on)
  File "/usr/local/lib/python2.7/dist-packages/swsssdk/interface.py", line 171, in connect
    self._onetime_connect(db_id)
  File "/usr/local/lib/python2.7/dist-packages/swsssdk/interface.py", line 183, in _onetime_connect
    client.config_set('notify-keyspace-events', self.KEYSPACE_EVENTS)
  File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 719, in config_set
    return self.execute_command('CONFIG SET', name, value)
  File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 673, in execute_command
    connection.send_command(*args)
  File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 610, in send_command
    self.send_packed_command(self.pack_command(*args))
  File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 585, in send_packed_command
    self.connect()
  File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 489, in connect
    raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error 111 connecting to 127.0.0.1:6379. Connection refused.
Traceback (most recent call last):
  File "/usr/local/bin/sonic-cfggen", line 376, in <module>
    main()
  File "/usr/local/bin/sonic-cfggen", line 293, in main
    configdb.connect()
  File "/usr/local/lib/python2.7/dist-packages/swsssdk/configdb.py", line 74, in connect
    self.db_connect('CONFIG_DB', wait_for_init, retry_on)
  File "/usr/local/lib/python2.7/dist-packages/swsssdk/configdb.py", line 69, in db_connect
    SonicV2Connector.connect(self, self.db_name, retry_on)
  File "/usr/local/lib/python2.7/dist-packages/swsssdk/dbconnector.py", line 250, in connect
    self.dbintf.connect(db_id, retry_on)
  File "/usr/local/lib/python2.7/dist-packages/swsssdk/interface.py", line 171, in connect
    self._onetime_connect(db_id)
  File "/usr/local/lib/python2.7/dist-packages/swsssdk/interface.py", line 183, in _onetime_connect
    client.config_set('notify-keyspace-events', self.KEYSPACE_EVENTS)
  File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 719, in config_set
    return self.execute_command('CONFIG SET', name, value)
  File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 673, in execute_command
    connection.send_command(*args)
  File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 610, in send_command
    self.send_packed_command(self.pack_command(*args))
  File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 585, in send_packed_command
    self.connect()
  File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 489, in connect
    raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error 111 connecting to 127.0.0.1:6379. Connection refused.
Watchdog armed for 180 seconds

Steps to reproduce the issue:

  1. sudo warm-reboot

Describe the results you received:

Above error message

Describe the results you expected:
No error message

Additional information you deem important (e.g. issue happens only occasionally):

**Output of `show version`:**

```

SONiC Software Version: SONiC.20191130.49
Distribution: Debian 9.13
Kernel: 4.9.0-11-2-amd64
Build commit: fb0b411b5
Build date: Fri Sep 11 12:17:51 UTC 2020
Built by: sonicbld@jenkins-slave-phx-2
```

**Attach debug file `sudo generate_dump`:**

```
(paste your output here)
```
@prsunny
Copy link
Contributor Author

prsunny commented Sep 24, 2020

Based on the logs, it could be due the following change:
sonic-net/sonic-utilities#1037

@sujinmkang , looks like watchdog is getting enabled after stopping database container and no more access to DB

Fri Sep 18 16:26:29 UTC 2020 Stopped all remaining containers ...
Fri Sep 18 16:26:31 UTC 2020 Enabling Watchdog before fastfast-reboot

@prsunny
Copy link
Contributor Author

prsunny commented Sep 24, 2020

output with warmboot -x

+ docker cp database:/var/lib/redis/dump.rdb /host/warmboot
+ docker exec -i database rm /var/lib/redis/dump.rdb
+ [[ fastfast-reboot = \w\a\r\m\-\r\e\b\o\o\t ]]
+ [[ fastfast-reboot = \f\a\s\t\f\a\s\t\-\r\e\b\o\o\t ]]
+ debug 'Stopping teamd ...'
+ [[ xno == x\y\e\s ]]
+ logger 'Stopping teamd ...'
+ docker exec -i teamd pkill -USR1 teamd
+ debug 'Stopped  teamd ...'
+ [[ xno == x\y\e\s ]]
+ logger 'Stopped  teamd ...'
+ debug 'Stopping syncd ...'
+ [[ xno == x\y\e\s ]]
+ logger 'Stopping syncd ...'
+ systemctl stop syncd
+ debug 'Stopped  syncd ...'
+ [[ xno == x\y\e\s ]]
+ logger 'Stopped  syncd ...'
+ debug 'Stopping all remaining containers ...'
+ [[ xno == x\y\e\s ]]
+ logger 'Stopping all remaining containers ...'
++ docker ps --format '{{.Names}}'
+ for CONTAINER_NAME in $(docker ps --format '{{.Names}}')
+ CONTAINER_STOP_RC=0
+ docker kill telemetry
+ systemctl stop telemetry
Warning: Stopping telemetry.service, but it can still be activated by:
  telemetry.timer
+ [[ CONTAINER_STOP_RC -ne 0 ]]
+ for CONTAINER_NAME in $(docker ps --format '{{.Names}}')
+ CONTAINER_STOP_RC=0
+ docker kill pmon
+ systemctl stop pmon
+ [[ CONTAINER_STOP_RC -ne 0 ]]
+ for CONTAINER_NAME in $(docker ps --format '{{.Names}}')
+ CONTAINER_STOP_RC=0
+ docker kill dhcp_relay
+ systemctl stop dhcp_relay
+ [[ CONTAINER_STOP_RC -ne 0 ]]
+ for CONTAINER_NAME in $(docker ps --format '{{.Names}}')
+ CONTAINER_STOP_RC=0
+ docker kill teamd
+ systemctl stop teamd
+ [[ CONTAINER_STOP_RC -ne 0 ]]
+ for CONTAINER_NAME in $(docker ps --format '{{.Names}}')
+ CONTAINER_STOP_RC=0
+ docker kill acms
+ systemctl stop acms
+ [[ CONTAINER_STOP_RC -ne 0 ]]
+ for CONTAINER_NAME in $(docker ps --format '{{.Names}}')
+ CONTAINER_STOP_RC=0
+ docker kill restapi
+ systemctl stop restapi
+ [[ CONTAINER_STOP_RC -ne 0 ]]
+ for CONTAINER_NAME in $(docker ps --format '{{.Names}}')
+ CONTAINER_STOP_RC=0
+ docker kill database
+ systemctl stop database
+ [[ CONTAINER_STOP_RC -ne 0 ]]
+ debug 'Stopped all remaining containers ...'
+ [[ xno == x\y\e\s ]]
+ logger 'Stopped all remaining containers ...'
+ systemctl stop docker.service
+ [[ mellanox = \n\e\p\h\o\s ]]
+ echo 'User issued '\''warm-reboot'\'' command [User: admin, Time: Thu Sep 24 01:05:59 UTC 2020]'
+ sync
+ sleep 1
+ sync
+ '[' -x /sbin/hwclock ']'
+ /sbin/hwclock -w
+ '[' -x /usr/bin/watchdogutil ']'
+ debug 'Enabling Watchdog before fastfast-reboot'
+ [[ xno == x\y\e\s ]]
+ logger 'Enabling Watchdog before fastfast-reboot'
+ /usr/bin/watchdogutil arm
Traceback (most recent call last):
  File "/usr/local/bin/sonic-cfggen", line 376, in <module>
    main()
  File "/usr/local/bin/sonic-cfggen", line 293, in main
    configdb.connect()
  File "/usr/local/lib/python2.7/dist-packages/swsssdk/configdb.py", line 74, in connect
    self.db_connect('CONFIG_DB', wait_for_init, retry_on)
  File "/usr/local/lib/python2.7/dist-packages/swsssdk/configdb.py", line 69, in db_connect
    SonicV2Connector.connect(self, self.db_name, retry_on)
  File "/usr/local/lib/python2.7/dist-packages/swsssdk/dbconnector.py", line 250, in connect
    self.dbintf.connect(db_id, retry_on)
  File "/usr/local/lib/python2.7/dist-packages/swsssdk/interface.py", line 171, in connect
    self._onetime_connect(db_id)
  File "/usr/local/lib/python2.7/dist-packages/swsssdk/interface.py", line 183, in _onetime_connect
    client.config_set('notify-keyspace-events', self.KEYSPACE_EVENTS)
  File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 719, in config_set
    return self.execute_command('CONFIG SET', name, value)
  File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 673, in execute_command
    connection.send_command(*args)
  File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 610, in send_command
    self.send_packed_command(self.pack_command(*args))
  File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 585, in send_packed_command
    self.connect()
  File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 489, in connect
    raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error 111 connecting to 127.0.0.1:6379. Connection refused.
Traceback (most recent call last):
  File "/usr/local/bin/sonic-cfggen", line 376, in <module>
    main()
  File "/usr/local/bin/sonic-cfggen", line 293, in main
    configdb.connect()
  File "/usr/local/lib/python2.7/dist-packages/swsssdk/configdb.py", line 74, in connect
    self.db_connect('CONFIG_DB', wait_for_init, retry_on)
  File "/usr/local/lib/python2.7/dist-packages/swsssdk/configdb.py", line 69, in db_connect
    SonicV2Connector.connect(self, self.db_name, retry_on)
  File "/usr/local/lib/python2.7/dist-packages/swsssdk/dbconnector.py", line 250, in connect
    self.dbintf.connect(db_id, retry_on)
  File "/usr/local/lib/python2.7/dist-packages/swsssdk/interface.py", line 171, in connect
    self._onetime_connect(db_id)
  File "/usr/local/lib/python2.7/dist-packages/swsssdk/interface.py", line 183, in _onetime_connect
    client.config_set('notify-keyspace-events', self.KEYSPACE_EVENTS)
  File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 719, in config_set
    return self.execute_command('CONFIG SET', name, value)
  File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 673, in execute_command
    connection.send_command(*args)
  File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 610, in send_command
    self.send_packed_command(self.pack_command(*args))
  File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 585, in send_packed_command
    self.connect()
  File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 489, in connect
    raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error 111 connecting to 127.0.0.1:6379. Connection refused.
Watchdog armed for 180 seconds
+ '[' -x /usr/share/sonic/device/x86_64-mlnx_msn3800-r0/warm-reboot_plugin ']'
+ debug 'Rebooting with /sbin/kexec -e to SONiC-OS-20191130.49 ...'
+ [[ xno == x\y\e\s ]]
+ logger 'Rebooting with /sbin/kexec -e to SONiC-OS-20191130.49 ...'
+ exec /sbin/kexec -e

@sujinmkang
Copy link
Collaborator

The problem came with my changes which is enabling the hw watchdog during the fast/warm-reboot.
Unlike the other platforms, mlnx platform api calls the sonic-cfggen to retrieve the platform and hwsku information as follows.
https://github.com/Azure/sonic-buildimage/blob/master/platform/mellanox/mlnx-platform-api/sonic_platform/chassis.py#L30-L31
I recommend to read /host/machine.conf to get those information so that the platform api's dependency to the config db should be removed.

@keboliu
Copy link
Collaborator

keboliu commented Sep 24, 2020

For the platform name, we can get it from /host/machine.conf, but for the MSFT specific SKU name it maybe not possible, for example:

For this MSFT SKU:

admin@arc-switch1004:~$ show platform summary
Platform: x86_64-mlnx_msn2700-r0
HwSKU: Mellanox-SN2700-D48C8
ASIC: Mellanox

admin@arc-switch1004:~$ sudo sonic-cfggen -d -v DEVICE_METADATA.localhost.hwsku
Mellanox-SN2700-D48C8

But we don’t have the SKU name in the machine.conf, or you see different content in your setup?

admin@arc-switch1004:~$ sudo cat /host/machine.conf
onie_arch=x86_64
onie_base_mac=50:6b:4b:8f:d2:40
onie_bin=
onie_boot_reason=rescue
onie_build_date=2020-02-10T08:13+00:00
onie_cli_static_parms=
onie_cli_static_url=201911-last-rc-sonic-mellanox.bin
onie_config_version=1
onie_dev=/dev/sda2
onie_exec_url=201911-last-rc-sonic-mellanox.bin
onie_firmware=auto
onie_grub_image_name=grubx64.efi
onie_initrd_tmp=/
onie_installer=/var/tmp/installer
onie_kernel_version=4.9.95
onie_machine=mlnx_msn2700
onie_machine_rev=0
onie_partition_type=gpt
onie_platform=x86_64-mlnx_msn2700-r0
onie_root_dir=/mnt/onie-boot/onie
onie_skip_ethmgmt_macs=yes
onie_switch_asic=mlnx
onie_uefi_arch=x64
onie_uefi_boot_loader=grubx64.efi
onie_vendor_id=33049
onie_version=2019.11-5.2.0020-9600

@lguohan
Copy link
Collaborator

lguohan commented Sep 24, 2020

i think tamer has put hwsku and platform into the environment variable and does not need sonic-cfggen to get it. @keboliu can you check that?

@lguohan
Copy link
Collaborator

lguohan commented Sep 24, 2020

@tahmed-dev , where is the example now?

@lguohan
Copy link
Collaborator

lguohan commented Sep 24, 2020

meanwhile, @sujinmkang , you also call this one "PLATFORM=$(sonic-cfggen -H -v DEVICE_METADATA.localhost.platform)". do you know if sonic-cfggen is execute early in the script, or later when you use the variable. if it is latter, then the generic code also have this issue.

@tahmed-dev
Copy link
Contributor

tahmed-dev commented Sep 24, 2020

@tahmed-dev , where is the example now?

@lguohan, @sujinmkang here is an example I pushed into the driver framework before it was refactored.

@prsunny
Copy link
Contributor Author

prsunny commented Sep 24, 2020

@tahmed-dev , can you check the "example" link. looks like broken?

@tahmed-dev
Copy link
Contributor

@tahmed-dev , can you check the "example" link. looks like broken?

@prsunny just updated it. example

@keboliu
Copy link
Collaborator

keboliu commented Sep 27, 2020

#5468
sonic-net/sonic-platform-common#121
above two PR fix this issue

@prsunny
Copy link
Contributor Author

prsunny commented Sep 30, 2020

Closing this issue as the PRs are merged

@prsunny prsunny closed this as completed Sep 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants