Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip wait_for_tunnel on mocked dualtor #15741

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

yucgu
Copy link

@yucgu yucgu commented Jul 7, 2023

Why I did it

hit a problem when running some dualtor test case on T0 testbed.
At that case ,the mock config is applied, and the tunnel interface is not created normally on standby at the mock case,
Which leads to the wait_for_tunnel() in write_standby.py timeout, and that timeout is 90s, so then it will lead to the stopping bpg service timeout, finally, the stopping swss doesn’t work normally and some test cases may fail.

tried to dump the ASIC_DB for checking the tunnel data.
Looks like there are tunnel data in ASIC_DB before restarting the swss, and we can see the MuxTunnel0 creation log,
But, looks like the case is, after swss started the restarting process, the earlier restarted service tunnelmgrd deleted the tunnel but didn’t add it back normally.
Here are main logs, the swss restarting happened about 11:29.01.

From the mgmt test log:
21/06/2023 11:29:01 dual_tor_mock.apply_peer_switch_table_to L0345 INFO | Dump asic_db before restart swss:
21/06/2023 11:29:01 base._run L0063 DEBUG | /data/tests/common/devices/multi_asic.py::_run_on_asics#112: [mth-t0-64] AnsibleModule::shell, args=[], kwargs={"cmd": "sonic-db-cli ASIC_DB keys 'ASIC_STATE:SAI_OBJECT_TYPE_TUNNEL:'"}
21/06/2023 11:29:01 base._run L0082 DEBUG | /data/tests/common/devices/multi_asic.py::_run_on_asics#112: [mth-t0-64] AnsibleModule::shell Result => {"stderr_lines": [], "cmd": "sonic-db-cli ASIC_DB keys 'ASIC_STATE:SAI_OBJECT_TYPE_TUNNEL:
'", "end": "2023-06-21 11:28:52.539676", "_ansible_no_log": false, "stdout": "ASIC_STATE:SAI_OBJECT_TYPE_TUNNEL:oid:0x2a0000000009ad\nASIC_STATE:SAI_OBJECT_TYPE_TUNNEL:oid:0x2a0000000009b5\nASIC_STATE:SAI_OBJECT_TYPE_TUNNEL:oid:0x2a000000000aae", "changed": true, "rc": 0, "start": "2023-06-21 11:28:52.531009", "stderr": "", "delta": "0:00:00.008667", "invocation": {"module_args": {"creates": null, "executable": null, "_uses_shell": true, "strip_empty_ends": true, "_raw_params": "sonic-db-cli ASIC_DB keys 'ASIC_STATE:SAI_OBJECT_TYPE_TUNNEL:*'", "removes": null, "argv": null, "warn": true, "chdir": null, "stdin_add_newline": true, "stdin": null}}, "stdout_lines": ["ASIC_STATE:SAI_OBJECT_TYPE_TUNNEL:oid:0x2a0000000009ad", "ASIC_STATE:SAI_OBJECT_TYPE_TUNNEL:oid:0x2a0000000009b5", "ASIC_STATE:SAI_OBJECT_TYPE_TUNNEL:oid:0x2a000000000aae"], "failed": false}
21/06/2023 11:29:01 dual_tor_mock.apply_peer_switch_table_to L0347 INFO | Restarting swss service to regenerate config.bcm

From the syslog:
Jun 21 11:28:50.543496 mth-t0-64 NOTICE swss#tunnelmgrd: :- doTunnelTask: Peer/Remote IP not configured
Jun 21 11:28:50.544121 mth-t0-64 NOTICE swss#tunnelmgrd: :- doTunnelTask: Tunnel MuxTunnel0 task, op SET
Jun 21 11:28:50.546930 mth-t0-64 NOTICE swss#orchagent: :- addDecapTunnel: Create overlay loopback router interface oid:6000000000aad
Jun 21 11:28:50.549307 mth-t0-64 NOTICE swss#orchagent: :- addDecapTunnelTermEntries: 10.1.0.32 already exists. Did not create entry.
Jun 21 11:28:50.549307 mth-t0-64 NOTICE swss#orchagent: :- doTask: Tunnel(s) added to ASIC_DB.

Jun 21 11:29:02.903535 mth-t0-64 INFO swss#supervisord 2023-06-21 11:29:02,903 INFO stopped: tunnelmgrd (terminated by SIGTERM)
Jun 21 11:30:03.191076 mth-t0-64 NOTICE swss#tunnelmgrd: :- main: --- Starting Tunnelmgrd ---

Jun 21 11:30:04.347362 mth-t0-64 NOTICE swss#tunnelmgrd: :- main: starting main loop
Jun 21 11:30:04.347400 mth-t0-64 NOTICE swss#tunnelmgrd: :- doTunnelTask: Peer/Remote IP not configured
Jun 21 11:30:04.347422 mth-t0-64 INFO swss#supervisord: tunnelmgrd delete tunnel "tun0" failed: No such device
Jun 21 11:30:04.347631 mth-t0-64 INFO swss#supervisord 2023-06-21 11:30:04,347 INFO success: arp_update entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
Jun 21 11:30:04.347658 mth-t0-64 NOTICE swss#tunnelmgrd: :- doTunnelTask: Tunnel MuxTunnel0 task, op SET

Jun 21 11:34:01.745938 mth-t0-64 INFO container: docker cmd: wait for bgp
Jun 21 11:34:01.746309 mth-t0-64 INFO container: docker cmd: stop for bgp
Jun 21 11:34:01.746385 mth-t0-64 DEBUG container: container_stop: END
Jun 21 11:34:01.788168 mth-t0-64 NOTICE cisco: Stopped bgp service...
Jun 21 11:34:01.883200 mth-t0-64 INFO swss#supervisord 2023-06-21 11:34:01,882 WARN received SIGTERM indicating exit request
Jun 21 11:34:01.883333 mth-t0-64 INFO swss#supervisord 2023-06-21 11:34:01,883 INFO waiting for supervisor-proc-exit-listener, rsyslogd, portsyncd, coppmgrd, arp_update, ndppd, neighsyncd, vlanmgrd, intfmgrd, portmgrd, buffermgrd, tunnel_packet_handler, vrfmgrd, nbrmgrd, vxlanmgrd, fdbsyncd, tunnelmgrd to die
Jun 21 11:34:01.883836 mth-t0-64 INFO swss#supervisord 2023-06-21 11:34:01,883 INFO stopped: tunnelmgrd (terminated by SIGTERM)

Jun 21 11:35:31.009842 mth-t0-64 WARNING systemd[1]: swss.service: Stopping timed out. Terminating.
Jun 21 11:35:31.011433 mth-t0-64 INFO swss.sh[578580]: Terminated
Jun 21 11:35:31.012724 mth-t0-64 WARNING systemd[1]: swss.service: Control process exited, code=killed, status=15/TERM
Jun 21 11:35:31.012911 mth-t0-64 WARNING systemd[1]: swss.service: Failed with result 'timeout'.
Jun 21 11:35:31.014462 mth-t0-64 INFO systemd[1]: Stopped switch state service.
Jun 21 11:35:31.957011 mth-t0-64 ERR write_standby: Timed out waiting for tunnel MuxTunnel0, mux state will not be written

Work item tracking
  • Microsoft ADO (number only):

How I did it

it’s related to the swss restarting flow, in the restarting, the tunnelmgrd deleted the tunnel interfaces and didn’t have the chance to recreate it before the checking, so, to be the simple, for the dualtor on T0 mock case, just skip the wait_for_tunnel() step in the write_standby.py

How to verify it

running the test script a few times, didn't see the bpg stopping timeout issue.

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • 202205
  • 202211
  • 202305

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

@yucgu yucgu requested a review from lguohan as a code owner July 7, 2023 09:31
@yucgu
Copy link
Author

yucgu commented Jul 7, 2023

Hi @theasianpianist @lguohan
could you help review this change?
Thanks

@lguohan
Copy link
Collaborator

lguohan commented Jul 7, 2023

@prsunny @theasianpianist , can you folks review this?

if 'subtype' in file_data['DEVICE_METADATA']['localhost']:
subtype = file_data['DEVICE_METADATA']['localhost']['subtype']

return 'dualtor' not in subtype.lower()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not sure what's the difference between is_dualtor and is_mocked_dualtor. Both are checking for subtype. Also, why opening config_db.json file? This may or may not be the actual running config.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_dualtor checks the config database that's the running config, and is_mocked_dualtor checks the saved start config config_db.json,
for the mocked case, at the timing of applying mock, it justs change the running config but doesn't save it.

here are my dumped data logs in two functions, we can see, is_dualtor has dualtor data like 'subtype': 'DualToR' and 'peer_switch': 'switch_hostname', while the is_mocked_dualtor doesn't have them.

In is_dualtor check:
meta {'bgp_asn': '64601', 'buffer_model': 'traditional', 'cloudtype': 'None', 'default_bgp_status': 'up', 'default_pfcwd_status': 'disable', 'deployment_id': '1', 'docker_routing_config_mode': 'separated', 'hostname': 'mth-t0-64', 'hwsku': 'Cisco-8102-C64', 'mac': 'BC:E7:12:82:03:90', 'peer_switch': 'switch_hostname', 'platform': 'x86_64-8102_64h_o-r0', 'region': 'None', 'subtype': 'DualToR', 'synchronous_mode': 'enable', 'type': 'ToRRouter'}

in is_mocked_dualtor check:
local {'bgp_asn': '64601', 'buffer_model': 'traditional', 'cloudtype': 'None', 'default_bgp_status': 'up', 'default_pfcwd_status': 'disable', 'deployment_id': '1', 'docker_routing_config_mode': 'separated', 'hostname': 'mth-t0-64', 'hwsku': 'Cisco-8102-C64', 'mac': 'BC:E7:12:82:03:90', 'platform': 'x86_64-8102_64h_o-r0', 'region': 'None', 'synchronous_mode': 'enable', 'type': 'ToRRouter'}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for explaining. @theasianpianist , do you think if there is a different approach for this here? I'm concerned about running this in real production flow where for some reason, saved file doesn't have this attribute whereas running config has this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @theasianpianist, do you have a chance to help review it? thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Prince here, plus I don't think it makes sense to change the image for a mock dualtor test issue. Do you see this problem on true dualtor testbeds as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is it working currently? mocked dualtor has been running for quite a while now.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's an issue exposed recently by some other latest change. there was no problem before that change.
There was a change for wait_for_tunnel() about a couple of months ago, the timeout in wait_for_tunnel() was changed from 60s to 90s, the change itself doesn’t have any problem, but it exposes this issue, when the timeout was 60s, even there was no tunnel ready and the function wait_for_tunnel() got timeout, the stopping bgp service would not get timeout so then swss could restart all services normally, and related cases could pass.
now the timeout is changed to 90s, the issue gets exposed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @prsunny @theasianpianist , any suggestion for the solution? thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prsunny @theasianpianist - could you please check the latest comments from @yucgu and provide us the next steps?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prsunny @theasianpianist - could you please help in approving this?

@theasianpianist
Copy link
Contributor

I think it's best to avoid image changes if possible for this issue. Can you see if the problem is still reproducible after changing this line from systemctl restart swss to config reload -y?

@maunnikr-cisco
Copy link

I think it's best to avoid image changes if possible for this issue. Can you see if the problem is still reproducible after changing this line from systemctl restart swss to config reload -y?

@theasianpianist The problem still persists after making this change. The TC is only passing when applying @yucgu and my (sonic-net/sonic-mgmt#9597) PR

@theasianpianist
Copy link
Contributor

theasianpianist commented Mar 14, 2024

I think it's best to avoid image changes if possible for this issue. Can you see if the problem is still reproducible after changing this line from systemctl restart swss to config reload -y?

@theasianpianist The problem still persists after making this change. The TC is only passing when applying @yucgu and my (sonic-net/sonic-mgmt#9597) PR

I am still hesitant to make this change. If the systemd service timeout is causing the issue, during the test we could temporarily increase the timeout value by adding a config file in /etc/systemd/system.conf.d, then removing it once the test is over. Please see https://unix.stackexchange.com/questions/551463/how-to-override-etc-systemd-system-conf and https://www.freedesktop.org/software/systemd/man/latest/systemd-system.conf.html#DefaultTimeoutStartSec=

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants