-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[teamd]: wait for swss db flush done before starting teamd container #2626
Conversation
files/scripts/swss.sh
Outdated
@@ -97,6 +97,7 @@ start() { | |||
/usr/bin/docker exec database redis-cli -n 2 FLUSHDB | |||
/usr/bin/docker exec database redis-cli -n 5 FLUSHDB | |||
clean_up_tables 6 "'PORT_TABLE*', 'MGMT_PORT_TABLE*', 'VLAN_TABLE*', 'VLAN_MEMBER_TABLE*', 'INTERFACE_TABLE*', 'MIRROR_SESSION*', 'VRF_TABLE*'" | |||
/usr/bin/docker exec database redis-cli -n 0 HSET "SWSS_DB_FLUSH_DONE" "1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One problem I foresee here is that SWSS_DB_FLUSH_DONE
never gets set to 0 or deleted. Take for instance the new logic I implemented in the 201803 branch (and will soon implement in the master/201811 branches, as well). If a critical process crashes in the swss container, the container will exit causing the swss service to restart itself and its dependent services. At this point, swss restarts and subsequently restarts teamd. It's possible that teamd would check the value before the databases get flushed, yet SWSS_DB_FLUSH_DONE
would still be set to 1, causing teamd to start before the databases are flushed. How can we reliably set this value to 0 or delete the key when applicable? maybe deleting the key when the swss service stops is enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, at least SWSS_DB_FLUSH_DONE should be set as 0 upon swss service stop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you test this change with system warm reboot?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yxieca I tested with cold reboot and teamd docker warm restart, but not system warm reboot. Did you see any issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really.
I my test. I think you change is equal to add service teamd after syncd. I don't see behavior change from one to another. Can you test that change?
diff --git a/files/build_templates/teamd.service.j2 b/files/build_templates/teamd.service.j2
index 792b824..bde55c6 100644
--- a/files/build_templates/teamd.service.j2
+++ b/files/build_templates/teamd.service.j2
@@ -2,6 +2,7 @@
Description=TEAMD container
Requires=updategraph.service
After=updategraph.service
+After=syncd.service
Before=ntp-config.service
[Service]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, with recent systemd service change, this should work too.
The difference might be bootup speed (for both cold boot and warm boot), but I don't have any concrete data.
Signed-off-by: Jipan Yang <jipan.yang@alibaba-inc.com>
Signed-off-by: Jipan Yang <jipan.yang@alibaba-inc.com>
Wait for swss to finish db flushing:
SWSS_DB_FLUSH_DONE flag and correct mtu for lag in appDB.
root@vlab-01:/home/admin# systemctl stop swss
mtu gone after swss restart.
root@vlab-01:/home/admin# systemctl restart teamd mtu push back.
|
Retest this please |
Close it in favor of #2724 |
…nux-kernel] advance submodule head linkmgrd: * 3e7a9df 2023-02-19 | [active-active] Toggle to standby if default route is missing (sonic-net#171) (HEAD -> 202205) [Longxiang Lyu] * 8ab1b2b 2023-02-15 | [active-active] fix issue that interfaces get stuck in `active` if service starts up with link state down (sonic-net#169) [Jing Zhang] * df862ad 2023-02-11 | Fix mux config when gRPC connection is lost (sonic-net#166) [Longxiang Lyu] utilities: * 8aa7930c 2023-02-13 | [portstat CLI] don't print reminder if use json format (sonic-net#2670) (HEAD -> 202205, github/202205) [wenyiz2021] * 4e3bb6fa 2023-02-21 | Add "show fabric reachability" command. (sonic-net#2672) [jfeng-arista] * 3587a94b 2023-02-18 | [202205][dhcp_relay] Remove add field of vlanid to DHCP_RELAY table while adding vlan (sonic-net#2680) [Yaqiang Zhu] * 4f07f7f0 2023-02-10 | Skip saidump for Spine Router as this can take more than 5 sec (sonic-net#2637) (sonic-net#2671) [kenneth-arista] * e61c5ec4 2023-02-10 | [vlan] Refresh dhcpv6_relay config while adding/deleting a vlan (sonic-net#2660) (sonic-net#2669) [Yaqiang Zhu] swss: * 1bbf725 2023-02-14 | [Workaround] EvpnRemoteVnip2pOrch warmboot check failure (sonic-net#2626) (HEAD -> 202205) [jcaiMR] * 380f72b 2023-02-20 | Support for tc-dot1p and tc-dscp qosmap (sonic-net#2559) [Divya Mukundan] * dbf6fcc 2022-11-01 | Added LAG member check on addLagMember() (sonic-net#2464) [Andriy Kokhan] swss-common: * b31391b 2023-02-21 | Prevent sonic-db-cli generate core dump (sonic-net#749) (HEAD -> 202205) [Hua Liu] * 16ff689 2022-12-13 | Support for TC-DOT1p qos map (sonic-net#721) [Divya Mukundan] platform-daemons: * fb92af4 2023-02-09 | [ycabled] add more coverage to ycabled; add minor name change for vendor API CLI return key-values pairs (sonic-net#338) (HEAD -> 202205) [vdahiya12] linux-kernel: * 4e62401 2023-02-09 | Update linux kernel for hw-mgmt V.7.0020.4104 (sonic-net#305) (HEAD -> 202205) [Stephen Sun] Signed-off-by: Ying Xie <ying.xie@microsoft.com>
…nux-kernel] advance submodule head (#13906) linkmgrd: * 3e7a9df 2023-02-19 | [active-active] Toggle to standby if default route is missing (#171) (HEAD -> 202205) [Longxiang Lyu] * 8ab1b2b 2023-02-15 | [active-active] fix issue that interfaces get stuck in `active` if service starts up with link state down (#169) [Jing Zhang] * df862ad 2023-02-11 | Fix mux config when gRPC connection is lost (#166) [Longxiang Lyu] utilities: * 8aa7930c 2023-02-13 | [portstat CLI] don't print reminder if use json format (#2670) (HEAD -> 202205, github/202205) [wenyiz2021] * 4e3bb6fa 2023-02-21 | Add "show fabric reachability" command. (#2672) [jfeng-arista] * 3587a94b 2023-02-18 | [202205][dhcp_relay] Remove add field of vlanid to DHCP_RELAY table while adding vlan (#2680) [Yaqiang Zhu] * 4f07f7f0 2023-02-10 | Skip saidump for Spine Router as this can take more than 5 sec (#2637) (#2671) [kenneth-arista] * e61c5ec4 2023-02-10 | [vlan] Refresh dhcpv6_relay config while adding/deleting a vlan (#2660) (#2669) [Yaqiang Zhu] swss: * 1bbf725 2023-02-14 | [Workaround] EvpnRemoteVnip2pOrch warmboot check failure (#2626) (HEAD -> 202205) [jcaiMR] * 380f72b 2023-02-20 | Support for tc-dot1p and tc-dscp qosmap (#2559) [Divya Mukundan] * dbf6fcc 2022-11-01 | Added LAG member check on addLagMember() (#2464) [Andriy Kokhan] swss-common: * b31391b 2023-02-21 | Prevent sonic-db-cli generate core dump (#749) (HEAD -> 202205) [Hua Liu] * 16ff689 2022-12-13 | Support for TC-DOT1p qos map (#721) [Divya Mukundan] platform-daemons: * fb92af4 2023-02-09 | [ycabled] add more coverage to ycabled; add minor name change for vendor API CLI return key-values pairs (#338) (HEAD -> 202205) [vdahiya12] linux-kernel: * 4e62401 2023-02-09 | Update linux kernel for hw-mgmt V.7.0020.4104 (#305) (HEAD -> 202205) [Stephen Sun] Signed-off-by: Ying Xie <ying.xie@microsoft.com>
Update sonic-swss submodule pointer to include the following: * f66abed Support for tc-dot1p and tc-dscp qosmap ([sonic-net#2559](sonic-net/sonic-swss#2559)) * 35385ad [RouteOrch] Record ROUTE_TABLE entry programming status to APPL_STATE_DB ([sonic-net#2512](sonic-net/sonic-swss#2512)) * 0704f78 [Workaround] EvpnRemoteVnip2pOrch warmboot check failure ([sonic-net#2626](sonic-net/sonic-swss#2626)) * 4df5cab [ResponsePublisher] add pipeline support ([sonic-net#2511](sonic-net/sonic-swss#2511)) Signed-off-by: AntonHryshchuk <antonh@nvidia.com>
Update sonic-swss submodule pointer to include the following: * baa302e Do not allow to add port to .1Q bridge while router port deletion is not completed ([sonic-net#2669](sonic-net/sonic-swss#2669)) * f66abed Support for tc-dot1p and tc-dscp qosmap ([sonic-net#2559](sonic-net/sonic-swss#2559)) * 35385ad [RouteOrch] Record ROUTE_TABLE entry programming status to APPL_STATE_DB ([sonic-net#2512](sonic-net/sonic-swss#2512)) * 0704f78 [Workaround] EvpnRemoteVnip2pOrch warmboot check failure ([sonic-net#2626](sonic-net/sonic-swss#2626)) * 4df5cab [ResponsePublisher] add pipeline support ([sonic-net#2511](sonic-net/sonic-swss#2511)) Signed-off-by: dprital <drorp@nvidia.com>
Update sonic-swss submodule pointer to include the following: * baa302e Do not allow to add port to .1Q bridge while router port deletion is not completed ([#2669](sonic-net/sonic-swss#2669)) * f66abed Support for tc-dot1p and tc-dscp qosmap ([#2559](sonic-net/sonic-swss#2559)) * 35385ad [RouteOrch] Record ROUTE_TABLE entry programming status to APPL_STATE_DB ([#2512](sonic-net/sonic-swss#2512)) * 0704f78 [Workaround] EvpnRemoteVnip2pOrch warmboot check failure ([#2626](sonic-net/sonic-swss#2626)) * 4df5cab [ResponsePublisher] add pipeline support ([#2511](sonic-net/sonic-swss#2511))
Why I did it submodule advance b085b5f - [ci] Fix pipeline error about team5 not found. (Core dump in orchagent when assigning router interface to a vlan with untagged mode #2684) (3 hours ago) [Liu Shilong] 4549b4c - Fix issue: there is no retry while creating a RIF which is in removing state ([201811 sub-module] advance sub-modules: utilities, swss, swss-common #2679) (3 hours ago) [Junchao-Mellanox] 980a45b - [FDB]Fixing FDB consolidated flush for Remote MACs (pmon to stretch #2673) (3 hours ago) [Sudharsan Dhamal Gopalarathnam] c646607 - Do not allow to add port to .1Q bridge while router port deletion is not completed (Update SDK, FW and SAI #2669) (3 hours ago) [Lior Avramov] 4a321f0 - [orchagent]: Get bridge port ID from orchagent cache instead of SAI API ([201811 sub module] advance sairedis sub module #2657) (3 hours ago) [Lawrence Lee] f4b88f3 - [Dual-ToR] handle 'mux_tunnel_egress_acl' attrib in order to change ACL configuration (drop on ingress/egress) on standby ToR (lm75 doesn't support written alarm to syslog. #2646) (3 hours ago) [Andriy Yurkiv] a4f29c1 - [Workaround] EvpnRemoteVnip2pOrch warmboot check failure ([teamd]: wait for swss db flush done before starting teamd container #2626) (3 hours ago) [jcaiMR] 53ee0a8 - Support for tc-dot1p and tc-dscp qosmap ([201803] [router-advertiser] Add templated script to wait for pertinent interfaces to be ready before starting radvd #2559) (3 hours ago) [Divya Mukundan] b953866 - [dual-tor] add missing SAI attribte in order to create IPNIP tunnel (Config reload/load_minigraph not clearing State DB #2503) (3 hours ago) [Andriy Yurkiv] How I did it How to verify it
Related work items: sonic-net#276, sonic-net#305, sonic-net#332, sonic-net#338, sonic-net#339, sonic-net#1188, sonic-net#1192, sonic-net#1197, sonic-net#1206, sonic-net#1685, sonic-net#1690, sonic-net#1696, sonic-net#1699, sonic-net#1709, sonic-net#1727, sonic-net#1737, sonic-net#1741, sonic-net#1742, sonic-net#2511, sonic-net#2512, sonic-net#2532, sonic-net#2559, sonic-net#2626, sonic-net#2638, sonic-net#2645, sonic-net#2649, sonic-net#2660, sonic-net#2669, sonic-net#2670, sonic-net#2678, sonic-net#10084, sonic-net#11442, sonic-net#11873, sonic-net#12047, sonic-net#12110, sonic-net#12207, sonic-net#12529, sonic-net#12678, sonic-net#13235, sonic-net#13287, sonic-net#13372, sonic-net#13395, sonic-net#13456, sonic-net#13497, sonic-net#13522, sonic-net#13545, sonic-net#13547, sonic-net#13552, sonic-net#13569, sonic-net#13572, sonic-net#13578, sonic-net#13591, sonic-net#13611, sonic-net#13647, sonic-net#13649, sonic-net#13660, sonic-net#13710, sonic-net#13716, sonic-net#13724, sonic-net#13726, sonic-net#13732, sonic-net#13735, sonic-net#13739, sonic-net#13757, sonic-net#13786, sonic-net#13792, sonic-net#13800, sonic-net#13801, sonic-net#13802, sonic-net#13805, sonic-net#13806, sonic-net#13812, sonic-net#13814, sonic-net#13822, sonic-net#13831, sonic-net#13834, sonic-net#13847, sonic-net#13870, sonic-net#13882, sonic-net#13884, sonic-net#13885, sonic-net#13894, sonic-net#13895, sonic-net#13926, sonic-net#13932, sonic-net#13935, sonic-net#13942, sonic-net#13951, sonic-net#13953, sonic-net#13964
…ic-net#2756)" (sonic-net#2773) This reverts commit 750e064. Reverts the PR sonic-net#2756 The fix added breaks the previously added workaround sonic-net#2626. Hence requesting to revert the fix. Once we find a proper solution for sonic-net#12361 we need to reintegrate this PR
Signed-off-by: Jipan Yang jipan.yang@alibaba-inc.com
- What I did
This a try to fix #2606
- How I did it
- How to verify it
- Description for the changelog
- A picture of a cute animal (not mandatory but encouraged)