-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kernel 4.9: race condition seen with port channel creation. #1981
Comments
I also see such issue, not sure if it is caused by kernel or not. maybe it is related the /etc/network/interface refactor. @stcheng , can you take a look? |
@hzheng5 has some change in teamd.sh to bring the member ports down before creating portchannel, then bring member up again. The change seemed to work and fixed the problem in my test setup. |
here is the new start_app function we are using
Please let us now if you want to merge the change into master branch. Thanks, |
thanks i'll test this |
Hello @jipanyang and @hzheng5, in my previous pull request #1908, I have added the |
Hi @stcheng , We can double check the config internally. But we still need to solve this issue when the member ports are set to be "admin up" ? Thanks, |
Hi @hzheng5, when ports are set as admin up, they cannot be enslaved due to the restriction of teamd. Do we want to have this behavior or not? If we want to have the behavior of enslaving member ports into a port channel regardless of the admin status, we could force to bring down the ports. If we want to have teamd's restrictions, then we could clean up the unnecessary configurations in the database. |
Hi Shuotian,
Yes. The root cause here is due the restriction of teamd. More specificlly, teamd is using rtnl_link_enslave_ifindex to add the member port to LAG, which will fail the portchannel creation if any member port is admin up.
However, I don't think this restriction should apply to SONiC. It's difficult to make sure that users always have configured the member ports to be down. So they may ends up with portchannel not created randomly even though it's configured in config DB.
So two possible solutions:
1. fix the restriction from rtnl_link_enslave_ifindex
2. force the member port to be down before creating the lag, and then put the member port in configured admin status afterwards.
Personally, I prefer 2.
Thanks,
Haiyang
------------------------------------------------------------------
Sender:Shuotian Cheng <notifications@github.com>
Sent at:2018 Aug 28 (Tue) 14:35
To:Azure/sonic-buildimage <sonic-buildimage@noreply.github.com>
Cc:Sonic项目合作 <sonic-external@list.alibaba-inc.com>; Comment <comment@noreply.github.com>
Subject:Re: [Azure/sonic-buildimage] Kernel 4.9: race condition seen with port channel creation. (#1981)
Hi @hzheng5, when ports are set as admin up, they cannot be enslaved due to the restriction of teamd. Do we want to have this behavior or not? If we want to have the behavior of enslaving member ports into a port channel regardless of the admin status, we could force to bring down the ports. If we want to have teamd's restrictions, then we could clean up the unnecessary configurations in the database.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Just a thought that, if the interface is already up and if you bring-down the interface, the peer may see link down. However, if we have to fix this, I would go for # 2. |
Yes. I think the second approach is preferred. But one more question here, if we configure the member port as admin down before enslaving the port, will we need to manually bring down the port after the port is enslaved? |
No I think. Say if we bring down, then at what point we want to bring it up? But if we want to support lag member up/down, then after enslaving the port, we want to check what is the user configured |
For #2, here are the suggested steps inside /usr/bin/teamd.sh to start teamd
Thanks, |
this issue has already been resolved. |
Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com> 30f5dd6 Update the example for pfcwd start command (sonic-net#1984) 9e30871 [Auto Techsupport] Event driven Techsupport Bug Fixes (sonic-net#1986) fbd565d Fix wrong help message for cable length setting (sonic-net#1978) b3a5052 [GCU] Using simulated config instead of target config when validating replace operation in NoDependencyMoveValidator (sonic-net#1987) 35cb524 [GCU] Copying config_db before callding sonic_yang.loadData (sonic-net#1983) a98858d [GCU] Different apply-patch runs should produce same sorted steps (sonic-net#1988) 8c81ae3 [breakout] Fix the check when port is not present in BREAKOUT_CFG table (sonic-net#1765) bc8fe7c [doc][DPB] Update DPB related interface breakout command Info (sonic-net#1438) 1a2a9a3 [config] Fix 'config reload -l' command to get filename by default (sonic-net#1611) ed2fa69 [debug dump util] FDB debug dump util changes (sonic-net#1968) 3b642c9 [GCU] Loading yang-models only once (sonic-net#1981) bb56fc2 Update swss_ready check to check per namespace swss service (sonic-net#1974) 4f39f9f [GCU] Moving PatchSorter unit-test to json file to make it easier to read/maintain (sonic-net#1977) 1a75870 [CLI][Help string] Changed the show command help text to be more consistent with each other. 818dcbf [GCU] Implementing DryRun by printing patch-sorter steps/imitating config_db (sonic-net#1973)
4236bc4 [config reload] Fixing config reload when timer based delayed services are disabled (#1967) d2514e4 [GCU] Different apply-patch runs should produce same sorted steps (#1988) 2878adb [GCU] Using simulated config instead of target config when validating replace operation in NoDependencyMoveValidator (#1987) fb8ca98 [GCU] Loading yang-models only once (#1981) f88ee92 [GCU] Copying config_db before callding sonic_yang.loadData (#1983) 9ed0e91 [GCU] Implementing DryRun by printing patch-sorter steps/imitating config_db (#1973) b36b5e3 [GCU] Moving PatchSorter unit-test to json file to make it easier to read/maintain (#1977) c0fa28b [generic-config-updater] Improving CreateOnly validator and marking /LOOPBACK_INTERFACE/LOOPBACK#/vrf_name as create-only (#1969) 0559d04 [generic-config-updater] Adding non-strict mode (#1929) b07f477 [debug dump util] FDB debug dump util changes (#1968) 6d8757a [warm/fast-reboot] Fix kexec portion to support platforms based on Device Tree (#1966) cc1409e [Auto Techsupport] Event driven Techsupport Bug Fixes (#1986) 6c48bd5 Fix wrong help message for cable length setting (#1978) c0bbbe3 [breakout] Fix the check when port is not present in BREAKOUT_CFG table (#1765) 5bb8cad [doc][DPB] Update DPB related interface breakout command Info (#1438) e6fd990 [config] Fix 'config reload -l' command to get filename by default (#1611) bd8f7bb Update swss_ready check to check per namespace swss service (#1974) 5439f94 [soft-reboot] Add support for platforms based on Device Tree (#1963) 7c5810a [config] Add portchannel support for static route (#1857) 7cb6a1b preserve old order for config reload (#1964) 20bddbd [Auto-Techsupport] Issues related to Multiple Cores crashing handled (#1948)
#### What I did Loading sonic-yang models only once, and re-using them. This makes the sorting a lot faster. How to verify `loadYangModel` took a lot of time? Add the following snippet to `TestPatchSorter` ```python from pstats import Stats import cProfile class TestPatchSorter(...): def setUp(self): """init each test""" self.pr = cProfile.Profile() self.pr.enable() print("\n<<<---") def tearDown(self): """finish any test""" p = Stats (self.pr) p.strip_dirs() p.sort_stats ('cumtime') p.print_stats () print("\n--->>>") . . . # Also update verify(self, cc_ops=[], tc_ops=[]) by commenting out changes validation to avoid extra calls to loadYangModels def verify(self, cc_ops=[], tc_ops=[]): # Arrange config_wrapper=ConfigWrapper() target_config=jsonpatch.JsonPatch(tc_ops).apply(Files.CROPPED_CONFIG_DB_AS_JSON) current_config=jsonpatch.JsonPatch(cc_ops).apply(Files.CROPPED_CONFIG_DB_AS_JSON) patch=jsonpatch.make_patch(current_config, target_config) # Act actual = self.create_patch_sorter(current_config).sort(patch) # Assert # simulated_config = current_config # for move in actual: # simulated_config = move.apply(simulated_config) # self.assertTrue(config_wrapper.validate_config_db_config(simulated_config)) # self.assertEqual(target_config, simulated_config) ``` Run ``` > python3 -m unittest patch_sorter_test.TestPatchSorter.test_sort__dpb_1_to_4__success . . . 48986582 function calls (48933431 primitive calls) in 104.530 seconds Ordered by: cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 104.530 104.530 case.py:549(_callTestMethod) 1 0.000 0.000 104.530 104.530 patch_sorter_test.py:1889(test_sort__dpb_1_to_4__success) 1 0.000 0.000 104.529 104.529 patch_sorter_test.py:1933(verify) 1 0.005 0.005 104.527 104.527 patch_sorter.py:1332(sort) 32/1 0.006 0.000 104.077 104.077 patch_sorter.py:955(sort) 334 0.012 0.000 99.498 0.298 patch_sorter.py:310(validate) 492 2.140 0.004 95.810 0.195 sonic_yang_ext.py:30(loadYangModel) <========= ``` From the above we can see profiling data about test_sort__dpb_1_to_4__success: - Took 104.53s to complete - loadYangModel was called 492 times - loadYangModel took 95.810s. loadYangModel is the method that loads the yang models from memory into SonicYang object. The loading of the YANG models should only happen once. #### How I did it Moved all calls to create sonic_yang object to ConfigWrapper, and each call to create a new instance just fills in the data for the yang models. #### How to verify it unit-test Running profiling after the update: ``` > python3 -m unittest patch_sorter_test.TestPatchSorter.test_sort__dpb_1_to_4__success . . . 702096 function calls (648951 primitive calls) in 2.882 seconds Ordered by: cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 2.882 2.882 case.py:549(_callTestMethod) 1 0.000 0.000 2.882 2.882 patch_sorter_test.py:1890(test_sort__dpb_1_to_4__success) 1 0.002 0.002 2.881 2.881 patch_sorter_test.py:1934(verify) 1 0.000 0.000 2.874 2.874 patch_sorter.py:1332(sort) 32/1 0.004 0.000 2.705 2.705 patch_sorter.py:955(sort) 334 0.008 0.000 2.242 0.007 patch_sorter.py:310(validate) 490 0.080 0.000 1.791 0.004 sonic_yang_ext.py:1043(loadData) 332 0.043 0.000 1.655 0.005 patch_sorter.py:345(validate) 332 0.018 0.000 1.509 0.005 gu_common.py:112(validate_config_db_config) . . . 1 0.002 0.002 0.164 0.164 sonic_yang_ext.py:30(loadYangModel) ``` From the above we can see profiling data about test_sort__dpb_1_to_4__success: - Took 2.882s to complete - loadYangModel was called 1time - loadYangModel took 0.164s. [profiling-after.txt](https://github.com/Azure/sonic-utilities/files/7757252/profiling-after.txt) #### Previous command output (if the output of a command-line utility has changed) #### New command output (if the output of a command-line utility has changed)
Description
Portchannel creation failed partially with config reload or system reboot.
Restart teamd service, portchannel got created succefully.
It looks the portchannel config in /etc/network/interface will bring up the member Ethernet interface.
While if the admin status of member Ethernet interface is up, portchannel create will fail.
The text was updated successfully, but these errors were encountered: