-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[teamd] prevent re-entrance of port priv change handler #2723
Conversation
When adding a lag member dynamically after system boots up, teamd port priv change handler could re-entrant itself and causing adding operation to fail. While handling PORT_CHANGE event, teamd_per_port.c port priv change handler was called, it will then call runner_lacp to add port to lag, the later causes IFINFO_CHANGE to be notified and calls the priv change handler again, this re-entrance would cause runner_lacp port_added to be called again and messes up with the previous adding sequence. Then fails the lag member adding operation. Prevent per port priv change handler re-entrance solves the problem. Signed-off-by: Ying Xie <ying.xie@microsoft.com>
I feel that we need to drill down the root cause of that device name missing issue. Patching with all the workaround doesn't seem to be a long term solution. |
@jipanyang I agree with you. That might take longer time. If we identify the root cause and proper fix. I don't mind revert this change and #2699. Frankly speaking, I thought about this. I think the other way to fix the issue is to delay creating LAG devices after warm reboot. But I don't know a good indication of 'readiness' to do so. |
When adding a lag member dynamically after system boots up, teamd port priv change handler could re-entrant itself and causing adding operation to fail. While handling PORT_CHANGE event, teamd_per_port.c port priv change handler was called, it will then call runner_lacp to add port to lag, the later causes IFINFO_CHANGE to be notified and calls the priv change handler again, this re-entrance would cause runner_lacp port_added to be called again and messes up with the previous adding sequence. Then fails the lag member adding operation. Prevent per port priv change handler re-entrance solves the problem. Signed-off-by: Ying Xie <ying.xie@microsoft.com>
When adding a lag member dynamically after system boots up, teamd port priv change handler could re-entrant itself and causing adding operation to fail. While handling PORT_CHANGE event, teamd_per_port.c port priv change handler was called, it will then call runner_lacp to add port to lag, the later causes IFINFO_CHANGE to be notified and calls the priv change handler again, this re-entrance would cause runner_lacp port_added to be called again and messes up with the previous adding sequence. Then fails the lag member adding operation. Prevent per port priv change handler re-entrance solves the problem. Signed-off-by: Ying Xie <ying.xie@microsoft.com>
When adding a lag member dynamically after system boots up, teamd port priv change handler could re-entrant itself and causing adding operation to fail. While handling PORT_CHANGE event, teamd_per_port.c port priv change handler was called, it will then call runner_lacp to add port to lag, the later causes IFINFO_CHANGE to be notified and calls the priv change handler again, this re-entrance would cause runner_lacp port_added to be called again and messes up with the previous adding sequence. Then fails the lag member adding operation. Prevent per port priv change handler re-entrance solves the problem. Signed-off-by: Ying Xie <ying.xie@microsoft.com>
When adding a lag member dynamically after system boots up, teamd port priv change handler could re-entrant itself and causing adding operation to fail. While handling PORT_CHANGE event, teamd_per_port.c port priv change handler was called, it will then call runner_lacp to add port to lag, the later causes IFINFO_CHANGE to be notified and calls the priv change handler again, this re-entrance would cause runner_lacp port_added to be called again and messes up with the previous adding sequence. Then fails the lag member adding operation. Prevent per port priv change handler re-entrance solves the problem. Signed-off-by: Ying Xie <ying.xie@microsoft.com>
…lly (#14750) src/sonic-swss * 96407be - (HEAD -> 202205, origin/202205) sonic-swss: Sync and program remote lag member status. (#2730) (11 hours ago) [Sambath Kumar Balasubramanian] * 6051d33 - Fix race condition in sub-interface handling (#2723) (11 hours ago) [Stephen Sun] * 7b95ec5 - [muxorch] handling multiple mux nexthops for route (#2656) (11 hours ago) [Nikola Dancejic]
* Fix race condition in sub-interface handling A sub-interface's becomes admin up in the following scenarios: 1. A port becomes admin up in STATE_DB.LAG_TABLE when there are sub-interfaces created based on it 2. A sub-interfaces is created when the parent port's admin_status is up in APPL_DB.LAG_TABLE However, it's possible that - the sub-interfaces have not been created while STATE_DB.LAG_TABLE is updated - the LAGs, as the parent ports, are still admin-down in APPL_DB.LAG_TABLE when sub-interfaces are created. In this scenario, sub-interfaces will keep admin-down for ever. Fix: In (2), check LAG's admin status in STATE_DB instead of APPL_DB
- What I did
This change addresses an issue introduced by #2699
When adding a lag member dynamically after system boots up, teamd
port priv change handler could re-entrant itself and causing adding
operation to fail.
While handling PORT_CHANGE event, teamd_per_port.c port priv change
handler was called, it will then call runner_lacp to add port to lag,
the later causes IFINFO_CHANGE to be notified and calls the priv change
handler again, this re-entrance would cause runner_lacp port_added to
be called again and messes up with the previous adding sequence. Then
fails the lag member adding operation.
Prevent per port priv change handler re-entrance solves the problem.
Signed-off-by: Ying Xie ying.xie@microsoft.com
- How to verify it
Passed incremental config nightly test over the weekend.
Manually tested add a member port to a lag.
897 iterations of continuous warm reboot: 338 with [teamd] retry creating team_port after interface info changed #2699 only. And 559 with [teamd] retry creating team_port after interface info changed #2699 plus this change (test still running):
yinxi@acs-trusty8:/warmboot/cont_wb$ grep Rebooting wb-test-20190328-0631.log | grep 20190329 | wc -l
559
yinxi@acs-trusty8:/warmboot/cont_wb$ grep Rebooting wb-test-20190328-0631.log | grep -v 20190329 | wc -l
338