-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
teamd: Add support for custom retry counts for LACP sessions #13453
Conversation
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
@saiarcot895 in the patch file, there seems to be a mix use of tabs and whitespaces, which caused indentation issues even just see from this PR. Can you unify the spaces? |
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
I converted some lines that appeared to be using spaces to tabs, hopefully I got all of them. |
+ lacp_port->tdport->ifname, | ||
+ lacp_port->partner_retry_count, | ||
+ lacpdu->v2.actor_retry_count); | ||
+ lacp_port->partner_retry_count = lacpdu->v2.actor_retry_count; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean we allways expect the actor and partner to have same retry_count ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. The actor fields in the lacpdu
packet that we received refer to the partner fields in lacp_port
. In other words, in the LACP PDU that we received, the actor information refers to our peer's information (aka our partner).
Also fix a bug with `next_retry_count_change_time` getting calculated repeatedly when a non-standard retry count is sent. Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
/azpw run Azure.sonic-buildimage |
/AzurePipelines run Azure.sonic-buildimage |
Azure Pipelines successfully started running 1 pipeline(s). |
Also fix a logic error when filling in the retry count fields in the LACP PDU packet. Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
First, reply with the most recently received LACP PDU version. This is so that if one side is sending both the old and new versions, then the peer side will send both versions to keep both listeners happy. This also allows for using the new version packets even when the retry count is 3 on both sides. This can only be triggered remotely; there's no local command to force a particular version to be used. Second, reset the 60-second timer preventing the retry count from being reset back to 3 every time a new version PDU is received. This keeps that setting persistent for longer with the downside of a delayed response to the retry count being reset to 3. Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
After getting a LACPDU that uses a diferent LACPDU version, immediately send a LACPDU back. Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
+ lacp->cfg.retry_count = LACP_CFG_DFLT_RETRY_COUNT; | ||
+ } else if (tmp < 3) { | ||
+ teamd_log_err("\"retry_count\" value is out of its limits."); | ||
+ return -EINVAL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could probably generate the error message here and continue with retry count of LACP_CFG_DFLT_RETRY_COUNT, and proceed with LACP_CFG_DFLT_RETRY_COUNT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
…nfig Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Date: 2023-01-18 14:26:36 -0800 | ||
|
||
After setting the retry count to some custom value, if a normal LACP | ||
packet comes in without a custom retry count, don't reset it back to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this description correct? Should this be don't reset the retry count back to old/default retry count
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
if (err) | ||
return err; | ||
lacp_port->lacp->cfg.retry_count = LACP_CFG_DFLT_RETRY_COUNT; | ||
+ lacp_port->last_received_lacpdu_version = 0x01; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the last received version set to 0x01?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is in the case where the LAG has gone down (changed to either expired or defaulted state); at that point, it's better to assume that the peer doesn't understand 0xf1, in case the peer changed, or the peer OS version changed to some older version; otherwise, the LAG may never come up.
+ if (clock_gettime(CLOCK_MONOTONIC, &monotonic_time)) { | ||
+ err = errno; | ||
+ teamd_log_err("%s: unable to get current time: %s", lacp_port->tdport->ifname, strerror(err)); | ||
+ return -err; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this if block doing anything? I don't see monotonic_time
that is collected here being used anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clock_gettime
returns -1 for failure and 0 for success; that means if it gets into the if-block, then clock_gettime
failed for some reason.
+ if (clock_gettime(CLOCK_MONOTONIC, &monotonic_time)) { | ||
+ err = errno; | ||
+ teamd_log_err("%s: unable to get current time: %s", lacp_port->tdport->ifname, strerror(err)); | ||
+ return -err; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens to retry count if getting time fails? Will/should this be fatal for teamd?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, with the current code, this would mean the LACPDU packet wouldn't even get processed, which would mean that the 90-second (or however many seconds based on the retry count) timer won't get refreshed, and the LAG would go down.
The timers that teamd uses also use CLOCK_MONOTONIC
, so I'm inclined to think that if that call fails 3 times, something is messed up with the system.
} | ||
|
||
+ if (lacp_port->last_received_lacpdu_version != lacpdu->version_number) { | ||
+ teamd_log_dbg(lacp_port->ctx, "%s: LACPDU version changed from %u to %u", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation here looks not consistent w/ rest of the indentation in this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
+ lacpdu->version_number); | ||
+ lacp_port->last_received_lacpdu_version = lacpdu->version_number; | ||
+ // Force-send a LACPDU packet acknowledging change in version | ||
+ err = lacpdu_send(lacp_port); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same packet w/ changed version is sent here. Don't you also want to change retry count in the sent packet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new retry count is in this packet (it would've been updated in the previous if-else block, the lacp_port->partner_retry_count
field).
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
* Add CLI configuration options for teamd retry count feature Add a SONiC CLI to more easily configure the retry count for port channels. This effectively acts like a wrapper around the underlying teamdctl command. Also add a python script that'll be installed into /usr/local/bin/teamd_increase_retry_count.py that will detect if the peer device likely supports this teamd feature (based on LLDP neighbor info) and increases the teamd retry count to 5 in preparation for warm upgrade. This script requires sudo to run. This is tied to sonic-net/sonic-buildimage#13453. Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Add test for error case from teamd when it's not running Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Fix up test cases Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Add some error handling if teamdctl doesn't exist Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Add probe functionality and sending current LACPDU packet functionality Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Check to see if the retry count feature is enabled before doing a get or set Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Add option to only send probe packets or only change retry count Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Call the teamd retry count script if doing a warm-reboot Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Fix pycheck errors, and disable scapy's IPv6 and verbose mode Scapy's IPv6 support appears to have caused some issues with older versions of scapy, which may be present on older SONiC images. Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Make teamd retry count support optional Don't fail warm reboot if teamd retry count support doesn't happen to be present. Also use fastfast-reboot for Mellanox devices. Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Address review comments, and restructure code to increase code coverage Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Address some review comments Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Replace tabs with spaces Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Verify that expected keys are present in the data returned from teamdctl Also update a failure message in the warm-reboot script if the retry count script fails. Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Fix TimeoutExpired undefined error Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Add ability to mock subprocess calls (at a limited level) Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Return an actual subprocess object, and add a test for checking timeout Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Change variable syntax Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Fix set being accessed with an index Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Add option to warm-reboot script to control if teamd retry count is required or not Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Move the teamd retry count check to before orchagent This is so that in the case of the teamd retry count check failing, there's fewer changes that happen on the system (it'll fail faster). Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Move retry count script start to be prior to point-of-no-return This doesn't need to be after the point-of-no-return, since this will detach and be sending LACPDUs on its own. Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Set executable bit Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Address PR comments Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Change to case-insensitive string contains check Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Make sure the global abort variable is used Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> --------- Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I see that the cutom retry is sent based on the RX of LACP PDU. Is there a way user can set this explicitly to override the default retry of 3 ?
The user can specify any custom retry count from the CLI (there's a sonic-utilities PR that was merged in with support for this). However, after some period of time, or some state changes (such as the link going down or the LACP session going down), it will reset back to 3 to comply with the current LACP standard. |
…t#2642) * Add CLI configuration options for teamd retry count feature Add a SONiC CLI to more easily configure the retry count for port channels. This effectively acts like a wrapper around the underlying teamdctl command. Also add a python script that'll be installed into /usr/local/bin/teamd_increase_retry_count.py that will detect if the peer device likely supports this teamd feature (based on LLDP neighbor info) and increases the teamd retry count to 5 in preparation for warm upgrade. This script requires sudo to run. This is tied to sonic-net/sonic-buildimage#13453. Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Add test for error case from teamd when it's not running Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Fix up test cases Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Add some error handling if teamdctl doesn't exist Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Add probe functionality and sending current LACPDU packet functionality Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Check to see if the retry count feature is enabled before doing a get or set Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Add option to only send probe packets or only change retry count Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Call the teamd retry count script if doing a warm-reboot Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Fix pycheck errors, and disable scapy's IPv6 and verbose mode Scapy's IPv6 support appears to have caused some issues with older versions of scapy, which may be present on older SONiC images. Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Make teamd retry count support optional Don't fail warm reboot if teamd retry count support doesn't happen to be present. Also use fastfast-reboot for Mellanox devices. Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Address review comments, and restructure code to increase code coverage Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Address some review comments Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Replace tabs with spaces Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Verify that expected keys are present in the data returned from teamdctl Also update a failure message in the warm-reboot script if the retry count script fails. Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Fix TimeoutExpired undefined error Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Add ability to mock subprocess calls (at a limited level) Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Return an actual subprocess object, and add a test for checking timeout Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Change variable syntax Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Fix set being accessed with an index Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Add option to warm-reboot script to control if teamd retry count is required or not Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Move the teamd retry count check to before orchagent This is so that in the case of the teamd retry count check failing, there's fewer changes that happen on the system (it'll fail faster). Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Move retry count script start to be prior to point-of-no-return This doesn't need to be after the point-of-no-return, since this will detach and be sending LACPDUs on its own. Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Set executable bit Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Address PR comments Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Change to case-insensitive string contains check Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> * Make sure the global abort variable is used Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com> --------- Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
…et#13453) Why I did it This is to add support for specifying custom retry counts for LACP sessions. This is to make warmboot easier on low-storage and low-memory platforms, by allowing more than 90 seconds of downtime. How I did it How to verify it Tested manually with these cases: Verify that changing the retry count using teamdctl PortChannel101 state item set runner.retry_count 5 takes effect Verify that the retry count change actually affects when the LAG goes down by forcefully killing teamd on one side (i.e. setting the retry count to 5 causes the LAG to go down after 150 seconds) Verify that the retry count gets reset to 3 after the LAG goes down for whatever reason Verify that the retry count gets reset to 3 after some period of time (30 seconds * retry count) Test cases are in sonic-net/sonic-mgmt#7961 and sonic-net/sonic-mgmt#8152. Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Signed-off-by: Saikrishna Arcot sarcot@microsoft.com
ADO (Microsoft only): 22127474
Why I did it
This is to add support for specifying custom retry counts for LACP sessions. This is to make warmboot easier on low-storage and low-memory platforms, by allowing more than 90 seconds of downtime.
How I did it
How to verify it
Tested manually with these cases:
teamdctl PortChannel101 state item set runner.retry_count 5
takes effectTest cases are in sonic-net/sonic-mgmt#7961 and sonic-net/sonic-mgmt#8152.
Which release branch to backport (provide reason below if selected)
Tested and verified to be working on 202205 image.
Description for the changelog
Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.
Link to config_db schema for YANG module changes
A picture of a cute animal (not mandatory but encouraged)