-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[swss]: Restore cached FDB and ARP entries after config reload for dual ToR #6912
[swss]: Restore cached FDB and ARP entries after config reload for dual ToR #6912
Conversation
* Only performed after config reload on dual ToR systems Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
* Only runs on dual ToR systems after config reload Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
INTF_NAME=$1 | ||
|
||
until [[ `ip link show $INTF_NAME | grep 'state UP'` ]]; do | ||
sleep 0.1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is too minute. I would suggest wait 1 sec. Also, if its never up, we need a way to exit the script after say 5mts or so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack, will look into a way to timeout
|
||
if [[ -f /arp.json ]]; | ||
then | ||
swssconfig /arp.json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, may be we should not restore the arp.json
entries to APP_DB since we are directly adding to kernel. If we add to APP_DB, arp_refresh script kick in and can result in failed entries temporarily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack, will fix
IP=`echo $KEY | jq -r ". / \":\" | .[2]"` | ||
|
||
wait_for_intf_up $DEVICE | ||
ip neigh replace "$IP" dev "$DEVICE" lladdr "$MAC" nud stale |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we are restoring IPv6 entries, this should be "ip -6 neigh ..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought ip neigh
by default could accept both IPv4 and IPv6 entries?
* Timeout kernel ARP restore process after 5 minutes * Reduce sleep interval when checking interface status * Don't restoret APP_DB neighbor table when restoring kernel neighbor table Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
@@ -71,6 +71,16 @@ stderr_logfile=syslog | |||
dependent_startup=true | |||
dependent_startup_wait_for=orchagent:running | |||
|
|||
[program:kernel_arp_restore] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the different between kernel_arp_restore and restore_neighbors below? it seems the similiar things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
restore_neighbors
is executed only during warmboot and restoration is by programming and sending arp requests. kernel_arp_restore
is only programming to kernel (as dynamic entries) and done for config_reload scenario. Do you prefer we should combine the script?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i recall for fast reboot we also restore the arp entries, which script is that? do we restart kernel arp entries for fast reboot?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, for fastboot, we only restore to APP_DB so that orchangent programs those ARP entry to ASIC and the arp_update script will send out requests to resolve arp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like restore_neighbors
is used primarily during warmboot. It looks like it does accomplish the same thing, but I am hesitant to change anything warmboot related in case it has other side effects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why fast_reboot cannot use the this kernel_arp_restore script? why only use this script in config reload scenario?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is fine not to combine with restore_neighbors, but why this arp_restore cannot be combined into swssconfig.sh and also use it for fast reboot?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue with placing this in swssconfig.sh
is that these kernel entries can only be programmed after the VLAN interface comes up, which happens after swssconfig.sh
has exited
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is ok not be combine. but can this be used for fast-reboot as well?
@@ -29,6 +29,25 @@ function fast_reboot { | |||
esac | |||
} | |||
|
|||
function config_reload { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you rename this function as restator_neighbors, and then rename the fast_reboot script as restore default_routes. basically, i think we should try to reuse the code here and not to create duplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can refactor this to re-use more existing code, but not by much since we don't want a lot of overlapping behavior between fast-reboot and config-reload here. E.g. for fast-reboot, we want to restore the ARP table to APP_DB as well as the default routes, but for config reload we do not want to restore the default routes, and we only need the ARP table to be restored to the kernel (if we restored to APP_DB here, there was some concern about having these entries be FAILED
if the ToR attempted an ARP refresh).
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
*) | ||
;; | ||
esac | ||
function signal_kernel_restore { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this only for config reload? can this be used for fast reboot as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fast reboot already restores the neighbor table to APP_DB, and to my knowledge does not need it restored to the kernel.
Closing since we are reverting the prerequisite PR sonic-net/sonic-utilities#2460 |
In conjuction with: sonic-net/sonic-utilities#1465
Signed-off-by: Lawrence Lee lawlee@microsoft.com
Why I did it
On dual ToR systems, we need to quickly restore the ARP table after config reload so the standby ToR is able to continue normal operation.
How I did it
Check for the
/host/config-reload
directory, which is created by the linked PR. If found, copy the ARP and FDB caches to SWSS. Restore FDB entries to APP_DB. Then wait for the VLAN interface to come up, and restore ARP table to the kernel.How to verify it
Ran
config reload -y
on a dual ToR testbed including changes from the linked PR, confirmed that ARP table on standby ToR was restored after.Which release branch to backport (provide reason below if selected)
Description for the changelog
A picture of a cute animal (not mandatory but encouraged)