Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[swss]: Restore cached FDB and ARP entries after config reload for dual ToR #6912

Conversation

theasianpianist
Copy link
Contributor

@theasianpianist theasianpianist commented Feb 26, 2021

  • Only performed after config reload on dual ToR systems

In conjuction with: sonic-net/sonic-utilities#1465
Signed-off-by: Lawrence Lee lawlee@microsoft.com

Why I did it

On dual ToR systems, we need to quickly restore the ARP table after config reload so the standby ToR is able to continue normal operation.

How I did it

Check for the /host/config-reload directory, which is created by the linked PR. If found, copy the ARP and FDB caches to SWSS. Restore FDB entries to APP_DB. Then wait for the VLAN interface to come up, and restore ARP table to the kernel.

How to verify it

Ran config reload -y on a dual ToR testbed including changes from the linked PR, confirmed that ARP table on standby ToR was restored after.

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012

Description for the changelog

A picture of a cute animal (not mandatory but encouraged)

image

* Only performed after config reload on dual ToR systems

Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
* Only runs on dual ToR systems after config reload

Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
INTF_NAME=$1

until [[ `ip link show $INTF_NAME | grep 'state UP'` ]]; do
sleep 0.1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too minute. I would suggest wait 1 sec. Also, if its never up, we need a way to exit the script after say 5mts or so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, will look into a way to timeout


if [[ -f /arp.json ]];
then
swssconfig /arp.json
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, may be we should not restore the arp.json entries to APP_DB since we are directly adding to kernel. If we add to APP_DB, arp_refresh script kick in and can result in failed entries temporarily.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, will fix

IP=`echo $KEY | jq -r ". / \":\" | .[2]"`

wait_for_intf_up $DEVICE
ip neigh replace "$IP" dev "$DEVICE" lladdr "$MAC" nud stale
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are restoring IPv6 entries, this should be "ip -6 neigh ..."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought ip neigh by default could accept both IPv4 and IPv6 entries?

* Timeout kernel ARP restore process after 5 minutes
* Reduce sleep interval when checking interface status
* Don't restoret APP_DB neighbor table when restoring kernel neighbor
table

Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
@theasianpianist theasianpianist changed the title [docker_image_ctl]: Copy cached ARP tables to SWSS docker [swss]: Restore cached FDB and ARP entries after config reload for dual ToR Mar 1, 2021
tahmed-dev
tahmed-dev previously approved these changes Mar 1, 2021
Copy link
Contributor

@tahmed-dev tahmed-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

dockers/docker-orchagent/kernel_arp_restore.sh Outdated Show resolved Hide resolved
dockers/docker-orchagent/kernel_arp_restore.sh Outdated Show resolved Hide resolved
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
@@ -71,6 +71,16 @@ stderr_logfile=syslog
dependent_startup=true
dependent_startup_wait_for=orchagent:running

[program:kernel_arp_restore]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the different between kernel_arp_restore and restore_neighbors below? it seems the similiar things.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

restore_neighbors is executed only during warmboot and restoration is by programming and sending arp requests. kernel_arp_restore is only programming to kernel (as dynamic entries) and done for config_reload scenario. Do you prefer we should combine the script?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i recall for fast reboot we also restore the arp entries, which script is that? do we restart kernel arp entries for fast reboot?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, for fastboot, we only restore to APP_DB so that orchangent programs those ARP entry to ASIC and the arp_update script will send out requests to resolve arp.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like restore_neighbors is used primarily during warmboot. It looks like it does accomplish the same thing, but I am hesitant to change anything warmboot related in case it has other side effects.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why fast_reboot cannot use the this kernel_arp_restore script? why only use this script in config reload scenario?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is fine not to combine with restore_neighbors, but why this arp_restore cannot be combined into swssconfig.sh and also use it for fast reboot?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue with placing this in swssconfig.sh is that these kernel entries can only be programmed after the VLAN interface comes up, which happens after swssconfig.sh has exited

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is ok not be combine. but can this be used for fast-reboot as well?

tahmed-dev
tahmed-dev previously approved these changes Mar 5, 2021
@@ -29,6 +29,25 @@ function fast_reboot {
esac
}

function config_reload {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you rename this function as restator_neighbors, and then rename the fast_reboot script as restore default_routes. basically, i think we should try to reuse the code here and not to create duplication.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can refactor this to re-use more existing code, but not by much since we don't want a lot of overlapping behavior between fast-reboot and config-reload here. E.g. for fast-reboot, we want to restore the ARP table to APP_DB as well as the default routes, but for config reload we do not want to restore the default routes, and we only need the ARP table to be restored to the kernel (if we restored to APP_DB here, there was some concern about having these entries be FAILED if the ToR attempted an ARP refresh).

Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
*)
;;
esac
function signal_kernel_restore {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this only for config reload? can this be used for fast reboot as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fast reboot already restores the neighbor table to APP_DB, and to my knowledge does not need it restored to the kernel.

@theasianpianist
Copy link
Contributor Author

theasianpianist commented Oct 28, 2022

Closing since we are reverting the prerequisite PR sonic-net/sonic-utilities#2460

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants