Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fast-reboot] FDB entries are not restored on fast-reboot #5216

Closed
stepanblyschak opened this issue Aug 19, 2020 · 8 comments · Fixed by sonic-net/sonic-utilities#1059
Closed
Assignees

Comments

@stepanblyschak
Copy link
Collaborator

stepanblyschak commented Aug 19, 2020

Description

FDB entries are not restored on fast-reboot. This issues causes traffic to flood for a long period of time till all FDB entries are re-learned. The original fast-reboot design is to save FDB entries to a file and add them on HW in fast-boot ASAP, instead of waiting for learning.

Steps to reproduce the issue:

  1. ansible-playbook test_sonic.yml -i inventory --limit $sw-t0 -e testbed_name=$sw-t0 -e testbed_type=t0 -e testcase_name=fast-reboot
  2. Open /tmp/fast-reboot.log in PTF container
  3. Observe flooding after data plane comes back

Describe the results you received:

Flooding for a long period of time after fast-reboot when data plane comes back.

Describe the results you expected:

There should be no flooding when switch is fast-rebooted, FDB entries known prior to reboot have to be restored in the HW ASAP instead of waiting for learning to happen.

Additional information you deem important (e.g. issue happens only occasionally):

The issue is most probably caused by filter_fdb_entries.py added in sonic-net/sonic-utilities#890, a simple reproduction without running full fast-reboot flow:

admin@arc-switch1025:~$ show arp
Address       MacAddress         Iface            Vlan
------------  -----------------  ---------------  ------
10.0.0.57     52:54:00:5e:30:81  PortChannel0001  -
10.0.0.59     52:54:00:a0:3d:57  PortChannel0002  -
10.0.0.61     52:54:00:ac:8a:35  PortChannel0003  -
10.0.0.63     52:54:00:69:75:b3  PortChannel0004  -
10.210.24.1   00:00:5e:00:01:01  eth0             -
10.210.25.32  14:18:77:33:f0:50  eth0             -
192.168.0.2   e4:1d:2d:ca:c7:05  Ethernet20       1000
Total number of entries 7
admin@arc-switch1025:~$ show mac
  No.    Vlan  MacAddress         Port        Type
-----  ------  -----------------  ----------  -------
    1    1000  E4:1D:2D:CA:C7:05  Ethernet20  Dynamic
Total number of entries 1
admin@arc-switch1025:~$/usr/bin/fast-reboot-dump.py -t .
admin@arc-switch1025:~$ cat fdb.json
[
  {
    "FDB_TABLE:Vlan1000:E4-1D-2D-CA-C7-05": {
      "type": "dynamic",
      "port": "Ethernet20"
    },
    "OP": "SET"
  }
]admin@arc-switch1025:~$/usr/bin/filter_fdb_entries.py -f ./fdb.json -a ./arp.json -c /etc/sonic/config_db.json
admin@arc-switch1025:~$ cat fdb.json
**Output of `show version`:**
SONiC Software Version: SONiC.201911.172-1abf7be7
Distribution: Debian 9.13
Kernel: 4.9.0-11-2-amd64
Build commit: 1abf7be7
Build date: Tue Aug 18 04:33:33 UTC 2020
Built by: johnar@jenkins-worker-8

Platform: x86_64-mlnx_msn2700-r0
HwSKU: ACS-MSN2700
ASIC: mellanox
Serial Number: MT1822K07815
Uptime: 09:18:51 up 17:21,  1 user,  load average: 0.13, 0.12, 0.10

Docker images:
REPOSITORY                    TAG                   IMAGE ID            SIZE
docker-syncd-mlnx             201911.172-1abf7be7   f1e750afd5fc        392MB
docker-syncd-mlnx             latest                f1e750afd5fc        392MB
docker-sonic-telemetry        201911.172-1abf7be7   7ba22719d1f6        353MB
docker-sonic-telemetry        latest                7ba22719d1f6        353MB
docker-router-advertiser      201911.172-1abf7be7   d6af74e64a7d        289MB
docker-router-advertiser      latest                d6af74e64a7d        289MB
docker-sonic-mgmt-framework   201911.172-1abf7be7   70dd81d12201        429MB
docker-sonic-mgmt-framework   latest                70dd81d12201        429MB
docker-platform-monitor       201911.172-1abf7be7   25bb9a6362ca        659MB
docker-platform-monitor       latest                25bb9a6362ca        659MB
docker-fpm-frr                201911.172-1abf7be7   127bc955a4f9        334MB
docker-fpm-frr                latest                127bc955a4f9        334MB
docker-sflow                  201911.172-1abf7be7   9c8697cd760a        314MB
docker-sflow                  latest                9c8697cd760a        314MB
docker-lldp-sv2               201911.172-1abf7be7   ea53736695ce        311MB
docker-lldp-sv2               latest                ea53736695ce        311MB
docker-dhcp-relay             201911.172-1abf7be7   04d8fc8cc983        299MB
docker-dhcp-relay             latest                04d8fc8cc983        299MB
docker-database               201911.172-1abf7be7   ae9a447b86a6        289MB
docker-database               latest                ae9a447b86a6        289MB
docker-teamd                  201911.172-1abf7be7   fd29716b8b6d        313MB
docker-teamd                  latest                fd29716b8b6d        313MB
docker-snmp-sv2               201911.172-1abf7be7   06e569191f99        347MB
docker-snmp-sv2               latest                06e569191f99        347MB
docker-orchagent              201911.172-1abf7be7   6ffa04ee3ed4        332MB
docker-orchagent              latest                6ffa04ee3ed4        332MB
docker-nat                    201911.172-1abf7be7   4bbbc474ec7d        315MB
docker-nat                    latest                4bbbc474ec7d        315MB

**Attach debug file `sudo generate_dump`:**

```
(paste your output here)
```

arp.json.txt
fdb.json.txt
config_db.json.txt

@tahmed-dev
Copy link
Contributor

@stepanblyschak Thanks for reporting this issue. can you please post the content of arp.json?

@tahmed-dev
Copy link
Contributor

The script filter FDB entries based on the contents of arp.json file and not based on show arp command. If arp.json is empty due to issue:5217, it is expect to have fdb.json cleared.

@stepanblyschak
Copy link
Collaborator Author

@tahmed-dev Sending the repro also with arp.json content:

admin@arc-switch1004:~$ show arp
Address       MacAddress         Iface            Vlan
------------  -----------------  ---------------  ------
10.0.0.59     52:54:00:c8:fa:34  PortChannel0002  -
10.0.0.61     52:54:00:35:67:e6  PortChannel0003  -
10.0.0.63     52:54:00:8e:db:0d  PortChannel0004  -
10.210.24.1   00:00:5e:00:01:01  eth0             -
10.210.25.32  14:18:77:33:f0:50  eth0             -
192.168.0.5   24:8a:07:9c:86:05  Ethernet10       1000
Total number of entries 6
admin@arc-switch1004:~$ /usr/bin/fast-reboot-dump.py -t .
admin@arc-switch1004:~$ cat fdb.json
[
  {
    "OP": "SET",
    "FDB_TABLE:Vlan1000:24-8A-07-9C-86-05": {
      "type": "dynamic",
      "port": "Ethernet10"
    }
  }
]admin@arc-switch1004:~$ cat arp.json
[
  {
    "NEIGH_TABLE:Vlan1000:192.168.0.5": {
      "neigh": "24:8a:07:9c:86:05",
      "family": "IPv4"
    },
    "OP": "SET"
  }
admin@arc-switch1004:~$ /usr/bin/filter_fdb_entries.py -f ./fdb.json -a ./arp.json -c /etc/sonic/config_db.json
admin@arc-switch1004:~$ cat fdb.json
[]admin@arc-switch1004:~$

The issue #5217 is different. As you can see arp.json has the necessary entry, but after reboot in #5217 those entry are not configured on HW before ports become operationally up.

@tahmed-dev
Copy link
Contributor

Thanks @stepanblyschak can you please link those files to the issue:

./arp.json
./fdb.json
/etc/sonic/config_db.json

@stepanblyschak
Copy link
Collaborator Author

@tahmed-dev Done

@stepanblyschak
Copy link
Collaborator Author

sonic_dump_arc-switch1004_20201013_124814.tar.gz

Reopening as the issue reproduced. Attached new system dump.

@tahmed-dev
Copy link
Contributor

I see in the log file the following signature:

syslog.13.gz:Oct 13 05:14:53.259842 arc-switch1004 NOTICE bgp#fpmsyncd: :- setWarmStartState: bgp warm start state changed to restored
syslog.14.gz:Oct 13 04:10:03.682696 arc-switch1004 INFO swss#restore_neighbor: restore_neighbors service is started
syslog.14.gz:Oct 13 04:10:03.696355 arc-switch1004 INFO swss#restore_neighbor: restore_neighbors service is skipped as warm restart not enabled
syslog.14.gz:Oct 13 04:10:13.056784 arc-switch1004 INFO swss#supervisord 2020-10-13 04:10:03,947 INFO exited: restore_neighbors (exit status 0; expected)
syslog.15.gz:Oct 13 03:52:23.998567 arc-switch1004 INFO swss#supervisord 2020-10-13 03:52:22,037 INFO spawned: 'restore_neighbors' with pid 75
syslog.15.gz:Oct 13 03:52:23.998649 arc-switch1004 INFO swss#supervisord 2020-10-13 03:52:22,084 INFO success: restore_neighbors entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
syslog.15.gz:Oct 13 03:52:34.491868 arc-switch1004 INFO swss#restore_neighbor: restore_neighbors service is started
syslog.15.gz:Oct 13 03:52:34.518601 arc-switch1004 INFO swss#restore_neighbor: restore_neighbors service is skipped as warm restart not enabled

which seem related to issue:5580

@stepanblyschak
Copy link
Collaborator Author

Opened wrong issue by a mistake, The comments is related to #5217

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants