Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FDB entry not removed in case when remove port from VLAN #7538

Open
ppikh opened this issue May 6, 2021 · 13 comments
Open

FDB entry not removed in case when remove port from VLAN #7538

ppikh opened this issue May 6, 2021 · 13 comments

Comments

@ppikh
Copy link
Contributor

ppikh commented May 6, 2021

Description

FDB entry not removed in case when remove port from VLAN

Steps to reproduce the issue:

  1. Add Vlan on DUT(sudo config vlan add 40)
  2. Add 2 ports to be VLAN members(sudo config vlan member add 40 Ethernet28 , sudo config vlan member add 40 Ethernet32)
  3. Send traffic in VLAN created above from host connected to port Ethernet28 to host connected to port Ethernet32
  4. Check MAC in Redis(use cmd: redis-cli -n 6 keys 'FDB*')
  5. Remove one port from VLAN(sudo config vlan member del 40 Ethernet28)
  6. Check MAC in Redis(use cmd: redis-cli -n 6 keys 'FDB*')

Describe the results you received:

In Redis output we have mac address for host connected to port which was removed from VLAN

Describe the results you expected:

In Redis output we do not have mac address for host connected to port which was removed from VLAN. When port removed from VLAN - peer MAC address removed also.

Output of show version:

SONiC Software Version: SONiC.master.118-b2286a24_Internal
Distribution: Debian 10.9
Kernel: 4.19.0-12-2-amd64
Build commit: b2286a24
Build date: Tue May  4 12:01:41 UTC 2021
Built by: sw-r2d2-bot@r-build-sonic-ci03

Platform: x86_64-mlnx_msn2100-r0
HwSKU: ACS-MSN2100
ASIC: mellanox
ASIC Count: 1
Serial Number: MT1752X06330
Uptime: 07:07:08 up 46 min,  1 user,  load average: 3.26, 3.24, 3.04

Docker images:
REPOSITORY                    TAG                                     IMAGE ID            SIZE
docker-dhcp-relay             latest                                  96fbaf982515        416MB
docker-dhcp-relay             master.118-b2286a24_Internal            96fbaf982515        416MB
docker-syncd-mlnx             latest                                  63cd7901a3ba        674MB
docker-syncd-mlnx             master.118-b2286a24_Internal            63cd7901a3ba        674MB
docker-snmp                   latest                                  86945622fea9        450MB
docker-snmp                   master.118-b2286a24_Internal            86945622fea9        450MB
docker-teamd                  latest                                  618c55e69a57        419MB
docker-teamd                  master.118-b2286a24_Internal            618c55e69a57        419MB
docker-nat                    latest                                  34c7b8cd7586        422MB
docker-nat                    master.118-b2286a24_Internal            34c7b8cd7586        422MB
docker-sonic-mgmt-framework   latest                                  4220d8154e18        628MB
docker-sonic-mgmt-framework   master.118-b2286a24_Internal            4220d8154e18        628MB
docker-router-advertiser      latest                                  bc1c33ecbbe1        408MB
docker-router-advertiser      master.118-b2286a24_Internal            bc1c33ecbbe1        408MB
docker-platform-monitor       latest                                  3d89ecf8b644        705MB
docker-platform-monitor       master.118-b2286a24_Internal            3d89ecf8b644        705MB
docker-lldp                   latest                                  95b3bedb00c0        448MB
docker-lldp                   master.118-b2286a24_Internal            95b3bedb00c0        448MB
docker-database               latest                                  f420da16dc6d        408MB
docker-database               master.118-b2286a24_Internal            f420da16dc6d        408MB
docker-orchagent              latest                                  4aa95eeaa1c9        437MB
docker-orchagent              master.118-b2286a24_Internal            4aa95eeaa1c9        437MB
docker-sonic-telemetry        latest                                  c9c85a1bc4db        498MB
docker-sonic-telemetry        master.118-b2286a24_Internal            c9c85a1bc4db        498MB
docker-fpm-frr                latest                                  e636ec17203a        437MB
docker-fpm-frr                master.118-b2286a24_Internal            e636ec17203a        437MB
docker-sflow                  latest                                  96750d48bf94        420MB
docker-sflow                  master.118-b2286a24_Internal            96750d48bf94        420MB
docker-macsec                 latest                                  dd87895db802        422MB
docker-macsec                 master.118-b2286a24_Internal            dd87895db802        422MB
docker-wjh                    latest                                  26469477c68f        505MB
docker-wjh                    master.master.0-dirty-20210502.150257   26469477c68f        505MB

Output of show techsupport:

[sonic_dump_r-bulldog-03_20210506_065908.tar.gz](https://github.com/Azure/sonic-buildimage/files/6432484/sonic_dump_r-bulldog-03_20210506_065908.tar.gz)

@prsunny
Copy link
Contributor

prsunny commented May 7, 2021

Could you please wait for the ageout time - 10mts and check the DB?

@raphaelt-nvidia
Copy link
Contributor

Waited 40 minutes after removing one interface from vlan. Both entries still present in FDB table.

redis-cli -n 6 keys "FDB"

  1. "FDB_TABLE|Vlan247:0c:42:a1:17:e7:1c"
  2. "FDB_TABLE|Vlan247:0c:42:a1:17:e7:ac"

@raphaelt-nvidia
Copy link
Contributor

I believe this is one of the issues that were intended to be addressed by Broadcom's work on Layer 2 forwarding enhancements, with the following components:

Design doc: https://github.com/Azure/SONiC/blob/master/doc/layer2-forwarding-enhancements/SONiC%20Layer%202%20Forwarding%20Enhancements%20HLD.md

Related PRs:

sonic-net/sonic-utilities#529
This is closed and not merged, apparently because other changes were done in multiple PRs in parallel to the same code, perhaps with the same intention, so this PR is obsolete.

sonic-net/sonic-swss#1716
Merged.

sonic-net/sonic-swss#885
This one is stuck on checkers. It is over 2 years old, has 37 commits, 21 files changed, a long and complex discussion, and looks as if it might address the bug.

@anilkpandey
Copy link

sonic-net/sonic-swss#885 is closed, as all required changes have been split and upstreamed separately.
The FDB flush changes were not needed anymore, as similar changes were merged before my changes could be reviewed and merged:
sonic-net/sonic-swss#1242

I see the code is present to flush fdb when port is removed from vlan. Not sure why you are still seeing the issue.

void FdbOrch::updateVlanMember(const VlanMemberUpdate& update)
{
SWSS_LOG_ENTER();

if (!update.add)
{
    swss::Port vlan = update.vlan;
    swss::Port port = update.member;
    flushFDBEntries(port.m_bridge_port_id, vlan.m_vlan_info.vlan_oid);
    notifyObserversFDBFlush(port, vlan.m_vlan_info.vlan_oid);
    return;
}

@raphaelt-nvidia
Copy link
Contributor

I don't see that code being called when the port is removed from the vlan. But I do see lines like this:

Jan 25 07:50:47.548674 r-panther-13 DEBUG swss#orchagent: :> update: enter
Jan 25 07:50:47.548674 r-panther-13 INFO swss#orchagent: :- update: FDB event:3, MAC: 00:00:00:00:00:00 , BVID: 0x26000000000792 , bridge port ID: 0x3a00000000079d.
Jan 25 07:50:47.548674 r-panther-13 INFO swss#orchagent: :- update: Flush event: Failed to get port by bridge port ID 0x3a00000000079d.
Jan 25 07:50:47.548753 r-panther-13 DEBUG swss#orchagent: :< update: exit
Jan 25 07:50:47.569525 r-panther-13 DEBUG swss#orchagent: :> update: enter
Jan 25 07:50:47.569525 r-panther-13 INFO swss#orchagent: :- update: FDB event:3, MAC: 00:00:00:00:00:00 , BVID: 0x39000000000010 , bridge port ID: 0x3a00000000079d.
Jan 25 07:50:47.569677 r-panther-13 INFO swss#orchagent: :- update: Flush event: Failed to get port by bridge port ID 0x3a00000000079d.
Jan 25 07:50:47.569677 r-panther-13 DEBUG swss#orchagent: :< update: exit

What calling stack and sequence do you expect to reach FdbOrch::updateVlanMember?

@anilkpandey
Copy link

From PortsOrch::removeVlanMember:
notify(SUBJECT_TYPE_VLAN_MEMBER_CHANGE, static_cast<void *>(&update));

As part of this notification, fdborch will call FdbOrch::updateVlanMember

@raphaelt-nvidia
Copy link
Contributor

I see that FdbOrch::updateVlanMember is called, but the FDB entry in STATE_DB is not cleared. Drilling down with gdb, I see that updateVlanMember calls flushFDBEntries. While stepping, I dumped the vlan member tables before and after the call to flushFDBEntries.

Before:

dump state vlan_member "Vlan247|Ethernet0"
{
"Vlan247|Ethernet0": {
"CONFIG_DB": {
"keys": [
{
"VLAN_MEMBER|Vlan247|Ethernet0": {
"tagging_mode": "tagged"
}
}
],
"tables_not_found": []
},
"APPL_DB": {
"keys": [
{
"VLAN_MEMBER_TABLE:Vlan247:Ethernet0": {
"tagging_mode": "tagged"
}
}
],
"tables_not_found": []
},
"ASIC_DB": {
"keys": [
{
"ASIC_STATE:SAI_OBJECT_TYPE_VLAN_MEMBER:oid:0x27000000000d1a": {
"SAI_VLAN_MEMBER_ATTR_BRIDGE_PORT_ID": "oid:0x3a000000000d19",
"SAI_VLAN_MEMBER_ATTR_VLAN_ID": "oid:0x26000000000d14",
"SAI_VLAN_MEMBER_ATTR_VLAN_TAGGING_MODE": "SAI_VLAN_TAGGING_MODE_TAGGED"
}
},
{
"ASIC_STATE:SAI_OBJECT_TYPE_BRIDGE_PORT:oid:0x3a000000000d19": {
"SAI_BRIDGE_PORT_ATTR_ADMIN_STATE": "true",
"SAI_BRIDGE_PORT_ATTR_FDB_LEARNING_MODE": "SAI_BRIDGE_PORT_FDB_LEARNING_MODE_HW",
"SAI_BRIDGE_PORT_ATTR_PORT_ID": "oid:0x100000000093e",
"SAI_BRIDGE_PORT_ATTR_TYPE": "SAI_BRIDGE_PORT_TYPE_PORT"
}
}
],
"tables_not_found": [],
"vidtorid": {
"oid:0x27000000000d1a": "oid:0xf70027",
"oid:0x3a000000000d19": "oid:0x3a"
}
},
"STATE_DB": {
"keys": [
{
"VLAN_MEMBER_TABLE|Vlan247|Ethernet0": {
"state": "ok"
}
}
],
"tables_not_found": []
}
}
}

After:

dump state vlan_member "Vlan247|Ethernet0"
{
"Vlan247|Ethernet0": {
"CONFIG_DB": {
"keys": [],
"tables_not_found": [
"VLAN_MEMBER"
]
},
"APPL_DB": {
"keys": [],
"tables_not_found": [
"VLAN_MEMBER_TABLE"
]
},
"ASIC_DB": {
"keys": [],
"tables_not_found": [
"ASIC_STATE:SAI_OBJECT_TYPE_VLAN_MEMBER",
"ASIC_STATE:SAI_OBJECT_TYPE_BRIDGE_PORT"
]
},
"STATE_DB": {
"keys": [],
"tables_not_found": [
"VLAN_MEMBER_TABLE"
]
}
}
}

Both before and after:

redis-cli -n 6 keys "FDB"

  1. "FDB_TABLE|Vlan247:02:0a:d4:58:0e:01"
  2. "FDB_TABLE|Vlan247:02:0a:d4:58:0f:01"

Something needs to trigger the removal of one of the above FDB entries. Is it the next line of updateVlanMember which calls notifyObserversFDBFlush, which calls notify in observer.h? This makes 3 calls to iter->update. The 3 functions are NeighOrch::update, MirrorOrch::update and MuxOrch::update. The last two don't seem relevant to this case and they do nothing. NeighOrch::update calls processFDBFlushUpdate which does find the port being removed from vlan in m_portsOrch->getPort(entry.bv_id, vlan). However, if it is expected that resolveNeighborEntry does the removing, this does not happen, because m_syncdNeighbors has 0 elements. From what I see, NeighOrch::addNeighbor and NeighOrch::removeNeighbor are the places where elements would be added to or removed from m_syncdNeighbors, but I do not hit breakpoints in these functions when adding or deleting a port from a vlan.

The above is relevant if my guess is correct that notifyObserversFDBFlush is responsible for removing the FDB entry from STATE_DB. If not, please let me know where to look instead.

@raphaelt-nvidia
Copy link
Contributor

@anilkpandey could you please reply to my latest comment so we can get to the bottom of this?

@zhangyanzhao
Copy link
Collaborator

Adam Yeung will follow up with BRCM team to provide more guidance on the issue investigation.

@anilkpan
Copy link

anilkpan commented Mar 1, 2022

A fdb flush call from orchagent sends flush request to SAI. When the mac is deleted in HW, a AGED notification is generated, which triggers the deletion of the fdb entry in ASIC_DB (by syncd) as well as STATE_DB (by fdborch).
Can you check whether the mac was deleted in HW and whether the AGED notification was received by SONiC (by syncd and then by fdborch)?

@raphaelt-nvidia
Copy link
Contributor

Thanks. I confirm that the MAC was deleted in HW and a notification reached SAI. I am continuing to investigate. Here are some lines of interest from syslog.

INFO syncd#SDK: :- processSingleEvent: key: SAI_OBJECT_TYPE_FDB_FLUSH:oid:0x21000000000000 op: flush
DEBUG syncd#SDK: :- processFdbFlush: attr: SAI_FDB_FLUSH_ATTR_BRIDGE_PORT_ID: oid:0x3a000000000d5a
DEBUG syncd#SDK: :- processFdbFlush: attr: SAI_FDB_FLUSH_ATTR_BV_ID: oid:0x26000000000d51
DEBUG syncd#SDK: [SAI_FDB.DEBUG] mlnx_sai_fdb.c[1202]- mlnx_flush_fdb_entries: mlnx_flush_fdb_entries - entered
DEBUG syncd#SDK: [SAI_FDB.DEBUG] mlnx_sai_fdb.c[1274]- mlnx_flush_fdb_entries: mlnx_flush_fdb_entries - left
INFO syncd#SDK: [SAI_SWITCH.INFO] mlnx_sai_switch.c[5307]- event_thread_func: Received trap FDB EVENT sdk 1030
INFO syncd#SDK: [SAI_SWITCH.INFO] mlnx_sai_switch.c[4929]- mlnx_switch_parse_fdb_event: FDB event received [1/1, 1] vlan: 62 ; mac: 00:00:00:00:00:00 ; log_port: (0x00010009) ; type: Flush Port FID(10); Roaming: No
NOTICE syncd#SDK: :- processFdbFlush: fdb flush succeeded, updating redis database
NOTICE syncd#SDK: :- processFlushEvent: received a flush port fdb event, portVid = oid:0x3a000000000d5a, bvId = oid:0x26000000000d51
NOTICE syncd#SDK: :- processFlushEvent: pattern ASIC_STATE:SAI_OBJECT_TYPE_FDB_ENTRY:oid:0x26000000000d51, portStr oid:0x3a000000000d5a
INFO syncd#SDK: :- enqueueNotification: fdb_event [{"fdb_entry":"{"bvid":"oid:0x3e00000026","mac":"00:00:00:00:00:00","switch_id":"oid:0x100000021"}","fdb_event":"SAI_FDB_EVENT_FLUSHED","list":[{"id":"SAI_FDB_ENTRY_ATTR_BRIDGE_PORT_ID","value":"oid:0x10000003a"},{"id":"SAI_FDB_ENTRY_ATTR_TYPE","value":"SAI_FDB_ENTRY_TYPE_DYNAMIC"},{"id":"SAI_FDB_ENTRY_ATTR_PACKET_ACTION","value":"SAI_PACKET_ACTION_FORWARD"}]}]
NOTICE syncd#SDK: :- handle_fdb_event: got fdb flush event: [{"fdb_entry":"{"bvid":"oid:0x3e00000026","mac":"00:00:00:00:00:00","switch_id":"oid:0x100000021"}","fdb_event":"SAI_FDB_EVENT_FLUSHED","list":[{"id":"SAI_FDB_ENTRY_ATTR_BRIDGE_PORT_ID","value":"oid:0x10000003a"},{"id":"SAI_FDB_ENTRY_ATTR_TYPE","value":"SAI_FDB_ENTRY_TYPE_DYNAMIC"},{"id":"SAI_FDB_ENTRY_ATTR_PACKET_ACTION","value":"SAI_PACKET_ACTION_FORWARD"}]}]
INFO syncd#SDK: :- sendNotification: fdb_event [{"fdb_entry":"{"bvid":"oid:0x26000000000d51","mac":"00:00:00:00:00:00","switch_id":"oid:0x21000000000000"}","fdb_event":"SAI_FDB_EVENT_FLUSHED","list":[{"id":"SAI_FDB_ENTRY_ATTR_BRIDGE_PORT_ID","value":"oid:0x3a000000000d5a"},{"id":"SAI_FDB_ENTRY_ATTR_TYPE","value":"SAI_FDB_ENTRY_TYPE_DYNAMIC"},{"id":"SAI_FDB_ENTRY_ATTR_PACKET_ACTION","value":"SAI_PACKET_ACTION_FORWARD"}]}]
DEBUG syncd#SDK: :- send: channel NOTIFICATIONS, publish: ["fdb_event","[{"fdb_entry":"{\"bvid\":\"oid:0x26000000000d51\",\"mac\":\"00:00:00:00:00:00\",\"switch_id\":\"oid:0x21000000000000\"}","fdb_event":"SAI_FDB_EVENT_FLUSHED","list":[{"id":"SAI_FDB_ENTRY_ATTR_BRIDGE_PORT_ID","value":"oid:0x3a000000000d5a"},{"id":"SAI_FDB_ENTRY_ATTR_TYPE","value":"SAI_FDB_ENTRY_TYPE_DYNAMIC"},{"id":"SAI_FDB_ENTRY_ATTR_PACKET_ACTION","value":"SAI_PACKET_ACTION_FORWARD"}]}]"]

@raphaelt-nvidia
Copy link
Contributor

Hi @anilkpan,

SAI is able to generate both SAI_FDB_EVENT_FLUSHED and SAI_FDB_EVENT_AGED, based on the event arriving from SDK. Our SDK sends FLUSHED in a scenario like the present one, where a member port is removed from a vlan. It sends AGED after a configurable amount of time with no traffic. Thus, where ports A and B are members in a vlan passing traffic, FLUSHED is generated when A is removed from the vlan, and AGED is only generated on B after some time with no traffic. I am surprised that you expect the AGED event to be generated when no aging occurs in this case. Is there a SONiC document that prescribes the use of AGED for this case?

@anilkpan
Copy link

@raphaelt-nvidia,
It depends on the SAI/SDK implementation. Both AGED and FLUSHED events will cause fdb to be removed from SONiC. Some SAI implementations generate AGED events for all the mac that are flushed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants