Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interface state is 'down' and not going up when adding and removing it from vlan group #5347

Closed
shlomibitton opened this issue Sep 9, 2020 · 14 comments
Assignees

Comments

@shlomibitton
Copy link
Contributor

shlomibitton commented Sep 9, 2020

Description
Interface state change to 'down' and not going up again when adding an interface to some vlan group and then remove it.
This issue is not easy to reproduce, It occur randomly after several times or even more of doing the steps below.
In order to restore the interface a config reload or reboot is required.

Steps to reproduce the issue:

  1. config vlan add 2
  2. config vlan member add 2 Ethernet104

Do the following until you able to catch this issue:

  1. config vlan member del 2 Ethernet104
  2. config vlan member add -u 2 Ethernet104

Describe the results you received:
On system log:
NOTICE swss#orchagent: :- removeVlanMember: Remove member Ethernet104 from VLAN Vlan3 lid:3 vmid:27000000000663
NOTICE swss#orchagent: :- setHostIntfsStripTag: Set SAI_HOSTIF_VLAN_TAG_STRIP to host interface: Ethernet104
ERR swss#orchagent: :- meta_generic_validation_remove: object 0x3a000000000662 reference count is 1, can't remove
ERR swss#orchagent: :- removeBridgePort: Failed to remove bridge port Ethernet104 from default 1Q bridge, rv:-5

Describe the results you expected:
No error should occur with this flow and interface should go up again.

Output of show version:

```
SONiC Software Version: SONiC.201911.188-ffbf0ed3
Distribution: Debian 9.13
Kernel: 4.9.0-11-2-amd64
Build commit: ffbf0ed3
Build date: Mon Sep  7 03:28:55 UTC 2020
Built by: johnar@jenkins-worker-8

Platform: x86_64-mlnx_msn3700-r0
HwSKU: ACS-MSN3700
ASIC: mellanox
Serial Number: MT1851X02961
Uptime: 07:07:49 up  3:31,  1 user,  load average: 0.77, 0.79, 0.82
```

syslog.txt
sai_sdk_dump_09_08_2020_03_23_PM.gz
saidump.txt

full techsupport output:
mstdump.zip
proc.zip
sai_sdk_dump.zip
hw-mgmt.zip
log.zip
log2.zip
log3.zip
dump.zip
etc.zip

@prsunny
Copy link
Contributor

prsunny commented Sep 9, 2020

can you provide the full output of 'show version' including the platform?

@shlomibitton
Copy link
Contributor Author

can you provide the full output of 'show version' including the platform?

@prsunny I have added the platform from 'show version', this is reproduced on several platform not only this one.
In addition, after further investigation I added 'saidump.txt' to this issue, looks like during the time the interface was part of the vlan group a mac entry learned on this interface and then there was a reference for this entry on the interface causing a failure of removing it from this vlan group. I think the best solution for this is to flush the fdb table of this interface before removing it from the vlan group, this way we can avoid such scenario.

@anshuv-mfst
Copy link

  • Is ethernet100 part of another Vlan?
  • Please provide show techsupport output for the bug.

@prsunny
Copy link
Contributor

prsunny commented Sep 16, 2020

@itaibaz, could you please take a look at this issue?

@itaibaz
Copy link

itaibaz commented Sep 16, 2020

The error seems to be SAI redis issue, and not SAI issue, so I don't understand the connection to SAI
ERR swss#orchagent: :- meta_generic_validation_remove: object 0x3a000000000662 reference count is 1, can't remove
ERR swss#orchagent: :- removeBridgePort: Failed to remove bridge port Ethernet104 from default 1Q bridge, rv:-5

@shlomibitton
Copy link
Contributor Author

@anshuv-mfst ethernet104 is not belong to any other vlan.
I added a full techsupport output here.

@keboliu
Copy link
Collaborator

keboliu commented Sep 18, 2020

in function PortsOrch::removeBridgePort there is a comments as below:

/* Flush FDB entries pointing to this bridge port */
// TODO: Remove all FDB entries associated with this bridge port before
//       removing the bridge port itself

/* Remove bridge port */
status = sai_bridge_api->remove_bridge_port(port.m_bridge_port_id);
if (status != SAI_STATUS_SUCCESS)
{
    SWSS_LOG_ERROR("Failed to remove bridge port %s from default 1Q bridge, rv:%d",
        port.m_alias.c_str(), status);
    return false;
}

seems that there is logic missing to remove the FDB entry before removing the bridge port, I think this is what exactly we have observed, there is still a reference to the port in the DB, so the remove fails. this is not likely only an issue on some certain platform, @prsunny what do you think?

@itaibaz
Copy link

itaibaz commented Sep 18, 2020

I checked what is the FDB that was learnt

2020-09-08.15:12:57.786288|n|fdb_event|[{"fdb_entry":"{"bvid":"oid:0x26000000000013","mac":"24:8A:07:3E:0C:86","switch_id":"oid:0x21000000000000"}","fdb_event":"SAI_FDB_EVENT_LEARNED","list":[{"id":"SAI_FDB_ENTRY_ATTR_BRIDGE_PORT_ID","value":"oid:0x3a000000000662"},{"id":"SAI_FDB_ENTRY_ATTR_TYPE","value":"SAI_FDB_ENTRY_TYPE_DYNAMIC"},{"id":"SAI_FDB_ENTRY_ATTR_PACKET_ACTION","value":"SAI_PACKET_ACTION_FORWARD"}]}]|
learnt on vlan 1 (default vlan SAI_SWITCH_ATTR_DEFAULT_VLAN_ID=oid:0x26000000000013)

There is an error
Sep 8 15:12:57.786697 arc-switch1038 NOTICE swss#orchagent: :- storeFdbEntryState: FdbOrch notification: Failed to locate vlan port from bv_id 0x26000000000013
The error happens since the port isn't member in that vlan (default vlan 1 memberhip is removed just after switch create), the port is member in vlan 3 only
But in general, a port can learn and have packet ingress, on vlans which it is not member in

port was added to vlan 3, but PVID was not set, so PVID remains 1
Sep 8 15:11:48.514340 arc-switch1038 INFO syncd#supervisord: syncd Sep 08 15:11:48 NOTICE SAI_BRIDGE: mlnx_sai_bridge.c[2618]- mlnx_create_bridge_port: Create bridge port, #0 TYPE=PORT #1 PORT_ID=PORT,(0:0),15100,0000,0 #2 ADMIN_STATE=true #3 FDB_LEARNING_MODE=HW
Sep 8 15:11:48.514746 arc-switch1038 INFO syncd#supervisord: syncd Sep 08 15:11:48 NOTICE SAI_BRIDGE: mlnx_sai_bridge.c[2917]- mlnx_create_bridge_port: Created bridge port idx 19
Sep 8 15:11:48.514967 arc-switch1038 INFO syncd#supervisord: syncd Sep 08 15:11:48 NOTICE SAI_UTILS: mlnx_sai_utils.c[2397]- set_dispatch_attrib_handler: Set VLAN_TAG, key:host interface 26, val:KEEP
Sep 8 15:11:48.514967 arc-switch1038 INFO syncd#supervisord: syncd Sep 08 15:11:48 NOTICE SAI_VLAN: mlnx_sai_vlan.c[1267]- mlnx_create_vlan_member: Create vlan member, #0 VLAN_ID=VLAN,(0:0),3,0000,0 #1 BRIDGE_PORT_ID=BRIDGE_PORT,(0:0),19,0000,0 #2 VLAN_TAGGING_MODE=TAGGED
Sep 8 15:11:48.516572 arc-switch1038 INFO syncd#supervisord: syncd Sep 08 15:11:48 NOTICE SAI_VLAN: mlnx_sai_vlan.c[1326]- mlnx_create_vlan_member: Created vlan member Vlan member port 15100 vlan 3

we work with ingress vlan filtering on
meaning, a tagged packet with vlan 1, will be filtered and dropped for vlan membership (because the port isn't member in vlan 1)
however, untagged packet will get the default pvid = 1 and will not be filtered
This is explained in PRM, PVID is assigned after the ingress VLAN membership filtering, thus even if the port is configured as non-member of the VLAN-ID which is assigned a PVID, the packet will NOT be discarded. To discard un-tagged packets, one should use the acceptable frame types as described in 6.3.2 Acceptable Frame Types.
So, bottom line, seems untagged packet arrived on the port

To me it seems weird, why PVID wasn't set to 3 when the port was added to vlan 3
In addition, as I stated, in any case Sonic should remove all fdb entries on top of a bridge port before removing a bridge port

@itaibaz
Copy link

itaibaz commented Sep 18, 2020

Bottom line I think there are 2 issues -

  1. why wasn't PVID 3 set on the port
  2. Sonic should add a flow for flushing fdb entries (And removing them from SAI redis) on top of a bridge port before removing it

@madhanmellanox
Copy link
Contributor

@prsunny When we flush the FDB entries corresponding to the bridge port before calling SAI API remove_bridge_port(), the issue is fixed. Doesn't this fix look good?

@prsunny
Copy link
Contributor

prsunny commented Sep 23, 2020

sure, could you help provide the fix?

@madhanmellanox
Copy link
Contributor

Yes, I have the fix. I have tested it. I will raise a PR shortly.

@madhanmellanox
Copy link
Contributor

@prsunny the PR for the fix is:
sonic-net/sonic-swss#1451

@liat-grozovik
Copy link
Collaborator

issue is fixed and merged thus closing it.
still not in 201911. request to pick it was sent to release manager.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants