-
Notifications
You must be signed in to change notification settings - Fork 544
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Workaround] EvpnRemoteVnip2pOrch warmboot check failure #2626
Conversation
@srj102 , please review this as it is a blocker and fixing a warmboot issue |
Yes, the issue happened in non-evpn scenario, seems in function OrchDaemon::init(), we directly create evpn_remote_vni_orch and push into m_orchList.
And warmboot check function OrchDaemon::warmRestartCheck( ) checked all orch agents in m_orchList, and finally call addOperation() for each orch agent. I didn't find any check logic for VXLAN_REMOTE_VNI table before walking orch agent list, Do I missed something ? attached some trace during warmboot:
|
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
Having the p2premotevniorch in the orchList does not explain why there was an event. keys VXLAN_REMOTE_VNI |
EvpnNvoOrch* evpn_orch = gDirectory.get<EvpnNvoOrch*>(); | ||
auto vtep_ptr = evpn_orch->getEVPNVtep(); | ||
if (!vtep_ptr) { | ||
SWSS_LOG_WARN("Remote VNI add: Source VTEP not found. remote=%s vid=%d", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this log say during warm boot ? Does it point to anything valid ?
ie what is the remote ip and what is the vid ?
If it is some spurious data then we need to investigate that as this can be seen on any of the other tables in other OAs.
Quite clearly the null check cannot be the fix especially in a non EVPN scenario.
We would need to see why is this being called when there is no explanation for the presence of an entry in the VXLAN_REMOTE_VNI table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jcaiMR @prsunny Do we know why EvpnRemoteVnip2pOrch::addOperation is called in non EVPN scenario. This will generally be called when VXLAN_REMOTE_VNI table is present. Is it populated in non evpn scenario?
Yes, the issue happened in non-evpn scenario, seems in function OrchDaemon::init(), we directly create evpn_remote_vni_orch and push into m_orchList.
if (vxlan_tunnel_orch->isDipTunnelsSupported()) { EvpnRemoteVnip2pOrch* evpn_remote_vni_orch = new EvpnRemoteVnip2pOrch(m_applDb, APP_VXLAN_REMOTE_VNI_TABLE_NAME); gDirectory.set(evpn_remote_vni_orch); m_orchList.push_back(evpn_remote_vni_orch); } else { EvpnRemoteVnip2mpOrch* evpn_remote_vni_orch = new EvpnRemoteVnip2mpOrch(m_applDb, APP_VXLAN_REMOTE_VNI_TABLE_NAME); gDirectory.set(evpn_remote_vni_orch); m_orchList.push_back(evpn_remote_vni_orch); }
And warmboot check function OrchDaemon::warmRestartCheck( ) checked all orch agents in m_orchList, and finally call addOperation() for each orch agent. I didn't find any check logic for VXLAN_REMOTE_VNI table before walking orch agent list, Do I missed something ?
attached some trace during warmboot:Thread 1 "orchagent" hit Breakpoint 1, EvpnRemoteVnip2mpOrch::addOperation (this=0x55ecbd1e7af0, request=...) at vxlanorch.cpp:2476 2476 vxlanorch.cpp: No such file or directory. (gdb) bt #0 EvpnRemoteVnip2mpOrch::addOperation (this=0x55ecbd1e7af0, request=...) at vxlanorch.cpp:2476 #1 0x000055ecbb8d1c60 in Orch2::doTask (this=0x55ecbd1e7af0, consumer=...) at orch.cpp:1063 #2 0x000055ecbb8d5a9e in Consumer::drain (this=0x55ecbd1e85b0) at orch.cpp:241 #3 Consumer::drain (this=0x55ecbd1e85b0) at orch.cpp:238 #4 Consumer::execute (this=0x55ecbd1e85b0) at orch.cpp:235 #5 0x000055ecbb8c54c9 in OrchDaemon::start (this=this@entry=0x55ecbd0bfd80) at orchdaemon.cpp:755 #6 0x000055ecbb843fe0 in main (argc=<optimized out>, argv=<optimized out>) at main.cpp:735 (gdb) c
Having the p2premotevniorch in the orchList does not explain why there was an event. Could you dump the contents of the VXLAN_REMOTE_VNI table before the WB command is issued ? Does it have any entries ?
keys VXLAN_REMOTE_VNI
Thanks for comments. we don't have EVPN configured and "sonic-db-cli APPL_DB keys "VXLAN_REMOTE_VNI*"" is empty.
Do you mean if VXLAN_REMOTE_VNI table is not exists, there will no event to trigger EvpnRemoteVnip2pOrch/EvpnRemoteVnip2mpOrch addOperation ? Then here seems we need dig more to see what kind of events from Consumer table right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes addOperation cannot be called unless the VXLAN_REMOTE_VNI table is populated.
My understanding is that warmRestartCheck will not call addOperation.
It just checks the sizes of the m_toSync map and if it is non zero then returns a false.
So one possible explanation is there was an earlier operation which resulted in the VXLAN_REMOTE_VNI being populated and resulting in a failure that you saw.
Under valid EVPN VXLAN configuration one is not expected to see this failure.
WB operation later on results in the readiness check failing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for suggestion, will have some further debug to find the event trigger.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, have modified the title and removed Fix #. Will use other PR for the root cause solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srj102 are you ok to remove the blocker on this PR?
orchagent/vxlanorch.cpp
Outdated
@@ -2347,6 +2347,14 @@ bool EvpnRemoteVnip2pOrch::addOperation(const Request& request) | |||
return true; | |||
} | |||
|
|||
EvpnNvoOrch* evpn_orch = gDirectory.get<EvpnNvoOrch*>(); | |||
auto vtep_ptr = evpn_orch->getEVPNVtep(); | |||
if (!vtep_ptr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please follow convention. { -> new line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will update in latest diff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated.
* fix p2p vxlan warmboot check failure
Update sonic-swss submodule pointer to include the following: * f66abed Support for tc-dot1p and tc-dscp qosmap ([sonic-net#2559](sonic-net/sonic-swss#2559)) * 35385ad [RouteOrch] Record ROUTE_TABLE entry programming status to APPL_STATE_DB ([sonic-net#2512](sonic-net/sonic-swss#2512)) * 0704f78 [Workaround] EvpnRemoteVnip2pOrch warmboot check failure ([sonic-net#2626](sonic-net/sonic-swss#2626)) * 4df5cab [ResponsePublisher] add pipeline support ([sonic-net#2511](sonic-net/sonic-swss#2511)) Signed-off-by: AntonHryshchuk <antonh@nvidia.com>
Update sonic-swss submodule pointer to include the following: * baa302e Do not allow to add port to .1Q bridge while router port deletion is not completed ([sonic-net#2669](sonic-net/sonic-swss#2669)) * f66abed Support for tc-dot1p and tc-dscp qosmap ([sonic-net#2559](sonic-net/sonic-swss#2559)) * 35385ad [RouteOrch] Record ROUTE_TABLE entry programming status to APPL_STATE_DB ([sonic-net#2512](sonic-net/sonic-swss#2512)) * 0704f78 [Workaround] EvpnRemoteVnip2pOrch warmboot check failure ([sonic-net#2626](sonic-net/sonic-swss#2626)) * 4df5cab [ResponsePublisher] add pipeline support ([sonic-net#2511](sonic-net/sonic-swss#2511)) Signed-off-by: dprital <drorp@nvidia.com>
Update sonic-swss submodule pointer to include the following: * baa302e Do not allow to add port to .1Q bridge while router port deletion is not completed ([#2669](sonic-net/sonic-swss#2669)) * f66abed Support for tc-dot1p and tc-dscp qosmap ([#2559](sonic-net/sonic-swss#2559)) * 35385ad [RouteOrch] Record ROUTE_TABLE entry programming status to APPL_STATE_DB ([#2512](sonic-net/sonic-swss#2512)) * 0704f78 [Workaround] EvpnRemoteVnip2pOrch warmboot check failure ([#2626](sonic-net/sonic-swss#2626)) * 4df5cab [ResponsePublisher] add pipeline support ([#2511](sonic-net/sonic-swss#2511))
* fix p2p vxlan warmboot check failure
…)" (#2773) This reverts commit 750e064. Reverts the PR #2756 The fix added breaks the previously added workaround #2626. Hence requesting to revert the fix. Once we find a proper solution for sonic-net/sonic-buildimage#12361 we need to reintegrate this PR
What I did
Workaround for issue #12361, let wr_arp function pass first, will use another PR for root cause fix.
The trigger of this issue is warm-restart check failure, and EvpnRemoteVnip2pOrch::addOperation() caused it.
it always return false because of evpn_orch->getEVPNVtep() always return false cause warmboot check failure.
The issue only happened on Broadcom platform since it support P2P vxlan tunnel, and will run into EvpnRemoteVnip2pOrch.
Nvidia SAI currently only support P2MP tunnel, so no issue on that platform.
Seems vtep instance only created when tunnel creation source is EVPN. If non EVPN case, source vtep not exists and it should return true to pass warmboot check.
So here the fix is let EvpnRemoteVnip2pOrch::addOperation() return true when local VTEP is not exists, the logic same like EvpnRemoteVnip2mpOrch::addOperation().
Why I did it
How I verified it
run arp/test_wr_arp.py on Broadcom platform.
Details if related