-
Notifications
You must be signed in to change notification settings - Fork 547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[warm-reboot]: add bgp eoiu support to speed up route reconciliation … #856
Conversation
syslog:
More detailed syslog:
Explicit EoR case:
|
@jipanyang I have two questions:
|
|
retest this please |
fpmsyncd/bgp_eoiu_marker.py
Outdated
except Exception: | ||
syslog.syslog(syslog.LOG_ERR, "*ERROR* get_all_peers Exception: %s" % (traceback.format_exc())) | ||
time.sleep(5) | ||
self.get_all_peers() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if we have some persistent exception?
We will loop here forever
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, the exception here is to deal with the case that bgp startup is slow and 'show bgp summary json' failed.
A retry limit of (120/5) = 24 may be enforced here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should enforce it here.
Can you please implement it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, will add more commit to this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"pkill -9 bgpd", then try the test again: bgp_eoiu_marker.py: Failed to get bgp neighbor info in 120 seconds, exiting
Jun 24 08:00:32.389598 vlab-01 INFO bgp#bgp_eoiu_marker.py: Cleaned ipv4 and ipv6 eoiu marker flags
Jun 24 08:00:32.390907 vlab-01 NOTICE bgp#bgp_eoiu_marker.py: :- checkWarmStart: bgp doing warm start, restore count 0
Jun 24 08:00:32.567670 vlab-01 ERR bgp#bgp_eoiu_marker.py: *ERROR* get_all_peers Exception: Traceback (most recent call last):#012 File "/usr/bin/bgp_eoiu_marker.py", line 53, in get_all_peers#012 peer_info = json.loads(output)#012 File "/usr/lib/python2.7/json/__init__.py", line 339, in loads#012 return _default_decoder.decode(s)#012 File "/usr/lib/python2.7/json/decoder.py", line 364, in decode#012 obj, end = self.raw_decode(s, idx=_w(s, 0).end())#012 File "/usr/lib/python2.7/json/decoder.py", line 382, in raw_decode#012 raise ValueError("No JSON object could be decoded")#012ValueError: No JSON object could be decoded
Jun 24 08:00:36.795726 vlab-01 NOTICE swss#orchagent: :- removeNextHopGroup: Delete next hop group 10.0.0.57,10.0.0.59,10.0.0.61,10.0.0.63
Jun 24 08:00:43.357393 vlab-01 NOTICE swss#orchagent: :- removeNextHopGroup: Delete next hop group fc00::72,fc00::76,fc00::7a,fc00::7e
Jun 24 08:02:22.918385 vlab-01 WARNING syncd#syncd: :- refresh_read_only_BCM56850: need to recalculate RO: SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY
Jun 24 08:02:22.920196 vlab-01 ERR syncd#syncd: :- internal_vs_generic_get: SAI_SWITCH_ATTR_AVAILABLE_IPV4_ROUTE_ENTRY read only not implemented on oid:0x2100000000
Jun 24 08:02:22.920752 vlab-01 ERR syncd#syncd: :- meta_sai_get_oid: get status: SAI_STATUS_NOT_IMPLEMENTED
Jun 24 08:02:22.925638 vlab-01 ERR swss#orchagent: :- meta_sai_get_oid: get status: SAI_STATUS_NOT_IMPLEMENTED
Jun 24 08:02:22.925638 vlab-01 ERR swss#orchagent: :- getResAvailableCounters: Failed to get switch attribute 49 , rv:-15
Jun 24 08:02:22.928650 vlab-01 WARNING syncd#syncd: :- refresh_read_only_BCM56850: need to recalculate RO: SAI_SWITCH_ATTR_AVAILABLE_IPV6_ROUTE_ENTRY
Jun 24 08:02:22.929727 vlab-01 ERR syncd#syncd: :- internal_vs_generic_get: SAI_SWITCH_ATTR_AVAILABLE_IPV6_ROUTE_ENTRY read only not implemented on oid:0x2100000000
Jun 24 08:02:22.931380 vlab-01 ERR syncd#syncd: :- meta_sai_get_oid: get status: SAI_STATUS_NOT_IMPLEMENTED
Jun 24 08:02:22.936154 vlab-01 ERR swss#orchagent: :- meta_sai_get_oid: get status: SAI_STATUS_NOT_IMPLEMENTED
Jun 24 08:02:22.936342 vlab-01 ERR swss#orchagent: :- getResAvailableCounters: Failed to get switch attribute 50 , rv:-15
Jun 24 08:02:22.941056 vlab-01 WARNING syncd#syncd: :- refresh_read_only_BCM56850: need to recalculate RO: SAI_SWITCH_ATTR_AVAILABLE_IPV4_NEXTHOP_ENTRY
Jun 24 08:02:22.941965 vlab-01 ERR syncd#syncd: :- internal_vs_generic_get: SAI_SWITCH_ATTR_AVAILABLE_IPV4_NEXTHOP_ENTRY read only not implemented on oid:0x2100000000
Jun 24 08:02:22.942365 vlab-01 ERR syncd#syncd: :- meta_sai_get_oid: get status: SAI_STATUS_NOT_IMPLEMENTED
Jun 24 08:02:22.945633 vlab-01 ERR swss#orchagent: :- meta_sai_get_oid: get status: SAI_STATUS_NOT_IMPLEMENTED
Jun 24 08:02:22.945633 vlab-01 ERR swss#orchagent: :- getResAvailableCounters: Failed to get switch attribute 51 , rv:-15
Jun 24 08:02:22.953105 vlab-01 WARNING syncd#syncd: :- refresh_read_only_BCM56850: need to recalculate RO: SAI_SWITCH_ATTR_AVAILABLE_IPV6_NEXTHOP_ENTRY
Jun 24 08:02:22.953949 vlab-01 ERR syncd#syncd: :- internal_vs_generic_get: SAI_SWITCH_ATTR_AVAILABLE_IPV6_NEXTHOP_ENTRY read only not implemented on oid:0x2100000000
Jun 24 08:02:22.954337 vlab-01 ERR syncd#syncd: :- meta_sai_get_oid: get status: SAI_STATUS_NOT_IMPLEMENTED
Jun 24 08:02:22.955411 vlab-01 ERR swss#orchagent: :- meta_sai_get_oid: get status: SAI_STATUS_NOT_IMPLEMENTED
Jun 24 08:02:22.956184 vlab-01 ERR swss#orchagent: :- getResAvailableCounters: Failed to get switch attribute 52 , rv:-15
Jun 24 08:02:22.959512 vlab-01 WARNING syncd#syncd: :- refresh_read_only_BCM56850: need to recalculate RO: SAI_SWITCH_ATTR_AVAILABLE_IPV4_NEIGHBOR_ENTRY
Jun 24 08:02:22.960532 vlab-01 ERR syncd#syncd: :- internal_vs_generic_get: SAI_SWITCH_ATTR_AVAILABLE_IPV4_NEIGHBOR_ENTRY read only not implemented on oid:0x2100000000
Jun 24 08:02:22.960924 vlab-01 ERR syncd#syncd: :- meta_sai_get_oid: get status: SAI_STATUS_NOT_IMPLEMENTED
Jun 24 08:02:22.964854 vlab-01 ERR swss#orchagent: :- meta_sai_get_oid: get status: SAI_STATUS_NOT_IMPLEMENTED
Jun 24 08:02:22.965683 vlab-01 ERR swss#orchagent: :- getResAvailableCounters: Failed to get switch attribute 53 , rv:-15
Jun 24 08:02:22.968983 vlab-01 WARNING syncd#syncd: :- refresh_read_only_BCM56850: need to recalculate RO: SAI_SWITCH_ATTR_AVAILABLE_IPV6_NEIGHBOR_ENTRY
Jun 24 08:02:22.968983 vlab-01 ERR syncd#syncd: :- internal_vs_generic_get: SAI_SWITCH_ATTR_AVAILABLE_IPV6_NEIGHBOR_ENTRY read only not implemented on oid:0x2100000000
Jun 24 08:02:22.971467 vlab-01 ERR swss#orchagent: :- meta_sai_get_oid: get status: SAI_STATUS_NOT_IMPLEMENTED
Jun 24 08:02:22.972047 vlab-01 ERR swss#orchagent: :- getResAvailableCounters: Failed to get switch attribute 54 , rv:-15
Jun 24 08:02:22.973522 vlab-01 ERR swss#orchagent: :- meta_sai_get_oid: get status: SAI_STATUS_NOT_IMPLEMENTED
Jun 24 08:02:22.974244 vlab-01 ERR swss#orchagent: :- getResAvailableCounters: Failed to get switch attribute 56 , rv:-15
Jun 24 08:02:22.975887 vlab-01 ERR swss#orchagent: :- meta_sai_get_oid: get status: SAI_STATUS_NOT_IMPLEMENTED
Jun 24 08:02:22.976324 vlab-01 ERR swss#orchagent: :- getResAvailableCounters: Failed to get switch attribute 55 , rv:-15
Jun 24 08:02:22.981321 vlab-01 ERR swss#orchagent: :- meta_sai_get_oid: get status: SAI_STATUS_NOT_IMPLEMENTED
Jun 24 08:02:22.981938 vlab-01 ERR swss#orchagent: :- getResAvailableCounters: Failed to get switch attribute 60 , rv:-15
Jun 24 08:02:22.986395 vlab-01 ERR swss#orchagent: :- meta_sai_get_oid: get status: SAI_STATUS_NOT_IMPLEMENTED
Jun 24 08:02:22.988600 vlab-01 ERR swss#orchagent: :- getResAvailableCounters: Failed to get switch attribute 61 , rv:-15
Jun 24 08:02:22.994613 vlab-01 ERR swss#orchagent: :- meta_sai_get_oid: get status: SAI_STATUS_NOT_IMPLEMENTED
Jun 24 08:02:22.994613 vlab-01 ERR swss#orchagent: :- getResAvailableCounters: Failed to get ACL table attribute 4412 , rv:-15
Jun 24 08:02:22.995269 vlab-01 ERR syncd#syncd: :- meta_sai_get_oid: get status: SAI_STATUS_NOT_IMPLEMENTED
Jun 24 08:02:22.995269 vlab-01 WARNING syncd#syncd: :- refresh_read_only_BCM56850: need to recalculate RO: SAI_SWITCH_ATTR_AVAILABLE_NEXT_HOP_GROUP_MEMBER_ENTRY
Jun 24 08:02:22.995269 vlab-01 ERR syncd#syncd: :- internal_vs_generic_get: SAI_SWITCH_ATTR_AVAILABLE_NEXT_HOP_GROUP_MEMBER_ENTRY read only not implemented on oid:0x2100000000
Jun 24 08:02:22.995269 vlab-01 ERR syncd#syncd: :- meta_sai_get_oid: get status: SAI_STATUS_NOT_IMPLEMENTED
Jun 24 08:02:22.995269 vlab-01 WARNING syncd#syncd: :- refresh_read_only_BCM56850: need to recalculate RO: SAI_SWITCH_ATTR_AVAILABLE_NEXT_HOP_GROUP_ENTRY
Jun 24 08:02:22.995269 vlab-01 ERR syncd#syncd: :- internal_vs_generic_get: SAI_SWITCH_ATTR_AVAILABLE_NEXT_HOP_GROUP_ENTRY read only not implemented on oid:0x2100000000
Jun 24 08:02:22.995269 vlab-01 ERR syncd#syncd: :- meta_sai_get_oid: get status: SAI_STATUS_NOT_IMPLEMENTED
Jun 24 08:02:22.995269 vlab-01 WARNING syncd#syncd: :- refresh_read_only_BCM56850: need to recalculate RO: SAI_SWITCH_ATTR_AVAILABLE_ACL_TABLE
Jun 24 08:02:22.995269 vlab-01 ERR syncd#syncd: :- internal_vs_generic_get: SAI_SWITCH_ATTR_AVAILABLE_ACL_TABLE read only not implemented on oid:0x2100000000
Jun 24 08:02:22.995269 vlab-01 ERR syncd#syncd: :- meta_sai_get_oid: get status: SAI_STATUS_NOT_IMPLEMENTED
Jun 24 08:02:22.995269 vlab-01 WARNING syncd#syncd: :- refresh_read_only_BCM56850: need to recalculate RO: SAI_SWITCH_ATTR_AVAILABLE_ACL_TABLE_GROUP
Jun 24 08:02:22.995269 vlab-01 ERR syncd#syncd: :- internal_vs_generic_get: SAI_SWITCH_ATTR_AVAILABLE_ACL_TABLE_GROUP read only not implemented on oid:0x2100000000
Jun 24 08:02:22.995269 vlab-01 ERR syncd#syncd: :- meta_sai_get_oid: get status: SAI_STATUS_NOT_IMPLEMENTED
Jun 24 08:02:22.995269 vlab-01 WARNING syncd#syncd: :- refresh_read_only_BCM56850: need to recalculate RO: SAI_ACL_TABLE_ATTR_AVAILABLE_ACL_ENTRY
Jun 24 08:02:22.995269 vlab-01 ERR syncd#syncd: :- internal_vs_generic_get: SAI_ACL_TABLE_ATTR_AVAILABLE_ACL_ENTRY read only not implemented on oid:0x700000000
Jun 24 08:02:23.000113 vlab-01 ERR swss#orchagent: :- meta_sai_get_oid: get status: SAI_STATUS_NOT_IMPLEMENTED
Jun 24 08:02:23.000113 vlab-01 ERR swss#orchagent: :- getResAvailableCounters: Failed to get switch attribute 57 , rv:-15
Jun 24 08:02:23.000113 vlab-01 ERR syncd#syncd: :- meta_sai_get_oid: get status: SAI_STATUS_NOT_IMPLEMENTED
Jun 24 08:02:23.000113 vlab-01 WARNING syncd#syncd: :- refresh_read_only_BCM56850: need to recalculate RO: SAI_SWITCH_ATTR_AVAILABLE_FDB_ENTRY
Jun 24 08:02:23.000113 vlab-01 ERR syncd#syncd: :- internal_vs_generic_get: SAI_SWITCH_ATTR_AVAILABLE_FDB_ENTRY read only not implemented on oid:0x2100000000
Jun 24 08:02:23.000113 vlab-01 ERR syncd#syncd: :- meta_sai_get_oid: get status: SAI_STATUS_NOT_IMPLEMENTED
Jun 24 08:02:23.001651 vlab-01 WARNING swss#orchagent: :- checkCrmThresholds: NEXTHOP_GROUP_MEMBER THRESHOLD_CLEAR for TH_PERCENTAGE 0% Used count 0 free count 0
Jun 24 08:02:23.004233 vlab-01 WARNING swss#orchagent: :- checkCrmThresholds: NEXTHOP_GROUP THRESHOLD_CLEAR for TH_PERCENTAGE 0% Used count 0 free count 0
Jun 24 08:02:40.955039 vlab-01 ERR bgp#bgp_eoiu_marker.py: message repeated 24 times: [ *ERROR* get_all_peers Exception: Traceback (most recent call last):#012 File "/usr/bin/bgp_eoiu_marker.py", line 53, in get_all_peers#012 peer_info = json.loads(output)#012 File "/usr/lib/python2.7/json/__init__.py", line 339, in loads#012 return _default_decoder.decode(s)#012 File "/usr/lib/python2.7/json/decoder.py", line 364, in decode#012 obj, end = self.raw_decode(s, idx=_w(s, 0).end())#012 File "/usr/lib/python2.7/json/decoder.py", line 382, in raw_decode#012 raise ValueError("No JSON object could be decoded")#012ValueError: No JSON object could be decoded]
Jun 24 08:02:40.955039 vlab-01 ERR bgp#bgp_eoiu_marker.py: Failed to get bgp neighbor info in 120 seconds, exiting
Let's avoid ZMQ usage for now. GRPC/Protobuf will not help us in this case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks ok for me.
Please check my comments.
Also please wait for some other team member opinion.
vs test on the swss is missing. |
c90eeeb
to
a1e6a9d
Compare
retest this please |
@jipanyang , I cannot resolve the conflict. can you resolve the conflict? |
…in fpmsyncd Signed-off-by: Jipan Yang <jipan.yang@alibaba-inc.com>
…olean Signed-off-by: Jipan Yang <jipan.yang@alibaba-inc.com>
Signed-off-by: Jipan Yang <jipan.yang@alibaba-inc.com>
Signed-off-by: Jipan Yang <jipan.yang@alibaba-inc.com>
Signed-off-by: Jipan Yang <jipan.yang@alibaba-inc.com>
Signed-off-by: Jipan Yang <jipan.yang@alibaba-inc.com>
Signed-off-by: Jipan Yang <jipan.yang@alibaba-inc.com>
Signed-off-by: Jipan Yang <jipan.yang@alibaba-inc.com>
…olving merge conflict Signed-off-by: Jipan Yang <jipan.yang@alibaba-inc.com>
… platforms (sonic-net#856) * Update stop, reset failed status and restart of systemd services to support multi-asic platforms. * Create function to avoid code duplication. * Fixed errors due to pervious commit and review comments. * Minor update to fix spacing. * Minor update to fix spacing. * Minor update to fix spacing. * For multi asic platform updated logic of stopping/restarting of services to ensure that the right instances are stopped and restarted if a service is both global and multi-instance. * Fixed log error message with incorrect number of parameterts.
Signed-off-by: Venkat Garigipati <venkatg@cisco.com>
…in fpmsyncd
Signed-off-by: Jipan Yang jipan.yang@alibaba-inc.com
What I did
Three PRs for adding BGP eoiu support to speed up route reconciliation in fpmsyncd
sonic-buildimage: sonic-net/sonic-buildimage#2823
sonic-swss-common: sonic-net/sonic-swss-common#273
sonic-swss: #856
Why I did it
Similar to restore_neigbors.py for neigborsyncd, start a bgp_eoiu_mark.py for bgp docker.
The script check bgp neighbor state via cli interface periodically (every 1 second)
It looks for explicit EOR and implicit EOR (keep alive after established) in the json output of show ip bgp neighbors A.B.C.D json
Once the script has collected all needed EORs, it set a EOIU flag in stateDB.
fpmsyncd could hold a few seconds (3 seconds) after getting the flag before starting routing reconciliation.
For any reason the script failed to set EOIU flag in stateDB, the current warm_restart bgp_timer will kick in later.
This approach may have a few more seconds delay compared with the FRR embedded EOIU solution, but simple and less risk.
How I verified it
Before warm upgrade bgp docker:
Warm upgrade bgp docker:
Check bgp sum immediately:
Details if related