-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bgpd: allow batch handling of peer shutdown/failure #17505
base: master
Are you sure you want to change the base?
Conversation
Replace the per-peer connection error with a per-bgp event and a list. The io pthread enqueues peers per-bgp-instance, and the error-handing code can process multiple peers if there have been multiple failures. Signed-off-by: Mark Stapp <mjs@cisco.com>
Remove a couple of apis that don't exist. Signed-off-by: Mark Stapp <mjs@cisco.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice improvement ahead!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we switch to frr.conf (unified config)?
912adc5
to
0d2605a
Compare
Pushed to try to clean up the build problem |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good ... waiting on @ton31337 's one comment
|
the above patch will fix the infinite loops we get stuck in sometimes with this code. Effectively when you call bgp_process the pi->next pointer can be reset. |
there is also a crash that I am chasing down w/ Mark that I am seeing locally |
0d2605a
to
54d1834
Compare
rebased to apply a couple of fixes - let's see how CI looks |
54d1834
to
b45520c
Compare
When peer connections encounter errors, attempt to batch some of the clearing processing that occurs. Add a new batch object, add multiple peers to it, if possible. Do one rib walk for the batch, rather than one walk per peer. Use a handler callback per batch to check and remove peers' path-infos, rather than a work-queue and callback per peer. The original clearing code remains; it's used for single peers. Signed-off-by: Mark Stapp <mjs@cisco.com>
Move the peer connection error list to the peer_connection struct; that seems to line up better with the way that struct works. Signed-off-by: Mark Stapp <mjs@cisco.com>
Add a simple topotest using multiple bgp peers; based on the ecmp_topo1 test. Signed-off-by: Mark Stapp <mjs@cisco.com>
b45520c
to
39c83b6
Compare
CI:rerun |
Spoke w/ Mark he's going to make a change so that all peer clearing events go through his batching |
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
When a peer connection fails or is closed, bgp does cleanup processing on a per-peer basis. At scale, this can become a problem - bgp can be forced to make a complete rib walk to clean up for each peer involved. This PR makes peer error-handling more visible at the bgp object level, and then adds a batching path if there are multiple peers who need cleanup/clearing processing at the same time.