bgpd: allow batch handling of peer shutdown/failure #17505

mjstapp · 2024-11-25T21:20:29Z

When a peer connection fails or is closed, bgp does cleanup processing on a per-peer basis. At scale, this can become a problem - bgp can be forced to make a complete rib walk to clean up for each peer involved. This PR makes peer error-handling more visible at the bgp object level, and then adds a batching path if there are multiple peers who need cleanup/clearing processing at the same time.

Replace the per-peer connection error with a per-bgp event and a list. The io pthread enqueues peers per-bgp-instance, and the error-handing code can process multiple peers if there have been multiple failures.
When peer connections encounter errors, attempt to batch some of the clearing processing that occurs. Add a new batch object, add multiple peers to it, if possible. Do one rib walk for the batch, rather than one walk per peer. Use a handler callback per batch to check and remove peers' path-infos, rather than a work-queue and callback per peer. The original clearing code remains; it's used for single peers.

Replace the per-peer connection error with a per-bgp event and a list. The io pthread enqueues peers per-bgp-instance, and the error-handing code can process multiple peers if there have been multiple failures. Signed-off-by: Mark Stapp <mjs@cisco.com>

Remove a couple of apis that don't exist. Signed-off-by: Mark Stapp <mjs@cisco.com>

ton31337

Very nice improvement ahead!

ton31337 · 2024-11-26T07:19:43Z

tests/topotests/bgp_peer_shut/r1/bgpd.conf

Can we switch to frr.conf (unified config)?

mjstapp · 2024-11-26T13:20:27Z

Pushed to try to clean up the build problem

riw777

looks good ... waiting on @ton31337 's one comment

donaldsharp · 2024-12-02T18:12:55Z

diff --git a/bgpd/bgp_route.c b/bgpd/bgp_route.c
index 2f21cfd76d..14c280b7ca 100644
--- a/bgpd/bgp_route.c
+++ b/bgpd/bgp_route.c
@@ -6203,7 +6203,7 @@ static void bgp_clear_batch_dests_task(struct event *event)
 {
        struct bgp_clearing_info *cinfo = EVENT_ARG(event);
        struct bgp_dest *dest;
-       struct bgp_path_info *pi;
+       struct bgp_path_info *pi, *next;
        struct bgp_table *table;
        struct bgp *bgp;
        afi_t afi;
@@ -6225,7 +6225,8 @@ next_dest:
        /* Have to check every path: it is possible that we have multiple paths
         * for a prefix from a peer if that peer is using AddPath.
         */
-       for (pi = bgp_dest_get_bgp_path_info(dest); pi; pi = pi->next) {
+       for (pi = bgp_dest_get_bgp_path_info(dest); pi; pi = next) {
+               next = pi ? pi->next : NULL;
                if (!bgp_clearing_batch_check_peer(cinfo, pi->peer))
                        continue;

donaldsharp · 2024-12-02T18:13:42Z

the above patch will fix the infinite loops we get stuck in sometimes with this code. Effectively when you call bgp_process the pi->next pointer can be reset.

donaldsharp · 2024-12-02T18:14:39Z

there is also a crash that I am chasing down w/ Mark that I am seeing locally

mjstapp · 2024-12-02T19:19:27Z

rebased to apply a couple of fixes - let's see how CI looks

When peer connections encounter errors, attempt to batch some of the clearing processing that occurs. Add a new batch object, add multiple peers to it, if possible. Do one rib walk for the batch, rather than one walk per peer. Use a handler callback per batch to check and remove peers' path-infos, rather than a work-queue and callback per peer. The original clearing code remains; it's used for single peers. Signed-off-by: Mark Stapp <mjs@cisco.com>

Move the peer connection error list to the peer_connection struct; that seems to line up better with the way that struct works. Signed-off-by: Mark Stapp <mjs@cisco.com>

Add a simple topotest using multiple bgp peers; based on the ecmp_topo1 test. Signed-off-by: Mark Stapp <mjs@cisco.com>

mjstapp · 2024-12-04T12:58:20Z

CI:rerun

donaldsharp · 2024-12-10T17:38:56Z

Spoke w/ Mark he's going to make a change so that all peer clearing events go through his batching

github-actions · 2024-12-17T16:18:55Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Mark Stapp added 2 commits November 25, 2024 14:13

bgpd: remove apis from bgp_route.h

affc54a

Remove a couple of apis that don't exist. Signed-off-by: Mark Stapp <mjs@cisco.com>

frrbot bot added bgp tests Topotests, make check, etc zebra labels Nov 25, 2024

github-actions bot added size/XXL master labels Nov 25, 2024

ton31337 reviewed Nov 26, 2024

View reviewed changes

tests/topotests/bgp_peer_shut/r1/bgpd.conf

Copy link

Member

ton31337 Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we switch to frr.conf (unified config)?

mjstapp force-pushed the bgp_peer_shut branch from 912adc5 to 0d2605a Compare November 26, 2024 13:19

riw777 approved these changes Nov 26, 2024

View reviewed changes

mjstapp force-pushed the bgp_peer_shut branch from 0d2605a to 54d1834 Compare December 2, 2024 19:16

github-actions bot added the rebase PR needs rebase label Dec 2, 2024

mjstapp force-pushed the bgp_peer_shut branch from 54d1834 to b45520c Compare December 3, 2024 14:28

Mark Stapp added 3 commits December 3, 2024 14:23

zebra: move peer conn error list to connection struct

a4d9cd0

Move the peer connection error list to the peer_connection struct; that seems to line up better with the way that struct works. Signed-off-by: Mark Stapp <mjs@cisco.com>

tests: add bgp peer-shutdown topotest

39c83b6

Add a simple topotest using multiple bgp peers; based on the ecmp_topo1 test. Signed-off-by: Mark Stapp <mjs@cisco.com>

mjstapp force-pushed the bgp_peer_shut branch from b45520c to 39c83b6 Compare December 3, 2024 19:24

github-actions bot added the conflicts label Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bgpd: allow batch handling of peer shutdown/failure #17505

bgpd: allow batch handling of peer shutdown/failure #17505

mjstapp commented Nov 25, 2024

ton31337 left a comment

ton31337 Nov 26, 2024

mjstapp commented Nov 26, 2024

riw777 left a comment

donaldsharp commented Dec 2, 2024

donaldsharp commented Dec 2, 2024

donaldsharp commented Dec 2, 2024

mjstapp commented Dec 2, 2024

mjstapp commented Dec 4, 2024

donaldsharp commented Dec 10, 2024

github-actions bot commented Dec 17, 2024

bgpd: allow batch handling of peer shutdown/failure #17505

Are you sure you want to change the base?

bgpd: allow batch handling of peer shutdown/failure #17505

Conversation

mjstapp commented Nov 25, 2024

ton31337 left a comment

Choose a reason for hiding this comment

ton31337 Nov 26, 2024

Choose a reason for hiding this comment

mjstapp commented Nov 26, 2024

riw777 left a comment

Choose a reason for hiding this comment

donaldsharp commented Dec 2, 2024

donaldsharp commented Dec 2, 2024

donaldsharp commented Dec 2, 2024

mjstapp commented Dec 2, 2024

mjstapp commented Dec 4, 2024

donaldsharp commented Dec 10, 2024

github-actions bot commented Dec 17, 2024