Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GR fails due to bgp route-map delay-timer behavior #10756

Closed
Stephenxf opened this issue Mar 8, 2022 · 1 comment · Fixed by #10838
Closed

GR fails due to bgp route-map delay-timer behavior #10756

Stephenxf opened this issue Mar 8, 2022 · 1 comment · Fixed by #10838
Assignees

Comments

@Stephenxf
Copy link

Describe the bug
The behavior of bgp route-map delay-timer at the very beginning of bgp startup seems incorrect and can cause GR to fail.
This timer has a default value of 5 secs (btw, the comment “disabled by default” is not true). We don’t have this command in our use case, so the default takes effect.

#define RMAP_DEFAULT_UPDATE_TIMER 5 /* disabled by default */

What is weird to me is that, the timer kicks in at the very beginning of bgp startup when the configured route-maps are read. More specifically, bgp_route_map_add() is called and then bgp_route_map_mark_update() starts the timer.
As a result, within 5 secs, bgp rejects all incoming routes and denies all outgoing updates, because inbound/outbound route-maps are not ready.

This is causing GR to fail in restart bgp case. Basically the restarting bgp speaker drops all incoming routes, accepts all EORs, runs best-path selection and then sends out EORs (with no routes) to all GR helpers, who end up purging all stale routes…

To Reproduce

  1. At least one bgp session, both bgp speakers originate at least one route and exchange.
  2. Enable GR on DUT and peering devices, configure inbound/outbound neighbor route-maps.
  3. Restart bgp on DUT and make sure session comes up and the neighbor (i.e., GR helpers) sends routes within 5 secs (if for some reason it takes longer, increase route-map delay-timer accordingly such that the routes are exchanged before delay-timer fires).
  4. Because of this reported issue, GR helpers will NOT receive routes from DUT but only EORs, and hence purge the stale routes previously received from DUT.

Expected behavior
Ideally the delay-timer shouldn’t apply at initial config load, but only be used on subsequent route-map add/delete/change.

Versions

  • OS Version: Debian GNU/Linux 10 (buster)
  • Kernel: Linux 4.19.0-12-2-amd64
  • FRR Version: 7.5.1

Additional context
I was testing with small scale, so everything happened within 5 secs (sessions established, updates exchanged). After 5 secs, soft-in/out would take place once route-maps are ready, but there certainly would be traffic loss. Some bgp/zebra logs attached as follows:

// read route-map configs at startup, delay-timer starts.
Mar  8 17:27:16.969896 dut01 DEBUG #zebra[30]: Add route-map permit-all
Mar  8 17:27:16.969896 dut01 DEBUG #zebra[30]: Route-map permit-all add sequence 10, type: permit
Mar  8 17:27:16.970549 dut01 DEBUG #bgpd[45]: Add route-map permit-all
Mar  8 17:27:16.970711 dut01 DEBUG #bgpd[45]: Route-map permit-all add sequence 10, type: permit

// session up, because route-maps are not ready, routes going out and coming in all denied, but EOR is received and later sent.
Mar  8 17:27:17.985906 dut01 INFO #bgpd[45]: %ADJCHANGE: neighbor 10.0.64.1(lnos-x1-a-fab01) in vrf default Up
Mar  8 17:27:18.977747 dut01 DEBUG #bgpd[45]: Route-map: null, prefix: 99.0.0.0/24, result: deny
Mar  8 17:27:18.977802 dut01 DEBUG #bgpd[45]: 10.0.64.1 [Update:SEND] 99.0.0.0/24 is filtered by route-map
Mar  8 17:27:18.980497 dut01 DEBUG #bgpd[45]: 10.0.64.1 rcvd UPDATE about 77.0.0.0/24 IPv4 unicast -- DENIED due to: route-map;
Mar  8 17:27:18.980497 dut01 INFO #bgpd[45]: bgp_update_receive: rcvd End-of-RIB for IPv4 Unicast from 10.0.64.1 in vrf default
Mar  8 17:27:18.981708 dut01 DEBUG #bgpd[45]: send End-of-RIB for IPv4 Unicast to 10.0.64.1

// after 5 secs, route-map notification comes, then soft-in/out will follow.
Mar  8 17:27:21.971764 dut01 DEBUG #zebra[30]: Event handler for route-map: permit-all
Mar  8 17:27:22.005981 dut01 DEBUG #bgpd[45]: u1:s1 announcing routes upon policy permit-all (type 0) change
@Stephenxf Stephenxf added the triage Needs further investigation label Mar 8, 2022
@ton31337 ton31337 added bgp bug and removed triage Needs further investigation labels Mar 8, 2022
@Stephenxf
Copy link
Author

Additional notes:

  1. We don’t have bgp route-map delay-timer command in our config, but the timer has a default value 5 secs, so the delay kicks in every time a route-map is added/deleted/modified, e.g., bgp_route_map_add() -> bgp_route_map_mark_update().
  2. When I say “route-maps are not ready”, I mean bgp knows the name of the route-map but haven’t got the pointer to the map. In FRR language, for inbound/outbound maps, ROUTE_MAP_IN_NAME() and ROUTE_MAP_OUT_NAME() are there, but ROUTE_MAP_IN() and ROUTE_MAP_OUT() are not. The pointers will be populated when the delay-timer fires.
  3. The concern is, when we read in the initial configs and kick off the route-map delay-timer, it will cause unnecessary churn and, in particular, lead to GR failure. “unnecessary churn” happens because the incoming/outgoing routes are rejected before the timer fires, and we have to do soft-in/out (or rely on soft-reconfig) once route-maps are in place.

@ton31337 ton31337 self-assigned this Mar 21, 2022
karampok added a commit to karampok/frr-k8s that referenced this issue Oct 28, 2024
When the default value of 5 seconds is being used,
Graceful-Restart can break momentarily on reconnect of
BGP peering.

The restarter FRR when that timeout expires and the route-maps
have not been processed, denies the routes, and therefore are removed.

```
DEBUG #bgpd[45]: Route-map: null, prefix: 99.0.0.0/24, result: deny
DEBUG #bgpd[45]: 10.0.64.1 [Update:SEND] 99.0.0.0/24 is filtered by route-map
```

FRRouting/frr#10757
FRRouting/frr#10756

Signed-off-by: karampok <karampok@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment