Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BGP route recursive resolution not happening #13682

Closed
abdosi opened this issue Jun 4, 2023 · 4 comments
Closed

BGP route recursive resolution not happening #13682

abdosi opened this issue Jun 4, 2023 · 4 comments
Labels
triage Needs further investigation

Comments

@abdosi
Copy link

abdosi commented Jun 4, 2023

Issue Description:
FRR Version: 8.2.2 being used in SONiC: https://github.com/sonic-net/sonic-buildimage/tree/202205/src/sonic-frr
https://github.com/sonic-net/sonic-frr/tree/79188bf710e92acf42fb5b9b0a2e9593a5e

admin@svcstr2-xxxx-lc4-1:~$ uname -a
Linux svcstr2-8800-lc4-1 5.10.0-18-2-amd64 #1 SMP Debian 5.10.140-1 (2022-09-02) x86_64 GNU/Linux
admin@svcstr2-xxxx-lc4-1:~$ cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
NAME="Debian GNU/Linux"
VERSION_ID="11"
VERSION="11 (bullseye)"
VERSION_CODENAME=bullseye
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

Topology Instance:

  • BGP0 (3.3.3.12) and BGP2 (3.3.3.13) instance (running on same Lince-Card different Linux namespace) have iBGP with BGP1 (3.3.3.6) instance (running on different Line-Card)

  • BGP1 instance has e-BGP peer 10.0.0.77 . Route Learned via e-BGPpeer is 100.1.0.39

  • On BGP0 instance 100.1.0.39 is learnt correctly recursive over 10.0.0.77 and then another recursive over 3.3.3.6

svcstr2-xxxx-lc4-1# show ip route 100.1.0.39
Routing entry for 100.1.0.39/32
Known via "bgp", distance 200, metric 0, best
Last update 01w4d19h ago
10.0.0.77 (recursive), weight 1

  • 20.0.1.4, via PortChannel01, weight 1
  • 20.0.2.4, via PortChannel02, weight 1
  • 20.0.5.4, via PortChannel05, weight 1
  • 20.0.6.4, via PortChannel06, weight 1
  • 20.0.9.4, via PortChannel09, weight 1
  • 20.0.10.4, via PortChannel10, weight 1
  • 20.0.11.4, via PortChannel11, weight 1
  • 20.0.12.4, via PortChannel12, weight 1
  • 20.0.13.4, via PortChannel13, weight 1
  • 20.0.14.4, via PortChannel14, weight 1
  • On BGP2 instance 100.1.0.39 is learnt but 10.0.0.77 is marked inactive and not recursive looked-up.

svcstr2-xxxx-lc4-1# show ip route 100.1.0.39
Routing entry for 100.1.0.39/32
Known via "bgp", distance 200, metric 0
Last update 01w4d19h ago
10.0.0.77 inactive, weight 1

Working Case: BGP0 instance

svcstr2-xxxx-lc4-1# show ip route 100.1.0.39
Routing entry for 100.1.0.39/32
  Known via "bgp", distance 200, metric 0, best
  Last update 01w4d19h ago
    10.0.0.77 (recursive), weight 1
  *   20.0.1.4, via PortChannel01, weight 1
  *   20.0.2.4, via PortChannel02, weight 1
  *   20.0.5.4, via PortChannel05, weight 1
  *   20.0.6.4, via PortChannel06, weight 1
  *   20.0.9.4, via PortChannel09, weight 1
  *   20.0.10.4, via PortChannel10, weight 1
  *   20.0.11.4, via PortChannel11, weight 1
  *   20.0.12.4, via PortChannel12, weight 1
  *   20.0.13.4, via PortChannel13, weight 1
  *   20.0.14.4, via PortChannel14, weight 1

svcstr2-xxxx-lc4-1# show ip route 10.0.0.77
Routing entry for 10.0.0.76/31
  Known via "bgp", distance 200, metric 0, best
  Last update 01w4d19h ago
    3.3.3.6 (recursive), weight 1
  *   20.0.1.4, via PortChannel01, weight 1
  *   20.0.2.4, via PortChannel02, weight 1
  *   20.0.5.4, via PortChannel05, weight 1
  *   20.0.6.4, via PortChannel06, weight 1
  *   20.0.9.4, via PortChannel09, weight 1
  *   20.0.10.4, via PortChannel10, weight 1
  *   20.0.11.4, via PortChannel11, weight 1
  *   20.0.12.4, via PortChannel12, weight 1
  *   20.0.13.4, via PortChannel13, weight 1
  *   20.0.14.4, via PortChannel14, weight 1

svcstr2-xxxx-lc4-1# show ip route 3.3.3.6
Routing entry for 3.3.3.6/32
  Known via "static", distance 1, metric 0, tag 2, best
  Last update 01w4d19h ago
  * 20.0.1.4, via PortChannel01, weight 1
  * 20.0.2.4, via PortChannel02, weight 1
  * 20.0.5.4, via PortChannel05, weight 1
  * 20.0.6.4, via PortChannel06, weight 1
  * 20.0.9.4, via PortChannel09, weight 1
  * 20.0.10.4, via PortChannel10, weight 1
  * 20.0.11.4, via PortChannel11, weight 1
  * 20.0.12.4, via PortChannel12, weight 1
  * 20.0.13.4, via PortChannel13, weight 1
  * 20.0.14.4, via PortChannel14, weight 1
 
svcstr2-xxxx-lc4-1# show ip bgp 100.1.0.39
BGP routing table entry for 100.1.0.39/32, version 68335
Paths: (1 available, best #1, table default)
  Advertised to non peer-group peers:
  10.0.0.13 10.0.0.1 10.0.0.5 10.0.0.9 10.0.0.17 10.0.0.21
  65004
    10.0.0.77 from 3.3.3.6 (3.3.3.6)
      Origin IGP, localpref 100, valid, internal, best (First path received)
      Last update: Wed May 24 00:36:42 2023

svcstr2-xxxx-lc4-1# show ip bgp 10.0.0.77
BGP routing table entry for 10.0.0.76/31, version 51006
Paths: (1 available, best #1, table default, not advertised to EBGP peer)
  Not advertised to any peer
  Local
    3.3.3.6 from 3.3.3.6 (3.3.3.6)
      Origin incomplete, metric 0, localpref 100, valid, internal, best (First path received)
      Community: no-export
      Last update: Wed May 24 00:36:42 2023

svcstr2-xxxx-lc4-1# show ip nht 10.0.0.77
10.0.0.77
 resolved via bgp
 via 3.3.3.6
 Client list: bgp(fd 34)

svcstr2-xxxx-lc4-1# show ip nht 3.3.3.6
3.3.3.6
 resolved via static
 via 20.0.1.4, PortChannel01
 via 20.0.2.4, PortChannel02
 via 20.0.5.4, PortChannel05
 via 20.0.6.4, PortChannel06
 via 20.0.9.4, PortChannel09
 via 20.0.10.4, PortChannel10
 via 20.0.11.4, PortChannel11
 via 20.0.12.4, PortChannel12
 via 20.0.13.4, PortChannel13
 via 20.0.14.4, PortChannel14
 Client list: bgp(fd 34)

svcstr2-xxxx-lc4-1# show ip bgp nexthop
Current BGP nexthop cache:
 3.3.3.6 valid [IGP metric 0], #paths 7, peer 3.3.3.6
  gate 20.0.1.4, if PortChannel01
  gate 20.0.2.4, if PortChannel02
  gate 20.0.5.4, if PortChannel05
  gate 20.0.6.4, if PortChannel06
  gate 20.0.9.4, if PortChannel09
  gate 20.0.10.4, if PortChannel10
  gate 20.0.11.4, if PortChannel11
  gate 20.0.12.4, if PortChannel12
  gate 20.0.13.4, if PortChannel13
  gate 20.0.14.4, if PortChannel14
  Last update: Wed May 24 00:36:47 2023

 10.0.0.77 valid [IGP metric 0], #paths 1
  gate 20.0.1.4, if PortChannel01
  gate 20.0.2.4, if PortChannel02
  gate 20.0.5.4, if PortChannel05
  gate 20.0.6.4, if PortChannel06
  gate 20.0.9.4, if PortChannel09
  gate 20.0.10.4, if PortChannel10
  gate 20.0.11.4, if PortChannel11
  gate 20.0.12.4, if PortChannel12
  gate 20.0.13.4, if PortChannel13
  gate 20.0.14.4, if PortChannel14
  Last update: Wed May 24 00:36:48 2023

Non Working Case:- BGP2 instance

svcstr2-xxxx-lc4-1# show ip route 100.1.0.39
Routing entry for 100.1.0.39/32
  Known via "bgp", distance 200, metric 0
  Last update 01w4d19h ago
    **10.0.0.77 inactive, weight 1**

svcstr2-xxxx-lc4-1# show ip route 10.0.0.77
Routing entry for 10.0.0.76/31
  Known via "bgp", distance 200, metric 0, best
  Last update 01w4d19h ago
    3.3.3.6 (recursive), weight 1
  *   20.0.1.4, via PortChannel17, weight 1
  *   20.0.2.4, via PortChannel18, weight 1
  *   20.0.5.4, via PortChannel21, weight 1
  *   20.0.6.4, via PortChannel22, weight 1
  *   20.0.9.4, via PortChannel25, weight 1
  *   20.0.10.4, via PortChannel26, weight 1
  *   20.0.11.4, via PortChannel27, weight 1
  *   20.0.12.4, via PortChannel28, weight 1
  *   20.0.13.4, via PortChannel29, weight 1
  *   20.0.14.4, via PortChannel30, weight 1

svcstr2-xxxx-lc4-1# show ip route 3.3.3.6
Routing entry for 3.3.3.6/32
  Known via "static", distance 1, metric 0, tag 2, best
  Last update 01w4d19h ago
  * 20.0.1.4, via PortChannel17, weight 1
  * 20.0.2.4, via PortChannel18, weight 1
  * 20.0.5.4, via PortChannel21, weight 1
  * 20.0.6.4, via PortChannel22, weight 1
  * 20.0.9.4, via PortChannel25, weight 1
  * 20.0.10.4, via PortChannel26, weight 1
  * 20.0.11.4, via PortChannel27, weight 1
  * 20.0.12.4, via PortChannel28, weight 1
  * 20.0.13.4, via PortChannel29, weight 1
  * 20.0.14.4, via PortChannel30, weight 1
  
svcstr2-xxxx-lc4-1# show ip bgp 100.1.0.39
BGP routing table entry for 100.1.0.39/32, version 84464
Paths: (1 available, best #1, table default)
  Advertised to non peer-group peers:
  10.0.0.25 10.0.0.37 10.0.0.39 10.0.0.41 10.0.0.43 10.0.0.45 10.0.0.47 10.0.0.29 10.0.0.33 10.0.0.35
  65004
    10.0.0.77 from 3.3.3.6 (3.3.3.6)
      Origin IGP, localpref 100, valid, internal, best (First path received)
      Last update: Wed May 24 00:36:44 2023

svcstr2-xxxx-lc4-1# show ip bgp 10.0.0.77
BGP routing table entry for 10.0.0.76/31, version 83647
Paths: (1 available, best #1, table default, not advertised to EBGP peer)
  Not advertised to any peer
  Local
    3.3.3.6 from 3.3.3.6 (3.3.3.6)
      Origin incomplete, metric 0, localpref 100, valid, internal, best (First path received)
      Community: no-export
      Last update: Wed May 24 00:36:43 2023

svcstr2-xxxx-lc4-1# show ip nht 10.0.0.77
10.0.0.77
 resolved via bgp
 via 3.3.3.6
 Client list: bgp(fd 34)

svcstr2-xxxx-lc4-1# show ip bgp nexthop
Current BGP nexthop cache:
 3.3.3.6 valid [IGP metric 0], #paths 7, peer 3.3.3.6
  gate 20.0.1.4, if PortChannel17
  gate 20.0.2.4, if PortChannel18
  gate 20.0.5.4, if PortChannel21
  gate 20.0.6.4, if PortChannel22
  gate 20.0.9.4, if PortChannel25
  gate 20.0.10.4, if PortChannel26
  gate 20.0.11.4, if PortChannel27
  gate 20.0.12.4, if PortChannel28
  gate 20.0.13.4, if PortChannel29
  gate 20.0.14.4, if PortChannel30
  Last update: Wed May 24 00:36:47 2023

 10.0.0.77 valid [IGP metric 0], #paths 1
  gate 20.0.1.4, if PortChannel17
  gate 20.0.2.4, if PortChannel18
  gate 20.0.5.4, if PortChannel21
  gate 20.0.6.4, if PortChannel22
  gate 20.0.9.4, if PortChannel25
  gate 20.0.10.4, if PortChannel26
  gate 20.0.11.4, if PortChannel27
  gate 20.0.12.4, if PortChannel28
  gate 20.0.13.4, if PortChannel29
  gate 20.0.14.4, if PortChannel30
  Last update: Wed May 24 00:36:48 2023



@abdosi abdosi added the triage Needs further investigation label Jun 4, 2023
@abdosi
Copy link
Author

abdosi commented Jun 4, 2023

Looks to be some timing/race-condition issue. Does not happen always. Is their any dump we can collect in such state that will help in debugging this ?

@abdosi
Copy link
Author

abdosi commented Jun 4, 2023

cc @anamehra @arlakshm @rlhui for viz

yxieca pushed a commit to sonic-net/sonic-buildimage that referenced this issue Jun 7, 2023
What I did:

Workaround for the issue seen here : FRRouting/frr#13682
It seems there is timing issue where there are multiple recursive lookup needed to resolve nexthop of the route it's possible that it does not happen correctly causing route to remain in inactive state

Issue is seen on chassis-packet as there 2 level of recursive lookup needed for a given e-BGP learnt route
- Level1 to resolve e-BGP peer (connected route via bgp ) over Loopback4096 (i-BGP peering)
- Level 2 Loopback4096 over backend port-channels next-hops

For VOQ chassis there is no e-BGP peer (connected route via bgp )  resolution as route is added as Static route by orchagent over Ethernet-IB.

Also as part of this remove route-map policy from instance.conf.j2 as same is define in peer-group.j2.

Microsoft ADO: https://msazure.visualstudio.com/One/_workitems/edit/24198507

How I verify:
Functional Verification manually
Updated UT.
We will be adding sanity check in sonic-mgmt to make sure none of route are in inactive state.

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
mssonicbld pushed a commit to mssonicbld/sonic-buildimage that referenced this issue Jun 7, 2023
What I did:

Workaround for the issue seen here : FRRouting/frr#13682
It seems there is timing issue where there are multiple recursive lookup needed to resolve nexthop of the route it's possible that it does not happen correctly causing route to remain in inactive state

Issue is seen on chassis-packet as there 2 level of recursive lookup needed for a given e-BGP learnt route
- Level1 to resolve e-BGP peer (connected route via bgp ) over Loopback4096 (i-BGP peering)
- Level 2 Loopback4096 over backend port-channels next-hops

For VOQ chassis there is no e-BGP peer (connected route via bgp )  resolution as route is added as Static route by orchagent over Ethernet-IB.

Also as part of this remove route-map policy from instance.conf.j2 as same is define in peer-group.j2.

Microsoft ADO: https://msazure.visualstudio.com/One/_workitems/edit/24198507

How I verify:
Functional Verification manually
Updated UT.
We will be adding sanity check in sonic-mgmt to make sure none of route are in inactive state.

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
mssonicbld pushed a commit to mssonicbld/sonic-buildimage that referenced this issue Jun 7, 2023
What I did:

Workaround for the issue seen here : FRRouting/frr#13682
It seems there is timing issue where there are multiple recursive lookup needed to resolve nexthop of the route it's possible that it does not happen correctly causing route to remain in inactive state

Issue is seen on chassis-packet as there 2 level of recursive lookup needed for a given e-BGP learnt route
- Level1 to resolve e-BGP peer (connected route via bgp ) over Loopback4096 (i-BGP peering)
- Level 2 Loopback4096 over backend port-channels next-hops

For VOQ chassis there is no e-BGP peer (connected route via bgp )  resolution as route is added as Static route by orchagent over Ethernet-IB.

Also as part of this remove route-map policy from instance.conf.j2 as same is define in peer-group.j2.

Microsoft ADO: https://msazure.visualstudio.com/One/_workitems/edit/24198507

How I verify:
Functional Verification manually
Updated UT.
We will be adding sanity check in sonic-mgmt to make sure none of route are in inactive state.

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
mssonicbld pushed a commit to sonic-net/sonic-buildimage that referenced this issue Jun 10, 2023
What I did:

Workaround for the issue seen here : FRRouting/frr#13682
It seems there is timing issue where there are multiple recursive lookup needed to resolve nexthop of the route it's possible that it does not happen correctly causing route to remain in inactive state

Issue is seen on chassis-packet as there 2 level of recursive lookup needed for a given e-BGP learnt route
- Level1 to resolve e-BGP peer (connected route via bgp ) over Loopback4096 (i-BGP peering)
- Level 2 Loopback4096 over backend port-channels next-hops

For VOQ chassis there is no e-BGP peer (connected route via bgp )  resolution as route is added as Static route by orchagent over Ethernet-IB.

Also as part of this remove route-map policy from instance.conf.j2 as same is define in peer-group.j2.

Microsoft ADO: https://msazure.visualstudio.com/One/_workitems/edit/24198507

How I verify:
Functional Verification manually
Updated UT.
We will be adding sanity check in sonic-mgmt to make sure none of route are in inactive state.

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
sonic-otn pushed a commit to sonic-otn/sonic-buildimage that referenced this issue Sep 20, 2023
What I did:

Workaround for the issue seen here : FRRouting/frr#13682
It seems there is timing issue where there are multiple recursive lookup needed to resolve nexthop of the route it's possible that it does not happen correctly causing route to remain in inactive state

Issue is seen on chassis-packet as there 2 level of recursive lookup needed for a given e-BGP learnt route
- Level1 to resolve e-BGP peer (connected route via bgp ) over Loopback4096 (i-BGP peering)
- Level 2 Loopback4096 over backend port-channels next-hops

For VOQ chassis there is no e-BGP peer (connected route via bgp )  resolution as route is added as Static route by orchagent over Ethernet-IB.

Also as part of this remove route-map policy from instance.conf.j2 as same is define in peer-group.j2.

Microsoft ADO: https://msazure.visualstudio.com/One/_workitems/edit/24198507

How I verify:
Functional Verification manually
Updated UT.
We will be adding sanity check in sonic-mgmt to make sure none of route are in inactive state.

Signed-off-by: Abhishek Dosi <abdosi@microsoft.com>
Copy link

github-actions bot commented Dec 2, 2023

This issue is stale because it has been open 180 days with no activity. Comment or remove the autoclose label in order to avoid having this issue closed.

@frrbot
Copy link

frrbot bot commented Dec 2, 2023

This issue will be automatically closed in the specified period unless there is further activity.

@frrbot frrbot bot closed this as completed Dec 9, 2023
@frrbot frrbot bot closed this as completed Dec 9, 2023
@frrbot frrbot bot removed the autoclose label Dec 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage Needs further investigation
Projects
None yet
Development

No branches or pull requests

1 participant