Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ip route in the kernel does not match routes in bgp #5026

Closed
lguohan opened this issue Jul 18, 2020 · 8 comments · Fixed by #5038 or #6478
Closed

ip route in the kernel does not match routes in bgp #5026

lguohan opened this issue Jul 18, 2020 · 8 comments · Fixed by #5038 or #6478

Comments

@lguohan
Copy link
Collaborator

lguohan commented Jul 18, 2020

Description

Steps to reproduce the issue:

  1. download vlab-01.img, vlab-01.memdmp, vlab-01.xml from https://sonic-jenkins.westus2.cloudapp.azure.com/job/vs/job/buildimage-vs-image-pr/4972/artifact/kvmdump/
  2. modify disk location in vlab-01.xml to point to vlab-01.img
  3. in virsh, restore vlab-01.memdmp --xml vlab-01.xml
  4. reconnect the T0 testbed to this dut.
  5. ./testbed-cli.sh -m veos.vtb -t vtestbed.csv connect-topo vms-kvm-t0 lab password.txt
  6. do route announcement. py.test $PYTEST_COMMON_OPTS test_announce_routes.py

Describe the results you received:
check ip route.

admin@vlab-01:~$ ip route
10.0.0.56/31 dev PortChannel0001 proto kernel scope link src 10.0.0.56
10.0.0.58/31 dev PortChannel0002 proto kernel scope link src 10.0.0.58
10.0.0.60/31 dev PortChannel0003 proto kernel scope link src 10.0.0.60
10.0.0.62/31 dev PortChannel0004 proto kernel scope link src 10.0.0.62
10.250.0.0/24 dev eth0 proto kernel scope link src 10.250.0.101
192.168.0.0/21 dev Vlan1000 proto kernel scope link src 192.168.0.1
240.127.1.0/24 dev docker0 proto kernel scope link src 240.127.1.1 linkdown

check ip bgp sum

admin@vlab-01:~$ show ip bgp sum

IPv4 Unicast Summary:
BGP router identifier 10.1.0.32, local AS number 65100 vrf-id 0
BGP table version 2
RIB entries 12807, using 2301 KiB of memory
Peers 4, using 82 KiB of memory
Peer groups 6, using 384 bytes of memory

Neighbor        V         AS MsgRcvd MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   NeighborName
10.0.0.57       4      64600   14414    1282        0    0    0 00:52:03         6399   ARISTA01T1
10.0.0.59       4      64600    7687    1289        0    0    0 00:52:02         6400   ARISTA02T1
10.0.0.61       4      64600    7687    1289        0    0    0 00:52:02         6400   ARISTA03T1
10.0.0.63       4      64600    7687    1289        0    0    0 00:52:02         6400   ARISTA04T1

Total number of neighbors 4

show ip bgp will give you lots of routes.

vlab-01# show ip route  
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
       F - PBR, f - OpenFabric,
       > - selected route, * - FIB route, q - queued route, r - rejected route

C>* 10.0.0.56/31 is directly connected, PortChannel0001, 00:53:40
C>* 10.0.0.58/31 is directly connected, PortChannel0002, 00:53:39
C>* 10.0.0.60/31 is directly connected, PortChannel0003, 00:53:39
C>* 10.0.0.62/31 is directly connected, PortChannel0004, 00:53:39
C>* 10.1.0.32/32 is directly connected, Loopback0, 02:38:39
C>* 10.250.0.0/24 is directly connected, eth0, 02:38:50
C>* 192.168.0.0/21 is directly connected, Vlan1000, 02:38:40

something is wrong with frr

Describe the results you expected:
ip route in the kernel should around 32000

Additional information you deem important:

**Output of `show version`:**

```
(paste your output here)
```

**Attach debug file `sudo generate_dump`:**

```
(paste your output here)
```
@lguohan lguohan transferred this issue from sonic-net/sonic-mgmt Jul 23, 2020
@lguohan lguohan pinned this issue Jul 23, 2020
lguohan pushed a commit that referenced this issue Jul 25, 2020
fixes #5026
Explanation:
In the log from the issue I found:
```
I see following in the log
Jul 22 21:13:06.574831 vlab-01 WARNING bgp#bgpd[49]: [EC 33554499] sendmsg_nexthop: zclient_send_message() failed
```
Analyzing source code I found that the error message could be issues only when `zclient_send_rnh()` return less than 0.
```
	ret = zclient_send_rnh(zclient, command, p, exact_match,
			       bnc->bgp->vrf_id);
	/* TBD: handle the failure */
	if (ret < 0)
		flog_warn(EC_BGP_ZEBRA_SEND,
			  "sendmsg_nexthop: zclient_send_message() failed");
```
I checked [zclient_send_rnh()](https://github.com/Azure/sonic-frr/blob/88351c8f6df5450e9098f773813738f62abb2f5e/lib/zclient.c#L654) and found that this function will return the exit code which the function gets from [zclient_send_message()](https://github.com/Azure/sonic-frr/blob/88351c8f6df5450e9098f773813738f62abb2f5e/lib/zclient.c#L266) But the latter function could return not 0 in two cases:
1.	bgpd didn’t connect to the zclient socket yet [code](https://github.com/Azure/sonic-frr/blob/88351c8f6df5450e9098f773813738f62abb2f5e/lib/zclient.c#L269)
2.	The socket was closed. But in this case we would receive the error message in the log. (And I can find the message in the log when we reboot sonic) [code](https://github.com/Azure/sonic-frr/blob/88351c8f6df5450e9098f773813738f62abb2f5e/lib/zclient.c#L277)

Also I see from the logs that client connection was set later we had the issue in bgpd.

Bgpd.log
```
Jul 22 21:13:06.574831 vlab-01 WARNING bgp#bgpd[49]: [EC 33554499] sendmsg_nexthop: zclient_send_message() failed
```
Vs
Zebra.log
```
Jul 22 21:13:12.713249 vlab-01 NOTICE bgp#zebra[48]: client 25 says hello and bids fair to announce only static routes vrf=0
Jul 22 21:13:12.820352 vlab-01 NOTICE bgp#zebra[48]: client 30 says hello and bids fair to announce only bgp routes vrf=0
Jul 22 21:13:12.820352 vlab-01 NOTICE bgp#zebra[48]: client 33 says hello and bids fair to announce only vnc routes vrf=0
```
So in our case we should start zebra first. Wait until it is started and then start bgpd and other daemons.

**- How I did it**

I changed a graph to start daemons in the following order:
1. First start zebra
2. Then starts staticd and bgpd
3. Then starts vtysh -b and bgpeoi after bgpd is started.
@lguohan lguohan unpinned this issue Jul 25, 2020
@lguohan lguohan reopened this Aug 6, 2020
@stephengzh
Copy link

@lguohan
Hi guohan, I met the same issue when running test_default_route.py
I added a route and run again, the issue is still existed but I can see the "route raw info". However, the "route parsed info" gives nothing.

27/08/2020 05:37:21 DEBUG devices.py:_run:85: /var/johnar/sonic-mgmt/tests/common/devices.py::get_ip_route_info#617: [str-msn2700-01] AnsibleModule::command Result => {"stderr_lines": [], "cmd": ["ip", "route", "list", "exact", "0.0.0.0/0"], "end": "2020-08-27 05:37:21.527655", "_ansible_no_log": false, "stdout": "default dev Loopback0 scope link src 10.1.0.32 ", "changed": true, "rc": 0, "start": "2020-08-27 05:37:21.522416", "stderr": "", "delta": "0:00:00.005239", "invocation": {"module_args": {"warn": true, "executable": null, "_uses_shell": false, "strip_empty_ends": true, "_raw_params": "ip route list exact 0.0.0.0/0", "removes": null, "argv": null, "creates": null, "chdir": null, "stdin_add_newline": true, "stdin": null}}, "stdout_lines": ["default dev Loopback0 scope link src 10.1.0.32 "], "failed": false}
27/08/2020 05:37:21 INFO devices.py:get_ip_route_info:621: route raw info for 0.0.0.0/0: [u'default dev Loopback0 scope link src 10.1.0.32 ']
27/08/2020 05:37:21 INFO devices.py:get_ip_route_info:656: route parsed info for 0.0.0.0/0: {'set_src': None, 'nexthops': []}

@pavel-shirshov
Copy link
Contributor

@stephengzh Thank you for your update. Do you have bgp and zebra logs for the moment, when default route test stopped working?
You can find them in /va/rlog/quagga. It would be good to have all files from there.

@pavel-shirshov
Copy link
Contributor

@stephengzh You don't see parsed info, because your raw info doesn't have both src or nexthops information.
Check
https://github.com/Azure/sonic-mgmt/blob/master/tests/common/devices.py#L549

@stephengzh
Copy link

@pavel-shirshov These two files are empty files.

@pavel-shirshov
Copy link
Contributor

But in this case there were no bgp sessions established. I'd say frr even didn't start working.

@shi-su
Copy link
Contributor

shi-su commented Jan 16, 2021

It seems that there is a race condition between zebra server accepts connections and bgpd tries to connect. Bgpd has a chance to try to connect before zebra is ready. In this scenario, bgpd will try again after 10 seconds and operate as normal within this 10 seconds. As a consequence, whatever bgpd tries to sent to zebra will be missing in the 10 seconds.

lguohan pushed a commit that referenced this issue Jan 19, 2021
Fix #5026

There is a race condition between zebra server accepts connections and bgpd tries to connect. Bgpd has a chance to try to connect before zebra is ready. In this scenario, bgpd will try again after 10 seconds and operate as normal within these 10 seconds. As a consequence, whatever bgpd tries to sent to zebra will be missing in the 10 seconds. To avoid such a scenario, bgpd should start after zebra is ready to accept connections.
lguohan pushed a commit that referenced this issue Jan 21, 2021
Fix #5026

There is a race condition between zebra server accepts connections and bgpd tries to connect. Bgpd has a chance to try to connect before zebra is ready. In this scenario, bgpd will try again after 10 seconds and operate as normal within these 10 seconds. As a consequence, whatever bgpd tries to sent to zebra will be missing in the 10 seconds. To avoid such a scenario, bgpd should start after zebra is ready to accept connections.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment