-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[HLD] bmp for monitoring SONiC BGP info
- Loading branch information
1 parent
ba7028b
commit 7edbcda
Showing
4 changed files
with
353 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,353 @@ | ||
# BMP for monitoring SONiC BGP info | ||
## High Level Design Document | ||
### Rev 0.1 | ||
|
||
# Table of Contents | ||
|
||
* [Revision](#revision) | ||
|
||
* [About this Manual](#about-this-manual) | ||
|
||
* [Definitions/Abbreviation](#definitionsabbreviation) | ||
|
||
* [1 Requirements Overview](#1-requirements-overview) | ||
* [1.1 Functional requirements](#11-functional-requirements) | ||
* [1.2 CLI requirements](#12-cli-requirements) | ||
* [1.3 Scalability and Default Values](#13-scalability-and-default-values) | ||
* [1.4 Warm Restart requirements ](#14-warm-restart-requirements) | ||
* [2 Architecture Design](#2-architecture-design) | ||
* [2.1 High-Level Architecture](#21-high-level-architecture) | ||
* [2.2 OpenBMP](#22-openbmp) | ||
* [2.3 BGP/FRR Update](#23-bgp/frr-update) | ||
* [2.4 Database Design](#24-database-design) | ||
* [2.5 BMP Agent](#25-bmp-agent) | ||
* [3 CLI](#3-cli) | ||
* [4 Performance evaluation and Test plan](#4-performance-evaluation-and-test-plan) | ||
|
||
###### Revision | ||
|
||
| Rev | Date | Author | Change Description | | ||
|:---:|:-----------:|:------------------:|-----------------------------------| | ||
| 0.1| 02/20/2024 | Feng Pan | Initial version | | ||
|
||
|
||
# About this Manual | ||
This document will propose to bring up BMP container on SONiC, which is forked from openbmp project with some code changes, by that we could improve the SONiC debuggability and BGP service monitoring efficiency. | ||
|
||
# Definitions/Abbreviation | ||
|
||
| **Term** | **Meaning** | | ||
|--------------------------|-------------------------------------| | ||
| BMP | BGP Monitor Protocol | | ||
| OpenBMP | Open Source BGP Monitoring Protocol (OpenBMP) Collection Framework | | ||
| BGP/FRR | SONiC BGP container and bgpd daemon which will collborate with BMP here | | ||
|
||
|
||
# 1 Requirements Overview | ||
|
||
## 1.1 Functional requirements | ||
|
||
Now SONiC supports standard BGP session established with BGP neighbors, but there's only limited BGP info readable in database, which makes some cases difficult to debug, we need to dump more BGP data, meanwhile we're also expecting some live debug option which could dump BGP related info on demand and improve the debuggability. | ||
|
||
At a high level the following data should be supported: | ||
|
||
- BGP neighbor's capability, like MPBGP support, graceful_restart support, etc | ||
- BGP route table fully support, now only ifname and nexthop are available, more data like as_path/asn/origin... should be added. | ||
|
||
Additionally, now if we want to monitor BGP session data of SONiC, BGPL(BGPListener) needs to be deployed and create standard BGP session with SONiC device, the restriction is that the data monitored is not identical with the data which SONiC converses with neighbors. With BMP data being available on SONiC, we should be able to monitor more BGP live data. Moreover, listeners should be able to use StreamingTelemetry "on change" subscription channel, which is state based monitoring and more efficient than operation based monitoring. | ||
|
||
|
||
## 1.2 CLI requirements | ||
- Use Config cli to enable bmp table population as granularity of table | ||
- Use show to query specific bmp table data | ||
|
||
## 1.3 Scalability and Default Values | ||
|
||
As a phase #1 scalability requirement, the proposed support is to have bgp neighbor table enabled by default. | ||
|
||
Brief memory usage per table by test evaluation | ||
|
||
| Table | Memory | | ||
|--------------------------|--------------------------------| | ||
| Neighbor table | 400B per neighbor | | ||
| Route table | 4M per neighbor | | ||
|
||
## 1.4 Warm Restart requirements | ||
No special handling for Warm restart support. | ||
|
||
# 2 Architecture Design | ||
|
||
The following are the architecture changes. | ||
|
||
## 2.1 High-Level Architecture | ||
As above context mentioned, we need to bring up new container as bmp on SONiC, after change the relevant component architecture is as below: | ||
<img src="images/architecture_diagram.png" alt="Architecture Diagram" width="800"> | ||
|
||
- Add new bmp container, which has limited resource control. | ||
|
||
- Update existing frr/bgpd daemon and enable bmp feature to collect bgp data. | ||
|
||
- BMP will support both config and CLI, which can enable database population per table type population by default and in runtime. | ||
|
||
- User could use StreamingTelemetry path to monitor bgp table via [The GNMI protobuf](#https://github.com/openconfig/gnmi/blob/5473f2ef722ee45c3f26eee3f4a44a7d827e3575/proto/gnmi/gnmi.proto#L309 ) | ||
|
||
## 2.2 OpenBMP | ||
[OpenBMP](#https://www.openbmp.org/) is open source BGP monitor framework, which supports collecting, aggregating, data persisting, as well as dashboard, etc. For this project, we will fork this project with some internal code changes which populates the required data into Redis for our usage. | ||
|
||
## 2.3 BGP/FRR update | ||
Current [BGP](#https://github.com/sonic-net/sonic-frr) container has been forked from [FRR](#https://github.com/FRRouting/frr), which supports BMP protocol from version frr/7.2 | ||
|
||
### Config file update | ||
Below section needs to be added into /etc/frr/bgpd.conf, so that FRR could find assigned collector endpoint and report bgp data. | ||
|
||
``` | ||
! | ||
bmp mirror buffer-limit 4294967214 | ||
! | ||
bmp targets test | ||
bmp stats interval 1000 | ||
bmp monitor ipv4 unicast pre-policy | ||
bmp monitor ipv6 unicast pre-policy | ||
bmp connect 127.0.0.1 port 5000 min-retry 1000 max-retry 2000 | ||
! | ||
``` | ||
|
||
### Daemon parameter | ||
bgpd daemon also supports parameter to enable bmp functionality, we need to adjust BGP docker init relevant script as below. | ||
|
||
``` | ||
/usr/lib/frr/bgpd -A 127.0.0.1 -M bmp | ||
``` | ||
|
||
## 2.4 Database Design | ||
|
||
BMP will continually populate existing redis table APPL_DB, just with different keys defined to cover the data so that functional requirement could be supported. | ||
|
||
Like below, please note that multiple ASIC will also be supported in this feature. since multiple ASIC uses multiple database instance according to ASIC index, the same logic is still applied in this project. | ||
|
||
|DB name | DB No. | Description| | ||
| ---- |:----:| ----| | ||
|APPL_DB | specific ASIC indexed DB per BGP neighbor | Application running data | | ||
|
||
### Neighbor capability table schema | ||
|
||
This table will capture BGP capability supported on BGP neighbor, which only contains 1 or 0 as value. | ||
``` | ||
admin@str2-7050cx3-acs-13:~$ redis-cli -n 0 HGETALL NEIGHBOR_BGP_CAP:10.0.0.57 | ||
1) "BGP_CAP_MPBGP" | ||
2) "1" | ||
3) "BGP_CAP_ROUTE_REFRESH" | ||
4) "1" | ||
5) "BGP_CAP_OUTBOUND_FILTER" | ||
6) "1" | ||
7) "BGP_CAP_MULTI_ROUTES_DEST" | ||
8) "1" | ||
9) "BGP_CAP_EXT_NEXTHOP" | ||
10) "1" | ||
11) "BGP_CAP_GRACEFUL_RESTART" | ||
12) "1" | ||
13) "BGP_CAP_4OCTET_ASN" | ||
14) "1" | ||
15) "BGP_CAP_DYN_CAP" | ||
16) "1" | ||
17) "BGP_CAP_MULTI_SESSION" | ||
18) "1" | ||
19) "BGP_CAP_ADD_PATH" | ||
20) "1" | ||
21) "BGP_CAP_ROUTE_REFRESH_ENHANCED" | ||
22) "1" | ||
23) "BGP_CAP_ROUTE_REFRESH_OLD" | ||
24) "1" | ||
``` | ||
|
||
### Route table schema | ||
Route table will capture all BGP session data per neighbor, and includes both advertised and received data. | ||
|
||
Advertised BGP data for neighbor 10.0.0.57 | ||
``` | ||
admin@str2-7050cx3-acs-13:~$ redis-cli -n 0 HGETALL ADV_ROUTE:20c0:ed40:0:80::/64:BGP_NEIGHBOR:10.0.0.57 | ||
1) "nlri" | ||
2) "20c0:ed40:0:80::/64" | ||
3) "local_asn" | ||
4) "64915" | ||
5) "peer_asn" | ||
6) "64915" | ||
7) "local_addr" | ||
8) "10.0.0.62" | ||
9) "peer_addr" | ||
10) "10.0.0.63" | ||
11) "next_hop" | ||
12) "10.0.0.63" | ||
13) "origin" | ||
14) "igp" | ||
15) "local_pref" | ||
16) "0" | ||
17) "as_path" | ||
18) " 65100 64600 65534 64673 65511" | ||
19) "community_list" | ||
20) "" | ||
``` | ||
|
||
Received BGP data for neighbor fc00::72 | ||
``` | ||
admin@str2-7050cx3-acs-13:~$ redis-cli -n 0 HGETALL RECV_ROUTE:20c0:ed40:0:80::/64:BGP_NEIGHBOR:fc00::82 | ||
1) "nlri" | ||
2) "20c0:ed40:0:80::/64" | ||
3) "local_asn" | ||
4) "64915" | ||
5) "peer_asn" | ||
6) "64915" | ||
7) "local_addr" | ||
8) "fc00:1::32" | ||
9) "peer_addr" | ||
10) "fc00::82" | ||
11) "next_hop" | ||
12) "fc00::41" | ||
13) "origin" | ||
14) "igp" | ||
15) "local_pref" | ||
16) "0" | ||
17) "as_path" | ||
18) " 65100 64600 65534 64673 65511" | ||
19) "community_list" | ||
20) "" | ||
``` | ||
|
||
|
||
## 2.5 BMP Agent | ||
As [2.2 OpenBMP](#22-openbmp) shown, We need to fork and update code in [OpenBMPd](#https://github.com/SNAS/openbmp/tree/master/Server/). OpenBMP supports BMP protocol collecting by openbmpd agent. Thus in this project we will only need openbmpd agent role, and add redis population when monitoring BGP data from BGP container. | ||
|
||
Below picture is referenced from [OpenBMPFlow](#https://www.openbmp.org/#openbmp-flow/), refer the part in <span style="color: red;">red</span> circle, which is the daemon we need to update in this porject. | ||
|
||
<img src="images/openbmp.png" alt="OPENBMP ARCHITECTURE" width="400"> | ||
|
||
|
||
### Detail workflow | ||
With source code of openbmpd, we need to update code in message parser and populate redis with specific table required, this should be controlled in runtime by CLI (as more details in below section). Here following the straightforward option, whenever FRR launches TYPE_INIT_MSG to BMP, BMP will just clear redis relevant table to keep the data consistency, since following the TYPE_INIT_MSG, all BGP relevant data will be reSynced from FRR. | ||
|
||
<img src="images/bmp_seq.png" alt="BMP brief sequence diagram" width="800"> | ||
|
||
### Delay-removal | ||
There will be data flapping if BGP/FRR connection to BMP is not stable, or even some flapping occurs in BGP side. In order to optimize this scenario handling. Below delay-removal algo could be used internally. | ||
|
||
1. When FRR connects to the BMP server with TYPE_INIT_MSG and sends BGP session updates, instead of immediately removing the entire table from Redis, just mark down the timestamp. | ||
2. Maintain a timer or counter to keep track of the elapsed time since the last update received from FRR. | ||
3. Set a threshold or predefined delay period for considering a BGP session update as stale or removed. For example, you can define that a BGP session update is considered stale if it has not been updated within a specific time interval (e.g., 3 minutes). | ||
4. Periodically check the timestamp of the BGP session updates stored in table. If the elapsed time since the last update exceeds the threshold defined in step 3, mark the corresponding entry as removed or delete it from Redis. | ||
5. Once re-adding the table with the new data sent by FRR arrives, start calculating the dataset and erasing the timer for data refreshed, and let staled data be removed by original loop. | ||
|
||
This ensures data consistency and minimizes unnecessary table flapping. | ||
|
||
### Config DB schema | ||
|
||
Create new config db table as below to contain config list for bmp. | ||
|
||
``` | ||
127.0.0.1:6379[4]> keys FEATURE|bmp* | ||
1) "FEATURE|bmp" | ||
``` | ||
|
||
Create below config items list for enabling and disabling different table. | ||
|
||
``` | ||
127.0.0.1:6379[4]> HGETALL FEATURE|bmp | ||
1) "bgp_neighbor_capability" | ||
2) "true" | ||
3) "bgp_adv_route" | ||
4) "false" | ||
5) "bgp_recv_route" | ||
6) "false" | ||
``` | ||
|
||
### Full Dataset supported | ||
|
||
[OpenBMP dataset](https://github.com/SNAS/openbmp/blob/master/docs/MESSAGE_BUS_API.md#message-api-parsed-data), we can find full dataset info as reference here which is Kafaka based TSV message, however, we will not follow it's format when populats the redis database, the data format we use is decalred [2.4 Database Design](#24-database-design) | ||
|
||
|
||
|
||
# 3 CLI | ||
|
||
bmp will support below config CLIs to enable/disable specific table population in runtime: | ||
|
||
``` | ||
1. Command: `config bmp enable neighbor-bgp-cap-table` | ||
- Description: Enable the BGP neighbor Capability table. | ||
- Result: reset FRR connection and populate NEIGHBOR_BGP_CAP table | ||
2. Command: `config bmp disable neighbor-table` | ||
- Description: Disable the BGP neighbor Capability table. | ||
- Result: erase NEIGHBOR_BGP_CAP table and stop table population | ||
3. Command: `config bmp enable route-advertised-table` | ||
- Description: Enable the BGP route advertised table for all neighbors. | ||
- Result: reset FRR connection and populate ADV_ROUTE table | ||
4. Command: `config bmp disable route-advertised-table` | ||
- Description: Disable the BGP route advertised table for all neighbors. | ||
- Syntax: `disable route-advertised-table` | ||
- Result: erase ADV_ROUTE table and stop table population | ||
5. Command: `config bmp enable route-received-table` | ||
- Description: Enable the BGP route received table for all neighbors. | ||
- Result: reset FRR connection and populate RECV_ROUTE table | ||
6. Command: `config bmp disable route-received-table` | ||
- Description: Disable the BGP route received table for all neighbors. | ||
- Result: erase RECV_ROUTE table and stop table population | ||
``` | ||
|
||
bmp will support below show CLIs to query specific table in runtime: | ||
|
||
``` | ||
1. Command: `show bmp neighbor-bgp-cap-table` | ||
- Description: Show BGP neighbor Capability table for all neighbors. | ||
- Result: Query NEIGHBOR_BGP_CAP table and show result in human readable format. | ||
2. Command: `show bmp route-advertised-table` | ||
- Description: Show BGP route advertised table for all neighbors. | ||
- Result: Query ADV_ROUTE table and show all session data. | ||
3. Command: `show bmp route-received-table` | ||
- Description: Show BGP route received table for all neighbors. | ||
- Result: Query RECV_ROUTE table and show all session data | ||
4. Command: `show bmp table status` | ||
- Description: Show all table status like enable or disable so that user could operate correctly. | ||
- Result: Query and show config db status for all table enablement | ||
``` | ||
|
||
|
||
# 4 Performance evaluation and Test plan | ||
|
||
### CPU usage | ||
Since BMP is sidecar daemon for improve the SONiC debuggability and monitor efficiency, we don't expect it take too much CPU usage, and prefer it could be able to work under low priority but without break. Now FRR to openbmp is working under push model, if limit cpu of bmp into 5%, it just operate slow, but without missing update, when we flap the bgp several times, all the flap updates will pushed to bmp finally. Thus we should be able to restrist CPU of BMP to lower priority without | ||
|
||
### Test plan | ||
|
||
#### Auto test | ||
Mock up FRR agent and send BGP data to BMP, test the hooking mechanism by verifying below cases: | ||
1. Verify BMP successfully captures the BGP data from the data sources. | ||
2. Verify that the captured BGP data is correctly transformed and formatted before being populated into Redis. | ||
3. Data validation to ensure that the data transformation and population processes are functioning as expected. | ||
4. Simulate various BGP updates, such as route advertisements, route withdrawals, and neighbor state changes. And verify BGP data is handled by BMP as expected. | ||
5. Verify CLIs to enable/disable table population and data is consistent in redis correctly. | ||
6. Simulate error conditions, such as network failures, Redis connection issues, or malformed BGP data, and verify that the system handles these errors gracefully. and test the resilience and error recovery mechanisms of BMP and Redis integration. | ||
|
||
#### Performance test | ||
Use below case to test performance under bgp flapping, which is extreme scenario for performance evaluation. | ||
|
||
``` | ||
./run_tests.sh -c route/test_default_route.py::test_default_route_with_bgp_flap | ||
``` | ||
|
||
1. Redis table could get full dataset after bgp flapping. | ||
2. Low frequency routing change should be notified to redis in real time, like BGP neighbor up/down, route advertise/withdraw. | ||
3. High frequency routing change like flapping should not be notified to redis within delay-removal interval. | ||
4. bgp flapping case could work with GNMI OnChange mode subscription. | ||
``` | ||
gnmi_cli -client_types=gnmi -a 127.0.0.1:50051 -t APPL_DB -logtostderr -insecure -v 2 -qt s -q "ADV_ROUTE" -streaming_type ON_CHANGE | ||
``` |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.