Skip to content

Commit

Permalink
rewording
Browse files Browse the repository at this point in the history
  • Loading branch information
a114j0y committed Mar 12, 2024
1 parent 4ecddec commit fc5039a
Showing 1 changed file with 52 additions and 41 deletions.
93 changes: 52 additions & 41 deletions doc/bgp_loading_optimization/bgp-loading-optimization-hld.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,10 +18,10 @@
- [Similar works](#similar-works)
- [Definitions \& Abbreviations](#definitions--abbreviations)
- [Overview](#overview)
- [Orchagent consumer workflow is single-threaded](#orchagent-consumer-workflow-is-single-threaded)
- [Orchagent consumer execute workflow is single-threaded](#orchagent-consumer-execute-workflow-is-single-threaded)
- [Syncd is too strictly locked](#syncd-is-too-strictly-locked)
- [Redundant APPL\_DB I/O traffic](#redundant-appl_db-io-traffic)
- [fpmsyncd flushes on every incoming route](#fpmsyncd-flushes-on-every-incoming-route)
- [producerstatetable publishes every command and fpmsyncd flushes on every select event](#producerstatetable-publishes-every-command-and-fpmsyncd-flushes-on-every-select-event)
- [APPL\_DB does redundant housekeeping](#appl_db-does-redundant-housekeeping)
- [Slow Routes decode and kernel thread overhead in zebra](#slow-routes-decode-and-kernel-thread-overhead-in-zebra)
- [Synchronous sairedis API usage](#synchronous-sairedis-api-usage)
Expand All @@ -41,13 +41,14 @@
- [Syncd \[similar optimization to orchagent\]](#syncd-similar-optimization-to-orchagent)
- [Asynchronous sairedis API usage and new ResponseThread in orchagent](#asynchronous-sairedis-api-usage-and-new--responsethread-in-orchagent)
- [WarmRestart scenario](#warmrestart-scenario)
- [Testing Requirements/Design](#testing-requirementsdesign)
- [System test](#system-test)
- [Performance measurements when loading 500k routes](#performance-measurements-when-loading-500k-routes)
- [Testing and measurements](#testing-and-measurements)
- [Requirements](#requirements-1)
- [PerformanceTimer](#performancetimer)
- [Performance measurements with 1M routes](#performance-measurements-with-1m-routes)

## Goal & Scope

This project aims to accelerate BGP routes end-to-end loading/withdrawing speed.
This project aims to accelerate the BGP routes end-to-end loading/withdrawing workflow.

We analyzed the performance bottleneck for each related submodule and optimized them accordingly.

Expand Down Expand Up @@ -84,46 +85,46 @@ This is an excellent achievement and we would kudo JNPR team to raise this racin

## Overview

The whole BGP loading/withdrawing workflow is shown in the figure below:
SONiC BGP loading/withdrawing workflow is shown in the figure below:
<figure align="center">
<img src="images/sonic-workflow.png" width="60%" height=auto>
<figcaption>SONiC BGP loading workflow</figcaption>
<img src="images/sonic-workflow.png" width="45%" height=auto>
<figcaption>SONiC BGP loading/withdrawing workflow</figcaption>
</figure>

1. `bgpd` parses the packets received on the socket, notifies `zebra`
2. `zebra` delivers this route to `fpmsyncd`
3. `fpmsyncd` uses redis pipeline to flush the route to `APPL_DB`
3. `fpmsyncd` uses redis pipeline to flush routes to `APPL_DB`
4. `orchagent` consumes `APPL_DB`
5. `orchagent` calls `sairedis` APIs to write into `ASIC_DB`
6. `syncd` consumes `ASIC_DB`
7. `syncd` invokes `SAI` SDK APIs to inject the routes to the hardware asic.
7. `syncd` invokes `SAI` SDK APIs to inject routing data to the hardware asic.

**NOTE**: [Linux kernel](https://github.com/SONiC-net/SONiC/wiki/Architecture#routing-state-interactions) part is ignored here.

### Orchagent consumer workflow is single-threaded
### Orchagent consumer execute workflow is single-threaded

Let's take the consumer for `ROUTE_TABLE` for example. In Orchagent's event-triggered main loop, Consumer would be selected to run its `execute()` API which contains three steps.
Let's take the consumer for `ROUTE_TABLE` for example. In Orchagent's event-triggered main loop, Consumer would be selected to run its `execute()` method which contains three steps.

1. `pops()`
- pop keys from `ROUTE_TABLE_KEY_SET` which stores all modified keys
- traverse these keys, move its corresponding data from temporary table `_ROUTE_TABLE` to `ROUTE_TABLE`
- pop keys from redis `ROUTE_TABLE_KEY_SET`, which stores all modified keys
- traverse these modified keys, move its corresponding values from temporary table `_ROUTE_TABLE` to `ROUTE_TABLE`
- delete temporary table `_ROUTE_TABLE`
- return these data to orchagent
- save these modified information, from redis table, to the local variable `std::deque<KeyOpFieldsValuesTuple> entries`
2. `addToSync()`
- stores data read by `pops()` to internal data structure `m_toSync`
- transfer the data from the local variable `entries` to the consumer instance's internal data structure `m_toSync`
3. `drain()`
- consumers `m_toSync`, calls sairedis API to write into asic_db
- consumes its `m_toSync`, invokes sairedis API and write these modified data to asic_db

We observe that, the 3 tasks would not read/write the same table, hence they can be executed in parallel. While the order of these three tasks within a single `execute()` call should be maintained, there could have some overlaps among different `execute()` calls. For example, when the first `execute()` call enters step 2, the second `execute()` could begin its step 1 instead of waiting for the step 3 of first `execute()` to be finished. To enable this overlap among `execute()` calls, we can add a thread to orchagent.
We observe that, the 3 tasks do not share the same redis context, hence have potential for parallel. While the order of these three tasks within a single `execute()` job should be maintained, there could have some overlaps among each `execute()` call. For example, when the first `execute()` call enters step 2, the second `execute()` could begin its step 1 instead of waiting for the step 3 of first `execute()` to be finished. To enable this overlapping among `execute()` calls, we can add a thread to orchagent.

<figure align=center>
<img src="images/orchagent-workflow.png" width="60%" height=auto>
<img src="images/orchagent-workflow.png" width="40%" height=auto>
<figcaption>Orchagent workflow<figcaption>
</figure>

### Syncd is too strictly locked

`syncd` shares the similar problem (job linearity) with `orchagent`. It also pops data from the upstream, then calls asic SDK apis to communicate with hardware. It's strictly locked as we can see, the whole `processEvent()` is locked. While SDK api call is usually time-consuming, we should unlock the main thread and add a new thread to employ some parallel techniques to utilize the wasted idle time here when waiting for asic's responses.
`syncd` shares the similar issue with `orchagent`. It also has a single-threaded workflow to pop data from the upstream redis tables, then invoke asic SDK APIs to inject data into its downstream hardware. We also want to explore its potential for parallel, and separate its communication with the upstream redis and the downstream hardware into two threads. However, this workflow needs careful locks since both communication with its upstream and downstream includes using the same redis context. As we can see in the original codebase, the whole `processEvent()` is locked. While SDK API calls tend to be time-consuming, we should unlock the thread to utilize the idle time here when syncd is waiting for the downstream hardware's responses.

<br>

Expand All @@ -136,19 +137,15 @@ We observe that, the 3 tasks would not read/write the same table, hence they can

There is much Redis I/O traffic during the BGP loading process, from which we find two sources of unnecessary traffic.

#### fpmsyncd flushes on every incoming route
#### producerstatetable publishes every command and fpmsyncd flushes on every select event

In the original design, `fpmsyncd` maintains a variable `pipeline`. Each time `fpmsyncd` receives a route from `zebra`, it processes the route and puts it in the `pipeline`. Every time the `pipeline` receives a route, it flushes the route to `APPL_DB`. If the size of the incoming route exceeds the size of the `pipeline` itself, the `pipeline` performs multiple flushes to make sure the received routes are written into `APPL_DB` completely.
In the original design, SONiC producers use lua scripts to implement its APIs such as set, delete, etc. We observe that, each lua script here ends with a redis `PUBLISH`. However, since we have already uses the pipeline, even if only the last command in the pipeline contains a redis `PUBLISH`, all the information in the pipeline can be published to the subsribed consumers. Hence, we want to decouple the redis `PUBLISH` from the producers' lua scripts and use a single `PUBLISH` command for a single pipeline flush.

Each flush corresponds to a redis `SET` operation in `APPL_DB`, which triggers the `PUBLISH` event, then all subscribers get notified of the updates in `APPL_DB`, perform Redis `GET` operations to fetch the new route information from `APPL_DB`.

That means, a single `pipeline` flush not only leads to redis `SET`, but also `PUBISH` and `GET`, hence a high flush frequency would cause a huge volumn of `REDIS` I/O traffic. However, the original `pipeline` flush frequency is decided by the routes incoming frequency and the `pipeline` size, which is unnecessarily high and hurts performance.

In the original design, the performance here is not very critical since the bottleneck lies in the downstream modules. But with the downstream `orchagent` getting faster, the performance here then matters, we should avoid flushing on each route arrival to reduce I/O.
In the original design, apart from redis pipeline flushing itself when it's full, `fpmsyncd` also invokes the redis pipeline `flush()` method every time a select event happens. Since the downstream handling of pipeline flushed data is not that fast, we can slow down the flush while batching more data in a single flush. Since each time a pipeline flushes, we transfer data from two modules via network, which includes `syscall` and context switching, and there is also round-trip-time between two modules. By reducing the flush frequency, we save on the overhead per flush.

#### APPL_DB does redundant housekeeping

When `orchagent` consumes `APPL_DB` with `pops()`, as Figure 3 shows, `pops` function not only reads from `route_table_set` to retrieve route prefixes, but also utilizes these prefixes to delete the entries in the temporary table `_ROUTE_TABLE` and write into the stable table `ROUTE_TABLE`, while at the same time transferring messages to `addToSync` procedure. The transformation from temporary tables to the stable tables causes much traffic but is actually not worth the time.
When `orchagent` consumes `APPL_DB` with `pops()`, `pops` needs to transfer data from `_ROUTE_TABLE` to `ROUTE_TABLE`, we propose to let upstream producers directly write into `ROUTE_TABLE`, which saves `pops` from doing these redis write and delete operations.

### Slow Routes decode and kernel thread overhead in zebra

Expand All @@ -160,7 +157,7 @@ The main thread of `zebra` not only needs to send routes to `fpmsyncd`, but also

<figure align=center>
<img src="images/zebra.jpg" width="60%" height=auto>
<figcaption>Figure 6. Zebra flame graph<figcaption>
<figcaption>Zebra flame graph<figcaption>
</figure>

### Synchronous sairedis API usage
Expand All @@ -169,8 +166,8 @@ The interaction between `orchagent` and `syncd` is using synchronous `sairedis`
Once `orchagent` `doTask` writes data to ASIC_DB, it waits for response from `syncd`. And since there is only single thread in `orchagent` it cannot process other routing messages until the response is received and processed.

<figure align=center>
<img src="images/sync-sairedis1.png" width="40%" height=20%>
<figcaption>Figure 5. Sync sairedis workflow<figcaption>
<img src="images/sync-sairedis1.png" width="20%" height=20%>
<figcaption>Sync sairedis workflow<figcaption>
</figure>

## Requirements
Expand Down Expand Up @@ -449,8 +446,8 @@ New pthread in orchagent
- CRM resources is calculated by subtracting ERR count from Used count in CRM

<figure align=center>
<img src="images/async-sairedis3.png" width="auto" height=auto>
<figcaption>Figure 10. Async sairedis workflow<figcaption>
<img src="images/async-sairedis3.png" width="50%" height=auto>
<figcaption>Async sairedis workflow<figcaption>
</figure>


Expand All @@ -460,19 +457,33 @@ This proposal considers the compatibility with SONiC `WarmRestart` feature. For

Take orchagent for example, we need to make sure ring buffer is empty and the ring buffer thread is in idle before we call ```dumpPendingTasks()```.

## Testing Requirements/Design
## Testing and measurements

### System test
### Requirements

- All modules should maintain the time sequence of route loading.
- All modules should support WarmRestart.
- No routes should remain in redis pipeline longer than configured interval.
- No data should remain in ring buffer when system finishes routing loading.
- System should be able to install/remove/set routes (faster than before).

### Performance measurements when loading 500k routes
### PerformanceTimer
We implemented c++ class PerformanceTimer in swsscommon library sonic-swss-common/common, this timer helps us measure the performance of a specific function or a module, it outputs interval(milliseconds) between each call, the execution time for a single call and how much entries this single call handles in the following format:

`[interval]<num_of_handled_entries>execution_time`.

Here is an example extracted from syslog:
```c++
NOTICE syncd#syncd: inc:88: 10000 (calls 5 : [13ms]<4315>102ms [10ms]<2635>64ms [7ms]<1577>52ms [3ms]<933>20ms [1ms]<540>22ms) Syncd::processBulkCreateEntry(route_entry) CREATE op took: 262 ms
```
We have a timer that measures performance of `processBulkCreateEntry(route_entry)` method, it takes 5 calls to create 10000 entries.

1st call created 4315 entries in 102 ms, started 13 ms after last call ends\
2nd call created 2635 entries in 64 ms, started 10 ms after last call ends\
3rd call created 1577 entries in 52 ms, started 7 ms after last call ends\
4th call created 933 entries in 20 ms, started 3 ms after last call ends\
5th call created 540 entries in 22 ms, started 1 ms after last call ends

Our optimization aims to reduce the interval (idle time) and improve the overall throughput.

- traffic speed via `zebra` from `bgpd` to `fpmsyncd`
- traffic speed via `fpmsyncd` from `zebra` to `APPL_DB`
- traffic speed via `orchagent` from `APPL_DB` to `ASIC_DB`
- traffic speed via `syncd` from `ASIC_DB` to the hardware
### Performance measurements with 1M routes

0 comments on commit fc5039a

Please sign in to comment.