Term | Meaning |
---|---|
Statically headroom look-up | The current solution of headroom calculation. In this solution, the headroom is retrieved by looking up a pre-defined table with the port's cable length and speed as the key. Only a limited set of cable length is supported. Or statically look-up for short. |
Dynamically headroom calculation | The solution of headroom calculation which will be introduced in this design. In this solution the headroom is calculated by the well known formula based with the cable length and speed as input. Arbitrary cable length will be supported. Or dynamically calculation for short. |
RoCE is an important feature in the datacenter network. As we all knew, headroom size is the key to ensure lossless traffic which is the key of RoCE. Currently, the headroom size is calculated by looking up the port's cable length and speed in the pre-defined pg_profile_lookup.ini, which has some drawbacks.
Currently the headroom buffer calculation is done by looking up the pg_profile_lookup.ini
table with the ports' cable length and speed as the key.
- When system start, it reads the pg_profile_lookup.ini and generates an internal lookup table indexed by speed and cable length, and containing size, xon, xoff and threshold.
- When a port's cable length updated, it records the cable length of the port. But it doesn't update relavent tables accordingly.
- When a port's speed updated,
- It looks up the (speed, cable length) tuple in the BUFFER_PROFILE table or generate a new profile according to the internal lookup table.
- And then update the port's BUFFER_PG table for the lossless priority group.
There are some limitations:
- The
pg_profile_lookup.ini
is predefined for each SKU. When a new system supports SONiC the file should be provided accordingly. - Only a fixed set of cable lengths are supproted.
- Static headroom isn't supported.
In general, we would like to:
- have the headroom calculated in the code so that the users won't need to be familiar with that.
- support headroom override, which means we will have fixed headroom size on some ports regardless of the ports' speed and cable length.
- have more shared buffer and less headroom.
The headroom size calculation discussed in this design will be implemented in the BufferManager
which is a daemon running in the swss docker.
We will have the following groups of parameters
- List of SONiC configuration, such as speed and cable length.
- List of ASIC related configuration, such as cell size, MAC/PHY delay, peer response time, IPG.
- List of PERIPHERIAL related configuration, such as gearbox delay.
- List of RoCE related configuration, such as MTU, small packet size percentage.
Based on the parameters and a well-known formula the code in buffer manager will do the calculation and not take it from a pre-defined values as we have today. On top of that, we need to support the ability to override headroom and not to calculate it in the code.
Meanwhile, the backward compatibility for the vendors who haven't provided the tables required for dynamically headroom calculation is also provided.
- When a port's cable length or speed updated, headroom of all lossless priority groups will be updated according to the well-known formula and then programed to ASIC.
- When a port is shut down/started up or its headroom size is updated, the size of shared buffer pool will be adjusted accordingly. The less the headroom, the more the shared buffer and vice versa. By doing so, we are able to have as much shared buffer as possible.
- When SONiC switch is upgraded from statically look-up to dynamically calculation, a port's headroom size of all the lossless priority groups will be calculated. The shared buffer pool will be adjusted according to the headroom size as well.
- Pre-defined
pg_profile_lookup.ini
isn't required any more. When a new platform supports SONiC only a few parameters are required. - Support arbitrary cable length.
- Support headroom override, which means user can configure static headroom on certain ports.
- Priority groups on which lossless traffic runs on is configurable. By default they're 3, 4.
- Ports' speed and cable length need to be statically configured.
- All the statically configured data will be stored in
CONFIG_DB
and all dynamically data inAPPL_DB
. CLI or other management plane entity is responsible for updatingCONFIG_DB
whileBuffer Manager
is responsible for updatingAPPL_DB
.
Backward compatibility is supported for vendors who haven't provided the related tables yet. In this section we will introduce the way it is achieved.
Currently, the SONiC system starts buffer manager from swss docker by the supervisor
according to the following settings in /etc/supervisor/conf.d/supervisord.conf
in swss
docker.
[program:buffermgrd]
command=/usr/bin/buffermgrd -l /usr/share/sonic/hwsku/pg_profile_lookup.ini
priority=11
autostart=false
autorestart=false
stdout_logfile=syslog
stderr_logfile=syslog
For the vendors who implement dynamically buffer calculating, a new command line option -c
is provided. As a result, the supervisor
setting will be:
[program:buffermgrd]
command=/usr/bin/buffermgrd -c
priority=11
autostart=false
autorestart=false
stdout_logfile=syslog
stderr_logfile=syslog
A new class is introduced to implement the dynamically buffer calculating while the class for statically look-up solution is remained. When buffer manager starts it will test the command line options, loading the corresponding class according to the command line option.
The database schema for the dynamically buffer calculation is added on top of that of the current solution without any field renamed or removed, which means it won't hurt the current solution.
In the rest part of this document, we will focus on the dynamically headroom calculation and the SONiC-to-SONiC upgrade process from the current solution to the new one.
The following tables are related to dynamically headroom calculation:
- The following tables are newly introduced and stored in
CONFIG_DB
, including:- Tables which stores the parameters determined by the switch hardware, including:
ASIC_TABLE
where the ASIC related parameters are stored.PERIPHERAL_TABLE
where the peripheral parameters are stored, like gearbox.
- Tables which stores the user configurations, including
ROCE_TABLE
where the RoCE parameters are stored.
- Tables which stores the parameters determined by the switch hardware, including:
- The static buffer configuration is stored in tables
BUFFER_POOL
,BUFFER_PROFILE
andBUFFER_PG
inCONFIG_DB
. - The dynamically generated headroom information is stored in tables in
APPL_DB
, includingBUFFER_POOL
,BUFFER_PROFILE
andBUFFER_PG
. They are the equivalent of the tables with the same names in configuration database in current solution.
Buffer Manager
will consume the tables in CONFIG_DB
and generate corresponding tables in APPL_DB
. Buffer Orchagent
will consume the tables in APPL_DB
and propagate the data to ASIC_DB
.
This table is introduced to store the switch ASIC related parameters required for calculating the headroom buffer size.
This table is not supposed to be updated on-the-fly.
The key can be the chip/vendor name in captical letters.
; The following fields are introduced to calculate the headroom sizes
key = ASIC_TABLE|<vendor name> ; Vendor name should be in captical letters.
; For Mellanox, "MELLANOX"
cell_size = 1*4DIGIT ; Mandatory. The cell size of the switch chip.
ipg = 1*2DIGIT ; Optional. Inter-packet gap.
pipeline_latency = 1*6DIGIT ; Mandatory. Pipeline latency, in unit of kBytes.
mac_phy_delay = 1*6DIGIT ; Mandatory. Max/phy delay, in unit of Bytes.
peer_response_time = 1*6DIGIT ; Mandatory. The maximum of peer switch response time
; in unit of kBytes.
; The following fields are introduced to calculate the buffer pools
max_headroom_size = 1*6DIGIT ; Optional. The maxinum value of headroom size a physical port can have.
; For split ports, the accumulative headrooms of lossless PGs of all
; split ports belonging to a same physical port should be less than this filed.
; Not providing this field means no such limitation for the ASIC.
reserved_lossy_pg = 1*6DIGIT ; Optional. The reserved headroom size for each lossy priority group.
; Not providing this field means the size is 0.
default_dynamic_th = 1*2DIGIT ; Default dynamic_th for dynamically generated buffer profiles
Every vendor should provide the ASIC_TABLE for all switch chips it supports in SONiC. It should be stored in files/build_templates
in the sonic-buildimage repo and /usr/shared/sonic/template/asic_config.json.j2
on the switch.
There should be a map from SKU to switch chip in the template. When the template is being rendering, the SKU will be mapped to switch chip and the switch chip is used to choose which group of parameters in the ASIC_TABLE
will be adopted on the switch. As a result, the ASIC_TABLE
with a single group of parameters will be loaded into config database.
The rendering takes place on the switch when the command config qos reload
is executed when the switch starts for the first time or config load_minigraph
is executed.
After that the table will be loaded from config database each time system starts.
Example
The below is an example for Mellanox switches based on Spectrum-1 switch chip.
Example of pre-defined json.j2 file:
{% if sonic_asic_platform == 'mellanox' %}
{% set platform2asic = {
'x86_64-mlnx_lssn2700-r0':'MELLANOX-SPECTRUM'
'x86_64-mlnx_msn2010-r0':'MELLANOX-SPECTRUM'
'x86_64-mlnx_msn2100-r0':'MELLANOX-SPECTRUM'
'x86_64-mlnx_msn2410-r0':'MELLANOX-SPECTRUM'
'x86_64-mlnx_msn2700-r0':'MELLANOX-SPECTRUM'
'x86_64-mlnx_msn2700_simx-r0':'MELLANOX-SPECTRUM'
'x86_64-mlnx_msn2740-r0':'MELLANOX-SPECTRUM'
'x86_64-mlnx_msn3700c-r0':'MELLANOX-SPECTRUM-2'
'x86_64-mlnx_msn3700-r0':'MELLANOX-SPECTRUM-2'
'x86_64-mlnx_msn3700_simx-r0':'MELLANOX-SPECTRUM-2'
'x86_64-mlnx_msn3800-r0':'MELLANOX-SPECTRUM-2'
'x86_64-mlnx_msn4700_simx-r0':'MELLANOX-SPECTRUM-3'
'x86_64-mlnx_msn4700-r0':'MELLANOX-SPECTRUM-3'
}
%}
{% set asic_type = platform2asic[platform] %}
"ASIC_TABLE": {
{% if asic_type == 'MELLANOX-SPECTRUM' %}
"MELLANOX-SPECTRUM": {
"cell_size": "96",
"pipeline_latency": "18",
"mac_phy_delay": "0.8",
"peer_response_time": "3.8"
}
{% endif %}
{% if asic_type == 'MELLANOX-SPECTRUM-2' %}
"MELLANOX-SPECTRUM-2": {
"cell_size": "144",
"pipeline_latency": "18",
"mac_phy_delay": "0.8",
"peer_response_time": "3.8"
}
{% endif %}
{% if asic_type == 'MELLANOX-SPECTRUM-3' %}
"MELLANOX-SPECTRUM-3": {
"cell_size": "144",
"pipeline_latency": "18",
"mac_phy_delay": "0.8",
"peer_response_time": "3.8"
}
{% endif %}
}
{% endif %}
Example of a rendered json snippet (which will be loaded into config database) on a Mellanox Spectrum switch:
"ASIC_TABLE": {
"MELLANOX-SPECTRUM": {
"cell_size": "96",
"pipeline_latency": "18",
"mac_phy_delay": "0.8",
"peer_response_time": "3.8"
}
}
This table contains the peripheral parameters, like gearbox. The key can be gearbox model name.
This table is not supposed to be updated on-the-fly.
key = PERIPHERAL_TABLE|<gearbox model name> ; Model name should be in captical letters.
gearbox_delay = 1*4DIGIT ; Optional. Latency introduced by gearbox, in unit of kBytes.
Every vendor should provide the PERIPHERAL_TABLE
for all peripheral devices it supports in SONiC, like gearbox models. It should be stored in files/build_templates/peripheral_config.json.j2
in the sonic-buildimage repo and /usr/shared/sonic/template/peripheral_config.json.j2
on the switch.
When the template is being rendering, all entries in PERIPHERAL_TABLE
will be loaded into the configuration database.
There should be a gearbox configure file in which the gearbox model installed in the system is defined.
For non-chassis systems, all ports share the unique gearbox model. As a result, the initialization of PERIPHERAL_TABLE
is the same as that of ASIC_TABLE
.
For chassis systems the gearbox in variant line-cards can differ, which means a mapping from port/line-card to gearbox model is required to get the correct gearbox model for a port. This requires additional field defined in PORT
table or some newly introduced table. As this part is still under discussion in community, we will not discuss this case for now.
The below is an example for Mellanox switches.
Example
{% if sonic_asic_platform == 'mellanox' %}
{% set platform_with_gearbox = ['x86_64-mlnx_msn3800-r0'] %}
{% set platform2gearbox = {
'x86_64-mlnx_msn3800-r0':'MELLANOX-PERIPHERAL-1'
}
%}
{% if platform in platform_with_gearbox %}
{% set gearbox_type = platform2gearbox[platform] %}
"PERIPHERAL_TABLE": {
"MELLANOX-PERIPHERAL-1": {
"gearbox_delay": "9.765"
}
}
{% endif %}
{% endif %}
This table contains the parameters related to RoCE configuration.
key = ROCE_TABLE|<name> ; Name should be in captical letters. For example, "AZURE"
mtu = 1*4DIGIT ; Mandatory. Max transmit unit of RDMA packet, in unit of kBytes.
small_packet_percentage = 1*3DIGIT ; Mandatory. The percentage of small packets against all packets.
Typically all vendors share the identical RoCE parameters. It should be stored in /usr/share/sonic/templates/buffers_config.j2
which will be used to render the buffer configuration by config qos reload
.
Example
"ROCE_TABLE": {
"AZURE": {
"mtu": "1500",
"small_packet_percentage": "100"
}
}
Table BUFFER_POOL
contains the information of a buffer pool.
Currently, there already are some fields in BUFFER_POOL
table. In this design, the field dynamically_update
is newly introduced, indicating whether the pool will be updated accordingly when one port's headroom size updated.
key = BUFFER_POOL|<name>
type = "ingress" / "egress" ; for ingress or egress traffic
mode = "static" / "dynamic" ; indicating the pool's threshold mode
size = 9*DIGIT ; the size of the pool
; for the pools with dynamically_update as true, the size is the original size of the pool
dynamically_update = "true" / "false" ; whether the pool's size will be updated dynamically
Typically there are the following entries defined in /usr/shared/sonic/device/<platfrom>/<SKU>/buffers.json.j2
by all vendors.
- ingress_lossless_pool
- ingress_lossy_pool
- egress_lossless_pool
- egress_lossy_pool
When the system starts for the first time or config qos reload
is configured, the buffer pools will be initialized by rendering the template.
In other cases, the buffer pools will be loaded from configure database.
Table BUFFER_PROFILE
contains the profiles of headroom parameters and the proportion of free shared buffers can be utilized by a port
, PG
tuple on ingress side or a port
, queue
tuple on egress side.
Currently, there already are some fields in BUFFER_PROFILE
table. In this design, the field headroom_type
is newly introduced, indicating whether the headroom information, including xon
, xoff
and size
, are dynamically calculated based on the speed
and cable length
of the port. Accordingly, the fileds xon
, xoff
and size
only exist when the headroom_type
is static
.
key = BUFFER_PROFILE|<name>
pool = reference to BUFFER_POOL object
xon = 6*DIGIT ; xon threshold
xoff = 6*DIGIT ; xoff threshold
size = 6*DIGIT ; size of headroom for ingress lossless
dynamic_th = 2*DIGIT ; for dynamic pools, proportion of free pool the port, PG tuple referencing this profile can occupy
static_th = 10*DIGIT ; similar to dynamic_th but for static pools and in unit of bytes
headroom_type = "static" / "dynamic" ; Optional. Whether the profile is dynamically calculated or user configured.
; Default value is "static"
The profile is configured by CLI.
Typically there are the following entries defined in /usr/shared/sonic/device/<platfrom>/<SKU>/buffers.json.j2
by all vendors.
- ingress_lossless_profile
- ingress_lossy_profile
- egress_lossless_profile
- egress_lossy_profile
- q_lossy_profile
The initialization of the above entries is the same as that of BUFFER_PROFILE
table.
Besides the above entries, there are the following ones which will be generated on-the-fly:
- Headroom override entries for lossless traffic, which will be configured by user.
- Entries for ingress loessless traffic with specific cable length and speed. They will be referenced by
BUFFER_PG
table and created if there is no such entry corresponding to a newly occuringspeed
andcable length
tuple.
Example
An example of mandatory entries on Mellanox platform:
"BUFFER_PROFILE": {
"ingress_lossless_profile": {
"pool":"[BUFFER_POOL|ingress_lossless_pool]",
"size":"0",
"dynamic_th":"0"
},
"ingress_lossy_profile": {
"pool":"[BUFFER_POOL|ingress_lossy_pool]",
"size":"0",
"dynamic_th":"3"
},
"egress_lossless_profile": {
"pool":"[BUFFER_POOL|egress_lossless_pool]",
"size":"0",
"dynamic_th":"7"
},
"egress_lossy_profile": {
"pool":"[BUFFER_POOL|egress_lossy_pool]",
"size":"4096",
"dynamic_th":"3"
},
"q_lossy_profile": {
"pool":"[BUFFER_POOL|egress_lossy_pool]",
"size":"0",
"dynamic_th":"3"
}
}
Table BUFFER_PG contains the maps from the port, priority group
tuple to the buffer profile
object.
Currently, there already are some fields in BUFFER_PG
table. In this design, the field headroom_type
is newly introduced, indicating whether the profile
are dynamically calculated.
key = BUFFER_PG|<name>
headroom_type = "static" / "dynamic" ; Optional. Whether the profile is dynamically calculated or user configured.
; Default value is "static"
profile = reference to BUFFER_PROFILE object ; Exists only when headroom_type is "static"
; For "dynamic" headroom_type the profile isn't required in CONFIG_DB.
The entry BUFFER_PG|<port>|0
is for ingress lossy traffic and will be generated when system starts for the first time or minigraph is loaded.
The headroom override entries are configured via CLI.
Other entries are for ingress lossless traffic and will be generated when the ports' speed
or cable length
updated.
The port speed needs to be fetched from PORT
table.
The cable length needs to be fetched from CABLE_LENGTH
table.
Table BUFFER_POOL
, BUFFER_PROFILE
and BUFFER_PG
are introduced in APPL_DB
. They are the equivalent of tables with the same name in CONFIG_DB
. The APPL_DB
tables shared the similar fields with that in CONFIG_DB
tables except some minor differences which will be elaborated below.
The ways in which the APPL_DB
tables are initializd are similar:
- When system starts,
Buffer Manager
consumes entries of the equivalent tables inCONFIG_DB
and creates corresponding entries inAPPL_DB
tables forstatic
entries. - When a new
speed
,cable length
tuple occurs in the system, theBuffer Manager
will create new entries in tablesAPPL_DB
table.
The field dynamically_update
of CONFIG_DB.BUFFER_POOL
doesn't exist in APPL_DB.BUFFER_POOL
. Other fields are the same.
key = BUFFER_PROFILE|<name>
type = "ingress" / "egress" ; for ingress or egress traffic
mode = "static" / "dynamic" ; indicating the pool's threshold mode
size = 9*DIGIT ; the size of of the pool
Difference between APPL_DB.BUFFER_PROFILE
and CONFIG_DB.BUFFER_PROFILE
including:
headroom_type
exists only in CONFIG_DB.- In APPL_DB the
xon
,xoff
andsize
always exist while in CONFIG_DB these fields exist only ifheadroom_type
isstatic
.
key = BUFFER_PROFILE|<name>
pool = reference to BUFFER_POOL object
xon = 6*DIGIT ; xon threshold
xoff = 6*DIGIT ; xoff threshold
size = 6*DIGIT ; size of headroom for ingress lossless
dynamic_th = 2*DIGIT ; for dynamic pools, proportion of free pool the port, PG tuple referencing this profile can occupy
static_th = 10*DIGIT ; similar to dynamic_th but in unit of bytes
The field headroom_type
of CONFIG_DB.BUFFER_PG
doesn't exist in APPL_DB.BUFFER_PG
. Other fields are the same.
key = BUFFER_PG|<name>
profile = reference to BUFFER_PROFILE object
The following flows will be described in this section.
- When a port's speed or cable length is updated, the
BUFFER_PG
,BUFFER_PROFILE
will be updated to reflect the headroom size regarding the new speed and cable length. As the headroom size updated,BUFFER_POOL
will be also updated accordingly. - When a port's admin status is updated, the
BUFFER_PG
andBUFFER_PROFILE
won't be updated. However, as only administratively up ports consume headroom, theBUFFER_POOL
should be updated by adding the buffer released by the admin down port. - When a static profile is configured on or removed from a port, the
BUFFER_PROFILE
and/orBUFFER_PG
table will be updated accordingly. - When the system starts, how the tables are loaded.
- Warm reboot flow.
This section will be split to two parts. In meta flows
we will describe some flows which are building blocks of other flows. In main flows
we will describe the flows listed in the above list.
Meta flows are the flows that will be called in other flows.
Headroom is calculated as the following:
headroom
=Xoff
+Xon
Xon
=pipeline latency
Xoff
=mtu
+propagation delay
*small packet multiply
worst case factor
= 2 *cell
/ (1 +cell
)small packet multiply
= (100 -small packet percentage
+small packet percentage
*worst case factor
) / 100propagation delay
=mtu
+ 2 * (kb on cable
+kb on gearbox
) +mac/phy delay
+peer response
kb on cable
=cable length
/speed of light in media
*port speed
The values used in the above procedure are fetched from the following table:
cable length
: CABLE_LENGTH|<name>|<port>port speed
: PORT|<port name>|speedkb on gearbox
: PERIPHERIAL_TABLE|<gearbox name>|gearbox_delaymac/phy delay
: ASIC_TABLE|<asic name>|mac_phy_delaypeer response
: ASIC_TABLE|<asic name>|peer_response_timecell
: ASIC_TABLE|<asic name>|cell_sizesmall packet percentage
: ROCE_TABLE|<name>|small_packet_percentagemtu
: ROCE_TABLE|<name>|mtu
When a port's cable length
or speed
updated, a profile corresponding to the new cable length
, speed
tuple should be looked up from the database. If there isn't one, a new one should be created.
The flow is like the following:
- Look up in
APPL_DB
, check whether there has already been a profile corresponding to the newcable length
andspeed
tuple. If yes, return the entry. - Create a profile based on the well-known formula and insert it to the
APPL_DB.BUFFER_PROFILE
table. - The
BufferOrch
will consume the update inAPPL_DB.BUFFER_PROFILE
table and call SAI to create a new profile.
Figure 1: Allocate a New Profile
This is for dynamic profile only. Static profile won't be removed even it isn't referenced any more.
When a port's cable length
or speed
updated, the profile related to the old cable length
or speed
tuple probably won't be referenced any longer. In this case, the profile should be removed.
Figure 2: Release a No-Longer-Referenced Profile
When any port's cable length
or speed
updated or admin state
changed, the buffer pool size should be recalculated.
An exception is warm reboot. During warm reboot the headroom is updated for each ports, which causes the buffer pool be updated for many times. However, the correct buffer pool data that it will eventually be has already been in switch chip. In this sense, to update buffer pool frequently is unnecessary and should be avoided.
To achieve that, the buffer pool shouldn't be updated during warm reboot and will be updated once warm reboot finished.
The avaliable buffer pool size euqals to the maxinum avaliable buffer pool size minus the size of buffer reserved for port and (port, PG) in ingress. The algorithm is as below:
- Accumulate all headroom by iterating all
port
,priority group
tuple and putting theirsize
together. - Some vendors may reserve memory for lossy PGs regardless of the
BUFFER_PROFILE
configuration. In this case, the reserved memory size for each lossy PG should be fetched fromASIC_TABLE.reserved_lossy_pg
. - Accumulate all reserved buffer for egress traffic for all ports.
The administratively down ports doesn't consume buffer hense they should be ruled out.
Figure 3: Calculate the Pool Size
For any port, when:
- its
cable length
orspeed
is updated, or - a PG is added to or removed from its
lossless PG
set
Its headroom buffer should be recalculated and programed to ASIC.
The flow is:
-
Find or create a buffer profile according to the new
cable length
andspeed
tuple. -
If
lossless PG
updated, remove the oldAPPL_DB.BUFFER_PG
object related to the oldlossless PG
and create the new one.For example, if the
lossless PG
forEthernet0
is updated from3-4
to3-5
, the entryEthernet0|3-4
will be removed and the oneEthernet0|3-5
will be created. -
Update the port's buffer pg and update the
APPL_DB.BUFFER_PG
table. -
Once BufferOrch is notifed on the
APPL_DB.BUFFER_PG
updated, it will update the related SAI object. -
Release the
APPL_DB.BUFFER_PROFILE
referenced by oldcable length
andspeed
tuple.
Figure 4: Calculate and deploy the Headroom For a Port, PG
There are admin speed and operational speed in the system, which stand for the speed configured by user and negotiated with peer device respectively. In the buffer design, we are talking about the admin speed.
-
Read the speed and cable length of the port
-
Check the following conditions, exit on anyone fails:
- Check whether
headroom_type
inCONFIG_DB.BUFFER_PG|<port>|<lossless PG>
is ofdynamic
which means dynamically calculating headroom is required for the port. - Check whether there is a cable length configured for the port.
- Check whether the headroom size calculated based on the speed, cable length pair is legal, which means it doesn't exceed the maxinum value.
If anyone of the above condition failed, none of the
CNOFIG_DB
,APPL_DB
orBuffer Manager
's internal data will be changed. This will result in inconsistence among the entities. A piece of error message will be logged for the purpose of promoting user to revert the configuration. - Check whether
-
Allocate a buffer profile related to the
cable length
andspeed
. -
Calculate and deploy the headroom for the port, PG tuple.
Figure 5: Cable length or speed updated
When a port's administratively status is changed, the BUFFER_PG
and BUFFER_PROFILE
won't be touched. However, as an admin down port doesn't consume headroom buffer while an admin up one does, the sum of effective headroom size will be udpated accordingly, hense the BUFFER_POOL
should be updated.
When a static headroom is configured on a port
- Release the buffer profile currently used on the port.
- Update the port's buffer pg according to the configuration.
- Recalculate the buffer pool size.
Figure 6: Apply Static Headroom Configure
When a static headroom is removed on a port:
- Allocate the buffer profile according to the port's speed and cable length.
- Recalculate the buffer pool size.
Figure 7: Remove Static Headroom Configure
When a static buffer profile is updated, it will be propagated to Buffer Orch
and then SAI
. The buffer pgs that reference this buffer profile don't need to be updated. However, as the total number of headroom buffer updated, the buffer pool size should be recalculated.
Figure 8: Static Buffer Profile Updated
When the lossless priorities are configured, the system should recalcuate the port's headroom, just like what is done when speed
or cable length
is updated.
In this section we will discuss the start flow and the flows of SONiC-to-SONiC upgrade from statically headroom look-up to dynamically calculation by:
- Cold reboot
- Warm reboot
When system cold reboot from current implementation to new one, db_migrator
will take care of generating new table and converting old data.
-
Initialize
ASIC_TABLE
,PERIPHERAL_TABLE
andROCE_TABLE
from predefined templates into config database, just like what is done when the system starts for the first time orconfig load_minigraph
-
Convert the current data in
BUFFER_PROFILE
andBUFFER_PG
into new format withheadroom_type
inserted via the following logic:- If a
BUFFER_PROFILE
entry has name convention ofpg_lossless_<speed>_<length>_profile
and the samedynamic_th
value asASIC_TABLE.default_dynamic_th
, it will be treated as a dynamically generated profile based on the port's speed and cable length. In this case it will be removed from theCONFIG_DB
. - If a
BUFFER_PG
references a profile whoseheadroom_type
isdynamic
, it will be also treated as a dynamic buffer pg object and itsheadroom_type
will be initialized asdynamic
. - If a
BUFFER_PROFILE
orBUFFER_PG
item doesn't meet any of the above conditions, it will be treated as astatic
profile.
- If a
After that, Buffer Manager
will start as normal flow which will be described in the next section.
When the daemon starts, it will:
- Test the command line options. If
-c
option is provided, the class for dynamically buffer calculation will be instantiated. Otherwise it should be the current solution of calculating headroom buffers, which is out of the scope of this design. - Load table
ASIC_TABLE
,PERIPHERAL_TABLE
andROCE_TABLE
fromCONFIG_DB
into internal data structures. - Load table
CABLE_LENGTH
andPORT
fromCONFIG_DB
into internal data structures. - After that it will:
- handle items in
CALBLE_LENGTH
andPORT
tables - handle items in
BUFFER_POOL
,BUFFER_PROFILE
andBUFFER_PG
inCONFIG_DB
via calculating the headroom size for the ports and then pushing result intoBUFFER_PROFILE
andBUFFER_PG
tables inAPPL_DB
except theport
andpriority group
tuples configured headroom override.
- handle items in
When system starts, the port's headroom will always be recalculated according to its speed and cable length. As a result, when system warm restarts between images who calculate headroom size in different ways, the Buffer Manager
will eventually regenerate items for each port
, priority group
according to the ports' speed
and cable length
and items in CONFIG_DB
and then push the item into APPL_DB
.
In this sense, no specific steps is required regarding configuration migration for the above tables.
For the table BUFFER_POOL
, each time the headroom size updated the buffer pool size should be updated accordingly. When the system warm reboots, the headroom size of each port will be updated as well, which means the buffer pool size will be updated for many times even though the value they eventually should be have already been in the switch chip. This should be avoided as the correct value has already been in the switch chip.
This can be achieved by checking whether the warm reboot is finished ahead of calling SAI api.
The command configure interface lossless_pg <set|clear>
is designed to configure the priorities used for lossless traffic.
sonic#config interface lossless_pg set <port> <pg-map>
sonic#config interface lossless_pg clear <port>
All the parameters are mandatory.
The pg-map
stands for the map of priorities for lossless traffic. It should be a string and in form of a bit map like 3-4
or 3-4,6
. The -
connects the lower bound and upper bound of a range of priorities and the ,
seperates multiple ranges.
Every time the lossless priority is set the old value will be overwritten.
A static profile can be used to override the headroom size and/or dynamic_th of a port, PG.
The command configure buffer_profile
is designed to create or destroy a static buffer profile which will be used for headroom override.
sonic#config buffer_profile <name> add --xon <xon> --xoff <xoff> --headroom <headroom> --dynamic_th <dynamic_th>
sonic#config buffer_profile <name> del
All the parameters are devided to two groups, one for headroom and one for dynamic_th. For any command at lease one group of parameters should be provided.
For headroom parameters:
- At lease one of
xoff
andheadroom
should be provided and the other will be optional and conducted via the formulaxon + xoff = headroom
. All other parameters are mandatory. xon
is madantory.xon
+xoff
<=headroom
; For Mellanox platform xon + xoff == headroom
If only headroom parameters are provided, the dynamic_th
will be taken from CONFIG_DB.ASIC_DB.default_dynamic_th
.
If only dynamic_th parameter is provided, the headroom_type
will be set as dynamic
and xon
, xoff
and size
won't be set.
When delete a profile, it shouldn't be referenced by any entry in CONFIG_DB.BUFFER_PG
.
The command configure interface headroom_override
is designed to enable or disable the headroom override for a certain port.
sonic#config interface headroom_override enable <port> --profile <profile>
sonic#config interface headroom_override disable <port>
Headroom override will be enabled on all lossless PGs configured by configure interface lossless_pg
on the <port>
The <profile>
must be defined in advance.
The command configure interface cable_length
is designed to configure the cable length of a port.
sonic#config interface cable_length <port> <length>
All the parameters are mandatory.
The length
stands for the length of the cable connected to the port. It should be integer and in the unit of meter.
The following command is used to configure the cable length of Ethernet0 as 10 meters.
sonic#config interface cable_length Ethernet0 10
The command mmuconfig
is extended to display the current configuration.
sonic#mmuconfig -l
The command config qos clear
is provided to all the QoS configurations from database, including the all the above mentioned configurations.
Configure commands:
config interface lossless_pg set Ethernet0 3-5
In configure database there should be:
{
"BUFFER_PG" : {
"Ethernet0|3-5" : {
"headroom_type" : "dynamic"
}
}
}
In APPL_DB there should be:
{
"BUFFER_PROFILE" : {
"pg_lossless_100000_5m_profile" : {
"pool" : "[BUFFER_POOL|ingress_lossless_pool]",
"xon" : "18432",
"xoff" : "20480",
"size" : "38912",
"dynamic_th" : "0"
}
},
"BUFFER_PG" : {
"Ethernet0|3-4" : {
"profile" : "[BUFFER_PROFILE|pg_lossless_100000_5m_profile]"
}
}
}
Configure commands:
config buffer_profile add pg_lossless_100000_5m_customize_profile --dynamic_th 3
config interface headroom_override enable Ethernet0 --profile pg_lossless_100000_5m_customize_profile
In configure database there should be:
{
"BUFFER_PROFILE" : {
"pg_lossless_100000_5m_customize_profile" : {
"pool" : "[BUFFER_POOL|ingress_lossless_pool]",
"dynamic_th" : "3",
"headroom_type" : "dynamic"
}
},
"BUFFER_PG" : {
"Ethernet0|3-4" : {
"profile" : "[BUFFER_PROFILE|pg_lossless_100000_5m_customize_profile]",
"headroom_type" : "dynamic"
}
}
}
In APPL_DB there should be:
{
"BUFFER_PROFILE" : {
"pg_lossless_100000_5m_customize_profile" : {
"pool" : "[BUFFER_POOL|ingress_lossless_pool]",
"xon" : "18432",
"xoff" : "20480",
"size" : "38912",
"dynamic_th" : "3"
}
},
"BUFFER_PG" : {
"Ethernet0|3-4" : {
"profile" : "[BUFFER_PROFILE|pg_lossless_100000_5m_customize_profile]"
}
}
}
Configure commands:
config interface lossless_pg set Ethernet0 --pg 3-4,6
config buffer_profile add pg_lossless_custom_profile --dynamic_th 3 --xon 18432 --size 36864
config interface headroom_override enable Ethernet0 --profile pg_lossless_custom_profile --pg 3-4
In configure database there should be:
{
"BUFFER_PROFILE" : {
"pg_lossless_custom_profile" : {
"pool" : "[BUFFER_POOL|ingress_lossless_pool]",
"dynamic_th" : "3",
"headroom_type" : "static",
"xon" : "18432",
"xoff" : "18432",
"size" : "36864"
}
},
"BUFFER_PG" : {
"Ethernet0|3-4" : {
"profile" : "[BUFFER_PROFILE|pg_lossless_custom_profile]",
"headroom_type" : "static"
},
"Ethernet0|6" : {
"headroom_type" : "dynamic"
}
}
}
In APPL_DB there should be:
{
"BUFFER_PROFILE" : {
"pg_lossless_custom_profile" : {
"pool" : "[BUFFER_POOL|ingress_lossless_pool]",
"xon" : "18432",
"xoff" : "18432",
"size" : "36864",
"dynamic_th" : "3"
},
"pg_lossless_100000_5m_profile" : {
"pool" : "[BUFFER_POOL|ingress_lossless_pool]",
"xon" : "18432",
"xoff" : "20480",
"size" : "38912",
"dynamic_th" : "0"
}
},
"BUFFER_PG" : {
"Ethernet0|3-4" : {
"profile" : "[BUFFER_PROFILE|pg_lossless_custom_profile]"
},
"Ethernet0|6" : {
"profile" : "[BUFFER_PROFILE|pg_lossless_100000_5m_profile]"
}
}
}
- Increase a port's speed.
- Increase a port's cable length.
- Decrease a port's speed.
- Decrease a port's cable length.
- Update a port's speed and cable length at the same time.
- A speed, cable length tuple occurs for the first time.
- A speed, cable length tuple is referenced by only one port, PG tuple and then is no longer referenced.
- Shutdown a port.
- Startup a port.
- Configure static buffer profile with headroom dynamically calculated and non-default alpha value. The buffer is referenced by a port, PG and then is no longer referenced.
- Configure non default lossless PG on a port.
- Configure headroom override on a port.
- Configure lossless PGs with both dynamically calculation and headroom override on the same port. (eg: 3-4 dynamic and 6 headroom override).
- Remove headroom override on a port.
- Remove headroom override on a port, PG.
- System start for the first time (with an empty configure database).
- System start with a legal configure database.
- System start from old image (SONiC-to-SONiC upgrade).
- System warm reboot from old image (SONiC-to-SONiC upgrade).
- Configure a headroom override with an unexist profile.
- Try to remove a referenced profile.
- Configure large cable length which makes calculated headroom exceed the max legal value.
-
Should we still use the fixed set of cable length? Or should we support arbitrary cable length?
- If arbitrary cable length is legal, should a no-longer-used buffer profile be removed from the database?
- If no, the database probably will be stuffed with those items?
- If yes, need to maintain a reference number for the profiles, which introduces unnecessary complexity.
Status: Closed.
Decision: We're going to support arbitrary cable length.
- If arbitrary cable length is legal, should a no-longer-used buffer profile be removed from the database?
-
After port cable length updated, should the BUFFER_PG table be updated as well?
- Current implementation don't do that. Why?
Status: Closed.
Decision: Yes. After port cable length pudated the BUFFER_PG table should be updated accordingly.
-
With headroom size dynamically configured, is it necessary to recalculate the buffer pool size?
- For the egress_lossy_pool, ingress_lossless_pool and ingress_lossy pool their size is the total size minus the total of headroom of all ports.
-
Lossless is supported on priority 3-4 only. Is this by design or standard or any historical reason?
-
Can shared buffer pool be updated on-the-fly? Can buffer profile be updated on-the-fly? Only the dynamic_th.
Status: Open.
Decision: Should be. But there is issues in SONiC "dynamic_th" parameter for lossless buffer profile can't be change on the fly.
-
There are default headrooms for lossy traffic which are determined by SDK and SONiC isn't aware. Do they affect shared buffer calculation?
Status: Closed.
Decision: Yes. They should be taken into consideration.
-
There is limitations from SDK/FW that there is a cap of the total number of headroom sizes of all priority groups belong to a port. For 2700 split port, this cap prevent the headroom size from being programed if the speed is 50G and cable length is 300m.
Status: Closed.
Decision: There should be a maxinum value of the accumulate PGs for port. This can be fetched from ASIC_DB.
-
Originally buffer configuration had been stored in APPL_DB but were moved to CONFIG_DB later. Why? doc for reference.
-
In theory, when system starts, as
BUFFER_PROFILE
andBUFFER_PG
tables are stored in config database which survives system reboot, theBuffer Orch
can receive items of the tables ahead of they being recalculated byBuffer Manager
based on the current algorithm andcable length
andspeed
. If the headroom size calculated differs before and after reboot, it can cause the items in the tables be deployed twice in which the first deployment will be overwritten quickly by the second one. -
For chassis systems the gearbox in variant line-cards can differ, which means a mapping from port/line-card to gearbox model is required to get the correct gearbox model for a port. This requires additional field defined in
PORT
table or some newly introduced table. As this part hasn't been defined in community, we will not discuss this case for now.
- No need to recalculate buffer pool size every time a port's
speed
orcable length
orlossless PG
is updated. A counter example: system starts, the buffer pool need to be calculated only once. - Need to define when the parameters should be saved into the internal data structure. I suggest do that in the meta flows.
- Need to make sure the way the
BUFFER_PG
linked to the oldlossless PG
is removed. - Need to make sure SONiC-to-SONiC upgrade flow and db_migrate.
- Review all flow chart.
- Headroom data will be updated more frequently than currently it is.
- More dynamic entries may occur due to supporting arbitrary cable length.
- As a result, headroom data is no longer fits to be stored as configuration.
By Buffer Manager
:
- Keep a simple logic in
Buffer Orch
- Aligns with the current logic between manager daemons and orchagent
- Easy to control when the buffer pool to be updated, which is import during warm reboot
- When something bad happens, to have a separated table in APPL_DB helps investigate issues
By Buffer Orchagent
:
- Don't need APPL_DB tables, simplify procedure significantly
- Need to read cable length and port speed in orchagent, which will change the structure of buffer manager.
In the statically look-up solution all buffer relavent tables are stored in CONFIG_DB which is supposed to contain the configuration supplied by user. However, some buffer data, including some entries in the BUFFER_PROFILE
table and the BUFFER_PG
table, are dynamically generated when ports' speed or cable length updated, which means they are not real configuration.
To have dynamic entries in CONFIG_DB is confusing. However, in statically look-up solution, a user is able to distinguish dynamic one from static one easily considering there are only limit number of combinations of speed
, cable length
tuple, the amount of dynamically generated entries in BUFFER_PROFILE
table is small. In this sense, to have dynamic and static entries mixed together isn't a big problem for now.
However, in this design it will no longer be true because:
- The variant cable length will be supported, which means the number of dynamically generated entries in
BUFFER_PROFILE
table can be much larger. - The headroom data will be calculated dynamically, which means they will be updated more frequently and no longer feasible to be static configuration.
- There is going to be headroom override which means
BUFFER_PG
andBUFFER_PROFILE
table will contain both dynamic and static entries. Meanwhile, thedynamic_th
orstatic_th
of the dynamic entries are configured by user.
These will confuse user, making it difficult to distinguish static and dynamic entries and understand the configuration.
To resolve the issue, we have to add tables to APPL_DB, representing the current value which is programed to ASIC.