-
Notifications
You must be signed in to change notification settings - Fork 16
Quality of Service
This document serves as a brief introduction to Quality of Service (QoS) in HPC networks and its place in the CODES ecosystem.
HPC systems in the past have mostly been based around Torus or Grid interconnects. These had good performance and connectivity. When multiple jobs were tasked to the system, they could be partitioned relatively easily with little overlap or contention for resources like link bandwidth and router queues. This meant that there was a low level of communication interference between the various jobs running on the system.
With hierarchical networks like Dragonfly and its derivatives becoming more popular, obtaining an optimally performing partition of the network for each individual job is not a simple task. Jobs and the correspondance between allocated terminals is commonly requried to share links and routers. This leads to contention for resources and interference to each job resulting in decreased overall performance.
Quality of Service (QoS) is a broad class of mechanisms designed to give the system power to control how congestion is handled and mitigated.
One mechanism for QoS is the introduction of traffic classes. Classifying different packets with different classes or priorities works to optimize traffic flow by preventing "important" packets, such as those associated with MPI collectives, from waiting too long in congested routers. It can also prevent bandwidth-hungry jobs from "bullying" other jobs by preventing them from consuming more than a set amount of bandwidth. This approach allows less communication intensive applications to have their fair share.
In CODES QoS is implemented at the interconnect model level and must be ported into any specific network. Currently the Dragonfly-Dally (1D Dragonfly) and Dragonfly-Plus (Megafly) network models support QoS.
To turn on QoS, one must specify parameters in two places:
- In the execution line when starting a simulation
- In the network configuration file
In the command line one must specify the priority type, that is, is the QoS mechanism trying to allocate bandwidth based on Jobs or by the type of traffic (MPI Collective priority)
To allocate bandwidth based on jobs: "--priority_type=0
" is specified
To allocate bandwidth based on giving priority to MPI Collectives: "--priority_type=1
" is specified
Below is an example CODES Dragonfly configuration file. The network that it is designed to configure is one with 8320 terminals (compute nodes) and 1040 routers. Each terminal is configured to have its own model-net LP for the workload interface. The network is divided up into 65 groups, each with all-to-all intra-connections and 128 global inter-connections per group distributed across its 16 routers.
Parameters specific to QoS are at the bottom of the file.
LPGROUPS
{
MODELNET_GRP
{
repetitions="1040";
# name of this lp changes according to the model
nw-lp="8";
# these lp names will be the same for dragonfly-custom model
modelnet_dragonfly_custom="8";
modelnet_dragonfly_custom_router="1";
}
}
PARAMS
{
# parameter to configure dragonfly custom model as 1D Dragonfly
df-dally-vc="1";
# packet size in the network
packet_size="4096";
# order of LPs, mapping for modelnet grp
modelnet_order=("dragonfly_custom","dragonfly_custom_router");
# scheduler options
modelnet_scheduler="fcfs";
# chunk size in the network (when chunk size = packet size, packets will not be divided into chunks)
chunk_size="4096";
# number of rows in local group chassis
num_router_rows="1";
# number of cols in local group chassis
num_router_cols="16";
# Number of row channels
num_row_chans="1";
# Number of column channels
num_col_chans="1";
# number of groups in the network
num_groups="65";
# buffer size in bytes for local virtual channels
local_vc_size="16384";
# buffer size in bytes for global virtual channels
global_vc_size="16384";
# buffer size in bytes for compute node virtual channels
cn_vc_size="32768";
#bandwidth in GiB/s for local channels
local_bandwidth="25.0";
# bandwidth in GiB/s for global channels
global_bandwidth="25.0";
# bandwidth in GiB/s for compute node-router channels
cn_bandwidth="25.0";
# ROSS message size
message_size="656";
# number of compute nodes connected to router
num_cns_per_router="8";
# number of global channels per router
num_global_channels="8";
# network config file for intra-group connections
intra-group-connections="PATH/TO/df8k-intra-file";
# network config file for inter-group connections
inter-group-connections="PATH/TO/df8k-inter-file";
# routing protocol to be used
routing="prog-adaptive";
# minimal route threshold before considering non-minimal paths
adaptive_threshold="131072";
# -vvv- QoS Specific Parameters -vvv-
# number of traffic classes
num_qos_levels="2";
# percent of total bandwidth allowed for each traffic class - should add up to 100
qos_bandwidth="90,10";
# simulation time where qos bandwidth monitoring is active - default is 5000000000
max_qos_monitor="5000000000";
}
If the priority type is 0, meaning that QoS traffic classes will be applied to packets by their respective job, then the number of QoS levels should match the number of jobs in the simluation. Additionally the number of percentages specified in the qos_bandwidth
configuration should also match the number of QoS levels.
If the priority type is 1, meaning that QoS traffic classes will be applied to give priority to MPI Collectives, then the number of QoS levels should be 2 and the amount of bandwidth to be allocated to the MPI Collective class should be the first percentage listed.
This value is the amount of time in the simluation in nanoseconds where QoS bandwidths are monitored.