Skip to content

Quality of Service

Neil McGlohon edited this page Jan 15, 2019 · 2 revisions

Quality of Service

This document serves as a brief introduction to Quality of Service (QoS) in HPC networks and its place in the CODES ecosystem.

Introduction


HPC systems in the past have mostly been based around Torus or Grid interconnects. These had good performance and connectivity. When multiple jobs were tasked to the system, they could be partitioned relatively easily with little overlap or contention for resources like link bandwidth and router queues. This meant that there was a low level of communication interference between the various jobs running on the system.

With hierarchical networks like Dragonfly and its derivatives becoming more popular, obtaining an optimally performing partition of the network for each individual job is not a simple task. Jobs and the correspondance between allocated terminals is commonly requried to share links and routers. This leads to contention for resources and interference to each job resulting in decreased overall performance.

Quality of Service (QoS) is a broad class of mechanisms designed to give the system power to control how congestion is handled and mitigated.

One mechanism for QoS is the introduction of traffic classes. Classifying different packets with different classes or priorities works to optimize traffic flow by preventing "important" packets, such as those associated with MPI collectives, from waiting too long in congested routers. It can also prevent bandwidth-hungry jobs from "bullying" other jobs by preventing them from consuming more than a set amount of bandwidth. This approach allows less communication intensive applications to have their fair share.

In CODES QoS is implemented at the interconnect model level and must be ported into any specific network. Currently the Dragonfly-Dally (1D Dragonfly) and Dragonfly-Plus (Megafly) network models support QoS.

Quick Start


To turn on QoS, one must specify parameters in two places:

  1. In the execution line when starting a simulation
  2. In the network configuration file

Execution Line

In the command line one must specify the priority type, that is, is the QoS mechanism trying to allocate bandwidth based on Jobs or by the type of traffic (MPI Collective priority)

To allocate bandwidth based on jobs: "--priority_type=0" is specified

To allocate bandwidth based on giving priority to MPI Collectives: "--priority_type=1" is specified

Configuration File

Below is an example CODES Dragonfly configuration file. The network that it is designed to configure is one with 8320 terminals (compute nodes) and 1040 routers. Each terminal is configured to have its own model-net LP for the workload interface. The network is divided up into 65 groups, each with all-to-all intra-connections and 128 global inter-connections per group distributed across its 16 routers.

Parameters specific to QoS are at the bottom of the file.

LPGROUPS
{
   MODELNET_GRP
   {
      repetitions="1040";
# name of this lp changes according to the model
      nw-lp="8";
# these lp names will be the same for dragonfly-custom model
      modelnet_dragonfly_custom="8";
      modelnet_dragonfly_custom_router="1";
   }
}
PARAMS
{
# parameter to configure dragonfly custom model as 1D Dragonfly
   df-dally-vc="1";
# packet size in the network
   packet_size="4096";
# order of LPs, mapping for modelnet grp
   modelnet_order=("dragonfly_custom","dragonfly_custom_router");
# scheduler options
   modelnet_scheduler="fcfs";
# chunk size in the network (when chunk size = packet size, packets will not be divided into chunks)
   chunk_size="4096";
# number of rows in local group chassis
   num_router_rows="1";
# number of cols in local group chassis
   num_router_cols="16";
# Number of row channels
   num_row_chans="1";
# Number of column channels
   num_col_chans="1";
# number of groups in the network
   num_groups="65";
# buffer size in bytes for local virtual channels
   local_vc_size="16384";
# buffer size in bytes for global virtual channels
   global_vc_size="16384";
# buffer size in bytes for compute node virtual channels
   cn_vc_size="32768";
#bandwidth in GiB/s for local channels
   local_bandwidth="25.0";
# bandwidth in GiB/s for global channels
   global_bandwidth="25.0";
# bandwidth in GiB/s for compute node-router channels
   cn_bandwidth="25.0";
# ROSS message size
   message_size="656";
# number of compute nodes connected to router
   num_cns_per_router="8";
# number of global channels per router
   num_global_channels="8";
# network config file for intra-group connections
   intra-group-connections="PATH/TO/df8k-intra-file";
# network config file for inter-group connections
   inter-group-connections="PATH/TO/df8k-inter-file";
# routing protocol to be used
   routing="prog-adaptive"; 
# minimal route threshold before considering non-minimal paths
   adaptive_threshold="131072";

# -vvv- QoS Specific Parameters -vvv-

# number of traffic classes
   num_qos_levels="2";
# percent of total bandwidth allowed for each traffic class - should add up to 100
   qos_bandwidth="90,10";
# simulation time where qos bandwidth monitoring is active - default is 5000000000
   max_qos_monitor="5000000000";
}

num_qos_levels and qos_bandwidth

Priority Type 0

If the priority type is 0, meaning that QoS traffic classes will be applied to packets by their respective job, then the number of QoS levels should match the number of jobs in the simluation. Additionally the number of percentages specified in the qos_bandwidth configuration should also match the number of QoS levels.

Priority Type 1

If the priority type is 1, meaning that QoS traffic classes will be applied to give priority to MPI Collectives, then the number of QoS levels should be 2 and the amount of bandwidth to be allocated to the MPI Collective class should be the first percentage listed.

max_qos_monitor

This value is the amount of time in the simluation in nanoseconds where QoS bandwidths are monitored.

Clone this wiki locally