HPL Tuning

1) Introduction

This wiki is meant as a brief tutorial how to tune HPL-GPU for the respectife system where it shall run and for the respective purpose. HPL-GPU works in combination with CALDGEMM. HPL-GPU and CALDGEMM are free software licensed under a combination of GPL, LGPL and BSD license. You can obtain the most recent versions from: https://github.com/davidrohr/hpl-gpu and https://github.com/davidrohr/caldgemm. CALDGEMM Command Line Options explains all its various options briefly. For a reference to CALDGEMM options, please refer to CALDGEMM Command Line Options, CALDGEMM-Performance-Optimization-Guide-(CAL---OpenCL-without-GPU_C), CALDGEMM-Performance-Optimization-Guide-(OpenCL---CUDA), and CALDGEMM dgemm_bench examples. HPL-GPU adds various additional parameters on top of the standard HPL parameters. For a reference to the standard HPL parameters and how to tune them, see: http://www.netlib.org/benchmark/hpl/tuning.html. In certain cases HPL-GPU behaves differently from standard HPL tuning and also from standalone CALDGEMM case. In that case, the best settings are explained in this document. Therefore, this document supersedes the CALDGEMM tuning guides in this wiki and standard HPL tuning guide. All new HPL-GPU compile time config options are described in HPL Compile Time Options.

HPL-GPU comes with a new "Generic" HPL-Configuration. The old "standard" HPL configurations files for several systems are still provided but mostly for reference. The configuration of HPL-GPU can be performed at four different layers:
1. Traditional Compile Time Defines as in the standard HPL.
2. Static Generic Configuration via the Compile-Time Generic Configuration File.
3. Runtime Generic Configuration via the Run-Time Generic Configuration File.
4. Runtime Generic Configuration via Environment Variables.
All layers can be mixed. In principle, the lowermost compile time define layer allows to perform all possible settings but is quite inconvenient. Most options are now accessible via the Generic Configuration. The The higher levels configure the settings on top of each other overriding settings of each lower level.
For all the CALDGEMM related options there are again two configuration styles:
- Traditional configuration via compile time defines.
- Generic CALDGEMM configuration via command line passing based on the same options provided by DGEMM bench.
For greater compatibility, the second option is favored.
All compile time defines for the traditional configuration are listed in HPL Compile Time Options. It is strongly suggested, to use a combination of compile-time and runtime generic configuration. The generic configuration consists of three configuration files:
- Make.Generic: Backend for Generic Configuraiton options (It uses the tradidional configuration interface. Usually, this file should not be touched. It contains mostly paths to headers and libraries and default settings which do not need to be changed. All the paths are controlled via environement variables as explained in the installation Howto. All options should be set via the other two files. Modifications to this file are only required if you want to apply compile time defines not accessible via the generic configuration, or if you need to change some paths or compiler / linker flags).
- Make.Generic.Options: Static Generic Configuration file (This file provides all generic configuration options that must be made at compile time, e.g. whether to link to an MPI library. More generic configuration options exist (as listed in Make.Generic.Options_OldInterface for the traditional interface and in Make.Generic.Options_StaticConfig for the new CALDGEMM Generic configuration), but it is suggested to set these settings at runtime instead of compile time.
- HPL-GPU.conf: The runtime generic configuration file (Allows to set most parameters at runtime).
The generic configuration options via environment variables allows to override settings in HPL-GPU.conf via environment variables. The parameter names are identical to those in HPL-GPU.conf. If an environment variable with such a name is found, it has precedence over the setting in HPL-GPU.conf.

In order to use this generic configuration, please copy the two generic files from the setup directory to HPL-GPU's top directory, and alter the Make.Generic.Options file according to this document. The build process will automatically bring the runtime configuration file HPL-GPU.conf in place with the programm binary (in bin/generic/HPL-GPU.conf).

The syntax in the generic configuration file HPL-GPU.conf is always:

PARAMETERNAME: PARAMETERSETTING

For boolean settings, you can just use PARAMETERNAME without colon and the setting to set the setting to true, and use PARAMETERNAME: 0 for disabling it. Later settings will overwrite previous settings. The HPL_PARAMDEFS is a bit special in the way that they do not override each other, but all of them are parsed.

Valid options in the runtime configuration file are: HPL_PARAMDEFS, HPL_WARMUP, HPL_FASTRAND, HPL_NUM_LASWP_CORES, HPL_DISABLE_LOOKAHEAD, HPL_LOOKAHEAD2_TURNOFF, HPL_LOOKAHEAD3_TURNOFF, HPL_DURATION_FIND_HELPER, HPL_CALDGEMM_ASYNC_FACT_DGEMM, HPL_CALDGEMM_ASYNC_FACT_FIRST, HPL_CALDGEMM_ASYNC_DTRSM, HPL_CALDGEMM_ASYNC_DTRSM_MIN_NB, HPL_CALDGEMM_ASYNC_FACT_DTRSM, HPL_NB_MULTIPLIER, HPL_NB_MULTIPLIER_THRESHOLD, HPL_MPI_AFFINITY, HPL_INTERLEAVE_MEMORY. These options are explained in Important HPL GPU and CALDGEMM options.

Lines starting with "#" are treated as comments, and everything within a line after a "//" is ignored.

For multi-GPU runs, you might want to restrict some settings to certain nodes or ranks, which is facilitated by the "!" operator. Lines can start with "!#", "!N", and "!%":

!Nmynode HPL_PARAMDEFS: will apply the setting only on nodes with hostname "mynode".
!#n HPL_PARAMDEFS: will apply the setting only on MPI rank n.
!%n,k HPL_PARAMDEFS: will apply the setting to all nodes with MPI rank x, such that x % n = k.

2) Tuning

Preliminary: Check the DMA and memory bandwidth!

2a) DGEMM Tuning

Before you start with HPL-GPU tuning, please ensure that your system achieves a "good" matrix multiplication performance in DGEMM. DGEMM is the computational hotspot of HPL, hence it dominates HPL performance. Please have a look at the CALDGEMM tuning guides (CALDGEMM Command Line Options, CALDGEMM-Performance-Optimization-Guide-(CAL---OpenCL-without-GPU_C), CALDGEMM-Performance-Optimization-Guide-(OpenCL---CUDA), and CALDGEMM dgemm_bench examples) in order to tune DGEMM performance. In general, it is difficult to say what "good" performance is, but usually you should achieve 75% to 80% of the system's theoretical peak performance.

With the following tuning, we will try to make HPL-GPU achieve almost the same performance as the DGEMM. By construction, HPL-GPU cannot exceed the CALDGEMM performance. Section "Reference Information" gives some empirical data how much performance should be achieved in which situation. In order to understand tis tuning guide, a very rough understanding of the working principle of HPL-GPU is needed. This will be sumarized briefly in the following:

Today's GPUs often have a turbo-clock which they maintain as long as they stay withn the TDP limit. Unfortunately, DGEMM requires very much energy and GPUs will usually not remain in their TDP limit. There are tools to raise TDP limits of GPUs, which you can use to boost your DGEMM performance. Be aware that you might run the GPUs out of specifications with this and this could damage your hardware. A tool for AMD GPUs for instance is atitweak: https://github.com/mjmvisser/adl3

2b) Working principle of HPL-GPU

HPL performs an iterative matrix factorization. Each iteration consists of the following steps: Panel-Factorization, Update-DGEMM, Update-DTRSM, Panel-Broadcast, U-Broadcast, LASWP. The Update-DGEMM is one large matrix-matrix multiplication, and it is the computational hotspot. The Panel-Factorization is a complex tast, which contains many subtasks, among them several medium sized DGEMMs and also some DTRSMs.

DGEMM is optimally suited for execution on the GPU, the larger the better. Therefore, it is ideal to execute the large Update-DGEMM on the GPU. The GPU can also perform the smaller DGEMMs during the factorization, but less efficiently than the Update-DGEMM. In addition, the GPU can perform the DTRSMs, but less efficiently than DGEMM. Again, the large Update-DTRSM runs more efficiently than the smaller DTRSMs inside the Panel-Factorization. All the other steps must remain on the processor.

The idea behind HPL-GPU is to run if possible only the Update-DGEMM on the GPU. All other tasks remain on the processor. A pipeline is used to hide these tasks behind the Update-DGEMM execution on the GPU. In the ideal case, the GPU runs at 100% utilization all time computing exclusively the Update-DGEMM, while the processor performs the other tasks in parallel. This approach works perfectly as long as the CPU can finish its tasks faster than the GPU. In that case, HPL-GPU can reach 100% GPU utilization. Because the GPU usually provides more compute power than the processor, this is very important.

During the HPL run, the remaining matrix size which is processed becomes smaller and smaller. The GPU workload per iteration goes with matrix size^2, the CPU workload only with the matrix size. Hence, the GPU workload diminishes faster than the CPU workload. Hence, at a certain point in time, the CPU part will take longer than the Update-DGEMM on the GPU. Then, the CPU becomes the dominant part, and the can no longer be used at 100% utilization. At that point, the ideal working configuration changes. Now, the GPU should perform additional workload. In this way, we can reduce the CPU workload an make the CPU finish its part faster. HPL-GPU adds the option to dynamically offload other steps to the GPU. All these options come with a parameter that define a matrix size, at which the offload is performed. If the matrix size is larger than this parameter, the offload is not performed and the GPU runs the Update-DGEMM only. If the matrix size is smaller, the GPU performs additional task. These parameters must be tuned for the individual system, and we explain below how this can be done.

There is another aspect with respect to workload distribution. The CPU provides in many situation a non-negligible amount of compute power. At the beginning of an HPL run the matrices are usually very large. In this case, the GPU is the dominant part. The GPU takes longer to process the Update-DGEMM than the CPU takes for its parts. This means that the CPU idles for some time. We can use this available CPU power to speed up the processing if the Update-DGEMM by distributing the Update-DGEMM workload between GPU and CPU.

The section "Tuning for different scenarios" below will explain how to set up a good workload distribution between CPU and GPU and how to tune the ofload parameters. But first, this section will go through some general tuning steps.

HPL-GPU comes with a set of plot scripts to visualize the HPL run and identify bottlenecks (See Analysis Plots of HPL GPU Runs).

2c) HPL blocking size Nb

The most important HPL parameter with respect to GPUs is the blocking size Nb. CPU-only systems usually prefer quite small blocking factors in the order of 64. Such settings are infeasible for GPU-accelerated systems. In general, a smaller blocking leads to more efficient execution by increasing the DGEMM workload and reducing the other workloads. Taking into account the above-described working principle of HPL-GPU: A smaller blocking increases the GPU workload, a larger blocking increases the CPU workload. From this perspective, the blocking should be rather small. However, the smaller the blocking is, the higher are both required host memory and PCI express bandwidth. Usually, a blocking size below 1024 will always lead to a bandwidth bottleneck. In general, the more GPUs you have in a system, the more severe the bandwidth issue is. I.e., a multi-GPU setup will usually require a larger blocking. In addition, it seems Intel CPU provide more memory bandwidth than AMD CPUs, hence and AMD CPU based system may sometimes require a larger blocking factor Nb than a comparable Intel CPU based system. Another aspect is the performance of the CPU: The faster the CPU is, the larger can be the Nb parameter withour creating a CPU-bottleneck. As a rough guideline, start with Nb = 1024 and then increase Nb step by step until you find the optimum. Usually, for an Intel CPU system with 4 GPUs, Nb = 1920 is a good compromise. There is usually absolutely no sense in going beyond Nb = 2048. In a multi-GPU system with old AMD GPUs this might be necessary, but in general, if you need such a high Nb setting, it indicates a bandwidth problem on your system.

There is one exception to this rule, on multi-GPU systems with more than 4 GPUs per node the system memory bandwidth becomes a bottleneck. The Nb revisited at the end of this howto gives some advise on this.

2d) Processor affinities

You can set cpu core affinities for the CALDGEMM main thread cal_info.PinMainThread = [n];. It can make sense to try all NUMA nodes for this. It makes sence to place the MPI based process on the NUMA node with IB adapter -DHPL_MPI_AFFINITY={n}. In order to use NUMA nodes efficiently, please look aht the paramters -DHPL_GPU_DEVICE_IDS and -DHPL_GPU_ALLOC_MAPPING in Make.Generic.Options file.

2e) HyperThreading

On older systems, Hyperthreading usually has a negative effect and should be disabled by all means. If you cannot disable it, you can use the -DHPL_GPU_EXCLUDE_CORES parameter to exclude CPU cores from the run, such that only one virtual core per physical core participates. You can try the same for the Modules of AMDs bulldozer architecture. At least make sure that caldgemm threads / mpi threads do not run on the same module as BLAS threads.

On new systems, this situation changes. As of the Intel Haswell CPU generation with Intel MKL 2015 BLAS library, there is literally no performance difference with and without HyperThreading. (See the examples on the ESC8000 server with and without HyperThreading.) In particular with HyperThreading, thread pinning is important, as described in Thread to core pinning in HPL and CALDGEMM

3) Tuning for different scenarios

There are principally three tuning scenarios. The first is the most simple one and it is the basis for the other two. Therefore, in section 3a) we explain how to boost performance using mostly the GPU first. The following sections 3b) and 3c) provide additional tuning methods, which can be applied on top of 3a) in other scenarios.

3a) Tuning for best performance running matrix multiplication on GPU only

In order to tune for best performance on the GPU, we use the most simple approach: We execute the Update-DGEMM on the GPU only. And we dynamically offload some of the CPU tasks to the GPU as soon as the CPU becomes a bottleneck. There are 2 to 3 tasks worth offloading, with a priority according to the following order:

I: Medium DGEMMs inside the factorization.
II: The large Update-DTRSM.
III: Medium DTRSMs inside the factorization. (Only for systems with very slow CPUs.)

We need to determine the optimal tradeoff parameters, i.e. the matrix size where the offload starts. In order to determine these sizes, we run HPL-GPU twice: One time without offloading, and one time where we always offload. Using the debug output, we can determine the tradeoff point, where we should start the offload. For the first run, we set the respective parameter to 1, which is the smallest setting. The remaining matrix size will never be below one. For the second run we set the parameter to 1000000 or even larger, so HPL-GPU will always offload. Because the different offloads can influence each other, we have to go step by step using the above priority list. I.e., first we must find the optimal setting for the DGEMMs inside the factorization (I) and then for the Update-DTRSM (II). On systems with very slow CPUs you can afterwards try to offload the DTRSMs inside the factorization (III), but usually this has a negatve effect. In order to enable general offload support, you have to enable: cal_info.AsyncSideQueue = true; (-Oa setting) and cal_info.AsyncDTRSM = true; (-Od setting). In general, it is a good idea to always enable -DHPL_CALDGEMM_ASYNC_FACT_FIRST.

The three parameters you need to adapt now for offloads I, II, and III are: -DHPL_CALDGEMM_ASYNC_FACT_DGEMM=[n] (I), -DHPL_CALDGEMM_ASYNC_DTRSM=[n] (II), and -DHPL_CALDGEMM_ASYNC_FACT_DTRSM=[n] (III). In the new generic runtime configuration file HPL-GPU.conf, the respective settings for offloading DGEMM during factorization and the settings for offloading DTRSM are grouped, and you have to uncomment the HPL_PARAMDEFS = -Oa and HPL_PARAMDEFS = -Od settings (-Oa in general is you want to use an async side queue, and -Od if you want to run async DTRSM.)

Please set HPL_CONFIG_VERBOSE to at least 3 to get sufficient debug output to find the optimal settings. The relevant lines which you have to investigate in order to find the best settings are the following:

#(0  ,   4) Program: caldgemm Sizes - A: 40321x1920 B: 1920x40320 C:40321x40320 (Host: lcsc-r01n01) System Time 4.030 System Gflops 1549.815
#(0  ,   4) GPU Time 4.0302 (1332.0910 Gflops)   CPU Time 2.1691 (404.5750 Gflops)   Linpack Time: 0.9765 (2, 0.0000, 0.5343)  Total CPU Time: 3.6799 --- GPU Ratio - Real: 0.767 Corrected: 0.866 Guessed: 0.859 , m*n: 1.6E+09, CPU Wait Time: -0.350

The first to numbers show MPI rank and iteration number. The iteration number is increasing by 1 every time. The first line then shows the matrix sizes. We will find the tradeoff point by finding between which two iterations the setting should change. Then we have to insert the respective C matrix size as [n] for the paramters above. There are three equivalent ways to find the best setting. For reference we describe all of them:

(i) The "System Gflops" at the end of the last line shows the performance achieved in one iteration. Comparing the achieved "System Gflops" iteration by iteration for the two runs, you need to find the iteration number when the run with the offload always enabled becomes faster than the run with the offload disabled. (Be aware that it is possible that one of the runs is always faster.)
(ii) The second line shows "GPU Time" and "Total CPU Time". You do not need the offload as long as "Total CPU Time" is shorter than GPU time. You should anable the offload as soon as "Total CPU Time" becomes longer.
(iii) The CPU wait time at the end of the last line describes how long CALDGEMM had to wait for the CPU thread. As soon as this becomes positive, you should enable the offload.

3b) Tuning for best performance using both CPU and GPU for matrix multiplication

This is the most complicated scenario because it requires much more scheduling than the other two. CALDGEMM offers a variety of parameters to tune the GPURatio. The GPU ratio is the fraction of the workload performed by the GPU, i.e. a ratio of 0.7 (70%) means the GPU does 70% of the Update-DGEMM and the CPU does 30%. A simple setting for the ratio would be (GPU-Performance) / (CPU-Performance + GPU-Performance), but this does not work since the CPU has to do other parts as well. Fortunately, CALDGEMM can perform this calculation for you.

But first: When you are using the CPU for the Update-DGEMM, there is also the "preparatory lookahead dgemm", which you can run on the CPU. Using this tuning parameter has very little effect, but can gain in the order of 0.5% performance improvement. You should tweak this setting before finding the optimal GPU ratio below. Proceed as in section 3a) and use the parameter cal_info.AlternateLookahead = [n];. By default, this preparatory DGEMM is always offloaded.

Now, as the next step, we would like to find the optimal GPU ratio: In the same debug output line above used in section 3a), the entry "Corrected: [n]" shows the expected optimal setting. CALDGEMM can automatically determine the optimal setting during runtime if you set a negative GPURatio, but this does not work well for the first few iterations. Hence, CALDGEMM offers the possibility to provide a first "guess", which is used as basis for its auto-computation. To determine such a guess, please provide the negative value as ratio. I.e., run CALDGEMM once with full autodetection (GPURatio = -1), then use the value printed out as guessed and plug its negative into the ratio. In the above case: cal_info.GPURatio = -0.866;. To get the optimal value, you might want to iterate this two or three times, i.e. run with -0.866 and then use the next debug value "Corrected: [n]". Be aware that caldgemm will now use your setting as initial GPU ratio and then dynamically adapt the ratio to the optimal setting during runtime. This is needed, because the matrix sizes become smaller and thus the optimal ratio changes.

In order to fine-tune the dynamic adaption, you can use the following caldgemm settings:

- MinimizeCPUPart=[n]: Minimizes the CPU part of the Update-DGEMM. In multi-GPU systems, it is important that the CPU does not slow down the GPU. Towards the end of the run the performance may fluctuate and the autodetermination might not be able to find good values. If the GPU ratio is chosen to small, the GPU will idle. THis setting can prevent this by forcing the ratio to 1.0 as soon as the matrix size becomes smaller than [n]. Good values are: disabled, one of the tradeoff points determined in 3a) to start one of the offloads, or you can use the same procedure as in 3a) to find the abolute best setting.
- GPURatioDuringFact=[f]: THis is only relevant for multi-node MPI runs. In that case, there are phases with and phases without factorization on a node. Both phases need different GPU ratios and CALDGEMM usually automatically handles this. However, performance of fluctuation phases may vary strongly and CALDGEMM might be incable of finding good ratios. With GPURatioDuringFact can enforce a ratio in this case. Usual values are: disabled, 1.0, or hand-tuned.
- GPURatioMax=[f]: Sets a cap for the GPU ratio. A problem is, that CALDGEMM cannot determine the ratio properly if the CPU is not used. If, in a network run, for one iteration due to e.g. a network latencey the ratio becomes 1.0, posterior automatic adaption may fail. The GOURatioMax setting can cap the ratio and avoid this. This setting should be used in combination with the MinimizeCPUPart setting above, so the ratio can actually go to 1.0 when the CPU time is dominant. Usual values are either disabled or 0.99.
- GPURatioMarginTime=[t]: If the autodetermination overestimates the CPU performance, the CPU will take too long and the GPU will idle. This setting allows to define a margin time [t]. CALDGEMM will try to set the ratio such that the CPU finishes [t] seconds before the GPU. Default parameter is 0.3s.
- GPURatioMarginTimeDuringFact=[t]: As above, but during the factorization phase of a multi-node MPI run.
- GPURatioLookaheadSizeMod=[f]: This is only relevant if the cal_info.AlternateLookahead setting is used. In that case, for a large matrix, the CPU will process the preparatory lookahead DGEMM. This can usually not run as effectively as the update DGEMM, hence CALDGEMM virtualy increases the preparatory DGEMM size in the ratio calculation to find better ratios. The default additional factor is 0.2.

3c) Tuning for best power efficiency

Tuning for best efficiency usually requires other settings than tuning for best performance. In general, the GPU is more efficient than CPUs. Hence, for best efficiency it is often good to offload as many of the tasks as possible on the GPU. It turns out, that offloading DTRSM is not always good, because the GPU DTRSM requires a lot of memory bandwidth.

As a rough guideline, in order to tune for best efficiency, please:

Set the GPU Ratio to 1.0 and/or set MinimizeCPUPart arbitrarily high, to use GPU only for DGEMM.
Always offload the preparatory DGEMM cal_info.AlternateLookahead = 10000000; (very high).
Always offload the DGEMM in the factorization -DHPL_CALDGEMM_ASYNC_FACT_DGEMM=10000000 (very high).
Tune offload of DTRSM as in 3a)

Other very important aspects are voltage and frequency. The high frequency do usually not deliver the best performance. In addition, by using a lower frequency, you might be able to reduce supply voltage as well. Please refer to the GPU driver and the toolkit that comes with it or to your vendor on how to change voltage / frequency.

For CPU, HPL-GPU can link to libcpufrequtils, and then HPL-GPU can alter the CPU frequencies directly. Refer to Make.Generic.Options for an example how this can work.

Another possibility is to gradually reduce the number of GPUs in a multi-GPU system, i.e. use only a single GPU in the end of the run, allowing the other GPUs to go in power save mode. Make.Generic.Options cantains an example also for this case.

For energy measurements, the -DHPL_DURATION_FINDER setting can be a great support.

4a) Nb Revisited

For multi-GPU systems with a lot of memory, it might make sense to switch the Nb parameter during the run. You can facilitate this with the HPL_NB_MULTIPLIER_THRESHOLD and HPL_NB_MULTIPLIER generic configuration options in HPL-GPU.conf. Both parameters shall provide a list. As long as the global remaining matrix size is above the n-th threshold, the Nb for the next iteration is multiplied by the respective n-th multiplier.

HPL_NB_MULTIPLIER_THRESHOLD: 20000;10000
HPL_NB_MULTIPLIER: 3;2

This would increase Nb by a factor of three for matrix sizes above 20000 and by a factor of two above 10000.

The optimal points in time to switch NB must be determined experimentally, in the same way as for the asynchronous DGEMM and DTRSM.

4a) Affinity optimizations

Refer to Thread to core pinning in HPL and CALDGEMM to optimally pin GPU runtime, MPI, and LASWP threads to CPU cores. Important HPL GPU CALDGEMM options Gives an overview of "all" important settings required for HPL-GPU and CALDGEMM with the generic configuration.

5) Reference Information

As a rough guideline, you can reach:

75% - 85% of the theoretical peak performance in the GPU DGEMM kernel.
CALDGEMM should be able to maintain aboud 98% of this kernel performance as DGEMM system performance on a single-GPU.
The loss in multi-GPU runs is usually 2% for dual-GPU and 4% for quad-GPU for a reasonable large matrix.
With lookahead enabled, and reasonable large matrix size, performance loss in HPL compared to DGEMM is in the order of 10% for a reasonable large matrix.
The loss when going from a single node to an multi-node MPI run is usually between 5% and 10%.

In the HPL-GPU repository (/setup direcotry), there are a couple of reference configuration files optimized for several systems we have set up. The older examples use the legacy configuration, the new examples for the asus esc8000 server use the new generic configuration.

Make.openSUSE113_caldgemm_GotoBLAS_AMD Single AMD Cypress GPU / Dual-socket AMD Magny-Cours CPU (as in the LOEWE-CSC cluster), using the CAL backend.
Make.openSUSE113_MultiGPU AMD Platform, Cypress or Cayman GPU series, multiple GPUs (2 or 3), CAL backend.
Make.openSUSE_caldgemm_MKL_MultiGPU_SandyBridge Intel platform, multiple GPUs, CAL backend.
Make.SANAM_Multi-Node_Efficiency-Optimized, Make.SANAM_Multi-Node_Performance-Optimized, Make.SANAM_Single-Node_Efficiency-Optimized, Make.SANAM_Single-Node_Performance-Optimized: Setup used for the Sanam cluster (single and multi node) (efficiency or performance optimized) (quad GPU system (2 * FirePro S10000), Intel Sandy Bridge), CAL backend.
Make.SANAM_Single-Node_OpenCL Similar to above, but with OpenCL backend.
Make.Lattice-CSC_CPU-and-GPU: Lattice-CSC, performance optimized (Ivy-Bridge and 4 * AMD FirePro S9150 GPU), OpenCL Backend.
Make.Lattice-CSC_Multi-Node_Efficiency-Optimized, Make.Lattice-CSC_Single-Node_Efficiency-Optimized: Lattice-CSC, efficiency optimized, for a single and for multiple nodes, OpenCL backend.
example_config_asus-esc-8000_haswell_8-gpu-firepro-s9150_performance_optimized: Sample config using the new generic runtime configuration for a test setup with an ASUS ESC8000 (Haswell-EP) and 8 GPUs (FirePro S9150), OpenCL backend, optimized for best performance, with HyperThreading enabled.
example_config_asus-esc-8000_haswell_8-gpu-firepro-s9150_performance_optimized_no-hyperthreading: The same without HyperThreading.
example_config_8gpu_numa_2_mpi_ranks_per_node: The same setup, but with 2 MPI ranks per node, 1 per CPU socket. This setup requires more MPI communication but reduces NUMA effects.
example_config_asus-esc-8000_haswell_8-gpu-firepro-s9150_efficiency_optimized: Efficiency optimized setup for the ASUS ESC8000 server.

6) Other rarely used options

6a) Additional dynamic parameter adaptation

It is possible, that the optimal HPL parameters (the standard parameters like nbmin, nbdiv) change during the run. In particular on AMD CPU systems, you might want to use parameters which require less memory bandwidth at the beginning, but parameters which result in faster factorization towards the end of the run. Meke.Generic.Options contains an example how to change these parameters dynamically.

6b) Disabling lookahead

On AMD CPU systems it might be beneficial to turn off the lookahead features (level 1 and 2) during the run to save bandwidth. On Intel CPU systems this was never necessary yet. The relevant parameters are -DHPL_DISABLE_LOOKAHEAD=[n] and -DHPL_LOOKAHEAD2_TURNOFF=[n]. They must be tuned in the same way as the offload parameters in section 3a). However, there is one difference: Instead of the metrics above, you have to minimize the wall time of the total iteration time time which looks like: Timer ITERATION (22) CPU Time 70.39200 Wall Time 4.84046.

Home

Howto

Environment Variables

HPL Tuning

Tools / Information

Reference

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HPL Tuning

1) Introduction

2) Tuning

2a) DGEMM Tuning

2b) Working principle of HPL-GPU

2c) HPL blocking size Nb

2d) Processor affinities

2e) HyperThreading

3) Tuning for different scenarios

3a) Tuning for best performance running matrix multiplication on GPU only

3b) Tuning for best performance using both CPU and GPU for matrix multiplication

3c) Tuning for best power efficiency

4a) Nb Revisited

4a) Affinity optimizations

5) Reference Information

6) Other rarely used options

6a) Additional dynamic parameter adaptation

6b) Disabling lookahead

Clone this wiki locally