HPL Compile Time Options

-DHPL_COPY_L

force the copy of the panel L before bcast;

-DHPL_NO_MPI_DATATYPE

Do not use custom MPI types

-DHPL_DETAILED_TIMING

enable detailed timers;

-DHPL_DETAILED2_TIMING

enable more detailed timers for the lookahead pipeline;

-DHPL_CALL_CALDGEMM

use caldgemm for DGEMM cals

-DTRACE_CALLS

function level tracing for calls that might be relevant for optimization (implies HPL_GPU_NOT_QUIET)

-DUSE_ORIGINAL_LASWP

use original laswp implementation

-DTRACE_LASWP

dump date for LASWPs

-DHPL_FASTINIT

Fast initialization of input matrices for tuning runs

-DHPL_FASTVERIFY

Use Fast initialization random number generator for verification

-DHPL_PAGELOCKED_MEM

Allocate the memory pagelocked

-DHPL_HUGE_TABLES

Allocate the memory using huge tables

-DHPL_GPU_TIMING

Force Display of CALDGEMM Timing Data without HPL_GPU_NOT_QUIET

-DHPL_GPU_NOT_QUIET

Do not set quiet parameter for CALDGEMM (Will also display timing)

-DHPL_GPU_PERFORMANCE_WARNINGS

Print performance warnings for suboptimal CALDGEMM execution

-DHPL_SEND_U_PADDING

Transmit the padding of U matrix, unsafe if transfering only parts of U

-DHPL_GPU_VERIFY

Verify result of caldgemm calls

-DCALDGEMM_TEST

Activate Test Debug Code

-DHPL_PRINT_INTERMEDIATE

print intermediate performance results

-DHPL_PRINT_AVG_MATRIX_SIZE

show how much memory the matrix uses per node on average

-DHPL_PRINT_THROTTLING_NODES=<GPU CLOCK>

set the gpu clock so caldgemm knows which reference to compare with

-DHPL_NO_MPI_THREAD_CHECK

HPL will not check whether the MPI lib has sufficiant threading capabilities but just call MPI_Init

-DHPL_START_PERCENTAGE=<float>

Approximate Percentage of the runtime where to start factorization by skipping cols in the matrix. As one cannot start at an arbitrary N easily this is not exact.

-DHPL_END_N=<int>

Abort HPL run after factorizing n columns

-DHPL_NO_HACKED_LIB

Do not use the hacked ATI lib

-DHPL_HAVE_PREFETCHW

AMD CPUs have a prefetchw instruction which makes some prefetches more efficient.

-DHPL_NO_MPI_LIB

No MPI, one single node run possible.

-DHPL_GPU_MAX_NB

Set max NB for GPU HPL (default 1024)

-DHPL_SLOW_CPU

Use special code paths optimized for slow CPUs and a GPU

-DHPL_FAST_GPU

Similar to slowCPU, better suited for medium CPU and fast GPU

-DHPL_RESTRICT_CPUS=

Restrict CPU threads used for factorization with lookahead, set to NO (0), YES (1), DYNAMIC (2), or DYNAMIC WITH CUT (3).

-DHPL_RESTRICT_CALLBACK(matrix_n)

Restrict CPU threads as above by the return value of the macro

-DHPL_MULTI_GPU

Use multiple GPUs

-DHPL_GPU_MAPPING="{x,y,z}"

Map gpu 0 to core x, 1 to y and so on

-DHPL_GPU_POSTPROCESS_MAPPING="C{x,y,z}"

Map gpu postprocess 0 to core x, 1 to y and so on

-DHPL_GPU_ALLOC_MAPPING="{x,y,z}"

Same as above for allocation

-DHPL_GPU_DMA_MAPPING="{x,y,z}"

Same as above for DMA

-DHPL_GPU_EXCLUDE_CORES="{x,y}"

Exclude CPU cores x,y from processing completely

-DHPL_GPU_DEVICE_IDS="{x,y}"

Device IDs of GPUs to use

-DHPL_GPU_PIN_MAIN=i

CPU core for main thread

-DHPL_DISABLE_LOOKAHEAD=n

Disable lookahead algorithm as soon as global trailing matrix size (dim n) hits n

-DHPL_LOOKAHEAD2_TURNOFF=n

Same for lookahead2

-DHPL_HALF_BLOCKING=n

n at which only half the blocking size is used

-DHPL_USE_ALL_CORES_FOR_LASWP

Use all cores for LASWP, lookahead 2 will not work well this way

-DHPL_GPU_THREADSAVE_DRIVER
Assume GPU driver to be threadsave
-DHPL_GPU_GLOBAL_DRIVER_MUTEX

The opposite, use one global mutex to protect all amd driver calls

-DHPL_CUSTOM_PARAMETER_CHANGE
Custom code executed every iteration, good for changing factorization parameters
-DHPL_CUSTOM_PARAMETER_CHANGE_CALDGEMM

As above, but executed before starting the caldgemm dgemm, so it can be used to alter caldgemm settings at runtime

-DHPL_GPU_EXTRA_CALDGEMM_OPTIONS

Extra source code that defines caldgemm options e.g. "cal_info.OutputThreads = 2;"

-DHPL_CALDGEMM_CBLAS_WRAPPER

Wrap CBLAS calls through CALDGEMM reducing the number of threads for small BLAS operations for better performance (only for OpenMP based BLAS libraries)

-DHPL_INTERLEAVE_MEMORY

Interleave memory between NUMA nodes

-DHPL_INTERLEAVE_C

As above, but do not change memory policy. Instead only interleave the C matrix

-DHPL_REGISTER_MEMORY

Register Memory for fast GPU access

-DHPL_EMULATE_MULTINODE

Report a multi node run to caldgemm and provide a fake function for the broadcast callback to simulate the loss due to communication overhead

-DHPL_PAUSE=n

Sleep during the iterations for n seconds (duration is gradually decreased during run). Usefull if the hardware overheats. Timers are stopped during this time. Clearly not valid for official results.

-DHPL_ALTERNATE_LOOKAHEAD=n

Use caldgemm AlternateLookahead feature with setting n

-DHPL_CALDGEMM_BACKEND=x

Use the CALDGEMM backend x, default: cal

-DHPL_LOOKAHEAD_2B

Enable lookahead mode 2b

-DHPL_LOOKAHEAD_2B_FIXED_STEPSIZE=n

Use a fixed starting stepsize for lookahead 2b instead of doing an mpi_allreduce to determine the minimum stepsize

-DHPL_LOOKAHEAD_2B_MULTIPLIER=n

Multiplier of stepsize in each iteration (default 3)

-DHPL_GPU_TEMPERATURE_THRESHOLD=temp

Temperature threshold where the run is stopped

-DHPL_MPI_WRAPPERS

Use MPI Wrappers to limit Max MPI Message size. Message size can be controlled via HPL_MAX_MPI_SEND_SIZE and HPL_MAX_MPI_BCAST_SIZE

-DHPL_COPYL_DURING_FACT

Perform copyL (if necessary) during factorization to allow for multithreading

-DHPL_ASYNC_DLATCPY

Perform DLATCPY asynchronously during DGEMM execution

-DHPL_MPI_AFFINITY

Set affinity for threads created during MPI init (hopefully all mpi threads) (syntax like gpu mapping)

-DHPL_EXCLUDE_FROM_LASWP=...

Exclude cores from laswp, syntax as for GPU mapping

-DHPL_DURATION_FIND_HELPER

HPL outputs linux timestamp before and after the actual calculation. In addition it adds an idle-pause of 10 secs before and after the run, this helps to find the exact duration in a power log etc.

-DHPL_CALDGEMM_ASYNC_FACT_DGEMM=m

Use async CALDGEMM queue for DGEMM during factorization as soon as local trailing matrix size (dim m) is below m (Consider this corresponds to matrix_n in caldgemm due to col/row major switch).

-DHPL_CALDGEMM_ASYNC_FACT_FIRST

Always use CALDGEMM DTRSM and DGEMM in first synchronous first iteration.

-DHPL_CALDGEMM_ASYNC_DTRSM_DGEMM

Use async CALDGEMM queue for large DTRSM in update step with DTRTRI / DGEMM emulation

-DHPL_CALDGEMM_ASYNC_DTRSM=m

Use async CALDGEMM queue for large DTRSM in update step (not in factorization) without DTRTRI / DGEMM emulation (should make the above obsolete), has same setting m as HPL_CALDGEMM_ASYNC_FACT_DGEMM