Skip to content

HPL Compile Time Options

David Rohr edited this page Aug 21, 2015 · 2 revisions
  • -DHPL_COPY_L

force the copy of the panel L before bcast;

  • -DHPL_NO_MPI_DATATYPE

Do not use custom MPI types

  • -DHPL_DETAILED_TIMING

enable detailed timers;

  • -DHPL_DETAILED2_TIMING

enable more detailed timers for the lookahead pipeline;

  • -DHPL_CALL_CALDGEMM

use caldgemm for DGEMM cals

  • -DTRACE_CALLS

function level tracing for calls that might be relevant for optimization (implies HPL_GPU_NOT_QUIET)

  • -DUSE_ORIGINAL_LASWP

use original laswp implementation

  • -DTRACE_LASWP

dump date for LASWPs

  • -DHPL_FASTINIT

Fast initialization of input matrices for tuning runs

  • -DHPL_FASTVERIFY

Use Fast initialization random number generator for verification

  • -DHPL_PAGELOCKED_MEM

Allocate the memory pagelocked

  • -DHPL_HUGE_TABLES

Allocate the memory using huge tables

  • -DHPL_GPU_TIMING

Force Display of CALDGEMM Timing Data without HPL_GPU_NOT_QUIET

  • -DHPL_GPU_NOT_QUIET

Do not set quiet parameter for CALDGEMM (Will also display timing)

  • -DHPL_GPU_PERFORMANCE_WARNINGS

Print performance warnings for suboptimal CALDGEMM execution

  • -DHPL_SEND_U_PADDING

Transmit the padding of U matrix, unsafe if transfering only parts of U

  • -DHPL_GPU_VERIFY

Verify result of caldgemm calls

  • -DCALDGEMM_TEST

Activate Test Debug Code

  • -DHPL_PRINT_INTERMEDIATE

print intermediate performance results

  • -DHPL_PRINT_AVG_MATRIX_SIZE

show how much memory the matrix uses per node on average

  • -DHPL_PRINT_THROTTLING_NODES=<GPU CLOCK>

set the gpu clock so caldgemm knows which reference to compare with

  • -DHPL_NO_MPI_THREAD_CHECK

HPL will not check whether the MPI lib has sufficiant threading capabilities but just call MPI_Init

  • -DHPL_START_PERCENTAGE=<float>

Approximate Percentage of the runtime where to start factorization by skipping cols in the matrix. As one cannot start at an arbitrary N easily this is not exact.

  • -DHPL_END_N=<int>

Abort HPL run after factorizing n columns

  • -DHPL_NO_HACKED_LIB

Do not use the hacked ATI lib

  • -DHPL_HAVE_PREFETCHW

AMD CPUs have a prefetchw instruction which makes some prefetches more efficient.

  • -DHPL_NO_MPI_LIB

No MPI, one single node run possible.

  • -DHPL_GPU_MAX_NB

Set max NB for GPU HPL (default 1024)

  • -DHPL_SLOW_CPU

Use special code paths optimized for slow CPUs and a GPU

  • -DHPL_FAST_GPU

Similar to slowCPU, better suited for medium CPU and fast GPU

  • -DHPL_RESTRICT_CPUS=

Restrict CPU threads used for factorization with lookahead, set to NO (0), YES (1), DYNAMIC (2), or DYNAMIC WITH CUT (3).

  • -DHPL_RESTRICT_CALLBACK(matrix_n)

Restrict CPU threads as above by the return value of the macro

  • -DHPL_MULTI_GPU

Use multiple GPUs

  • -DHPL_GPU_MAPPING="{x,y,z}"

Map gpu 0 to core x, 1 to y and so on

  • -DHPL_GPU_POSTPROCESS_MAPPING="C{x,y,z}"

Map gpu postprocess 0 to core x, 1 to y and so on

  • -DHPL_GPU_ALLOC_MAPPING="{x,y,z}"

Same as above for allocation

  • -DHPL_GPU_DMA_MAPPING="{x,y,z}"

Same as above for DMA

  • -DHPL_GPU_EXCLUDE_CORES="{x,y}"

Exclude CPU cores x,y from processing completely

  • -DHPL_GPU_DEVICE_IDS="{x,y}"

Device IDs of GPUs to use

  • -DHPL_GPU_PIN_MAIN=i

CPU core for main thread

  • -DHPL_DISABLE_LOOKAHEAD=n

Disable lookahead algorithm as soon as global trailing matrix size (dim n) hits n

  • -DHPL_LOOKAHEAD2_TURNOFF=n

Same for lookahead2

  • -DHPL_HALF_BLOCKING=n

n at which only half the blocking size is used

  • -DHPL_USE_ALL_CORES_FOR_LASWP

Use all cores for LASWP, lookahead 2 will not work well this way

  • -DHPL_GPU_THREADSAVE_DRIVER

  • Assume GPU driver to be threadsave

  • -DHPL_GPU_GLOBAL_DRIVER_MUTEX

The opposite, use one global mutex to protect all amd driver calls

  • -DHPL_CUSTOM_PARAMETER_CHANGE

  • Custom code executed every iteration, good for changing factorization parameters

  • -DHPL_CUSTOM_PARAMETER_CHANGE_CALDGEMM

As above, but executed before starting the caldgemm dgemm, so it can be used to alter caldgemm settings at runtime

  • -DHPL_GPU_EXTRA_CALDGEMM_OPTIONS

Extra source code that defines caldgemm options e.g. "cal_info.OutputThreads = 2;"

  • -DHPL_CALDGEMM_CBLAS_WRAPPER

Wrap CBLAS calls through CALDGEMM reducing the number of threads for small BLAS operations for better performance (only for OpenMP based BLAS libraries)

  • -DHPL_INTERLEAVE_MEMORY

Interleave memory between NUMA nodes

  • -DHPL_INTERLEAVE_C

As above, but do not change memory policy. Instead only interleave the C matrix

  • -DHPL_REGISTER_MEMORY

Register Memory for fast GPU access

  • -DHPL_EMULATE_MULTINODE

Report a multi node run to caldgemm and provide a fake function for the broadcast callback to simulate the loss due to communication overhead

  • -DHPL_PAUSE=n

Sleep during the iterations for n seconds (duration is gradually decreased during run). Usefull if the hardware overheats. Timers are stopped during this time. Clearly not valid for official results.

  • -DHPL_ALTERNATE_LOOKAHEAD=n

Use caldgemm AlternateLookahead feature with setting n

  • -DHPL_CALDGEMM_BACKEND=x

Use the CALDGEMM backend x, default: cal

  • -DHPL_LOOKAHEAD_2B

Enable lookahead mode 2b

  • -DHPL_LOOKAHEAD_2B_FIXED_STEPSIZE=n

Use a fixed starting stepsize for lookahead 2b instead of doing an mpi_allreduce to determine the minimum stepsize

  • -DHPL_LOOKAHEAD_2B_MULTIPLIER=n

Multiplier of stepsize in each iteration (default 3)

  • -DHPL_GPU_TEMPERATURE_THRESHOLD=temp

Temperature threshold where the run is stopped

  • -DHPL_MPI_WRAPPERS

Use MPI Wrappers to limit Max MPI Message size. Message size can be controlled via HPL_MAX_MPI_SEND_SIZE and HPL_MAX_MPI_BCAST_SIZE

  • -DHPL_COPYL_DURING_FACT

Perform copyL (if necessary) during factorization to allow for multithreading

  • -DHPL_ASYNC_DLATCPY

Perform DLATCPY asynchronously during DGEMM execution

  • -DHPL_MPI_AFFINITY

Set affinity for threads created during MPI init (hopefully all mpi threads) (syntax like gpu mapping)

  • -DHPL_EXCLUDE_FROM_LASWP=...

Exclude cores from laswp, syntax as for GPU mapping

  • -DHPL_DURATION_FIND_HELPER

HPL outputs linux timestamp before and after the actual calculation. In addition it adds an idle-pause of 10 secs before and after the run, this helps to find the exact duration in a power log etc.

  • -DHPL_CALDGEMM_ASYNC_FACT_DGEMM=m

Use async CALDGEMM queue for DGEMM during factorization as soon as local trailing matrix size (dim m) is below m (Consider this corresponds to matrix_n in caldgemm due to col/row major switch).

  • -DHPL_CALDGEMM_ASYNC_FACT_FIRST

Always use CALDGEMM DTRSM and DGEMM in first synchronous first iteration.

  • -DHPL_CALDGEMM_ASYNC_DTRSM_DGEMM

Use async CALDGEMM queue for large DTRSM in update step with DTRTRI / DGEMM emulation

  • -DHPL_CALDGEMM_ASYNC_DTRSM=m

Use async CALDGEMM queue for large DTRSM in update step (not in factorization) without DTRTRI / DGEMM emulation (should make the above obsolete), has same setting m as HPL_CALDGEMM_ASYNC_FACT_DGEMM

  • -DHPL_CALDGEMM_ASYNC_FACT_DTRSM=m

Same as above during factorization

  • -DHPL_WARMUP

Do a warmup run before the actual factorization to make sure all initialization is done

  • -DHPL_PRINT_CONFIG

Print Config Options at start of run

  • -DHPL_CPUFREQ

Add option to change cpu frequency

  • -DHPL_CALDGEMM_PARAM

Command line options passed to caldgemm in dgemm_bench format

  • -DHPL_GPU_RUNTIME_CONFIG

Read runtime config file "HPL-GPU.conf" when the run starts. This can be used to set / override compile time settings

  • -DHPL_NUM_LASWP_CORES

Number of CPU cores to use for LASWP