-
Notifications
You must be signed in to change notification settings - Fork 14
HPL Compile Time Options
-DHPL_COPY_L
force the copy of the panel L before bcast;
-DHPL_NO_MPI_DATATYPE
Do not use custom MPI types
-DHPL_DETAILED_TIMING
enable detailed timers;
-DHPL_DETAILED2_TIMING
enable more detailed timers for the lookahead pipeline;
-DHPL_CALL_CALDGEMM
use caldgemm for DGEMM cals
-DTRACE_CALLS
function level tracing for calls that might be relevant for optimization (implies HPL_GPU_NOT_QUIET)
-DUSE_ORIGINAL_LASWP
use original laswp implementation
-DTRACE_LASWP
dump date for LASWPs
-DHPL_FASTINIT
Fast initialization of input matrices for tuning runs
-DHPL_FASTVERIFY
Use Fast initialization random number generator for verification
-DHPL_PAGELOCKED_MEM
Allocate the memory pagelocked
-DHPL_HUGE_TABLES
Allocate the memory using huge tables
-DHPL_GPU_TIMING
Force Display of CALDGEMM Timing Data without HPL_GPU_NOT_QUIET
-DHPL_GPU_NOT_QUIET
Do not set quiet parameter for CALDGEMM (Will also display timing)
-DHPL_GPU_PERFORMANCE_WARNINGS
Print performance warnings for suboptimal CALDGEMM execution
-DHPL_SEND_U_PADDING
Transmit the padding of U matrix, unsafe if transfering only parts of U
-DHPL_GPU_VERIFY
Verify result of caldgemm calls
-DCALDGEMM_TEST
Activate Test Debug Code
-DHPL_PRINT_INTERMEDIATE
print intermediate performance results
-DHPL_PRINT_AVG_MATRIX_SIZE
show how much memory the matrix uses per node on average
-DHPL_PRINT_THROTTLING_NODES=<GPU CLOCK>
set the gpu clock so caldgemm knows which reference to compare with
-DHPL_NO_MPI_THREAD_CHECK
HPL will not check whether the MPI lib has sufficiant threading capabilities but just call MPI_Init
-DHPL_START_PERCENTAGE=<float>
Approximate Percentage of the runtime where to start factorization by skipping cols in the matrix. As one cannot start at an arbitrary N easily this is not exact.
-DHPL_END_N=<int>
Abort HPL run after factorizing n columns
-DHPL_NO_HACKED_LIB
Do not use the hacked ATI lib
-DHPL_HAVE_PREFETCHW
AMD CPUs have a prefetchw instruction which makes some prefetches more efficient.
-DHPL_NO_MPI_LIB
No MPI, one single node run possible.
-DHPL_GPU_MAX_NB
Set max NB for GPU HPL (default 1024)
-DHPL_SLOW_CPU
Use special code paths optimized for slow CPUs and a GPU
-DHPL_FAST_GPU
Similar to slowCPU, better suited for medium CPU and fast GPU
-DHPL_RESTRICT_CPUS=
Restrict CPU threads used for factorization with lookahead, set to NO (0), YES (1), DYNAMIC (2), or DYNAMIC WITH CUT (3).
-DHPL_RESTRICT_CALLBACK(matrix_n)
Restrict CPU threads as above by the return value of the macro
-DHPL_MULTI_GPU
Use multiple GPUs
-DHPL_GPU_MAPPING="{x,y,z}"
Map gpu 0 to core x, 1 to y and so on
-DHPL_GPU_POSTPROCESS_MAPPING="C{x,y,z}"
Map gpu postprocess 0 to core x, 1 to y and so on
-DHPL_GPU_ALLOC_MAPPING="{x,y,z}"
Same as above for allocation
-DHPL_GPU_DMA_MAPPING="{x,y,z}"
Same as above for DMA
-DHPL_GPU_EXCLUDE_CORES="{x,y}"
Exclude CPU cores x,y from processing completely
-DHPL_GPU_DEVICE_IDS="{x,y}"
Device IDs of GPUs to use
-DHPL_GPU_PIN_MAIN=i
CPU core for main thread
-DHPL_DISABLE_LOOKAHEAD=n
Disable lookahead algorithm as soon as global trailing matrix size (dim n) hits n
-DHPL_LOOKAHEAD2_TURNOFF=n
Same for lookahead2
-DHPL_HALF_BLOCKING=n
n at which only half the blocking size is used
-DHPL_USE_ALL_CORES_FOR_LASWP
Use all cores for LASWP, lookahead 2 will not work well this way
-
-DHPL_GPU_THREADSAVE_DRIVER
-
Assume GPU driver to be threadsave
-
-DHPL_GPU_GLOBAL_DRIVER_MUTEX
The opposite, use one global mutex to protect all amd driver calls
-
-DHPL_CUSTOM_PARAMETER_CHANGE
-
Custom code executed every iteration, good for changing factorization parameters
-
-DHPL_CUSTOM_PARAMETER_CHANGE_CALDGEMM
As above, but executed before starting the caldgemm dgemm, so it can be used to alter caldgemm settings at runtime
-DHPL_GPU_EXTRA_CALDGEMM_OPTIONS
Extra source code that defines caldgemm options e.g. "cal_info.OutputThreads = 2;"
-DHPL_CALDGEMM_CBLAS_WRAPPER
Wrap CBLAS calls through CALDGEMM reducing the number of threads for small BLAS operations for better performance (only for OpenMP based BLAS libraries)
-DHPL_INTERLEAVE_MEMORY
Interleave memory between NUMA nodes
-DHPL_INTERLEAVE_C
As above, but do not change memory policy. Instead only interleave the C matrix
-DHPL_REGISTER_MEMORY
Register Memory for fast GPU access
-DHPL_EMULATE_MULTINODE
Report a multi node run to caldgemm and provide a fake function for the broadcast callback to simulate the loss due to communication overhead
-DHPL_PAUSE=n
Sleep during the iterations for n seconds (duration is gradually decreased during run). Usefull if the hardware overheats. Timers are stopped during this time. Clearly not valid for official results.
-DHPL_ALTERNATE_LOOKAHEAD=n
Use caldgemm AlternateLookahead feature with setting n
-DHPL_CALDGEMM_BACKEND=x
Use the CALDGEMM backend x, default: cal
-DHPL_LOOKAHEAD_2B
Enable lookahead mode 2b
-DHPL_LOOKAHEAD_2B_FIXED_STEPSIZE=n
Use a fixed starting stepsize for lookahead 2b instead of doing an mpi_allreduce to determine the minimum stepsize
-DHPL_LOOKAHEAD_2B_MULTIPLIER=n
Multiplier of stepsize in each iteration (default 3)
-DHPL_GPU_TEMPERATURE_THRESHOLD=temp
Temperature threshold where the run is stopped
-DHPL_MPI_WRAPPERS
Use MPI Wrappers to limit Max MPI Message size. Message size can be controlled via HPL_MAX_MPI_SEND_SIZE and HPL_MAX_MPI_BCAST_SIZE
-DHPL_COPYL_DURING_FACT
Perform copyL (if necessary) during factorization to allow for multithreading
-DHPL_ASYNC_DLATCPY
Perform DLATCPY asynchronously during DGEMM execution
-DHPL_MPI_AFFINITY
Set affinity for threads created during MPI init (hopefully all mpi threads) (syntax like gpu mapping)
-DHPL_EXCLUDE_FROM_LASWP=...
Exclude cores from laswp, syntax as for GPU mapping
-DHPL_DURATION_FIND_HELPER
HPL outputs linux timestamp before and after the actual calculation. In addition it adds an idle-pause of 10 secs before and after the run, this helps to find the exact duration in a power log etc.
-DHPL_CALDGEMM_ASYNC_FACT_DGEMM=m
Use async CALDGEMM queue for DGEMM during factorization as soon as local trailing matrix size (dim m) is below m (Consider this corresponds to matrix_n in caldgemm due to col/row major switch).
-DHPL_CALDGEMM_ASYNC_FACT_FIRST
Always use CALDGEMM DTRSM and DGEMM in first synchronous first iteration.
-DHPL_CALDGEMM_ASYNC_DTRSM_DGEMM
Use async CALDGEMM queue for large DTRSM in update step with DTRTRI / DGEMM emulation
-DHPL_CALDGEMM_ASYNC_DTRSM=m
Use async CALDGEMM queue for large DTRSM in update step (not in factorization) without DTRTRI / DGEMM emulation (should make the above obsolete), has same setting m as HPL_CALDGEMM_ASYNC_FACT_DGEMM
-DHPL_CALDGEMM_ASYNC_FACT_DTRSM=m
Same as above during factorization
-DHPL_WARMUP
Do a warmup run before the actual factorization to make sure all initialization is done
-DHPL_PRINT_CONFIG
Print Config Options at start of run
-DHPL_CPUFREQ
Add option to change cpu frequency
-DHPL_CALDGEMM_PARAM
Command line options passed to caldgemm in dgemm_bench format
-DHPL_GPU_RUNTIME_CONFIG
Read runtime config file "HPL-GPU.conf" when the run starts. This can be used to set / override compile time settings
-DHPL_NUM_LASWP_CORES
Number of CPU cores to use for LASWP
- DMA and memory bandwidth
- CALDGEMM Performance Optimization Guide (CAL OpenCL without GPU_C)
- CALDGEMM Performance Optimization Guide (OpenCL CUDA)
- Thread to core pinning in HPL and CALDGEMM
- Important HPL GPU / CALDGEMM options
Tools / Information
- Analysis Plots of HPL GPU Runs
- Headless System with X Server
- Heterogeneous cluster with different node types
- HPL Compile Time Options
- Catalyst Driver Patch
Reference