Skip to content

Throughput benchmark

Ye Luo edited this page Aug 21, 2020 · 2 revisions

Disclaimer

No benchmark is perfect. Every benchmark is targetting its specific need and must be well defined for comparison. The Coral-2 benchmark is very different from the throughput benchmark I describe here.

Throughput benchmark

The number of samples generated in a give time is the key for the throughput benchmark. Samples wasted in equilibration is not part of a measurement.

Figure of merit

FOM is defined as workload divided by elapsed time.

Weak scaling

QMCPACK spends little time in MPI communication. A full Machine FOM = 1 Node FOM x MPI efficiency. We have see MPI efficiency always above 95% in the past.

Problem size

QMCPACK workload depends on the problem size N, the number of electrons. B-spline SPO, Two and Three body Jastrow factors scale O(N^2) but Slater determinants scales O(N^3). For this reason, it is simpler to compare two FOM based on the same problem size. When the problem size is large, O(N^3) leads the cost and thus we can take O(N^3) for simplicity.

A simple formula

FOM = N^3 x Nwalker x DMC steps / wall clock time

For example running the 256 atom NiO problem on Titan, 14 walkers per GPU, 19.65 seconds per step DMC. Then the full machine FOM = 18000 nodes x 0.95 x 3072^3 x 14 / 19.65 = 3.53 x 10^14. This run used the CUDA code without delayed update algorithm.