Benchmark Ampere Altra Developer Platform - 128 core 2.8 GHz ARM64 #17

geerlingguy · 2023-09-11T20:39:41Z

I am upgrading my system from 96 to 128 core M128-28, which should hopefully boost the score a little further. We'll see if efficiency is better or worse with the extra 32 cores.

Previous discussion: #10

geerlingguy · 2023-09-12T15:56:42Z

Initial result:

root@ampere-ubuntu:/opt/hpl-2.3/bin/Altramax_oracleblis# mpirun --allow-run-as-root -np 128 --bind-to core --map-by core ./xhpl
================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :  100000 
NB     :     256 
PMAP   : Row-major process mapping
P      :       8 
Q      :      16 
PFACT  :   Right 
NBMIN  :       4 
NDIV   :       2 
RFACT  :   Crout 
BCAST  :  1ringM 
DEPTH  :       1 
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4      100000   256     8    16             596.08             1.1185e+03
HPL_pdgesv() start time Tue Sep 12 15:31:28 2023

HPL_pdgesv() end time   Tue Sep 12 15:41:24 2023

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   1.62873305e-03 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

It should be more, though... See: AmpereComputing/HPL-on-Ampere-Altra#11

geerlingguy · 2023-10-09T22:14:29Z

With 384GB of RAM, I got:

root@ampere-ubuntu:/opt/hpl-2.3/bin/Altramax_oracleblis# mpirun --allow-run-as-root -np 128 --bind-to core --map-by core ./xhpl
================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :  200000 
NB     :     256 
PMAP   : Row-major process mapping
P      :       8 
Q      :      16 
PFACT  :   Right 
NBMIN  :       4 
NDIV   :       2 
RFACT  :   Crout 
BCAST  :  1ringM 
DEPTH  :       1 
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4      200000   256     8    16            4214.60             1.2655e+03
HPL_pdgesv() start time Fri Sep 15 18:56:44 2023

HPL_pdgesv() end time   Fri Sep 15 20:06:59 2023

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   1.13196075e-02 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

geerlingguy · 2023-10-10T14:40:33Z

After a bit of discussion in AmpereComputing/HPL-on-Ampere-Altra#11, we determined it is a memory bandwidth issue; the 128-core CPU just can't get more memory bandwidth than the 96-core CPU, and that's the bottleneck.

Throwing a little more memory at it did bump the score, but to get beyond 1.2 Tflops, we need a system with 8 memory channels (like one of the server boards). I'm happy with the score we got, though :)

geerlingguy mentioned this issue Oct 9, 2023

Benchmark Adlink Ampere Altra Dev Kit - 64-core 2.2 GHz #19

Closed

geerlingguy closed this as completed Oct 10, 2023

geerlingguy mentioned this issue Dec 8, 2024

Benchmark 128-core System76 Thelio Astra #44

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark Ampere Altra Developer Platform - 128 core 2.8 GHz ARM64 #17

Benchmark Ampere Altra Developer Platform - 128 core 2.8 GHz ARM64 #17

geerlingguy commented Sep 11, 2023 •

edited

Loading

geerlingguy commented Sep 12, 2023

geerlingguy commented Oct 9, 2023

geerlingguy commented Oct 10, 2023

Benchmark Ampere Altra Developer Platform - 128 core 2.8 GHz ARM64 #17

Benchmark Ampere Altra Developer Platform - 128 core 2.8 GHz ARM64 #17

Comments

geerlingguy commented Sep 11, 2023 • edited Loading

geerlingguy commented Sep 12, 2023

geerlingguy commented Oct 9, 2023

geerlingguy commented Oct 10, 2023

geerlingguy commented Sep 11, 2023 •

edited

Loading