Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark Ampere Altra Developer Platform - 128 core 2.8 GHz ARM64 #17

Closed
geerlingguy opened this issue Sep 11, 2023 · 3 comments
Closed

Comments

@geerlingguy
Copy link
Owner

geerlingguy commented Sep 11, 2023

I am upgrading my system from 96 to 128 core M128-28, which should hopefully boost the score a little further. We'll see if efficiency is better or worse with the extra 32 cores.

Previous discussion: #10

@geerlingguy
Copy link
Owner Author

Initial result:

root@ampere-ubuntu:/opt/hpl-2.3/bin/Altramax_oracleblis# mpirun --allow-run-as-root -np 128 --bind-to core --map-by core ./xhpl
================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :  100000 
NB     :     256 
PMAP   : Row-major process mapping
P      :       8 
Q      :      16 
PFACT  :   Right 
NBMIN  :       4 
NDIV   :       2 
RFACT  :   Crout 
BCAST  :  1ringM 
DEPTH  :       1 
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4      100000   256     8    16             596.08             1.1185e+03
HPL_pdgesv() start time Tue Sep 12 15:31:28 2023

HPL_pdgesv() end time   Tue Sep 12 15:41:24 2023

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   1.62873305e-03 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

It should be more, though... See: AmpereComputing/HPL-on-Ampere-Altra#11

@geerlingguy
Copy link
Owner Author

With 384GB of RAM, I got:

root@ampere-ubuntu:/opt/hpl-2.3/bin/Altramax_oracleblis# mpirun --allow-run-as-root -np 128 --bind-to core --map-by core ./xhpl
================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :  200000 
NB     :     256 
PMAP   : Row-major process mapping
P      :       8 
Q      :      16 
PFACT  :   Right 
NBMIN  :       4 
NDIV   :       2 
RFACT  :   Crout 
BCAST  :  1ringM 
DEPTH  :       1 
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4      200000   256     8    16            4214.60             1.2655e+03
HPL_pdgesv() start time Fri Sep 15 18:56:44 2023

HPL_pdgesv() end time   Fri Sep 15 20:06:59 2023

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   1.13196075e-02 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

@geerlingguy
Copy link
Owner Author

After a bit of discussion in AmpereComputing/HPL-on-Ampere-Altra#11, we determined it is a memory bandwidth issue; the 128-core CPU just can't get more memory bandwidth than the 96-core CPU, and that's the bottleneck.

Throwing a little more memory at it did bump the score, but to get beyond 1.2 Tflops, we need a system with 8 memory channels (like one of the server boards). I'm happy with the score we got, though :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant