Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark 128-core System76 Thelio Astra #44

Closed
geerlingguy opened this issue Oct 22, 2024 · 18 comments
Closed

Benchmark 128-core System76 Thelio Astra #44

geerlingguy opened this issue Oct 22, 2024 · 18 comments

Comments

@geerlingguy
Copy link
Owner

The Thelio Astra has an M128-30 Ampere Altra Max CPU, and the configuration I was sent includes 512 GB of ECC DDR4-3200 RAM. See: geerlingguy/sbc-reviews#53

@geerlingguy
Copy link
Owner Author

On my first run, it compiled and started the benchmark, but a few minutes in, after the system was consuming 430W or so continuously (my UPS beeped a bit as I passed its 600W threshold), I saw power draw drop to 38W, and the system seemed to be locked up. Even a reboot from OpenBMC didn't seem to restore it—it is stuck in power off state even if I try powering it on via BMC.

I had to manually power cycle the machine using the power button.

I'm also wondering... I never heard the fans spin up at all, they just stayed in their idle RPM AFAICT—maybe the fan curve or the fan control on the little breakout adapter isn't running correctly? I'll ask System76 if that could be the case.

@geerlingguy
Copy link
Owner Author

geerlingguy commented Oct 23, 2024

btop, hilariously, is displaying the CPU temp in thousands of degrees C:

Screenshot 2024-10-23 at 11 38 31 AM

I'm going to monitor temps with sensors on a 2nd benchmark run, maybe the fan curve needs fixing.

@geerlingguy
Copy link
Owner Author

It looks like cooling is the issue — I'll contact System76 and ask about it.

image

SoC temps got to 95°C and pegged around 250W, and would hover between 95-98°C. I also encountered a lockup during the 'Background Blur' benchmark on Geekbench 6, and I'm guessing it was also thermal throttling.

@bexcran
Copy link

bexcran commented Oct 23, 2024

btop, hilariously, is displaying the CPU temp in thousands of degrees C:

Press 'o' for options, press '2' for the cpu tab and scroll down to 'Cpu sensor'.
Press the left/right arrows to select 'apm_xgene/SoC Temperature' instead of 'apm_xgene/IO power'.

@geerlingguy
Copy link
Owner Author

Ha! didn't even think of that. btop's been pretty reliable in picking the right metric on other platforms, this was the first time I tried it on Ampere. Will tuck that away when I get the Astra back — System76 is going to exchange systems as the one I was sent likely had shipping damage (one fan sounded horrible, and one of the CPU fans had dislodged from the CPU cooler, and was rattling around inside the cooling duct, likely banging into the motherboard).

@geerlingguy
Copy link
Owner Author

geerlingguy commented Dec 7, 2024

Just got the system back—this time things were intact, but the system still seemed to get warmer and warmer until hitting above 90°C and locking up—I saw DIMM and SoC overheat errors in OpenBMC, and it would hard lock, requiring a physical power button hold to shut down, or an 'immediate' poweroff in the BMC (an immediate reset wouldn't work).

The SoC gets back down to 35°C pretty quickly, as the idle fan PWM seems to be fine (something like a silent 50% duty cycle).

According to this commit, the hedt fan curve should be used:

https://github.com/pop-os/system76-power/blob/b0edf27a664ed11058e0bb52b4e8b74f79577a2d/src/fan.rs#L259-L269

But maybe system76-power isn't running on this install? The fan never ramps up at all, no matter what the SoC temp. If I run stress-ng -n 128 indefinitely, it only gets up into the 70°C range. But HPL burns through like 430W of total system power, and that eventually leads to the thermal limits.

@bexcran
Copy link

bexcran commented Dec 7, 2024

On my Thelio Astra:

system76@system76-ALTRAD8UD-1L2T:~$ ps ax | grep system76-power
   2772 ?        Ss     0:00 /usr/bin/system76-power daemon
  83889 pts/3    S+     0:00 grep --color=auto system76-power
system76@system76-ALTRAD8UD-1L2T:~$ system76-power profile
Power Profile: Balanced
system76@system76-ALTRAD8UD-1L2T:~$ sensors
apm_xgene-isa-0000
Adapter: ISA adapter
SoC Temperature:  +45.0°C  
CPU power:        36.96 W  
IO power:          8.05 W  

nvme-pci-20100
Adapter: PCI adapter
Composite:    +54.9°C  (low  = -20.1°C, high = +83.8°C)
                       (crit = +88.8°C)
Sensor 2:     +60.9°C  

system76_thelio_io-hid-3-9
Adapter: HID adapter
CPU Fan:     1035 RPM
Intake Fan:   660 RPM
GPU Fan:        0 RPM
Aux Fan:        0 RPM

If I set the Ps=1,Qs=128 and run the HPL benchmark from this repo, the SoC temperature quickly rises to 60C then over the course of 10-15 minutes rises to 71C and is now reporting 72C. I'm concerned that if I was to leave it overnight it might continue rising.

system76@system76-ALTRAD8UD-1L2T:~/src/top500-benchmark$ sensors
apm_xgene-isa-0000
Adapter: ISA adapter
SoC Temperature:  +72.0°C  
CPU power:       179.48 W  
IO power:         13.07 W  

nvme-pci-20100
Adapter: PCI adapter
Composite:    +54.9°C  (low  = -20.1°C, high = +83.8°C)
                       (crit = +88.8°C)
Sensor 2:     +60.9°C  

system76_thelio_io-hid-3-9
Adapter: HID adapter
CPU Fan:     1410 RPM
Intake Fan:  1020 RPM
GPU Fan:        0 RPM
Aux Fan:        0 RPM

Edit: I'm now occasionally seeing sensors reporting 73C.

@bexcran
Copy link

bexcran commented Dec 7, 2024

@geerlingguy Could you post the output of the sensors command please? Also, if you're not familiar with the command you can watch the values change via watch -n 2 sensors.

@geerlingguy
Copy link
Owner Author

@bexcran - my sensors output had no fan information whatsoever, just temps... and I'm not sure if my system's running the system76-power utility at all... I'll have to check when I boot it back up later (not sure if I can get to it this weekend, might be Monday!).

On my system the temperature rose pretty quickly from 35°C to 90°C (in the course of 10 minutes or so), and I never heard the fans move any faster listening closely out the exhaust fan (the air was nice and toasty though!).

@geerlingguy
Copy link
Owner Author

Ah, I left it plugged in so I could boot up remotely, yay!

system76@thelio-astra:~$ sensors
bnxt_en-pci-20301
Adapter: PCI adapter
temp1:        +51.0°C  

nvme-pci-20100
Adapter: PCI adapter
Composite:    +27.9°C  (low  = -20.1°C, high = +83.8°C)
                       (crit = +88.8°C)
Sensor 2:     +54.9°C  

apm_xgene-isa-0000
Adapter: ISA adapter
SoC Temperature:  +31.0°C  
CPU power:        10.72 W  
IO power:         13.03 W  

bnxt_en-pci-20300
Adapter: PCI adapter
temp1:        +51.0°C 

Their power daemon is not running:

system76@thelio-astra:~$ ps ax | grep system76-power
   5702 pts/0    S+     0:00 grep --color=auto system76-power

Are there any instructions for setting it up on Ubuntu?

And separately—since it seems that must run to get the fan to work at different SoC temps... is there any way to guarantee it will work if running other OSes on the box, which might not be supported by system76-power? (Or is there any way in BMC/BIOS to force fans to max or something?). (Sorry for the barrage of questions... just figured out that the fans weren't plugged into the motherboard yesterday, heh).

@geerlingguy
Copy link
Owner Author

I found System76 Driver (Install), which tells me to install the driver with:

sudo apt-add-repository -y ppa:system76-dev/stable
sudo apt-get update
sudo apt install system76-driver

That succeeded, and now:

system76@thelio-astra:~$ ps ax | grep system76-power
  12737 ?        Ss     0:00 /usr/bin/system76-power daemon
  18538 pts/0    S+     0:00 grep --color=auto system76-power

I didn't see any fan speeds in sensors, and nothing more was found with sensors-detect, so I'm rebooting to see if fan speeds become visible...

@geerlingguy
Copy link
Owner Author

Yay! I see fans in sensors!

system76@thelio-astra:~$ sensors
system76_thelio_io-hid-3-2
Adapter: HID adapter
CPU Fan:      480 RPM
Intake Fan:     0 RPM
GPU Fan:        0 RPM
Aux Fan:        0 RPM

bnxt_en-pci-20301
Adapter: PCI adapter
temp1:        +58.0°C  

nvme-pci-20100
Adapter: PCI adapter
Composite:    +36.9°C  (low  = -20.1°C, high = +83.8°C)
                       (crit = +88.8°C)
Sensor 2:     +54.9°C  

apm_xgene-isa-0000
Adapter: ISA adapter
SoC Temperature:  +35.0°C  
CPU power:         9.96 W  
IO power:         14.03 W  

bnxt_en-pci-20300
Adapter: PCI adapter
temp1:        +58.0°C  

I will run this benchmark again and see how temps go now that the daemon is running.

I still don't see any sensor data in OpenBMC though—I'm guessing the system76-driver only applies to booted Linux, but doesn't report back to IPMI...

Screenshot 2024-12-07 at 5 24 29 PM

@geerlingguy
Copy link
Owner Author

geerlingguy commented Dec 7, 2024

With the daemon running, temps are more controlled—SoC goes up between 75-80°C, and fan speeds ramp between 1500-1800 RPM (intake around 1100-1200 RPM), while the CPU's burning 200-210W.

astra-benchmark-temps

I'm using the default OpenBLAS installation, not sure if it's picking the right profile for ampere or just generic arm64.

@geerlingguy
Copy link
Owner Author

geerlingguy commented Dec 8, 2024

Using the defaults: 1,147.2 Gflops at 415W, for 2.76 Gflops/W (total system power draw, CPU + IO was using around 225W)

I may see about running the Ampere optimized Blis: https://github.com/AmpereComputing/HPL-on-Ampere-Altra . As a point of comparison, the M128-28X got 1.265 Tflops, so I'm guessing this will get at least 1.3 or 1.4—possibly more since we have two more memory channels on this system...

================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :  202181
NB     :     256
PMAP   : Row-major process mapping
P      :       8
Q      :      16
PFACT  :   Right
NBMIN  :       4
NDIV   :       2
RFACT  :   Crout
BCAST  :  1ringM
DEPTH  :       1
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4      202181   256     8    16            4802.94             1.1472e+03
HPL_pdgesv() start time Sat Dec  7 17:26:26 2024

HPL_pdgesv() end time   Sat Dec  7 18:46:29 2024

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   1.57845109e-03 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

@geerlingguy
Copy link
Owner Author

Re-testing with these instructions, the system certainly ramps up hotter, hitting the high 80°C's...

Screenshot 2024-12-09 at 4 37 24 PM Screenshot 2024-12-09 at 4 37 36 PM

The fan curve (currently hedt) might need to be a little more aggressive, to try to keep SoC temps down below 80°C... also would be nice to have a cooldown time so the fan ramps up and doesn't ramp back down quickly, just kinda stay high for a few seconds before it ramps down.

@bexcran
Copy link

bexcran commented Dec 9, 2024

@geerlingguy I think that's why OpenBMC uses a PID control loop for its fan controller - so that it keeps them running faster than needed for a few seconds as the temperature drops.

https://github.com/openbmc/phosphor-pid-control/blob/master/tuning.md:

The openBMC PID control daemon, swampd (phosphor-pid-control) requires the user to specify the sensors and PID coefficients.

This page seems to have a good description of the process: https://www.west-cs.com/products/l2/pid-temperature-controller/

PID temperature control is a loop control feature found on most process controllers to improve the accuracy of the process. PID temperature controllers work using a formula to calculate the difference between the desired temperature setpoint and current process temperature, then predicts how much power to use in subsequent process cycles to ensure the process temperature remains as close to the setpoint as possible by eliminating the impact of process environment changes.

PID temperature controllers differ from On/Off temperature controllers where 100% power is applied until the setpoint is reached, at which point the power is cut to 0% until the process temperature again falls below the setpoint. This leads to regular overshoots and lag which can affect the overall quality of the product.

Temperature controllers with PID are more effective at dealing with process disturbances, which can be something as seemingly innocuous as opening an oven door, but the change in temperature can then have an impact on the quality of the final product. If the PID temperature controller is tuned properly it will compensate for the disturbance and bring the process temperature back to the setpoint, but reduce power as temperature approaches the setpoint so that it doesn’t overshoot and risk damaging the product with too much heat.****

@geerlingguy
Copy link
Owner Author

geerlingguy commented Dec 10, 2024

Result using the ampere-optimized Blis and HPL options:

1,652.4 Gflops (1.65 Tflops) at 440W, for 3.76 Gflops/W

================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :  250000 
NB     :     256 
PMAP   : Row-major process mapping
P      :       8 
Q      :      16 
PFACT  :   Right 
NBMIN  :       4 
NDIV   :       2 
RFACT  :   Crout 
BCAST  :  1ringM 
DEPTH  :       1 
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4      250000   256     8    16            6304.05             1.6524e+03
HPL_pdgesv() start time Mon Dec  9 16:35:04 2024

HPL_pdgesv() end time   Mon Dec  9 18:20:08 2024

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   1.42735119e-03 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================
Screenshot 2024-12-09 at 6 45 02 PM

@geerlingguy
Copy link
Owner Author

Happy to see this system get better than the expected result of 1597 Gflops according to the HPL-on-Ampere repo :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants