Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System76 Thelio Astra Workstation #53

Open
geerlingguy opened this issue Oct 22, 2024 · 18 comments
Open

System76 Thelio Astra Workstation #53

geerlingguy opened this issue Oct 22, 2024 · 18 comments

Comments

@geerlingguy
Copy link
Owner

geerlingguy commented Oct 22, 2024

DSC01878

Basic information

  • Board URL (official): https://system76.com/desktops/thelio-astra
  • Board purchased from: (Provided by System76 for review)
  • Board purchase date: October 22, 2024
  • Board specs (as tested): Ampere Altra Max M128-30, 512 GB ECC DDR4-3200, Nvidia A402
  • Board price (as tested): $3,299 (base), $6,812 (as tested)

Docs / Issues

See also: @bexcran's Ampere Systems Wiki.

Linux/system information

# output of `screenfetch`
                          ./+o+-       jgeerling@thelio-astra
                  yyyyy- -yyyyyy+      OS: Ubuntu 24.04 noble
               ://+//////-yyyyyyo      Kernel: aarch64 Linux 6.8.0-47-generic
           .++ .:/++++++/-.+sss/`      Uptime: 3m
         .:++o:  /++++++++/:--:/-      Packages: 1872
        o:+o+:++.`..```.-/oo+++++/     Shell: bash 5.2.21
       .:+o:+o/.          `+sssoo+/    Disk: 15G / 101G (16%)
  .++/+:+oo+o:`             /sssooo.   CPU: ARM Neoverse-N1 @ 128x 3GHz
 /+++//+:`oo+o               /::--:.   GPU: ASPEED Technology, Inc. ASPEED Graphics Family (rev 41)
NVIDIA Corporation Device 25b2 (rev a1)
 \+/+o+++`o++o               ++////.   RAM: 13063MiB / 514300MiB
  .++.o+++oo+:`             /dddhhh.  
       .+.o+oo:.          `oddhhhh+   
        \+.++o+o``-````.:ohdhhhhh+    
         `:o+++ `ohhhhhhhhyo++os:     
           .o:`.syhhhhhhh/.oo++o`     
               /osyyyyyyo++ooo+++/    
                   ````` +oo+++o\:    
                          `oo++.  

# output of `uname -a`
jgeerling@thelio-astra:~$ uname -a
Linux thelio-astra 6.8.0-47-generic #47-Ubuntu SMP PREEMPT_DYNAMIC Fri Sep 27 22:03:50 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux

Benchmark results

CPU

Power

  • Idle power draw (at wall): 103 W (7 W system shutdown, BMC running)
  • Maximum simulated power draw (stress-ng --matrix 0): 357 W
  • During Geekbench multicore benchmark: TODO W
  • During top500 HPL benchmark: 440 W

Disk

1TB PCIe Gen 4 NVMe (KINGSTON SKC3000S1024G - KC3000)

Benchmark Result
iozone 4K random read 85.59 MB/s
iozone 4K random write 267.75 MB/s
iozone 1M random read 3279.35 MB/s
iozone 1M random write 5082.95 MB/s
iozone 1M sequential read 3274.56 MB/s
iozone 1M sequential write 5041.55 MB/s
wget https://raw.githubusercontent.com/geerlingguy/pi-cluster/master/benchmarks/disk-benchmark.sh
chmod +x disk-benchmark.sh
sudo MOUNT_PATH=/ TEST_SIZE=1g ./disk-benchmark.sh

Network

iperf3 results (10 GbE connection):

  • iperf3 -c $SERVER_IP: 9.38 Gbps
  • iperf3 -c $SERVER_IP --reverse: 9.41 Gbps
  • iperf3 -c $SERVER_IP --bidir: 9.36 Gbps up, 9.40 Gbps down

The SFP cages on the motherboard support 25 GbE; I have not had a chance to plug this into the 25 GbE switch in my rack yet...

GPU

glmark2-es2-wayland

=======================================================
    glmark2 2023.01
=======================================================
    OpenGL Information
    GL_VENDOR:      NVIDIA Corporation
    GL_RENDERER:    NVIDIA RTX A400/PCIe
    GL_VERSION:     OpenGL ES 3.2 NVIDIA 550.120
    Surface Config: buf=32 r=8 g=8 b=8 a=8 depth=24 stencil=0 samples=0
    Surface Size:   800x600 windowed
=======================================================
[build] use-vbo=false: FPS: 1880 FrameTime: 0.532 ms
[build] use-vbo=true: FPS: 2028 FrameTime: 0.493 ms
[texture] texture-filter=nearest: FPS: 1954 FrameTime: 0.512 ms
[texture] texture-filter=linear: FPS: 1974 FrameTime: 0.507 ms
[texture] texture-filter=mipmap: FPS: 1897 FrameTime: 0.527 ms
[shading] shading=gouraud: FPS: 1664 FrameTime: 0.601 ms
[shading] shading=blinn-phong-inf: FPS: 1634 FrameTime: 0.612 ms
[shading] shading=phong: FPS: 1626 FrameTime: 0.615 ms
[shading] shading=cel: FPS: 1674 FrameTime: 0.597 ms
[bump] bump-render=high-poly: FPS: 1673 FrameTime: 0.598 ms
[bump] bump-render=normals: FPS: 2147 FrameTime: 0.466 ms
[bump] bump-render=height: FPS: 2081 FrameTime: 0.481 ms
[effect2d] kernel=0,1,0;1,-4,1;0,1,0;: FPS: 1793 FrameTime: 0.558 ms
[effect2d] kernel=1,1,1,1,1;1,1,1,1,1;1,1,1,1,1;: FPS: 1736 FrameTime: 0.576 ms
[pulsar] light=false:quads=5:texture=false: FPS: 1959 FrameTime: 0.511 ms
[desktop] blur-radius=5:effect=blur:passes=1:separable=true:windows=4: FPS: 1298 FrameTime: 0.771 ms
[desktop] effect=shadow:windows=4: FPS: 1578 FrameTime: 0.634 ms
[buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 896 FrameTime: 1.117 ms
[buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=subdata: FPS: 1008 FrameTime: 0.993 ms
[buffer] columns=200:interleave=true:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 1184 FrameTime: 0.845 ms
[ideas] speed=duration: FPS: 1704 FrameTime: 0.587 ms
[jellyfish] <default>: FPS: 1618 FrameTime: 0.618 ms
[terrain] <default>: FPS: 530 FrameTime: 1.889 ms
[shadow] <default>: FPS: 1808 FrameTime: 0.553 ms
[refract] <default>: FPS: 1061 FrameTime: 0.943 ms
Error: Failed to add fragment shader from file None:
Error:   0(24) : error C7101: Macro HIGHP_OR_DEFAULT redefined
Error: 
[conditionals] fragment-steps=0:vertex-steps=0: Set up failed
Error: Failed to add fragment shader from file None:
Error:   0(24) : error C7101: Macro HIGHP_OR_DEFAULT redefined
Error: 
[conditionals] fragment-steps=5:vertex-steps=0: Set up failed
Error: Failed to add fragment shader from file None:
Error:   0(24) : error C7101: Macro HIGHP_OR_DEFAULT redefined
Error: 
[conditionals] fragment-steps=0:vertex-steps=5: Set up failed
Error: Failed to add fragment shader from file None:
Error:   0(31) : error C7101: Macro HIGHP_OR_DEFAULT redefined
Error: 
[function] fragment-complexity=low:fragment-steps=5: Set up failed
Error: Failed to add fragment shader from file None:
Error:   0(32) : error C7101: Macro HIGHP_OR_DEFAULT redefined
Error: 
[function] fragment-complexity=medium:fragment-steps=5: Set up failed
Error: Failed to add fragment shader from file None:
Error:   0(25) : error C7101: Macro HIGHP_OR_DEFAULT redefined
Error: 
[loop] fragment-loop=false:fragment-steps=5:vertex-steps=5: Set up failed
Error: Failed to add fragment shader from file None:
Error:   0(25) : error C7101: Macro HIGHP_OR_DEFAULT redefined
Error: 
[loop] fragment-steps=5:fragment-uniform=false:vertex-steps=5: Set up failed
Error: Failed to add fragment shader from file None:
Error:   0(25) : error C7101: Macro HIGHP_OR_DEFAULT redefined
Error: 
[loop] fragment-steps=5:fragment-uniform=true:vertex-steps=5: Set up failed
=======================================================
                                  glmark2 Score: 1615 
=======================================================

GravityMark

GravityMark scores (1600x900, 200,000 asteroids, OpenGL):

https://gravitymark.tellusim.com/report/?id=0125a17841303c3abab85af7c6f69315428ddb74

Vulkan wouldn't run as I would get Bus error (core dumped).

TODO: See this issue for discussion about a full suite of standardized GPU benchmarks.

Ollama

ollama LLM model inference results:

Device CPU/GPU Model Speed Power (Peak)
Nvidia A400 GPU llama3.2:3b 35.51 Tokens/s 167 W
Nvidia A400 CPU/GPU llama3.2:8b 2.79 Tokens/s 190 W
Nvidia A400 CPU/GPU llama2:13b 7.93 Tokens/s 223 W

For full results (including testing other GPUs...), see issue: geerlingguy/ollama-benchmark#5

Memory

tinymembench results:

Click to expand memory benchmark result
tinymembench v0.4.10 (simple benchmark for memory throughput and latency)

==========================================================================
== Memory bandwidth tests                                               ==
==                                                                      ==
== Note 1: 1MB = 1000000 bytes                                          ==
== Note 2: Results for 'copy' tests show how many bytes can be          ==
==         copied per second (adding together read and writen           ==
==         bytes would have provided twice higher numbers)              ==
== Note 3: 2-pass copy means that we are using a small temporary buffer ==
==         to first fetch data into it, and only then write it to the   ==
==         destination (source -> L1 cache, L1 cache -> destination)    ==
== Note 4: If sample standard deviation exceeds 0.1%, it is shown in    ==
==         brackets                                                     ==
==========================================================================

 C copy backwards                                     :  12612.3 MB/s
 C copy backwards (32 byte blocks)                    :  12457.3 MB/s
 C copy backwards (64 byte blocks)                    :  12471.6 MB/s
 C copy                                               :  12644.2 MB/s
 C copy prefetched (32 bytes step)                    :  13293.4 MB/s
 C copy prefetched (64 bytes step)                    :  13292.2 MB/s
 C 2-pass copy                                        :   8224.0 MB/s
 C 2-pass copy prefetched (32 bytes step)             :  10425.1 MB/s
 C 2-pass copy prefetched (64 bytes step)             :  10882.2 MB/s
 C fill                                               :  46942.6 MB/s
 C fill (shuffle within 16 byte blocks)               :  46940.3 MB/s
 C fill (shuffle within 32 byte blocks)               :  46940.6 MB/s
 C fill (shuffle within 64 byte blocks)               :  46884.5 MB/s
 NEON 64x2 COPY                                       :  13373.7 MB/s
 NEON 64x2x4 COPY                                     :  13385.7 MB/s
 NEON 64x1x4_x2 COPY                                  :  13315.9 MB/s
 NEON 64x2 COPY prefetch x2                           :  15927.2 MB/s
 NEON 64x2x4 COPY prefetch x1                         :  16140.0 MB/s
 NEON 64x2 COPY prefetch x1                           :  15967.5 MB/s
 NEON 64x2x4 COPY prefetch x1                         :  16139.9 MB/s
 ---
 standard memcpy                                      :  13327.2 MB/s
 standard memset                                      :  47863.5 MB/s
 ---
 NEON LDP/STP copy                                    :  13366.5 MB/s
 NEON LDP/STP copy pldl2strm (32 bytes step)          :  14947.3 MB/s
 NEON LDP/STP copy pldl2strm (64 bytes step)          :  15072.0 MB/s (0.1%)
 NEON LDP/STP copy pldl1keep (32 bytes step)          :  15284.9 MB/s
 NEON LDP/STP copy pldl1keep (64 bytes step)          :  15246.9 MB/s
 NEON LD1/ST1 copy                                    :  13421.6 MB/s
 NEON STP fill                                        :  47871.6 MB/s
 NEON STNP fill                                       :  47864.8 MB/s
 ARM LDP/STP copy                                     :  13426.0 MB/s
 ARM STP fill                                         :  47841.2 MB/s
 ARM STNP fill                                        :  47813.0 MB/s

==========================================================================
== Memory latency test                                                  ==
==                                                                      ==
== Average time is measured for random memory accesses in the buffers   ==
== of different sizes. The larger is the buffer, the more significant   ==
== are relative contributions of TLB, L1/L2 cache misses and SDRAM      ==
== accesses. For extremely large buffer sizes we are expecting to see   ==
== page table walk with several requests to SDRAM for almost every      ==
== memory access (though 64MiB is not nearly large enough to experience ==
== this effect to its fullest).                                         ==
==                                                                      ==
== Note 1: All the numbers are representing extra time, which needs to  ==
==         be added to L1 cache latency. The cycle timings for L1 cache ==
==         latency can be usually found in the processor documentation. ==
== Note 2: Dual random read means that we are simultaneously performing ==
==         two independent memory accesses at a time. In the case if    ==
==         the memory subsystem can't handle multiple outstanding       ==
==         requests, dual random read has the same timings as two       ==
==         single reads performed one after another.                    ==
==========================================================================

block size : single random read / dual random read, [MADV_NOHUGEPAGE]
      1024 :    0.0 ns          /     0.0 ns 
      2048 :    0.0 ns          /     0.0 ns 
      4096 :    0.0 ns          /     0.0 ns 
      8192 :    0.0 ns          /     0.0 ns 
     16384 :    0.0 ns          /     0.0 ns 
     32768 :    0.0 ns          /     0.0 ns 
     65536 :    0.0 ns          /     0.0 ns 
    131072 :    1.2 ns          /     1.7 ns 
    262144 :    2.2 ns          /     2.7 ns 
    524288 :    3.2 ns          /     3.6 ns 
   1048576 :    4.9 ns          /     6.3 ns 
   2097152 :   17.5 ns          /    24.7 ns 
   4194304 :   24.1 ns          /    30.6 ns 
   8388608 :   29.4 ns          /    35.0 ns 
  16777216 :   39.1 ns          /    49.1 ns 
  33554432 :   67.0 ns          /    87.6 ns 
  67108864 :   81.0 ns          /    99.2 ns 

block size : single random read / dual random read, [MADV_HUGEPAGE]
      1024 :    0.0 ns          /     0.0 ns 
      2048 :    0.0 ns          /     0.0 ns 
      4096 :    0.0 ns          /     0.0 ns 
      8192 :    0.0 ns          /     0.0 ns 
     16384 :    0.0 ns          /     0.0 ns 
     32768 :    0.0 ns          /     0.0 ns 
     65536 :    0.0 ns          /     0.0 ns 
    131072 :    1.2 ns          /     1.7 ns 
    262144 :    1.8 ns          /     2.2 ns 
    524288 :    2.1 ns          /     2.4 ns 
   1048576 :    2.6 ns          /     2.8 ns 
   2097152 :   15.8 ns          /    23.1 ns 
   4194304 :   22.9 ns          /    28.9 ns 
   8388608 :   27.0 ns          /    30.8 ns 
  16777216 :   27.9 ns          /    31.9 ns 
  33554432 :   60.6 ns          /    80.2 ns 
  67108864 :   73.2 ns          /    90.4 ns 

sbc-bench results

Run sbc-bench and paste a link to the results here:

wget https://raw.githubusercontent.com/ThomasKaiser/sbc-bench/master/sbc-bench.sh
sudo /bin/bash ./sbc-bench.sh -r

Phoronix Test Suite

Results from pi-general-benchmark.sh:

  • pts/encode-mp3: 9.339 sec
  • pts/x264 4K: 29.01 fps
  • pts/x264 1080p: 71.35 fps
  • pts/phpbench: 533767
  • pts/build-linux-kernel (defconfig): 94.042 sec
@geerlingguy
Copy link
Owner Author

Phoronix has a review up with some preliminary benchmarks: System76 Thelio Astra Reviewed: High-End ARM64 Developer Desktop.

@geerlingguy
Copy link
Owner Author

Note: I was originally going to have most of the benchmarking done already... but I had to be silly and try out a bunch of other GPUs. I then ran into the fun parade of Nvidia proprietary vs Ubuntu included vs Nouveau drivers, and totally borked my Ubuntu install...

Couple that with Arm64 needing specific card hardware support for video output pre-OS boot, and I got it nice and mangled. Need to reinstall Ubuntu and start fresh again with the A402 they included ;)

@joespeed
Copy link

@geerlingguy
Copy link
Owner Author

geerlingguy commented Oct 23, 2024

Ubuntu reinstalled. I downloaded Ubuntu 24.04.1 Server for arm64, installed it through OpenBMC's remote KVM (could also use SOL Serial-Over-LAN console, it works surprisingly well for text/console-based install), and am running through Ampere's guide for setting up Nvidia graphics accelerated Linux Desktop environment:

# 1 - Disable PCIe ASPM:
sudo nano /etc/default/grub

# Add pcie_aspm=off to kernel parameters
GRUB_CMDLINE_LINUX_DEFAULT="pcie_aspm=off"

sudo update-grub

# 2 - Install Ubuntu's desktop environment
sudo apt install -y ubuntu-desktop

# 3 - Install Nvidia graphics drivers
sudo ubuntu-drivers list --gpgpu  # list available drivers
sudo ubuntu-drivers install nvidia-driver-550  # install a driver (note: this was preinstalled)

# 4 - Blacklist nouveau
sudo nano /etc/modprobe.d/blacklist-nouveau.conf

# Put this inside and save it before rebooting
blacklist nouveau
options nouveau modeset=0

# 5 - Reboot the system
sudo reboot

Also noting timings here, since it can be a bit disconcerting how long boot processes take compared to something like a Raspberry Pi, Mac, or typical consumer PC—this is server-grade hardware, with server-grade boot times:

  • 00:00: Press power button
  • 01:00: Motherboard BIOS splash screen (ASRock Rack logo and boot commands displayed on screen)
  • 01:15: EFI stub messages appear, Linux boot begins (VGA output just shows stub messages this entire period)
  • 04:00: Ubuntu desktop environment appears (normal boot conditions)
  • Under abnormal boot conditions, like when nouveau is active and it is erroring out:
    • 05:00: OpenBMC seems to go away at this point, but VGA output remains
    • 06:00: VGA output goes blank, system reachable over SSH
    • 06:30: Mouse cursor appears over blank screen on VGA and graphics card DisplayPort outputs

It seems like the desktop rendering doesn't work out of the box with Ubuntu's default install... interestingly, I had the exact same issue on my old 2013 MacBook Air after attempting an Ubuntu 24.04 install (it worked on 22.04). So maybe if the drivers aren't perfect OOTB, it does this non-rendered desktop environment thing? Is there a regression in the nouveau drivers?

@geerlingguy
Copy link
Owner Author

It seems like it could be a nouveau issue, after looking in dmesg logs:

[   21.727151] nouveau 0004:01:00.0: Adding to iommu group 28
[   22.774703] i2c_designware APMC0D0F:00: controller timed out
... [message repeats] ...
[   54.294665] watchdog: BUG: soft lockup - CPU#122 stuck for 26s! [jbd2/dm-0-8:1631]
[   54.302232] Modules linked in: nouveau(+) snd_hda_intel(+) snd_intel_dspcfg snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_seq_midi snd_seq_midi_event drm_gpuvm snd_rawmidi drm_exec gpu_sched nls_iso8859_1 drm_ttm_helper snd_seq acpi_ipmi ttm ipmi_ssif(+) snd_seq_device irdma drm_display_helper snd_timer i40e cec snd rc_core soundcore ib_uverbs ast ib_core ipmi_devintf ipmi_msghandler arm_dmc620_pmu xgene_hwmon arm_cmn acpiphp_ampere_altra cppc_cpufreq arm_dsu_pmu acpi_tad joydev input_leds apple_mfi_fastcharge dm_multipath efi_pstore nfnetlink dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq libcrc32c raid1 raid0 cdc_ether usbnet hid_apple hid_logitech_hidpp hid_logitech_dj hid_generic usbhid hid uas usb_storage crct10dif_ce polyval_ce polyval_generic ghash_ce sm4 sha2_ce sha256_arm64 ice ixgbe nvme sha1_ce igb nvme_core xhci_pci xfrm_algo i2c_algo_bit mdio xhci_pci_renesas nvme_auth gnss aes_neon_bs
[   54.302392]  aes_neon_blk aes_ce_blk aes_ce_cipher
[   54.302402] CPU: 122 PID: 1631 Comm: jbd2/dm-0-8 Not tainted 6.8.0-47-generic #47-Ubuntu
[   54.302408] Hardware name: System76 Thelio Astra/Thelio Astra, BIOS 3.02 08/20/2024
[   54.302412] pstate: 00400009 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[   54.302418] pc : queued_spin_lock_slowpath+0x90/0x4f0
[   54.302430] lr : _raw_spin_lock+0x84/0xb8
[   54.302438] sp : ffff80008e98b8b0
[   54.302440] x29: ffff80008e98b8b0 x28: 0000000000000000 x27: ffff80008e98bcd8
[   54.302449] x26: ffff07ff9d751740 x25: ffffc7ec57f40000 x24: ffffc7ec57f40000
[   54.302457] x23: ffffc7ec5d78cc84 x22: 0000000000000000 x21: ffff07ff9d751748
[   54.302464] x20: 0000000000000000 x19: ffff07ff9d751748 x18: ffff80008e871080
[   54.302472] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
[   54.302479] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
[   54.302486] x11: 0000000000000000 x10: 0000000000000000 x9 : ffffc7ec5e59047c
[   54.302493] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000000
[   54.302499] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000
[   54.302506] x2 : 0000000000000001 x1 : 0000000000000001 x0 : 0000000000000001
[   54.302513] Call trace:
[   54.302516]  queued_spin_lock_slowpath+0x90/0x4f0
[   54.302524]  _raw_spin_lock+0x84/0xb8
[   54.302530]  nvme_queue_rqs+0xe4/0x2f8 [nvme]
[   54.302552]  blk_mq_flush_plug_list.part.0+0x16c/0x1c0
[   54.302561]  blk_add_rq_to_plug+0x1ac/0x2a0
[   54.302567]  blk_mq_submit_bio+0x530/0x6e0
[   54.302573]  __submit_bio+0x100/0x220
[   54.302580]  __submit_bio_noacct+0x68/0x1f0
[   54.302586]  submit_bio_noacct_nocheck+0x1e8/0x230
[   54.302591]  submit_bio_noacct+0x134/0x658
[   54.302597]  submit_bio+0xc0/0x190
[   54.302602]  submit_bh_wbc+0x150/0x220
[   54.302609]  submit_bh+0x20/0x50
[   54.302615]  jbd2_journal_commit_transaction+0x4d0/0x1708
[   54.302622]  kjournald2+0xc8/0x298
[   54.302627]  kthread+0xf8/0x110
[   54.302632]  ret_from_fork+0x10/0x20
[   54.326691] i2c_designware APMC0D0F:00: controller timed out
... [message repeats] ...
... [lots more of the spinlock messages too] ...
[  379.890828] nouveau 0004:01:00.0: gr: DATA_ERROR 0000009c [] ch 1 [00ffb40000 gnome-shell[4279]] subc 0 class c797 mthd 0d78 data 00000004
[  379.893910] ast 0003:02:00.0: swiotlb buffer is full (sz: 8388608 bytes), total 32768 (slots), used 0 (slots)
[  379.903289] nouveau 0004:01:00.0: gr: DATA_ERROR 0000009c [] ch 1 [00ffb40000 gnome-shell[4279]] subc 0 class c797 mthd 0d78 data 00000004
[  379.915735] nouveau 0004:01:00.0: gr: DATA_ERROR 0000009c [] ch 1 [00ffb40000 gnome-shell[4279]] subc 0 class c797 mthd 17e0 data 00000018

@geerlingguy
Copy link
Owner Author

geerlingguy commented Oct 23, 2024

After blacklisting the nouveau driver and rebooting (the ubuntu 550 nvidia driver was preinstalled), I am now getting display output over VGA (and in the BMC KVM).

However, it seems to not be using any GPU acceleration... glmark2 shows LLVM rendering.

I unplugged VGA output to my monitor, and plugged in DisplayPort to port 1 on the A400. Now once it hits Checkpoint 92, I see the BIOS screen on the VGA output / BMC KVM, and the displayport screen goes from 'no signal' to blank... but then it stalls out. Giving it another few minutes to see if something's just delaying boot.

Only bug I've found somewhat related is this one, but it's about Linux boot not seeing a CPU sometimes... in my case, it seems like the machine stalls at Checkpoint 92.

[Edit: And after waiting another 3 minutes or so, it looks like the whole system rebooted—it's going through DRAM checks and all the Checkpoints again now... stuck again at Checkpoint 92.]

[Edit 2: And if I unplug the display from the DP connector on the Nvidia A402, and reboot with only VGA plugged in, it reliably gets past Checkpoint AD into Linux system boot, and completes startup.]

@geerlingguy
Copy link
Owner Author

Going to pause my testing on the workstation for the time being—it looks like there are two main issues I'm hitting:

  1. The fans remain at idle and don't spin up when needed—at least the CPU coolers—and this leads to lockups under high load over long periods. See Benchmark 128-core System76 Thelio Astra top500-benchmark#44 (comment)
  2. I can't get external display output to work through the Nvidia card with either Nvidia's driver install or the Ubuntu driver install. The system doesn't get past 'Checkpoint 92'.

@geerlingguy
Copy link
Owner Author

Just a note, with this configuration (a lot of sticks of ECC RAM), it idles around 100W, and at system shutdown, uses about 7W of power to run the BMC/IPMI:

Screenshot 2024-10-23 at 11 52 03 AM

@bexcran
Copy link

bexcran commented Oct 23, 2024

Some more detailed information about the boot process:

00:00: Press power button

  • 00:10: SCP completes booting. Powers on the host ARMv8-A CPU.
  • 00:25: TF-A finishes DRAM initialization and training.
  • 00:35: TF-A finishes copying UEFI from SPI-NOR into DRAM.
    01:00: Motherboard BIOS splash screen (ASRock Rack logo and boot commands displayed on screen)
    ...

Also, if you ssh to BMC ports 2200, 2201, 2202 (using the same login as the BMC) you can see the SOL consoles for the host, SCP (PMPro and SMPro) and the secure TF-A console.

@geerlingguy
Copy link
Owner Author

geerlingguy commented Dec 8, 2024

I ran into overheating / thermal shutdown issues (RAM and SoC were both throwing temperature errors after 10-20 minutes of a heavy workload).

The fix was to install System76's 'driver' package, which includes a system76-power daemon that runs in the background and adjusts fan speeds on the system:

The System76 Driver (Install) documentation directed me to install the driver:

sudo apt-add-repository -y ppa:system76-dev/stable
sudo apt-get update
sudo apt install system76-driver

That succeeded, and now:

system76@thelio-astra:~$ ps ax | grep system76-power
  12737 ?        Ss     0:00 /usr/bin/system76-power daemon
  18538 pts/0    S+     0:00 grep --color=auto system76-power

I'd like to know if there's a way to set a default fan curve or something that doesn't require the daemon to be running. Also see if there's a way to get sensor data into OpenBMC.

For more, see: geerlingguy/top500-benchmark#44 (comment) (and the following comments).

@geerlingguy
Copy link
Owner Author

@bexcran - FYI I am not able to run Geekbench 6 (not that it gives great data for this platform... but still)—it bails out each time during the multi core 'Background Blur' test, which I believe uses NEON and is the main benchmark that seems to give a lot of Arm CPUs a hard time (my overclocked Pi 5's always barfed at that point!).

I was wondering if you might be able to give it a run and see what happens, if it might be able to run on another system?

An interesting aside, it seems like Primate Labs support site has been practically silent for a year or so now. I have a few questions out about these things, but nothing's gotten a response :(

@geerlingguy
Copy link
Owner Author

I also tried running GravityMark (with Vulkan) and glmark2-es2-wayland, and neither completed. I would eventually get a (core dumped). I didn't see anything relevant in dmesg, just a bunch of message every few minutes about apparmor denying snap from doing stuff:

[ 1442.033754] audit: type=1400 audit(1733860961.332:179): apparmor="DENIED" operation="capable" class="cap" profile="/usr/lib/snapd/snap-confine" pid=10601 comm="snap-confine" capability=12  capname="net_admin"

I am trying GravityMark with OpenGL instead of Vulkan now...

@geerlingguy
Copy link
Owner Author

geerlingguy commented Dec 10, 2024

GravityMark OpenGL does seem to work:

M:      0 us: ../data.zip: 313 files
E:    386 us: Cache::open(): can't seek to the header
Invalid argument
M:   3.30 ms: Temporal antialiasing
M:   3.33 ms: Fullscreen mode
M:   3.34 ms: Render Statistics
M:  29.35 ms: Build Date: Sep 22 2024
M:  29.41 ms: Build Info: version=20240515; linux; arm64; release; vk=1; gl=45; gles=32; cu=1; fusion
M:  29.42 ms: Build Version: 1.88
M:  54.55 ms: Name: System76 Thelio Astra Thelio Astra
M:  54.60 ms: System: Ubuntu 24.04.1 LTS
M:  54.64 ms: Kernel: Linux 6.8.0-49-generic aarch64
M:  54.65 ms: Memory: 502.25 GB
M:  54.67 ms: Uptime: 27 m 22 s
M:  54.68 ms: CPU: arm64
M:  54.70 ms: GPU: NVIDIA RTX A400
M:  54.71 ms: Device: VEN_10DE&DEV_25B2&SUBSYS_187910DE
M:  54.74 ms: Version: 550.120
M:  54.75 ms: Memory: 4.00 GB
M:  54.76 ms: Screens: 1
M:  56.46 ms: Desktop: 2944x1080 1.0
M:  56.51 ms: Screen 0: 1024x768 1920 0 VGA-1
M:  56.52 ms: Screen 1: 1920x1080 0 0 DP-3
M:  56.53 ms: Set fullscreen mode on 0 screen
M:  56.78 ms: Creating 1024x768 OpenGL Window
M: 146.31 ms: Render Size: 1024x768
M: 146.46 ms: Using Fetch Mode
M: 321.14 ms: Device: NVIDIA RTX A400/PCIe
M: 321.18 ms: Vendor: NVIDIA Corporation
M: 321.19 ms: Version: 4.5.0 NVIDIA 550.120
M: 321.20 ms: Video Memory: 4.00 GB
M: 321.21 ms: Max Uniform Size: 64.00 KB
M: 321.22 ms: Max Storage Size: 2.00 GB
M: 321.24 ms: Creating SceneManager
M:   3.924 s: Creating RenderManager
M:   8.969 s: Rasterization Mode
M:   8.969 s: Creating Scene
M:  11.369 s: Creating 200,000 Asteroids
M:  11.591 s: Updating Scene
M:  13.721 s: GravityMark 1.88 OpenGL is Ready in 13.7 s
M:  13.721 s: Starting 1024x768 OpenGL Benchmark
M:  13.721 s: Count: 1
M:  13.752 s: Resizing 1024x768 frame
M:  3:00.819: Benchmark Finished
M:  3:00.819: NVIDIA RTX A400/PCIe
M:  3:00.819: API: OpenGL
M:  3:00.819: Platform: Linux
M:  3:00.819: Resolution: 1024x768
M:  3:00.819: Antialiasing: Temporal
M:  3:00.819: Asteroids: 200,000
M:  3:00.819: Score: 9125
M:  3:00.819: Time: 167.1 s
M:  3:00.819: FPS: 54.6

Running again on the monitor so I could get some output, I got this result:

https://gravitymark.tellusim.com/report/?id=0125a17841303c3abab85af7c6f69315428ddb74

image

@geerlingguy
Copy link
Owner Author

Ollama seems to install and detect the Nvidia GPU successfully:

system76@thelio-astra:~/Downloads$ curl -fsSL https://ollama.com/install.sh | sh
>>> Installing ollama to /usr/local
>>> Downloading Linux arm64 bundle
######################################################################## 100.0%
>>> Creating ollama user...
>>> Adding ollama user to render group...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> Enabling and starting ollama service...
Created symlink /etc/systemd/system/default.target.wants/ollama.service → /etc/systemd/system/ollama.service.
>>> NVIDIA GPU installed.

I downloaded llama3.2:3b, ran it, and it was definitely GPU-accelerated, giving me good speed—will bench it and put those results above.

@geerlingguy
Copy link
Owner Author

In my testing of Ollama on the Thelio Astra, it works out of the box with any Nvidia GPU via CUDA, but AMD GPUs don't work due to ROCm not yet supporting arm64, and Ollama not supporting Vulkan; see geerlingguy/ollama-benchmark#5

But I could successfully run llama.cpp on my AMD Pro W7700, as well as an Nvidia RTX A4000 that was a drop-in upgrade.

@geerlingguy
Copy link
Owner Author

Blender provides arm64 builds for Windows and macOS, but sadly, not Linux!

https://builder.blender.org/download/daily/

I am trying the Ubuntu repo version, I'm guessing it won't support Nvidia GPU, but we'll see...

@geerlingguy
Copy link
Owner Author

I also tried setting up Box86/Box64/Steam using Ampere's Steam guide, but ran into ptitSeb/box86#1015 — I might need to revamp their guide entirely. It would be nice if Pi-Apps would work out of the box on Ampere machines... maybe worth pursuing.

@geerlingguy
Copy link
Owner Author

geerlingguy commented Dec 11, 2024

Windows 11 install is going off without a hitch—based on AmpereComputing/Windows-11-On-Ampere#6 (comment) (the UUPdump download). And when it booted, I still had my display plugged into the A4000's DisplayPort connection—and Windows 11 seems to have identified that as an output. There's no acceleration, and the resolution is like 480p stretched, but I'm surprised it's giving output through the Nvidia GPU, and not only through VGA!

System76 shipped this system back with this $25 TPM 2.0 module installed, and that seems to be enough to get the Installer happy without hacking the registry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants