Skip to content

FW16 Freeze then Reboot (FTR) S5_RESET_STATUS = 0x08000800 <- Sync Flood. #41

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 task
jcdutton opened this issue Feb 2, 2025 · 60 comments
Open
1 task
Labels
bug Something isn't working Laptop 16 AMD Ryzen 7040 Framework Laptop 16 (AMD Ryzen™ 7040 Series)

Comments

@jcdutton
Copy link

jcdutton commented Feb 2, 2025

Device Information

System Model or SKU

[ ] Framework Laptop 16 (AMD Ryzen™ 7040 Series)
No dGPU.

BIOS VERSION

3.0.5

Windows:
N/A

Linux:
Open a terminal and run the following command
sudo dmidecode --string bios-version
03.05

DIY Edition information

Memory: Manufacture and SKU
Kingston Fury Impact: Part Number: KF556S40-32
2x making 64GB total.
Storage: Manufacture and SKU
Model Number: WD_BLACK SN850X 1000GB
Firmware Version: 620361WD

Port/Peripheral information

  1. USB-C card, nothing plugged in.
  2. Empty
  3. Empty
  4. Empty
  5. USB-C card, FW16 PSU plugged in.
  6. USB-A card, nothing plugged in.

Standalone Operation

Are you running your mainboard as a standalone device. Is standalone mode enabled in the BIOS?

  • No

Describe the bug

S5_RESET_STATUS = 0x08000800 <- Sync Flood.

Occasionally, about once a month I get a random crash/freeze then about 20 seconds later a reboot.
There is some details in this community thread:
https://community.frame.work/t/frwk16-random-crash-then-reboots/62411/31

This issue is only for "Freeze then Reboot" issues. Not "Freeze then power off".

Steps To Reproduce

Steps to reproduce the behavior:

  1. Start from a powered off laptop.
  2. Power on laptop
  3. Wait a random amount of time. Play videos, netflix, youtube etc.
  4. System freezes for about 20 seconds and then reboots itself.

Note: I generally have the power plugged in most of the time. For all the FTR I have seen, the power was plugged in at the time. The PSU used is the FW provided one that comes with the FW16.
Note: The FW16 was not under high load. the cpu fans were not audibly running. I.e. I could not hear them above the netflix / youtube video playing.

Expected behavior

It should not randomly freeze then reboot. (FTR)

Screenshots

N/A

Operating System (please complete the following information):

  • OS/Distribution: Linux/Ubuntu
  • Version: 24.04
  • Linux Kernel Version: uname -a 6.12.7 <- Mainline compiled kernel.

Additional context

Add any other context about the problem here.

@jcdutton
Copy link
Author

jcdutton commented Feb 2, 2025

There has been an interest in understanding the port80 codes.
On the FW16 AMD, these are 32bit values, and not 8bit values.
I have a version of the EC firmware that dumps the last 4096 PORT80 codes, and this is enough to survive a reboot of the CPU. During the CPU reboot, the EC does not reset, so the PORT80 codes should be retained.

I attach some files, that contain the PORT80 codes output during a normal reboot and a normal cold-boot. They have been sorted and de-duped to shorten the list of codes.

It would be helpful if FW support could identify what each code means.
When the problem next occurs, I will be capturing the PORT80 codes and posting them here.

port80-reboot-de-dupe1.txt
port80-cold-boot-de-dupe1.txt

@jcdutton
Copy link
Author

jcdutton commented Feb 2, 2025

People are seeing 3 different scenarios:

  1. an instant power off. Needing power button to switch on.
  2. a system pause/blank screen, for about 20 seconds, then a reboot.

In both cases, no logs or crash dumps to help diagnose the problem.

  1. the screen goes blank, but does not power off and does not reboot.

All three are difficult to reproduce.
I see (2) about once a month.
If people are seeing (3), that is just a gpu driver bug, so gather crash traces and report to amd.
If people see (1), please raise a different Issue, and don't mix it with this one.

This issue is only for "Freeze then Reboot" issues. Not "Instant power off".

@jcdutton
Copy link
Author

jcdutton commented Feb 2, 2025

As an aid to people trying to understand the EC console output.
E.g. [66390.711600 HC 0x0115 err 1]
#define EC_CMD_PD_GET_LOG_ENTRY 0x0115
So this is the Host trying to get the log entries from the EC firmware, but on the FW16, this does not appear to be implemented, thus returning an err 1.

Please see attached file for the definition of each HC (Host command to EC)

ec_commands.txt

@jcdutton
Copy link
Author

jcdutton commented Feb 3, 2025

When comparing multiple port80 traces, from say, repeating a "reboot", and then comparing the sequence of port80 values. It appears to be that the capture of port80 values by the EC is not perfect. There a bit errors and missed values.
I.e. Sometimes the value is absent in the sequence compared to a different reboot.
Also, sometimes some bits are set in the 32bit value, and other times it is not set. This would hint at bit errors being introduced.
e.g.
ef008888
vs
00008888
vs
b0008888
vs
000b8888
The "ef" has bits set when maybe they should not have been set.
The "b0" has bits set when maybe they should not have been set.

I will see if there is some bug in the EC code or something else causing the problem.

I have now confirmed there is a problem.
I can write predictable values to port80 from user-space, and some of the output is lost.
The EC does actually read all values correctly, if the usleep(2000) is used between each outl()
The EC does actually read "a value" for each value written to the port80, if a reasonable usleep(100) is made between outl(), but the value is corrupted sometimes.
So, it appears the corruption happens if the outl() is made too quickly, without a delay between writes.

Further investigation:
I don't have the EC datasheet. I would welcome it is someone could send me one.
The EC receives 32bit port 80 values.
The EC only has a 4 32bit value hardware buffer. ( In some special cases it can handle 5 32 bit values.)
The EC splits the 32bit values into index/byte pairs. Meaning, to read 32bit value one actually has to read 8 bytes on the EC receiving side.
The EC is slow to respond to interrupts, so if one writes more than 4 values quickly, the port80 hardware buffer overflows and one looses port80 values.
The EC code for receiving port 80 values is here:
File: host_subs_npcx.c
static void host_port80_isr(const void *arg)

Side affects:
While experimenting, when the EC is busy doing something, it can slow the entire FW laptop down. (to a crawling pace)
Various parts of the EC comms stops working. E.g. detecting power plug/unplug.
I have not diagnosed why EC problems would slow the entire PC down, but will look into it later.
I did not believe the EC could slow the main CPU down until I saw it myself. So it is a possible new source of something to look at if users report slowness. Particularly of interest to Real Time Audio users.

@PureKrome
Copy link

@jcdutton
Copy link
Author

jcdutton commented Feb 4, 2025

During a normal boot (I.e. not a FTR):
Some of the port80 codes are interesting:
Here it is using lots of Port80 writes to say:
"Entering Vbios Boot Load%r"
Also, with what looks like two devices trying to write to port80 at the same time.
Notice the "%" instead of the "e"

00000145 | 5f535452 | _STR
00000146 | 0000000a | ....
00000147 | 00000045 | ...E
00000148 | 0000006e | ...n
00000149 | 00000074 | ...t
0000014a | 00000065 | ...e
0000014b | 00000072 | ...r
0000014c | 00000069 | ...i
0000014d | 0000006e | ...n
0000014e | 00000067 | ...g
0000014f | 00000020 | ...
00000150 | 00000056 | ...V
00000151 | 00000062 | ...b
00000152 | 00000069 | ...i
00000153 | 0000006f | ...o
00000154 | 00000073 | ...s
00000155 | 00000020 | ...
00000156 | 00000042 | ...B
00000157 | 0000006f | ...o
00000158 | 0000006f | ...o
00000159 | 00000074 | ...t
0000015a | 00000020 | ...
0000015b | 0000004c | ...L
0000015c | 0000006f | ...o
0000015d | 00000061 | ...a
0000015e | 00000064 | ...d
0000015f | 00000065 | ...e
00000160 | e825fe00 | .%..
00000161 | 00000072 | ...r
00000162 | 00000020 | ...
00000163 | 0000002e | ....
00000164 | 0000002e | ....
00000165 | 0000002e | ....
00000166 | 0000000a | ....
00000167 | 5f454e44 | _END

@kiram9 kiram9 added bug Something isn't working Laptop 16 AMD Ryzen 7040 Framework Laptop 16 (AMD Ryzen™ 7040 Series) labels Feb 4, 2025
@jcdutton
Copy link
Author

jcdutton commented Feb 7, 2025

The following file contains the Port80 codes during a reboot of the Linux operating system normally. I.e. User clicked on "reboot". The purpose of this is to

  1. Share them
  2. Maybe discover what some of them mean.
  3. There are some comments in the spreadsheet, of the values I know already.
  4. Once I see a FTR, I can then compare that bad one with this one.
  5. I have created some new Port80 codes, and they are commented in the spreadsheet.
    For example, I have modified my Linux kernel so that it outputs Port80 0xaaaa0001 while booting at the BIOS to OS transition point. i.e. start_start()
    Also Port80 0xaaaa0003 while rebooting at the OS to BIOS transition point.
    Also Port 80 0xbadbadXX when the EC controller sees a hardware overflow when receiving Port80 values.
    In this way, it is easier to detect whether Port80 values have been lost or not.

port80-reboot10.ods

@jcdutton
Copy link
Author

jcdutton commented Feb 9, 2025

The changes I have made to the EC and ectool that work with a FW16 AMD are here:
https://github.com/jcdutton/ectool/commits/main/
https://github.com/jcdutton/EmbeddedController/commits/fwk-lotus-azalea-19573
https://github.com/jcdutton/zephyr/commits/lotus-zephyr/

To get longer port80 history:
src/platform/ec/include/config.h:#define CONFIG_PORT80_HISTORY_LEN 4096

I have found that 4096 is about the largest value you can give it, otherwise the EC runs out of RAM and behaves badly.
A complete reboot uses about 3500 port80 records, so 4096 is adequate for our needs.
I have only ever tested writing my modified EC firmware to the RW section of the Flash, I have not overwritten the RO section of the Flash, so if anything fails in my modified EC firmware code, it just reboots back to the original FW provided RO section.

@jcdutton
Copy link
Author

My updates to the ECTool.efi reflash program, to make it slightly safer to use.
It defaults to doing nothing unless the command line arguments are exactly right.
https://github.com/jcdutton/FrameworkHacksPkg/commits/main/

I only ever write to the RW section, never to the RO section, so that if something goes wrong, it falls back to the FW provided EC firmware.

@sinatosk
Copy link

sinatosk commented Feb 11, 2025

Hi,

I'm using

Framework 16 AMD Ryzen 7 7840HS with Radeon 780M ( no dGPU )
RAM/Memory 32GB ( 2x16GB ) Framework
NVME 2280: Western Digital SN850X 2TB - Firmware 620361WD
NVME 2230: Western Digital SN770M 2TB - Firmware 731120WD
BIOS 3.05
Gentoo Linux 2.17 ( Linux 6.14-rc2 mainline realtime, compiled by clang 19.1.7 march and mtune set to znver4)
KDE Plasma 6.2.5 Wayland

  1. USB-C - Framework 180W Power adapter
  2. USB-C - External display ( it has a DisplayPort USB-C adapter that plugs into Framework's USB-C module )
  3. Framework's audio module
  4. USB-C - External storage or mouse ( mouse has a USB-A to USB-C adapter that plugs into Framework's USB-C module )
  5. USB-C - generally don't use
  6. USB-C - I don't use because of Linux "sudo lsusb -v" fails #5

I originally thought this issue was with amdgpu/drm as I don't use stable kernels anymore ( not that I intend to continue but I do it to grab amdgpu/drm issues ASAP ), since kernel 6.12-rc1 I've only used mainline ( Linus Torvalds branch )

This FTR is something I've been experiencing for sometime and since BIOS 3.05 it's increased ( I believe it happened 3 times when using BIOS 3.04 but I just assumed it was amdgpu/drm at the time )

what @jcdutton says here, point 2 is the one I'm experiencing

a system pause/blank screen, for about 20 seconds, then a reboot.

The FTR issue for me triggers more frequently ( from multiple times a day down to 1-3 days ) when the Framework 16 display panel refresh rate is at 165HZ ( with Panel Self Refresh/Panel Replay enabled ), also using an external display ( 3840x2160@60HZ ) and in DC power mode ( Framework battery )

I've also noticed FTR happens ( maybe coincidence ) when watching a video ( youtube on firefox for example ) on primary or secondary display but then again the issues happens when I'm just programming ( RustRover mostly ) or browsing the web but it's random like others have said and when in AC power mode ( Framework 180W power adapter ), it could be a week or so before it FTR.

This is all while the power profile ( set via KDE power devil ) is set to powersave whether I'm in AC or DC power mode.

With the display panel refresh rate is set to 60HZ, it still FTR's but not as frequently ( that's how it feels ).

For the past week-ish though with power profile set to balanced ( sometimes performance when compiling software ) and using AC power mode ( Framework 180W power adapter ), display panel refresh rate set to 165HZ, it hasn't FTR yet.

From my perspective, it's looking like something to do with when running in powersave and it's worst when in DC power mode

Since day one I've had the charge limit on the battery set to 80% and most of the time I'm using powersave.

What I'm using to change the power profiles ( GUI component ) with KDE, I think it's called power devil that's communicating with power profile daemon

Because I set the power profile via KDE power devil and not directly in the pseudo/proc files, the terms between KDE and AMD not fully expressed

KDE -> AMD

powersave -> powersave
balanced -> balanced_powersave ( if using DC )
balanced -> balanced_performance ( if using AC )
performance -> performance

also ( may not have anything to do with it ) when using power profile powersave, CPB ( Core Performance Boost ) is turned off and the iGPU performance level is set to low, in other power profiles CPB is enabled and iGPU performance level set to auto

I've only just learnt of this issue here today but been experiencing this FTR since last year

edit1: I forgot to add. Just like others after reboot, I check the kernel logs for warnings/errors and there is none

edit2: power profile daemon also adjust AMD's ABM ( Adaptive Backlight Management ) if you don't have the kernel parameter amdgpu.abmlevel set. Mine is set as amdgpu.abmlevel=2

@PureKrome
Copy link

@sinatosk just to confirm, you DON'T HAVE the AMD RX7700S in the extension port thiny? This is all occurring on your iGPU?

@sinatosk
Copy link

@sinatosk just to confirm, you DON'T HAVE the AMD RX7700S in the extension port thiny? This is all occurring on your iGPU?

correct

@jcdutton
Copy link
Author

Another observation.
One can force an OS crash using "echo c >/proc/sysrq-trigger".
On my system (Ubuntu) that causes the system to freeze the screen and halt the system. No reboot happens (I waited 120 seconds and no reboot).
FTR is doing freeze, wait about 20-30 seconds, then reboots.
So, the FTR is not following the configured Linux crash configuration for me.
I can only assume from this that the FTR is not OS related and does the whole Freeze and reboot without the OS knowing about it, otherwise, on my system, it should not have rebooted.

@jcdutton
Copy link
Author

@sinatosk
Thank you for the information. I will try setting my system to powersave, 165Hz display in the hope that I will see the FTR more often.
Up to now, I have been on "balanced" power mode.

@sinatosk
Copy link

sinatosk commented Feb 11, 2025

Up to now, I have been on "balanced" power mode.

try it in DC power mode ( battery ) too please

@jcdutton
Copy link
Author

jcdutton commented Feb 12, 2025

I have some more info that might help track this down. The quote is from someone at AMD:
Quote starts----
Another thing that is really useful is that there is a register in FCH
called S5_RESET_STATUS. If you can get the value from it when this
fails it can point you at where the issue is. I don't think this patch
landed, but you can see if it works for your system to print the info.

https://marc.info/?l=linux-i2c&m=168089982408414

There should be some public documentation on interpreting
S5_RESET_STATUS somewhere, but it's slipping my mind where it is.
---Quote ends.

So, If anyone wishes to compile their own kernels and help with this problem, its very worth while applying the above patch.

Normal values:
Cold Power off / Power on:
S5_RESET_STATUS = 0x00200800
Warm reboot:
S5_RESET_STATUS = 0x00080800
Keeping your finger of the power button for 10 seconds to force the power off.
S5_RESET_STATUS = 0x00000800
Suspend:
Nothing output.

So, we are looking for values that are different than those three values.

@sinatosk
Copy link

sinatosk commented Feb 12, 2025

I'll try that on Saturday, busy tomorrow and Friday

@sydneymeyer
Copy link

I have a 13" FW AMD and a longstanding issue with similar symptoms (20s freeze -> reboot, no logs), although mostly after resume from suspend.

I now have 6.12.13 compiled with the proposed patch and after a reboot ( and an unrelated hard reset in between :), i find:

Feb 13 00:05:18 fw kernel: S5_RESET_STATUS = 0x00080800
-- Boot dc7cb0856fbb40cfa5171be88a9547ba --
Feb 13 00:07:57 fw kernel: S5_RESET_STATUS = 0x00200800
-- Boot 3106ceac8bb34769bc02d093bcc8dcf4 --
-- Boot 495ba047b880414f87d5461dbeffa2bb --
-- Boot bb72a980dcb24cf883187281825ecaf6 --
Feb 13 00:15:43 fw kernel: S5_RESET_STATUS = 0x00080800

Would this (output after of the next freeze/reboot cycle, oc) be of any help, albeit a FW 13 AMD?

@jcdutton
Copy link
Author

jcdutton commented Feb 13, 2025

Some BIOS Port codes can be found here:
They are from an older CPU, but some might be appropriate.
According to wiki here the AMD 7840HS CPU is a phoenix CPU.:
https://en.wikipedia.org/wiki/List_of_AMD_Ryzen_processors.
There do not appear to be any BIOS codes for phoenix, but the picasso ones might be similar.
The BIOS Post codes are here:
https://github.com/jcdutton/amd_firmware_binaries/blob/main/picasso/PSP/AblPostCode.h.txt
It covers 32bit Port codes from the PSP ABL with prefix: 0xEA00XXXX where XXXX is the code listed in the link above.
Some further codes are here, using the 0xEA00XXXX prefix:
https://github.com/jcdutton/amd_blobs/blob/main/picasso/PSP/bl_errorcodes_public.h.txt

For example, IPL POST codes are prefixed with 0xee00...., and ABL POST codes are prefixed with 0xea00....

@PureKrome
Copy link

@sydneymeyer
Copy link

FWIW, perhaps this can help narrow down the cause or help others being plagued by this issue.

I experienced this (FTR) issue since i received the laptop (dec 23) every 3-5 days and around September 2024, i contacted FW support, who suggested to perform a "mainboard reset" by the instructions following below.

After this "mainboard reset", which i suppose resets some kind of nvram on the board/ec (obviously, i don't understand all this low level stuff well enough), but since this reset to this day (around 4-5 months), i had around 3-4 FTR cycles so far (which could as well be an whole different issue), but the regular, every few days, FTR issue was/is, apparently/hopefully gone.

Considering this, it may take a while for this to happen again here, but if it does, i'll remember and post the value in this issue.

According to wiki here the AMD 7840HS CPU is a phoenix CPU.

It's a 7840u here. I'd consider this to be a different layout of the same architecture.

Quote Instructions from Framework Support:

Please perform a full mainboard reset. This will clear up any states that are saved. Kindly follow the instructions below and see how it goes.

Plug in the system to AC.
Remove the Input Cover.
Press the chassis open switch in the center of the Mainboard 10 times, you must press it slowly, so press for 2 seconds. Release, wait for the red blink on the Mainboard LEDs. repeat.
Press the power button to boot the system
BIOS settings will be reset to defaults.

@YummYume
Copy link

@sydneymeyer So do you suggest people experiencing this issue to try this as well ? It will only reset the BIOS settings ?

@sydneymeyer
Copy link

@YummYume This is what Framework Support is suggesting people experiencing this issue to do and what had the effect for me i described above.

It will only reset the BIOS settings ?

Well, it resets the BIOS to factory defaults, e.g. if you have set up Secure Boot with your own keys, you might have to re-enroll them. Perhaps you also might have to restore the UEFI-Boot Entries, can't remember. But other than that, yes.

@YummYume
Copy link

@sydneymeyer Thanks, I'll probably try doing that tomorrow and see if it fixes anything. It's weird because I haven't touched my BIOS much anyway other than changing the max charging power for the battery.

@jcdutton
Copy link
Author

jcdutton commented Feb 16, 2025

Just a comment. This issue ticket is aimed at finding the cause of the problem. So hiding it with a main board reset etc. is not the aim of this issue ticket. I.e. we are looking for a root cause.
So, we are currently looking for people who can reproduce the problem.

@sinatosk
Copy link

Hi, didn't get a chance to do it yesterday as I said I would. I'll apply the patch tomorrow ( after linux 6.14-rc3 is released )

I'll then try to trigger the issue ( if possible ) over the week or two

@sinatosk
Copy link

sinatosk commented Feb 19, 2025

Just had a screen freeze didn't go black but it also didn't restart so I held power button to force reboot

code is 0x00000800 but @jcdutton already showed that

took 3 days for that to happen - linux 6.14-rc3, DC power mode ( battery ), external monitor enabled, framework display at 165Hz, power profile set to powersave, CPB off and iGPU low

edit1: still continuing to test this issue

@jcdutton
Copy link
Author

But, just to be clear. We now only have an error sample size of 1.
I would very much like more reports than just one.

@jcdutton
Copy link
Author

@sydneymeyer
Did you keep note of which cards were in what slots at the time of the FTR, and also what you were doing at the time.
Anything that helps us reach an easier way for the FW team to see the problem for themselves will be helpful.

@sydneymeyer
Copy link

The issue occurs exclusively after resume from suspend. The Laptop is mostly connected to a USB-C powered dock with HDMI, Ethernet and a Apple Magic Trackpad 2 attached. FTR cycles have occured while connected to a dock (connected before suspend, while sleeping and when resuming), after disconnecting the dock while sleeping and resuming standalone, and completely standalone (i.e. suspend/resume cycles without attached dock).

Next boot, the last logged line in the system logs is usually from suspending the laptop (via systemd), nothing else. "ectool panicinfo" gives no "no panic data".

OS is NixOS with respective stable kernel. Programs open are usually sway (vulkan), Firefox (w/ vaapi enabled), Thunderbird, mpv (vulkan), foot, vim, cmus with pipewire, networkd, iwd.. More or less.

Unfortunately, i have no way of reproducing the issue or have recognized a pattern other than what's described above.

  • Framework 13 AMD 7840U DIY
  • 2x16 GB DDR5-5600 Crucial Memory (CT2K16G56C46S5)
  • 2TB WD SN850X (FW 620361WD)
  • Slot 1 USB-C (back left)
  • Slot 2 USB-A (front left)
  • Slot 3 USB-C (back right)
  • Slot 4 USB-A (front right)
  • Original Framework 13 USB-C charger
  • BIOS 3.05
  • Secure Boot enabled
  • TPM not used
  • 60% Battery charge limit
  • "Gaming Mode" enabled (4GB VRAM alloc.)

@jcdutton
Copy link
Author

jcdutton commented Feb 22, 2025

I just had a FTR. 2025-02-22
FW16 AMD, BIOS 3.05. I was not doing much at the time, so nothing obvious caused this. I was not playing a video at the time. I was not touching the keyboard at the time, so I did not notice it had frozen until it did the reboot.
No crash dump, No EC reboot/reset. No pstore dmesg.txt.

S5_RESET_STATUS = 0x08000800 <- Sync Flood.

Here is the port80 output:
port80-FTR.ods

Looking at the port80 output. It looks to me that the FTR happened here:
00018b86 eed50000 ....

and then after the reboot, the boot up entered / started the Linux kernel here:
000196f6 aaaa0001 (Linux: start_kernel())

Note: the aaaa0001 is added by me, so not a port80 code from the BIOS.
So, the useful port80 content is between those points.

I have added a few aaaaXXXX port80 codes written during the kernel panic and emergency reboot parts of the Linux kernel, and none of those aaaaXXXX port80 codes are present in the output, so this did not reboot due to a panic or anything like that.

So, we now have reports from 2 people with the same S5_RESET_STATUS = 0x08000800.

@jcdutton jcdutton changed the title FW16 Freeze then Reboot (FTR) FW16 Freeze then Reboot (FTR) S5_RESET_STATUS = 0x08000800 <- Sync Flood. Feb 24, 2025
@sydneymeyer
Copy link

Mar 02 15:45:20 fw kernel: S5_RESET_STATUS = 0x08000800, like usual, i.e. while resuming from suspend, no logs at all.

@sydneymeyer
Copy link

@jcdutton Do you need anything else from me or the machine in this state, because i'm considering resetting and selling the laptop. I'm sick of all the unresolved issues this computer has.

@helpimnotdrowning
Copy link

I was linked here from the forums (https://community.frame.work/t/sudden-reboot-on-wake-with-no-logs/65434 ) and I'm experiencing a similar issue with my Framwork 13 / R5 7640U. I applied the S5_RESET_STATUS patch (with light modification), and got a seemingly-new status code 0x00800800 that hasn't been mentioned here before. Is there anything more I can do on my end?

@jcdutton
Copy link
Author

jcdutton commented Mar 3, 2025

I have found a message that contains something helpful.
https://lore.kernel.org/all/80dbe1de-c71c-4556-817d-3f06e67f38ba@amd.com/

In that message, there is a link to this URL:
https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/55901_B1_pub_053.zip

In that zip file there are some PDFs.
In the PDF:
55901_B1_pub_5.pdf
There is an explanation regarding each bit in the S5_RESET_STATUS
Extract included here:
PMx000000C0 (FCH::PM::S5_RESET_STATUS)
Reset: 0000_0800h.
_aliasHOSTLEGACY; PMx000000C0; PM=FED8_0300h
Bits Description
31: sw_sync_flood_flag. Read-write,Read,Write-1-to-clear. Reset: 0. PMxC0[31] will be set if sw_sync_flood (PMx88[10]) trigger reset. Write 1 to clear. Bit[31] and Bit[28:16] except bit[20] will be cleared by Last reset event except the associated bit will be set.
30: sdp_parity_err. Read-write,Read,Write-1-to-clear. Reset: 0.
Description: When there is 'parity error', Sync_Flood reset will occur and PMxC0[27] will be set if enabled, in order to distinguish 'parity error' and 'CPU sync flood', PMxC0[30] will be set when there is SDP parity Error, thus software can distinguish 'parity error' and 'CPU sync flood'.
This bit will not be cleared by other reset event, software need write 1 to clear. SDP parity error will not clear any status bit in this register.
29: mp1_wdtout. Read-write,Read,Write-1-to-clear. Reset: 0. This bit will be set to 1 when MP1_Watchdog timer time out (this indicates there was a failed warm reset handshake between SMU and FCH). This bit will not be cleared by other reset event, software need write 1 to clear. MP1_Watchdog timer time out will not clear any status bit in this register.
28: Reserved.
27: sync_flood. Read-write,Read,Write-1-to-clear. Reset: 0. system reset was caused by a SYNC_FLOOD event which was due to an UE error( when PMx74[18]=1). Write 1 to clear. Bit[31] and Bit[28:16] except bit[20] will be cleared by Last reset event except the associated bit will be set.
26: remoteresetfromasf. Read-write,Read,Write-1-to-clear. Reset: 0. system reset was caused by a remote RESET command from ASF. Write 1 to clear. Bit[31] and Bit[28:16] except bit[20] will be cleared by Last reset event except the associated bit will be set.
25: watchdogissuereset. Read-write,Read,Write-1-to-clear. Reset: 0. system reset was caused by MICROSOFT WatchDog Timer. Write 1 to clear. Bit[31] and Bit[28:16] except bit[20] will be cleared by Last reset event except the associated bit will be set.
24: failbootrst. Read-write,Read,Write-1-to-clear. Reset: 0. system reset was caused by AMD Fail boot timer. Write 1 to clear. Bit[31] and Bit[28:16] except bit[20] will be cleared by Last reset event except the associated bit will be set.
23: shutdown_msg. Read-write,Read,Write-1-to-clear. Reset: 0. system reset was caused by a SHUTDOWN command from CPU (when PMx08[20]=1 and PMx74[17]=1). Write 1 to clear. Bit[31] and Bit[28:16] except bit[20] will be cleared by Last reset event except the associated bit will be set.
22: kb_reset. Read-write,Read,Write-1-to-clear. Reset: 0. system reset was caused by assertion of KB_RST_L.. Write 1 to clear. Bit[31] and Bit[28:16] except bit[20] will be cleared by Last reset event except the associated bit will be set.
21: sleepreset. Read-write,Read,Write-1-to-clear. Reset: 0. Reset status from Sleep state (Power saving mode, S3, 4, or 5) transition. Write 1 to clear. Bit[31] and Bit[28:16] except bit[20] will be cleared by Last reset event except the associated bit will be set.
20: do_k8_full_reset. Read-write,Read,Write-1-to-clear. Reset: 0. Description: system reset was caused by CF9 = 0x0E. Write 1 to clear.
[Note] Write CF9=0xE will set this bit=1, but write CF9=0xE will generate SLpRst later which will set bit[21]=SleepReset. In order to keep this bit =1, this bit will not be cleared by hardware, software need to write 1 to clear this bit. Bit[31] and Bit[28:16] except bit[20] will be cleared by Last reset event except the associated bit will be set.
19: do_k8_reset. Read-write,Read,Write-1-to-clear. Reset: 0. system reset was caused by CF9 = 0x06. Write 1 to clear. Bit[31] and Bit[28:16] except bit[20] will be cleared by Last reset event except the associated bit will be set.
18: do_k8_init. Read-write,Read,Write-1-to-clear. Reset: 0. system reset was caused by CF9 = 0x04. Write 1 to clear. Bit[31] and Bit[28:16] except bit[20] will be cleared by Last reset event except the associated bit will be set.
17: soft_pcirst. Read-write,Read,Write-1-to-clear. Reset: 0. system reset was caused by writing to PMIO 0xC4[0] (PciReset). Write 1 to clear. Bit[31] and Bit[28:16] except bit[20] will be cleared by Last reset event except the associated bit will be set.
16: usrrstb. Read-write,Read,Write-1-to-clear. Reset: 0. Last reset was caused by BP_SYS_RST_L assertion. Write 1 to clear. Bit[31] and Bit[28:16] except bit[20] will be cleared by Last reset event except the associated bit will be set.
15:14: pmeturnofftime. Read-write. Reset: 0h.
Description:
00: 1ms
01: 2ms
10: 4ms
11: 8ms

13:10: Reserved.
9: intthermaltrip. Read-write,Read,Write-1-to-clear. Reset: 0. system was shut down due to an internal ThermalTrip event. Write 1 to clear
8:5: Reserved.
4: remotepowerdownfromasf. Read-write,Read,Write-1-to-clear. Reset: 0. SOC has received a remote Power Off command from ASF. Write 1 to clear.
3: Reserved.
2: shutdown. Read-write,Read,Write-1-to-clear. Reset: 0. system was shut down due to ShutDown event (SHUTDOWN# pin). Write 1 to clear.
1: pwrbtn4second. Read-write,Read,Write-1-to-clear. Reset: 0. system was shut down due to 4s PwrButton event. Write 1 to clear.
0: thermaltrip. Read-write,Read,Write-1-to-clear. Reset: 0. system was shut down due to BP_THERMTRIP_L assertion. Write 1 to clear.

@jcdutton
Copy link
Author

jcdutton commented Mar 3, 2025

So, S5_RESET_STATUS = 0x00800800 means:
Bit 11 set. - Reserved
Bit 23 set - 23: shutdown_msg. Read-write,Read,Write-1-to-clear. Reset: 0. system reset was caused by a SHUTDOWN command from CPU (when PMx08[20]=1 and PMx74[17]=1). Write 1 to clear. Bit[31] and Bit[28:16] except bit[20] will be cleared by Last reset event except the associated bit will be set.

So, this implies the CPU caused the shutdown.
My guess therefore, is its due to a software bug, rather than anything more serious.

@jcdutton
Copy link
Author

jcdutton commented Mar 3, 2025

Also from the PDF document linked above:

Shutdown (message from CPU):
Triple faults in CPU will cause an internal SHUTDOWN message
broadcasted. FCH will generate a reset to S0 logic; configurable to warm or cold reset.

SYNC_FLOOD (message)
Internal data fabric logic detects an error (eg. parity error) and broadcasts an internal SYNC_FLOOD message. FCH will generate a reset to S0 logic; configurable to warm or cold reset.

@sydneymeyer
Copy link

configurable to warm or cold reset.

Does this (warm reset) mean the system, i.e. the EC/Kernel/CPU (obviously, i don't get it), can/could recover from this happening?

@jcdutton
Copy link
Author

jcdutton commented Mar 5, 2025

Sometimes, this S5_RESET_STATUS patch might work better:

S5_RESET_STATUS2.txt

@jcdutton
Copy link
Author

jcdutton commented Mar 5, 2025

This S5_RESET_STATUS patch should apply to kernel 6.13.5 kernel:

S5_RESET_STATUS-6.13.5.txt

@helpimnotdrowning
Copy link

configurable to warm or cold reset.

Does this (warm reset) mean the system, i.e. the EC/Kernel/CPU (obviously, i don't get it), can/could recover from this happening?

It probably refers to https://en.wikipedia.org/wiki/Reboot#Cold_versus_warm_reboot , the difference seeming to be whether the system performs a complete POST sequence or just a small part. Doesn't sound recoverable unfortunately .

@CDRXavier
Copy link

CDRXavier commented Mar 5, 2025

Sync flood is analogous to SERR# (System Error) on a PCI bus.

This seems vaguely familiar ...
Sleep/Hibernate issue with Windows 11

Since I am running primarily on Windows, extensive Linux testing is unlikely, but I have a Linux install (coincidentally also seem to have sleep issues). However, if there are things I can do to help either of those issues. I would gladly help.
Since we know this issue happens when entering/exiting sleep, targeted rapid testing might be possible. However, as noted in my thread as well, the issue seems inconsistent.

Is there a way for me to like, tap into the bus? There is a EC header, and I have logic analyzers. I can solder wires to onboard headers and monitor it externally.

Might be a case where a tiny second laptop like my GPD would come in handy.

@jcdutton
Copy link
Author

jcdutton commented Mar 5, 2025

@CDRXavier
Although the symptoms are similar when sleeping, the cause is very different.

  1. Random FTR not related to sleep results in a 0x08000800 <- Sync Flood <- More hardware / BIOS related. I have been told by AMD that we, users, cannot do anything further with this. It needs special BIOS code and board analysis equipment to diagnose it further.
  2. Reboots after sleep appear to be different, resulting is a 0x00800800 <- Shutdown msg -- Triple Fault - More software related.
    These ones need debugging the kernel code to track down the triple fault.
    Note, triple faults are notoriously difficult to track down.

We are only focusing on (1) in this ISSUE ticket.
We would need a new ISSUE ticket for discussing (2)

@PureKrome
Copy link

PureKrome commented Mar 6, 2025

We are only focusing on (1) in this ISSUE ticket.
We would need a new ISSUE ticket for discussing (2)

@jcdutton Would you like me to make a new ticket for (2) because that was the cause of me creating the conversation in the support forums which has had a lot of convo in it. I felt like (2) is more common pain point for everyone.

EDIT: To Be Honest, i thought this ticket was all about (2) really

@WillNilges
Copy link

@PureKrome I would join you in that ticket. I am almost certain that my issue is (2), but since installing 6.13.5 with the recommended patch yesterday I haven't been able to repro. I've seen it a lot between 6.12 and 6.13.4. Anyone seen it on 6.13.5 yet?

In case I haven't mentioned it yet I have a FW 13 Ryzen 7 7840U with these cards:

  1. USB-C
  2. USB-A
  3. USB-C
  4. HDMI

@WillNilges
Copy link

@jcdutton
Copy link
Author

jcdutton commented Mar 7, 2025

Hi All,
I am the original poster of this ISSUE.

  1. Random FTR not related to sleep results in a 0x08000800 <- Sync Flood <- More hardware / BIOS related. I have been told by AMD that we, users, cannot do anything further with this. It needs special BIOS code and board analysis equipment to diagnose it further.
  2. Reboots after sleep appear to be different, resulting is a 0x00800800 <- Shutdown msg -- Triple Fault - More software related.
    These ones need debugging the kernel code to track down the triple fault.
    Note, triple faults are notoriously difficult to track down.

We are only focusing on (1) in this ISSUE ticket.
I have only seen (1) on my FW16. I have not seen (2).

Please do not report here if you have (2). Report it in a new issue.

@PureKrome
Copy link

PureKrome commented Mar 8, 2025

Please do not report here if you have (2). Report it in a new issue.

@jcdutton That issue has been reported here in the official FW Community Forums.

Do I need to also create a ticket here in GH?

(also here's another link in the FW CF for another potentially related one)

@WillNilges
Copy link

@PureKrome @jcdutton I've done it: #50

Can we work on centralizing folks around this issue? Please spread the word. I would really like to hear back from a Framework engineer because I cannot use my laptop because of this.

@sydneymeyer
Copy link

This S5_RESET_STATUS patch should apply to kernel 6.13.5 kernel:

S5_RESET_STATUS-6.13.5.txt

FWIW, i'm now running 6.14-rc6 with this patch.

@sinatosk
Copy link

Framework 16 AMD Ryzen 7 7840HS using Radeon 780M
RAM/Memory: 32GB ( 2x16GB ) Framework
NVME 2280: Western Digital SN850X 2TB - Firmware 620361WD
NVME 2230: Western Digital SN770M 2TB - Firmware 731120WD
BIOS: 3.05
Gentoo Linux 2.17 ( Linux 6.14.0 mainline realtime, compiled by clang 20.1.1 march and mtune set to znver4)
KDE Plasma 6.2.5 Wayland

I use KDE Plamsa powerdevil to change the power profiles

Framework port usage:

  1. USB-C - Framework 180W Power adapter
  2. USB-C - External display ( it has a DisplayPort USB-C adapter that plugs into Framework's USB-C module )
  3. Framework's audio module
  4. USB-C - External storage or mouse ( mouse has a USB-A to USB-C adapter that plugs into Framework's USB-C module )
  5. USB-C - generally don't use
  6. USB-C - I don't use because of Linux "sudo lsusb -v" fails #5

I just had a FTR

S5_RESET_STATUS = 0x08000800`

  • With AC adapter ( Framework ) plugged in
  • compiled and installed Linux 6.14.0
  • rebooted into 6.14.0
  • unplugged AC adapter
  • started KDE Plasma
  • power profile set to performance
  • opened Firefox
  • power profile set to power-save
  • started watching a video
  • some minutes ( 46~ ) later, it FTR

No external display was in use, none plugged in and the main display ( Framework ) refresh rate set too 165Hz with ABM ( Active Backlight Management ) set to 2 via kernel command line parameter

  • power profile set to power-save
  • this sets powerprofiledaemon to power/power-save
  • this sets the CPU to power/power-save
  • also sets CPB ( Core Performance Boost ) to "off"
  • and sets iGPU to "low"

The last FTR I had was with kernel 6.14-rc1 or rc2

@sydneymeyer
Copy link

Should we continue to report every instance of this (S5_RESET_STATUS = 0x08000800) happening, or is there anything else we can do regarding this issue?

To be frank, i don't see Framework taking any interest in this at all (except from what appears to be "managing the community"), and i have no desire to compile the kernel from scratch everytime i'm doing updates forever until Framework feels like it.. or perhaps not. Who would know.

Besides, there appears to be nothing new to report, always the same, already reported many times, circumstances.

@jcdutton
Copy link
Author

jcdutton commented Apr 8, 2025

@sydneymeyer
I cannot speak for FW, but for the AMD mainboards we have two different FTR and one FTH.
If there is a fault in the hardware causing these problems, I think I would want to root cause them, so that the next mainboard I designed fixed the problem.

I think it is fair to say that the problem is not reproducible on demand. Its occurrence is apparently random.
The more reports we get, the more likely we will see a pattern and be able to reproduce the problem more predictably.

I have been asked by FW support to do:
"can you do a power drain of your battery, then remove RAM, SSD, ECs and also removal of the battery? Then put them all again and boot, see if issue or behavior changes."

You might wish to try it also, to see if it helps or not.
I have not tried it yet, but I think I will at the same time as replacing my PTM termal pad once it arrives from FW.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Laptop 16 AMD Ryzen 7040 Framework Laptop 16 (AMD Ryzen™ 7040 Series)
Projects
None yet
Development

No branches or pull requests

9 participants