Skip to content

FW16 EC slows the CPU down #43

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 task
jcdutton opened this issue Feb 5, 2025 · 11 comments
Open
1 task

FW16 EC slows the CPU down #43

jcdutton opened this issue Feb 5, 2025 · 11 comments
Labels
Laptop 16 AMD Ryzen 7040 Framework Laptop 16 (AMD Ryzen™ 7040 Series)

Comments

@jcdutton
Copy link

jcdutton commented Feb 5, 2025

Device Information

System Model or SKU

[ ] Framework Laptop 16 (AMD Ryzen™ 7040 Series)
No dGPU.

BIOS VERSION

3.0.5

Windows:
N/A

Linux:
Open a terminal and run the following command
sudo dmidecode --string bios-version
03.05

DIY Edition information

Memory: Manufacture and SKU
Kingston Fury Impact: Part Number: KF556S40-32
2x making 64GB total.
Storage: Manufacture and SKU
Model Number: WD_BLACK SN850X 1000GB
Firmware Version: 620361WD

Port/Peripheral information

  1. USB-C card, nothing plugged in.
  2. Empty
  3. Empty
  4. Empty
  5. USB-C card, FW16 PSU plugged in.
  6. USB-A card, nothing plugged in.

Standalone Operation

Are you running your mainboard as a standalone device. Is standalone mode enabled in the BIOS?

  • No

Describe the bug

As discovered when I was working on
#41
It is possible for the EC embedded controller to slow the laptop CPU down.
This happens when the EC is busy doing something.
This Issue is a placeholder to try and investigate where the link is between the CPU and the EC. What is causing the slow down? So we can fix it so that the EC cannot slow the CPU down when one does not wish the EC to slow the CPU down.
If anyone can provide ideas of where in the Linux kernel to start looking for the link, it would be helpful.
Some user applications such as "Real Time Audio", "Digital Audio Workstation", "DAW" would benefit as they rely on low latency for audio input, real time processing and output.

Steps To Reproduce

Steps to reproduce the behavior:

  1. TBD

Expected behavior

The EC should not slow the CPU down, unless intended to. E.g. Setting "Power Modes" from "Performance" to "Power Saver."

Screenshots

N/A

Operating System (please complete the following information):

  • OS/Distribution: Linux/Ubuntu
  • Version: 24.04
  • Linux Kernel Version: uname -a 6.12.7 <- Mainline compiled kernel.

Additional context

Add any other context about the problem here.

@kiram9 kiram9 added the Laptop 16 AMD Ryzen 7040 Framework Laptop 16 (AMD Ryzen™ 7040 Series) label Feb 12, 2025
@JohnAZoidberg
Copy link
Member

From #41 (comment)

While experimenting, when the EC is busy doing something, it can slow the entire FW laptop down. (to a crawling pace)
Various parts of the EC comms stops working. E.g. detecting power plug/unplug.
I have not diagnosed why EC problems would slow the entire PC down, but will look into it later.
I did not believe the EC could slow the main CPU down until I saw it myself. So it is a possible new source of something to look at if users report slowness. Particularly of interest to Real Time Audio users.

What is it that you are seeing? How do you know the EC is busy doing "something"? Are you looking at the EC console?
What is it doing/printing when you see the slow down?
I assume "to a crawling pace" means that navigating the GUI is slow?

@jcdutton
Copy link
Author

I have a EC CCD plugged in so I see the console port output of the EC.
There are a number of things that cause the EC to be busy, e.g. turn up debug out etc. That sort of thing.
I either see too much output on the console port, or the EC hangs and is unresponsive to console port input.
When this is observed, it also tends to slow the CPU down.
This slow down is observed as:

  1. Very slow to boot up.
  2. Very slow to login
    etc.
    Disabling the EC debug output speeds everything, on the CPU, up again.

I am expecting that there is some IO blocking going on in the Linux kernel, and re-writing the Linux kernel code a bit might resolve the problem.

@ngraham20
Copy link

ngraham20 commented Feb 22, 2025

Hey, I've noticed similar symptoms when running a different program. (A game, Albion Online, but I do use fw-fanctrl, which uses EC to apply fan curves) I noticed for me, that it only occurs in "balanced" and "performance" profiles, and switching to "power saver" suddenly makes the system recover. Can you replicate this? Maybe they're the same issue with different triggers?

@jcdutton
Copy link
Author

jcdutton commented Mar 9, 2025

I think I have determined the cause.
It is due to ACPI and transaction requests with the EC in ./drivers/acpi/ec.c
The transaction requests come in from multiple CPU cores. So one can quite quickly have all the CPU cores waiting on EC transactions to complete.

The kernel code in ./driver/acpi/ec.c does appear to be overly complex for what it needs to actually do and I think that complexity has made it difficult to optimize.

@JohnAZoidberg
Copy link
Member

I think I have determined the cause.
It is due to ACPI and transaction requests with the EC in ./drivers/acpi/ec.c
The transaction requests come in from multiple CPU cores. So one can quite quickly have all the CPU cores waiting on EC transactions to complete.

Could you please elaborate how you came to this conclusion?

That would be an interesting find.
The code in this driver is generic, so it's unrelated to the cros_ec driver.

@jcdutton
Copy link
Author

jcdutton commented Mar 9, 2025

The ACPI talks to the EC using IOPORTs, Command=IOPORT 0x66, Data=IOPORT 0x62
cros_ec uses an alternative method for writing ectool commands to the EC.

From what I can see, slow down happens even when no ectool / HCI commands are being sent to the EC.
But looking at the EC console (using an EC CCD) ACPI commands are being received.

Interestingly, it looks like there is some sort of work queue between ACPI and writing to the EC.
The work queue is 16 entries long, and if the ACPI writes an entry when the queue is full, it looks like it silently drops the request from the ACPI. I don't know how the ACPI behaves when requests are simply dropped.
As an example, when you plug in the PSU to the FW16 laptop, it creates 42 ACPI requests to the EC.

@jcdutton
Copy link
Author

jcdutton commented Mar 9, 2025

One way to observe the EC slowing down the CPU is by doing:

  1. Open a terminal window on Linux.
  2. Press the "h" key and keep your finger down. (it can be any key really)
  3. You should see the "h" being repeated at a fairly constant rate.
  4. Unplug the PSU from the FW16 laptop. You will see pauses in the "h" output.
  5. Plug the PSU into the FW16 laptop. You will see pauses in the "h" output again.

I think these pauses are being caused by ACPI to EC comms blocking all the CPU cores.

@ngraham20
Copy link

One way to observe the EC slowing down the CPU is by doing:

1. Open a terminal window on Linux.

2. Press the "h" key and keep your finger down.   (it can be any key really)

3. You should see the "h" being repeated at a fairly constant rate.

4. Unplug the PSU from the FW16 laptop.  You will see pauses in the "h" output.

5. Plug the PSU into the FW16 laptop. You will see pauses in the "h" output again.

I think these pauses are being caused by ACPI to EC comms blocking all the CPU cores.

Actually, unplugging my laptop while following these steps caused KDE to not notice I'd unplugged it at all.

Image The laptop is currently unplugged

@ngraham20
Copy link

However, after a reboot, I am unable to reproduce, so maybe that was unrelated?

@RokeJulianLockhart
Copy link

RokeJulianLockhart commented Apr 6, 2025

@sinatosk
Copy link

sinatosk commented Apr 11, 2025

Framework 16 AMD Ryzen 7 7840HS using Radeon 780M
RAM/Memory: 32GB ( 2x16GB ) Framework
NVME 2280: Western Digital SN850X 2TB - Firmware 620361WD
NVME 2230: Western Digital SN770M 2TB - Firmware 731120WD
BIOS: 3.05
Linux firmware: linux-firmware d864697f
Gentoo Linux 2.17 ( Linux 6.15-rc1 mainline realtime, compiled by clang 20.1.2 march and mtune set to znver4)
KDE Plasma 6.2.5 Wayland

I've also tried this on Linux 6.12.22, 6.13.10, 6.14.1 and now 6.15-rc1

@jcdutton

I'm not sure where to report this but I thought of you because you mentioned something about the EC slowing down the CPU

I've recently received my PTM7958 from Framework and things are looking good on average.

Using amdgpu_top ( 0.10.4 ), watching the temperatures and power/TDP, one thing I noticed ( 2~ weeks before PTM7958 ) was that the throttle status flags showing

Throttle Status: [PROCHOT_CPU, PROCHOT_GPU]

and I thought this was due to the processor package overheating because of the liquid thermal issue, but after applying the PTM7958, I'm still seeing this and thought this is weird because the temperatures are well within the safe range.

I rebooted and flags are gone ( as some others mentioned ), did some tests again ( PTM7958 related ) and thought "hmm, wonder what frequencies are like on battery", unplugged Framework power adapter and the PROCHOT_CPU and PROCHOT_GPU flags showed again, I'm like "huh?"

I plugged the Framework power adapter back in and then another flag showed for 1 second ( EDC_CPU ) while PROCHOT_CPU and PROCHOT_GPU flags is still showing.

I went away from my FW16 and it auto suspended in that time while amdgpu_top was still running, later resumed and flags are gone

so I tried this numerous times in power profiles ( power save, balanced and performance ) via KDE power devil

  • Unplug the Framework power adapter, the PROCHOT_CPU and PROCHOT_GPU flags show
  • Suspend and then resume, flags are gone
  • Plug in the Framework power adapter and EDC_CPU flags shows for 1 second ( it's probably less but I have amdgpu_top refresh period to 1 second ) while PROCHOT_CPU and PROCHOT_GPU still showing
  • Suspend and resume, flags are gone

The EDC_CPU flag though only shows when I plug ( never at unplug ) the Framework power adapter, CPU is idle and power profile set to "power save"

again "power save" in KDE power devil ( communicates to power profile daemon ), in my case also turns off CPB ( Core Performance Boost ) and sets iGPU performance level to "low", in "balanced" or "performance" CPB is on and iGPU performance level is higher

There was one instance ( after PTM7958 applied but has happened numerous times before PTM7958 and again, thought it was liquid thermal issue ) where the flags were showing, suspended the system overnight, resumed later next morning and regardless of the power profiles the processor frequency would not go above 3.8GHz~ and the temperatures were ranging 70-75 degree's celsius, after reboot ( suspend/resume didn't fix this ), frequencies were good again and I've not been able to trigger this again

Since applying the PTM7958, I've never seen the processor temperature go above 97.8 degree's celsius ( average is 90.8-97.8, depending on ambient temperature ) degree's celsius, before with liquid thermal, it was hitting 100, sometimes 101 many times

so it seems changing the power state from AC to DC or DC to AC enables these flags and after suspend/resume or reboot flags are gone, I'm guessing something is not being configured correctly in the firmware/BIOS/EC before or after AC/DC/suspend/resume/reboot?

and with those flags enabled it's possible the firmware/BIOS/EC elsewhere is doing something which may eventually lead to the FTR issue ( #41 )?

The FTR issue for me is low frequency ( weeks )

edit: maybe a bug with amdgpu_top, that only just occurred to me but there no open issues and I've been using this software on multiple AMD GPU's for some years and when looking at the frequencies, I use btop and htop too and see similar numbers ( btop just shows and average of all the cores )

edit 2: it's just done it again. On battery ( down to 59% ), plugged in, flags show, suspended, resumes, flags still showed and processor frequency won't go above 3.8GHz, suspended and resumed again, flags gone and frequency still won't go above 3.8GHz, this is while I'm in performance mode and temps no higher than 81 degree's celsius ( weather quite warm today ). I'm now wondering if this frequency issue is to do with battery discharging, I'll find out later and if so, maybe another issue ( BIOS or OS? )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Laptop 16 AMD Ryzen 7040 Framework Laptop 16 (AMD Ryzen™ 7040 Series)
Projects
None yet
Development

No branches or pull requests

6 participants