-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[APU2] IO_PAGE_FAULTS on writes by ath10k_pci #1134
Comments
@ma-neumann thank you for the report.
Where is it coming from? |
@pietrushnic I have compiled the tag
|
Thank you so much for testing. Can you confirm that on v0.9.0? We would at least know if this is a regression or a known bug. Fixing IOMMU is not easy because we don't have a comprehensive test suite covering various hardware, but I hope we can satisfy your case without breaking others. |
This is a known problem already from the traditional PC Engines firmware. |
@miczyg1 It looks related to me too, but the symptoms differ, don't you think? |
Why? It was also caused by ath10k_pci according to comments. |
Somehow I speculate the And I suspect you are also implying that this is probably no regression. I will test |
Unfortunately, it does not seem to be a regression (at least not from
Yet, I just noticed that I had missed so far that the Linux kernel raises an exception when it initializes the IOMMU. The following happens right at the beginning when the kernel starts (on
Seems like the kernel is stuck for about 20 seconds at first, and then raises an exception somewhere in the end of function I do not understand this IOMMU code in the kernel, but from the code it looks like its somehow about remapping interrupts (given the By trial and error I have switched the AMD's IOMMU remapping mode from Now, the kernel does not get stuck anymore and it seems to successfully initialize the IOMMU. Unfortunately, the original symptom -- the By another round of trial and error I have put the IOMMU into pass-through mode (whatever it means) using the kernel option [1] https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/noble/tree/drivers/iommu/amd/init.c?h=master-next--2024.09.30-1--auto#n2859 |
It all sounds suspicious. I'm not an IOMMU expert, but pass-through means the device can access memory directly without IOMMU translating virtual addresses to physical addresses. It stops complaining because the previous I wonder what @krystian-hebel and @andyhhp think about that. |
Yeah, The google groups link isn't quite correct. Flags of 0x0070 do translate to PE, RW, PR, but that means the device is trying to write to a region marked read-only in the IOMMU. All the addresses seem to be quite close together. Does 0xced536d0 fall in any region described in /proc/iomem ?
I'm not aware of any extra configuration the firmware would need to do to set up VAPIC, but I wouldn't rule it out either. Either way, I think that's a red herring and unrelated to IO_PAGE_FAULTs. |
@andyhhp Good morning, thank you very much. The IOMMU is back on.
These three seem to be hits of "System RAM" regions and a "RAM buffer" region.
I guess they should have used the regions which had been allocated to them, i.e. each WLE uses its |
The region ce2e4000-cfffffff looks like cbmem (coreboot memory). But I don't understand why and what a device would like to write there. Getting the logs from cbmem would be great: https://docs.dasharo.com/common-coreboot-docs/dumping_logs/#cbmem-utility |
The code seems to try to disable guest VAPIC logging. However, according to BKDG, the Guest VAPIC should not be supported in the SOC (the GASup bit should be 0 in IOMMU Extended Feature). But, the guest VAPIC log registers are described in BKDG 🤔 |
I was surprised that VAPIC was seemingly active in APU2; it feels too old to have support. But, it's Fam16h Model 0x30, and I recall there being prototype support there, which was formally supported in Zen1 which was the following architecture. I agree that the BKDG seems confused on whether vAPIC should be visible or not. I think it's quite likely that there's support in silicon which the AMD BIOS clobbers. |
@miczyg1 Please see
Also see
Finally see output from |
Taking one of your addresses at random:
So the DMA is hitting the SMM range. /proc/iomem says Why isn't that marked as reserved in the E820 ? |
There are two problems here:
Well, there is a bad logic in EDK2 UEFI Payload to determine the TOLUD. We have an ugly hack that worked for Intel (as we released only firmware for intel-based boards) and simply read the TOLUD from host bridge. It goes without saying that on AMD it doesn't work :) So the TOLUD is assumed to be on the MMIO boundary (0xd0000000 in this case) instead of 0xcff00000, so the memory map if ends up not reserving the 0xcff00000-0xd0000000 RAM... Working on a fix.
I have found out that AVIC is not supported on this HW. The CPUID 0x8000000A EDX indicates no support for AVIC. That means the IOMMU guest AVIC support should not be exposed at all. I have prepared a fix for it already by hiding the guest AVIC capability in IOMMU and disabling the feature. So far it works and no more WARNs is visible in dmesg. |
|
Do I need that also for Dasharo (coreboot+SeaBIOS), or is this UEFI-specific? |
I would happily test the new build 😀 |
|
I would be happy to introduce a fix to the upcoming 24.08.00.01. I'm unsure about testing, although it would be great to have automated verification of this issue to avoid regression. Still, maybe IOMMU verification could be extended as part of additional effort since I think this is quite a lot of effort. |
🤔 I guess I cannot assign issue to two milestones. |
Ok, created separate issue for tracking Dasharo (coreboot+SeaBIOS) fix integration. |
By the way: at least, disabling the IOMMU has fixed my WiFi issues I mentioned originally. |
There was a dirty hack for Intel platforms that read TOLUD register to determine the boundary between MMIO and DRAM. It caused problems on AMD platforms such as apu2, which does not have TOLUD register. As a result, regions which held reserved memory were incorrectly reported as RAM buffers or RAM itself and the OS allocated DMA there. It could be observed with many IO_PAGE_FAULTs occurring in the OS. See: Dasharo/dasharo-issues#1134 Signed-off-by: Michał Żygowski <michal.zygowski@3mdeb.com>
There was a dirty hack for Intel platforms that read TOLUD register to determine the boundary between MMIO and DRAM. It caused problems on AMD platforms such as apu2, which does not have TOLUD register. As a result, regions which held reserved memory were incorrectly reported as RAM buffers or RAM itself and the OS allocated DMA there. It could be observed with many IO_PAGE_FAULTs occurring in the OS. See: Dasharo/dasharo-issues#1134 Signed-off-by: Michał Żygowski <michal.zygowski@3mdeb.com>
There was a dirty hack for Intel platforms that read TOLUD register to determine the boundary between MMIO and DRAM. It caused problems on AMD platforms such as apu2, which does not have TOLUD register. As a result, regions which held reserved memory were incorrectly reported as RAM buffers or RAM itself and the OS allocated DMA there. It could be observed with many IO_PAGE_FAULTs occurring in the OS. See: Dasharo/dasharo-issues#1134 Signed-off-by: Michał Żygowski <michal.zygowski@3mdeb.com>
It looks like I fixed it (at least Linux does not complain any more when tested with DTS). I will check some common distro to see if there are still some problems. I have also added a few extra fixes for lack of PCI INT on IOMMU and the IOAPICs init and reworked ACPI IVRS generation. I will have to check some system running Xen and PCI passthrough to be sure everything is allright. The correct memory map reporting is also done here (apart from coreboot): |
Component
Dasharo firmware
Device
PC Engines APU2
Dasharo version
pcengines_apu2_v0.9.1-rc1
Dasharo Tools Suite version
No response
Test case ID
No response
Brief summary
Linux kernel reports IO_PAGE_FAULTS on writes by ath10k_pci
How reproducible
Hi dasharo team,
I am running dasharo's
v0.9.1-rc1
on a PC Engines APU2D4 with Ubuntu 24.04 LTS (currently kernel6.8.0-48-generic
).The APU is equipped with two Compex WLE900VX, thus using
ath10k_pci
driver (plus currentlylinux-firmware 20240318.git3b128b60-0ubuntu2.4
). Both are in AP-mode (using hostapd), one is on 2.4Ghz and one on 5Ghz.ath10k_pci
reports the WLEs' Qualcomm chips properly (qca988x hw2.0 target
, see also [1]) and loads the latest firmware properly (firmware ver 10.2.4-1.0-00047
, see also [2]).The WLE on 5Ghz reports the following IO_PAGE_FAULTS:
As far as I understand, according to AMD's IOMMU specifications, the
flags=0x0070
indicate that the WLE has been lacking permission when trying to write to the addresses reported (see also [3]).These IO_PAGE_FAULTs happen every now and then, so far they seem sporadic to me. In general I experience quite good WiFi performance, but sometimes I experience weird/significant delays, maybe the issue is related.
The issue in [4] might be related. Yet, disabling dasharo's performance boost option for the APU in BIOS did not change the issue. And, the workaround of emulating the IOMMU hardware in software (kernel parameter
iommu=soft
) does not seem to be an option to me for performance reasons (have not tested it).I'd appreciate any ideas. Thank you very much.
[1] https://compex.com.sg/shop/wifi-module/802-11ac-wave-1/wle900vx-wifi5-11ac-qca9880-qca9890/
[2] https://git.codelinaro.org/clo/ath-firmware/ath10k-firmware/-/tree/main/QCA988X/hw2.0/10.2.4-1.0?ref_type=heads
[3] https://groups.google.com/g/linux-ntb/c/vvnbizy8d_8/m/tZMqnJH9AwAJ
[4] pcengines/apu2-documentation#240
How to reproduce
n/a
Expected behavior
n/a
Actual behavior
n/a
Screenshots
No response
Additional context
No response
Solutions you've tried
No response
The text was updated successfully, but these errors were encountered: