-
Notifications
You must be signed in to change notification settings - Fork 130
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SOLVED] No temp readings on Epyc 9274F #479
Comments
@garceri Hello,
For all these outputs, please employ the Markdown formatting. |
I think Genoa and Zen4/Hawk Point (#84) are missing the temperature for the same reason: no thermal register known. |
Meanwhile I can fix the voltage Vcore if you provide me the CLI output requested above. |
Can you please pull |
Which temperature are we reading when System is idling ?
|
Let the server idle for ten minutes, Temp readings reported by corefreq seem to be off: This is what lm-sensors report
and finally this is what corefreq reports:
|
First Gen needed an temperature offset. Perhaps same with Genoa. Can you compile my SMU tool zencli ? cc zencli.c -o zencli As root, you will peek the thermal registers I just know: ## since Zen gen1
zencli smu 0x59800
## per CCD
zencli smu 0x59954
zencli smu 0x59958
zencli smu 0x5995C
zencli smu 0x59960
zencli smu 0x59964
zencli smu 0x59968
zencli smu 0x5996C
## Family 19h APU
zencli smu 0x59B08
As a défaut, CoreFreq is showing the relative frequency. |
The Mhz reported, are relative to what ? Here are the zencli outputs you requested:
|
Can you also dump the CCD range ? ## per CCD
zencli smu 0x59954
zencli smu 0x59958
zencli smu 0x5995C
zencli smu 0x59960
zencli smu 0x59964
zencli smu 0x59968
zencli smu 0x5996C |
I see, https://elixir.bootlin.com/linux/latest/source/drivers/hwmon/k10temp.c#L475 So SMU address is 0x59B00 + (CCD number * 4) |
Fortunately thermal function was already made for Raphael. |
It may be a wrong Register address Line 208 in 87344fa
and replace code with this one: #define SMU_AMD_THM_TCTL_CCD_REGISTER_F19H_61H \
(SMU_AMD_THM_TCTL_REGISTER_F17H + 0x300) Next please rebuild, unload, reload and test for temperature EDIT: For DIMM geometry, can you also peek those addresses ## Prior Zen4
./zencli smu 0x50030
./zencli smu 0x50034
./zencli smu 0x50038
./zencli smu 0x5003C
## Since Zen4
./zencli smu 0x50040
./zencli smu 0x50044
./zencli smu 0x50048
./zencli smu 0x5004C |
I pulled and recompiled the commit you mentioned (7c926af from develop) and seems to have improved a bit, still there are some cores whose's temperatures are somehow reported as 0 C..
Here are the zencli peeks you requested:
|
This seems to be the real deal, i modified the file as suggested, with
|
Thank you very much for all your tests. New DIMMs of 24 or 48 GB are not so easy to debug. I have no bit value specified and known to decode with. |
The temperature fix is made available in |
Looks good after recompiling the latest |
Thank you for your confirmation. In addition to the DIMM Geometry, it also remains misc issues
|
Anything I can do, or information I can provide just let me know.! |
Yes, please, I have various requests in next comments |
Value above zero may happened when System is under high load |
Here you go:
|
Either:
Or:
Fyi, to compute the boosted ratio, use the hexa returned above, instead of echo $(( ((0x12345678 >> 17) & 0xff) >> 2 )) |
|
Register is specified here: Line 2205 in 49e4e2b
But it was known as is for Zen2 Either address and/or bit fields for Zen4/EPYC have changed According to 9274F, I'm looking for a ratio of |
Commit b465f42 in branch |
Exactly what I was expecting. Thank you. |
In your PCI list I was expecting the IOMMU lspci -nn|grep -i iommu 00:00.2 IOMMU [0806]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse IOMMU [1022:1481] If enabled in BIOS, yours should be As the closest references, I found this Dell and table, could HP have a different mapping ? |
I'd have to check if IOMMU is enabled in BIOS, I can't reboot the server right now, I'll check the BIOS setting as soon as I can find a time window to reboot it. |
Do you have other software, even made for Windows, which can read the geometry of your 48 GB DIMM ? |
Here is the manufacturer spec sheet: Does
|
Thanks. They give some hints
I'm searching for a deeper datasheet like JEDEC specs |
P-StatesYou can reprogram P-States from UI but also using the driver parameters. For example, your EPYC default P-States were discovered as bellow.
Now suppose you want to alter these two enabled P-States to respectively insmod build/corefreqk.ko Register_ClockSource=1 \
Register_Governor=1 Register_CPU_Idle=1 Register_CPU_Freq=1 \
Turbo_Ratio_Unlock=1 TurboBoost_Enable="0,1" \
Ratio_Boost="-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,12,30" Remarks
I hope you won't find this method too complicated. |
It will be some time until I can test it, the server is running production workloads and I'll need to take it offline to test this. |
Actions are transferred to the todo list To refresh the EPYC 9274F Wiki, can you however pull the latest from corefreq-cli -s -n -m -n -z -n -B -n -M -n -k -n -C 1 -n -i 1 -n -V 1 -n -W 1 -n -R |
Here ya go:
|
Commit dc6280f fixes the spacing in the Power Monitoring bottom area, especially when a 3 digits power is measured. Format:
CPU Freq(MHz) Accumulator Min Energy(J) Max Min Power(W) Max
000 3798.08 000000000000411306 0.00 6.28 8.87 0.00 6.28 8.87
001 3682.77 000000000000496376 0.00 7.57 8.20 0.00 7.57 8.20
002 3850.84 000000000000417714 0.00 6.37 9.53 0.00 6.37 9.53
003 3830.85 000000000000496311 0.00 7.57 8.59 0.00 7.57 8.59
004 3860.38 000000000000437088 0.00 6.67 8.64 0.00 6.67 8.64
005 3826.94 000000000000448859 0.00 6.85 8.19 0.00 6.85 8.19
006 3951.14 000000000000407283 0.00 6.21 8.10 0.00 6.21 8.10
007 3820.92 000000000000512015 0.00 7.81 8.20 0.00 7.81 8.20
008 3838.89 000000000000453518 0.00 6.92 7.74 0.00 6.92 7.74
009 3807.62 000000000000450171 0.00 6.87 8.38 0.00 6.87 8.38
010 3969.38 000000000000451074 0.00 6.88 7.74 0.00 6.88 7.74
011 4025.41 000000000000339855 0.00 5.19 7.88 0.00 5.19 7.88
012 3892.19 000000000000417007 0.00 6.36 7.74 0.00 6.36 7.74
013 3905.71 000000000000386342 0.00 5.90 7.68 0.00 5.90 7.68
014 3726.84 000000000000527824 0.00 8.05 9.04 0.00 8.05 9.04
015 3908.34 000000000000494423 0.00 7.54 8.21 0.00 7.54 8.21
016 3984.11 000000000000000000 0.00 0.00 0.00 0.00 0.00 0.00
017 3749.12 000000000000000000 0.00 0.00 0.00 0.00 0.00 0.00
018 3783.17 000000000000000000 0.00 0.00 0.00 0.00 0.00 0.00
019 3755.73 000000000000000000 0.00 0.00 0.00 0.00 0.00 0.00
020 3774.03 000000000000000000 0.00 0.00 0.00 0.00 0.00 0.00
021 3810.57 000000000000000000 0.00 0.00 0.00 0.00 0.00 0.00
022 4013.79 000000000000000000 0.00 0.00 0.00 0.00 0.00 0.00
023 3953.22 000000000000000000 0.00 0.00 0.00 0.00 0.00 0.00
024 3711.33 000000000000000000 0.00 0.00 0.00 0.00 0.00 0.00
025 4015.93 000000000000000000 0.00 0.00 0.00 0.00 0.00 0.00
026 3878.68 000000000000000000 0.00 0.00 0.00 0.00 0.00 0.00
027 3880.01 000000000000000000 0.00 0.00 0.00 0.00 0.00 0.00
028 3820.70 000000000000000000 0.00 0.00 0.00 0.00 0.00 0.00
029 3956.70 000000000000000000 0.00 0.00 0.00 0.00 0.00 0.00
030 3721.36 000000000000000000 0.00 0.00 0.00 0.00 0.00 0.00
031 3987.81 000000000000000000 0.00 0.00 0.00 0.00 0.00 0.00
Energy(J) Package[0] Cores Uncore Memory
17.68 145.5 145.50 0.02 109.1 126.85 10.50 17.1 18.67 0.00 0.0 0.00
Power(W)
17.68 145.5 145.50 0.02 109.1 126.85 10.50 17.1 18.67 0.00 0.0 0.00 Can you please show me yours ? |
Compiled with that commit:
|
Thank you. Version |
It looks like some bits have architecturally been fixed since kernel I'm now reading the Is it still case on Genoa ? |
Hello, In this AMD HSMP source code, I'm reading an address exception for Zen family 1A Can you edit and replace value of Line 222 in 3a25d36
with this value: #define SMU_HSMP_CMD 0x3b10934 I would also like to put the Line 7892 in 3a25d36
as below: switch (PUBLIC(RO(Proc))->ArchID) {
case AMD_Zen4_PHX2:
case AMD_Zen4_HWK:
case AMD_Zen4_PHX:
case AMD_Zen4_RPL:
case AMD_Zen3Plus_RMB:
case AMD_Zen3_VMR:
case AMD_Zen2_MTS:
Core_AMD_SMN_Read(XtraCOF,
SMU_AMD_F17H_MATISSE_COF,
PRIVATE(OF(Zen)).Device.DF);
break;
case AMD_Zen4_Bergamo:
case AMD_EPYC_Rome_CPK:
case AMD_Zen4_Genoa:
Core_AMD_SMN_Read(XtraCOF,
SMU_AMD_F17H_ZEN2_MCM_COF,
PRIVATE(OF(Zen)).Device.DF);
break;
} Rebuild and reload CoreFreq Finally post the output of Thank you for helping. |
Will do as soon as I can, pleased to help! |
Here you go, sorry for the delay:
|
Thanks.
|
|
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
I can't get any temperature readings under Ubuntu 22.04 running kernel 6.5
The text was updated successfully, but these errors were encountered: