Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Temperatures for cores not showing #1284

Open
AnOpenSauceDev opened this issue Aug 20, 2023 · 15 comments · May be fixed by #1352
Open

Temperatures for cores not showing #1284

AnOpenSauceDev opened this issue Aug 20, 2023 · 15 comments · May be fixed by #1352
Labels
bug 🐛 Something isn't working Linux 🐧 Linux related issues support request This is not a code issue but merely a support request. Please use the mailing list or IRC instead.

Comments

@AnOpenSauceDev
Copy link

When using htop via SSH on my Ubuntu server, i notice that even if i enable Also show CPU temperature (libsensors5 is installed), no temperature reading appears. I'm unsure if this is because of my core count or not (40 threads total), but no matter what i do, nothing will show up alongside the usage reading.

@BenBE BenBE added question ❔ Further information is requested support request This is not a code issue but merely a support request. Please use the mailing list or IRC instead. Linux 🐧 Linux related issues labels Aug 22, 2023
@BenBE
Copy link
Member

BenBE commented Aug 22, 2023

What does the full output for sensors -u look like?

@AnOpenSauceDev
Copy link
Author

AnOpenSauceDev commented Aug 22, 2023

Log File

$ sensors -u

coretemp-isa-0001
Adapter: ISA adapter
Package id 1:
  temp1_input: 38.000
  temp1_max: 85.000
  temp1_crit: 95.000
  temp1_crit_alarm: 0.000
Core 0:
  temp2_input: 31.000
  temp2_max: 85.000
  temp2_crit: 95.000
  temp2_crit_alarm: 0.000
Core 1:
  temp3_input: 33.000
  temp3_max: 85.000
  temp3_crit: 95.000
  temp3_crit_alarm: 0.000
Core 2:
  temp4_input: 31.000
  temp4_max: 85.000
  temp4_crit: 95.000
  temp4_crit_alarm: 0.000
Core 3:
  temp5_input: 33.000
  temp5_max: 85.000
  temp5_crit: 95.000
  temp5_crit_alarm: 0.000
Core 4:
  temp6_input: 33.000
  temp6_max: 85.000
  temp6_crit: 95.000
  temp6_crit_alarm: 0.000
Core 8:
  temp7_input: 35.000
  temp7_max: 85.000
  temp7_crit: 95.000
  temp7_crit_alarm: 0.000
Core 9:
  temp8_input: 31.000
  temp8_max: 85.000
  temp8_crit: 95.000
  temp8_crit_alarm: 0.000
Core 10:
  temp9_input: 34.000
  temp9_max: 85.000
  temp9_crit: 95.000
  temp9_crit_alarm: 0.000
Core 11:
  temp10_input: 35.000
  temp10_max: 85.000
  temp10_crit: 95.000
  temp10_crit_alarm: 0.000
Core 12:
  temp11_input: 33.000
  temp11_max: 85.000
  temp11_crit: 95.000
  temp11_crit_alarm: 0.000

power_meter-acpi-0
Adapter: ACPI interface
power1:
  power1_average: 136.000
  power1_average_interval: 0.001

coretemp-isa-0000
Adapter: ISA adapter
Package id 0:
  temp1_input: 38.000
  temp1_max: 85.000
  temp1_crit: 95.000
  temp1_crit_alarm: 0.000
Core 0:
  temp2_input: 31.000
  temp2_max: 85.000
  temp2_crit: 95.000
  temp2_crit_alarm: 0.000
Core 1:
  temp3_input: 34.000
  temp3_max: 85.000
  temp3_crit: 95.000
  temp3_crit_alarm: 0.000
Core 2:
  temp4_input: 34.000
  temp4_max: 85.000
  temp4_crit: 95.000
  temp4_crit_alarm: 0.000
Core 3:
  temp5_input: 32.000
  temp5_max: 85.000
  temp5_crit: 95.000
  temp5_crit_alarm: 0.000
Core 4:
  temp6_input: 37.000
  temp6_max: 85.000
  temp6_crit: 95.000
  temp6_crit_alarm: 0.000
Core 8:
  temp7_input: 37.000
  temp7_max: 85.000
  temp7_crit: 95.000
  temp7_crit_alarm: 0.000
Core 9:
  temp8_input: 36.000
  temp8_max: 85.000
  temp8_crit: 95.000
  temp8_crit_alarm: 0.000
Core 10:
  temp9_input: 38.000
  temp9_max: 85.000
  temp9_crit: 95.000
  temp9_crit_alarm: 0.000
Core 11:
  temp10_input: 37.000
  temp10_max: 85.000
  temp10_crit: 95.000
  temp10_crit_alarm: 0.000
Core 12:
  temp11_input: 35.000
  temp11_max: 85.000
  temp11_crit: 95.000
  temp11_crit_alarm: 0.000

i350bb-pci-0100
Adapter: PCI adapter
loc1:
  temp1_input: 57.000
  temp1_max: 120.000
  temp1_crit: 110.000

@BenBE
Copy link
Member

BenBE commented Aug 23, 2023

Do you know how these sensors are distributed amongst the cores? If I count correctly and assume temperature 0 of each coretemp block to be the overall package temperature I see 24 sensors, which would amount to 48 cores.

@cgzones @fasterit Can you two take a look at this?

@AnOpenSauceDev
Copy link
Author

AnOpenSauceDev commented Aug 23, 2023

The server setup i have is 2x E5-2680 v2's, which should only be two threads per core. So I'm assuming it should only be 20 sensors. Oddly enough, btop detects all sensors fine, which makes me think it could possibly be a htop issue.

@BenBE BenBE added bug 🐛 Something isn't working and removed question ❔ Further information is requested labels Aug 23, 2023
@SergeyKharenko
Copy link

My server has the same problem:
Motherboard: Supermicro X10 DRi-T
CPU: Dual E5-2698V3
Here is the terminal:
image

In sensors -u, the temperature of each core is correct:
image

Htop is one of my favorite programs. I would appreciate it if the problem were fixed!

@AnOpenSauceDev
Copy link
Author

My problem is that none show at all, but i still have a valid sensors reading.

@BenBE
Copy link
Member

BenBE commented Sep 12, 2023

@Kharlenkow :

In sensors -u, the temperature of each core is correct:

Please provide the output as plain text. While images are fine to point at UI issues or convey what the display looks like, they usually aren't very accessible or easy for further processing. Also Your screenshot is missing (the interesting) part of the sensors -u output.

Htop is one of my favorite programs.

Glad to hear.

I would appreciate it if the problem were fixed!

Will have to see if we find a solution to properly process the available information and correlate it with our internal view of the system. This is not the first report regarding CPU sensor stuff – and likely not the last. That stuff is strange at times.

@AnOpenSauceDev Can you provide the full contents of /proc/cpuinfo? It looks kinda strange that core IDs aren't contiguous in the sensors -u output.

Also, if you want to help a bit with investigations: Can you try to establish some kind of mapping of physical cores to the temperature sensor cores by putting some load on individual CPU threads (affinity binding) and checking which temperature follows the load? TIA.

@Kharlenkow In case you have a different CPU, having the same information (cpuinfo, sensors reading, physical<-->sensors mapping) available would be nice.

@AnOpenSauceDev
Copy link
Author

@AnOpenSauceDev Can you provide the full contents of /proc/cpuinfo? It looks kinda strange that core IDs aren't contiguous in the sensors -u output.

cpuinfo.txt

It might take a while to benchmark every core, but so far nothing seems off.

@BenBE
Copy link
Member

BenBE commented Sep 13, 2023

Thank you for that info. Seems this strange core ID counting is in the CPU info as well. At least makes things consistent. :)

@SergeyKharenko
Copy link

@BenBE

Please provide the output as plain text. While images are fine to point at UI issues or convey what the display looks like, they usually aren't very accessible or easy for further processing. Also Your screenshot is missing (the interesting) part of the sensors -u output:

Thank you for attention my feedback!!! Here is the entire output:

coretemp-isa-0001
Adapter: ISA adapter
Package id 1:
  temp1_input: 41.000
  temp1_max: 80.000
  temp1_crit: 98.000
  temp1_crit_alarm: 0.000
Core 0:
  temp2_input: 33.000
  temp2_max: 80.000
  temp2_crit: 98.000
  temp2_crit_alarm: 0.000
Core 1:
  temp3_input: 33.000
  temp3_max: 80.000
  temp3_crit: 98.000
  temp3_crit_alarm: 0.000
Core 2:
  temp4_input: 32.000
  temp4_max: 80.000
  temp4_crit: 98.000
  temp4_crit_alarm: 0.000
Core 3:
  temp5_input: 34.000
  temp5_max: 80.000
  temp5_crit: 98.000
  temp5_crit_alarm: 0.000
Core 4:
  temp6_input: 34.000
  temp6_max: 80.000
  temp6_crit: 98.000
  temp6_crit_alarm: 0.000
Core 5:
  temp7_input: 34.000
  temp7_max: 80.000
  temp7_crit: 98.000
  temp7_crit_alarm: 0.000
Core 6:
  temp8_input: 33.000
  temp8_max: 80.000
  temp8_crit: 98.000
  temp8_crit_alarm: 0.000
Core 7:
  temp9_input: 32.000
  temp9_max: 80.000
  temp9_crit: 98.000
  temp9_crit_alarm: 0.000
Core 8:
  temp10_input: 35.000
  temp10_max: 80.000
  temp10_crit: 98.000
  temp10_crit_alarm: 0.000
Core 9:
  temp11_input: 34.000
  temp11_max: 80.000
  temp11_crit: 98.000
  temp11_crit_alarm: 0.000
Core 10:
  temp12_input: 31.000
  temp12_max: 80.000
  temp12_crit: 98.000
  temp12_crit_alarm: 0.000
Core 11:
  temp13_input: 35.000
  temp13_max: 80.000
  temp13_crit: 98.000
  temp13_crit_alarm: 0.000
Core 12:
  temp14_input: 31.000
  temp14_max: 80.000
  temp14_crit: 98.000
  temp14_crit_alarm: 0.000
Core 13:
  temp15_input: 35.000
  temp15_max: 80.000
  temp15_crit: 98.000
  temp15_crit_alarm: 0.000
Core 14:
  temp16_input: 33.000
  temp16_max: 80.000
  temp16_crit: 98.000
  temp16_crit_alarm: 0.000
Core 15:
  temp17_input: 34.000
  temp17_max: 80.000
  temp17_crit: 98.000
  temp17_crit_alarm: 0.000

power_meter-acpi-0
Adapter: ACPI interface
power1:
  power1_average: 4294967.295
  power1_average_interval: 1.000

coretemp-isa-0000
Adapter: ISA adapter
Package id 0:
  temp1_input: 41.000
  temp1_max: 80.000
  temp1_crit: 98.000
  temp1_crit_alarm: 0.000
Core 0:
  temp2_input: 34.000
  temp2_max: 80.000
  temp2_crit: 98.000
  temp2_crit_alarm: 0.000
Core 1:
  temp3_input: 33.000
  temp3_max: 80.000
  temp3_crit: 98.000
  temp3_crit_alarm: 0.000
Core 2:
  temp4_input: 35.000
  temp4_max: 80.000
  temp4_crit: 98.000
  temp4_crit_alarm: 0.000
Core 3:
  temp5_input: 33.000
  temp5_max: 80.000
  temp5_crit: 98.000
  temp5_crit_alarm: 0.000
Core 4:
  temp6_input: 36.000
  temp6_max: 80.000
  temp6_crit: 98.000
  temp6_crit_alarm: 0.000
Core 5:
  temp7_input: 34.000
  temp7_max: 80.000
  temp7_crit: 98.000
  temp7_crit_alarm: 0.000
Core 6:
  temp8_input: 32.000
  temp8_max: 80.000
  temp8_crit: 98.000
  temp8_crit_alarm: 0.000
Core 7:
  temp9_input: 31.000
  temp9_max: 80.000
  temp9_crit: 98.000
  temp9_crit_alarm: 0.000
Core 8:
  temp10_input: 34.000
  temp10_max: 80.000
  temp10_crit: 98.000
  temp10_crit_alarm: 0.000
Core 9:
  temp11_input: 33.000
  temp11_max: 80.000
  temp11_crit: 98.000
  temp11_crit_alarm: 0.000
Core 10:
  temp12_input: 33.000
  temp12_max: 80.000
  temp12_crit: 98.000
  temp12_crit_alarm: 0.000
Core 11:
  temp13_input: 34.000
  temp13_max: 80.000
  temp13_crit: 98.000
  temp13_crit_alarm: 0.000
Core 12:
  temp14_input: 36.000
  temp14_max: 80.000
  temp14_crit: 98.000
  temp14_crit_alarm: 0.000
Core 13:
  temp15_input: 35.000
  temp15_max: 80.000
  temp15_crit: 98.000
  temp15_crit_alarm: 0.000
Core 14:
  temp16_input: 33.000
  temp16_max: 80.000
  temp16_crit: 98.000
  temp16_crit_alarm: 0.000
Core 15:
  temp17_input: 32.000
  temp17_max: 80.000
  temp17_crit: 98.000
  temp17_crit_alarm: 0.000

I can absolutely sure two CPUs are the same because I personally installed them onto the socket, unless I was cheated by the seller~
My /proc/cpuinfo file is here: cpuinfo.txt

@BenBE
Copy link
Member

BenBE commented Sep 13, 2023

Thank you for the quick feedback.

I did some study of the documentation of the coretemp stuff and it seems the main issue in htop comes down to how the sensors are mapped onto the actual CPU cores. This will likely take a bit of work, as currently the information related to the cpuinfo (and thus core layout) is not kept for correlation in the libsensors code.

Also, the libsensors code assumes the core IDs to be contiguous, which is clearly not the case with the example by @AnOpenSauceDev. The second issue arises with multiple coretemp instances due to multiple CPUs present in the system. Both being issues that can be resolved when properly mapping the core IDs of the coretemp instances to the physical CPU cores available from /proc/cpuinfo.

@cgzones Can you please take a look at refactoring the libsensors code? Would be nice if we could implement some proper mapping of sensors to their physical cores.

The heuristic could still remain similar to what it is now, being all cores inherit Tctrl, Tdie followed by Tccd{X}, with only parts of the information cleared out, if multiple readings are available on the same core (e.g. acpitz + coretemp). If acpitz gives temperatures for cores not covered by coretemp, those should still keep the acpitz readings.

References:

@SergeyKharenko
Copy link

SergeyKharenko commented Sep 18, 2023

Thanks again for your attention!!!

CAUSES:

I referred codes in linux/LibSensors.c and created an simple test.

int main() {
    sensors_init(NULL);
    int n = 0;
    for (const sensors_chip_name* chip = sensors_get_detected_chips(NULL, &n);chip; chip = sensors_get_detected_chips(NULL, &n)){
        cout<<"SENSOR:"<<chip->prefix<<endl;
        int m=0;
        for(const sensors_feature* feature = sensors_get_features(chip, &m);feature; feature = sensors_get_features(chip, &m)){
            cout<<"    name "<<feature->name<<endl;
        }

    }
}

Here is the output: (It is run on another dual socket server, Dual Xeon E5-2643 V3 , which has the same problem)

SENSOR:coretemp
    name temp1
    name temp2
    name temp3
    name temp4
    name temp5
    name temp6
    name temp7
SENSOR:amdgpu
    name in0
    name fan1
    name temp1
    name power1
SENSOR:i350bb
    name temp1
SENSOR:nvme
    name temp1
SENSOR:coretemp
    name temp1
    name temp2
    name temp3
    name temp4
    name temp5
    name temp6
    name temp7
SENSOR:power_meter
    name power1

There are two SENSORs named coretemp. They have completely same feature. However in the Line 185 of linux/LibSensors.c:
unsigned long int tempID = strtoul(feature->name + strlen("temp"), NULL, 10);
tempID is assigned by the number followed by temp. The bug would appear when updating the temp of the second CPU because of the FAULT tempID. For example temp1 of the second CPU should be stored at the index of 6 of the cpu temp array. As a result, temp value of the first CPU are overwrite by the second.

SOLUTION

step 0:
In Machine.h add value CPUsockets in the structure of Machine

step 1:
In linux/LinuxMachine.c , add function 'LinuxMachine_updateCPUsockets' to get the value of CPUsockets by open
/sys/devices/system/node/has_cpu
Take my machine as an example, the output is '0-1'. I guess in single socket system, it might be '0-0'.

step 2:
In linux/LibSensors.c, add value bias=existingCPUs/CPUsockets and int current_CPUsocket=0 at the beginning of the function LibSensors_getCPUTemperatures. Besides, charge the Line 185 to:
unsigned long int tempID = strtoul(feature->name + strlen("temp"), NULL, 10)+bias*current_CPUsocket;
Don't forget to add current_CPUsocket++ after the Line 211!

step 3:
Change Line 256-262. Update the temp socket by socket. (I suggest reading the file /sys/devices/system/cpu/smt/active for SMT/HT judgment).

Since I have been a little busy at work recently, the code has not been implemented on the original project(i am so sorry TOT).
Hope my suggestions would be adopted!

@BenBE
Copy link
Member

BenBE commented Sep 18, 2023

That's still incomplete. because your solution does not properly track, which instance of coretemp is associated with which physical CPU. Overall it's not as simple as laid out, because you need track the topology; which is currently unimplemented.

@SergeyKharenko
Copy link

SergeyKharenko commented Feb 1, 2024

In the past two days, I have consulted the source code of the hwmon subsystem and lm-sensors, and tested it with numactl (which is able to force the task to run on a certain CPU core).

First of all, the lm-sensors reading method of increasing the tempX by suffix number in each hwmon group exactly corresponds to the sequential increase of the core id (at least on my three machines), and there is no exception of out-of-order correspondence as you described.

In addition, for multi-socket motherboards, I also tested and verified the one-to-one correspondence between the CPU socket ID number and nodeX in the system directory (at least this is true on dual-socket motherboards, I don't have four-socket and above motherboards to test). Actually property addr of the structure sensors_chip_name in the library lm_sensors also indicates the actual CPU socket ID. By the way, sub-folders nodeX is under the folder /sys/devices/system/node/. Under these sub-folders, a file named cpulist describes core IDs in the system that bind to each physical CPU. That completely solves the problem of the ownership of the system CPU core ID to the CPU socket ID.

To sum up, we could first use the cpulist files to deduce which CPU socket the core belongs to according to the cpuX suffix X in /proc/stat, and then associate the hwmon group to the corresponding CPU socket through the addr attribute sensors_chip_name in lm-sensors. In a single hwmon group, call the lm-sensors API and read the temperature of each core in order.

Hope my suggestions would be adopted!

@SergeyKharenko
Copy link

@BenBE @AnOpenSauceDev I browse the pull request list and find she has done what I want. #1352

I also test her fork using the same method. As is shown below, the problem has been solved. Besides, core ID and its temperature are correctly corresponded. Hope this PR will be accepted!

image

@BenBE BenBE linked a pull request Feb 2, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐛 Something isn't working Linux 🐧 Linux related issues support request This is not a code issue but merely a support request. Please use the mailing list or IRC instead.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants