Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NovaCustom V560TNE bricks most likely due to overheating #1008

Closed
filipleple opened this issue Aug 19, 2024 · 15 comments
Closed

NovaCustom V560TNE bricks most likely due to overheating #1008

filipleple opened this issue Aug 19, 2024 · 15 comments
Assignees
Labels
bug Something isn't working firmware needs review novacustom_v56_mtl NovaCustom V56 Series

Comments

@filipleple
Copy link
Member

Component

Dasharo firmware

Device

NovaCustom V56 14th Gen

Dasharo version

v0.9.1-rc2

Dasharo Tools Suite version

--

Test case ID

--

Brief summary

NovaCustom V560TNE bricks most likely due to overheating, usually when rebooting after flashing

How reproducible

90%

How to reproduce

Turn on any regression suite.

Expected behavior

The suite should pass

Actual behavior

After a few tests and flashes, the laptop will refuse to boot up and will become very hot. It will boot up again after a ~5-10 minute cooldown period

Screenshots

--

Additional context

So far we can reliably reproduce this on one unit only.

Solutions you've tried

Re-applying thermal paste

@philipandag
Copy link

I might have found a clue here. The THR002.001 case sets a temperature threshold to 70.0C, but fails and the CPU is reported to have 80.0C. After the test the device did not turn back on. Maybe the device for some reason does not reboot after setting the threshold and only reboots after running the stress test which causes it to not be able to boot until the cpu temperature drops?

------------------------------------------------------------------------------
THR002.001 Try to enter a threshold value within the limits and ve... ....
Checking if stress-ng is installed...

Package stress-ng is installed
THR002.001 Try to enter a threshold value within the limits and ve... | FAIL |
'80.0 < 73' should be true.
------------------------------------------------------------------------------
Dasharo-Compatibility.Cpu-Throttling                                  | FAIL |
3 tests, 0 passed, 1 failed, 2 skipped
==============================================================================

(... skips)

==============================================================================
DDET001.001 USB Stack disable :: Test disabling the USB stack         | FAIL |
OSError: Socket is closed
------------------------------------------------------------------------------
DDET002.001 USB Stack enable :: Test enabling the USB stack           | FAIL |
OSError: Socket is closed
------------------------------------------------------------------------------
DDET003.001 Usb Devices Detected In Firmware Warmboot :: Test if U... ..^CSecond signal will force exit.
DDET003.001 Usb Devices Detected In Firmware Warmboot :: Test if U... | FAIL |
Execution terminated by signal

(exited at this moment because device does not boot)

@philipandag
Copy link

After running dasharo-security/wifi-bluetooth-switch.robot the laptop got "bricked" again. (The suite failed because Flashing via SSH is still interrupted and running tests via DCU is not possible rn)
I have pressed FN+1 and to my suprise the fans started blowing at maximum speed. I left the laptop like that without touching anything else, no powering down, disconnecting battery, pressing anything else etc. After a couple of minutes messages from systemd about GPU started appearing and after another couple minutes the laptop rebooted and was working properly.

image

@philipandag
Copy link

I would like to point out that pressing Fn+1 for the second time to DISABLE the high performance mode causes one of the fans to slow down immediately and the second one slows down gradually which can be easily verified audibly. Maybe it is somehow related.

@wessel-novacustom
Copy link

I have also seen bricks after suspending and after reboots. I the cases I saw this, ME was enabled, but a custom BIOS boot splash logo was implemented.

@mkopec
Copy link
Member

mkopec commented Aug 23, 2024

Something is definitely overheating especially on the RTX 4070 models. Powering off and waiting a couple of minutes usually works

@mkopec
Copy link
Member

mkopec commented Aug 23, 2024

Based on post codes I think it hangs in edk2. Need to build coreboot with EC logging enabled, EDK2 in debug mode with serial redirection enabled, and EC with parallel debugger enabled. Then just check in logs where it hangs.

@wessel-novacustom
Copy link

We have found that the V560TNE bricks when suspending with the default kernel of Ubuntu 24.04 LTS, fully updated.

The same happens after trying to integrate a custom boot logo with DTS (RC) on the V560TND.

This issue should be top-priority.

@wessel-novacustom
Copy link

I can confirm that the laptop that didn't turn on after suspending could be turned on again once it was cooled down.

@philipandag
Copy link

philipandag commented Sep 4, 2024

The issues with suspension are fixed by upgrading the kernel to 6.9. Changing the boot logo was working fine today on our V560TNE with v0.9.1-rc4 and kernel 6.9.

https://docs.dasharo.com/unified/clevo/post-install/#linux

On Gen 14 (Meteor Lake), it's recommended to install the Ubuntu mainline kernel, which is a newer version than the default Ubuntu kernel. This version contains additional fixes for newer hardware which helps with power management and suspend on Gen 14 laptops.

@macpijan
Copy link
Contributor

macpijan commented Sep 4, 2024

@filipleple @philipandag But you still do face bricks when using linux 6.9, just not after suspend?

@philipandag
Copy link

Yes, although they are much rarer.

@mkopec
Copy link
Member

mkopec commented Sep 10, 2024

Very likely to be fixed by Dasharo/ec@3786c8c .

The ME_WE pin was floating, and in some conditions (depending on temperature, but also possibly other factors) it would be sampled high instead of low, which in turn caused ME to enter FDOPSS state. When in FDOPSS, sending the End-of-post HECI command would fail, and coreboot would refuse to boot, because booting without sending EOP is considered insecure.

The pin was configured as input, because we got the GPIO config from previous firmware, and missed this error during review.

@philipandag
Copy link

I am testing the v0.9.1-rc5 on V540TND since yesterday and no bricks happened. Tried quick reboots, consecutive reflashes and stressing the hardware to make it hot. It seems that, at least the V540TND, doesn't have this issue with the newest rc5.

@wessel-novacustom
Copy link

I am testing the v0.9.1-rc5 on V540TND since yesterday and no bricks happened. Tried quick reboots, consecutive reflashes and stressing the hardware to make it hot. It seems that, at least the V540TND, doesn't have this issue with the newest rc5.

Same here for rc5 on the V560TND and V560TNE.

@mkopec
Copy link
Member

mkopec commented Sep 12, 2024

In that case I believe we can close this issue

@mkopec mkopec closed this as completed Sep 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working firmware needs review novacustom_v56_mtl NovaCustom V56 Series
Projects
Development

No branches or pull requests

6 participants