Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Thermal Runaway Protection false positive ignoring the configuration values #19335

Closed
Kurolox opened this issue Sep 10, 2020 · 13 comments
Closed

Comments

@Kurolox
Copy link

Kurolox commented Sep 10, 2020

Bug Description

The thermal runaway protection gives false positives even when the hysteresis and period values configured in the firmware are never met.

I believe there's something wrong regarding how the thermal runaway protection is calculated. Just as an example, the bed temperature was supposed to change from 85C to 80C. I've got a thermal runaway protection error when the bed temperature hit 78C, halting my print. This is nowhere near close the default values given in the firmware.

After some experimenting with bed temperature variations and thermal runaway settings, I believe that the calculation for the thermal runaway protection trigger isn't updating the new target temperature after an M140 command, and it's still using the previous one, causing a thermal runaway error when it shouldn't have.

I think it's important to add that just issuing two M140 commands with different temperatures aren't triggering this issue. I've provided below a .gcode file that triggers this bug, but I haven't been able to create a smaller .gcode file to reproduce this issue.

I've also added a video below displaying the issue at hand.

My Configurations

config.zip

Steps to Reproduce

  1. Attempt to use the provided .gcode file

Expected behavior: The file should print without issues

Actual behavior: A false positive thermal runaway happens outside of the configured firmware parameters

Additional Information

I've added the particular gcode I was running that caused this thermal runaway false positive. It happens about a minute after the M140 S80 command is issued at the first layer. It may help when reproducing this issue.

OUT_CFFFP_tempTower.zip

I've also recorded a video displaying the issue. The printer is using the firmware attached to this issue. It can be seen that a thermal runaway happens when the bed temperature reaches 77C, which is within the allowed hysteresis value configured in the printer, and a thermal runaway shouldn't have happened.

@AnHardt
Copy link
Member

AnHardt commented Sep 10, 2020

You are right with:

After some experimenting with bed temperature variations and thermal runaway settings, I believe that the calculation for the thermal runaway protection trigger isn't updating the new target temperature after an M140 command, and it's still using the previous one, causing a thermal runaway error when it shouldn't have.

The runaway test ignores the target temperature completely. It just cares about if the heaters temperature is dropping enough when cooling down or rising enough when heating. If that is to slow or the wrong direction it bites.
This is not a one time test when changing target temperature, it is repeated.

Because the temperature drops fast when hot and slower at lower temperatures (the difference to the environments temperature is the real reason for that) the first tests may succeed but later ones fail.

With a well insulated bed like yours seems to be (or are the PID factors not aggressive enough?) shrink the temperature span to the minimum your bed temperature noise level allows and increase the time until it works in all cases (as long nothing is defect).

As always - don't expect any default value to be valid for your machine. We have configurations to adjust parameters to your machine. In the example configuration files we have values that worked on other machines. They may be totally different from what your machine may need.

@Kurolox
Copy link
Author

Kurolox commented Sep 10, 2020

Yeah. I was actually experimenting with the logging that was commented out of the thermal_runaway_protection function, and noticed that running an M140 command doesn't trigger any thermal runnaway checks (since it's out of the hotend loop, I assume. It's the only time the runaway function is called in the whole code). That would explain why it only happened in the middle of a print and couldn't reproduce it myself manually.

I wasn't using the default values. I did increase them, but maybe I wasn't aggresive enough when doing so. I only bumped the hysteresis temperature by a single degree and the interval by 50% (from 40s to 60s).

The runaway test ignores the target temperature completely.

Is this actually right? The thermal_runaway_protection function does use the target temperature for some calculations, mostly related to the timer.

Anyways. While you're probably right that I just didn't configure the thermal runaway settings aggresively enough, I'm still going to play around with the logging and see if I notice anything weird in the thermal runaway calculation.

@AnHardt
Copy link
Member

AnHardt commented Sep 10, 2020

I just didn't configure the thermal runaway settings aggresively enough,

You want the test to be less aggressive - not more. For that shrink the temperature span and/or increase the time.

@Kurolox
Copy link
Author

Kurolox commented Sep 10, 2020

Yeah, sorry about my choice of words. What I meant to say is that I didn't reduce it aggresively enough.

@Kurolox
Copy link
Author

Kurolox commented Sep 10, 2020

I've been tinkering with this for a while, and no matter what values I plug into the THERMAL_PROTECTION_PERIOD and THERMAL_PROTECTION_HYSTERESIS configuration, a thermal runaway will always happen at the same part of the print, after the same amount of time (independent of the THERMAL_PROTECTION_PERIOD value).

I've got to the point where I've just plugged the following values without any difference in the result.

#define THERMAL_PROTECTION_PERIOD 9999
#define THERMAL_PROTECTION_HYSTERESIS 999

This doesn't look like a configuration issue, @AnHardt. I'll keep looking at it and try to get any useful information.

@AnHardt
Copy link
Member

AnHardt commented Sep 10, 2020

After having a look into the current code:
Sorry confused it again with the code producing "Heating Faild".

And yes. I'm assuming there is an error in the "Thermal Runaway" code.

@Kurolox
Copy link
Author

Kurolox commented Sep 10, 2020

I'll check it myself, so I'd like to be assigned this issue if possible.

@AnHardt
Copy link
Member

AnHardt commented Sep 10, 2020

Try it.

@Kurolox
Copy link
Author

Kurolox commented Sep 10, 2020

After some time and thanks to @p3p, the issue was that I wasn't using the correct settings in the configuration file.

THERMAL_PROTECTION_PERIOD and THERMAL_PROTECTION_HYSTERESIS only affect the hotend, so I should've modified THERMAL_PROTECTION_BED_PERIOD and THERMAL_PROTECTION_BED_HYSTERESIS.

@Kurolox Kurolox closed this as completed Sep 10, 2020
@AnHardt
Copy link
Member

AnHardt commented Sep 10, 2020

OMG.

Task: "Detect if the heater system is able to hold the target temperature. Especially detect if the heater is stuck completely ON or OFF."

Strategy: "Once the heater reached a window (target temperature +- THERMAL_PROTECTION_HYSTERESIS ) around the target temperature start the tests. Alarm if the window is left for longer than THERMAL_PROTECTION_PERIOD. Stop the tests if a new target temperature is set - then the counterpart watching the temperature in/de-crease is kicking in.

Code (2.0.x code for bugfix seems to be the same):
We will handle this with a state-machine with four states.
enum TRState : char { TRInactive, TRFirstHeating, TRStable, TRRunaway };

The algorithm is called with.
static void thermal_runaway_protection(tr_state_machine_t &state, const float &current, const float &target, const heater_ind_t heater_id, const uint16_t period_seconds, const uint16_t hysteresis_degc);
The different state machines for the heaters are initialized with TRInactive and time 0
https://github.com/MarlinFirmware/Marlin/blob/2.0.x/Marlin/src/module/temperature.cpp#L1937-L1945
For detecting target temperature changes we store the old temperatures in
https://github.com/MarlinFirmware/Marlin/blob/2.0.x/Marlin/src/module/temperature.cpp#L1949
and initialize with '0'. We reserve HOTENDS -1 places for the hotends plus 2 for bed and chamber.

https://github.com/MarlinFirmware/Marlin/blob/2.0.x/Marlin/src/module/temperature.cpp#L1964
tries to find the index into above array but will fail for H_CHAMBER (-2) what gets the same index as the bed.

https://github.com/MarlinFirmware/Marlin/blob/2.0.x/Marlin/src/module/temperature.cpp#L1977-L1982
On target temperatur change this switches the state to TRFirstHeating except if the target is 0 - then to TRInactive.

Then follows the switch to the states.

https://github.com/MarlinFirmware/Marlin/blob/2.0.x/Marlin/src/module/temperature.cpp#L1988-L1991
switches from TRFirstHeating to TRStable when crossing the target temperature from below. If coming from above TRStable is entered immediately. Better would be a test if current is within target+-range.

In case TRStabe
https://github.com/MarlinFirmware/Marlin/blob/2.0.x/Marlin/src/module/temperature.cpp#L2012-L2015
we have about the same problem. The timer is restarted if current is above the lower limit but also if target is above the upper limit - what should not.

If current is not within the limits and period_seconds is over go to TRRunaway state.
https://github.com/MarlinFirmware/Marlin/blob/2.0.x/Marlin/src/module/temperature.cpp#L2016-L2016

Errors:
The code clearly does not do what i expect it to do.
Calculating the same index for the bed and the chamber for sure will cause problems if thermal protection for the bed and for the chamber is on at the same time.
The other two problems should cancle out each other but degrade the ability to detect stuck ON heaters. Only heaters, not heating enough, are detected. That's not the dangerous part of the story.

In your case, coming from above, TRStable is entered immediately, but there then the timer is always restarted and should not run out. Moreover - above the lower margin the timer is never tested.

@Kurolox
I don't understand where your problem is coming from.

@thinkyhead
Please have a look on that.

@Kurolox
Copy link
Author

Kurolox commented Sep 10, 2020

My problem was that I was using the wrong settings in the configuration and I was changing the thermal protection parameters for the hotend instead than the heated bed. However, at least it seems like you found some legitimate issues out of this.

@AnHardt
Copy link
Member

AnHardt commented Sep 10, 2020

#if ENABLED(THERMAL_PROTECTION_BED)
if (degBed() > BED_MAXTEMP)
_temp_error(H_BED, PSTR(MSG_T_THERMAL_RUNAWAY), GET_TEXT(MSG_THERMAL_RUNAWAY));
#endif

What the fuc* is this?
A max-temp check should not be conditional. Even not conditional on THERMAL_PROTECTION_BED.
But worst is, it raises a MSG_THERMAL_RUNAWAY instead of a MAXTEMP_ERROR. That makes debugging completely impossible.

Same problem for hotend and chamber heaters.

Nowadays any tried improvement seems to introduce at least two new bugs.
Time to leave Marlin. It's wasted time. :-( 😠

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked and limited conversation to collaborators Nov 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants