[thermalctld] Enlarge startretries value to avoid thermalctld not able to restart during regression test #5633

Junchao-Mellanox · 2020-10-15T01:08:30Z

- Why I did it

Found error logs in syslog:

Oct 10 12:41:09.148655 arc-switch1029 INFO pmon#supervisord 2020-10-10 12:41:07,689 INFO gave up: thermalctld entered FATAL state, too many start retries too quickly

The issue is related to the "startsecs" configuration of thermalctld in /etc/supervisor/conf.d/supervisord.conf. The current configuration setting the "startsecs" to 10, which means that it require thermalctld process running at least 10 seconds or supervisord will not restart it after it exiting even if the exit code is expected.

See the official document for "startsecs" at http://supervisord.org/configuration.html:

startsecs

The total number of seconds which the program needs to stay running after a startup to consider the start successful. If the program does not stay up for this many seconds after it has started, even if it exits with an “expected” exit code (see exitcodes), the startup will be considered a failure. Set to 0 to indicate that the program needn’t stay running for any particular amount of time.

- How I did it

The fix is to change the "startsecs" configuration from 10 to 0

- How to verify it

Manual test

- Which release branch to backport (provide reason below if selected)

201811
201911
202006

- Description for the changelog

- A picture of a cute animal (not mandatory but encouraged)

…art it again

jleveque · 2020-10-15T01:14:04Z

@Junchao-Mellanox: Usually we set startsecs=0 for one-shot processes, like scripts, which are expected to exit in under a second. However, thermalctld is a daemon and is meant to run indefinitely. Is this the proper fix? Under what conditions is thermalctl expected to exit quickly?

Junchao-Mellanox · 2020-10-15T01:16:55Z

@Junchao-Mellanox: Usually we set startsecs=0 for one-shot processes, like scripts, which are expected to exit in under a second. However, thermalctld is a daemon and is meant to run indefinitely. Is this the proper fix? Under what conditions is thermalctl expected to exit quickly?

Hi Joe, there is a test case which loads invalid thermal control configuration and verify that thermal control daemon won't crash. In that case, it restart the daemon after the checking, and it could start & kill the daemon in a very short time.

jleveque · 2020-10-15T01:20:04Z

I see. Thanks for the info. I'm not sure we should change this simply to satisfy a specific test case. It sounds like it might be more proper to modify the test case?

Junchao-Mellanox · 2020-10-15T01:28:57Z

I see. Thanks for the info. I'm not sure we should change this simply to satisfy a specific test case. It sounds like it might be more proper to modify the test case?

We could add a delay in the test case, like "time.sleep(10)", not sure if it is a good solution. IMO, thermalctld is designed to help protect the system from being overheat, maybe we could set it to restart always. Any suggestion?

jleveque · 2020-10-15T01:33:44Z

IMO, thermalctld is designed to help protect the system from being overheat, maybe we could set it to restart always. Any suggestion?

I think this is a valid concern. thermalctld is a critical process. I think it makes sense to be persistent at trying to get it running. I'm not opposed to setting it to restart always. @sujinmkang: Do you see any issues with configuring thermalctld to restart always?

Junchao-Mellanox · 2020-10-15T08:39:18Z

retest broadcom please

liat-grozovik · 2020-10-15T15:13:24Z

retest mellanox please

Junchao-Mellanox · 2020-10-16T00:50:03Z

retest mellanox please

sujinmkang · 2020-10-22T16:56:44Z

@Junchao-Mellanox and @jleveque As we discussed in the email thread, if the thermal control configuration is critical for thermalctld to run, then thermalctld crashes and restart several times and then stops restarting, is right behavior instead of continuous restart, I think. If thermalctld can run some minimum checks without the configuration, it should give warning/error messages periodically but it should run without crash.
@Junchao-Mellanox Can you share the exact test steps and the expected outputs/states?

Junchao-Mellanox · 2020-10-23T01:01:21Z

@Junchao-Mellanox and @jleveque As we discussed in the email thread, if the thermal control configuration is critical for thermalctld to run, then thermalctld crashes and restart several times and then stops restarting, is right behavior instead of continuous restart, I think. If thermalctld can run some minimum checks without the configuration, it should give warning/error messages periodically but it should run without crash.
@Junchao-Mellanox Can you share the exact test steps and the expected outputs/states?

Hi @sujinmkang, the test case is like this:

Copy an invalid configuration file to switch
Kill thermalctld to make it load the invalid one
Make sure that thermalctld doesn't crash after reading an invalid configuration
Restore the configuration to a normal one
Kill thermalctld to make it load the normal one

So thermalctld itself doesn't crash, it is the test case which kill it manually.

jleveque · 2020-10-23T17:57:51Z

Retest baseimage please

jleveque · 2020-10-23T17:57:59Z

Retest mellanox please

jleveque · 2020-10-23T17:58:21Z

@Junchao-Mellanox: Can you please update the PR title and description to match the new change?

Junchao-Mellanox · 2020-10-26T01:02:55Z

retest mellanox please

keboliu · 2020-10-29T11:03:55Z

@sujinmkang would you please help to review.

keboliu · 2020-10-31T02:04:18Z

@abdosi would you please help to cherry-pick?

…e to restart during regression test (#5633) Increase startretires value from default of 10 to 50 to prevent supervisor from placing thermalctld in FATAL state during regression testing. Also ensures supervisord tries hard to get thermalctld running in production, as thermalctld is critical to prevent device from overheating.

…e to restart during regression test (sonic-net#5633) Increase startretires value from default of 10 to 50 to prevent supervisor from placing thermalctld in FATAL state during regression testing. Also ensures supervisord tries hard to get thermalctld running in production, as thermalctld is critical to prevent device from overheating.

Fix issue: restart thermalctld too quick cause supervisord never rest…

6bb58eb

…art it again

Junchao-Mellanox mentioned this pull request Oct 15, 2020

Fix issue: restart thermalctld too quick cause supervisord never restart it again Junchao-Mellanox/sonic-buildimage#29

Closed

3 tasks

keboliu added Request for 201911 Branch Bug 🐛 labels Oct 22, 2020

Change startretries to 50 for thermalctld

882a551

Junchao-Mellanox changed the title ~~Fix issue: restart thermalctld too quick cause supervisord never restart it again~~ Fix issue: enlarge startretries value to avoid thermalctld not able to restart during regression test Oct 26, 2020

jleveque approved these changes Oct 26, 2020

View reviewed changes

jleveque requested a review from sujinmkang October 26, 2020 21:55

jleveque changed the title ~~Fix issue: enlarge startretries value to avoid thermalctld not able to restart during regression test~~ [thermalctld] Enlarge startretries value to avoid thermalctld not able to restart during regression test Oct 30, 2020

jleveque merged commit 781188f into sonic-net:master Oct 30, 2020

abdosi added the Included in 201911 Branch label Nov 3, 2020

Junchao-Mellanox deleted the fix_thermalctld_restart branch December 15, 2020 01:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[thermalctld] Enlarge startretries value to avoid thermalctld not able to restart during regression test #5633

[thermalctld] Enlarge startretries value to avoid thermalctld not able to restart during regression test #5633

Junchao-Mellanox commented Oct 15, 2020

jleveque commented Oct 15, 2020

Junchao-Mellanox commented Oct 15, 2020

jleveque commented Oct 15, 2020

Junchao-Mellanox commented Oct 15, 2020

jleveque commented Oct 15, 2020

Junchao-Mellanox commented Oct 15, 2020

liat-grozovik commented Oct 15, 2020

Junchao-Mellanox commented Oct 16, 2020

sujinmkang commented Oct 22, 2020

Junchao-Mellanox commented Oct 23, 2020

jleveque commented Oct 23, 2020

jleveque commented Oct 23, 2020

jleveque commented Oct 23, 2020

Junchao-Mellanox commented Oct 26, 2020

keboliu commented Oct 29, 2020

keboliu commented Oct 31, 2020

[thermalctld] Enlarge startretries value to avoid thermalctld not able to restart during regression test #5633

[thermalctld] Enlarge startretries value to avoid thermalctld not able to restart during regression test #5633

Conversation

Junchao-Mellanox commented Oct 15, 2020

jleveque commented Oct 15, 2020

Junchao-Mellanox commented Oct 15, 2020

jleveque commented Oct 15, 2020

Junchao-Mellanox commented Oct 15, 2020

jleveque commented Oct 15, 2020

Junchao-Mellanox commented Oct 15, 2020

liat-grozovik commented Oct 15, 2020

Junchao-Mellanox commented Oct 16, 2020

sujinmkang commented Oct 22, 2020

Junchao-Mellanox commented Oct 23, 2020

jleveque commented Oct 23, 2020

jleveque commented Oct 23, 2020

jleveque commented Oct 23, 2020

Junchao-Mellanox commented Oct 26, 2020

keboliu commented Oct 29, 2020

keboliu commented Oct 31, 2020