[config] Call 'systemctl reset-failed' before 'systemctl restart' when restarting services #607

jleveque · 2019-08-14T19:07:43Z

This ensures that the process will get restarted, even if systemd has previously placed the service in the "failed" state, as systemctl reset-failed <service> will take the service out of the "failed" state.

Also, no longer print the command names as they are run; instead print a message stating that we are restarting each service.

The commands which currently call this function are config load_minigraph and config reload

This should address sonic-net/sonic-buildimage#3244

…n restarting services

nikos-github

I assume the error code returned by reset-failed if the services is not in failed state, doesn't

nikos-github · 2019-08-14T20:13:06Z

config/main.py

+            # We first run "systemctl reset-failed" to ensure that we
+            # can restart the service, even if it has entered a failed state
+            click.echo("Restarting {} ...".format(service))
+            run_command("systemctl reset-failed {}".format(service))


I assume the error code returned by reset-failed if the services is not in failed state, doesn't trigger the exception. Looks good to me otherwise.

Calling systemctl reset-failed on a service not in a failed state will still return 0. There should be no issues.

Great. Thank you for fixing this.

jleveque · 2019-08-14T22:27:02Z

Retest this please

jleveque · 2019-08-14T23:47:16Z

I'm now contemplating changing this as follows:

Instead of calling systemctl reset-failed <service> on each service in the list, just call systemctl reset-failed once to clear the failed state of ALL failed services. This will also ensure all dependent services are removed from the failed state.

Thoughts, anyone?

Edit: I made the change in commit #ab28455

avi-milner · 2019-08-15T15:00:11Z

@jleveque i have tested this and it seems to fix the problem: peformed multiple config load_minigraph without any isse seen

avi-milner

i have tested this and it seems to fix the problem: peformed multiple config load_minigraph without any isse seen

nikos-github · 2019-08-15T15:13:29Z

config/main.py

+    # We first run "systemctl reset-failed" to remove the "failed"
+    # status from all services before we attempt to restart them
+    click.echo("Resetting all failed services ...")
+    run_command("systemctl reset-failed")


The concern I have is that this could possibly reset more services than the ones on the list we are meant to handle. Not sure if this is correct or desirable.

so maybe change approach to reset counter per service ?

Yes I believe that approach would be better.

jleveque · 2019-08-15T18:11:22Z

Retest this please

avi-milner · 2019-08-15T19:49:41Z

Retest this please
hi reviwer requests you to use your first commit that resets counter per service and not globally,
i didnt see that you reverted 2nd commit

jleveque · 2019-08-15T20:22:41Z

@avi-milner: Are you asking that I revert back to the original solution? What about dependent services? What if they (synd, teamd, etc.) are in the failed state? I don't believe they will get restarted.

nikos-github · 2019-08-15T20:59:31Z

@avi-milner: Are you asking that I revert back to the original solution? What about dependent services? What if they (synd, teamd, etc.) are in the failed state? I don't believe they will get restarted.

Are syncd and teamd systemd controlled? In either case, if there are dependent services, then we need to have a list of them and reset-failed those as well if we feel we should. A blind reset-failed for all services inside a sonic device, is not the right thing to do.

jleveque · 2019-08-15T21:06:24Z

@nikos-github: Yes. All Docker containers are controlled by their own service. Multiple services are dependent upon SwSS (syncd, teamd, dhcp_relay, radv, snmp), so when SwSS restarts it will also restart its dependent services.

@avi-milner: What do you see wrong with removing all services from the failed list upon reloading config? Reloading config is pretty much like wiping the slate clean. Why not also reset the failed status of all services at that time?

avi-milner · 2019-08-15T22:05:42Z

i was just referring you to nikos change request נשלח מ-Workspace ONE Boxer בתאריך 15 באוג׳ 2019 23:59, Nikos <notifications@github.com> כתב: External Email

…

________________________________ @avi-milner<https://github.com/avi-milner>: Are you asking that I revert back to the original solution? What about dependent services? What if they (synd, teamd, etc.) are in the failed state? I don't believe they will get restarted. Are syncd and teamd systemd controlled? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#607>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AM434QVRCAYQFCS2IA6VRXTQEW7TRANCNFSM4ILYAFTQ>.

jleveque · 2019-08-15T22:08:17Z

I see. So you're good with this change?

avi-milner · 2019-08-15T22:09:03Z

yes נשלח מ-Workspace ONE Boxer בתאריך 16 באוג׳ 2019 01:08, Joe LeVeque <notifications@github.com> כתב: External Email

…

________________________________ I see. So you're good with this change? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#607>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AM434QUS63GCV3ZBRPLOJPLQEXHVLANCNFSM4ILYAFTQ>.

nikos-github · 2019-08-15T23:19:34Z

@jleveque As I mentioned earlier, this is not the right approach. Resetting all failed services in the sonic device through the use of systemctl reset-failed is not correct. The services need to be explicitly specified. Not all services are under sonic's control in the switch and therefore no such assumption should be made.

jleveque · 2019-08-15T23:32:40Z

Sorry for the confusion earlier. I seem to have gotten your responses mixed up.

@nikos-github: So you believe resetting all failed services upon reloading the configuration is not an assumption we can make? Like I said earlier, reloading configuration pretty much "wipes the slate clean"; at which point we would want to try to restart all services, even those which may have failed (which may have been due to bad configuration). Can you provide an example of a service which you would not want to remove from the failed list when reloading configuration?

nikos-github · 2019-08-15T23:56:11Z

@jleveque This really depends on how the user has configured their Linux system. We should try to restart all failed sonic services or at least the ones controlled or are dependent by config reload. Not all failed services in the system in general. When sonic configuration is reloaded, it should contain itself to sonic services. systemctl reset-failled will reset ALL failed services in the system whether they are sonic related or not.

jleveque · 2019-08-16T05:41:56Z

@nikos-github: Please review my latest iteration. I now obtain a list of failed services and only restart each one if it is a service we will be attempting to restart (or a dependency thereof) by explicitly checking against a list.

jleveque · 2019-08-16T23:45:36Z

Retest this please

jleveque · 2019-08-16T23:47:08Z

@avi-milner: Can you test the current iteration of this PR to ensure it still fixes your problem?

jleveque · 2019-08-19T17:06:33Z

Retest this please

jleveque · 2019-08-19T23:21:00Z

Retest this please

jleveque · 2019-08-20T01:15:32Z

Retest this please

avi-milner · 2019-08-20T12:25:48Z

@avi-milner: Can you test the current iteration of this PR to ensure it still fixes your problem?

it works fine for me

avi-milner

hi @jleveque ,
it seems that the fix is not working as we expected,
when we call to config_reload / config_load/ config_load_minigraph
from this code you only reset failed counter for services that are already in failed state, this still causes the config load scenarios to fail after running them within the time window of 20 minutes, 3 times

can you please fix to always reset failed counter for config load commands ?

avi-milner · 2019-08-25T11:44:04Z

config/main.py

@@ -454,6 +494,9 @@ def load_minigraph():
    if os.path.isfile(db_migrator) and os.access(db_migrator, os.X_OK):
        run_command(db_migrator + ' -o set_version')

+    # We first run "systemctl reset-failed" to remove the "failed"
+    # status from all services before we attempt to restart them
+    _reset_failed_services()


this would only remove services that are already in failed status

avi-milner · 2019-08-25T11:44:15Z

config/main.py

@@ -398,6 +435,9 @@ def reload(filename, yes, load_sysinfo):
    if os.path.isfile(db_migrator) and os.access(db_migrator, os.X_OK):
        run_command(db_migrator + ' -o migrate')

+    # We first run "systemctl reset-failed" to remove the "failed"
+    # status from all services before we attempt to restart them
+    _reset_failed_services()


this would only remove services that are already in failed status

…n restarting services (#607)

[config] Call 'systemctl reset-failed' before 'systemctl restart' whe…

caa1d36

…n restarting services

jleveque added the Enhancement label Aug 14, 2019

jleveque self-assigned this Aug 14, 2019

jleveque mentioned this pull request Aug 14, 2019

Multiple restart of swss during config load fails to start swss sonic-net/sonic-buildimage#3244

Closed

jleveque added Request for 201811 Branch Request for 201911 Branch labels Aug 14, 2019

nikos-github approved these changes Aug 14, 2019

View reviewed changes

Call 'systemctl reset-failed' globally instead of per-process

ab28455

avi-milner approved these changes Aug 15, 2019

View reviewed changes

nikos-github suggested changes Aug 15, 2019

View reviewed changes

Only reset a failed service if it is one of the services we will restart

35b461f

nikos-github approved these changes Aug 16, 2019

View reviewed changes

jleveque merged commit abd1dbc into sonic-net:master Aug 20, 2019

jleveque deleted the reset_failed_before_restart branch August 20, 2019 18:00

avi-milner reviewed Aug 25, 2019

View reviewed changes

avi-milner mentioned this pull request Aug 25, 2019

Multiple restart of swss during config load fails to start swss still after fix for that #616

Closed

yxieca pushed a commit that referenced this pull request Aug 26, 2019

[config] Call 'systemctl reset-failed' before 'systemctl restart' whe…

4f72e14

…n restarting services (#607)

yxieca added the Included in 201811 Branch label Aug 29, 2019

abdosi added the Included in 201911 Branch label Feb 4, 2020

vadymhlushko-mlnx mentioned this pull request Jan 23, 2023

[submodule] Advance sonic-utilities pointer sonic-net/sonic-buildimage#13482

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[config] Call 'systemctl reset-failed' before 'systemctl restart' when restarting services #607

[config] Call 'systemctl reset-failed' before 'systemctl restart' when restarting services #607

jleveque commented Aug 14, 2019

nikos-github left a comment

nikos-github Aug 14, 2019

jleveque Aug 14, 2019 •

edited

Loading

nikos-github Aug 14, 2019

jleveque commented Aug 14, 2019

jleveque commented Aug 14, 2019 •

edited

Loading

avi-milner commented Aug 15, 2019

avi-milner left a comment

nikos-github Aug 15, 2019

avi-milner Aug 15, 2019

nikos-github Aug 15, 2019

jleveque commented Aug 15, 2019

avi-milner commented Aug 15, 2019

jleveque commented Aug 15, 2019

nikos-github commented Aug 15, 2019 •

edited

Loading

jleveque commented Aug 15, 2019

avi-milner commented Aug 15, 2019 via email

jleveque commented Aug 15, 2019

avi-milner commented Aug 15, 2019 via email

nikos-github commented Aug 15, 2019 •

edited

Loading

jleveque commented Aug 15, 2019

nikos-github commented Aug 15, 2019 •

edited

Loading

jleveque commented Aug 16, 2019

jleveque commented Aug 16, 2019

jleveque commented Aug 16, 2019

jleveque commented Aug 19, 2019

jleveque commented Aug 19, 2019

jleveque commented Aug 20, 2019

avi-milner commented Aug 20, 2019

avi-milner left a comment •

edited

Loading

avi-milner Aug 25, 2019

avi-milner Aug 25, 2019

[config] Call 'systemctl reset-failed' before 'systemctl restart' when restarting services #607

[config] Call 'systemctl reset-failed' before 'systemctl restart' when restarting services #607

Conversation

jleveque commented Aug 14, 2019

nikos-github left a comment

Choose a reason for hiding this comment

nikos-github Aug 14, 2019

Choose a reason for hiding this comment

jleveque Aug 14, 2019 • edited Loading

Choose a reason for hiding this comment

nikos-github Aug 14, 2019

Choose a reason for hiding this comment

jleveque commented Aug 14, 2019

jleveque commented Aug 14, 2019 • edited Loading

avi-milner commented Aug 15, 2019

avi-milner left a comment

Choose a reason for hiding this comment

nikos-github Aug 15, 2019

Choose a reason for hiding this comment

avi-milner Aug 15, 2019

Choose a reason for hiding this comment

nikos-github Aug 15, 2019

Choose a reason for hiding this comment

jleveque commented Aug 15, 2019

avi-milner commented Aug 15, 2019

jleveque commented Aug 15, 2019

nikos-github commented Aug 15, 2019 • edited Loading

jleveque commented Aug 15, 2019

avi-milner commented Aug 15, 2019 via email

jleveque commented Aug 15, 2019

avi-milner commented Aug 15, 2019 via email

nikos-github commented Aug 15, 2019 • edited Loading

jleveque commented Aug 15, 2019

nikos-github commented Aug 15, 2019 • edited Loading

jleveque commented Aug 16, 2019

jleveque commented Aug 16, 2019

jleveque commented Aug 16, 2019

jleveque commented Aug 19, 2019

jleveque commented Aug 19, 2019

jleveque commented Aug 20, 2019

avi-milner commented Aug 20, 2019

avi-milner left a comment • edited Loading

Choose a reason for hiding this comment

avi-milner Aug 25, 2019

Choose a reason for hiding this comment

avi-milner Aug 25, 2019

Choose a reason for hiding this comment

jleveque Aug 14, 2019 •

edited

Loading

jleveque commented Aug 14, 2019 •

edited

Loading

nikos-github commented Aug 15, 2019 •

edited

Loading

nikos-github commented Aug 15, 2019 •

edited

Loading

nikos-github commented Aug 15, 2019 •

edited

Loading

avi-milner left a comment •

edited

Loading