-
Notifications
You must be signed in to change notification settings - Fork 689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[config] Call 'systemctl reset-failed' before 'systemctl restart' when restarting services #607
[config] Call 'systemctl reset-failed' before 'systemctl restart' when restarting services #607
Conversation
…n restarting services
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume the error code returned by reset-failed if the services is not in failed state, doesn't
config/main.py
Outdated
# We first run "systemctl reset-failed" to ensure that we | ||
# can restart the service, even if it has entered a failed state | ||
click.echo("Restarting {} ...".format(service)) | ||
run_command("systemctl reset-failed {}".format(service)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume the error code returned by reset-failed if the services is not in failed state, doesn't trigger the exception. Looks good to me otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling systemctl reset-failed
on a service not in a failed state will still return 0. There should be no issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great. Thank you for fixing this.
Retest this please |
I'm now contemplating changing this as follows:
Thoughts, anyone? Edit: I made the change in commit #ab28455 |
@jleveque i have tested this and it seems to fix the problem: peformed multiple config load_minigraph without any isse seen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i have tested this and it seems to fix the problem: peformed multiple config load_minigraph without any isse seen
config/main.py
Outdated
# We first run "systemctl reset-failed" to remove the "failed" | ||
# status from all services before we attempt to restart them | ||
click.echo("Resetting all failed services ...") | ||
run_command("systemctl reset-failed") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The concern I have is that this could possibly reset more services than the ones on the list we are meant to handle. Not sure if this is correct or desirable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so maybe change approach to reset counter per service ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I believe that approach would be better.
Retest this please |
|
@avi-milner: Are you asking that I revert back to the original solution? What about dependent services? What if they (synd, teamd, etc.) are in the failed state? I don't believe they will get restarted. |
Are syncd and teamd systemd controlled? In either case, if there are dependent services, then we need to have a list of them and reset-failed those as well if we feel we should. A blind reset-failed for all services inside a sonic device, is not the right thing to do. |
@nikos-github: Yes. All Docker containers are controlled by their own service. Multiple services are dependent upon SwSS (syncd, teamd, dhcp_relay, radv, snmp), so when SwSS restarts it will also restart its dependent services. @avi-milner: What do you see wrong with removing all services from the failed list upon reloading config? Reloading config is pretty much like wiping the slate clean. Why not also reset the failed status of all services at that time? |
i was just referring you to nikos change request
נשלח מ-Workspace ONE Boxer
בתאריך 15 באוג׳ 2019 23:59, Nikos <notifications@github.com> כתב:
External Email
…________________________________
@avi-milner<https://github.com/avi-milner>: Are you asking that I revert back to the original solution? What about dependent services? What if they (synd, teamd, etc.) are in the failed state? I don't believe they will get restarted.
Are syncd and teamd systemd controlled?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#607>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AM434QVRCAYQFCS2IA6VRXTQEW7TRANCNFSM4ILYAFTQ>.
|
I see. So you're good with this change? |
yes
נשלח מ-Workspace ONE Boxer
בתאריך 16 באוג׳ 2019 01:08, Joe LeVeque <notifications@github.com> כתב:
External Email
…________________________________
I see. So you're good with this change?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#607>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AM434QUS63GCV3ZBRPLOJPLQEXHVLANCNFSM4ILYAFTQ>.
|
@jleveque As I mentioned earlier, this is not the right approach. Resetting all failed services in the sonic device through the use of |
Sorry for the confusion earlier. I seem to have gotten your responses mixed up. @nikos-github: So you believe resetting all failed services upon reloading the configuration is not an assumption we can make? Like I said earlier, reloading configuration pretty much "wipes the slate clean"; at which point we would want to try to restart all services, even those which may have failed (which may have been due to bad configuration). Can you provide an example of a service which you would not want to remove from the failed list when reloading configuration? |
@jleveque This really depends on how the user has configured their Linux system. We should try to restart all failed sonic services or at least the ones controlled or are dependent by config reload. Not |
@nikos-github: Please review my latest iteration. I now obtain a list of failed services and only restart each one if it is a service we will be attempting to restart (or a dependency thereof) by explicitly checking against a list. |
Retest this please |
@avi-milner: Can you test the current iteration of this PR to ensure it still fixes your problem? |
Retest this please |
Retest this please |
1 similar comment
Retest this please |
it works fine for me |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @jleveque ,
it seems that the fix is not working as we expected,
when we call to config_reload / config_load/ config_load_minigraph
from this code you only reset failed counter for services that are already in failed state, this still causes the config load scenarios to fail after running them within the time window of 20 minutes, 3 times
can you please fix to always reset failed counter for config load commands ?
@@ -454,6 +494,9 @@ def load_minigraph(): | |||
if os.path.isfile(db_migrator) and os.access(db_migrator, os.X_OK): | |||
run_command(db_migrator + ' -o set_version') | |||
|
|||
# We first run "systemctl reset-failed" to remove the "failed" | |||
# status from all services before we attempt to restart them | |||
_reset_failed_services() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would only remove services that are already in failed status
@@ -398,6 +435,9 @@ def reload(filename, yes, load_sysinfo): | |||
if os.path.isfile(db_migrator) and os.access(db_migrator, os.X_OK): | |||
run_command(db_migrator + ' -o migrate') | |||
|
|||
# We first run "systemctl reset-failed" to remove the "failed" | |||
# status from all services before we attempt to restart them | |||
_reset_failed_services() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would only remove services that are already in failed status
This ensures that the process will get restarted, even if systemd has previously placed the service in the "failed" state, as
systemctl reset-failed <service>
will take the service out of the "failed" state.Also, no longer print the command names as they are run; instead print a message stating that we are restarting each service.
The commands which currently call this function are
config load_minigraph
andconfig reload
This should address sonic-net/sonic-buildimage#3244