-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[201811] Check platform reboot cause to see if any reset happened during fast/warm-reboot #8912
Conversation
if [[ "$BOOT_TYPE" == "fast" ]] && [[ -d /host/fast-reboot ]]; then | ||
if [[ -f /host/reboot-cause/previous-reboot-cause.json ]]; then | ||
REG_BOOT_TYPE="fast*" | ||
CAUSE_NO_AVAIL="\"N/A\"" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use white space consistently.
if [[ $REBOOT_CAUSE =~ $REG_BOOT_TYPE ]]; then | ||
if [[ "${EXTRA_CAUSE}" != "${CAUSE_NO_AVAIL}" ]]; then | ||
# Delete the FAST_REBOOT|system db setting | ||
$SONIC_DB_CLI STATE_DB DEL "FAST_REBOOT|system" &>/dev/null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This delete could be too late: syncd might have read it and proceeded with fast reboot recovery.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yxieca With the platform api approach to determine the hardware reboot-cause, it's hard to get the actual hardware reboot-cause before syncd or swss starts. I think it's better to use a platform script to determine the hardware reboot. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point of this change is to determine reboot cause is not fast/warm reboot before syncd/swss starts. So that we don't try to start system with fast/warm recovery.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As comments.
else: | ||
previous_reboot_cause = software_reboot_cause | ||
|
||
# Current time | ||
reboot_cause_gen_time = str(datetime.datetime.now().strftime('%Y_%m_%d_%H_%M_%S')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import Error!
Oct 17 19:26:27 sonic-s6100-01 process-reboot-cause[5168]: Starting up...
Oct 17 19:26:27 sonic-s6100-01 process-reboot-cause[5168]: No reboot cause found from /proc/cmdline
Oct 17 19:26:27 sonic-s6100-01 process-reboot-cause[5168]: No reboot cause found from platform api
Oct 17 19:26:27 sonic-s6100-01 process-reboot-cause[5168]: Reboot cause file /host/reboot-cause/reboot-cause.txt not found
Oct 17 19:26:27 sonic-s6100-01 process-reboot-cause[5168]: Traceback (most recent call last):
Oct 17 19:26:27 sonic-s6100-01 process-reboot-cause[5168]: File "/usr/bin/process-reboot-cause", line 255, in
Oct 17 19:26:27 sonic-s6100-01 process-reboot-cause[5168]: main()
Oct 17 19:26:27 sonic-s6100-01 process-reboot-cause[5168]: File "/usr/bin/process-reboot-cause", line 221, in main
Oct 17 19:26:27 sonic-s6100-01 process-reboot-cause[5168]: reboot_cause_gen_time = str(datetime.datetime.now().strftime('%Y_%m_%d_%H_%M_%S'))
Oct 17 19:26:27 sonic-s6100-01 process-reboot-cause[5168]: NameError: global name 'datetime' is not defined
reboot_cause_dict = get_reboot_cause_dict(previous_reboot_cause, additional_reboot_info, reboot_cause_gen_time) | ||
|
||
# Create reboot-cause-#time#.json under history directory | ||
REBOOT_CAUSE_HISTORY_FILE_JSON = os.path.join(REBOOT_CAUSE_HISTORY_DIR, "reboot-cause-{}.json".format(reboot_cause_gen_time)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
REBOOT_CAUSE_HISTORY_DIR is not defined.
if [[ -f /host/reboot-cause/previous-reboot-cause.json ]]; then | ||
REG_BOOT_TYPE="fast*" | ||
CAUSE_NO_AVAIL="\"N/A\"" | ||
REBOOT_CAUSE="$(cat /host/reboot-cause/previous-reboot-cause.json | jq '.cause')" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On testing, observed process-reboot-cause is running later than swss
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
determine-reboot-cause is also running later than swss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@santhosh-kt If that is the case, should we go with the script approach to check the reboot-cause? I think it's too hard to fix the process starting order with platform api dependency. To understand the reboot-cause, we can keep the determine-reboot-cause changes with this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sujinmkang : Already tested this with a dedicated script(A part of the commit - https://github.com/Azure/sonic-buildimage/pull/8024/files - _is_software_reboot()
in track_reboot_reason.sh) that is being called inside the preStartAction() script and it is able to identify the CPU reset cases.
|
||
# Write the previous reboot cause to REBOOT_CAUSE_HISTORY_FILE_JSON as a JSON format | ||
with open(REBOOT_CAUSE_HISTORY_FILE_JSON, "w") as reboot_cause_history_file: | ||
json.dump(reboot_cause_dict, reboot_cause_history_file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import error for json
previous_reboot_cause = software_reboot_cause | ||
|
||
# Current time | ||
reboot_cause_gen_time = str(datetime.datetime.now().strftime('%Y_%m_%d_%H_%M_%S')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import error for 'datetime'
|
…ause file which is output of determine-reboot-cause
…he reboot cause is avaible before docker starts
Tested the changes. Database docker able to read 2.0 API reboot cause clearly. Under fast-boot:
Under fast-boot that gone CPU cold-reset:
|
[201811] Check platform reboot cause to see if any reset happened during fast/warm-reboot
Why I did it
To recover syncd and swss from any cold reset during fast/warm-reboot
How I did it
Check platform reboot-cause to see if any cold reset happens for fast-reboot power up
How to verify it
Manual test
Which release branch to backport (provide reason below if selected)
Description for the changelog
A picture of a cute animal (not mandatory but encouraged)