Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[201811] Check platform reboot cause to see if any reset happened during fast/warm-reboot #8912

Merged
merged 8 commits into from
Dec 1, 2021

Conversation

sujinmkang
Copy link
Collaborator

[201811] Check platform reboot cause to see if any reset happened during fast/warm-reboot

Why I did it

To recover syncd and swss from any cold reset during fast/warm-reboot

How I did it

Check platform reboot-cause to see if any cold reset happens for fast-reboot power up

How to verify it

Manual test

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106

Description for the changelog

A picture of a cute animal (not mandatory but encouraged)

if [[ "$BOOT_TYPE" == "fast" ]] && [[ -d /host/fast-reboot ]]; then
if [[ -f /host/reboot-cause/previous-reboot-cause.json ]]; then
REG_BOOT_TYPE="fast*"
CAUSE_NO_AVAIL="\"N/A\""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use white space consistently.

if [[ $REBOOT_CAUSE =~ $REG_BOOT_TYPE ]]; then
if [[ "${EXTRA_CAUSE}" != "${CAUSE_NO_AVAIL}" ]]; then
# Delete the FAST_REBOOT|system db setting
$SONIC_DB_CLI STATE_DB DEL "FAST_REBOOT|system" &>/dev/null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This delete could be too late: syncd might have read it and proceeded with fast reboot recovery.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yxieca With the platform api approach to determine the hardware reboot-cause, it's hard to get the actual hardware reboot-cause before syncd or swss starts. I think it's better to use a platform script to determine the hardware reboot. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of this change is to determine reboot cause is not fast/warm reboot before syncd/swss starts. So that we don't try to start system with fast/warm recovery.

Copy link
Contributor

@yxieca yxieca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As comments.

else:
previous_reboot_cause = software_reboot_cause

# Current time
reboot_cause_gen_time = str(datetime.datetime.now().strftime('%Y_%m_%d_%H_%M_%S'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import Error!

Oct 17 19:26:27 sonic-s6100-01 process-reboot-cause[5168]: Starting up...
Oct 17 19:26:27 sonic-s6100-01 process-reboot-cause[5168]: No reboot cause found from /proc/cmdline
Oct 17 19:26:27 sonic-s6100-01 process-reboot-cause[5168]: No reboot cause found from platform api
Oct 17 19:26:27 sonic-s6100-01 process-reboot-cause[5168]: Reboot cause file /host/reboot-cause/reboot-cause.txt not found
Oct 17 19:26:27 sonic-s6100-01 process-reboot-cause[5168]: Traceback (most recent call last):
Oct 17 19:26:27 sonic-s6100-01 process-reboot-cause[5168]: File "/usr/bin/process-reboot-cause", line 255, in
Oct 17 19:26:27 sonic-s6100-01 process-reboot-cause[5168]: main()
Oct 17 19:26:27 sonic-s6100-01 process-reboot-cause[5168]: File "/usr/bin/process-reboot-cause", line 221, in main
Oct 17 19:26:27 sonic-s6100-01 process-reboot-cause[5168]: reboot_cause_gen_time = str(datetime.datetime.now().strftime('%Y_%m_%d_%H_%M_%S'))
Oct 17 19:26:27 sonic-s6100-01 process-reboot-cause[5168]: NameError: global name 'datetime' is not defined

reboot_cause_dict = get_reboot_cause_dict(previous_reboot_cause, additional_reboot_info, reboot_cause_gen_time)

# Create reboot-cause-#time#.json under history directory
REBOOT_CAUSE_HISTORY_FILE_JSON = os.path.join(REBOOT_CAUSE_HISTORY_DIR, "reboot-cause-{}.json".format(reboot_cause_gen_time))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

REBOOT_CAUSE_HISTORY_DIR is not defined.

if [[ -f /host/reboot-cause/previous-reboot-cause.json ]]; then
REG_BOOT_TYPE="fast*"
CAUSE_NO_AVAIL="\"N/A\""
REBOOT_CAUSE="$(cat /host/reboot-cause/previous-reboot-cause.json | jq '.cause')"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On testing, observed process-reboot-cause is running later than swss

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

determine-reboot-cause is also running later than swss.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@santhosh-kt If that is the case, should we go with the script approach to check the reboot-cause? I think it's too hard to fix the process starting order with platform api dependency. To understand the reboot-cause, we can keep the determine-reboot-cause changes with this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sujinmkang : Already tested this with a dedicated script(A part of the commit - https://github.com/Azure/sonic-buildimage/pull/8024/files - _is_software_reboot() in track_reboot_reason.sh) that is being called inside the preStartAction() script and it is able to identify the CPU reset cases.


# Write the previous reboot cause to REBOOT_CAUSE_HISTORY_FILE_JSON as a JSON format
with open(REBOOT_CAUSE_HISTORY_FILE_JSON, "w") as reboot_cause_history_file:
json.dump(reboot_cause_dict, reboot_cause_history_file)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import error for json

previous_reboot_cause = software_reboot_cause

# Current time
reboot_cause_gen_time = str(datetime.datetime.now().strftime('%Y_%m_%d_%H_%M_%S'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import error for 'datetime'

@santhosh-kt
Copy link
Contributor

  1. On comparison with "process-reboot-cause" script,
    "determine-reboot-cause" script and "determine-reboot-cause.service" are not being copied to the fsroot. (Missing cp commands in files/build_templates/sonic_debian_extension.j2) and also the service enable command is missing in the j2 file,
  2. There is still a race condition between postAction in database.sh with determine-reboot-cause script.
Nov 01 11:50:54 sonic mcelog[531]: failed to prefill DIMM database from DMI data
Nov 01 11:51:15 sonic rc.local[538]: (Reading database ... 25874 files and directories currently installed.)
Nov 01 11:51:23 sonic database.sh[1830]: Creating new database container
Nov 01 11:51:23 sonic database.sh[1830]: e81575b859fb0f3d6d0a94881c51698b6daf59c0edb18e297fc79d4d58078086
Nov 01 11:51:23 sonic database.sh[1830]: database
Nov 01 11:51:24 sonic database.sh[1830]: Could not connect to Redis at /var/run/redis/redis.sock: No such file or directory
Nov 01 11:51:25 sonic database.sh[1830]: database BOOT_TYPE = fast <--- debug log
Nov 01 11:51:25 sonic database.sh[1830]: /usr/bin/database.sh: line 60: jq: command not found
Nov 01 11:51:25 sonic database.sh[1830]: cat: write error: Broken pipe   <--- /host/reboot-cause directory was not created at that time.
Nov 01 11:51:25 sonic database.sh[1830]: /usr/bin/database.sh: line 61: jq: command not found
Nov 01 11:51:25 sonic root[2062]: WARMBOOT_FINALIZER : Wait for database to become ready...
Nov 01 11:51:25 sonic database.sh[1830]: cat: write error: Broken pipe
Nov 01 11:51:25 sonic database.sh[1830]: database REBOOT_CAUSE = <--- debug log(No root cause due to folder missing)
Nov 01 11:51:25 sonic database.sh[1830]: OK
Nov 01 11:51:36 sonic-s6100-01 root[3156]: Flushing APP, ASIC, COUNTER, CONFIG, and partial STATE databases ...
root@sonic-s6100-01:/host#```

determine-reboot-cause logs:
```root@sonic-s6100-01:/host# journalctl -a | grep determine
Nov 01 11:51:21 sonic determine-reboot-cause[1837]: Starting up...
Nov 01 11:51:21 sonic determine-reboot-cause[1837]: No reboot cause found from /proc/cmdline
Nov 01 11:51:21 sonic determine-reboot-cause[1837]: No reboot cause found from platform api
Nov 01 11:51:21 sonic determine-reboot-cause[1837]: Reboot cause file /host/reboot-cause/reboot-cause.txt not found
root@sonic-s6100-01:/host#```

@santhosh-kt
Copy link
Contributor

Tested the changes. Database docker able to read 2.0 API reboot cause clearly.

Under fast-boot:

Nov 22 14:56:03 sonic database.sh[1838]: Creating new database container
Nov 22 14:56:03 sonic database.sh[1838]: 7272ae05a317f2e989d3fb4512f7e9bf1faf6ccd738861b850bdd3574322c40b
Nov 22 14:56:03 sonic database.sh[1838]: database preStartAction() BOOT_TYPE = fast
Nov 22 14:56:03 sonic database.sh[1838]: database
Nov 22 14:56:03 sonic database.sh[1838]: Could not connect to Redis at /var/run/redis/redis.sock: No such file or directory
Nov 22 14:56:04 sonic database.sh[1838]: database postStartAction() BOOT_TYPE = fast
Nov 22 14:56:05 sonic database.sh[1838]: OK
Nov 22 14:56:05 sonic database.sh[1838]: OK
root@sonic-s6100-01:~#

Under fast-boot that gone CPU cold-reset:

Nov 22 14:20:40 sonic mcelog[495]: failed to prefill DIMM database from DMI data
Nov 22 14:21:01 sonic rc.local[544]: (Reading database ... 25874 files and directories currently installed.)
Nov 22 14:21:09 sonic database.sh[1840]: Creating new database container
Nov 22 14:21:09 sonic database.sh[1840]: 3528792659c85e7b28ca786b221d2245ab5b78b151b441ee8d0c73655706f90c
Nov 22 14:21:09 sonic database.sh[1840]: database preStartAction() BOOT_TYPE = cold
Nov 22 14:21:09 sonic database.sh[1840]: database
Nov 22 14:21:09 sonic database.sh[1840]: Could not connect to Redis at /var/run/redis/redis.sock: No such file or directory
Nov 22 14:21:11 sonic database.sh[1840]: OK
Nov 22 14:21:11 sonic root[2070]: WARMBOOT_FINALIZER : Wait for database to become ready...
Nov 22 14:21:22 sonic-s6100-01 root[3206]: Flushing APP, ASIC, COUNTER, CONFIG, and partial STATE databases ...
root@sonic-s6100-01:~#

@sujinmkang sujinmkang requested a review from yxieca November 22, 2021 21:31
@sujinmkang sujinmkang merged commit a80319e into sonic-net:201811 Dec 1, 2021
yxieca added a commit that referenced this pull request Feb 24, 2022
…ened during fast/warm-reboot (#8912)"

This reverts commit a80319e.
yxieca added a commit that referenced this pull request Feb 24, 2022
…ened during fast/warm-reboot (#8912)" (#10076)

This reverts commit a80319e.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants