Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check platform reboot cause to see if any reset happened during fast/warm-reboot #7920

Closed
wants to merge 8 commits into from

Conversation

sujinmkang
Copy link
Collaborator

Why I did it

Check platform reboot cause to see if any reset happened during fast/warm-reboot
While starting dockers with boot_type, it's only checking the /proc/cmdline.
If any unexpected Platform reset happens during fast/warm-reboot, it should be detected as platform reboot
and "FAST_REBOOT|system" should be set in that case.

How I did it

Add the platform reboot cause parsing when determining the boot_type

How to verify it

Check the logic with reboot
Convert the device with fast-reload and see if the boot_type is detected correctly

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012

Description for the changelog

A picture of a cute animal (not mandatory but encouraged)

@sujinmkang sujinmkang requested a review from lguohan as a code owner June 19, 2021 08:40
@jleveque
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jleveque
Copy link
Contributor

/azp run

function isPlatformReset()
{
output=$(python3 -c "import sonic_platform.platform; p = sonic_platform.platform.Platform(); c = p.get_chassis(); hw_rc_major, hw_rc_minor = c.get_reboot_cause(); print(hw_rc_major)" 2>/dev/null)
echo "${output}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should echo more context, rather than just the reboot cause?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sujinmkang while testing in 201811 image, This line(66) causing changes in getBootType() function.
(ie) BOOT_TYPE = "reboot cause" value instead of "cold", "warm", "fastfast", etc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this line (echo "${output}") resolved the issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I commented about it yesterday but it said pending status.

Comment on lines 67 to 70
if [[ "${output}" != "Non-Hardware" ]]; then
return 0
else
return 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment here to explain this logic?

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@sujinmkang
Copy link
Collaborator Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@santhosh-kt
Copy link
Contributor

Uploaded UT test logs in #8024 for your reference.
(OS9 to SONIC - cold reset.txt)
(OS9 to SONIC - fastboot.txt)

@@ -60,6 +60,16 @@ function getMountPoint()
echo $1 | python -c "import sys, json, os; mnts = [x for x in json.load(sys.stdin)[0]['Mounts'] if x['Destination'] == '/usr/share/sonic/hwsku']; print '' if len(mnts) == 0 else os.path.abspath(mnts[0]['Source'])" 2>/dev/null
}

function isPlatformReset()
{
output=$(python3 -c "import sonic_platform.platform; p = sonic_platform.platform.Platform(); c = p.get_chassis(); hw_rc_major, hw_rc_minor = c.get_reboot_cause(); print(hw_rc_major)" 2>/dev/null)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sujinmkang : On checking the logs double time, found out one of the process(database.sh) is mentioning it as cold reset despite fast-boot happened. I suspect this is because either determine-reboot-cause or process-reboot-cause function is not up until that time?
Will there be any impact if database is still considering the BOOT_TYPE as cold?

sonic database.sh[680]: BOOT_TYPE  we received cold
sonic lldp.sh[2314]: BOOT_TYPE  we received fast
sonic pmon.sh[2308]: BOOT_TYPE  we received fast
sonic-s6100-01 bgp.sh[2886]: BOOT_TYPE  we received fast
sonic-s6100-01 swss.sh[2887]: BOOT_TYPE  we received fast
sonic-s6100-01 teamd.sh[3655]: BOOT_TYPE  we received fast
sonic-s6100-01 snmp.sh[3658]: BOOT_TYPE  we received fast
sonic-s6100-01 radv.sh[3651]: BOOT_TYPE  we received fast
sonic-s6100-01 syncd.sh[3656]: BOOT_TYPE  we received fast
sonic-s6100-01 dhcp_relay.sh[4441]: BOOT_TYPE  we received fast

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@santhosh-kt we're not using determine-reboot-cause or process-reboot-cause for this determination but only using platform api. Probably does the database always consider the boot_type as cold even without this change? Since we're sending the same boot_type to all container starting script. Right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry! I misunderstood. Except database.sh script all other daemon scripts are called lately if I understood.

So all those scripts(except database.sh) could have easily interpreted the boot type since get_reboot_cause() is showing correct value as track_reboot_reason.sh script got executed before these scripts.

But for database.sh, this get_reboot_cause() is showing wrong value as it runs even before s6100_platform.sh itself.
My question is, will wrong BOOT_TYPE has any impact in database.sh script?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@santhosh-kt can we detect the platform driver got initialized or not? Or should we check the exception from platform api call to see the platform driver initialization status? Any other suggestion?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@santhosh-kt I checked the database container start code, I don't think the boot_type is be used for database container starts.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yxieca I don't think the database container is starting differently based on the boot-type. do you have any comments or concern?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Database container start code has a special handling during warm reboot: https://github.com/Azure/sonic-buildimage/blob/master/files/build_templates/docker_image_ctl.j2#L87

Add additional reboot cause information to track the unexpected reboot cases
@sujinmkang sujinmkang closed this Sep 17, 2021
@sujinmkang sujinmkang reopened this Sep 17, 2021
Since platform driver is not ready before database container, we cannot
determine if there is any platform reset happens when database starts.
So Delete the setting before starting swss if any platform reset
happened, so that the swss/orchagent can perform loading phy firmware
downloading before it hits the phy connection timeout.
@sujinmkang
Copy link
Collaborator Author

@yxieca Since the platform driver is not ready by database container starts and we are sure the platform driver is ready before swss starting, I changed the code to Delete "FAST_REBOOT|system" key when swss starts instead of checking the platform reset in database starts. What do you think this change?

@yxieca
Copy link
Contributor

yxieca commented Sep 20, 2021

@yxieca Since the platform driver is not ready by database container starts and we are sure the platform driver is ready before swss starting, I changed the code to Delete "FAST_REBOOT|system" key when swss starts instead of checking the platform reset in database starts. What do you think this change?

Is this question related to this PR? I didn't find the change matches this question?

@sujinmkang
Copy link
Collaborator Author

@santhosh-kt Can you please verify this new PR with the scenario?
@yxieca Can you please review the logic to detect the platform reset/reboot cause during fast-reload/reboot case?

@santhosh-kt
Copy link
Contributor

Hi @sujinmkang : Following changes can be tried out in master image but as I mentioned earlier, latest master raw image will expand upto 3GB and it will cause disk space issue while conversion.
I see to try the same in 201811 but we don't have determine-reboot-cause changes as well as no support on previous-reboot-cause.json.

@sujinmkang
Copy link
Collaborator Author

sujinmkang commented Oct 6, 2021

@santhosh-kt I created a new PR for 201811 : #8912

@sujinmkang
Copy link
Collaborator Author

@santhosh-kt Can you please give me some more information about the 3G image issue? Is the issue still existing without my PR or does it happen with my PR?

@santhosh-kt
Copy link
Contributor

@sujinmkang This issue is seen even before your changes. Below commit increased the raw disk size since docker file system got increased in size as well as "no disk space left" build error was seen when sonic-broadcom.raw build was generated with 2GB as RAW SIZE LIMIT.

cbe948e#diff-a681207dd6c606f49b8ab351137a2c0238f02908652615a58b73cf74865e55b1

if [[ $REBOOT_CAUSE =~ $REG_BOOT_TYPE ]]; then
if [[ "${EXTRA_CAUSE}" != "${CAUSE_NO_AVAIL}" ]]; then
# Delete the FAST_REBOOT|system db setting
$SONIC_DB_CLI STATE_DB DEL "FAST_REBOOT|system" &>/dev/null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can you guarantee that syncd didn't take fast-reboot approach? syncd could have read this value before swss deletes it.

if [[ "$BOOT_TYPE" == "fast" ]] && [[ -d /host/fast-reboot ]]; then
if [[ -f /host/reboot-cause/previous-reboot-cause.json ]]; then
REG_BOOT_TYPE="fast*"
CAUSE_NO_AVAIL="\"N/A\""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use white space instead of tab?

@sujinmkang
Copy link
Collaborator Author

Closing this PR since we don't need this PR for master branch

@sujinmkang sujinmkang closed this Jan 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants