-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check platform reboot cause to see if any reset happened during fast/warm-reboot #7920
Conversation
fast/warm-reboot
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
function isPlatformReset() | ||
{ | ||
output=$(python3 -c "import sonic_platform.platform; p = sonic_platform.platform.Platform(); c = p.get_chassis(); hw_rc_major, hw_rc_minor = c.get_reboot_cause(); print(hw_rc_major)" 2>/dev/null) | ||
echo "${output}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this should echo more context, rather than just the reboot cause?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sujinmkang while testing in 201811 image, This line(66) causing changes in getBootType() function.
(ie) BOOT_TYPE = "reboot cause" value instead of "cold", "warm", "fastfast", etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing this line (echo "${output}") resolved the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I commented about it yesterday but it said pending status.
if [[ "${output}" != "Non-Hardware" ]]; then | ||
return 0 | ||
else | ||
return 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a comment here to explain this logic?
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
Uploaded UT test logs in #8024 for your reference. |
@@ -60,6 +60,16 @@ function getMountPoint() | |||
echo $1 | python -c "import sys, json, os; mnts = [x for x in json.load(sys.stdin)[0]['Mounts'] if x['Destination'] == '/usr/share/sonic/hwsku']; print '' if len(mnts) == 0 else os.path.abspath(mnts[0]['Source'])" 2>/dev/null | |||
} | |||
|
|||
function isPlatformReset() | |||
{ | |||
output=$(python3 -c "import sonic_platform.platform; p = sonic_platform.platform.Platform(); c = p.get_chassis(); hw_rc_major, hw_rc_minor = c.get_reboot_cause(); print(hw_rc_major)" 2>/dev/null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sujinmkang : On checking the logs double time, found out one of the process(database.sh) is mentioning it as cold reset despite fast-boot happened. I suspect this is because either determine-reboot-cause or process-reboot-cause function is not up until that time?
Will there be any impact if database is still considering the BOOT_TYPE as cold?
sonic database.sh[680]: BOOT_TYPE we received cold
sonic lldp.sh[2314]: BOOT_TYPE we received fast
sonic pmon.sh[2308]: BOOT_TYPE we received fast
sonic-s6100-01 bgp.sh[2886]: BOOT_TYPE we received fast
sonic-s6100-01 swss.sh[2887]: BOOT_TYPE we received fast
sonic-s6100-01 teamd.sh[3655]: BOOT_TYPE we received fast
sonic-s6100-01 snmp.sh[3658]: BOOT_TYPE we received fast
sonic-s6100-01 radv.sh[3651]: BOOT_TYPE we received fast
sonic-s6100-01 syncd.sh[3656]: BOOT_TYPE we received fast
sonic-s6100-01 dhcp_relay.sh[4441]: BOOT_TYPE we received fast
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@santhosh-kt we're not using determine-reboot-cause or process-reboot-cause for this determination but only using platform api. Probably does the database always consider the boot_type as cold even without this change? Since we're sending the same boot_type to all container starting script. Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry! I misunderstood. Except database.sh script all other daemon scripts are called lately if I understood.
So all those scripts(except database.sh) could have easily interpreted the boot type since get_reboot_cause() is showing correct value as track_reboot_reason.sh script got executed before these scripts.
But for database.sh, this get_reboot_cause() is showing wrong value as it runs even before s6100_platform.sh itself.
My question is, will wrong BOOT_TYPE has any impact in database.sh script?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@santhosh-kt can we detect the platform driver got initialized or not? Or should we check the exception from platform api call to see the platform driver initialization status? Any other suggestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@santhosh-kt I checked the database container start code, I don't think the boot_type is be used for database container starts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yxieca I don't think the database container is starting differently based on the boot-type. do you have any comments or concern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Database container start code has a special handling during warm reboot: https://github.com/Azure/sonic-buildimage/blob/master/files/build_templates/docker_image_ctl.j2#L87
Add additional reboot cause information to track the unexpected reboot cases
Since platform driver is not ready before database container, we cannot determine if there is any platform reset happens when database starts. So Delete the setting before starting swss if any platform reset happened, so that the swss/orchagent can perform loading phy firmware downloading before it hits the phy connection timeout.
@yxieca Since the platform driver is not ready by database container starts and we are sure the platform driver is ready before swss starting, I changed the code to Delete "FAST_REBOOT|system" key when swss starts instead of checking the platform reset in database starts. What do you think this change? |
Is this question related to this PR? I didn't find the change matches this question? |
@santhosh-kt Can you please verify this new PR with the scenario? |
Hi @sujinmkang : Following changes can be tried out in master image but as I mentioned earlier, latest master raw image will expand upto 3GB and it will cause disk space issue while conversion. |
@santhosh-kt I created a new PR for 201811 : #8912 |
@santhosh-kt Can you please give me some more information about the 3G image issue? Is the issue still existing without my PR or does it happen with my PR? |
@sujinmkang This issue is seen even before your changes. Below commit increased the raw disk size since docker file system got increased in size as well as "no disk space left" build error was seen when sonic-broadcom.raw build was generated with 2GB as RAW SIZE LIMIT. cbe948e#diff-a681207dd6c606f49b8ab351137a2c0238f02908652615a58b73cf74865e55b1 |
if [[ $REBOOT_CAUSE =~ $REG_BOOT_TYPE ]]; then | ||
if [[ "${EXTRA_CAUSE}" != "${CAUSE_NO_AVAIL}" ]]; then | ||
# Delete the FAST_REBOOT|system db setting | ||
$SONIC_DB_CLI STATE_DB DEL "FAST_REBOOT|system" &>/dev/null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can you guarantee that syncd didn't take fast-reboot approach? syncd could have read this value before swss deletes it.
if [[ "$BOOT_TYPE" == "fast" ]] && [[ -d /host/fast-reboot ]]; then | ||
if [[ -f /host/reboot-cause/previous-reboot-cause.json ]]; then | ||
REG_BOOT_TYPE="fast*" | ||
CAUSE_NO_AVAIL="\"N/A\"" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use white space instead of tab?
Closing this PR since we don't need this PR for master branch |
Why I did it
Check platform reboot cause to see if any reset happened during fast/warm-reboot
While starting dockers with boot_type, it's only checking the /proc/cmdline.
If any unexpected Platform reset happens during fast/warm-reboot, it should be detected as platform reboot
and "FAST_REBOOT|system" should be set in that case.
How I did it
Add the platform reboot cause parsing when determining the boot_type
How to verify it
Check the logic with reboot
Convert the device with fast-reload and see if the boot_type is detected correctly
Which release branch to backport (provide reason below if selected)
Description for the changelog
A picture of a cute animal (not mandatory but encouraged)