Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sysctl] Increase hung_task_timeout_secs to 300 #6312

Merged
merged 1 commit into from
Dec 30, 2020

Conversation

daall
Copy link
Contributor

@daall daall commented Dec 29, 2020

Depending on the performance characteristics of a given hardware platform, it's possible to exceed the default 120 second kernel timeout during I/O intensive operations like image installation. This can cause a kernel panic like so:

kernel:[ 852.441781] Kernel panic - not syncing: hung_task: blocked tasks

If this happens during image installation, it's possible for the install to become corrupted and leave the device in an unreachable state that requires a power cycle to resolve. This risk increases as image size continues to increase. So, we need to increase the timeout so that we don't encounter kernel panics on devices with lower disk throughput.

Signed-off-by: Danny Allen daall@microsoft.com

Depending on the performance characteristics of a given hardware platform,
it's possible to exceed the default 120 second kernel timeout during I/O
intensive operations like image installation. This risk increases as image
size continues to increase. So, we need to increase the timeout so that we
don't encounter kernel panics on devices with lower disk throughput.

Signed-off-by: Danny Allen <daall@microsoft.com>
@lguohan
Copy link
Collaborator

lguohan commented Dec 29, 2020

i feel we should not change the default value here. instead, we should nice the installer. if kernel is in such bad state, the user space application might hang as well, hardware watchdog might timeout and reboot the box.

Copy link
Collaborator

@lguohan lguohan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as comments

@yxieca
Copy link
Contributor

yxieca commented Dec 29, 2020

i feel we should not change the default value here. instead, we should nice the installer. if kernel is in such bad state, the user space application might hang as well, hardware watchdog might timeout and reboot the box.

I think the issue is that the kernel takes too much time to flush IO to the hard drive. I am not sure if nice the use application would help. Unless we throttle the amount of data writing to the hard drive. Even that, because there is cache in between, we don't know when the kernel will flush how much data to hard drive. So I think we don't have control over this issue in user space.

Copy link
Collaborator

@lguohan lguohan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add more context for this change, so that later when we trace back to this pr, we know exactly why we made this change.

@daall
Copy link
Contributor Author

daall commented Dec 30, 2020

retest vsimage please

@lguohan lguohan merged commit a64994e into sonic-net:master Dec 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants