-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[sysctl] Increase hung_task_timeout_secs to 300 #6312
Conversation
Depending on the performance characteristics of a given hardware platform, it's possible to exceed the default 120 second kernel timeout during I/O intensive operations like image installation. This risk increases as image size continues to increase. So, we need to increase the timeout so that we don't encounter kernel panics on devices with lower disk throughput. Signed-off-by: Danny Allen <daall@microsoft.com>
i feel we should not change the default value here. instead, we should nice the installer. if kernel is in such bad state, the user space application might hang as well, hardware watchdog might timeout and reboot the box. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as comments
I think the issue is that the kernel takes too much time to flush IO to the hard drive. I am not sure if nice the use application would help. Unless we throttle the amount of data writing to the hard drive. Even that, because there is cache in between, we don't know when the kernel will flush how much data to hard drive. So I think we don't have control over this issue in user space. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add more context for this change, so that later when we trace back to this pr, we know exactly why we made this change.
retest vsimage please |
Depending on the performance characteristics of a given hardware platform, it's possible to exceed the default 120 second kernel timeout during I/O intensive operations like image installation. This can cause a kernel panic like so:
kernel:[ 852.441781] Kernel panic - not syncing: hung_task: blocked tasks
If this happens during image installation, it's possible for the install to become corrupted and leave the device in an unreachable state that requires a power cycle to resolve. This risk increases as image size continues to increase. So, we need to increase the timeout so that we don't encounter kernel panics on devices with lower disk throughput.
Signed-off-by: Danny Allen daall@microsoft.com