-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix docker auto restart issue #21377
Conversation
/azp run Azure.sonic-buildimage |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run Azure.sonic-buildimage |
Azure Pipelines successfully started running 1 pipeline(s). |
Just curios about PR description, you mentioned "some race condition". How bad is the bug? high chance or low change to repro? Please update PR description. |
How can we reproduce this race condition issue? If we have steps to reproduce, can we add sonic-mgmt testcase? Do other containers need this change? |
For instance, if we kill the supervisor-proc-exit-listener first and then kill the telemetry process, the telemetry container will not restart. |
add more comments in description. The race issue is in mgmt test case https://github.com/sonic-net/sonic-mgmt/blob/a354b1e5d665bfc2fd5d9a4b3b22fc3fa2f50592/tests/autorestart/test_container_autorestart.py#L309, "docker exec container_name kill -SIGKILL -1" is sent to every container. When a container receives the SIGKILL signal to terminate its processes, the order in which the processes are actually terminated can depend on the scheduling and resource availability within the container. Scheduling: Within a container, processes are scheduled by the operating system or container runtime. The order in which the processes are scheduled to run can impact the order of termination. The scheduler determines which process gets executed first, and this can vary depending on factors such as process priorities, resource availability, and the scheduling algorithm used. as a result of this, if supervisor-proc-exit-listener is killed first before critical process, container auto-restart will not be launched as expected if we don't config autorestart=unexpected there. |
@FengPan-Frank If this is a generic issue within all containers, will we make this change in all containers? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general LGTM. Is there a consistent repro for the issue and verification? Say running test_container_autorestart.py X times? (I wonder what the practical value of X is, i imagine maybe 10 if you don't have that many critical process). Then we can verify the same flake does not happen again.
only some of containers have this issue, not all containers |
For consistent repo, we can kill supervisor-proc-exit-listener first, then critical process exit will not restart docker container, I can add one additional test step for this into test_container_autorestart.py |
Cherry-pick PR to 202411: #21426 |
Why I did it if critical process crashes or killed, bmp docker container will not be auto-restarted. How I did it /usr/bin/supervisor-proc-exit-listener takes in charge of critical process monitor and event publish, thus it should be autorestar-ted in any case, otherwise there might be issue if supervisor-proc-exit-listener crashes, or in some test cases like "docker exec bmp kill -SIGKILL -1" critical processes may not work correctly in some race condition (depends on whether supervisor-proc-exit-listener is the last one to be killed) When a container receives the SIGKILL signal to terminate its processes, the order in which the processes are actually terminated can depend on the scheduling and resource availability within the container. If supervisor-proc-exit-listener is killed first before critical process, container auto restart will not be launched as expected.
Why I did it
if critical process crashes or killed, bmp docker container will not be auto-restarted.
Work item tracking
How I did it
/usr/bin/supervisor-proc-exit-listener takes in charge of critical process monitor and event publish, thus it should be autorestar-ted in any case, otherwise there might be issue if supervisor-proc-exit-listener crashes, or in some test cases like
"docker exec bmp kill -SIGKILL -1" critical processes may not work correctly in some race condition (depends on whether supervisor-proc-exit-listener is the last one to be killed)
When a container receives the SIGKILL signal to terminate its processes, the order in which the processes are actually terminated can depend on the scheduling and resource availability within the container.
Scheduling: Within a container, processes are scheduled by the operating system or container runtime. The order in which the processes are scheduled to run can impact the order of termination. The scheduler determines which process gets executed first, and this can vary depending on factors such as process priorities, resource availability, and the scheduling algorithm used.
Resource Availability: Containers share resources such as CPU, memory, and disk I/O. When a SIGKILL signal is sent to all processes, the available resources might be limited or constrained. The order in which processes get terminated can be affected by resource contention. If resources are heavily utilized, some processes might be prioritized for termination over others due to resource constraints.
as a result of this, if supervisor-proc-exit-listener is killed first before critical process, container auto restart will not be launched as expected.
How to verify it
Which release branch to backport (provide reason below if selected)
Tested branch (Please provide the tested image version)
Description for the changelog
Link to config_db schema for YANG module changes
A picture of a cute animal (not mandatory but encouraged)