Fix docker auto restart issue #21377

FengPan-Frank · 2025-01-10T04:19:51Z

Why I did it

if critical process crashes or killed, bmp docker container will not be auto-restarted.

Work item tracking

Microsoft ADO (number only):30807821

How I did it

/usr/bin/supervisor-proc-exit-listener takes in charge of critical process monitor and event publish, thus it should be autorestar-ted in any case, otherwise there might be issue if supervisor-proc-exit-listener crashes, or in some test cases like
"docker exec bmp kill -SIGKILL -1" critical processes may not work correctly in some race condition (depends on whether supervisor-proc-exit-listener is the last one to be killed)

When a container receives the SIGKILL signal to terminate its processes, the order in which the processes are actually terminated can depend on the scheduling and resource availability within the container.

Scheduling: Within a container, processes are scheduled by the operating system or container runtime. The order in which the processes are scheduled to run can impact the order of termination. The scheduler determines which process gets executed first, and this can vary depending on factors such as process priorities, resource availability, and the scheduling algorithm used.
Resource Availability: Containers share resources such as CPU, memory, and disk I/O. When a SIGKILL signal is sent to all processes, the available resources might be limited or constrained. The order in which processes get terminated can be affected by resource contention. If resources are heavily utilized, some processes might be prioritized for termination over others due to resource constraints.

as a result of this, if supervisor-proc-exit-listener is killed first before critical process, container auto restart will not be launched as expected.

How to verify it

Which release branch to backport (provide reason below if selected)

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

mssonicbld · 2025-01-10T04:19:54Z

/azp run Azure.sonic-buildimage

azure-pipelines · 2025-01-10T04:20:07Z

Azure Pipelines successfully started running 1 pipeline(s).

…nic-buildimage into fix_autorestart

mssonicbld · 2025-01-10T08:27:46Z

/azp run Azure.sonic-buildimage

azure-pipelines · 2025-01-10T08:27:57Z

Azure Pipelines successfully started running 1 pipeline(s).

qiluo-msft · 2025-01-11T00:49:09Z

Just curios about PR description, you mentioned "some race condition". How bad is the bug? high chance or low change to repro? Please update PR description.

zbud-msft · 2025-01-11T00:52:00Z

How can we reproduce this race condition issue? If we have steps to reproduce, can we add sonic-mgmt testcase? Do other containers need this change?

ganglyu · 2025-01-11T01:58:19Z

How can we reproduce this race condition issue? If we have steps to reproduce, can we add sonic-mgmt testcase? Do other containers need this change?

For instance, if we kill the supervisor-proc-exit-listener first and then kill the telemetry process, the telemetry container will not restart.

FengPan-Frank · 2025-01-11T12:08:44Z

Just curios about PR description, you mentioned "some race condition". How bad is the bug? high chance or low change to repro? Please update PR description.

add more comments in description.

The race issue is in mgmt test case https://github.com/sonic-net/sonic-mgmt/blob/a354b1e5d665bfc2fd5d9a4b3b22fc3fa2f50592/tests/autorestart/test_container_autorestart.py#L309, "docker exec container_name kill -SIGKILL -1" is sent to every container.

When a container receives the SIGKILL signal to terminate its processes, the order in which the processes are actually terminated can depend on the scheduling and resource availability within the container.

Scheduling: Within a container, processes are scheduled by the operating system or container runtime. The order in which the processes are scheduled to run can impact the order of termination. The scheduler determines which process gets executed first, and this can vary depending on factors such as process priorities, resource availability, and the scheduling algorithm used.
Resource Availability: Containers share resources such as CPU, memory, and disk I/O. When a SIGKILL signal is sent to all processes, the available resources might be limited or constrained. The order in which processes get terminated can be affected by resource contention. If resources are heavily utilized, some processes might be prioritized for termination over others due to resource constraints.

as a result of this, if supervisor-proc-exit-listener is killed first before critical process, container auto-restart will not be launched as expected if we don't config autorestart=unexpected there.

zbud-msft · 2025-01-13T18:02:49Z

@FengPan-Frank If this is a generic issue within all containers, will we make this change in all containers?

hdwhdw

In general LGTM. Is there a consistent repro for the issue and verification? Say running test_container_autorestart.py X times? (I wonder what the practical value of X is, i imagine maybe 10 if you don't have that many critical process). Then we can verify the same flake does not happen again.

FengPan-Frank · 2025-01-14T00:38:47Z

@FengPan-Frank If this is a generic issue within all containers, will we make this change in all containers?

only some of containers have this issue, not all containers

FengPan-Frank · 2025-01-14T00:42:06Z

In general LGTM. Is there a consistent repro for the issue and verification? Say running test_container_autorestart.py X times? (I wonder what the practical value of X is, i imagine maybe 10 if you don't have that many critical process). Then we can verify the same flake does not happen again.

For consistent repo, we can kill supervisor-proc-exit-listener first, then critical process exit will not restart docker container, I can add one additional test step for this into test_container_autorestart.py

mssonicbld · 2025-01-14T07:42:06Z

Cherry-pick PR to 202411: #21426

Why I did it if critical process crashes or killed, bmp docker container will not be auto-restarted. How I did it /usr/bin/supervisor-proc-exit-listener takes in charge of critical process monitor and event publish, thus it should be autorestar-ted in any case, otherwise there might be issue if supervisor-proc-exit-listener crashes, or in some test cases like "docker exec bmp kill -SIGKILL -1" critical processes may not work correctly in some race condition (depends on whether supervisor-proc-exit-listener is the last one to be killed) When a container receives the SIGKILL signal to terminate its processes, the order in which the processes are actually terminated can depend on the scheduling and resource availability within the container. If supervisor-proc-exit-listener is killed first before critical process, container auto restart will not be launched as expected.

Fix docker auto restart issue

0fbec34

FengPan-Frank requested a review from lguohan as a code owner January 10, 2025 04:19

FengPan-Frank added 2 commits January 10, 2025 08:26

Fix docker auto restart issue

b306fb2

Merge branch 'fix_autorestart' of https://github.com/FengPan-Frank/so…

25d328b

…nic-buildimage into fix_autorestart

FengPan-Frank requested review from qiluo-msft and ganglyu January 10, 2025 08:27

ganglyu approved these changes Jan 10, 2025

View reviewed changes

qiluo-msft requested review from zbud-msft and hdwhdw January 11, 2025 00:49

qiluo-msft approved these changes Jan 11, 2025

View reviewed changes

FengPan-Frank added the Request for 202411 Branch label Jan 12, 2025

zbud-msft approved these changes Jan 13, 2025

View reviewed changes

hdwhdw approved these changes Jan 13, 2025

View reviewed changes

qiluo-msft merged commit 7a21cab into sonic-net:master Jan 14, 2025
20 checks passed

FengPan-Frank added the Approved for 202411 Branch label Jan 14, 2025

mssonicbld mentioned this pull request Jan 14, 2025

[action] [PR:21377] Fix docker auto restart issue #21426

Merged

11 tasks

mssonicbld added the Created PR to 202411 Branch label Jan 14, 2025

mssonicbld added the Included in 202411 Branch label Jan 15, 2025

mssonicbld removed the Created PR to 202411 Branch label Jan 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix docker auto restart issue #21377

Fix docker auto restart issue #21377

FengPan-Frank commented Jan 10, 2025 •

edited

Loading

mssonicbld commented Jan 10, 2025

azure-pipelines bot commented Jan 10, 2025

mssonicbld commented Jan 10, 2025

azure-pipelines bot commented Jan 10, 2025

qiluo-msft commented Jan 11, 2025

zbud-msft commented Jan 11, 2025

ganglyu commented Jan 11, 2025

FengPan-Frank commented Jan 11, 2025

zbud-msft commented Jan 13, 2025

hdwhdw left a comment

FengPan-Frank commented Jan 14, 2025

FengPan-Frank commented Jan 14, 2025

mssonicbld commented Jan 14, 2025

Fix docker auto restart issue #21377

Fix docker auto restart issue #21377

Conversation

FengPan-Frank commented Jan 10, 2025 • edited Loading

Why I did it

Work item tracking

How I did it

How to verify it

Which release branch to backport (provide reason below if selected)

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

mssonicbld commented Jan 10, 2025

azure-pipelines bot commented Jan 10, 2025

mssonicbld commented Jan 10, 2025

azure-pipelines bot commented Jan 10, 2025

qiluo-msft commented Jan 11, 2025

zbud-msft commented Jan 11, 2025

ganglyu commented Jan 11, 2025

FengPan-Frank commented Jan 11, 2025

zbud-msft commented Jan 13, 2025

hdwhdw left a comment

Choose a reason for hiding this comment

FengPan-Frank commented Jan 14, 2025

FengPan-Frank commented Jan 14, 2025

mssonicbld commented Jan 14, 2025

FengPan-Frank commented Jan 10, 2025 •

edited

Loading