Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix docker auto restart issue #21377

Merged
merged 3 commits into from
Jan 14, 2025

Conversation

FengPan-Frank
Copy link
Contributor

@FengPan-Frank FengPan-Frank commented Jan 10, 2025

Why I did it

if critical process crashes or killed, bmp docker container will not be auto-restarted.

Work item tracking
  • Microsoft ADO (number only):30807821

How I did it

/usr/bin/supervisor-proc-exit-listener takes in charge of critical process monitor and event publish, thus it should be autorestar-ted in any case, otherwise there might be issue if supervisor-proc-exit-listener crashes, or in some test cases like
"docker exec bmp kill -SIGKILL -1" critical processes may not work correctly in some race condition (depends on whether supervisor-proc-exit-listener is the last one to be killed)

When a container receives the SIGKILL signal to terminate its processes, the order in which the processes are actually terminated can depend on the scheduling and resource availability within the container.

Scheduling: Within a container, processes are scheduled by the operating system or container runtime. The order in which the processes are scheduled to run can impact the order of termination. The scheduler determines which process gets executed first, and this can vary depending on factors such as process priorities, resource availability, and the scheduling algorithm used.
Resource Availability: Containers share resources such as CPU, memory, and disk I/O. When a SIGKILL signal is sent to all processes, the available resources might be limited or constrained. The order in which processes get terminated can be affected by resource contention. If resources are heavily utilized, some processes might be prioritized for termination over others due to resource constraints.

as a result of this, if supervisor-proc-exit-listener is killed first before critical process, container auto restart will not be launched as expected.

How to verify it

image

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • 202205
  • 202211
  • 202305

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@qiluo-msft
Copy link
Collaborator

Just curios about PR description, you mentioned "some race condition". How bad is the bug? high chance or low change to repro? Please update PR description.

@zbud-msft
Copy link
Contributor

How can we reproduce this race condition issue? If we have steps to reproduce, can we add sonic-mgmt testcase? Do other containers need this change?

@ganglyu
Copy link
Contributor

ganglyu commented Jan 11, 2025

How can we reproduce this race condition issue? If we have steps to reproduce, can we add sonic-mgmt testcase? Do other containers need this change?

For instance, if we kill the supervisor-proc-exit-listener first and then kill the telemetry process, the telemetry container will not restart.

@FengPan-Frank
Copy link
Contributor Author

Just curios about PR description, you mentioned "some race condition". How bad is the bug? high chance or low change to repro? Please update PR description.

add more comments in description.

The race issue is in mgmt test case https://github.com/sonic-net/sonic-mgmt/blob/a354b1e5d665bfc2fd5d9a4b3b22fc3fa2f50592/tests/autorestart/test_container_autorestart.py#L309, "docker exec container_name kill -SIGKILL -1" is sent to every container.

When a container receives the SIGKILL signal to terminate its processes, the order in which the processes are actually terminated can depend on the scheduling and resource availability within the container.

Scheduling: Within a container, processes are scheduled by the operating system or container runtime. The order in which the processes are scheduled to run can impact the order of termination. The scheduler determines which process gets executed first, and this can vary depending on factors such as process priorities, resource availability, and the scheduling algorithm used.
Resource Availability: Containers share resources such as CPU, memory, and disk I/O. When a SIGKILL signal is sent to all processes, the available resources might be limited or constrained. The order in which processes get terminated can be affected by resource contention. If resources are heavily utilized, some processes might be prioritized for termination over others due to resource constraints.

as a result of this, if supervisor-proc-exit-listener is killed first before critical process, container auto-restart will not be launched as expected if we don't config autorestart=unexpected there.

@zbud-msft
Copy link
Contributor

@FengPan-Frank If this is a generic issue within all containers, will we make this change in all containers?

Copy link

@hdwhdw hdwhdw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general LGTM. Is there a consistent repro for the issue and verification? Say running test_container_autorestart.py X times? (I wonder what the practical value of X is, i imagine maybe 10 if you don't have that many critical process). Then we can verify the same flake does not happen again.

@FengPan-Frank
Copy link
Contributor Author

@FengPan-Frank If this is a generic issue within all containers, will we make this change in all containers?

only some of containers have this issue, not all containers

@FengPan-Frank
Copy link
Contributor Author

In general LGTM. Is there a consistent repro for the issue and verification? Say running test_container_autorestart.py X times? (I wonder what the practical value of X is, i imagine maybe 10 if you don't have that many critical process). Then we can verify the same flake does not happen again.

For consistent repo, we can kill supervisor-proc-exit-listener first, then critical process exit will not restart docker container, I can add one additional test step for this into test_container_autorestart.py

@qiluo-msft qiluo-msft merged commit 7a21cab into sonic-net:master Jan 14, 2025
20 checks passed
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202411: #21426

VladimirKuk pushed a commit to Marvell-switching/sonic-buildimage that referenced this pull request Jan 21, 2025
Why I did it
if critical process crashes or killed, bmp docker container will not be auto-restarted.

How I did it
/usr/bin/supervisor-proc-exit-listener takes in charge of critical process monitor and event publish, thus it should be autorestar-ted in any case, otherwise there might be issue if supervisor-proc-exit-listener crashes, or in some test cases like
"docker exec bmp kill -SIGKILL -1" critical processes may not work correctly in some race condition (depends on whether supervisor-proc-exit-listener is the last one to be killed)

When a container receives the SIGKILL signal to terminate its processes, the order in which the processes are actually terminated can depend on the scheduling and resource availability within the container.

If supervisor-proc-exit-listener is killed first before critical process, container auto restart will not be launched as expected.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants