Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Monit] Monitoring the running status of containers. #6251

Merged
merged 18 commits into from
Jan 8, 2021
Merged

[Monit] Monitoring the running status of containers. #6251

merged 18 commits into from
Jan 8, 2021

Conversation

yozhao101
Copy link
Contributor

- Why I did it
This PR aims to monitor the running status of each container. Currently the auto-restart feature was enabled. If a critical process exited unexpected, the container will be restarted. If the container was restarted 3 times during 20 minutes, then it will not run anymore unless we cleared the flag using the command sudo systemctl reset-failed <container_name> manually.

- How I did it
We will employ Monit to monitor a script. This script will generate the expected running container list and compare it with the current running containers. If there are containers which were expected to run but were not running, then an alerting message will be written into syslog.

- How to verify it
I tested this feature on a lab device str-a7050-acs-3 which has single ASIC and str2-n3164-acs-3 which has a Multi-ASIC. First I manually stopped a container by running the command sudo systemctl stop <container_name>, then I checked whether there was an alerting message in the syslog.

- Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • [ x] 202006

- Description for the changelog

- A picture of a cute animal (not mandatory but encouraged)

write an alerting message into syslog if a container which was expected
to run but was not running.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Copy link
Contributor

@jleveque jleveque left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As comments.

Also, I suggest renaming the monitoring_containers script to container_checker

files/image_config/monit/monitoring_containers Outdated Show resolved Hide resolved
files/image_config/monit/monitoring_containers Outdated Show resolved Hide resolved
files/image_config/monit/monitoring_containers Outdated Show resolved Hide resolved
'container_checker'.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
message into syslog.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
files/image_config/monit/container_checker Outdated Show resolved Hide resolved
files/image_config/monit/container_checker Outdated Show resolved Hide resolved
containers.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
run but running.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Copy link
Contributor

@abdosi abdosi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lguohan lguohan merged commit 04cd1d6 into sonic-net:master Jan 8, 2021
lguohan pushed a commit that referenced this pull request Jan 9, 2021
**- Why I did it**
This PR aims to monitor the running status of each container. Currently the auto-restart feature was enabled. If a critical process exited unexpected, the container will be restarted. If the container was restarted 3 times during 20 minutes, then it will not run anymore unless we cleared the flag using the command `sudo systemctl reset-failed <container_name>` manually. 

**- How I did it**
We will employ Monit to monitor a script. This script will generate the expected running container list and compare it with the current running containers. If there are containers which were expected to run but were not running, then an alerting message will be written into syslog.

**- How to verify it**
I tested this feature on a lab device `str-a7050-acs-3` which has single ASIC and `str2-n3164-acs-3` which has a Multi-ASIC. First I manually stopped a container by running the command `sudo systemctl stop <container_name>`, then I checked whether there was an alerting message in the syslog.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
yozhao101 added a commit to sonic-net/sonic-mgmt that referenced this pull request Jan 27, 2021
What is the motivation for this PR?
This PR aims to test the feature of container checker and PR link of container checker is sonic-net/sonic-buildimage#6251.

The script of container_checker was run periodically by Monit and aims to monitor the running status of each container. Currently the auto-restart feature was enabled. If a critical process exited unexpected, the container will be restarted. If the container was restarted 3 times during 20 minutes, then it will not run anymore unless we cleared the flag using the command sudo systemctl reset-failed <container_name> manually.

How did you do it?
This pytest script will test the script container_checker in the following steps:

Stop the containers explicitly.
Check whether the names of stopped containers appear in the Monit alerting message.
Restart the corresponding stopped containers.
Post-check all the critical processes are running and BGP sessions are established.
How did you verify/test it?
I tested this pytest script on a virtual testbed.

Any platform specific information?
N/A

Supported testbed topology if it's a new test case?
N/A
yozhao101 added a commit to sonic-net/sonic-mgmt that referenced this pull request Mar 19, 2021
Signed-off-by: Yong Zhao yozhao@microsoft.com

Description of PR
Summary:
This PR aims to test the feature of container checker and PR link is sonic-net/sonic-buildimage#6251.

Fixes # (issue)

Type of change
 Bug fix
 Testbed and Framework(new/improvement)
[ x] Test case(new/improvement)

Approach

What is the motivation for this PR?
This PR aims to test the feature of container checker and PR link of container checker is sonic-net/sonic-buildimage#6251.

The script of container_checker was run periodically by Monit and aims to monitor the running status of each container. Currently the auto-restart feature was enabled. If a critical process exited unexpected, the container will be restarted. If the container was restarted 3 times during 20 minutes, then it will not run anymore unless we cleared the flag using the command sudo systemctl reset-failed <container_name> manually.

How did you do it?
This pytest script will test the script container_checker in the following steps:

Stop the containers explicitly.
Check whether the names of stopped containers appear in the Monit alerting message.
Restart the containers by the config_reload(...).
Post-check all the critical processes are running and BGP sessions are established.

How did you verify/test it?
I tested the PR against the physical testbed (str-dx010-acs-1) which was installed image built from public master branch.
vmittal-msft pushed a commit to vmittal-msft/sonic-mgmt that referenced this pull request Sep 28, 2021
Signed-off-by: Yong Zhao yozhao@microsoft.com

Description of PR
Summary:
This PR aims to test the feature of container checker and PR link is sonic-net/sonic-buildimage#6251.

Fixes # (issue)

Type of change
 Bug fix
 Testbed and Framework(new/improvement)
[ x] Test case(new/improvement)

Approach

What is the motivation for this PR?
This PR aims to test the feature of container checker and PR link of container checker is sonic-net/sonic-buildimage#6251.

The script of container_checker was run periodically by Monit and aims to monitor the running status of each container. Currently the auto-restart feature was enabled. If a critical process exited unexpected, the container will be restarted. If the container was restarted 3 times during 20 minutes, then it will not run anymore unless we cleared the flag using the command sudo systemctl reset-failed <container_name> manually.

How did you do it?
This pytest script will test the script container_checker in the following steps:

Stop the containers explicitly.
Check whether the names of stopped containers appear in the Monit alerting message.
Restart the containers by the config_reload(...).
Post-check all the critical processes are running and BGP sessions are established.

How did you verify/test it?
I tested the PR against the physical testbed (str-dx010-acs-1) which was installed image built from public master branch.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants