-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Monit] Repeat restarting culprit container if Monit can't reset its counter #10288
[Monit] Repeat restarting culprit container if Monit can't reset its counter #10288
Conversation
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
@@ -85,6 +85,8 @@ def check_memory_usage(container_name, threshold_value): | |||
if mem_usage_bytes > threshold_value: | |||
print("[{}]: Memory usage ({} Bytes) is larger than the threshold ({} Bytes)!" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The message text from statement print(...)
will be appended to the end of Monit alerting message in syslog. Please review the following example:
Jan 2 02:34:56.029949 DSM07-0101-0613-10BT0 ERR monit[496]: 'container_memory_telemetry' status failed (3) -- [telemetry]: Memory usage (489475276.8 Bytes) is larger than the threshold (419430400 Bytes)!
@@ -2,4 +2,4 @@ | |||
## Monit configuration for telemetry container | |||
############################################################################### | |||
check program container_memory_telemetry with path "/usr/bin/memory_checker telemetry 419430400" | |||
if status == 3 for 10 times within 20 cycles then exec "/usr/bin/restart_service telemetry" | |||
if status == 3 for 10 times within 20 cycles then exec "/usr/bin/restart_service telemetry" repeat every 2 cycles |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
repeat every 2 cycles
at here means if Monit failed to reset its counter and memory usage of telemetry is still larger than the threshold (at here 400MB
) for 2 minutes
, then telemetry container will be restarted.
I selected the number 2
since memory usage of telemetry possibly increased very quickly from MB to around GB within 2 minutes and telemetry should be restarted ASAP.
Other numbers can be selected as well and do you have any suggestion?
@@ -2,4 +2,4 @@ | |||
## Monit configuration for telemetry container | |||
############################################################################### | |||
check program container_memory_telemetry with path "/usr/bin/memory_checker telemetry 419430400" | |||
if status == 3 for 10 times within 20 cycles then exec "/usr/bin/restart_service telemetry" | |||
if status == 3 for 10 times within 20 cycles then exec "/usr/bin/restart_service telemetry" repeat every 2 cycles |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed at last week's meeting, I am working on the PR of pytest testcase to make sure the old code will fail while new one will succeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pytest PR is submitted for review: sonic-net/sonic-mgmt#5492.
@@ -2,4 +2,4 @@ | |||
## Monit configuration for telemetry container | |||
############################################################################### | |||
check program container_memory_telemetry with path "/usr/bin/memory_checker telemetry 419430400" | |||
if status == 3 for 10 times within 20 cycles then exec "/usr/bin/restart_service telemetry" | |||
if status == 3 for 10 times within 20 cycles then exec "/usr/bin/restart_service telemetry" repeat every 2 cycles |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will double check whether this new syntax existed on 201811, 201911 and 202012 images or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In 201811, 201911 and 202012 images, Monit has the same version 5.20.0
. I checked this syntax of Monit with 20191130.83 image on lab device and it did work.
…10288) Signed-off-by: Yong Zhao <yozhao@microsoft.com> Why I did it This PR aims to fix the Monit issue which shows Monit can't reset its counter when monitoring memory usage of telemetry container. Specifically the Monit configuration file related to monitoring memory usage of telemetry container is as following: check program container_memory_telemetry with path "/usr/bin/memory_checker telemetry 419430400" if status == 3 for 10 times within 20 cycles then exec "/usr/bin/restart_service telemetry" If memory usage of telemetry container is larger than 400MB for 10 times within 20 cycles (minutes), then it will be restarted. Recently we observed, after telemetry container was restarted, its memory usage continuously increased from 400MB to 11GB within 1 hour, but it was not restarted anymore during this 1 hour sliding window. The reason is Monit can't reset its counter to count again and Monit can reset its counter if and only if the status of monitored service was changed from Status failed to Status ok. However, during this 1 hour sliding window, the status of monitored service was not changed from Status failed to Status ok. Currently for each service monitored by Monit, there will be an entry showing the monitoring status, monitoring mode etc. For example, the following output from command sudo monit status shows the status of monitored service to monitor memory usage of telemetry: Program 'container_memory_telemetry' status Status ok monitoring status Monitored monitoring mode active on reboot start last exit value 0 last output - data collected Sat, 19 Mar 2022 19:56:26 Every 1 minute, Monit will run the script to check the memory usage of telemetry and update the counter if memory usage is larger than 400MB. If Monit checked the counter and found memory usage of telemetry is larger than 400MB for 10 times within 20 minutes, then telemetry container was restarted. Following is an example status of monitored service: Program 'container_memory_telemetry' status Status failed monitoring status Monitored monitoring mode active on reboot start last exit value 0 last output - data collected Tue, 01 Feb 2022 22:52:55 After telemetry container was restarted. we found memory usage of telemetry increased rapidly from around 100MB to more than 400MB during 1 minute and status of monitored service did not have a chance to be changed from Status failed to Status ok. How I did it In order to provide a workaround for this issue, Monit recently introduced another syntax format repeat every <n> cycles related to exec. This new syntax format will enable Monit repeat executing the background script if the error persists for a given number of cycles. How to verify it I verified this change on lab device str-s6000-acs-12. Another pytest PR (sonic-net/sonic-mgmt#5492) is submitted in sonic-mgmt repo for review.
…w syntax as a workaround (#5492) What is the motivation for this PR? This PR adds two new test cases to test a Monit issue fixed by the PR (sonic-net/sonic-buildimage#10288) in SONiC image repo. How did you do it? Specifically the 'stress' utility is leveraged to increase memory usage of a container continuously, then Test whether that container can be restarted by the script ran by Monit. Test whether that container can be restarted by the script ran by Moni; If that container was restarted, then test the script ran by Monit was unable to restart the container anymore due to Monit failed to reset its internal counter. Test whether that container can be restarted by the script ran by Moni; If that container was restarted, then test the script ran by Monit was able to restart the container with the help of new Monit syntax although Monit failed to reset its internal counter. How did you verify/test it? I tested manually against the DuT str-s6000-0n-5 with 20201231.54 and 20191130.83 images. and following is testing results with 20201231.54 image.
…w syntax as a workaround (#5492) What is the motivation for this PR? This PR adds two new test cases to test a Monit issue fixed by the PR (sonic-net/sonic-buildimage#10288) in SONiC image repo. How did you do it? Specifically the 'stress' utility is leveraged to increase memory usage of a container continuously, then Test whether that container can be restarted by the script ran by Monit. Test whether that container can be restarted by the script ran by Moni; If that container was restarted, then test the script ran by Monit was unable to restart the container anymore due to Monit failed to reset its internal counter. Test whether that container can be restarted by the script ran by Moni; If that container was restarted, then test the script ran by Monit was able to restart the container with the help of new Monit syntax although Monit failed to reset its internal counter. How did you verify/test it? I tested manually against the DuT str-s6000-0n-5 with 20201231.54 and 20191130.83 images. and following is testing results with 20201231.54 image.
Related work items: #49, #58, #107, sonic-net#247, sonic-net#249, sonic-net#277, sonic-net#593, sonic-net#597, sonic-net#1035, sonic-net#2130, sonic-net#2150, sonic-net#2165, sonic-net#2169, sonic-net#2178, sonic-net#2179, sonic-net#2187, sonic-net#2188, sonic-net#2191, sonic-net#2195, sonic-net#2197, sonic-net#2198, sonic-net#2200, sonic-net#2202, sonic-net#2206, sonic-net#2209, sonic-net#2211, sonic-net#2216, sonic-net#7909, sonic-net#8927, sonic-net#9681, sonic-net#9733, sonic-net#9746, sonic-net#9850, sonic-net#9967, sonic-net#10104, sonic-net#10152, sonic-net#10168, sonic-net#10228, sonic-net#10266, sonic-net#10288, sonic-net#10294, sonic-net#10313, sonic-net#10394, sonic-net#10403, sonic-net#10404, sonic-net#10421, sonic-net#10431, sonic-net#10437, sonic-net#10445, sonic-net#10457, sonic-net#10458, sonic-net#10465, sonic-net#10467, sonic-net#10469, sonic-net#10470, sonic-net#10474, sonic-net#10477, sonic-net#10478, sonic-net#10482, sonic-net#10485, sonic-net#10488, sonic-net#10489, sonic-net#10492, sonic-net#10494, sonic-net#10498, sonic-net#10501, sonic-net#10509, sonic-net#10512, sonic-net#10514, sonic-net#10516, sonic-net#10517, sonic-net#10523, sonic-net#10525, sonic-net#10531, sonic-net#10532, sonic-net#10538, sonic-net#10555, sonic-net#10557, sonic-net#10559, sonic-net#10561, sonic-net#10565, sonic-net#10572, sonic-net#10574, sonic-net#10576, sonic-net#10578, sonic-net#10581, sonic-net#10585, sonic-net#10587, sonic-net#10599, sonic-net#10607, sonic-net#10611, sonic-net#10616, sonic-net#10618, sonic-net#10619, sonic-net#10623, sonic-net#10624, sonic-net#10633, sonic-net#10646, sonic-net#10655, sonic-net#10660, sonic-net#10664, sonic-net#10680, sonic-net#10683
Signed-off-by: Yong Zhao yozhao@microsoft.com
Why I did it
This PR aims to fix the Monit issue which shows Monit can't reset its counter when monitoring memory usage of telemetry container.
Specifically the Monit configuration file related to monitoring memory usage of telemetry container is as following:
If memory usage of telemetry container is larger than 400MB for 10 times within 20 cycles (minutes), then it will be restarted.
Recently we observed, after telemetry container was restarted, its memory usage continuously increased from 400MB to 11GB within 1 hour, but it was not restarted anymore during this 1 hour sliding window.
The reason is Monit can't reset its counter to count again and Monit can reset its counter if and only if the status of monitored service was changed from
Status failed
toStatus ok
. However, during this 1 hour sliding window, the status of monitored service was not changed fromStatus failed
toStatus ok
.Currently for each service monitored by Monit, there will be an entry showing the monitoring status, monitoring mode etc. For example, the following output from command
sudo monit status
shows the status of monitored service to monitor memory usage of telemetry:Every 1 minute, Monit will run the script to check the memory usage of telemetry and update the counter if memory usage is larger than 400MB. If Monit checked the counter and found memory usage of telemetry is larger than 400MB for 10 times
within 20 minutes, then telemetry container was restarted. Following is an example status of monitored service:
After telemetry container was restarted. we found memory usage of telemetry increased rapidly from around 100MB to more than 400MB during 1 minute and status of monitored service did not have a chance to be changed from
Status failed
toStatus ok
.How I did it
In order to provide a workaround for this issue, Monit recently introduced another syntax format
repeat every <n> cycles
related toexec
. This new syntax format will enable Monit repeat executing the background script if the error persists for a given number of cycles.How to verify it
I verified this change on lab device
str-s6000-acs-12
. Another pytest PR (sonic-net/sonic-mgmt#5492) is submitted in sonic-mgmt repo for review.Which release branch to backport (provide reason below if selected)
Description for the changelog
Link to config_db schema for YANG module changes
A picture of a cute animal (not mandatory but encouraged)