Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[techsupport] Handle minor fixes of TS Lock and update auto-TS #2114

Merged
merged 7 commits into from
Apr 3, 2022
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 12 additions & 3 deletions scripts/coredump_gen_handler.py
Original file line number Diff line number Diff line change
Expand Up @@ -110,13 +110,22 @@ def parse_ts_dump_name(self, ts_stdout):
syslog.syslog(syslog.LOG_ERR, "stdout of the 'show techsupport' cmd doesn't have the dump name")
return ""

def invoke_ts_cmd(self, since_cfg):
since_cfg = "'" + since_cfg + "'"
def invoke_ts_cmd(self, since_cfg, num_retry=0):
cmd_opts = ["show", "techsupport", "--silent", "--since", since_cfg]
cmd = " ".join(cmd_opts)
rc, stdout, stderr = subprocess_exec(cmd_opts, env=ENV_VAR)
if rc:
if rc == EXT_LOCKFAIL:
syslog.syslog(syslog.LOG_NOTICE, "Another instance of techsupport running, aborting this. stderr: {}".format(stderr))
return ""
elif rc == EXT_RETRY:
if num_retry <= MAX_RETRY_LIMIT:
print(num_retry)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a gap of few seconds before next retry?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not required, EXT_RETRY happening is less likely and response should be quick in order to grab the lock. so i don't think a gap is required.
I'll remove the print statement though

return self.invoke_ts_cmd(since_cfg, num_retry+1)
else:
syslog.syslog(syslog.LOG_ERR, "MAX_RETRY_LIMIT for show techsupport invocation exceeded, stderr: {}".format(stderr))
elif rc != EXT_SUCCESS:
syslog.syslog(syslog.LOG_ERR, "show techsupport failed with exit code {}, stderr: {}".format(rc, stderr))
Copy link
Contributor

@qiluo-msft qiluo-msft Mar 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syslog

Do you want to exit immediately after the error logging? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, i'll add it

# Parse the dump name
new_dump = self.parse_ts_dump_name(stdout)
if not new_dump:
syslog.syslog(syslog.LOG_ERR, "{} was run, but no techsupport dump is found".format(cmd))
Expand Down
12 changes: 8 additions & 4 deletions scripts/generate_dump
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,10 @@ handle_exit()
ECODE=$?
echo "Removing lock. Exit: $ECODE" >&2
$RM $V -rf ${LOCKDIR}
# Echo the filename as the last statement if the generation suceeds
Copy link
Contributor

@qiluo-msft qiluo-msft Mar 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suceeds

typo #Closed

Copy link
Contributor

@qiluo-msft qiluo-msft Mar 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Echo

Isn't it already supported? I am seeing

mkdir: created directory '/var/dump/sonic_dump_vlab-01_20220331_204737/log'
sonic_dump_vlab-01_20220331_204737/log/techsupport_time_info.Gg18UE4AVH
removed '/var/dump/sonic_dump_vlab-01_20220331_204737/log/techsupport_time_info.Gg18UE4AVH'
removed directory '/var/dump/sonic_dump_vlab-01_20220331_204737/core'
removed directory '/var/dump/sonic_dump_vlab-01_20220331_204737/log'
removed directory '/var/dump/sonic_dump_vlab-01_20220331_204737'
/var/dump/sonic_dump_vlab-01_20220331_204737.tar:         5.5% -- replaced with /var/dump/sonic_dump_vlab-01_20220331_204737.tar.gz
/var/dump/sonic_dump_vlab-01_20220331_204737.tar.gz
``` #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the addition of lock code, no. handle_exit is the last trap that runs before exiting, and it does print a few statements and thus the issue

if [[ -f $TARFILE && ($ECODE == $EXT_SUCCESS || $ECODE == $RETURN_CODE) ]]; then
echo $TARFILE
fi
}

handle_signal()
Expand Down Expand Up @@ -1360,8 +1364,6 @@ main() {

# Invoke the TechSupport Cleanup Hook
setsid python3 /usr/local/bin/techsupport_cleanup.py ${TARFILE} &> /tmp/techsupport_cleanup.log &

echo ${TARFILE}

if ! $SAVE_STDERR
then
Expand Down Expand Up @@ -1494,7 +1496,6 @@ while getopts ":xnvhzas:t:r:d" opt; do
;;
v)
# echo commands about to be run to stderr
set -v
V="-v"
;;
n)
Expand Down Expand Up @@ -1547,14 +1548,17 @@ fi
## Attempt Locking
##

if mkdir "${LOCKDIR}" &>/dev/null; then
if $MKDIR "${LOCKDIR}" &>/dev/null; then
trap 'handle_exit' EXIT
echo "$$" > "${PIDFILE}"
# This handler will exit the script upon receiving these interrupts
# Trap configured on EXIT will be triggered by the exit from handle_signal function
trap 'handle_signal' SIGINT SIGHUP SIGQUIT SIGTERM
echo "Lock succesfully accquired and installed signal handlers"
# Proceed with the actual code
if [[ ! -z "${V}" ]]; then
set -v
fi
main
else
# lock failed, check if the other PID is alive
Expand Down
36 changes: 34 additions & 2 deletions tests/coredump_gen_handler_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,12 @@
import sys
import pyfakefs
import unittest
import signal
from pyfakefs.fake_filesystem_unittest import Patcher
from swsscommon import swsscommon
from utilities_common.general import load_module_from_source
from utilities_common.db import Db
from utilities_common.auto_techsupport_helper import EXT_RETRY
from .mock_tables import dbconnector

sys.path.append("scripts")
Expand All @@ -18,6 +20,9 @@
/tmp/saisdkdump
"""

def signal_handler(signum, frame):
raise Exception("Timed out!")

def set_auto_ts_cfg(redis_mock, state="disabled",
rate_limit_interval="0",
max_core_size="0.0",
Expand Down Expand Up @@ -264,7 +269,7 @@ def test_since_argument(self):
def mock_cmd(cmd, env):
ts_dump = "/var/dump/sonic_dump_random3.tar.gz"
cmd_str = " ".join(cmd)
if "--since '4 days ago'" in cmd_str:
if "--since 4 days ago" in cmd_str:
patcher.fs.create_file(ts_dump)
return 0, AUTO_TS_STDOUT + ts_dump, ""
elif "date --date=4 days ago" in cmd_str:
Expand Down Expand Up @@ -330,7 +335,7 @@ def test_invalid_since_argument(self):
def mock_cmd(cmd, env):
ts_dump = "/var/dump/sonic_dump_random3.tar.gz"
cmd_str = " ".join(cmd)
if "--since '2 days ago'" in cmd_str:
if "--since 2 days ago" in cmd_str:
patcher.fs.create_file(ts_dump)
print(AUTO_TS_STDOUT + ts_dump)
return 0, AUTO_TS_STDOUT + ts_dump, ""
Expand Down Expand Up @@ -396,3 +401,30 @@ def mock_cmd(cmd, env):
assert "orchagent.12345.123.core.gz" in current_fs
assert "lldpmgrd.12345.22.core.gz" in current_fs
assert "python3.12345.21.core.gz" in current_fs

def test_max_retry_ts_failure(self):
"""
Scenario: TS subprocess is continously returning EXT_RETRY
Make sure auto-ts is not exceeding the limit
"""
db_wrap = Db()
redis_mock = db_wrap.db
set_auto_ts_cfg(redis_mock, state="enabled")
set_feature_table_cfg(redis_mock, state="enabled")
with Patcher() as patcher:
def mock_cmd(cmd, env):
return EXT_RETRY, "", ""

cdump_mod.subprocess_exec = mock_cmd
patcher.fs.create_file("/var/core/orchagent.12345.123.core.gz")
cls = cdump_mod.CriticalProcCoreDumpHandle("orchagent.12345.123.core.gz", "swss", redis_mock)

signal.signal(signal.SIGALRM, signal_handler)
signal.alarm(5) # 5 seconds
try:
cls.handle_core_dump_creation_event()
except Exception:
assert False, "Method should not time out"
finally:
signal.alarm(0)

10 changes: 8 additions & 2 deletions utilities_common/auto_techsupport_helper.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,9 @@
"CORE_DUMP_DIR", "CORE_DUMP_PTRN", "TS_DIR", "TS_PTRN",
"CFG_DB", "AUTO_TS", "CFG_STATE", "CFG_MAX_TS", "COOLOFF",
"CFG_CORE_USAGE", "CFG_SINCE", "FEATURE", "STATE_DB",
"TS_MAP", "CORE_DUMP", "TIMESTAMP", "CONTAINER",
"TIME_BUF", "SINCE_DEFAULT", "TS_PTRN_GLOB"
"TS_MAP", "CORE_DUMP", "TIMESTAMP", "CONTAINER", "TIME_BUF",
"SINCE_DEFAULT", "TS_PTRN_GLOB", "EXT_LOCKFAIL", "EXT_RETRY",
"EXT_SUCCESS", "MAX_RETRY_LIMIT"
] + [ # Methods
"verify_recent_file_creation",
"get_ts_dumps",
Expand Down Expand Up @@ -60,6 +61,11 @@
TIME_BUF = 20
SINCE_DEFAULT = "2 days ago"

# Techsupport Exit Codes
EXT_LOCKFAIL = 2
EXT_RETRY = 4
EXT_SUCCESS = 0
MAX_RETRY_LIMIT = 2
Copy link
Collaborator

@dgsudharsan dgsudharsan Mar 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is MAX_RETRY_LIMIT configurable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EXT_RETRY happening more than one time for a single process is even more unlikely and thus a MAX_RETRY_LIMIT need not be configurable.


# Helper methods
def subprocess_exec(cmd, env=None):
Expand Down