Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[teamd] Fix tlm_teamd miss killed issue during stopping container #8007

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

kuanyu99
Copy link
Contributor

Why I did it

tlm_teamd process will be miss killed when the container stop script try to kill the teamd process.

The syslog will looks like:

Jun 22 09:11:37.064720 as8000-3 INFO teamd#supervisord 2021-06-22 09:11:37,064 INFO exited: tlm_teamd (terminated by SIGUSR1; not expected)
Jun 22 09:11:38.071958 as8000-3 INFO teamd#supervisord 2021-06-22 09:11:38,066 INFO reaped unknown pid 26 (exit status 0)
Jun 22 09:11:38.071958 as8000-3 INFO teamd#supervisord 2021-06-22 09:11:38,066 INFO reaped unknown pid 34 (exit status 0)
Jun 22 09:11:38.071958 as8000-3 INFO teamd#supervisord 2021-06-22 09:11:38,067 INFO reaped unknown pid 42 (exit status 0)
Jun 22 09:11:38.071958 as8000-3 INFO teamd#supervisord 2021-06-22 09:11:38,067 INFO reaped unknown pid 50 (exit status 0)
Jun 22 09:11:38.073458 as8000-3 INFO teamd#/supervisor-proc-exit-listener: Process 'tlm_teamd' exited unexpectedly. Terminating supervisor 'teamd'
Jun 22 09:11:38.074293 as8000-3 INFO teamd#supervisord 2021-06-22 09:11:38,073 WARN received SIGTERM indicating exit request
Jun 22 09:11:38.074496 as8000-3 INFO teamd#supervisord 2021-06-22 09:11:38,074 INFO waiting for supervisor-proc-exit-listener, rsyslogd, teammgrd, teamsyncd to die
Jun 22 09:11:40.077351 as8000-3 NOTICE teamd#teamsyncd: :- main: Received SIGTERM Exiting
Jun 22 09:11:41.068213 as8000-3 INFO teamd#supervisord 2021-06-22 09:11:41,067 INFO stopped: teamsyncd (exit status 0)
Jun 22 09:11:42.069232 as8000-3 NOTICE teamd#teammgrd: :- cleanTeamProcesses: Cleaning up LAGs during shutdown...

How I did it

Add a "-x" parameter for the pkill command, which means only kill the exactly matching process name.

How to verify it

  1. Stop the teamd container by sudo systemctl stop teamd
  2. Check the syslog and make sure the tlm_teamd doesn't exit unexpectedly

The syslog will change like:

Jun 23 09:03:56.259763 as5835-54x INFO systemd[1]: Stopping TEAMD container...
Jun 23 09:03:56.265594 as5835-54x NOTICE admin: Stopping teamd service...
Jun 23 09:03:56.482925 as5835-54x INFO pmon#/xcvrd: Got SFP inserted event
Jun 23 09:03:56.482925 as5835-54x INFO pmon#/xcvrd: receive plug in and update port sfp status table.
Jun 23 09:03:56.545098 as5835-54x NOTICE admin: Warm boot flag: teamd true.
Jun 23 09:03:56.549725 as5835-54x NOTICE admin: Fast boot flag: teamd false.
Jun 23 09:03:57.091157 as5835-54x DEBUG /container: container_stop: BEGIN
Jun 23 09:03:57.092032 as5835-54x DEBUG /container: read_data: config:True feature:teamd fields:[('set_owner', 'local'), ('no_fallback_to_local', False)] val:['local', False]
Jun 23 09:03:57.092765 as5835-54x DEBUG /container: read_data: config:False feature:teamd fields:[('current_owner', 'none'), ('remote_state', 'none'), ('container_id', '')] val:['none', 'none', '']
Jun 23 09:03:57.093845 as5835-54x DEBUG /container: container_stop: teamd: set_owner:local current_owner:none remote_state:none docker_id:teamd
Jun 23 09:03:57.160800 as5835-54x ERR teamd#tlm_teamd: :- get_dump: Can't get dump for LAG 'PortChannel0002'. Skipping
Jun 23 09:03:57.160981 as5835-54x ERR teamd#tlm_teamd: :- get_dump: Can't get dump for LAG 'PortChannel0003'. Skipping
Jun 23 09:03:57.161062 as5835-54x ERR teamd#tlm_teamd: :- get_dump: Can't get dump for LAG 'PortChannel0001'. Skipping
Jun 23 09:03:57.161139 as5835-54x ERR teamd#tlm_teamd: :- get_dump: Can't get dump for LAG 'PortChannel0004'. Skipping
Jun 23 09:03:57.165264 as5835-54x NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel0003' has been removed.
Jun 23 09:03:57.165361 as5835-54x NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel0001' has been removed.
Jun 23 09:03:57.165439 as5835-54x NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel0004' has been removed.
Jun 23 09:03:57.165515 as5835-54x NOTICE teamd#tlm_teamd: :- remove_lag: The LAG 'PortChannel0002' has been removed.
Jun 23 09:03:57.165607 as5835-54x INFO teamd#supervisord 2021-06-23 09:03:57,164 INFO reaped unknown pid 26 (exit status 0)
Jun 23 09:03:57.167833 as5835-54x INFO teamd#supervisord 2021-06-23 09:03:57,165 INFO reaped unknown pid 34 (exit status 0)
Jun 23 09:03:57.167926 as5835-54x INFO teamd#supervisord 2021-06-23 09:03:57,165 INFO reaped unknown pid 42 (exit status 0)
Jun 23 09:03:57.168014 as5835-54x INFO teamd#supervisord 2021-06-23 09:03:57,165 INFO reaped unknown pid 50 (exit status 0)
Jun 23 09:03:58.168208 as5835-54x INFO teamd#supervisord 2021-06-23 09:03:58,167 WARN received SIGTERM indicating exit request
Jun 23 09:03:58.168995 as5835-54x INFO teamd#supervisord 2021-06-23 09:03:58,168 INFO waiting for supervisor-proc-exit-listener, rsyslogd, teammgrd, teamsyncd, tlm_teamd to die
Jun 23 09:03:59.169782 as5835-54x NOTICE teamd#tlm_teamd: :- main: Exiting
Jun 23 09:03:59.441711 as5835-54x INFO pmon#/xcvrd: Got SFP inserted event
Jun 23 09:03:59.441931 as5835-54x INFO pmon#/xcvrd: receive plug in and update port sfp status table.
Jun 23 09:03:59.804165 as5835-54x INFO teamd#supervisord 2021-06-23 09:03:59,803 INFO stopped: tlm_teamd (exit status 0)

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012

Description for the changelog

  • Due to the pkill command will kill the processes which contain the match word,
    the tlm_teamd will also be killed when the script want to kill the teamd only.
  • Add '-x' parameter which means the process name should exactly match.

A picture of a cute animal (not mandatory but encouraged)

@kuanyu99 kuanyu99 requested a review from lguohan as a code owner June 29, 2021 02:43
* Due to the pkill command will kill the processes which contain the match word,
  the tlm_teamd will also be killed when the script want to kill the teamd only.
* Add '-x' parameter which means the process name should exactly match.
@kuanyu99 kuanyu99 force-pushed the ky_fix-teamd-misskill branch from c3957f9 to a80584a Compare October 1, 2021 02:42
@kuanyu99
Copy link
Contributor Author

kuanyu99 commented Oct 1, 2021

@lguohan Could you help me restart the tests? I checked the report and think it wasn't related to my modification.

@kuanyu99
Copy link
Contributor Author

kuanyu99 commented Oct 6, 2021

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants