Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Minion sends return events to all masters #62834

Closed
frebib opened this issue Oct 7, 2022 · 11 comments · Fixed by #65053
Closed

[BUG] Minion sends return events to all masters #62834

frebib opened this issue Oct 7, 2022 · 11 comments · Fixed by #65053
Assignees
Labels
Bug broken, incorrect, or confusing behavior Core relates to code central or existential to Salt Phosphorus v3005.0 Release code name and version Regression The issue is a bug that breaks functionality known to work in previous releases.

Comments

@frebib
Copy link
Contributor

frebib commented Oct 7, 2022

Description
Since 3005, minions seem to send job return data to all connected masters, and not just the one master that initiated the job. Before #61468 the minion would only send return data to the one master that the connected minion process that spawned the job.

The number of jobs caused inode exhaustion on the global master (see below) as more jobs were being returned and cached in the job cache. This directly correlates with the number of connected 3005 minions. We noticed because the inode exhaustion caused the master workers to lock up, causing timeouts in minion logs which spiked in our logging collection.

Here you can see the rate of job returns coincides with the number of minions running 3005. The blackouts are presumably the master being hosed due to inode exaustion.
image
image

Setup
We have a large setup with thousands of minions. They're split up into clusters of "a few" to "hundreds" of nodes, each of which has one or more masters. All minions are also connected to the single "global" master that we use for adhoc jobs. Before upgrading minions to 3005, the global master would only see returns for jobs it issued. 3005 minions now return jobs spawned from the other masters too.

Steps to Reproduce the behavior
Connect one minion to two masters, spawn a job from one master and observe the return received on both masters in the master logs.

Expected behavior
This is where it gets a bit questionable. One could argue that this behaviour is intentional or desirable. It's a change in behaviour since before 3005, but possibly a good one. The issue comes in that masters are more likely to be DoS'ed by many events being returned at once.

Versions Report
3005/3005.1 both exhibit this behaviour. 3004 and 2019.2 don't

@frebib frebib added Bug broken, incorrect, or confusing behavior needs-triage labels Oct 7, 2022
@frebib
Copy link
Contributor Author

frebib commented Oct 7, 2022

For now we've had to revert minions back to 2019.2 as we can't operate like this. I'm going to try reverting #61468 to see if that resolves the issue.

Edit: Reverting that PR does indeed fix this behaviour. I'm not happy about that though.
I wonder if the logic could be re-worked to still use the persistent IPC socket, but target the return to a specific master only.

@OrangeDog OrangeDog added Regression The issue is a bug that breaks functionality known to work in previous releases. Core relates to code central or existential to Salt Phosphorus v3005.0 Release code name and version labels Oct 7, 2022
@Jsollvander
Copy link

Piggybacking on this bug report because this particular commit is causing problems for me as well on 3005 (im on FreeBSD if that matters). For me its causing all jobs to "hang" forever on the minions. For example, my scheduled state.apply jobs:

[DEBUG   ] schedule.handle_func: Removing /var/cache/salt/minion/proc/20221007081653333986
[DEBUG   ] Minion of 'master' is handling event tag '__schedule_return'
[INFO    ] Returning information for job: req
[DEBUG   ] SaltEvent PUB socket URI: /var/run/salt/minion/minion_event_5305b5d6d0_pub.ipc
[DEBUG   ] SaltEvent PULL socket URI: /var/run/salt/minion/minion_event_5305b5d6d0_pull.ipc
[DEBUG   ] Sending event: tag = __master_req_channel_payload; data = {'cmd': '_return', 'id': 'minion', 'fun': 'state.apply', 'fun_args': [], 'schedule':
 'job_state_apply', 'jid': 'req', 'pid': 98706, 'return': {'file_|-/usr/home_|-/usr/home_|-directory': {'name': '/usr/home', 'changes': {}, 'result': True, 'comment': 'The.....

Then nothing for hours.

While I was debugging this on my own I was running the minion with ktrace and saw this stack trace when you send a SIGTERM when its stuck in this state:

 32734 python3.9 GIO   fd 2 wrote 633 bytes
       "Future <salt.ext.tornado.concurrent.Future object at 0x80838fdf0> exception was never retrieved: Traceback (most recent call last):
          File "/usr/local/lib/python3.9/site-packages/salt/ext/tornado/gen.py", line 309, in wrapper
            yielded = next(result)
          File "/usr/local/lib/python3.9/site-packages/salt/minion.py", line 2920, in handle_event
            self._return_pub(data, ret_cmd="_return", sync=False)
          File "/usr/local/lib/python3.9/site-packages/salt/minion.py", line 2262, in _return_pub
            log.trace("ret_val = %s", ret_val)  # pylint: disable=no-member
        UnboundLocalError: local variable 'ret_val' referenced before assignment
       "

First I thought this was due to a async timeout problem, but forcing sync=True didn't help. Then I stumbled on this bug report.

Reverting #61468 fixes the problem me.

frebib added a commit to frebib/salt that referenced this issue Oct 7, 2022
Since [1] minions now return all returns to all masters, instead of just
the master that spawned the job. The upstream change in behaviour
overloads our global masters making them unusable, so this aims to
revert to the previous behaviour whilst maintaining the single-channel
return improvements also introduced in [1].

[1]: saltstack#61468

Upstream-bug: saltstack#62834
Signed-off-by: Joe Groocock <jgroocock@cloudflare.com>
@asomers
Copy link
Contributor

asomers commented Oct 7, 2022

cc @devkits

@jfindlay
Copy link
Contributor

jfindlay commented Oct 7, 2022

Expected behavior

This is where it gets a bit questionable. One could argue that this behaviour is intentional or desirable. It's a change in behaviour since before 3005, but possibly a good one. The issue comes in that masters are more likely to be DoS'ed by many events being returned at once.

I think the expectation of behavior depends on the setup. If you have a minion config with

master:
- master-A
- master-B

and both masters have equivalent loads/roles, then you might want broadcast returns, I guess for redundancy? The case might be stronger if you're using master_type, master_shuffle etc.

However in the case where master-B connects to more minions than master-A the asymmetry makes this 'feature' look more like a 'bug'. If both behaviors are desired, then it should be controlled by a minion config.

@llua
Copy link
Contributor

llua commented Mar 15, 2023

Piggybacking on this bug report because this particular commit is causing problems for me as well on 3005 (im on FreeBSD if that matters). For me its causing all jobs to "hang" forever on the minions. For example, my scheduled state.apply jobs:

[DEBUG   ] schedule.handle_func: Removing /var/cache/salt/minion/proc/20221007081653333986
[DEBUG   ] Minion of 'master' is handling event tag '__schedule_return'
[INFO    ] Returning information for job: req
[DEBUG   ] SaltEvent PUB socket URI: /var/run/salt/minion/minion_event_5305b5d6d0_pub.ipc
[DEBUG   ] SaltEvent PULL socket URI: /var/run/salt/minion/minion_event_5305b5d6d0_pull.ipc
[DEBUG   ] Sending event: tag = __master_req_channel_payload; data = {'cmd': '_return', 'id': 'minion', 'fun': 'state.apply', 'fun_args': [], 'schedule':
 'job_state_apply', 'jid': 'req', 'pid': 98706, 'return': {'file_|-/usr/home_|-/usr/home_|-directory': {'name': '/usr/home', 'changes': {}, 'result': True, 'comment': 'The.....

Then nothing for hours.

While I was debugging this on my own I was running the minion with ktrace and saw this stack trace when you send a SIGTERM when its stuck in this state:

 32734 python3.9 GIO   fd 2 wrote 633 bytes
       "Future <salt.ext.tornado.concurrent.Future object at 0x80838fdf0> exception was never retrieved: Traceback (most recent call last):
          File "/usr/local/lib/python3.9/site-packages/salt/ext/tornado/gen.py", line 309, in wrapper
            yielded = next(result)
          File "/usr/local/lib/python3.9/site-packages/salt/minion.py", line 2920, in handle_event
            self._return_pub(data, ret_cmd="_return", sync=False)
          File "/usr/local/lib/python3.9/site-packages/salt/minion.py", line 2262, in _return_pub
            log.trace("ret_val = %s", ret_val)  # pylint: disable=no-member
        UnboundLocalError: local variable 'ret_val' referenced before assignment
       "

First I thought this was due to a async timeout problem, but forcing sync=True didn't help. Then I stumbled on this bug report.

Reverting #61468 fixes the problem me.

I am also experiencing this exact problem.

@anilsil anilsil added this to the Sulfur v3006.3 milestone May 4, 2023
@scott-sturdivant
Copy link

Piggybacking on this bug report because this particular commit is causing problems for me as well on 3005 (im on FreeBSD if that matters). For me its causing all jobs to "hang" forever on the minions. For example, my scheduled state.apply jobs:

[DEBUG   ] schedule.handle_func: Removing /var/cache/salt/minion/proc/20221007081653333986
[DEBUG   ] Minion of 'master' is handling event tag '__schedule_return'
[INFO    ] Returning information for job: req
[DEBUG   ] SaltEvent PUB socket URI: /var/run/salt/minion/minion_event_5305b5d6d0_pub.ipc
[DEBUG   ] SaltEvent PULL socket URI: /var/run/salt/minion/minion_event_5305b5d6d0_pull.ipc
[DEBUG   ] Sending event: tag = __master_req_channel_payload; data = {'cmd': '_return', 'id': 'minion', 'fun': 'state.apply', 'fun_args': [], 'schedule':
 'job_state_apply', 'jid': 'req', 'pid': 98706, 'return': {'file_|-/usr/home_|-/usr/home_|-directory': {'name': '/usr/home', 'changes': {}, 'result': True, 'comment': 'The.....

Then nothing for hours.
While I was debugging this on my own I was running the minion with ktrace and saw this stack trace when you send a SIGTERM when its stuck in this state:

 32734 python3.9 GIO   fd 2 wrote 633 bytes
       "Future <salt.ext.tornado.concurrent.Future object at 0x80838fdf0> exception was never retrieved: Traceback (most recent call last):
          File "/usr/local/lib/python3.9/site-packages/salt/ext/tornado/gen.py", line 309, in wrapper
            yielded = next(result)
          File "/usr/local/lib/python3.9/site-packages/salt/minion.py", line 2920, in handle_event
            self._return_pub(data, ret_cmd="_return", sync=False)
          File "/usr/local/lib/python3.9/site-packages/salt/minion.py", line 2262, in _return_pub
            log.trace("ret_val = %s", ret_val)  # pylint: disable=no-member
        UnboundLocalError: local variable 'ret_val' referenced before assignment
       "

First I thought this was due to a async timeout problem, but forcing sync=True didn't help. Then I stumbled on this bug report.
Reverting #61468 fixes the problem me.

I am also experiencing this exact problem.

Same here, though on amazon linux 2 and with a single master. The minions will drop off (manage.status shows them as down) and when they are manually killed, they have the same stack trace as above.

@eliasp
Copy link
Contributor

eliasp commented May 24, 2023

Seeing this on 3005.1 (Master + Minion) on Ubuntu 22.04.
I can reproduce it with a single Master and just 1 Minion connected to it.

At some point it stops returning job data and when trying to restart the service, it ends up wit this:

May 24 09:44:49 smaster01 systemd[1]: Stopping The Salt Minion...
May 24 09:44:49 smaster01 salt-3005.1[1206994]: [WARNING ] Minion received a SIGTERM. Exiting.
May 24 09:44:49 smaster01 salt-3005.1[1206994]: The Salt Minion is shutdown. Minion received a SIGTERM. Exited.
May 24 09:44:49 smaster01 salt-3005.1[1206994]: The minion failed to return the job information for job req. This is often due to the master being shut down or overloaded. If the master is running, consider increasing the worker_threads value.
May 24 09:44:49 smaster01 salt-3005.1[1206994]: Future <salt.ext.tornado.concurrent.Future object at 0x7f3c540f5b50> exception was never retrieved: Traceback (most recent call last):
May 24 09:44:49 smaster01 salt-3005.1[1206994]:   File "salt/ext/tornado/gen.py", line 309, in wrapper
May 24 09:44:49 smaster01 salt-3005.1[1206994]:     yielded = next(result)
May 24 09:44:49 smaster01 salt-3005.1[1206994]:   File "salt/minion.py", line 2921, in handle_event
May 24 09:44:49 smaster01 salt-3005.1[1206994]:     self._return_pub(data, ret_cmd="_return", sync=False)
May 24 09:44:49 smaster01 salt-3005.1[1206994]:   File "salt/minion.py", line 2263, in _return_pub
May 24 09:44:49 smaster01 salt-3005.1[1206994]:     log.trace("ret_val = %s", ret_val)  # pylint: disable=no-member
May 24 09:44:49 smaster01 salt-3005.1[1206994]: UnboundLocalError: local variable 'ret_val' referenced before assignment
May 24 09:46:19 smaster01 systemd[1]: salt-minion.service: State 'stop-sigterm' timed out. Killing.

@mburring
Copy link

mburring commented Jul 5, 2023

^C[TRACE   ] Waiting to kill process manager children
[DEBUG   ] Closing IPCMessageClient instance
[DEBUG   ] Closing IPCMessageSubscriber instance
[WARNING ] Minion received a SIGINT. Exiting.
[INFO    ] Shutting down the Salt Minion
[TRACE   ] Processing <function DaemonMixIn._mixin_before_exit at 0x7ff0140f6b00>
[TRACE   ] Processing <function LogLevelMixIn.__shutdown_logging at 0x7ff0140f6560>
The Salt Minion is shutdown. Minion received a SIGINT. Exited.
The minion failed to return the job information for job req. This is often due to the master being shut down or overloaded. If the master is running, consider increasing the worker_threads value.
Future <salt.ext.tornado.concurrent.Future object at 0x7ff009ac8910> exception was never retrieved: Traceback (most recent call last):
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 309, in wrapper
    yielded = next(result)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 2921, in handle_event
    self._return_pub(data, ret_cmd="_return", sync=False)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 2263, in _return_pub
    log.trace("ret_val = %s", ret_val)  # pylint: disable=no-member
UnboundLocalError: local variable 'ret_val' referenced before assignment

Same issue, random 3005 & 3006 minions & masters. To debug I left some minions in foreground mode with trace logging and a few will lockup after running a scheduled highstate. Need to ^C to get it to dump the exception otherwise it stays permanently locked and is identified by as a zombie process. Only running one master per minion.

@llua
Copy link
Contributor

llua commented Aug 24, 2023

For the record: frebib@4f194e8 via cleanly applies to 3006.x and i've been using it without the issue of having scheduled highstates stalling indefinitely.

@frebib
Copy link
Contributor Author

frebib commented Aug 24, 2023

@llua The only reason I've not PR'ed that patch is because it breaks deltaproxy (???)
We've had #61468 reverted ever since it was released because the Salt master becomes entirely unusable without it reverted

@whytewolf
Copy link
Collaborator

@llua The only reason I've not PR'ed that patch is because it breaks deltaproxy (???) We've had #61468 reverted ever since it was released because the Salt master becomes entirely unusable without it reverted

@frebib I think you were on the right track. I commented on your CR there about how to get around deltaproxy if it was the only thing holding you back. if you can get that into a PR even if checking for deltaproxy it would be greatly appropriated. even with deltaproxy fails. might help to know what is causing those.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior Core relates to code central or existential to Salt Phosphorus v3005.0 Release code name and version Regression The issue is a bug that breaks functionality known to work in previous releases.
Projects
None yet
Development

Successfully merging a pull request may close this issue.