[BUG] Minion sends return events to all masters #62834

frebib · 2022-10-07T08:00:16Z

Description
Since 3005, minions seem to send job return data to all connected masters, and not just the one master that initiated the job. Before #61468 the minion would only send return data to the one master that the connected minion process that spawned the job.

The number of jobs caused inode exhaustion on the global master (see below) as more jobs were being returned and cached in the job cache. This directly correlates with the number of connected 3005 minions. We noticed because the inode exhaustion caused the master workers to lock up, causing timeouts in minion logs which spiked in our logging collection.

Here you can see the rate of job returns coincides with the number of minions running 3005. The blackouts are presumably the master being hosed due to inode exaustion.

Setup
We have a large setup with thousands of minions. They're split up into clusters of "a few" to "hundreds" of nodes, each of which has one or more masters. All minions are also connected to the single "global" master that we use for adhoc jobs. Before upgrading minions to 3005, the global master would only see returns for jobs it issued. 3005 minions now return jobs spawned from the other masters too.

Steps to Reproduce the behavior
Connect one minion to two masters, spawn a job from one master and observe the return received on both masters in the master logs.

Expected behavior
This is where it gets a bit questionable. One could argue that this behaviour is intentional or desirable. It's a change in behaviour since before 3005, but possibly a good one. The issue comes in that masters are more likely to be DoS'ed by many events being returned at once.

Versions Report
3005/3005.1 both exhibit this behaviour. 3004 and 2019.2 don't

frebib · 2022-10-07T08:09:50Z

For now we've had to revert minions back to 2019.2 as we can't operate like this. I'm going to try reverting #61468 to see if that resolves the issue.

Edit: Reverting that PR does indeed fix this behaviour. I'm not happy about that though.
I wonder if the logic could be re-worked to still use the persistent IPC socket, but target the return to a specific master only.

Jsollvander · 2022-10-07T10:14:23Z

Piggybacking on this bug report because this particular commit is causing problems for me as well on 3005 (im on FreeBSD if that matters). For me its causing all jobs to "hang" forever on the minions. For example, my scheduled state.apply jobs:

[DEBUG   ] schedule.handle_func: Removing /var/cache/salt/minion/proc/20221007081653333986
[DEBUG   ] Minion of 'master' is handling event tag '__schedule_return'
[INFO    ] Returning information for job: req
[DEBUG   ] SaltEvent PUB socket URI: /var/run/salt/minion/minion_event_5305b5d6d0_pub.ipc
[DEBUG   ] SaltEvent PULL socket URI: /var/run/salt/minion/minion_event_5305b5d6d0_pull.ipc
[DEBUG   ] Sending event: tag = __master_req_channel_payload; data = {'cmd': '_return', 'id': 'minion', 'fun': 'state.apply', 'fun_args': [], 'schedule':
 'job_state_apply', 'jid': 'req', 'pid': 98706, 'return': {'file_|-/usr/home_|-/usr/home_|-directory': {'name': '/usr/home', 'changes': {}, 'result': True, 'comment': 'The.....

Then nothing for hours.

While I was debugging this on my own I was running the minion with ktrace and saw this stack trace when you send a SIGTERM when its stuck in this state:

 32734 python3.9 GIO   fd 2 wrote 633 bytes
       "Future <salt.ext.tornado.concurrent.Future object at 0x80838fdf0> exception was never retrieved: Traceback (most recent call last):
          File "/usr/local/lib/python3.9/site-packages/salt/ext/tornado/gen.py", line 309, in wrapper
            yielded = next(result)
          File "/usr/local/lib/python3.9/site-packages/salt/minion.py", line 2920, in handle_event
            self._return_pub(data, ret_cmd="_return", sync=False)
          File "/usr/local/lib/python3.9/site-packages/salt/minion.py", line 2262, in _return_pub
            log.trace("ret_val = %s", ret_val)  # pylint: disable=no-member
        UnboundLocalError: local variable 'ret_val' referenced before assignment
       "

First I thought this was due to a async timeout problem, but forcing sync=True didn't help. Then I stumbled on this bug report.

Reverting #61468 fixes the problem me.

Since [1] minions now return all returns to all masters, instead of just the master that spawned the job. The upstream change in behaviour overloads our global masters making them unusable, so this aims to revert to the previous behaviour whilst maintaining the single-channel return improvements also introduced in [1]. [1]: saltstack#61468 Upstream-bug: saltstack#62834 Signed-off-by: Joe Groocock <jgroocock@cloudflare.com>

asomers · 2022-10-07T13:24:54Z

cc @devkits

jfindlay · 2022-10-07T14:40:00Z

Expected behavior

This is where it gets a bit questionable. One could argue that this behaviour is intentional or desirable. It's a change in behaviour since before 3005, but possibly a good one. The issue comes in that masters are more likely to be DoS'ed by many events being returned at once.

I think the expectation of behavior depends on the setup. If you have a minion config with

master:
- master-A
- master-B

and both masters have equivalent loads/roles, then you might want broadcast returns, I guess for redundancy? The case might be stronger if you're using master_type, master_shuffle etc.

However in the case where master-B connects to more minions than master-A the asymmetry makes this 'feature' look more like a 'bug'. If both behaviors are desired, then it should be controlled by a minion config.

llua · 2023-03-15T01:52:19Z

Piggybacking on this bug report because this particular commit is causing problems for me as well on 3005 (im on FreeBSD if that matters). For me its causing all jobs to "hang" forever on the minions. For example, my scheduled state.apply jobs:

[DEBUG   ] schedule.handle_func: Removing /var/cache/salt/minion/proc/20221007081653333986
[DEBUG   ] Minion of 'master' is handling event tag '__schedule_return'
[INFO    ] Returning information for job: req
[DEBUG   ] SaltEvent PUB socket URI: /var/run/salt/minion/minion_event_5305b5d6d0_pub.ipc
[DEBUG   ] SaltEvent PULL socket URI: /var/run/salt/minion/minion_event_5305b5d6d0_pull.ipc
[DEBUG   ] Sending event: tag = __master_req_channel_payload; data = {'cmd': '_return', 'id': 'minion', 'fun': 'state.apply', 'fun_args': [], 'schedule':
 'job_state_apply', 'jid': 'req', 'pid': 98706, 'return': {'file_|-/usr/home_|-/usr/home_|-directory': {'name': '/usr/home', 'changes': {}, 'result': True, 'comment': 'The.....

Then nothing for hours.

While I was debugging this on my own I was running the minion with ktrace and saw this stack trace when you send a SIGTERM when its stuck in this state:

 32734 python3.9 GIO   fd 2 wrote 633 bytes
       "Future <salt.ext.tornado.concurrent.Future object at 0x80838fdf0> exception was never retrieved: Traceback (most recent call last):
          File "/usr/local/lib/python3.9/site-packages/salt/ext/tornado/gen.py", line 309, in wrapper
            yielded = next(result)
          File "/usr/local/lib/python3.9/site-packages/salt/minion.py", line 2920, in handle_event
            self._return_pub(data, ret_cmd="_return", sync=False)
          File "/usr/local/lib/python3.9/site-packages/salt/minion.py", line 2262, in _return_pub
            log.trace("ret_val = %s", ret_val)  # pylint: disable=no-member
        UnboundLocalError: local variable 'ret_val' referenced before assignment
       "

First I thought this was due to a async timeout problem, but forcing sync=True didn't help. Then I stumbled on this bug report.

Reverting #61468 fixes the problem me.

I am also experiencing this exact problem.

scott-sturdivant · 2023-05-05T15:22:58Z

Piggybacking on this bug report because this particular commit is causing problems for me as well on 3005 (im on FreeBSD if that matters). For me its causing all jobs to "hang" forever on the minions. For example, my scheduled state.apply jobs:

[DEBUG   ] schedule.handle_func: Removing /var/cache/salt/minion/proc/20221007081653333986
[DEBUG   ] Minion of 'master' is handling event tag '__schedule_return'
[INFO    ] Returning information for job: req
[DEBUG   ] SaltEvent PUB socket URI: /var/run/salt/minion/minion_event_5305b5d6d0_pub.ipc
[DEBUG   ] SaltEvent PULL socket URI: /var/run/salt/minion/minion_event_5305b5d6d0_pull.ipc
[DEBUG   ] Sending event: tag = __master_req_channel_payload; data = {'cmd': '_return', 'id': 'minion', 'fun': 'state.apply', 'fun_args': [], 'schedule':
 'job_state_apply', 'jid': 'req', 'pid': 98706, 'return': {'file_|-/usr/home_|-/usr/home_|-directory': {'name': '/usr/home', 'changes': {}, 'result': True, 'comment': 'The.....

Then nothing for hours.
While I was debugging this on my own I was running the minion with ktrace and saw this stack trace when you send a SIGTERM when its stuck in this state:

 32734 python3.9 GIO   fd 2 wrote 633 bytes
       "Future <salt.ext.tornado.concurrent.Future object at 0x80838fdf0> exception was never retrieved: Traceback (most recent call last):
          File "/usr/local/lib/python3.9/site-packages/salt/ext/tornado/gen.py", line 309, in wrapper
            yielded = next(result)
          File "/usr/local/lib/python3.9/site-packages/salt/minion.py", line 2920, in handle_event
            self._return_pub(data, ret_cmd="_return", sync=False)
          File "/usr/local/lib/python3.9/site-packages/salt/minion.py", line 2262, in _return_pub
            log.trace("ret_val = %s", ret_val)  # pylint: disable=no-member
        UnboundLocalError: local variable 'ret_val' referenced before assignment
       "

First I thought this was due to a async timeout problem, but forcing sync=True didn't help. Then I stumbled on this bug report.
Reverting #61468 fixes the problem me.

I am also experiencing this exact problem.

Same here, though on amazon linux 2 and with a single master. The minions will drop off (manage.status shows them as down) and when they are manually killed, they have the same stack trace as above.

eliasp · 2023-05-24T07:52:21Z

Seeing this on 3005.1 (Master + Minion) on Ubuntu 22.04.
I can reproduce it with a single Master and just 1 Minion connected to it.

At some point it stops returning job data and when trying to restart the service, it ends up wit this:

May 24 09:44:49 smaster01 systemd[1]: Stopping The Salt Minion...
May 24 09:44:49 smaster01 salt-3005.1[1206994]: [WARNING ] Minion received a SIGTERM. Exiting.
May 24 09:44:49 smaster01 salt-3005.1[1206994]: The Salt Minion is shutdown. Minion received a SIGTERM. Exited.
May 24 09:44:49 smaster01 salt-3005.1[1206994]: The minion failed to return the job information for job req. This is often due to the master being shut down or overloaded. If the master is running, consider increasing the worker_threads value.
May 24 09:44:49 smaster01 salt-3005.1[1206994]: Future <salt.ext.tornado.concurrent.Future object at 0x7f3c540f5b50> exception was never retrieved: Traceback (most recent call last):
May 24 09:44:49 smaster01 salt-3005.1[1206994]:   File "salt/ext/tornado/gen.py", line 309, in wrapper
May 24 09:44:49 smaster01 salt-3005.1[1206994]:     yielded = next(result)
May 24 09:44:49 smaster01 salt-3005.1[1206994]:   File "salt/minion.py", line 2921, in handle_event
May 24 09:44:49 smaster01 salt-3005.1[1206994]:     self._return_pub(data, ret_cmd="_return", sync=False)
May 24 09:44:49 smaster01 salt-3005.1[1206994]:   File "salt/minion.py", line 2263, in _return_pub
May 24 09:44:49 smaster01 salt-3005.1[1206994]:     log.trace("ret_val = %s", ret_val)  # pylint: disable=no-member
May 24 09:44:49 smaster01 salt-3005.1[1206994]: UnboundLocalError: local variable 'ret_val' referenced before assignment
May 24 09:46:19 smaster01 systemd[1]: salt-minion.service: State 'stop-sigterm' timed out. Killing.

mburring · 2023-07-05T12:01:44Z

^C[TRACE   ] Waiting to kill process manager children
[DEBUG   ] Closing IPCMessageClient instance
[DEBUG   ] Closing IPCMessageSubscriber instance
[WARNING ] Minion received a SIGINT. Exiting.
[INFO    ] Shutting down the Salt Minion
[TRACE   ] Processing <function DaemonMixIn._mixin_before_exit at 0x7ff0140f6b00>
[TRACE   ] Processing <function LogLevelMixIn.__shutdown_logging at 0x7ff0140f6560>
The Salt Minion is shutdown. Minion received a SIGINT. Exited.
The minion failed to return the job information for job req. This is often due to the master being shut down or overloaded. If the master is running, consider increasing the worker_threads value.
Future <salt.ext.tornado.concurrent.Future object at 0x7ff009ac8910> exception was never retrieved: Traceback (most recent call last):
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 309, in wrapper
    yielded = next(result)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 2921, in handle_event
    self._return_pub(data, ret_cmd="_return", sync=False)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 2263, in _return_pub
    log.trace("ret_val = %s", ret_val)  # pylint: disable=no-member
UnboundLocalError: local variable 'ret_val' referenced before assignment

Same issue, random 3005 & 3006 minions & masters. To debug I left some minions in foreground mode with trace logging and a few will lockup after running a scheduled highstate. Need to ^C to get it to dump the exception otherwise it stays permanently locked and is identified by as a zombie process. Only running one master per minion.

llua · 2023-08-24T05:43:34Z

For the record: frebib@4f194e8 via cleanly applies to 3006.x and i've been using it without the issue of having scheduled highstates stalling indefinitely.

frebib · 2023-08-24T20:28:54Z

@llua The only reason I've not PR'ed that patch is because it breaks deltaproxy (???)
We've had #61468 reverted ever since it was released because the Salt master becomes entirely unusable without it reverted

whytewolf · 2023-08-24T21:29:58Z

@llua The only reason I've not PR'ed that patch is because it breaks deltaproxy (???) We've had #61468 reverted ever since it was released because the Salt master becomes entirely unusable without it reverted

@frebib I think you were on the right track. I commented on your CR there about how to get around deltaproxy if it was the only thing holding you back. if you can get that into a PR even if checking for deltaproxy it would be greatly appropriated. even with deltaproxy fails. might help to know what is causing those.

frebib added Bug broken, incorrect, or confusing behavior needs-triage labels Oct 7, 2022

OrangeDog added Regression The issue is a bug that breaks functionality known to work in previous releases. Core relates to code central or existential to Salt Phosphorus v3005.0 Release code name and version labels Oct 7, 2022

anilsil assigned whytewolf May 4, 2023

anilsil added this to the Sulfur v3006.3 milestone May 4, 2023

anilsil removed the needs-triage label May 4, 2023

frebib mentioned this issue Aug 24, 2023

[BUG] 3006 minion returns data to both masters in active/active multimaster setup #65042

Closed

9 tasks

anilsil modified the milestones: Sulfur v3006.3, Sulfur v3006.4 Aug 24, 2023

dwoz self-assigned this Aug 24, 2023

dwoz mentioned this issue Aug 24, 2023

[3006.x] Only return job results to originating master. #65053

Merged

3 tasks

vkotarov mentioned this issue Aug 29, 2023

[BUG] Runners occasionally fail with "RuntimeError: dictionary changed size during iteration" #65082

Closed

9 tasks

twangboy modified the milestones: Sulfur v3006.4, Sulfur v3006.5 Nov 15, 2023

twangboy closed this as completed Nov 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Minion sends return events to all masters #62834

[BUG] Minion sends return events to all masters #62834

frebib commented Oct 7, 2022 •

edited

Loading

frebib commented Oct 7, 2022 •

edited

Loading

Jsollvander commented Oct 7, 2022

asomers commented Oct 7, 2022

jfindlay commented Oct 7, 2022 •

edited

Loading

llua commented Mar 15, 2023

scott-sturdivant commented May 5, 2023

eliasp commented May 24, 2023

mburring commented Jul 5, 2023

llua commented Aug 24, 2023

frebib commented Aug 24, 2023

whytewolf commented Aug 24, 2023

[BUG] Minion sends return events to all masters #62834

[BUG] Minion sends return events to all masters #62834

Comments

frebib commented Oct 7, 2022 • edited Loading

frebib commented Oct 7, 2022 • edited Loading

Jsollvander commented Oct 7, 2022

asomers commented Oct 7, 2022

jfindlay commented Oct 7, 2022 • edited Loading

llua commented Mar 15, 2023

scott-sturdivant commented May 5, 2023

eliasp commented May 24, 2023

mburring commented Jul 5, 2023

llua commented Aug 24, 2023

frebib commented Aug 24, 2023

whytewolf commented Aug 24, 2023

frebib commented Oct 7, 2022 •

edited

Loading

frebib commented Oct 7, 2022 •

edited

Loading

jfindlay commented Oct 7, 2022 •

edited

Loading