-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Minion sends return events to all masters #62834
Comments
For now we've had to revert minions back to 2019.2 as we can't operate like this. I'm going to try reverting #61468 to see if that resolves the issue. Edit: Reverting that PR does indeed fix this behaviour. I'm not happy about that though. |
Piggybacking on this bug report because this particular commit is causing problems for me as well on 3005 (im on FreeBSD if that matters). For me its causing all jobs to "hang" forever on the minions. For example, my scheduled state.apply jobs:
Then nothing for hours. While I was debugging this on my own I was running the minion with
First I thought this was due to a async timeout problem, but forcing Reverting #61468 fixes the problem me. |
Since [1] minions now return all returns to all masters, instead of just the master that spawned the job. The upstream change in behaviour overloads our global masters making them unusable, so this aims to revert to the previous behaviour whilst maintaining the single-channel return improvements also introduced in [1]. [1]: saltstack#61468 Upstream-bug: saltstack#62834 Signed-off-by: Joe Groocock <jgroocock@cloudflare.com>
cc @devkits |
I think the expectation of behavior depends on the setup. If you have a minion config with master:
- master-A
- master-B and both masters have equivalent loads/roles, then you might want broadcast returns, I guess for redundancy? The case might be stronger if you're using However in the case where |
I am also experiencing this exact problem. |
Same here, though on amazon linux 2 and with a single master. The minions will drop off (manage.status shows them as down) and when they are manually killed, they have the same stack trace as above. |
Seeing this on 3005.1 (Master + Minion) on Ubuntu 22.04. At some point it stops returning job data and when trying to restart the service, it ends up wit this:
|
Same issue, random 3005 & 3006 minions & masters. To debug I left some minions in foreground mode with trace logging and a few will lockup after running a scheduled highstate. Need to ^C to get it to dump the exception otherwise it stays permanently locked and is identified by as a zombie process. Only running one master per minion. |
For the record: frebib@4f194e8 via cleanly applies to 3006.x and i've been using it without the issue of having scheduled highstates stalling indefinitely. |
@frebib I think you were on the right track. I commented on your CR there about how to get around deltaproxy if it was the only thing holding you back. if you can get that into a PR even if checking for deltaproxy it would be greatly appropriated. even with deltaproxy fails. might help to know what is causing those. |
Description
Since 3005, minions seem to send job return data to all connected masters, and not just the one master that initiated the job. Before #61468 the minion would only send return data to the one master that the connected minion process that spawned the job.
The number of jobs caused inode exhaustion on the global master (see below) as more jobs were being returned and cached in the job cache. This directly correlates with the number of connected 3005 minions. We noticed because the inode exhaustion caused the master workers to lock up, causing timeouts in minion logs which spiked in our logging collection.
Here you can see the rate of job returns coincides with the number of minions running 3005. The blackouts are presumably the master being hosed due to inode exaustion.
Setup
We have a large setup with thousands of minions. They're split up into clusters of "a few" to "hundreds" of nodes, each of which has one or more masters. All minions are also connected to the single "global" master that we use for adhoc jobs. Before upgrading minions to 3005, the global master would only see returns for jobs it issued. 3005 minions now return jobs spawned from the other masters too.
Steps to Reproduce the behavior
Connect one minion to two masters, spawn a job from one master and observe the return received on both masters in the master logs.
Expected behavior
This is where it gets a bit questionable. One could argue that this behaviour is intentional or desirable. It's a change in behaviour since before 3005, but possibly a good one. The issue comes in that masters are more likely to be DoS'ed by many events being returned at once.
Versions Report
3005/3005.1 both exhibit this behaviour. 3004 and 2019.2 don't
The text was updated successfully, but these errors were encountered: