Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows minion re-authenticates for every job #35890

Closed
szjur opened this issue Aug 30, 2016 · 19 comments
Closed

Windows minion re-authenticates for every job #35890

szjur opened this issue Aug 30, 2016 · 19 comments
Assignees
Labels
Bug broken, incorrect, or confusing behavior Confirmed Salt engineer has confirmed bug/feature - often including a MCVE severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around Windows

Comments

@szjur
Copy link

szjur commented Aug 30, 2016

Description of Issue/Question

I'm not sure whether it is the same issue as #30411. I observed that Windows minions (2016.3.2) seem to authenticate for every job they process. Linux minions (for RHEL 6) of the same version work fine.

So after creating a job for a Windows minion, I can see salt/auth events on the master's event bus and in the syndic log I can see:

2016-08-30 11:23:19,142 [salt.master                                          ][INFO    ][30931] User root Published command grains.items with jid 20160830112319124713
**2016-08-30 11:23:21,019 [salt.transport.mixins.auth                           ][INFO    ][30913] Authentication request from myminion.mydomain.com
2016-08-30 11:23:21,019 [salt.transport.mixins.auth                           ][INFO    ][30913] Authentication accepted from myminion.mydomain.com**
2016-08-30 11:23:21,141 [salt.utils.job                                       ][INFO    ][30925] Got return from myminion.mydomain.com for job 20160830112319124713

Then in the minion's log in debug mode there is:

2016-08-30 11:23:18,738 [salt.minion      ][INFO    ][5720] User root Executing command grains.items with jid 20160830112319124713
2016-08-30 11:23:18,739 [salt.minion      ][DEBUG   ][5720] Command details {'tgt_type': 'list', 'jid': '20160830112319124713', 'tgt': ['myminion.mydomain.com'], 'ret': '', 'user': 'root', 'arg': [[]], 'fun': 'grains.items', 'master_id': 'mymaster.mydomain.com'}
2016-08-30 11:23:20 [salt.minion    ][INFO    ] Starting a new job with PID 5896
2016-08-30 11:23:20 [salt.utils.lazy][DEBUG   ] LazyLoaded grains.items
2016-08-30 11:23:20 [salt.utils.lazy][DEBUG   ] LazyLoaded direct_call.get
2016-08-30 11:23:20 [salt.minion    ][DEBUG   ] Minion return retry timer set to 6 seconds (randomized)
2016-08-30 11:23:20 [salt.minion    ][INFO    ] Returning information for job: 20160830112319124713
2016-08-30 11:23:20 [salt.transport.zeromq][DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('c:\salt\conf\pki\minion', 'myminion.mydomain.com', 'tcp://1.2.3.4:4506', 'aes')
2016-08-30 11:23:20 [salt.crypt     ][DEBUG   ] Initializing new AsyncAuth for ('c:\salt\conf\pki\minion', 'myminion.mydomain.com', 'tcp://1.2.3.4:4506')
2016-08-30 11:23:20 [salt.transport.zeromq][DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('c:\salt\conf\pki\minion', 'myminion.mydomain.com', 'tcp://1.2.3.4:4506', 'clear')
2016-08-30 11:23:20 [salt.crypt     ][DEBUG   ] Decrypting the current master AES key
2016-08-30 11:23:20 [salt.crypt     ][DEBUG   ] Loaded minion key: c:\salt\conf\pki\minion\minion.pem

That obviously can cause massive load on the syndics.

Setup

2 masters, 2 syndics, zeromq filtering enabled. Some Windows minions, some Linux ones.

Steps to Reproduce Issue

Execute any command on the Windows minion and observe the event bus on the master and /var/log/salt/master for minion authentication messages. Compare the same for a Linux minion.

Versions Report

2016.3.2

@cachedout
Copy link
Contributor

Does this by chance change if you set multiprocessing: False in the minion config?

@szjur
Copy link
Author

szjur commented Aug 30, 2016

Indeed it does, thanks a lot! Are there any caveats to that or a valid
reason that it defaults to True despite causing problems?

@szjur
Copy link
Author

szjur commented Aug 30, 2016

I have already found out that "multiprocessing: False" is almost fatal with saltutil.signal_job. When I send a signal 15 to a job it seems to kill a part of the minion, which amazingly recovers but you never get a result from that saltutil.signal_job and the minion behaves as if it was restarted. I can see a possible reason for that as "multiprocessing: False" makes each job run in a thread, which means the same PID.

Is there a way to stop the minion from re-authenticating with the default "multiprocessing: True"? Or maybe a way to make saltutil.signal_job send a signal to a thread when miltiprocessing is False? I'm not sure if that is possible in Windows. Linux seems to have a pthread_kill() function.

@Ch3LL
Copy link
Contributor

Ch3LL commented Sep 1, 2016

ping @cachedout or @twangboy can either of you guys answer @szjur 's questions about multiprocessing and saltutil.signal_job ? I'm a little fuzzy in this area. Thanks

@Ch3LL Ch3LL added the Pending-Discussion The issue or pull request needs more discussion before it can be closed or merged label Sep 1, 2016
@Ch3LL Ch3LL added this to the Blocked milestone Sep 1, 2016
@twangboy
Copy link
Contributor

Turning multiprocessing off destroys job handling in Windows for some reason.

As far as acceptance wait time.... @UtahDave

@damon-atkins
Copy link
Contributor

damon-atkins commented Oct 21, 2016

Linux seems to have a pthread_kill() function

My understanding is threading in python is not real threading (i.e. it does not use pthread), it's time-sharing, so you cannot have more than one line of python code running at the same time. However, if you call an external program python continues while the external program continues as well. My understanding is Variables are 100% shared in Threaded Python, so any code which sets a global var, the global var will stay around (unlike multiprocessor).

What I have noticed is that the main salt minion process, import modules, set memory/variables, these are allocated forever. e.g. cmdmod

So "multiprocessing: False" would be a large change unless Salt-Minion still forks at least once for the execution of each incoming request. Include being hit by memory leaks?

I believe IronPython and Jython have proper threading.

@tsaridas
Copy link
Contributor

@twangboy @Ch3LL can we set this as a bug please ?
It should be expected to be able to kill jobs in either cases ( multiprocessing off / on ) in windows as it is possible in linux.

@twangboy twangboy added the Bug broken, incorrect, or confusing behavior label Jan 12, 2017
@stale
Copy link

stale bot commented Aug 24, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue.

@stale stale bot added the stale label Aug 24, 2018
@twangboy
Copy link
Contributor

I think this is still a problem

@stale
Copy link

stale bot commented Aug 27, 2018

Thank you for updating this issue. It is no longer marked as stale.

@stale stale bot removed the stale label Aug 27, 2018
@zbarr
Copy link

zbarr commented Jul 12, 2019

This is very much still a problem.
I currently have a process that collects grains from all hosts by running "grains.items" on "*" and during this, we get a flood of auth requests from every Windows minion. This makes the master so busy that it can no longer handle my API requests and everything falls apart. If this problem has not been fixed in development, is there a good workaround? I probably will just change it to multiprocessing: False in the meantime because I do not often kill jobs from the salt master but I do occasionally.

@zbarr
Copy link

zbarr commented Jul 15, 2019

@twangboy any idea on this? My master cannot currently handle the auth flood that it gets every time I execute a command on a large number of Windows minions. I cannot find in the code anywhere that would tell Windows minions to behave differently for auth. I even upgraded to the latest minions and they still have the issue.

Here is an excerpt from my garbage logs:

2019-07-15 14:02:51,858 [salt.utils.lazy  :104 ][DEBUG   ][22928] LazyLoaded test.ping
2019-07-15 14:02:51,858 [salt.minion      :35  ][TRACE   ][22928] Executors list ['direct_call']
2019-07-15 14:02:51,859 [salt.loaded.int.module.test:124 ][DEBUG   ][22928] test.ping received for minion 'ch1cwwzbarr001'
2019-07-15 14:02:51,859 [salt.minion      :767 ][DEBUG   ][22928] Minion return retry timer set to 5 seconds (randomized)
2019-07-15 14:02:51,859 [salt.minion      :1941][INFO    ][22928] Returning information for job: 20190715140246445150
2019-07-15 14:02:51,859 [salt.minion      :35  ][TRACE   ][22928] Return data: {'jid': '20190715140246445150', 'return': True, 'fun': 'test.ping', 'retcode': 0, 'success': True, 'fun_args': []}
2019-07-15 14:02:51,942 [git.cmd          :722 ][DEBUG   ][22928] Popen(['git', 'version'], cwd=c:\salt\bin, universal_newlines=False, shell=None)
2019-07-15 14:02:52,003 [git.cmd          :722 ][DEBUG   ][22928] Popen(['git', 'version'], cwd=c:\salt\bin, universal_newlines=False, shell=None)
2019-07-15 14:02:52,073 [salt.transport.zeromq:132 ][DEBUG   ][22928] Initializing new AsyncZeroMQReqChannel for ('c:\\salt\\conf\\pki\\minion', 'ch1cwwzbarr001', 'tcp://10.0.65.22:4506', 'aes')
2019-07-15 14:02:52,073 [salt.crypt       :463 ][DEBUG   ][22928] Initializing new AsyncAuth for ('c:\\salt\\conf\\pki\\minion', 'ch1cwwzbarr001', 'tcp://10.0.65.22:4506')
2019-07-15 14:02:52,076 [salt.transport.zeromq:203 ][DEBUG   ][22928] Connecting the Minion to the Master URI (for the return server): tcp://10.0.65.22:4506
2019-07-15 14:02:52,084 [salt.transport.zeromq:1178][DEBUG   ][22928] Trying to connect to: tcp://10.0.65.22:4506
2019-07-15 14:02:52,084 [salt.transport.zeromq:35  ][TRACE   ][22928] Inserted key into loop_instance_map id 78520144 for key ('c:\\salt\\conf\\pki\\minion', 'ch1cwwzbarr001', 'tcp://10.0.65.22:4506', 'aes') and process 22928
2019-07-15 14:02:52,084 [salt.transport.zeromq:132 ][DEBUG   ][22928] Initializing new AsyncZeroMQReqChannel for ('c:\\salt\\conf\\pki\\minion', 'ch1cwwzbarr001', 'tcp://10.0.65.22:4506', 'clear')
2019-07-15 14:02:52,084 [salt.transport.zeromq:203 ][DEBUG   ][22928] Connecting the Minion to the Master URI (for the return server): tcp://10.0.65.22:4506
2019-07-15 14:02:52,092 [salt.transport.zeromq:1178][DEBUG   ][22928] Trying to connect to: tcp://10.0.65.22:4506
2019-07-15 14:02:52,092 [salt.transport.zeromq:35  ][TRACE   ][22928] Inserted key into loop_instance_map id 78520144 for key ('c:\\salt\\conf\\pki\\minion', 'ch1cwwzbarr001', 'tcp://10.0.65.22:4506', 'clear') and process 22928
2019-07-15 14:02:52,093 [salt.crypt       :207 ][DEBUG   ][22928] salt.crypt.get_rsa_pub_key: Loading public key
2019-07-15 14:02:52,106 [salt.crypt       :868 ][DEBUG   ][22928] Decrypting the current master AES key
2019-07-15 14:02:52,107 [salt.crypt       :199 ][DEBUG   ][22928] salt.crypt.get_rsa_key: Loading private key
2019-07-15 14:02:52,107 [salt.crypt       :179 ][DEBUG   ][22928] salt.crypt._get_key_with_evict: Loading private key
2019-07-15 14:02:52,108 [salt.crypt       :797 ][DEBUG   ][22928] Loaded minion key: c:\salt\conf\pki\minion\minion.pem
2019-07-15 14:02:52,148 [salt.crypt       :207 ][DEBUG   ][22928] salt.crypt.get_rsa_pub_key: Loading public key

@zbarr
Copy link

zbarr commented Jul 15, 2019

Additionally, it should be noted that setting "multiprocessing: False" does not have an effect.

@stale
Copy link

stale bot commented Jan 8, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue.

@stale stale bot added the stale label Jan 8, 2020
@waynew waynew added the Confirmed Salt engineer has confirmed bug/feature - often including a MCVE label Jan 8, 2020
@stale
Copy link

stale bot commented Jan 8, 2020

Thank you for updating this issue. It is no longer marked as stale.

@stale stale bot removed the stale label Jan 8, 2020
@sagetherage
Copy link
Contributor

I have this on a meeting agenda to discuss and attempt to move it forward, I will update the ticket after this week with details.

@sagetherage sagetherage self-assigned this Jan 29, 2020
@zbarr
Copy link

zbarr commented Jan 29, 2020

Sounds good - thanks!

@sagetherage sagetherage removed their assignment Feb 7, 2020
@sagetherage sagetherage removed the Pending-Discussion The issue or pull request needs more discussion before it can be closed or merged label Feb 7, 2020
@sagetherage sagetherage removed this from the Blocked milestone Feb 7, 2020
@sagetherage
Copy link
Contributor

@cmcmarrow assigning you this bug, can you please give priority and severity labels?

@sagetherage sagetherage added the severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around label Aug 28, 2020
@twangboy twangboy self-assigned this Sep 18, 2020
@sagetherage sagetherage assigned xeacott and unassigned cmcmarrow May 7, 2021
@twangboy twangboy modified the milestones: Approved, Sulphur v3006.0 Feb 8, 2022
@anilsil anilsil removed this from the Chlorine v3007.0 milestone May 11, 2023
@twangboy
Copy link
Contributor

I don't see auth messages in the minion debug log anymore on 3006.2. I believe this is fixed.
Fixed by #64944

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior Confirmed Salt engineer has confirmed bug/feature - often including a MCVE severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around Windows
Projects
None yet
Development

No branches or pull requests