salt exiting prematurely #6881

WillPlatnick · 2013-08-24T19:32:21Z

Starting in 0.16.3, I've started having issues with running highstates via the command salt 'hostname' state.highstate when there's a lot of work to be done. The highstate runs, but salt terminates and goes back to the command line after only 1-3 minutes with an exit code of 0. If I lookup all the jobs, I can see it runs and salt will do a saltutil.find_job a couple times before salt quits prematurely and if I lookup the highstate jid, it ran fine. This is not happening every single time, but in my tests today, it was happening the majority of time.

Example Tests:
Test #1 - provision fresh VM with salt-cloud, highstate returned as it should after 3m1s
Test #2 - provision fresh VM with salt-cloud, salt exits with no data returned after 1m52s

This is the versions output for the salt master and minion
Salt: 0.16.3
Python: 2.6.6 (r266:84292, Dec 26 2010, 22:31:48)
Jinja2: 2.5.5
M2Crypto: 0.20.1
msgpack-python: 0.1.10
msgpack-pure: Not Installed
pycrypto: 2.1.0
PyYAML: 3.10
PyZMQ: 13.1.0
ZMQ: 3.2.3

terminalmage · 2013-08-24T21:30:05Z

@TempSpace We'll need more information to get to the bottom of this. Premature exits with no return data can be caused by tracebacks on the minion, so the minion log (default location: /var/log/salt/minion) would be a good place to look for tracebacks. If you could run tail -f on that log file while you run your highstate, that may help you find any tracebacks that are occurring. If you see any, please post them here.

WillPlatnick · 2013-08-25T00:55:53Z

I'll keep an eye out for them. If a traceback happens, is there logic to re-execute itself? Because when I went back and looked up the jid of the highstate, it always finished with all True states.

WillPlatnick · 2013-08-25T01:11:40Z

OK, I just replicated the issue and there is nothing about any tracebacks in the log. Every line just says INFO, no warnings, errors, tracebacks or anything. What should I do next?

basepi · 2013-08-26T21:38:46Z

I think the answer here is that there's something keeping the minion from responding to the find_job query in time for some reason. I think we need to make the timeout on that command configurable, and try a longer timeout value.

WillPlatnick · 2013-09-19T00:24:25Z

Is there any more information I can provide on this one? It has become a huge issue for us.

basepi · 2013-09-19T18:54:12Z

Have you tried running these commands with a higher timeoute? (-t 300 for example would not check with the minions for 5 minutes). I wonder if repeated calls to the minion to see if it's still running its job are the culprit. If the minion gets busy enough during the highstate that it misses one of those messages, or it doesn't reply quickly enough, I could see this happening.

basepi · 2013-09-19T18:58:42Z

I've also created a new issue for making the secondary timeout (how long salt waits after checking in with the minion) configurable. If we bump that number up it should make this much more robust: #7354

WillPlatnick · 2013-09-20T13:10:55Z

I have. The higher timeout makes no difference at all unfortunately. I had a timeout of 5 minutes in the examples above. It failed at 1 minute and 52 seconds in my 2nd test regardless of the timeout.

basepi · 2013-09-23T19:06:09Z

Wait, so your salt command didn't even wait for the whole timeout? Did it return anything or just exit?

WillPlatnick · 2013-09-24T13:52:30Z

Correct, it didn't wait for the whole timeout. It returned absolutely nothing, it just exited with a status code 0. Running the minion in debug shows nothing but INFO lines, no errors, no tracebacks. If I run the same command again, it will almost always give me the expected output of a highstate.

basepi · 2013-09-25T19:24:47Z

OK, here's a question: how are you targeting your minions? Are you using compound matcher or similar?

Except the changes there didn't make it into 0.16.3 or 0.16.4, they're in 0.17......hrm.....

WillPlatnick · 2013-09-26T12:32:36Z

I use different kinds. The examples above were specifically targeted salt 'machinename' state.highstate. When I run states manually, I usually use blob targetting salt 'wplatnick*' state.highstate. It happens with both.

WillPlatnick · 2013-10-21T16:48:13Z

@basepi Any further thoughts on this one?

UtahDave · 2013-10-21T16:56:42Z

@TempSpace does this only happen when using the salt-cloud execution module within salt?

WillPlatnick · 2013-10-21T18:48:09Z

No, it happens via salt on non-salt-cloud machines as well.

basepi · 2013-10-21T23:02:03Z

The weird thing is that as far as I know, this is only happening to you. I wonder if there's some sort of firewall issue or similar that is causing these issues.

basepi · 2013-10-21T23:04:55Z

After one of these exits prematurely, are the minions still reachable by the master? Can you continue to ping them?

cachedout · 2014-02-03T18:43:56Z

@WillPlatnick There have been a number of changes in recent weeks on timeouts. Have you by chance been able to test any of the release candidates of 2014-1? This could very well be resolved.

cachedout · 2014-03-04T19:28:19Z

Hi again @WillPlatnick. Given the release of 2014.1, the fact that we can't reproduce this, we don't have other users reporting it and that we haven't heard back from you, I'm going to close this issue. If it's still affecting you on recent code, we'll certainly re-open. Please don't hesitate to comment here if we're closing an issue that shouldn't be. Thanks!

basepi mentioned this issue Sep 19, 2013

Make the secondary check-in timeout for salt commands configurable #7354

Closed

cachedout closed this as completed Mar 4, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

salt exiting prematurely #6881

salt exiting prematurely #6881

WillPlatnick commented Aug 24, 2013

terminalmage commented Aug 24, 2013

WillPlatnick commented Aug 25, 2013

WillPlatnick commented Aug 25, 2013

basepi commented Aug 26, 2013

WillPlatnick commented Sep 19, 2013

basepi commented Sep 19, 2013

basepi commented Sep 19, 2013

WillPlatnick commented Sep 20, 2013

basepi commented Sep 23, 2013

WillPlatnick commented Sep 24, 2013

basepi commented Sep 25, 2013

WillPlatnick commented Sep 26, 2013

WillPlatnick commented Oct 21, 2013

UtahDave commented Oct 21, 2013

WillPlatnick commented Oct 21, 2013

basepi commented Oct 21, 2013

basepi commented Oct 21, 2013

cachedout commented Feb 3, 2014

cachedout commented Mar 4, 2014

salt exiting prematurely #6881

salt exiting prematurely #6881

Comments

WillPlatnick commented Aug 24, 2013

terminalmage commented Aug 24, 2013

WillPlatnick commented Aug 25, 2013

WillPlatnick commented Aug 25, 2013

basepi commented Aug 26, 2013

WillPlatnick commented Sep 19, 2013

basepi commented Sep 19, 2013

basepi commented Sep 19, 2013

WillPlatnick commented Sep 20, 2013

basepi commented Sep 23, 2013

WillPlatnick commented Sep 24, 2013

basepi commented Sep 25, 2013

WillPlatnick commented Sep 26, 2013

WillPlatnick commented Oct 21, 2013

UtahDave commented Oct 21, 2013

WillPlatnick commented Oct 21, 2013

basepi commented Oct 21, 2013

basepi commented Oct 21, 2013

cachedout commented Feb 3, 2014

cachedout commented Mar 4, 2014