Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

salt exiting prematurely #6881

Closed
WillPlatnick opened this issue Aug 24, 2013 · 19 comments
Closed

salt exiting prematurely #6881

WillPlatnick opened this issue Aug 24, 2013 · 19 comments
Labels
Bug broken, incorrect, or confusing behavior cannot-reproduce cannot be replicated with info/context provided severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around

Comments

@WillPlatnick
Copy link
Contributor

Starting in 0.16.3, I've started having issues with running highstates via the command salt 'hostname' state.highstate when there's a lot of work to be done. The highstate runs, but salt terminates and goes back to the command line after only 1-3 minutes with an exit code of 0. If I lookup all the jobs, I can see it runs and salt will do a saltutil.find_job a couple times before salt quits prematurely and if I lookup the highstate jid, it ran fine. This is not happening every single time, but in my tests today, it was happening the majority of time.

Example Tests:
Test #1 - provision fresh VM with salt-cloud, highstate returned as it should after 3m1s
Test #2 - provision fresh VM with salt-cloud, salt exits with no data returned after 1m52s

This is the versions output for the salt master and minion
Salt: 0.16.3
Python: 2.6.6 (r266:84292, Dec 26 2010, 22:31:48)
Jinja2: 2.5.5
M2Crypto: 0.20.1
msgpack-python: 0.1.10
msgpack-pure: Not Installed
pycrypto: 2.1.0
PyYAML: 3.10
PyZMQ: 13.1.0
ZMQ: 3.2.3

@terminalmage
Copy link
Contributor

@TempSpace We'll need more information to get to the bottom of this. Premature exits with no return data can be caused by tracebacks on the minion, so the minion log (default location: /var/log/salt/minion) would be a good place to look for tracebacks. If you could run tail -f on that log file while you run your highstate, that may help you find any tracebacks that are occurring. If you see any, please post them here.

@WillPlatnick
Copy link
Contributor Author

I'll keep an eye out for them. If a traceback happens, is there logic to re-execute itself? Because when I went back and looked up the jid of the highstate, it always finished with all True states.

@WillPlatnick
Copy link
Contributor Author

OK, I just replicated the issue and there is nothing about any tracebacks in the log. Every line just says INFO, no warnings, errors, tracebacks or anything. What should I do next?

@basepi
Copy link
Contributor

basepi commented Aug 26, 2013

I think the answer here is that there's something keeping the minion from responding to the find_job query in time for some reason. I think we need to make the timeout on that command configurable, and try a longer timeout value.

@WillPlatnick
Copy link
Contributor Author

Is there any more information I can provide on this one? It has become a huge issue for us.

@basepi
Copy link
Contributor

basepi commented Sep 19, 2013

Have you tried running these commands with a higher timeoute? (-t 300 for example would not check with the minions for 5 minutes). I wonder if repeated calls to the minion to see if it's still running its job are the culprit. If the minion gets busy enough during the highstate that it misses one of those messages, or it doesn't reply quickly enough, I could see this happening.

@basepi
Copy link
Contributor

basepi commented Sep 19, 2013

I've also created a new issue for making the secondary timeout (how long salt waits after checking in with the minion) configurable. If we bump that number up it should make this much more robust: #7354

@WillPlatnick
Copy link
Contributor Author

I have. The higher timeout makes no difference at all unfortunately. I had a timeout of 5 minutes in the examples above. It failed at 1 minute and 52 seconds in my 2nd test regardless of the timeout.

@basepi
Copy link
Contributor

basepi commented Sep 23, 2013

Wait, so your salt command didn't even wait for the whole timeout? Did it return anything or just exit?

@WillPlatnick
Copy link
Contributor Author

Correct, it didn't wait for the whole timeout. It returned absolutely nothing, it just exited with a status code 0. Running the minion in debug shows nothing but INFO lines, no errors, no tracebacks. If I run the same command again, it will almost always give me the expected output of a highstate.

@basepi
Copy link
Contributor

basepi commented Sep 25, 2013

OK, here's a question: how are you targeting your minions? Are you using compound matcher or similar?

Except the changes there didn't make it into 0.16.3 or 0.16.4, they're in 0.17......hrm.....

@WillPlatnick
Copy link
Contributor Author

I use different kinds. The examples above were specifically targeted salt 'machinename' state.highstate. When I run states manually, I usually use blob targetting salt 'wplatnick*' state.highstate. It happens with both.

@WillPlatnick
Copy link
Contributor Author

@basepi Any further thoughts on this one?

@UtahDave
Copy link
Contributor

@TempSpace does this only happen when using the salt-cloud execution module within salt?

@WillPlatnick
Copy link
Contributor Author

No, it happens via salt on non-salt-cloud machines as well.

@basepi
Copy link
Contributor

basepi commented Oct 21, 2013

The weird thing is that as far as I know, this is only happening to you. I wonder if there's some sort of firewall issue or similar that is causing these issues.

@basepi
Copy link
Contributor

basepi commented Oct 21, 2013

After one of these exits prematurely, are the minions still reachable by the master? Can you continue to ping them?

@cachedout
Copy link
Contributor

@WillPlatnick There have been a number of changes in recent weeks on timeouts. Have you by chance been able to test any of the release candidates of 2014-1? This could very well be resolved.

@cachedout
Copy link
Contributor

Hi again @WillPlatnick. Given the release of 2014.1, the fact that we can't reproduce this, we don't have other users reporting it and that we haven't heard back from you, I'm going to close this issue. If it's still affecting you on recent code, we'll certainly re-open. Please don't hesitate to comment here if we're closing an issue that shouldn't be. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior cannot-reproduce cannot be replicated with info/context provided severity-medium 3rd level, incorrect or bad functionality, confusing and lacks a work around
Projects
None yet
Development

No branches or pull requests

5 participants