-
Notifications
You must be signed in to change notification settings - Fork 638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Playbook failures with logs stopping after 5 hours #622
Comments
Heres output of a job that ran a 5 hour long sleep command in an ansible playbook. ansible playbook:
fails at 5 hours |
we are working on it... In current implementation of Kubernetes (specifically kubelet) there are hard coded time limit for connections At a high level approach to solving this problem is to use Kubernetes Ran into some “gotchas” that during the implementation of the fix: Pod logs: long lines are corrupted when using timestamps=true Fixed in kubernetes/kubernetes#113481 we also needed to fix something in ansible runner so that we dont get duplicated timestamps on the last message containing the zipdata ansible/ansible-runner#1161 here's our receptor PR still WIP, just finish the green path test now we need to make sure if we are deployed on old kube we preserve previous behavior and not barf |
Closing as a dupe of the other issues mentioned above. |
fixed in ansible/receptor#683 |
ISSUE TYPE
SUMMARY
After 5 hours of a playbook running, AWX will stop providing me logs, will report the job as a failure, and provide me no error message.
I've been searching for the past few days for various timeouts in ssh, tcp, or just in the kubernetes configurations for the AWX operator, but I can't seem to find anything. Wireshark does not find any TCP FIN's, or TCP RST signals until after the playbook has been reported by AWX as a failure. I've tried setting the following client ssh options
But that doesn't seem to help either. And Kubernetes marks the pod as killed after 5 hours.
ENVIRONMENT
STEPS TO REPRODUCE
I believe one can reproduce this by having a really long playbook. maybe one with a single step that lasts 5 hours like a
sleep 18000
though I have not tried this.EXPECTED RESULTS
Playbook continues executing, or reports an error.
ACTUAL RESULTS
Playbook stops mid-log and marks the playbook as failed along with the kubernetes pod being killed with
ADDITIONAL INFORMATION
My setup is using an external postgresql instance. The playbook itself is mostly doing backup related tasks like executing pg_dumpall, which takes a long time as the database is quite large.
AWX-OPERATOR LOGS
None right now. If I can reproduce with a long sleep statement, then I'll share that, since I don't want to expose anything from my company.
update: job_5544.txt
Are there timeouts I'm not aware of? or a place I can get more insight into what "automation-job" pods are actually doing? I've tried ssh-ing into the pod and executing the last command that AWX reports, and that seems to work just fine.
The text was updated successfully, but these errors were encountered: