-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWX stops gathering job output if kubernetes starts a new log #11338
Comments
This seems similar/related? kubernetes/kubernetes#28369 |
We recently saw this in an AKS deployment. I've verified a workaround by setting |
#10366 (comment) @dbanttari this might fix your issue. Note the below modification that is more generic and doesn't rely on the
|
Hello, |
I'm not sure how we would fix this without completely rearchitecting things to not rely on the kubernetes logging system. We don't have any plans to do this right now. |
Hello @shanemcd , |
Running this in kubernetes 1.20+ should be called as strictly unsupported and broken I feel like |
From #11511:
|
I agree this is a contanerd/k8s issue. But it also best practice to "limit' logs from filling up filesystems. But relying on the logging systems of k8s does not seem reliable. Are te following options valid? |
If this kubernetes/kubernetes#59902 was implemented, would not that help? I can also confirm is it an issue for us too. For some problematic playbooks with lots of output and multiple hosts, the only solution we found, was to create a playbook that will create a pod which runs the playbook directly on the k8s cluster, so outside of AWX :-( In our case, increasing the log size is not an option, because we have other workloads on the cluster and increasing the log file size will drastically increase the disk space. Thanks |
Hi same problem in k3s, this is a real problem for us also Rhel 8.4 Fixed with kubelet --container-log-max-size |
Using AKS and Terraform we were able to get enough headroom by specifying:
The default, I believe, was 10MB. |
Hey, I tried your suggestion but included--container-log-max-files=3 before I read that 5 is the default value... And even though the daemonset pods failed to start with Crashloopback and I deleted them a couple of minutes later somehow the kubelet config file now looks like this:
Funny enough everything works exactly as before and as far as I can tell, nothing was removed, only the two parameters got added a gazzilion times... Anyway I would like to fix this mess and restore the default kubelet config file if possible. Could anyone suggest how to fix this please? |
Lucky for us, I've also setup ARA. so when the log file rotates, ARA still has the complete job data.. |
Any update about the resolution of this problem @shanemcd ? please |
Also see some good quality research in the ansible-runner issue ansible/ansible-runner#998 This is particularly a problem for inventory updates that produce large inventories, because the archive of the artifacts necessarily has to be large, and the way that is sent requires precise byte alignment or else it won't work. Trying to think of where / how a patch might be developed... a change to awx-operator might just increase the size that it can handle but not fundamentally address the issue. If I could reproduce it, maybe I could get the syntax of what the controller node reads, and figure out how to identify these breaks in the ansible-runner code. |
For me the same behavior happens when automation-job pod finished and is destroyed. |
@AlanCoding to force a reproducer, you could try setting |
Any update about the resolution? |
Hi - trying to apply the mentioned workaround for k3s, too, but currently I'm unsure on where and how to apply the container-log-max-size configuration. Could you please give me a hint, where and how you deployed that? Thanks, Andreas |
@andreasbourges $ cat /etc/systemd/system/k3s.service
...
ExecStart=/usr/local/bin/k3s \
server \
'--write-kubeconfig-mode' \
'644' \
'--kubelet-arg' \ 👈👈👈
'container-log-max-files=4' \ 👈👈👈
'--kubelet-arg' \ 👈👈👈
'container-log-max-size=50Mi' \ 👈👈👈
$ sudo systemctl daemon-reload
$ sudo systemctl restart k3s Or re-install K3s using script with specific arguments, $ curl -sfL https://get.k3s.io | sh -s - --write-kubeconfig-mode 644 --kubelet-arg "container-log-max-files=4" --kubelet-arg "container-log-max-size=50Mi" |
Thanks a lot - this saves me a lot of time! Highly appreciated! |
Hi! I have been reading all your valuable information and tests for the last days, because we are facing the same issue. I just wanted to let you know that, in our case, changing the log configuration didn't solve it, so I'm afraid we have come to the wall of "4h maximum connection time" that is commented in this other issue: |
Hi,
thanks for sharing your experience - I adjusted the amount of data
processed in the playbook (way too much information was gathered via an http
request to Autobot) *and* I adjusted the container-log size.
What I can tell is, that opening the output log from the aborted tasks,
still triggers the uwsgi consuming GBs of memory. We're hoping that
adjusting memory consumption will prevent the tasks from failing and thus
avoid the uwsgi problem.
Well, let's see.
Thanks,
Andreas
Edit by Gundalow to tidy up formatting
|
Just want to say that increasing size before log rotation does not solve the problem. It just decreases the likelyhood that your run will take place or trigger a log rotation that causes the error. Its not a solution to this problem in any way. |
Hi Braden,
.I totally agree - this is a non-deterministic workaround, but in
combination with the reduction of the generated logs we're running fine so
far - not a single error since the change (before we had a chance of 50:50).
And in the absence of any alternatives, we are happy to have this one.
Do you know of any other ways to solve this?
Thanks,
Andreas
Von: Braden Schaeffer ***@***.***>
Gesendet: Mittwoch, 12. Oktober 2022 16:20
An: ansible/awx ***@***.***>
Cc: andreasbourges ***@***.***>; Mention ***@***.***>
Betreff: Re: [ansible/awx] AWX stops gathering job output if kubernetes
starts a new log (Issue #11338)
Just want to say that increasing size before log rotation does not solve the
problem. It just decreases the likelyhood that your run will take place or
trigger a log rotation that causes the error. Its not a solution to this
problem in any way.
-
Reply to this email directly, view it on GitHub
<#11338 (comment)> , or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AIB4YVQ5FYXYS7JVDRJY5RDWC
3CJRANCNFSM5H3JFP5A> .
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
fixed in ansible/receptor#683 |
Not for me — I seem to still get only the first chunk of logs. Running AWX receptor version 1.4.1 against OpenShift 3.11's (I know, I know) Kubernetes 1.11.0+d4cacc0. It is worth noting that the
Indeed. I seem to remember that at some point, AWX used to rely on RabbitMQ for this task? |
... You see, umm, we haven't applied `-t awx` for a long time, because [that would break logging in AWX](ansible/awx#11338 (comment)) 🤦
this issue seems finally fixed for me with kubernetes 1.29 |
As seen in ansible/awx#11338 and ansible/receptor#446 - Force `RECEPTOR_KUBE_SUPPORT_RECONNECT` as per ansible/receptor#683 - Pump up timeouts thereof
Please confirm the following
Summary
If a very large job is run, sometimes Kubernetes will start a new
log
, but AWX's reporting of job progess stalls. Eventually the job ends as anError
, even though the job itself continues to completion in the podAWX version
19.4.0
Installation method
kubernetes
Modifications
yes
Ansible version
2.11.3.post0
Operating system
Azure Kubernetes Service 1.21.2
Web browser
Chrome
Steps to reproduce
We can tail the log of the pod in real time using kubectl
*** this window stalls until the job is complete and the pod exits ***
*** opened a new window and it resumes: ***
Note that this example is discontiguous, which implies that this may happen multiple times during a pod's life
Expected results
Expected AWX to continue reporting the progress of the job, and noting its proper completion status (in this case, "Failed")
Actual results
The last item presented by AWX was the last entry in the log before it started anew:
Additional information
Ansible pod image is based on awx-ee:0.6.0 but adds things like helm, openssl, azure cli, zabbix-api, and other libraries that are used by various tasks
Installation was done via AWX Operator 0.14.0
The text was updated successfully, but these errors were encountered: