Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWX stops gathering job output if kubernetes starts a new log #11338

Closed
3 tasks done
dbanttari opened this issue Nov 11, 2021 · 29 comments
Closed
3 tasks done

AWX stops gathering job output if kubernetes starts a new log #11338

dbanttari opened this issue Nov 11, 2021 · 29 comments

Comments

@dbanttari
Copy link

dbanttari commented Nov 11, 2021

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that AWX is open source software provided for free and that I am not entitled to status updates or other assurances.

Summary

If a very large job is run, sometimes Kubernetes will start a new log, but AWX's reporting of job progess stalls. Eventually the job ends as an Error, even though the job itself continues to completion in the pod

AWX version

19.4.0

Installation method

kubernetes

Modifications

yes

Ansible version

2.11.3.post0

Operating system

Azure Kubernetes Service 1.21.2

Web browser

Chrome

Steps to reproduce

We can tail the log of the pod in real time using kubectl

->k logs -f -n awx automation-job-29506-2qlhn
...
...
...
{"uuid": "021508bc-6424-4b63-a394-a42f0aaeab36", "counter": 42648, "stdout": "", "start_line": 23274, "end_line": 23274, "runner_ident": "29506", "event": "runner_on_start", "job_id": 29506, "pid": 19, "created": "2021-11-11T19:51:19.974410", "parent_uuid": "261db21c-7d3e-ca0c-65e3-000000042eae", "event_data": {"playbook": "site.yml", "playbook_uuid": "3cc1df7c-23bc-4641-8262-1eaf9a73ea2f", "play": "tag_role_edapp:tag_role_eda:tag_role_edweb:tag_role_edw:tag_role_rhb*:tag_role_rwb1*:tag_role_vhb*:tag_role_vtxa*:tag_role_vtxw*:tag_role_vweb:tag_role_vws1*:tag_role_vws2*:tag_role_vwsa:tag_role_wvd1", "play_uuid": "261db21c-7d3e-ca0c-65e3-00000000031f", "play_pattern": "tag_role_edapp:tag_role_eda:tag_role_edweb:tag_role_edw:tag_role_rhb*:tag_role_rwb1*:tag_role_vhb*:tag_role_vtxa*:tag_role_vtxw*:tag_role_vweb:tag_role_vws1*:tag_role_vws2*:tag_role_vwsa:tag_role_wvd1", "task": "Create component in udeploy", "task_uuid": "261db21c-7d3e-ca0c-65e3-000000042eae", "task_action": "uri", "task_args": "", "task_path": "/runner/project/roles/udeploy-agent/tasks/add_component.yml:26", "role": "udeploy-agent", "host": "**********", "uuid": "021508bc-6424-4b63-a394-a42f0aaeab36"}}

*** this window stalls until the job is complete and the pod exits ***

*** opened a new window and it resumes: ***

->k logs -f -n awx automation-job-29506-2qlhn
{"uuid": "7193d6f6-1f7e-48c6-8f4d-b764817d31cb", "counter": 56862, "stdout": "\u001b[0;32mok: [dvuc-rhb1b-vm0]\u001b[0m", "start_line": 32869, "end_line": 32870, "runner_ident": "29506", "event": "runner_on_ok", "job_id": 29506, "pid": 19, "created": "2021-11-11T20:03:21.402116", "parent_uuid": "261db21c-7d3e-ca0c-65e3-0000000505be", "event_data": {"playbook": "site.yml", "playbook_uuid": "3cc1df7c-23bc-4641-8262-1eaf9a73ea2f", "play": "tag_role_rhb*:tag_role_rwb1*:tag_role_vtxa*:tag_role_vtxw*:tag_role_vweb:tag_role_vws1*:tag_role_vws2*:tag_role_vwsa", "play_uuid": "261db21c-7d3e-ca0c-65e3-0000000003c3", "play_pattern": "tag_role_rhb*:tag_role_rwb1*:tag_role_vtxa*:tag_role_vtxw*:tag_role_vweb:tag_role_vws1*:tag_role_vws2*:tag_role_vwsa", "task": "Make renamed executable for easier troubleshooting in multi-instance installations", "task_uuid": "261db21c-7d3e-ca0c-65e3-0000000505be", "task_action": "ansible.windows.win_copy", "task_args": "", "task_path": "/runner/project/roles/tomcat/tasks/install-tomcat.yml:20", "role": "tomcat", "host": "***********", "remote_addr": "**********", "res": {"src": "**********", "dest": "**********", "checksum": null, "size": 131072, "original_basename": "tomcat9.exe", "changed": false, "operation": "file_copy", "_ansible_no_log": false}, "start": "2021-11-11T20:03:20.256071", "end": "2021-11-11T20:03:21.401776", "duration": 1.145705, "event_loop": null, "uuid": "7193d6f6-1f7e-48c6-8f4d-b764817d31cb"}}
...
...
...
{"status": "failed", "runner_ident": "29506"}
{"zipfile": 85344}
*** zipfile contents here ***
{"eof": true}

Note that this example is discontiguous, which implies that this may happen multiple times during a pod's life

Expected results

Expected AWX to continue reporting the progress of the job, and noting its proper completion status (in this case, "Failed")

Actual results

The last item presented by AWX was the last entry in the log before it started anew:

Screen Shot 2021-11-11 at 2 10 28 PM

Additional information

Ansible pod image is based on awx-ee:0.6.0 but adds things like helm, openssl, azure cli, zabbix-api, and other libraries that are used by various tasks

Installation was done via AWX Operator 0.14.0

@dbanttari
Copy link
Author

This seems similar/related? kubernetes/kubernetes#28369

@shanemcd
Copy link
Member

We recently saw this in an AKS deployment. I've verified a workaround by setting kubelet --container-log-max-size to a higher number, or using containerLogMaxSize in a kubelet config file.

@chrismeyersfsu
Copy link
Member

#10366 (comment) @dbanttari this might fix your issue. Note the below modification that is more generic and doesn't rely on the max-pods being set to 30. Can you try it and report back?

if ! $(grep q container-log-max-size /etc/default/kubelet); then sed -E -i 's/max-pods=([0-9]*)/max-pods=\1 --container-log-max-size=500Mi/g' /etc/default/kubelet; systemctl restart kubelet; fi

@Tioborto
Copy link
Contributor

Hello,
We are in the same case with our AWX instance (same version).
We are currently in a GKE cluster so we don't have control over the kubelet configuration.
Do you plan to fix this soon?

@shanemcd
Copy link
Member

I'm not sure how we would fix this without completely rearchitecting things to not rely on the kubernetes logging system. We don't have any plans to do this right now.

@Tioborto
Copy link
Contributor

Hello @shanemcd ,
This is a real problem for us. We have users with jobs in "ERROR" status while we have full logs in our external tool.
Why is this not a top priority? This is a regression from previous versions

@bschaeffer
Copy link

I'm not sure how we would fix this without completely rearchitecting things to not rely on the kubernetes logging system. We don't have any plans to do this right now.

Running this in kubernetes 1.20+ should be called as strictly unsupported and broken I feel like

@bschaeffer
Copy link

From #11511:

FWIW, this does seem to be related to container runtime. In GKE, we recently upgraded to 1.20 which defaults to containerd at the same time we saw this error. When we rolled back to 1.19, it was also broken, but we realised it was still using containerd://1.4.8.

When we switched the runtime to docker://19.3.15, it actually fixed our problem.

So things are now in a working state, but it will be EOL for us in about 5 months.

@MalfuncEddie
Copy link

I'm not sure how we would fix this without completely rearchitecting things to not rely on the kubernetes logging system. We don't have any plans to do this right now.

I agree this is a contanerd/k8s issue. But it also best practice to "limit' logs from filling up filesystems.

But relying on the logging systems of k8s does not seem reliable.

Are te following options valid?
Move to openshift?
Is it possible to move the execution nodes to podman (and does podman have this issue?)

@oliverf1
Copy link

oliverf1 commented Feb 1, 2022

I'm not sure how we would fix this without completely rearchitecting things to not rely on the kubernetes logging system. We don't have any plans to do this right now.

If this kubernetes/kubernetes#59902 was implemented, would not that help?

I can also confirm is it an issue for us too. For some problematic playbooks with lots of output and multiple hosts, the only solution we found, was to create a playbook that will create a pod which runs the playbook directly on the k8s cluster, so outside of AWX :-( In our case, increasing the log size is not an option, because we have other workloads on the cluster and increasing the log file size will drastically increase the disk space.

Thanks

@chris93111
Copy link
Contributor

chris93111 commented Feb 24, 2022

Hi same problem in k3s, this is a real problem for us also

Rhel 8.4
Version: 0.1.0
RuntimeName: containerd
RuntimeVersion: v1.5.8-k3s1
RuntimeApiVersion: v1alpha2
k3s 1.22

Fixed with kubelet --container-log-max-size

@jogleasonjr
Copy link
Contributor

Using AKS and Terraform we were able to get enough headroom by specifying:

resource "azurerm_kubernetes_cluster_node_pool" "pool" { 
  ...

  kubelet_config = {
    container_log_max_line    = 5
    container_log_max_size_mb = 500
  }
  ...
}

The default, I believe, was 10MB.

@def1nity
Copy link

#10366 (comment) @dbanttari this might fix your issue. Note the below modification that is more generic and doesn't rely on the max-pods being set to 30. Can you try it and report back?

if ! $(grep q container-log-max-size /etc/default/kubelet); then sed -E -i 's/max-pods=([0-9]*)/max-pods=\1 --container-log-max-size=500Mi/g' /etc/default/kubelet; systemctl restart kubelet; fi

Hey, I tried your suggestion but included--container-log-max-files=3 before I read that 5 is the default value... And even though the daemonset pods failed to start with Crashloopback and I deleted them a couple of minutes later somehow the kubelet config file now looks like this:

KUBELET_FLAGS=--address=0.0.0.0 --anonymous-auth=false --authentication-token-webhook=true --authorization-mode=Webhook --azure-container-registry-config=/etc/kubernetes/
azure.json --cgroups-per-qos=true --client-ca-file=/etc/kubernetes/certs/ca.crt --cloud-config=/etc/kubernetes/azure.json --cloud-provider=azure --cluster-dns=10.0.0.0
--cluster-domain=cluster.local --enforce-node-allocatable=pods --event-qps=0 --eviction-hard=memory.available<750Mi,nodefs.available<10%,nodefs.inodesFree<5% --feature-g
ates=RotateKubeletServerCertificate=true --image-gc-high-threshold=85 --image-gc-low-threshold=80 --image-pull-progress-deadline=30m --keep-terminated-pod-volumes=false -
-kube-reserved=cpu=100m,memory=1843Mi --kubeconfig=/var/lib/kubelet/kubeconfig --max-pods=35 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-ma
x-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container
-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --co
ntainer-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files
=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-ma
x-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container
-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --co
ntainer-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500
Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-s
ize=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-lo
g-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --conta
iner-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3
--container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-f
iles=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-lo
g-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --conta
iner-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi
--container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size
=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-m
ax-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --containe
r-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --c
ontainer-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-file
s=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-m
ax-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --containe
r-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --c
ontainer-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=50
0Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-
size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-l
og-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --cont
ainer-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3
--container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-
files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-l
og-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --cont
ainer-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi
--container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --container-log-max-size=500Mi --container-log-max-files=3 --network-plugin=cni --
node-status-update-frequency=10s --non-masquerade-cidr=0.0.0.0/0 --pod-infra-container-image=mcr.microsoft.com/oss/kubernetes/pause:3.5 --pod-manifest-path=/etc/kubernete
s/manifests --pod-max-pids=-1 --protect-kernel-defaults=true --read-only-port=0 --resolv-conf=/run/systemd/resolve/resolv.conf --rotate-certificates=false --streaming-con
nection-idle-timeout=4h --tls-cert-file=/etc/kubernetes/certs/kubeletserver.crt --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM
_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_RSA_
WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_GCM_SHA256 --tls-private-key-file=/etc/kubernetes/certs/kubeletserver.key
KUBELET_REGISTER_SCHEDULABLE=true
NETWORK_POLICY=azure

KUBELET_NODE_LABELS=kubernetes.azure.com/role=agent,agentpool=nodepoold,kubernetes.azure.com/agentpool=nodepoold,storageprofile=managed,storagetier=Premium_LRS,kube
rnetes.azure.com/storageprofile=managed,kubernetes.azure.com/storagetier=Premium_LRS,kubernetes.azure.com/os-sku=Ubuntu,kubernetes.azure.com/cluster=MC_rg_cluster_westeurope
,kubernetes.azure.com/mode=system,kubernetes.azure.com/node-image-version=AKSUbuntu-1804gen2containerd-2021.09.28,nodepool=defaultnodepoolnew

root@aks-nodepoolD:/etc/default#

Funny enough everything works exactly as before and as far as I can tell, nothing was removed, only the two parameters got added a gazzilion times... Anyway I would like to fix this mess and restore the default kubelet config file if possible.

Could anyone suggest how to fix this please?

@ThomasADavis
Copy link

Lucky for us, I've also setup ARA. so when the log file rotates, ARA still has the complete job data..

@Tioborto
Copy link
Contributor

Any update about the resolution of this problem @shanemcd ? please

@AlanCoding
Copy link
Member

Also see some good quality research in the ansible-runner issue ansible/ansible-runner#998

This is particularly a problem for inventory updates that produce large inventories, because the archive of the artifacts necessarily has to be large, and the way that is sent requires precise byte alignment or else it won't work.

Trying to think of where / how a patch might be developed... a change to awx-operator might just increase the size that it can handle but not fundamentally address the issue. If I could reproduce it, maybe I could get the syntax of what the controller node reads, and figure out how to identify these breaks in the ansible-runner code.

@cerberek
Copy link

For me the same behavior happens when automation-job pod finished and is destroyed.
In pod logs i could see that all happened correctly but AWX jobs have Error with trimmed log output.

@kdelee
Copy link
Member

kdelee commented Jul 27, 2022

@AlanCoding to force a reproducer, you could try setting kubelet --container-log-max-size to a very small number

@ethemcemozkan
Copy link
Contributor

Any update about the resolution?

@andreasbourges
Copy link

Hi same problem in k3s, this is a real problem for us also

Rhel 8.4 Version: 0.1.0 RuntimeName: containerd RuntimeVersion: v1.5.8-k3s1 RuntimeApiVersion: v1alpha2 k3s 1.22

Fixed with kubelet --container-log-max-size

Hi - trying to apply the mentioned workaround for k3s, too, but currently I'm unsure on where and how to apply the container-log-max-size configuration. Could you please give me a hint, where and how you deployed that?

Thanks,

Andreas

@kurokobo
Copy link
Contributor

@andreasbourges
Modify /etc/systemd/system/k3s.service and restart K3s,

$ cat /etc/systemd/system/k3s.service
...
ExecStart=/usr/local/bin/k3s \
    server \
        '--write-kubeconfig-mode' \
        '644' \
        '--kubelet-arg' \     👈👈👈
        'container-log-max-files=4' \     👈👈👈
        '--kubelet-arg' \     👈👈👈
        'container-log-max-size=50Mi' \     👈👈👈

$ sudo systemctl daemon-reload
$ sudo systemctl restart k3s

Or re-install K3s using script with specific arguments,

$ curl -sfL https://get.k3s.io | sh -s - --write-kubeconfig-mode 644 --kubelet-arg "container-log-max-files=4" --kubelet-arg "container-log-max-size=50Mi"

@andreasbourges
Copy link

@andreasbourges Modify /etc/systemd/system/k3s.service and restart K3s,

$ cat /etc/systemd/system/k3s.service
...
ExecStart=/usr/local/bin/k3s \
    server \
        '--write-kubeconfig-mode' \
        '644' \
        '--kubelet-arg' \     👈👈👈
        'container-log-max-files=4' \     👈👈👈
        '--kubelet-arg' \     👈👈👈
        'container-log-max-size=50Mi' \     👈👈👈

$ sudo systemctl daemon-reload
$ sudo systemctl restart k3s

Or re-install K3s using script with specific arguments,

$ curl -sfL https://get.k3s.io | sh -s - --write-kubeconfig-mode 644 --kubelet-arg "container-log-max-files=4" --kubelet-arg "container-log-max-size=50Mi"

Thanks a lot - this saves me a lot of time!

Highly appreciated!

@anasaizg
Copy link

Hi! I have been reading all your valuable information and tests for the last days, because we are facing the same issue. I just wanted to let you know that, in our case, changing the log configuration didn't solve it, so I'm afraid we have come to the wall of "4h maximum connection time" that is commented in this other issue:

#11805 (comment)

@andreasbourges
Copy link

andreasbourges commented Sep 27, 2022 via email

@bschaeffer
Copy link

Just want to say that increasing size before log rotation does not solve the problem. It just decreases the likelyhood that your run will take place or trigger a log rotation that causes the error. Its not a solution to this problem in any way.

@andreasbourges
Copy link

andreasbourges commented Oct 12, 2022 via email

@TheRealHaoLiu
Copy link
Member

fixed in ansible/receptor#683

@domq
Copy link

domq commented Jul 5, 2023

fixed in ansible/receptor#683

Not for me — I seem to still get only the first chunk of logs. Running AWX receptor version 1.4.1 against OpenShift 3.11's (I know, I know) Kubernetes 1.11.0+d4cacc0.

It is worth noting that the /tmp/receptor/awx-0/*/stdout files do not end with (or contain) an { "eof": true } marker in this case, indicating that Receptor at least understands that it has work remaining.

I'm not sure how we would fix this without completely rearchitecting things to not rely on the kubernetes logging system.

Indeed. I seem to remember that at some point, AWX used to rely on RabbitMQ for this task?

domq pushed a commit to epfl-si/wp-ops that referenced this issue Dec 12, 2023
... You see, umm, we haven't applied `-t awx` for a long time, because [that would break logging in AWX](ansible/awx#11338 (comment)) 🤦
@Klaas-
Copy link
Contributor

Klaas- commented Jan 9, 2024

this issue seems finally fixed for me with kubernetes 1.29

domq pushed a commit to epfl-si/wp-ops that referenced this issue Sep 27, 2024
As seen in ansible/awx#11338 and ansible/receptor#446

- Force `RECEPTOR_KUBE_SUPPORT_RECONNECT` as per ansible/receptor#683
- Pump up timeouts thereof
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests