AWX RuntimeError resulting in loss of worker task logs and true outcome #9961

duntonr · 2021-04-20T18:30:27Z

ISSUE TYPE

Bug Report

SUMMARY

AWX UI reporting stalls eventually errors out with unhelpful

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 1397, in run
    res = receptor_job.run()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 2957, in run
    return self._run_internal(receptor_ctl)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 3008, in _run_internal
    raise RuntimeError(detail)

In reality, though, the job has actually continued and either completed successfully OR failed due to a normal (helpful) job/task/play error. This is found by looking at the spawned worker container logs.

The UI error IS raised at the same time as when the JOB completes.

Downloading logs from the UI results in the INCOMPLETE log set, eg not the apparent zip file the worker container uploads back

AWX reports the job as a failure in the job list, even though it succeeded from the actual container logs. This is VERY confusing.

ENVIRONMENT

AWX version: 19.0.0
AWX install method: AWX Operator 0.8.0 via microk8s 1.4.4
Ansible version: 2.11.0b4
Operating System: Ubuntu 20.04
Web Browser: Brave 1.23.71 / Chromium 90.0.4430.72

STEPS TO REPRODUCE

Use standard AWX EE 0.1.1 or derived custom EE container for a long-running/verbose job
Track job progress from both AWX UI as well as container logs from k8s (dashboard, kubectl, etc)
After some point, the UI stops tracking
Continue to monitor the container output of the job
Compare results from job container output to AWX UI (note you need to adjust the cleanup timing in AWX or be on top of tailing / dumping the container logs as the container is cleaned up once the job exits

EXPECTED RESULTS

AWX UI Job output continues to stay synced with actual job output and status match

ACTUAL RESULTS

AWX UI logging falls out of sync from execution and reports job failed with .../tasks.py error, despite actual job outcome. If there was an actual Job error, this is not displayed (so could not tell WHY a job failed without worker pod logs).

ADDITIONAL INFORMATION

After continuously tailing the worker pod logs, I was able to see this:

...
{"uuid": "cfed869a-3ecb-4919-9301-5479b816bec1", "counter": 10218, "stdout": "\r\nPLAY RECAP *********************************************************************\r\n\u001b[0;33mconsul01.internal.therack.io\u001b[0m : \u001b[0;32mok=109 \u001b[0m \u001b[0;33mchanged=39  \u001b[0m unreachable=0    failed=0    \u001b[0;36mskipped=37  \u001b[0m rescued=0    \u001b[1;35mignored=4   \u001b[0m\r\n\u001b[0;33mconsul02.internal.therack.io\u001b[0m : \u001b[0;32mok=97  \u001b[0m \u001b[0;33mchanged=27  \u001b[0m unreachable=0    failed=0    \u001b[0;36mskipped=49  \u001b[0m rescued=0    \u001b[1;35mignored=1   \u001b[0m\r\n\u001b[0;33mconsul03.internal.therack.io\u001b[0m : \u001b[0;32mok=97  \u001b[0m \u001b[0;33mchanged=27  \u001b[0m unreachable=0    failed=0    \u001b[0;36mskipped=49  \u001b[0m rescued=0    \u001b[1;35mignored=1   \u001b[0m\r\n\u001b[0;33mmaas.internal.therack.io\u001b[0m   : \u001b[0;32mok=63  \u001b[0m \u001b[0;33mchanged=21  \u001b[0m unreachable=0    failed=0    \u001b[0;36mskipped=22  \u001b[0m rescued=0    \u001b[1;35mignored=3   \u001b[0m\r\n\u001b[0;33mmarathon.internal.therack.io\u001b[0m : \u001b[0;32mok=64  \u001b[0m \u001b[0;33mchanged=22  \u001b[0m unreachable=0    failed=0    \u001b[0;36mskipped=21  \u001b[0m rescued=0    \u001b[1;35mignored=3   \u001b[0m\r\n", "start_line": 30927, "end_line": 30935, "runner_ident": "209", "event": "playbook_on_stats", "job_id": 209, "pid": 20, "created": "2021-04-20T17:51:16.379887", "parent_uuid": "87632333-0e50-47fe-8754-3fe63b1b5216", "event_data": {"playbook": "playbooks/controlDeck/common/configureServices.yml", "playbook_uuid": "87632333-0e50-47fe-8754-3fe63b1b5216", "changed": {"consul01.internal.therack.io": 39, "consul02.internal.therack.io": 27, "consul03.internal.therack.io": 27, "marathon.internal.therack.io": 22, "maas.internal.therack.io": 21}, "dark": {}, "failures": {}, "ignored": {"consul01.internal.therack.io": 4, "consul02.internal.therack.io": 1, "consul03.internal.therack.io": 1, "marathon.internal.therack.io": 3, "maas.internal.therack.io": 3}, "ok": {"consul01.internal.therack.io": 109, "consul02.internal.therack.io": 97, "consul03.internal.therack.io": 97, "marathon.internal.therack.io": 64, "maas.internal.therack.io": 63}, "processed": {"consul01.internal.therack.io": 1, "consul02.internal.therack.io": 1, "consul03.internal.therack.io": 1, "marathon.internal.therack.io": 1, "maas.internal.therack.io": 1}, "rescued": {}, "skipped": {"consul01.internal.therack.io": 37, "consul02.internal.therack.io": 49, "consul03.internal.therack.io": 49, "marathon.internal.therack.io": 21, "maas.internal.therack.io": 22}, "artifact_data": {}, "uuid": "cfed869a-3ecb-4919-9301-5479b816bec1"}}
{"status": "successful", "runner_ident": "209"}
{"zipfile": 675955}
P)h>@6aWAK2mrv0lu{6R!x41^000*a000L7003ieZEaz0WUW?PZ=....<more binary dump, I'm assuming its a zip file>

As you can see, job 209 finished successfully.

However

I also inspected the container logs for the redis, awx-web, awx-task, and awx-ee in the AWX pod but didn't see anything immediately apparently around the time AWX UI stopped tracking

The text was updated successfully, but these errors were encountered:

shanemcd · 2021-04-20T18:37:29Z

Interesting. Some questions:

Does this happen consistently?
What can you tell us about the workload itself?
- How many tasks?
- Are you producing a lot of stdout?
- Are you making heavy use of ansible facts?

duntonr · 2021-04-20T18:55:50Z

Hi @shanemcd

Does this happen consistently?
- So far, it seems intermittent or at least is not triggered if the job naturally hits an error earlier in the run. I have a limited amount of data so far (decided to rebuild my AWX environment from scratch due to issues upgrading from pre-18) and am working through configuring (and improving) my playbooks and templates along the way.
Workload
- It is a rather heavy set of plays. It does things like configure partitions/fs/exports, install docker, gusterfs, consul, nomad, vault, do system updates, etc. This is across 5 hosts.
- The UI seemed to only capture 23 plays and 445 tasks. Based on the pod log, its more like 80 plays
- It's currently sent to verbose (level 3) output and if I download the incomplete file from UI, its ~2.5M. When downloading the worker pod logs, it was getting close to 10mb (but the formatting is different there)
- It does make use of facts throughout the plays, eg to determine if a host is a gusterfs server or client, nomad server or client, etc

When running with AWX 17, via local Docker Compose, I did not have these issues for a very similar amount of tasks/stdio

I have some additional templates/playbooks to bring over, including a larger one that will run over ~36 hosts and will keep an eye out there for the same behavior

shanemcd · 2021-04-20T18:59:49Z

@duntonr Thanks for the information. We did indeed make some pretty radical changes to the architecture of AWX in version 18, so that explains why you weren't seeing this issue before.

I will spend some time tomorrow trying to reproduce this, or at very least improve the error handling.

shanemcd · 2021-04-20T19:03:43Z

@duntonr One more question: is this error the only thing you see in the stdout? Or was there output before the traceback?

duntonr · 2021-04-20T19:07:45Z

Thanks @shanemcd !

For what its worth, I came across #9917 and https://groups.google.com/g/awx-project/c/MACNtPrGpV8 as those are the only google hits for File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 1397, in run

Separately, I had similar issues as reported in 9917 but was able to eventually get a custom EE container builtt and working (to include Galaxy collections) by using the workarounds mentioned there. I needed to use your suggested --build-arg ANSIBLE_RUNNER_IMAGE=quay.io/ansible/awx-ee:0.1.1

If though, I tried the (very new) --build-arg ANSIBLE_RUNNER_IMAGE=quay.io/ansible/awx-ee:0.2.0, I would get

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 1397, in run
    res = receptor_job.run()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 2957, in run
    return self._run_internal(receptor_ctl)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 3008, in _run_internal
    raise RuntimeError(detail)
RuntimeError: Pod Running

I would get that same error if I tried the out-of-the-box 0.2.0 quay image as well. I can report the error over there, but mentioning it here in case it helps shed any light on the AWX Runtime vs worker pod relationships

duntonr · 2021-04-20T19:29:27Z

@duntonr One more question: is this error the only thing you see in the stdout? Or was there output before the traceback?

That's an interesting thing as well... when the job starts, things are normal, eg the log starts with

Enter passphrase for /runner/artifacts/209/ssh_key_data: 
Identity added: /runner/artifacts/209/ssh_key_data (duntonr@gmail.com)
ansible-playbook [core 2.11.0b4] 
...

I can keep scrolling though the logs via AWX UI (double down carrots) until it stalls. At that point, the flashy green indicator keeps flashing to show the job is still running but no new logs.

When the job does end, the indicator will turn red, but again, no new logs/errors/etc at the bottom of the logging window

If I then refresh the Job Output screen, that's when the Traceback error shows up, at the TOP of the log

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 1397, in run
    res = receptor_job.run()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 2957, in run
    return self._run_internal(receptor_ctl)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 3008, in _run_internal
    raise RuntimeError(detail)
RuntimeError: Finished
Enter passphrase for /runner/artifacts/209/ssh_key_data: 
Identity added: /runner/artifacts/209/ssh_key_data (duntonr@gmail.com)
ansible-playbook [core 2.11.0b4] 
...

duntonr · 2021-04-22T17:55:53Z

Just by way of update, I have been able to run a different playbook across different "stacks" of 4 hosts/stack WITHOUT issue so far.

I've made 5 runs of a template that consists of:

4 hosts
4 plays
~528 tasks
~ 37minutes
L3 logging, resulting in download size of ~71KB log txt file when finished
uses facts

This playbook/template is a slightly lighter version of the one I was having issues with. That said it's mostly the same software and actually imports a lot of the same task files....just these are worker nodes, vs the control/server hosts where the issue was.

This is a home lab, so there aren't any crazy vlaning/security issues running between the two (same subnet, etc) host groups. The only major difference I could think of is the problematic playbook runs against x86-64 hosts, whereas this "worker" playbook runs against arm64v8 hosts. Should not make a difference for this it seems, but figured I would mention it.

Once this rollout is finished, I will run a playbook that touches all ~35 hosts and see whats happens

DrackThor · 2021-05-04T09:13:50Z

@duntonr I'm experiencing quite the same issues - I often get this output:

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 1397, in run
    res = receptor_job.run()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 2957, in run
    return self._run_internal(receptor_ctl)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 3008, in _run_internal
    raise RuntimeError(detail)
RuntimeError: Finished

I've recently got my custom EE working on my test cluster (#9917 ), where I also dont get this error.
On my second cluster however I keep getting this RuntimeError quite frequently.

ENVIRONMENT
AWX version: 19.0.0
AWX install method: AWX Operator 0.8.0 via kubeadm 1.19.10
Ansible version - awx-ee: core 2.12.0.dev0
Operating System: Ubuntu 18.04
Web Browser: Chromium 90.0.4430.93

nlvw · 2021-05-04T23:39:51Z

I was receiving the same error, and behavior in 19.0.0. In 19.1.0 I no longer receive the error but the behavior remains. I cannot run any job with more then 3 hosts without the job failing mysteriously with cutoff output and no errors.

DrackThor · 2021-05-06T07:40:42Z

I was receiving the same error, and behavior in 19.0.0. In 19.1.0 I no longer receive the error but the behavior remains. I cannot run any job with more then 3 hosts without the job failing mysteriously with cutoff output and no errors.

I was able to get my EE running like I mentioned in #9917, maybe this also helps you 😄

nlvw · 2021-05-06T14:50:40Z

@DrackThor I'm using the builtin EE and not a custom one. I've read through the issue you mentioned but failed to see how it address the problem mentioned here.

duntonr · 2021-05-07T14:55:41Z

Thanks @DrackThor . I'm using a custom EE too but the issue seems to happen with a custom EE (with Galaxy stuff installed in the image) or the standard EE (with Galaxy stuff cobbled right into the repo that hold my playbooks)

I did run a larger set of plays... 28 hosts, ~ 3.5hrs run time, etc... issue still occurs

It's somewhat interesting that error is injected at the TOP of the output in awx.

Also, the error seems to occur quicker on my play against the 5 x86 hosts vs the play against 28 arm64v8 hosts

shanemcd · 2021-05-07T15:25:09Z

@duntonr I tried to improve the error handling in 19.1.0. Can you try again and paste the new error?

shanemcd · 2021-05-12T21:56:20Z

I think we've finally gotten to the bottom of this. Testing patch at ansible/receptor#319

For context, see: - ansible#319 - ansible/awx#9961

rysolv-bot · 2021-05-14T03:53:08Z

travelingladybug has contributed $50.00 to this issue on Rysolv.

The total bounty is now $50.00. Solve this issue on Rysolv to earn this bounty.

duntonr · 2021-05-18T21:44:54Z

@shanemcd - Sorry for the delay but 19.1 did NOT solve the issue. The same issue/behavior remains with:

Operator 0.9
AWX 19.1
Custom awx-ee image, based off quay.io/ansible/awx-ee:0.2.0 (just using Dockerfile FROM, not using ansible-builder anymore)

I was excited by ansible/receptor#319, until I read that issues updates :( . It does "feel" kinda like a lock or contention type issue though

shanemcd · 2021-06-01T12:48:12Z

In terms of what it means to test this: we shouldn't see this error anymore.

This was fixed in the latest alpha release of Receptor, which is going out in AWX 19.2.0 sometime today or tomorrow.

kdelee · 2021-06-28T14:44:42Z

should be fixed by 1ed170f

shanemcd · 2021-06-28T14:45:55Z

To be clear, the unhelpful RuntimeError: Pod Running error should be fixed by the aforementioned commit.

kazigk · 2021-07-01T13:30:46Z

1ed170f didn't fix it for me. I'm currently running:

AWX Operator 0.12.0
AWX 19.2.2 (which includes said commit)
AWX EE 0.5.0

Update:
I tried once again on the same playbook and it finished without any error. 115 hosts, 2500 lines of logs (stdout).
Sadly I can't get logs from the previous run as pod is already deleted.
I will post an update if I get any issues in near future.

kazigk · 2021-07-02T10:43:59Z

I'm still facing the issue, but in my case I'm not getting any error nor play recap in automation-job pod logs. The job finishes like it should, but the output is incomplete, so AWX can't even mark hosts as failed, which is very unhelpful. Running the same job on smaller part of inventory (or just splicing it) does solve the issue, but it's less readable (a few different stdout logs to look at isn't ideal).

Maybe I didn't understand what the commit mentioned above should fix or I'm encountering a different issue that just fits the description?

daenney · 2021-08-13T13:19:12Z

I'm observing this issue too. We initially had a run that ended with the following traceback and a bunch of log output truncated.

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 1397, in run
    res = receptor_job.run()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 2957, in run
    return self._run_internal(receptor_ctl)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 3008, in _run_internal
    raise RuntimeError(detail)
RuntimeError: Finished

It was really slow, so we ran it again with bigger instances powering it, and now we've ended up with:

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 1397, in run
    res = receptor_job.run()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 2957, in run
    return self._run_internal(receptor_ctl)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 3008, in _run_internal
    raise RuntimeError(detail)
RuntimeError: Pod Running

I was following the logs with kubectl, and it ends with:

rpc error: code = Unknown desc = Error: No such container: 51b1416e42f635758859cfd2890a9f1021e48d56ff5f2006835c52bcb8483fd8

The logs right before it where just normal play output. Like the container just terminated.

Doing the same job in smaller batches seems to do the trick for us too.

daenney · 2021-08-13T13:42:29Z

We've also had a case where running with a larger batch it makes it to the end of the run, but seems to crash when flushing the recap.

The last of the logs in the pod show:

{"status": "failed", "runner_ident": "29178"}
{"zipfile": 28832}
P)h>@6a [..] some output that looks like it's straight out of The Matrix [..] 3SpWb400{"eof": true}

The traceback was:

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 1397, in run
    res = receptor_job.run()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 2957, in run
    return self._run_internal(receptor_ctl)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 3008, in _run_internal
    raise RuntimeError(detail)
RuntimeError: Finished

davilmb · 2021-08-23T14:58:45Z

Hi @shanemcd and everyone!

I tested the fix mentioned in version 19.2.0, but without success, the same happened when upgrading to 19.2.2 and the new one release 19.3.0.

For the tests I used the awx-ee default (0.5.0) and also the 0.6.0 even the latest tag, always the same problem, after 4 hours of execution the Job is marked with "Error", even if it's terminate successfully in container log.

As an example below:

Note: I also ran the test on a new installation with version 18.0.0 (where there was no report of the problem), but I had the same problem and this didn't happen in the version that I was using previously (14.1.0).

Is there any palliative measure that we can use in this cases ?

Awaiting further comments regarding this issue.

nlvw · 2021-08-29T19:56:40Z

I've only found 2 solutions to this problem. Either have your templates split your job using a high number of slices or revert back to an earlier version of AWX before all of this Kubernetes only refactor stuff began. I really hope this issue gets more attention as it has made the product unusable.

If you deploy AWX 16 (before the incomplete UI redesign for stability) you can update ansible within the docker container to 10.x relatively safely. This will give you the git collection downloads from the requirements.yml file. The steps are relatively simple just anter a bash shell in the ansible container, yum uninstall ansible, and then pip install the desired ansible version.

nicovs · 2021-10-15T07:30:50Z

See: #10366 (comment)

kladiv · 2021-10-17T08:04:58Z

Hi @nicovs ,
i tested the suggested changes (in #10366 comment) in our infrastructure (k3s) but the job still failing after 4:02 hours.

AWX 19.3.0
AWX EE (latest) built-in installed by AWX installation

Job is a Workflow Job, and one playbook runs a task like below:

  raw: "ps -ef | grep -w /opt/xensource/sm/LVMoISCSISR | grep -v grep | grep -wq vdi_delete"
  register: quiesce_ps
  failed_when: false
  until: quiesce_ps.rc == 1
  retries: "{{ quiesce_wait_max_retries }}"
  delay: "{{ quiesce_wait_retries_delay }}"
  become: no
  delegate_to: "{{ xen_pool_master_inventory_hostname }}"

or like below (i tried a change to check if got same error):

  shell: >-
    RC=0;
    while [ $RC -eq 0 ]; do
      sleep 60;
      ps -ef | grep -w /opt/xensource/sm/LVMoISCSISR | grep -v grep | grep -wq vdi_delete;
      RC=$?;
    done
  register: quiesce_status
  async: 10800 # 3 hrs
  poll: 60 # 1 min
  become: no
  delegate_to: "{{ xen_pool_master_inventory_hostname }}"

Both the until/retries/delay task and async/poll task fails the Job without any error after about 4hrs. Every time it runs, it fails after 4 hrs.

Another playbook task (it makes XenServer big VM export via command module) fails the Job after about 14hrs without any error:

Below the logs i see in /var/log/pods/default_awx-777854cdfb-z2bs4_b24bc0a5-ca74-4561-89ed-378ddbed4d08/awx-task/1.log

2021-10-17T07:43:48.929344109+08:00 stderr F 2021-10-16 23:43:48,928 INFO     [ba8cf785179d40f19847119a0a9dfef2] awx.main.commands.run_callback_receiver Event processing is finished for Job 1090, sending notifications
2021-10-17T07:43:48.929927511+08:00 stderr F 2021-10-16 23:43:48,928 INFO     [ba8cf785179d40f19847119a0a9dfef2] awx.main.commands.run_callback_receiver Event processing is finished for Job 1090, sending notifications
2021-10-17T07:43:49.42528382+08:00 stderr F 2021-10-16 23:43:49,409 DEBUG    [ba8cf785179d40f19847119a0a9dfef2] awx.main.tasks job 1090 (running) finished running, producing 389 events.
2021-10-17T07:43:49.425323136+08:00 stderr F 2021-10-16 23:43:49,421 DEBUG    [ba8cf785179d40f19847119a0a9dfef2] awx.analytics.job_lifecycle job-1090 post run
2021-10-17T07:43:49.612110438+08:00 stderr F 2021-10-16 23:43:49,611 DEBUG    [ba8cf785179d40f19847119a0a9dfef2] awx.analytics.job_lifecycle job-1090 finalize run
2021-10-17T07:43:49.640764078+08:00 stderr F 2021-10-16 23:43:49,639 DEBUG    [ba8cf785179d40f19847119a0a9dfef2] awx.main.dispatch task d2b391fd-10bc-488d-9269-c4d6c18fbfed starting awx.main.tasks.update_inventory_computed_fields(*[2])
2021-10-17T07:43:49.683846539+08:00 stderr F 2021-10-16 23:43:49,683 DEBUG    [ba8cf785179d40f19847119a0a9dfef2] awx.main.models.inventory Going to update inventory computed fields, pk=2
2021-10-17T07:43:49.714765297+08:00 stderr F 2021-10-16 23:43:49,714 DEBUG    [ba8cf785179d40f19847119a0a9dfef2] awx.main.models.inventory Finished updating inventory computed fields, pk=2, in 0.029 seconds
2021-10-17T07:43:50.735156488+08:00 stderr F 2021-10-16 23:43:50,733 WARNING  [ba8cf785179d40f19847119a0a9dfef2] awx.main.dispatch job 1090 (error) encountered an error (rc=None), please see task stdout for details.
2021-10-17T07:43:50.736988429+08:00 stderr F 2021-10-16 23:43:50,736 DEBUG    [ba8cf785179d40f19847119a0a9dfef2] awx.main.dispatch task 2b73b673-6d40-4c65-b1a1-7588fd1a9c5b starting awx.main.tasks.handle_work_error(*['2b73b673-6d40-4c65-b1a1-7588fd1a9c5b'])
2021-10-17T07:43:50.738166086+08:00 stderr F 2021-10-16 23:43:50,737 DEBUG    [ba8cf785179d40f19847119a0a9dfef2] awx.main.tasks Executing error task id 2b73b673-6d40-4c65-b1a1-7588fd1a9c5b, subtasks: [{'type': 'job', 'id': 1090}]
2021-10-17T07:43:50.775480607+08:00 stderr F 2021-10-16 23:43:50,770 DEBUG    [ba8cf785179d40f19847119a0a9dfef2] awx.main.dispatch task 2b73b673-6d40-4c65-b1a1-7588fd1a9c5b starting awx.main.tasks.handle_work_success(*[])

mw-0 · 2021-11-11T16:29:27Z

@smullenrga have you tried the same but change your log sizes. They are 16k default.

smullenrga · 2021-11-11T18:29:14Z

@mw-0 I'm looking into the options for log sizes, I'm simply a consumer of our k8s cluster so will possibly have to work with the cluster admins to make any global changes. Will update once I've had a chance to change log settings. Thanks for the reply.

smullenrga · 2021-11-15T14:55:15Z

Cleaning up/deleting several earlier comments.

We're running Kubernetes and using the GELF driver to log to an ELK stack (configured via /etc/docker/daemon.json). AWX changed somewhere between 15 (worked fine in our environment) and 19.4 to create separate STDOUT lines for each event. In 19.4, very long STDOUT lines (>16K) are being generated which are being split up and then improperly reassembled somewhere. Per moby/moby#22982 a 16K limit / split was put on STDOUT lines in Docker 1.13.

In our environment, after upgrading from AWX 15 to AWX 19.4, AWX Jobs break when a STDOUT line >16K is encountered because whatever is re-assembling these docker-split long lines is failing to put a carriage return on the reassembled line that goes back to AWX. As a result, you get the 16+K JSON object rebuilt as expected however, you also end up with the following JSON even log line appended to the end of the prior long line and the JSON parsing breaks.

As a result of the failed log parsing, the jobs are marked failed regardless of true status.

shanemcd · 2021-11-19T00:45:50Z

Can y'all take a look at #11338 (comment) ? Wondering if this is the same thing.

daenney · 2021-11-19T11:19:53Z

@shanemcd Sure looks like it from my point of view. Unfortunately tweaking kubelet args isn't always an option (it's possible but very annoying with managed node pools etc. in certain clouds) so it would be nice to find a way around it. Ideally whatever is tailing/streaming the log needs to handle file rotation transparently.

smullenrga · 2021-11-19T14:12:44Z

@shanemcd This does not look like my issue, perhaps I should open a new issue. I get all of the logs from kubectl, and I get the full JSON output at the top of the job output after the job's errored out. My overall log is only like 700K and that's only because I'm intentionally generating event data's over 16K (which happens naturally with a windows gather facts when the event body contais all of the host vars on at least some of our systems). My issue really seems to be about the log lines being split over 16K and whatever's reassembling them not putting a carriage return on the line. In the "stack trace" at the top of the failed job in AWX, I see the full event items in JSON format, each on its own line UNLESS the event is over 16K, as soon as it crosses 16K for the event JSON object, the next event's JSON object is appended to the end of the 16+K line and that's the point at which AWX lists it as failed and the pretty/formatted output stops.

bschaeffer · 2022-01-18T14:21:16Z

From #11511:

FWIW, this does seem to be related to container runtime. In GKE, we recently upgraded to 1.20 which defaults to containerd at the same time we saw this error. When we rolled back to 1.19, it was also broken, but we realised it was still using containerd://1.4.8.

When we switched the runtime to docker://19.3.15, it actually fixed our problem.

So things are now in a working state, but it will be EOL for us in about 5 months.

kladiv · 2022-01-18T14:48:34Z

Hi, interesting. Containerd is the default CRI in most Kubernetes distributions so i hope this issue will be solved soon in AWX side 'cause it could affect all users in the next future. Thank you Best, Claudio Il giorno mar 18 gen 2022 alle ore 15:21 Braden Schaeffer < ***@***.***> ha scritto:

…

From #11511 <#11511>: FWIW, this does seem to be related to container runtime. In GKE, we recently upgraded to 1.20 which defaults to containerd at the same time we saw this error. When we rolled back to 1.19, it was also broken, but we realised it was still using containerd://1.4.8. When we switched the runtime to docker://19.3.15, it actually fixed our problem. So things are now in a working state, but it will be EOL for us in about 5 months. — Reply to this email directly, view it on GitHub <#9961 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AANYNFVYP2RTERBNKRW3NELUWVZPDANCNFSM43IX6QDQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you commented.Message ID: ***@***.***>

-- Ciao, Claudio mail @ ***@***.***

smullenrga · 2022-01-18T18:49:58Z

I think I've found my root cause - the GELF log driver in docker is not updated to handle the 16K log limit. Our enterprise cluster that AWX is on is configured to use the GELF driver to send logs to ELK and fails as noted above.

AWX on Docker Desktop with default configs works fine, no failure.

As soon as I change Docker Desktop to the GELF driver, I get the same failure.

Looking at the docker (moby) source code for the default jsonfilelog driver, it's updated to read PLogMetaData (which contains the partial log message info) and concatenate lines as needed.

The fluentd driver reads the metadata and passes it on to the log recipient.

The GELF driver has no processing of metadata or line concatenation logic from what I can see and therefore passes the bad partial messages through without any metadata needed for reassembly.

I don't know if AWX is written to handle the docker split log line metadata / reassembly itself or if it is expecting to receive the log lines already reassembled. I'm working on testing the fluentd driver to see if it breaks AWX as well. As far as I can tell, using the jsonfilelog log driver in docker will fix my issue but results in the problem of not being able to send logs to our logging systems as I'm required to do.

shanemcd · 2022-01-31T17:07:21Z

Going to close in favor of #11338 since it pinpoints the underlying issue.

awxbot added the type:bug label Apr 20, 2021

shanemcd added component:api priority:high labels Apr 20, 2021

shanemcd added a commit to shanemcd/receptor that referenced this issue May 13, 2021

Ensure that status lockfile is closed before trying to release work

513920a

For context, see: - ansible#319 - ansible/awx#9961

shanemcd mentioned this issue May 13, 2021

Ensure that status lockfile is closed before trying to release work ansible/receptor#320

Closed

shanemcd added a commit to shanemcd/receptor that referenced this issue May 13, 2021

Ensure that status lockfile is closed before trying to release work

71fab47

For context, see: - ansible#319 - ansible/awx#9961

shanemcd added a commit to shanemcd/receptor that referenced this issue May 13, 2021

Ensure that status lockfile is closed before trying to release work

a3134f8

For context, see: - ansible#319 - ansible/awx#9961

shanemcd mentioned this issue May 15, 2021

Add missing locks ansible/receptor#321

Closed

This was referenced May 21, 2021

AWX and Execution Environment not synchronized #10211

Closed

Retry when errors occur while releasing work ansible/receptor#328

Merged

shanemcd added state:needs_test and removed state:needs_devel labels Jun 1, 2021

shanemcd mentioned this issue Jul 23, 2021

Missing output of executed job #10669

Closed

3 tasks

shanemcd mentioned this issue Aug 20, 2021

Job goes into error state, but finishes successfully in the background #10889

Closed

3 tasks

kzinas-adv mentioned this issue Sep 16, 2021

Long running jobs: job output stops non-deterministically #11087

Closed

3 tasks

nlvw mentioned this issue Sep 28, 2021

suppressing long job output like selective stdout callback #5789

Open

kdelee removed the state:needs_test label Sep 30, 2021

shanemcd mentioned this issue Jan 7, 2022

automation-job completes successfully but the ops-awx-task and UI show an error and stop producing output #11511

Closed

3 tasks

shanemcd removed the priority:high label Jan 26, 2022

shanemcd closed this as completed Jan 31, 2022

kladiv mentioned this issue Apr 28, 2022

Playbooks terminated unexpectedly after 4 hours #11805

Closed

6 tasks

wtmthethird mentioned this issue Aug 31, 2022

502 Bad Gateway Errors in AWX GUI browsing Jobs, Templates #12644

Closed

9 tasks

AWX RuntimeError resulting in loss of worker task logs and true outcome #9961

AWX RuntimeError resulting in loss of worker task logs and true outcome #9961

Comments

duntonr commented Apr 20, 2021 • edited Loading

ISSUE TYPE

SUMMARY

ENVIRONMENT

STEPS TO REPRODUCE

EXPECTED RESULTS

ACTUAL RESULTS

ADDITIONAL INFORMATION

shanemcd commented Apr 20, 2021

duntonr commented Apr 20, 2021

shanemcd commented Apr 20, 2021

shanemcd commented Apr 20, 2021

duntonr commented Apr 20, 2021 • edited Loading

duntonr commented Apr 20, 2021

duntonr commented Apr 22, 2021

DrackThor commented May 4, 2021 • edited Loading

nlvw commented May 4, 2021 • edited Loading

DrackThor commented May 6, 2021

nlvw commented May 6, 2021

duntonr commented May 7, 2021

shanemcd commented May 7, 2021

shanemcd commented May 12, 2021 • edited Loading

rysolv-bot commented May 14, 2021

duntonr commented May 18, 2021 • edited Loading

shanemcd commented Jun 1, 2021

kdelee commented Jun 28, 2021 • edited by shanemcd Loading

shanemcd commented Jun 28, 2021

kazigk commented Jul 1, 2021 • edited Loading

kazigk commented Jul 2, 2021

daenney commented Aug 13, 2021 • edited Loading

daenney commented Aug 13, 2021 • edited Loading

davilmb commented Aug 23, 2021 • edited Loading

nlvw commented Aug 29, 2021 • edited Loading

nicovs commented Oct 15, 2021

kladiv commented Oct 17, 2021

mw-0 commented Nov 11, 2021 • edited Loading

smullenrga commented Nov 11, 2021

smullenrga commented Nov 15, 2021 • edited Loading

shanemcd commented Nov 19, 2021

daenney commented Nov 19, 2021

smullenrga commented Nov 19, 2021 • edited Loading

bschaeffer commented Jan 18, 2022

kladiv commented Jan 18, 2022 via email

smullenrga commented Jan 18, 2022

shanemcd commented Jan 31, 2022

duntonr commented Apr 20, 2021 •

edited

Loading

duntonr commented Apr 20, 2021 •

edited

Loading

DrackThor commented May 4, 2021 •

edited

Loading

nlvw commented May 4, 2021 •

edited

Loading

shanemcd commented May 12, 2021 •

edited

Loading

duntonr commented May 18, 2021 •

edited

Loading

kdelee commented Jun 28, 2021 •

edited by shanemcd

Loading

kazigk commented Jul 1, 2021 •

edited

Loading

daenney commented Aug 13, 2021 •

edited

Loading

daenney commented Aug 13, 2021 •

edited

Loading

davilmb commented Aug 23, 2021 •

edited

Loading

nlvw commented Aug 29, 2021 •

edited

Loading

mw-0 commented Nov 11, 2021 •

edited

Loading

smullenrga commented Nov 15, 2021 •

edited

Loading

smullenrga commented Nov 19, 2021 •

edited

Loading