Job failed due to log size limit was reached and can't retrieve the failed job output. #13680

Lee-Kwang · 2023-03-13T05:39:29Z

Please confirm the following

I agree to follow this project's code of conduct.
I have checked the current issues for duplicates.
I understand that AWX is open source software provided for free and that I might not receive a timely response.

Bug Summary

Started a job on the container group on Openshift cluster and ended as failed when log size limit was reached and
it takes for ages or fails to retrieve the failed job output
and viewing the job template gets error 'Something went wrong' and this error is cleared by deleting the failed job.

Behaviour observed:

Started a job on the container group in Openshift cluster against inventory with some hosts or local host.
The playbook printed out many lines of debug message until the log size limit is reached.
The job got stuck for minutes and ended as failed.
I tried with all options of RECEPTOR_KUBE_SUPPORT_RECONNECT, the results were same.

Retrieving the job output took ages or ended up as 'Something went wrong'
When trying to view the job template, I often get 'Something went wrong' error.
The Jobs view often gets 'Something went wrong'
After deleting the failed job, I could view the job template details and Jobs view works ok.

AWX version

21.5.1

Select the relevant components

Installation method

kubernetes

Modifications

no

Ansible version

core 2.13.8

Operating system

Red Hat Enterprise Linux release 9.1 (Plow) UBI

Web browser

Chrome

Steps to reproduce

create a job template with playbook which prints out tens of thousand debug message.
create an inventory with some remote hosts or local host.
start the job on the container group in remote cluster against the above inventory

Expected results

The job ends successfully

Actual results

job failed. can't view job output. failed job blocks access to the job template.

Additional information

No response

fosterseth · 2023-03-15T15:45:02Z

I have a feeling you may be running into #12961

that PR landed in awx 21.11.0

what happened was that when jobs ended in a failed state, AWX would attempt to gather the entire output of the job pod and stick it into the job's result_traceback field. However, that output could be MASSIVE (all of the stdout) and was breaking things.

in 21.11.0+ it will cap the output to the last 1000 bytes or so. I bet the job template detail page was loading the last ran job and the uwsgi process was just dying while trying to load that job.

This won't explain why those jobs failed in the first place (you mention log rotation limit problem, which can be addressed via k8s configuration). However, on 21.11.0+ it should break the UI when these failures happen.

github-actions bot added component:api component:ui needs_triage type:bug community labels Mar 13, 2023

mabashian removed the component:ui label May 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Job failed due to log size limit was reached and can't retrieve the failed job output. #13680

Job failed due to log size limit was reached and can't retrieve the failed job output. #13680

Lee-Kwang commented Mar 13, 2023

fosterseth commented Mar 15, 2023 •

edited

Loading

Job failed due to log size limit was reached and can't retrieve the failed job output. #13680

Job failed due to log size limit was reached and can't retrieve the failed job output. #13680

Comments

Lee-Kwang commented Mar 13, 2023

Please confirm the following

Bug Summary

Behaviour observed:

AWX version

Select the relevant components

Installation method

Modifications

Ansible version

Operating system

Web browser

Steps to reproduce

Expected results

Actual results

Additional information

fosterseth commented Mar 15, 2023 • edited Loading

fosterseth commented Mar 15, 2023 •

edited

Loading