Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job failed due to log size limit was reached and can't retrieve the failed job output. #13680

Open
5 of 9 tasks
Lee-Kwang opened this issue Mar 13, 2023 · 1 comment
Open
5 of 9 tasks

Comments

@Lee-Kwang
Copy link

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that AWX is open source software provided for free and that I might not receive a timely response.

Bug Summary

Started a job on the container group on Openshift cluster and ended as failed when log size limit was reached and
it takes for ages or fails to retrieve the failed job output
and viewing the job template gets error 'Something went wrong' and this error is cleared by deleting the failed job.

Behaviour observed:

Started a job on the container group in Openshift cluster against inventory with some hosts or local host.
The playbook printed out many lines of debug message until the log size limit is reached.
The job got stuck for minutes and ended as failed.
I tried with all options of RECEPTOR_KUBE_SUPPORT_RECONNECT, the results were same.

Retrieving the job output took ages or ended up as 'Something went wrong'
When trying to view the job template, I often get 'Something went wrong' error.
The Jobs view often gets 'Something went wrong'
After deleting the failed job, I could view the job template details and Jobs view works ok.

AWX version

21.5.1

Select the relevant components

  • UI
  • API
  • Docs
  • Collection
  • CLI
  • Other

Installation method

kubernetes

Modifications

no

Ansible version

core 2.13.8

Operating system

Red Hat Enterprise Linux release 9.1 (Plow) UBI

Web browser

Chrome

Steps to reproduce

create a job template with playbook which prints out tens of thousand debug message.
create an inventory with some remote hosts or local host.
start the job on the container group in remote cluster against the above inventory

Expected results

The job ends successfully

Actual results

job failed. can't view job output. failed job blocks access to the job template.

Additional information

No response

@fosterseth
Copy link
Member

fosterseth commented Mar 15, 2023

I have a feeling you may be running into #12961

that PR landed in awx 21.11.0

what happened was that when jobs ended in a failed state, AWX would attempt to gather the entire output of the job pod and stick it into the job's result_traceback field. However, that output could be MASSIVE (all of the stdout) and was breaking things.

in 21.11.0+ it will cap the output to the last 1000 bytes or so. I bet the job template detail page was loading the last ran job and the uwsgi process was just dying while trying to load that job.

This won't explain why those jobs failed in the first place (you mention log rotation limit problem, which can be addressed via k8s configuration). However, on 21.11.0+ it should break the UI when these failures happen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants