Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task was marked as running but was not present in the job queue, so it has been marked as failed. #14277

Open
5 of 11 tasks
deep7861 opened this issue Jul 24, 2023 · 3 comments
Open
5 of 11 tasks

Comments

@deep7861
Copy link

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that AWX is open source software provided for free and that I might not receive a timely response.
  • I am NOT reporting a (potential) security vulnerability. (These should be emailed to security@ansible.com instead.)

Bug Summary

One of our jobs consistently fails with this error:
Task was marked as running but was not present in the job queue, so it has been marked as failed.

image

We haven't been able to identify any resource crunch on k8s cluster, neither the AWX POD are running out of resources.

AWX version

21.3.0

Select the relevant components

  • UI
  • UI (tech preview)
  • API
  • Docs
  • Collection
  • CLI
  • Other

Installation method

kubernetes

Modifications

no

Ansible version

No response

Operating system

No response

Web browser

No response

Steps to reproduce

Our setup:
AKS 1.23.8
AWX Operator: 0.24.0
AWX: 21.3.0

This job is connecting to ~30 linux VMs (inventory hosts) and from each VM, contacting ~100 network devices to get output of 3 commands.
The output is being stored in a dictionary per inventory host.

The job runs okay when there are lesser network devices (upto 90ish), with always fail with 100.

As the error probably says, the issue should not be in the network or device access or anything else.

Expected results

Play runs smooth and job finishes as expected

Actual results

Job fails with error message:
Task was marked as running but was not present in the job queue, so it has been marked as failed.

Additional information

No response

@fosterseth
Copy link
Member

@deep7861 you may be running into the k8s max container log issue. Changing this max log size varies depending on your k8s cluster type, but here is a thread that explains it a bit #11338 (comment)

the other thing to look into is the receptor reconnect option ansible/receptor#683 (comment)

@deep7861
Copy link
Author

deep7861 commented Aug 8, 2023

@fosterseth Thank you for looking into this issue.
While I try to find the log size relation, I happen to notice a strange behavior.
In some of the posts you mentioned, I saw a suggestion to check the 'result_traceback' value from /api/v2/jobs/job_id for the failed job.
Now, when I try doing it - the page doesn't load. Here is what I get:
image

When I try to look for that job from usual AWX UI, it fails as well:
image

This error appears to happen and I note below log from web container:

2023/08/08 15:28:49 [error] 33#33: *189 upstream prematurely closed connection while reading response header from upstream, client: 10.244.7.25, server: _, req │
│ 10.244.7.25 - - [08/Aug/2023:15:28:49 +0000] "GET /api/v2/unified_jobs/?name__icontains=ine_lm&not__launch_type=sync&order_by=-finished&page=1&page_size=20 HTT │
│ DAMN ! worker 5 (pid: 38) died, killed by signal 9 :( trying respawn ... │
│ Respawned uWSGI worker 5 (new pid: 70) │
│ mounting awx.wsgi:application on / │
│ WSGI app 0 (mountpoint='/') ready in 1 seconds on interpreter 0x7636d0 pid: 70 (default app)

Do we know why this is happening?

@bpedersen2
Copy link
Contributor

#9594

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants