-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Task was marked as running in Tower but was not present in the job queue, so it has been marked as failed #9594
Comments
Hey @bdoublet91, Would you give this a shot on a newer version of AWX? Generally this type of issue means a playbook unexpectedly exited during execution, but it could be a bug in AWX itself that's been addressed since 15.0.1 was released. |
I will be able to do the test next week with awx 17.0.1 I think. |
Some news, |
We seem to have the error "Task was marked as running in Tower but was not present in the job queue, so it has been marked as failed." This happens as soon as we run multiple jobs at the same time, with a single job it never occurs. We use AWX 19.2.2 I only could find this in the supervisor.log after it happened: 2021-07-22 19:48:15,401 INFO waiting for awx-dispatcher to stop |
I am seeing this as well on K8s, AWX version 19.2.2. This occurs when we have multiple jobs launched at the same time like bdoublet91. |
I checked many logs but couldn't find it. It's like something is resetted on certain events and that this also cancels / kills the jobs |
Check NTP for all your nodes. I faced similar issue and NTP for one of the server nodes for Ansible Tower was our of sync. Re-synching them fixed the issue. |
I just have 1 box with 3 docker containers, unlikely there is a time problem? |
Found this arround timestamp of job failure:
awx-uwsgi stdout | running "exec: supervisorctl restart tower-processes:awx-dispatcher tower-processes:awx-receiver" (accepting1)... Why would it do this? Seems exactly at the moment we experience our lost task. |
+1, also seeing this |
It's really severe on our side, we can hardly use AWX at this moment, as soon as we launch multiple jobs we get in trouble :( |
Unfortunately the upgrade to 19.3.0 didn't fix it |
This isn't particularly scientific, but it appears that this was mitigated for me in some capacity by increasing the memory limit for the awx-task container to 4Gi. I'll do some more testing once I get my kubernetes cluster beefed up a little bit to handle the extra requirements. |
We have now 16 GB ram and 6 cpu cores, just for this AWX thing and our issues still aren't resolved. We still get task failed :( |
@HOSTED-POWER Are those the requests/limits for the awx-task container or the specs of the system it runs on? |
It are the specs of the host machine, the containers don't have any limitations configured. |
In a situation like that I'd consider defining max worker count when that setting is defined is just the value * 5, so it may take some tweaking and will likely result in a small throughput loss, but it may end up resolving the problem. |
I don't fully understand, we have 16 GB ram on the host machine. The container has no limits. Where is the problem coming from exactly? I tried this on the container: cat /proc/meminfo Looks like it reads like 16 Gb, so where is this problem coming from? :) (Just double checking to make sure it applies) |
I'm not certain this will be the cause for everyone seeing this error, I can only say that it was what solved it for me. I would encourage you to read my extended description in #11036, but here it is fitted to your scenario: The awx-task container determines the maximum number of workers it can start simultaneously to handle the lifecycle of EE containers running jobs in the queue based on the available system memory. In your case it's reading 16GB as the system memory budgeted just for queue workers, which calculates to a max of 85 workers ((16GB + 1) * 5). Each worker takes ~150-200MB of memory upon initialization. Obviously given the chance to have enough jobs, the workers alone would consume between 12,750MB and 17,000MB of memory. But that's not budgeting the likely much higher memory consumption of each individual EE pod being spun up, so the 16GB will be fully consumed long before you hit 85 workers, assuming that the EE pods run on the same system. Once all available memory has been used, these workers start dying off and their queued jobs are lost, presumably because there isn't enough memory to process re-queuing them. By defining extra_settings:
- setting: SYSTEM_TASK_ABS_MEM
value: "3" Now I'm limited to 15 concurrent jobs in AWX and I see no more errors. |
Ok that would make sense, just one thing I find strange, we never used more than 5 GB of this 16 GB of RAM ... (we log statistics of the host machine) So I don't think we are in the case of exhausting resources? Or is something else possible? |
It depends on the other pods in the cluster as well. If they have memory requests that are higher than their actual memory usage, they will prevent any other pod from accessing that memory without actually "using" it. If you're easily able to recreate the situation where the error described here happens, I would suggest running a |
Also, it may be silly but I'd verify that there aren't any default limits on pods in your namespace that might be applying to your pods without you realizing - would be visible in |
Thx, but we run them in docker at this moment, no parameter is showing sign of overload as far as I can see In which .py file can I set SYSTEM_TASK_ABS_MEM ? I would try it inside the container as a test otherwise |
I have it in /etc/tower/settings.py. |
Ok I set this:
not sure how this would help ... :) It's still quite a strange bug |
This doesn't make any difference, it must be something (totally) different We keep getting 10's of task failed every day, it's hell and unusable :( |
Well we migrated to Kubernetes only to see that same problems keep happening. We just see "Error" now , without any extra information. |
Facing the same issue with 19.2.1 running in Docker, not cool with a 4 hours workflow :D |
The worker is killed for no reason (that I'm aware of)... Like if
|
Just happened again with a job that ran for 1 hour and 52 minutes... part of a workflow.
|
The same behavior here on OpenShift. If running multiple jobs at the same time, every single Pod is created and Jobs are running. Suddenly (most when starting one more job), all Pods will unexpected terminate. Operator: 0.15.0 No memory issues on OpenShift Pods.
|
Hi any update !? i have the same error with 100 job at same time in awx 20.0, i have try with the new params SYSTEM_TASK_ABS_MEM = 3 and SYSTEM_TASK_ABS_CPU = "3000mi" and is the same result we try to migrate awx 15.0 to 20 and we didn't have this problem before (15 in docker 20 in kubernetes) We also notice that during a heavy load of the platform I see that the playbooks are distributed in a non-homogeneous way |
I do have the same issue when running jobs from outside awx with awx-cli in a container.
Logging AWX keeps repeating:
|
Updated our AWX from version |
@chris93111 and @HOSTED-POWER : The calculation is based on the CPU and Memory. CPU is calculated based on the value we set in SYSTEM_TASK_ABS_CPU and memory is calculated is based on the value we set in SYSTEM_TASK_ABS_MEM. Based on this the min and max forks are calculated. So if we set the values in awx_task pod as: The min and max forks would be 4(1 CPU * 4 = 4 ) to 4(4 * 10 = 40). This will mirror your configuration for 1 instance: From my experimentation, this does not mean that you can run 40 jobs continuously. I can run a max of 10-20 jobs, and after that I start hitting this issue. The max number of jobs you can run also depends on how much CPU/Memory you have in your cluster, so if a AWX EE pod has a request cpu of 1 and memory as 1Gi, while the underlying K8s VM cluster as a max of 12 CPUs, you can run at max 12 jobs(maybe lesser if you factor in control plane pods). The exact number of jobs running in the system can be viewed inside the awx task pod's container,(kubectl exec -it (pod) -- bash ), AWX community has provided an utility awx_manage. You can run awx_manage graph_jobs . This is a graphical representation of how many jobs are running, pending, waiting state, and the capacity. This utility has been very helpful for me to determine the capacity calculation. Hope this helps! |
Still seeing this issue on AWX 21.4.0 version. AWX is installed on K8s cluster
|
Any update on this issue? Please suggest. |
Hi there, I also applied that suggestion modified, so I used 6Gi of RAM and 2000 m of CPU.
The interesting part is I verified via Grafana that my pods never reached that limit, neither in CPU neither in ram. This issue also appeared in situations where the awx API told me that 80% capacity is available. |
AWX version 21.5.0 Still having this issue. |
I have tested this issue multiple times. This happens when AWX system is overwhelmed with the amount of jobs sent in. This can be addressed to an extent by changing the SYSTEM_TASK_ABS_MEM, SYSTEM_TASK_ABS_CPU parameters. I was able to achieve running close to 200 concurrent jobs at the same time by adjusting these parameters; and running 2 AWX replicas and giving guaranteed effort for AWX task container in K8s cluster. While digging deep into the issue, what I observed was:
|
@mani3887 I have same issue on 21.3.0 version (deployed on k8s). But we aren't running many concurrent jobs. This issue for us is reproducible easily when we connect to certain number of Linux machines that in turn run and register commands on ~100 network devices and store outputs in per inventory host dictionary. Where to change these values? I'm willing to experiment and see if it helps If someone from AWX team is still looking at this issue, this is easily reproducible and I'm happy to share any outputs/data that may help solve mysteries behind the issue. |
ISSUE TYPE
SUMMARY
Sometimes, I have some schedule jobs that failed on awx status job with no logs but the playbook ran on the server.
Here it's an update of the system like apt update && apt upgrade.
Sometimes the same schedules jobs succeeded with logs and all is ok
I get the following explanation
ENVIRONMENT
STEPS TO REPRODUCE
Dont really know
EXPECTED RESULTS
All logs and success jobs
ACTUAL RESULTS
failed jobs random with no logs
I don't use latest version of awx because there are too much bugs that makes awx painfull to use so I don't know if latest version resolve this issue.
If you want more information or tests, feel free to ask. Thanks you
The text was updated successfully, but these errors were encountered: