-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
502 Bad Gateway Errors in AWX GUI browsing Jobs, Templates #12644
Comments
+1 We are facing the same problem, I was just about to create a new issue when I saw this one. I've been digging for a few hours and the TL;DR is that I traced the problem to jobs with a lot of output causing the uWSGI thread to crash. Increasing the memory for the awx-web container to 2GB fixed the job overview not loading correctly but we need to further increase it to be able to load the job output. We saw the uWSGI thread get killed with a signal 9, which indicates an OOM scenario. As I stated at the start, increasing the request/limit for the awx-web container fixed the job overview and job details pages but the job output still won't load. I'm hesitant to increase the memory even further, that just feels like masking the underlying problem. |
| This got us the exact offending query, would you be able to get us that query? @fust also wondering if same issue can be replicated by hitting the |
doing some local testing
I can see uwsgi process spike up to 1.9 Gb memory usage So if you have large number of events for a job, and multiple users (or even a single user refreshing the job output page) trying to view the stdout out for that job, you can easily breach 2 Gb for the entire container. @fust @2and3makes23 do you get the 502 if you append this tricks the browser into avoiding loading all events at once for that job |
We're seeing this same behavior with 21.2.0 as well, and also only on jobs that enter an "ERROR" state. Other jobs, even with large outputs/task, display correctly, albeit slowly. However, a user viewing a job's output that is in an "ERROR" state crashes the awx-p-web container, I've seen 5gb+ be requested before it OOMs. We're running in GKE, and we're using cloud sql. In the pgsql logs (we have not turned logging on for all queries, though) one oddity that does come up is that it appears to be doing a DB query for the pods in an ERROR state, but we can't see corresponding DB queries for pods that are in a successful or failed state. I do not know if this is relevant or not, but it is rather confusing that we're only seeing queries against the database if we're trying to view an ERROR'd job. The query that's being run is:
Running the query on the database produces an ~11MB file. Also, we can see the query we run directly against the database in the logs as well, supporting the earlier question of why are only the ERROR'd jobs showing as queries against the DB? When we run this same query against an identical playbook run, but one a week prior that did not enter an error'd state, we see a 5.6KB file instead of 11MB. That is a striking difference in and of itself. We also tried to search through the code to find where this is coming from, and could not locate a "LIMIT 21", and also can't find any inner joins with exception of those against the rbac tables. Lastly, hitting /api/v2/jobs//job_events/children_summary/ does nothing for this jobid. The raw data looks like this:
You just tested this as I was typing, but I want to point out that our issues are surrounding jobs that enter an "ERROR" state, successful jobs we see memory spikes, but nothing crashes. (A second lastly, since you commented!): Thanks for looking into this. |
I've had to search for a bit to find the query, we haven't seen any of these issues as of late. The relevent part from the PostgreSQL log is this:
As you can see the uWSGI process crashes during retreival of the data, leaving the PostgreSQL server with an unexpected EOF. As I said, we haven't seen these issues lately, which reinforces the theory that it is related to failed jobs. |
We will try that as soon as we hit another error-state job. |
@fosterseth |
We are running AWX version |
@fosterseth Like you I tried querying
Update: In the afternoon above query for both jobs worked while viewing respective job details And I tried your idea with |
For what it's worth, it looks like at least some of our jobs that are entering ERROR'd state are a result of this: #9961 -> #11338. And then, when we try to view the log output, that's when the AWX web container starts to run away with all the memory and crash. So that might be one way to replicate this. |
@wtmthethird At least for us it's unlikely that's the case. This makes it very hard to do any debugging on the uWSGI application which is further complicated by the fact it only happens in our production environment. |
@fust you could trigger a failed job by calling receptorctl work release on the work unit id
you could also probably just do a maybe wait until a lot of stdout is produce before running that command. maybe open browser dev tools, go to network tab and navigate to the job detail page for that job. Which api request yields the 502? |
@wtmthethird wondering what the contents of the output is -- my guess is the |
@wtmthethird 's hint about the db tipped me off -- the sql they posted was essentially just a unified job record from the DB. The only field that could be arbitrarily big is @fust and @wtmthethird also mentioned issue #11338 as a source of this problem, so I started there I got a reproducer by doing the following, start minikube {
"log-driver": "json-file",
"log-opts": {
"max-size": "15k",
"max-file": "3",
"labels": "production_status",
"env": "os,customer"
}
} This limits the size of the stdout of containers to 15kb. navigate the job endpoint What you'll find the |
Hi, ...we hit a similar issue, that might have the same root cause. Initially reported on google groups (see https://groups.google.com/g/awx-project/c/BTQ51PblMaI for the full thread), I was asked to supply some output from the awx deployment in this github issue. -> what does "/api/v2/jobs/id/job_events/children_summary" return for the failed job? result: { -> what does that same endpoint return for the same job, but when successful? Result: {"detail":"Nicht gefunden."} -> Do you get 502 errors when viewing the job detail page in the UI? open the browser dev tools > network tab. When you go the job output page in the UI, but don't scroll down. Do any requests give a 502 or take a long time to respond? Result: No - No 502's and a resonable amount of time to respond (5s): But - if I pick up the scrollbar with my mouse and move it half way down, I can see a lot of requests being sent to the backend and then there is a 502 ... Additionally, I can see the response size is between 1 and 7 MB - seems a bit large for loading the events. Here's a screenshot, after I moved the scrollbar down: Once more - thanks for looking into this! Best regards, Andreas |
@andreasbourges thanks for this detailed information! my current running hypothesis is that uwsgi blows up because the result_traceback on the failed/error job is enormous. Can you verify that that job has result_traceback that has tons of text? |
Hmmm... how would I verify this?
|
@andreasbourges sorry should have said, this is a field on the api endpoint of the job so just navigate to |
Hi,
.your assumption seems to be correct - the result-traceback has a size of
15 MB (but is this enough to make uwsgi going nuts?). But what makes the job
failing without further notice? Is it the sheer size of the data (looks like
I receive the whole device configuration from nautobot for *each* of the ~
400 devices - might be the problem .)
Will have a look at this.
Thanks,
Andreas
|
@andreasbourges the underlying issues seems to be container stdout log rotation, as described here #11338. You need to make sure your environment has container max log size set to value that is sufficiently large. It should be larger than expected stdout of your ansible job |
Just to let you know - reducing the data received from the inventory and tune container log rotation made our jobs run stable again! Thanks for your support! |
we are attempting to address this issue here ansible/receptor#683 |
Please confirm the following
Bug Summary
When browsing Jobs or other job related lists via UI or API, nginx produces 502 Bad Gateway errors, when certain Jobs are part of the result to be shown to the user.
Those jobs have in common that they ended in error state.
Log entries that might be of interest:
AWX version
21.3.0
Select the relevant components
Installation method
openshift
Modifications
no
Ansible version
2.12.5.post0
Operating system
Windows 10
Web browser
Firefox, Chrome, Safari, Edge
Steps to reproduce
Some of our errenous jobs might have failed because of #11805, but there might be other causes like OpenShift lifecycle pocesses.
At first we thought, those jobs were corrupted within our DB, but sometimes we are able to view (at least some of) them in browser, getting no 502 error so that seems to be out of the equation.
When limiting elements per page to 1, it is possible to switch pages (without error) until you hit a faulty one.
Expected results
No errors when viewing jobs or job related elements in UI/via API
Actual results
502 Bad Gateway
Additional information
Any hint on how to troubleshoot this is much appreciated
The text was updated successfully, but these errors were encountered: