-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Truncated or oversized response headers received from daemon process 'ckan' #1375
Comments
@FuhuXia is seeing this on inventory sandbox and I am seeing it on inventory-1d. (but not inventory-2d yet). |
On inventory-1d I'm only seeing this with apache. If I run with paster, ckan returns the right response. |
Since this seems related to apache and/or mod_wsgi, it reminds me of this under-documented gotcha #1383 (comment) I'm going to try to configure mod_wsgi to use the system python instead of our custom version and see if that helps. It might be worth doing #1383 just to get off of the system mod_wsgi/apache dependency. |
I didn't really get a chance to dig into this. For next steps, I would try to install the CKAN virtualenv manually on sandbox, install gunicorn, update the wsgi script, and update the apache config similar to pycsw. If that seems stable, I would go ahead with the implementation of #1383, cribbing off of datagov-deploy-pycsw. |
Found a way to bypass this error on inventory bionic sandbox. I guess it has something to do with how certain buggy version of psycopg2 work with wsgi. The solution is using To put it into playbook, we can use |
It turns out we always run |
Ah, we reverted to psycopg2==2.7.7 as a workaround for #961 just before this issue appeared. That makes sense. |
So, since GSA/inventory-app#23 is not yet merged, should this still be In Progress? |
I just deployed catalog-app to bionic sandbox, which is running |
Bumping psycopg2 to 2.8 seems causing another issue. #1068. The explanation of the issue has something to do with python dependency not being thread safe which causes postgres QueuePool running out. Tried a solution on serverfaul to use Python package's mod-wsgi instead of the one comes with Ubuntu, and use psycopg2-binary package in pip instead of psycopg2. It worked well. Will try to get it into Ansible playbook. |
Great find @FuhuXia! The thread-safety issue makes a lot of sense. I think this bolsters the idea of using gunicorn with processes instead of threads. As a quick fix, we could try to use a single thread with multiple processes https://github.com/GSA/datagov-deploy/blob/7141d2c65af46122cfcfdacc2a307f10787bd3e7/ansible/roles/software/ckan/catalog/www/templates/etc/apache2/sites-enabled/ckan.readonly.conf.j2#L24 I think we have memory capacity to allow for this. |
... heh, now I feel like I'm just repeating what's in the SO post, but switching to mpm_prefork is also low effort, which doesn't use threads at all. |
@FuhuXia and I discussed this a bit, pscopg2 posts show a similar issue when using the binary wheel distributed with psycopg2. However, we are should be building from source, so I think it's an unrelated issue. Since then, I was looking through the mod_wsgi docs and realized that mpm_prefork doesn't have an impact on mod_wsgi when configured as a daemon, which is how we're using WSGI, so perhaps setting WSGI's I verified we are in fact building psycopg2 from source. |
I tried these methods, all with no success. I tried each method with a reboot in between and a 5+ minute wait to let the connections drain. After starting apache2, I started a loop to curl the dataset endpoint:
Within a minute, sometimes more sometimes less, the curl loop would hang and no more output would appear in the log indicating the hang. After 30 seconds, kill the curl loop and then stop apache. Once apache2 killed off its processes, you'd see the familiar processes=6 threads=1 processes=6 threads=1 and psycopg2-binary
processes=6 threads=1 and virtualenv mod-wsgi
And update /etc/apache2/mods-enabled/wsgi.load to point to the virtualenv |
I don't think this is particularly relevant, but even with threads=1, we're still seeing about four threads per process. Not sure where these are coming from. Second column is PID,
|
So I was thinking about the psycopg2 post mentioned earlier, and although I don't think it's related, I was thinking about how it might be possible for our dependencies to be built against the wrong library, or an old library. Python was also built a long time ago, when these instances were first provisioned. Maybe an act of desperation, I tried recompiling Python (deleteing the existing install so common.yml would rebuild it). So far, this actually seems like it's working but I don't really understand why. The Python version is the same (2.7.16) as well as compile options ( |
That would explain why we never saw it in sandbox or local dev. |
Mostly we just want to re-compile the python version on production inventory as we think it's related to #1375 (comment)
inventory-1p started hanging around 11pm PST, so that didn't work. We chatted today and decided to focus on reproducing this on staging. I realized that whatever triggers this could be specific to a type of request or controller. I went back to look at the logs and noticed some requests were timing out.
I grabbed a couple hundred URLs from production logs to reproduce on staging, these trigger hangs:
The last one can be simplified to I was able to verify that locally that this did crash CKAN. |
From the above crash, I ran into this and a psycopg2 2.8.x fix for CKAN (feeling silly that we didn't check CKAN issues related to psycopg2 before). I patched CKAN on the datagov-inventory/psycopg2-2.8 branch. That resolved the error, but now I'm running into a new crash, also seemingly psycopg2 related. This is leading me to lean towards focusing on getting inventory-app on CKAN 2.8, where I suspect these issues have already been fixed, rather than continuing down the path of try/search/fix. |
How to reproduce
curl localhost/api/action/status_show
Expected behavior
200 JSON response with app status
Actual behavior
500 Internal server error. Logs contain "Truncated or oversized response headers received from daemon process"
The text was updated successfully, but these errors were encountered: