Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Truncated or oversized response headers received from daemon process 'ckan' #1375

Closed
adborden opened this issue Feb 13, 2020 · 20 comments
Closed
Assignees
Labels
bug Software defect or bug component/catalog Related to catalog component playbooks/roles component/inventory Inventory playbooks/roles

Comments

@adborden
Copy link
Contributor

[Thu Feb 13 23:30:47.840275 2020] [wsgi:error] [pid 11319:tid 140180173760256] [client 127.0.0.1:52666] Truncated or oversized response headers received from daemon process 'ckan': /etc/ckan/apache.wsgi

How to reproduce

  1. curl localhost/api/action/status_show

Expected behavior

200 JSON response with app status

Actual behavior

500 Internal server error. Logs contain "Truncated or oversized response headers received from daemon process"

@adborden adborden added bug Software defect or bug component/inventory Inventory playbooks/roles labels Feb 13, 2020
@adborden
Copy link
Contributor Author

@FuhuXia is seeing this on inventory sandbox and I am seeing it on inventory-1d. (but not inventory-2d yet).

@adborden
Copy link
Contributor Author

On inventory-1d I'm only seeing this with apache. If I run with paster, ckan returns the right response.

@adborden
Copy link
Contributor Author

Since this seems related to apache and/or mod_wsgi, it reminds me of this under-documented gotcha #1383 (comment)

I'm going to try to configure mod_wsgi to use the system python instead of our custom version and see if that helps. It might be worth doing #1383 just to get off of the system mod_wsgi/apache dependency.

@adborden
Copy link
Contributor Author

I didn't really get a chance to dig into this. For next steps, I would try to install the CKAN virtualenv manually on sandbox, install gunicorn, update the wsgi script, and update the apache config similar to pycsw. If that seems stable, I would go ahead with the implementation of #1383, cribbing off of datagov-deploy-pycsw.

@FuhuXia
Copy link
Member

FuhuXia commented Feb 26, 2020

Found a way to bypass this error on inventory bionic sandbox. I guess it has something to do with how certain buggy version of psycopg2 work with wsgi.

The solution is using psycopg2==2.7.7 for DB init, then change it to psycopg2==2.8.4 before apache start. Hopefully after next ckan/psycopg2 upgrade we dont need this hack anymore.

To put it into playbook, we can use psycopg2==2.8.4 in the freeze file, but temporarily change it to psycopg2==2.7.7 before ckan DB init task, then apply freeze file again afterwords.

@FuhuXia
Copy link
Member

FuhuXia commented Mar 3, 2020

It turns out we always run db init manually, because we have a never tag for db-init.yml playbook. We just need to remember to downgradepsycopg2 to 2.7.7 when run the db init command and revert back to freeze version afterwards.
https://github.com/GSA/datagov-deploy/blob/94487cb80f97a94abb593d107f11ed27d3f7a757/ansible/roles/software/ckan/inventory/tasks/main.yml#L6-L7

@adborden
Copy link
Contributor Author

Ah, we reverted to psycopg2==2.7.7 as a workaround for #961 just before this issue appeared. That makes sense.

@adborden
Copy link
Contributor Author

So, since GSA/inventory-app#23 is not yet merged, should this still be In Progress?

@adborden adborden added the component/catalog Related to catalog component playbooks/roles label Mar 12, 2020
@adborden
Copy link
Contributor Author

I just deployed catalog-app to bionic sandbox, which is running psycopg2==2.7.3.2 and also seeing this issue. I'm going to bump psycopg2 to 2.8.

@FuhuXia
Copy link
Member

FuhuXia commented Mar 25, 2020

Bumping psycopg2 to 2.8 seems causing another issue. #1068. The explanation of the issue has something to do with python dependency not being thread safe which causes postgres QueuePool running out.

Tried a solution on serverfaul to use Python package's mod-wsgi instead of the one comes with Ubuntu, and use psycopg2-binary package in pip instead of psycopg2. It worked well. Will try to get it into Ansible playbook.

@adborden
Copy link
Contributor Author

Great find @FuhuXia! The thread-safety issue makes a lot of sense. I think this bolsters the idea of using gunicorn with processes instead of threads. As a quick fix, we could try to use a single thread with multiple processes https://github.com/GSA/datagov-deploy/blob/7141d2c65af46122cfcfdacc2a307f10787bd3e7/ansible/roles/software/ckan/catalog/www/templates/etc/apache2/sites-enabled/ckan.readonly.conf.j2#L24

I think we have memory capacity to allow for this.

@adborden
Copy link
Contributor Author

... heh, now I feel like I'm just repeating what's in the SO post, but switching to mpm_prefork is also low effort, which doesn't use threads at all.

@adborden
Copy link
Contributor Author

@FuhuXia and I discussed this a bit, pscopg2 posts show a similar issue when using the binary wheel distributed with psycopg2. However, we are should be building from source, so I think it's an unrelated issue.

Since then, I was looking through the mod_wsgi docs and realized that mpm_prefork doesn't have an impact on mod_wsgi when configured as a daemon, which is how we're using WSGI, so perhaps setting WSGI's threads=1 is our best option.

I verified we are in fact building psycopg2 from source.

@adborden
Copy link
Contributor Author

I tried these methods, all with no success. I tried each method with a reboot in between and a 5+ minute wait to let the connections drain. After starting apache2, I started a loop to curl the dataset endpoint:

while true; curl localhost/dataset; sleep 1; done

Within a minute, sometimes more sometimes less, the curl loop would hang and no more output would appear in the log indicating the hang. After 30 seconds, kill the curl loop and then stop apache. Once apache2 killed off its processes, you'd see the familiar Truncated or oversized response headers received from daemon process 'ckan': /etc/ckan/apache.wsgi in the logs. Still no stack trace.

processes=6 threads=1
Update /etc/apache2/sites-enable/ckan.conf

processes=6 threads=1 and psycopg2-binary

sudo su -l
umask 022
/usr/lib/ckan/bin/pip uninstall psycopg2
/usr/lib/ckan/bin/pip install psycopg2-binary

processes=6 threads=1 and virtualenv mod-wsgi

sudo su -l
umask 022
/usr/lib/ckan/bin/pip install mod-wsgi

And update /etc/apache2/mods-enabled/wsgi.load to point to the virtualenv .so.

@adborden
Copy link
Contributor Author

I don't think this is particularly relevant, but even with threads=1, we're still seeing about four threads per process. Not sure where these are coming from.

Second column is PID,

Every 2.0s: ps -eLf | grep [c]kan                                                                 inventory-1p.prod-ocsit.bsp.gsa.gov: Thu Mar 26 20:17:29 2020

www-data  3033  3029  3033  0    5 20:14 ?        00:00:00 ckan              -k start
www-data  3033  3029  3050  0    5 20:14 ?        00:00:00 ckan              -k start
www-data  3033  3029  3051  0    5 20:14 ?        00:00:00 ckan              -k start
www-data  3033  3029  3052  1    5 20:14 ?        00:00:02 ckan              -k start
www-data  3033  3029  3281  0    5 20:14 ?        00:00:00 ckan              -k start
www-data  3034  3029  3034  0    5 20:14 ?        00:00:00 ckan              -k start
www-data  3034  3029  3043  0    5 20:14 ?        00:00:00 ckan              -k start
www-data  3034  3029  3044  0    5 20:14 ?        00:00:00 ckan              -k start
www-data  3034  3029  3045  1    5 20:14 ?        00:00:02 ckan              -k start
www-data  3034  3029  3159  0    5 20:14 ?        00:00:00 ckan              -k start
www-data  3035  3029  3035  0    5 20:14 ?        00:00:00 ckan              -k start
www-data  3035  3029  3082  0    5 20:14 ?        00:00:00 ckan              -k start
www-data  3035  3029  3083  0    5 20:14 ?        00:00:00 ckan              -k start
www-data  3035  3029  3084  1    5 20:14 ?        00:00:02 ckan              -k start
www-data  3035  3029  3211  0    5 20:14 ?        00:00:00 ckan              -k start
www-data  3036  3029  3036  0    5 20:14 ?        00:00:00 ckan              -k start
www-data  3036  3029  3085  0    5 20:14 ?        00:00:00 ckan              -k start
www-data  3036  3029  3086  0    5 20:14 ?        00:00:00 ckan              -k start
www-data  3036  3029  3087  1    5 20:14 ?        00:00:02 ckan              -k start
www-data  3036  3029  3329  0    5 20:14 ?        00:00:00 ckan              -k start
www-data  3037  3029  3037  0    5 20:14 ?        00:00:00 ckan              -k start
www-data  3037  3029  3046  0    5 20:14 ?        00:00:00 ckan              -k start
www-data  3037  3029  3047  0    5 20:14 ?        00:00:00 ckan              -k start
www-data  3037  3029  3049  1    5 20:14 ?        00:00:02 ckan              -k start
www-data  3037  3029  3357  0    5 20:14 ?        00:00:00 ckan              -k start
www-data  3038  3029  3038  0    5 20:14 ?        00:00:00 ckan              -k start
www-data  3038  3029  3088  0    5 20:14 ?        00:00:00 ckan              -k start
www-data  3038  3029  3089  0    5 20:14 ?        00:00:00 ckan              -k start
www-data  3038  3029  3090  1    5 20:14 ?        00:00:02 ckan              -k start
www-data  3038  3029  3175  0    5 20:14 ?        00:00:00 ckan              -k start

@adborden
Copy link
Contributor Author

adborden commented Mar 26, 2020

So I was thinking about the psycopg2 post mentioned earlier, and although I don't think it's related, I was thinking about how it might be possible for our dependencies to be built against the wrong library, or an old library. Python was also built a long time ago, when these instances were first provisioned. Maybe an act of desperation, I tried recompiling Python (deleteing the existing install so common.yml would rebuild it).

So far, this actually seems like it's working but I don't really understand why. The Python version is the same (2.7.16) as well as compile options (config.status --config '--prefix' '/usr/local/lib/python2.7.16' '--enable-ipv6' '--enable-unicode=ucs4' '--enable-shared').

@FuhuXia
Copy link
Member

FuhuXia commented Mar 26, 2020

That would explain why we never saw it in sandbox or local dev.

adborden added a commit that referenced this issue Mar 26, 2020
Mostly we just want to re-compile the python version on production inventory as
we think it's related to
#1375 (comment)
@adborden
Copy link
Contributor Author

inventory-1p started hanging around 11pm PST, so that didn't work.

We chatted today and decided to focus on reproducing this on staging. I realized that whatever triggers this could be specific to a type of request or controller. I went back to look at the logs and noticed some requests were timing out.

[Fri Mar 27 22:46:47.342153 2020] [wsgi:error] [pid 14023:tid 140285466208000] [client 72.52.197.100:2452] Timeout when reading response headers from daemon process 'ckan': /etc/ckan/apache.wsgi, referer: https://inventory.data.gov/api/action/datastore_search?resource_id=102fd9bd-4737-401b-b88f-5c5b0fab94ec&filters=%7B%22MZIP09%22:%2219380%22%7D

I grabbed a couple hundred URLs from production logs to reproduce on staging, these trigger hangs:

  • /organization/mcc-gov/unredacted.json
  • /api/action/datastore_search_sql?sql=SELECT%20%22MZIP09%22%20as%20mzip09,%22MSTREE09%22%20as%20mstree09%20,%22SCHNAM09%22%20as%20schnam09,%22NCESSCH%22%20as%20ncessch,%22MSTATE09%22%20as%20mstate09%20,%22MCITY09%22%20as%20mcity09%20FROM%20%22102fd9bd-4737-401b-b88f-5c5b0fab94ec%22%20where%20%22MZIP09%22='16001'

The last one can be simplified to
curl localhost/api/action/datastore_search_sql?sql=SELECT+1

I was able to verify that locally that this did crash CKAN.

@adborden
Copy link
Contributor Author

adborden commented Mar 28, 2020

From the above crash, I ran into this and a psycopg2 2.8.x fix for CKAN (feeling silly that we didn't check CKAN issues related to psycopg2 before). I patched CKAN on the datagov-inventory/psycopg2-2.8 branch. That resolved the error, but now I'm running into a new crash, also seemingly psycopg2 related.

This is leading me to lean towards focusing on getting inventory-app on CKAN 2.8, where I suspect these issues have already been fixed, rather than continuing down the path of try/search/fix.

@mogul mogul added this to the Sprint 20200402 milestone Apr 10, 2020
@mogul mogul closed this as completed Apr 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Software defect or bug component/catalog Related to catalog component playbooks/roles component/inventory Inventory playbooks/roles
Projects
None yet
Development

No branches or pull requests

3 participants