Truncated or oversized response headers received from daemon process 'ckan' #1375

adborden · 2020-02-13T23:36:43Z

[Thu Feb 13 23:30:47.840275 2020] [wsgi:error] [pid 11319:tid 140180173760256] [client 127.0.0.1:52666] Truncated or oversized response headers received from daemon process 'ckan': /etc/ckan/apache.wsgi

How to reproduce

curl localhost/api/action/status_show

Expected behavior

200 JSON response with app status

Actual behavior

500 Internal server error. Logs contain "Truncated or oversized response headers received from daemon process"

The text was updated successfully, but these errors were encountered:

adborden · 2020-02-13T23:38:14Z

@FuhuXia is seeing this on inventory sandbox and I am seeing it on inventory-1d. (but not inventory-2d yet).

adborden · 2020-02-13T23:41:02Z

On inventory-1d I'm only seeing this with apache. If I run with paster, ckan returns the right response.

adborden · 2020-02-18T18:51:09Z

Since this seems related to apache and/or mod_wsgi, it reminds me of this under-documented gotcha #1383 (comment)

I'm going to try to configure mod_wsgi to use the system python instead of our custom version and see if that helps. It might be worth doing #1383 just to get off of the system mod_wsgi/apache dependency.

adborden · 2020-02-25T00:53:00Z

I didn't really get a chance to dig into this. For next steps, I would try to install the CKAN virtualenv manually on sandbox, install gunicorn, update the wsgi script, and update the apache config similar to pycsw. If that seems stable, I would go ahead with the implementation of #1383, cribbing off of datagov-deploy-pycsw.

FuhuXia · 2020-02-26T16:06:44Z

Found a way to bypass this error on inventory bionic sandbox. I guess it has something to do with how certain buggy version of psycopg2 work with wsgi.

The solution is using psycopg2==2.7.7 for DB init, then change it to psycopg2==2.8.4 before apache start. Hopefully after next ckan/psycopg2 upgrade we dont need this hack anymore.

To put it into playbook, we can use psycopg2==2.8.4 in the freeze file, but temporarily change it to psycopg2==2.7.7 before ckan DB init task, then apply freeze file again afterwords.

FuhuXia · 2020-03-03T14:30:46Z

It turns out we always run db init manually, because we have a never tag for db-init.yml playbook. We just need to remember to downgradepsycopg2 to 2.7.7 when run the db init command and revert back to freeze version afterwards.
https://github.com/GSA/datagov-deploy/blob/94487cb80f97a94abb593d107f11ed27d3f7a757/ansible/roles/software/ckan/inventory/tasks/main.yml#L6-L7

adborden · 2020-03-10T19:22:39Z

Ah, we reverted to psycopg2==2.7.7 as a workaround for #961 just before this issue appeared. That makes sense.

adborden · 2020-03-10T19:26:18Z

So, since GSA/inventory-app#23 is not yet merged, should this still be In Progress?

adborden · 2020-03-12T21:19:55Z

I just deployed catalog-app to bionic sandbox, which is running psycopg2==2.7.3.2 and also seeing this issue. I'm going to bump psycopg2 to 2.8.

FuhuXia · 2020-03-25T03:00:46Z

Bumping psycopg2 to 2.8 seems causing another issue. #1068. The explanation of the issue has something to do with python dependency not being thread safe which causes postgres QueuePool running out.

Tried a solution on serverfaul to use Python package's mod-wsgi instead of the one comes with Ubuntu, and use psycopg2-binary package in pip instead of psycopg2. It worked well. Will try to get it into Ansible playbook.

adborden · 2020-03-25T16:38:59Z

Great find @FuhuXia! The thread-safety issue makes a lot of sense. I think this bolsters the idea of using gunicorn with processes instead of threads. As a quick fix, we could try to use a single thread with multiple processes https://github.com/GSA/datagov-deploy/blob/7141d2c65af46122cfcfdacc2a307f10787bd3e7/ansible/roles/software/ckan/catalog/www/templates/etc/apache2/sites-enabled/ckan.readonly.conf.j2#L24

I think we have memory capacity to allow for this.

adborden · 2020-03-25T16:44:26Z

... heh, now I feel like I'm just repeating what's in the SO post, but switching to mpm_prefork is also low effort, which doesn't use threads at all.

adborden · 2020-03-26T20:29:23Z

@FuhuXia and I discussed this a bit, pscopg2 posts show a similar issue when using the binary wheel distributed with psycopg2. However, we are should be building from source, so I think it's an unrelated issue.

Since then, I was looking through the mod_wsgi docs and realized that mpm_prefork doesn't have an impact on mod_wsgi when configured as a daemon, which is how we're using WSGI, so perhaps setting WSGI's threads=1 is our best option.

I verified we are in fact building psycopg2 from source.

adborden · 2020-03-26T20:57:08Z

I tried these methods, all with no success. I tried each method with a reboot in between and a 5+ minute wait to let the connections drain. After starting apache2, I started a loop to curl the dataset endpoint:

while true; curl localhost/dataset; sleep 1; done

Within a minute, sometimes more sometimes less, the curl loop would hang and no more output would appear in the log indicating the hang. After 30 seconds, kill the curl loop and then stop apache. Once apache2 killed off its processes, you'd see the familiar Truncated or oversized response headers received from daemon process 'ckan': /etc/ckan/apache.wsgi in the logs. Still no stack trace.

processes=6 threads=1
Update /etc/apache2/sites-enable/ckan.conf

processes=6 threads=1 and psycopg2-binary

sudo su -l
umask 022
/usr/lib/ckan/bin/pip uninstall psycopg2
/usr/lib/ckan/bin/pip install psycopg2-binary

processes=6 threads=1 and virtualenv mod-wsgi

sudo su -l
umask 022
/usr/lib/ckan/bin/pip install mod-wsgi

And update /etc/apache2/mods-enabled/wsgi.load to point to the virtualenv .so.

adborden · 2020-03-26T22:21:20Z

I don't think this is particularly relevant, but even with threads=1, we're still seeing about four threads per process. Not sure where these are coming from.

Second column is PID,

Every 2.0s: ps -eLf | grep [c]kan                                                                 inventory-1p.prod-ocsit.bsp.gsa.gov: Thu Mar 26 20:17:29 2020

www-data  3033  3029  3033  0    5 20:14 ?        00:00:00 ckan              -k start
www-data  3033  3029  3050  0    5 20:14 ?        00:00:00 ckan              -k start
www-data  3033  3029  3051  0    5 20:14 ?        00:00:00 ckan              -k start
www-data  3033  3029  3052  1    5 20:14 ?        00:00:02 ckan              -k start
www-data  3033  3029  3281  0    5 20:14 ?        00:00:00 ckan              -k start
www-data  3034  3029  3034  0    5 20:14 ?        00:00:00 ckan              -k start
www-data  3034  3029  3043  0    5 20:14 ?        00:00:00 ckan              -k start
www-data  3034  3029  3044  0    5 20:14 ?        00:00:00 ckan              -k start
www-data  3034  3029  3045  1    5 20:14 ?        00:00:02 ckan              -k start
www-data  3034  3029  3159  0    5 20:14 ?        00:00:00 ckan              -k start
www-data  3035  3029  3035  0    5 20:14 ?        00:00:00 ckan              -k start
www-data  3035  3029  3082  0    5 20:14 ?        00:00:00 ckan              -k start
www-data  3035  3029  3083  0    5 20:14 ?        00:00:00 ckan              -k start
www-data  3035  3029  3084  1    5 20:14 ?        00:00:02 ckan              -k start
www-data  3035  3029  3211  0    5 20:14 ?        00:00:00 ckan              -k start
www-data  3036  3029  3036  0    5 20:14 ?        00:00:00 ckan              -k start
www-data  3036  3029  3085  0    5 20:14 ?        00:00:00 ckan              -k start
www-data  3036  3029  3086  0    5 20:14 ?        00:00:00 ckan              -k start
www-data  3036  3029  3087  1    5 20:14 ?        00:00:02 ckan              -k start
www-data  3036  3029  3329  0    5 20:14 ?        00:00:00 ckan              -k start
www-data  3037  3029  3037  0    5 20:14 ?        00:00:00 ckan              -k start
www-data  3037  3029  3046  0    5 20:14 ?        00:00:00 ckan              -k start
www-data  3037  3029  3047  0    5 20:14 ?        00:00:00 ckan              -k start
www-data  3037  3029  3049  1    5 20:14 ?        00:00:02 ckan              -k start
www-data  3037  3029  3357  0    5 20:14 ?        00:00:00 ckan              -k start
www-data  3038  3029  3038  0    5 20:14 ?        00:00:00 ckan              -k start
www-data  3038  3029  3088  0    5 20:14 ?        00:00:00 ckan              -k start
www-data  3038  3029  3089  0    5 20:14 ?        00:00:00 ckan              -k start
www-data  3038  3029  3090  1    5 20:14 ?        00:00:02 ckan              -k start
www-data  3038  3029  3175  0    5 20:14 ?        00:00:00 ckan              -k start

adborden · 2020-03-26T22:29:05Z

So I was thinking about the psycopg2 post mentioned earlier, and although I don't think it's related, I was thinking about how it might be possible for our dependencies to be built against the wrong library, or an old library. Python was also built a long time ago, when these instances were first provisioned. Maybe an act of desperation, I tried recompiling Python (deleteing the existing install so common.yml would rebuild it).

So far, this actually seems like it's working but I don't really understand why. The Python version is the same (2.7.16) as well as compile options (config.status --config '--prefix' '/usr/local/lib/python2.7.16' '--enable-ipv6' '--enable-unicode=ucs4' '--enable-shared').

FuhuXia · 2020-03-26T23:03:13Z

That would explain why we never saw it in sandbox or local dev.

Mostly we just want to re-compile the python version on production inventory as we think it's related to #1375 (comment)

adborden · 2020-03-28T00:35:02Z

inventory-1p started hanging around 11pm PST, so that didn't work.

We chatted today and decided to focus on reproducing this on staging. I realized that whatever triggers this could be specific to a type of request or controller. I went back to look at the logs and noticed some requests were timing out.

[Fri Mar 27 22:46:47.342153 2020] [wsgi:error] [pid 14023:tid 140285466208000] [client 72.52.197.100:2452] Timeout when reading response headers from daemon process 'ckan': /etc/ckan/apache.wsgi, referer: https://inventory.data.gov/api/action/datastore_search?resource_id=102fd9bd-4737-401b-b88f-5c5b0fab94ec&filters=%7B%22MZIP09%22:%2219380%22%7D

I grabbed a couple hundred URLs from production logs to reproduce on staging, these trigger hangs:

/organization/mcc-gov/unredacted.json
/api/action/datastore_search_sql?sql=SELECT%20%22MZIP09%22%20as%20mzip09,%22MSTREE09%22%20as%20mstree09%20,%22SCHNAM09%22%20as%20schnam09,%22NCESSCH%22%20as%20ncessch,%22MSTATE09%22%20as%20mstate09%20,%22MCITY09%22%20as%20mcity09%20FROM%20%22102fd9bd-4737-401b-b88f-5c5b0fab94ec%22%20where%20%22MZIP09%22='16001'

The last one can be simplified to
curl localhost/api/action/datastore_search_sql?sql=SELECT+1

I was able to verify that locally that this did crash CKAN.

adborden · 2020-03-28T00:41:23Z

From the above crash, I ran into this and a psycopg2 2.8.x fix for CKAN (feeling silly that we didn't check CKAN issues related to psycopg2 before). I patched CKAN on the datagov-inventory/psycopg2-2.8 branch. That resolved the error, but now I'm running into a new crash, also seemingly psycopg2 related.

This is leading me to lean towards focusing on getting inventory-app on CKAN 2.8, where I suspect these issues have already been fixed, rather than continuing down the path of try/search/fix.

adborden · 2020-06-22T16:52:57Z

Related: https://serverfault.com/questions/844761/wsgi-truncated-or-oversized-response-headers-received-from-daemon-process/935400#935400

adborden added bug Software defect or bug component/inventory Inventory playbooks/roles labels Feb 13, 2020

adborden assigned FuhuXia Feb 13, 2020

mogul assigned adborden Feb 18, 2020

FuhuXia mentioned this issue Mar 3, 2020

Use psycopg2==2.8.4 in bionic GSA/inventory-app#23

Merged

adborden mentioned this issue Mar 10, 2020

Ubuntu Bionic rollout for inventory-web host(s) #717

Closed

6 tasks

adborden added the component/catalog Related to catalog component playbooks/roles label Mar 12, 2020

adborden mentioned this issue Mar 19, 2020

Ubuntu bionic rollout to catalog-web host(s) #719

Closed

6 tasks

adborden unassigned FuhuXia Mar 26, 2020

adborden added a commit that referenced this issue Mar 26, 2020

[python] bump python on bionic

d7e8232

Mostly we just want to re-compile the python version on production inventory as we think it's related to #1375 (comment)

adborden mentioned this issue Mar 27, 2020

Bionic inventory CKAN requests hang #1068

Closed

mogul added this to the Sprint 20200402 milestone Apr 10, 2020

mogul closed this as completed Apr 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Truncated or oversized response headers received from daemon process 'ckan' #1375

Truncated or oversized response headers received from daemon process 'ckan' #1375

adborden commented Feb 13, 2020

adborden commented Feb 13, 2020

adborden commented Feb 13, 2020

adborden commented Feb 18, 2020

adborden commented Feb 25, 2020

FuhuXia commented Feb 26, 2020 •

edited

Loading

FuhuXia commented Mar 3, 2020

adborden commented Mar 10, 2020

adborden commented Mar 10, 2020

adborden commented Mar 12, 2020

FuhuXia commented Mar 25, 2020

adborden commented Mar 25, 2020

adborden commented Mar 25, 2020

adborden commented Mar 26, 2020

adborden commented Mar 26, 2020

adborden commented Mar 26, 2020

adborden commented Mar 26, 2020 •

edited

Loading

FuhuXia commented Mar 26, 2020

adborden commented Mar 28, 2020

adborden commented Mar 28, 2020 •

edited

Loading

adborden commented Jun 22, 2020

Truncated or oversized response headers received from daemon process 'ckan' #1375

Truncated or oversized response headers received from daemon process 'ckan' #1375

Comments

adborden commented Feb 13, 2020

How to reproduce

Expected behavior

Actual behavior

adborden commented Feb 13, 2020

adborden commented Feb 13, 2020

adborden commented Feb 18, 2020

adborden commented Feb 25, 2020

FuhuXia commented Feb 26, 2020 • edited Loading

FuhuXia commented Mar 3, 2020

adborden commented Mar 10, 2020

adborden commented Mar 10, 2020

adborden commented Mar 12, 2020

FuhuXia commented Mar 25, 2020

adborden commented Mar 25, 2020

adborden commented Mar 25, 2020

adborden commented Mar 26, 2020

adborden commented Mar 26, 2020

adborden commented Mar 26, 2020

adborden commented Mar 26, 2020 • edited Loading

FuhuXia commented Mar 26, 2020

adborden commented Mar 28, 2020

adborden commented Mar 28, 2020 • edited Loading

adborden commented Jun 22, 2020

FuhuXia commented Feb 26, 2020 •

edited

Loading

adborden commented Mar 26, 2020 •

edited

Loading

adborden commented Mar 28, 2020 •

edited

Loading