Daphne process killed due to out of memory error #293

vaisaghvt · 2019-12-16T09:51:22Z

The issue

We recently changed to django 2.x and channels. 2.2 from a system that has been running relatively stable on channels 1.x for almost 3 years now. Our tests did not show any issues when we were running django channels in our test servers. And even when we tried load testing with frequent requests, the errors were not unexpected i.e. there would be connections dropped and a few 502s and as soon as the load was cut off things would work fine.

This changed drastically when we deployed it in our prod environment. Our daphne process SIGKILLS every few minutes and as a result our websockets connections are extremely unstable and requires frequent reconnects. Our setup is such that we have an AWS ALB sitting in front of NGINX that directs all non-websockets traffic to a gunicorn process (which is pretty stable) and all websockets traffic to a daphne process.

We tried Uvicorn in the middle as well, howeer Uvicorn wouldn't sigkill itself but it was just as unresponsive.

Our ELB logs show a steady stream of 502 errors.

On investigating into the logs its seems Daphne is being killed by supervisor because of an out of memory error.

We currently have about 1GB allotted to the ECS-task and channels capacity set as


CHANNEL_LAYERS = {
    "default": {
        "BACKEND": "channels_redis.core.RedisChannelLayer",
        "CONFIG": {
            "hosts": [(REDIS_HOST, int(REDIS_PORT))],
            "capacity": 10000
        }
    },
}

Could you advice on if there is something obviously stupid about our configuration? Is the capacity impractical or the memory way too low for daphne?

Requested background details

Your OS and runtime environment, and browser if applicable

Running in a docker container, ecs-task on amazon linux-ami. The docker-container is build from python:3.7-alpine3.10 as base

A pip freeze output showing your package versions

aioredis==1.3.1
amqp==2.5.2
appdirs==1.4.3
argon2-cffi==16.3.0
asgiref==3.2.3
async-timeout==3.0.1
attrs==19.3.0
autobahn==19.11.1
Automat==0.8.0
awscli==1.16.303
billiard==3.6.1.0
bleach==3.0.2
blis==0.2.4
boto3==1.9.101
botocore==1.12.253
cached-property==1.5.1
celery==4.3.0
certifi==2019.11.28
cffi==1.13.2
channels==2.2.0
channels-redis==2.4.0
chardet==3.0.4
Click==7.0
colorama==0.4.1
ConcurrentLogHandler==0.9.1
constantly==15.1.0
coverage==4.5.3
cryptography==2.8
cymem==2.0.3
Cython==0.29.14
daphne==2.4.0
defusedxml==0.6.0
Django==2.2.8
django-activity-stream==0.8.0
django-allauth==0.40.0
django-allauth-2fa==0.6
django-appconf==1.0.3
django-avatar==4.1.0
django-celery-beat==1.5.0
django-celery-results==1.1.2
django-cors-middleware==1.3.1
django-debug-toolbar==1.9.1
django-extensions==2.0.7
django-guardian==1.4.9
django-jsonfield==1.2.0
django-jsonfield-compat==0.4.4
django-oauth-toolkit==1.2.0
django-otp==0.7.4
django-otp-twilio==0.5.1
django-prometheus==1.0.15
django-redis==4.10.0
django-storages==1.7.2
django-test-plus==1.1.1
django-timezone-field==4.0
django-webpack-loader==0.6.0
djangorestframework==3.10.3
docutils==0.15.2
en-core-web-md==2.1.0
en-core-web-sm==2.1.0
factory-boy==2.11.1
Faker==3.0.0
gunicorn==20.0.4
h11==0.8.1
hiredis==1.0.1
httptools==0.0.13
hunspell==0.5.0
hyperlink==19.0.0
idna==2.8
incremental==17.5.0
isodate==0.6.0
itsdangerous==0.24
jieba==0.39
jmespath==0.9.4
joblib==0.14.0
jsonschema==2.6.0
kombu==4.6.0
lxml==4.3.3
meld3==2.0.0
msgpack==0.6.1
msgpack-python==0.5.4
murmurhash==1.0.2
numpy==1.17.2
oauthlib==3.1.0
pandas==0.25.3
Pillow==5.4.1
plac==0.9.6
preshed==2.0.1
prometheus-client==0.7.1
psycopg2==2.7.3
PuLP==1.6.9
pyasn1==0.4.8
pyasn1-modules==0.2.7
pycparser==2.19
PyHamcrest==1.9.0
PyJWT==1.7.1
pyOpenSSL==19.1.0
pyparsing==2.4.5
PySocks==1.7.1
python-crontab==2.4.0
python-dateutil==2.8.0
python3-openid==3.1.0
pytz==2019.3
PyYAML==5.1.2
qrcode==6.1
raven==6.10.0
redis==3.3.8
reportlab==3.4.0
requests==2.22.0
requests-file==1.4.3
requests-oauthlib==1.3.0
requests-toolbelt==0.9.1
rsa==3.4.2
s3transfer==0.2.1
scikit-learn==0.22
scipy==1.3.1
sentry-sdk==0.13.2
service-identity==18.1.0
six==1.13.0
slackclient==1.2.1
spacy==2.1.0
spacy-hunspell==0.1.0
sqlparse==0.3.0
srsly==0.2.0
statistics==1.0.3.5
supervisor==4.0.4
text-unidecode==1.3
thinc==7.0.8
tldextract==2.2.0
tqdm==4.40.0
twilio==6.30.0
Twisted==19.10.0
txaio==18.8.1
urllib3==1.25.7
uvicorn==0.10.8
uvloop==0.14.0
vine==1.3.0
wasabi==0.4.2
webencodings==0.5.1
websocket-client==0.56.0
websockets==8.1
xlrd==1.2.0
yara-python==3.10.0
zeep==2.5.0
zh-model==0.0.0
zope.interface==4.7.1

- What you expected to happen vs. what actually happened

I didn't expect the process to keep getting sigkilled. Earlier versions of channels did not have this issue.

- How you're running Channels (runserver? daphne/runworker? Nginx/Apache in front?)

I'm running channels using daphne in supervisor behind Nginx which itself sits behind a loadbalancer. 

- Console logs and full tracebacks of any errors

Probably relevant logs:

[208557.248169] Task in /ecs/63f414dd-5286-4cf8-a911-8f161bbd09c4/0721d1d3babfc0a1f64a10b6793ca243720aad49f733875cbf9d075dcc467e90 killed as a result of limit of /ecs/63f414dd-5286-4cf8-a911-8f161bbd09c4/0721d1d3babfc0a1f64a10b6793ca243720aad49f733875cbf9d075dcc467e90
[208557.269384] memory: usage 1048572kB, limit 1048576kB, failcnt 2277610
[208557.274009] memory+swap: usage 1048572kB, limit 2097152kB, failcnt 0
[208557.278596] kmem: usage 9772kB, limit 9007199254740988kB, failcnt 0
[208557.283176] Memory cgroup stats for /ecs/63f414dd-5286-4cf8-a911-8f161bbd09c4/0721d1d3babfc0a1f64a10b6793ca243720aad49f733875cbf9d075dcc467e90: cache:532KB rss:1037704KB rss_huge:0KB shmem:28KB mapped_file:300KB dirty:0KB writeback:0KB swap:0KB inactive_anon:28KB active_anon:1037704KB inactive_file:356KB active_file:164KB unevictable:0KB
[208557.308074] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
[208557.316418] [28485] 0 28485 407 30 4 3 0 0 secrets_entrypo
[208557.324641] [28533] 0 28533 406 31 5 3 0 0 run_microservic
[208557.332463] [29518] 0 29518 5723 3634 15 3 0 0 supervisord
[208557.340081] [29520] 102 29520 251 24 4 3 0 0 chronyd
[208557.347886] [29521] 0 29521 5628 3571 15 3 0 0 gunicorn
[208557.355402] [29522] 0 29522 2007 259 8 3 0 0 nginx
[208557.362926] [29525] 100 29525 2188 368 7 3 0 0 nginx
[208557.370375] [29526] 100 29526 2121 366 7 3 0 0 nginx
[208557.377736] [29531] 0 29531 91045 49348 155 3 0 0 gunicorn
[208557.385270] [29532] 0 29532 87644 46745 148 3 0 0 gunicorn
[208557.392925] [ 2642] 0 2642 194931 157096 357 5 0 0 daphne
[208557.400421] Memory cgroup out of memory: Kill process 2642 (daphne) score 582 or sacrifice child
[208557.407897] Killed process 2642 (daphne) total-vm:779724kB, anon-rss:628384kB, file-rss:0kB, shmem-rss:0kB

The text was updated successfully, but these errors were encountered:

JohnDoee · 2019-12-16T10:03:58Z

Check out django/channels#1181 (comment) - the threadpool doesn't reuse threads until it has created all of the threads, which is a bit of a weird design imo.

From channels readme

By default, the number of threads is set to "the number of CPUs * 5", so you may see up to this number of threads. If you want to change it, set the ASGI_THREADS environment variable to the maximum number you wish to allow.

So if you have 16 cores, that's 80 threads. If each request can use up to 150MB ram, that's 12GB locked for this because python doesn't really free up memory (or share between threads?).

vaisaghvt · 2019-12-16T11:29:18Z

@JohnDoee thatnks for the super quick response. Where do I set this ASGI_THREADS? is that a setting in channel layers? or should it be set in the container?

JohnDoee · 2019-12-16T11:39:52Z

It's an environment variable, with docker it'd be something like

-e ASGI_THREADS=8

vaisaghvt · 2019-12-16T12:10:38Z

hi @JohnDoee I feel like i'm abusing your responsiveness. But if I set ASGI_THREADS=1 then my websockets completely stops working. "WebSocket is closed before the connection is established." is all i get. Is that normal?

JohnDoee · 2019-12-16T13:07:13Z

I've never tried to set it that low so I wouldn't know for sure. It's not the behaviour I'd expect with the knowledge I have of ASGI and Channels though.

Only sync code should be thrown into a thread (e.g. database operations).

vaisaghvt · 2019-12-16T14:50:21Z

@JohnDoee thanks a lot! That helps. So ASGI threads are what the sync tasks get offloaded to.

So I'm having this behavior that my reconnecting websockets fails to establish a websockets connection with the above error for a few tries and then establishes one on the nth try. This sometimes happens in the first try but after keeping the server running for a couple of hours it now almost always takes 5-10 tries to establish a connection...we initially thought it had something to do with the memory issue above. But the asgi_thread setting seems to have fixed that.

Do you have any advice on what other parameter I should play around with ? Should I be looking at the handshake timeout ?

JohnDoee · 2019-12-16T15:55:34Z

Websocket handshakes must be accepted by your application, so a timeout like that would mean that part doesn't happen, i.e. the request doesn't reach anything to accept or deny the handshake. That's probably the best I can tell you with the information you provide.

A small side-note, a change from Channels 1 to Channels 2 is the meaning of "Channel Layer", it's only necessary if you have inter-application communication, e.g. group chats. If you're not, then you should probably just remove it.

vaisaghvt · 2019-12-17T13:36:52Z

Hi @JohnDoee Thanks for all your help! I understand the handshake isn't related. So I'm closing this thread.

caesar4321 · 2024-06-07T12:43:39Z

Did you get the issue resolved?

vaisaghvt closed this as completed Dec 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Daphne process killed due to out of memory error #293

Daphne process killed due to out of memory error #293

vaisaghvt commented Dec 16, 2019 •

edited

Loading

JohnDoee commented Dec 16, 2019

vaisaghvt commented Dec 16, 2019

JohnDoee commented Dec 16, 2019

vaisaghvt commented Dec 16, 2019

JohnDoee commented Dec 16, 2019

vaisaghvt commented Dec 16, 2019

JohnDoee commented Dec 16, 2019

vaisaghvt commented Dec 17, 2019

caesar4321 commented Jun 7, 2024

Daphne process killed due to out of memory error #293

Daphne process killed due to out of memory error #293

Comments

vaisaghvt commented Dec 16, 2019 • edited Loading

JohnDoee commented Dec 16, 2019

vaisaghvt commented Dec 16, 2019

JohnDoee commented Dec 16, 2019

vaisaghvt commented Dec 16, 2019

JohnDoee commented Dec 16, 2019

vaisaghvt commented Dec 16, 2019

JohnDoee commented Dec 16, 2019

vaisaghvt commented Dec 17, 2019

caesar4321 commented Jun 7, 2024

vaisaghvt commented Dec 16, 2019 •

edited

Loading