Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Daphne process killed due to out of memory error #293

Closed
vaisaghvt opened this issue Dec 16, 2019 · 9 comments
Closed

Daphne process killed due to out of memory error #293

vaisaghvt opened this issue Dec 16, 2019 · 9 comments

Comments

@vaisaghvt
Copy link

vaisaghvt commented Dec 16, 2019

The issue

We recently changed to django 2.x and channels. 2.2 from a system that has been running relatively stable on channels 1.x for almost 3 years now. Our tests did not show any issues when we were running django channels in our test servers. And even when we tried load testing with frequent requests, the errors were not unexpected i.e. there would be connections dropped and a few 502s and as soon as the load was cut off things would work fine.

This changed drastically when we deployed it in our prod environment. Our daphne process SIGKILLS every few minutes and as a result our websockets connections are extremely unstable and requires frequent reconnects. Our setup is such that we have an AWS ALB sitting in front of NGINX that directs all non-websockets traffic to a gunicorn process (which is pretty stable) and all websockets traffic to a daphne process.

We tried Uvicorn in the middle as well, howeer Uvicorn wouldn't sigkill itself but it was just as unresponsive.

Our ELB logs show a steady stream of 502 errors.

On investigating into the logs its seems Daphne is being killed by supervisor because of an out of memory error.

We currently have about 1GB allotted to the ECS-task and channels capacity set as


CHANNEL_LAYERS = {
    "default": {
        "BACKEND": "channels_redis.core.RedisChannelLayer",
        "CONFIG": {
            "hosts": [(REDIS_HOST, int(REDIS_PORT))],
            "capacity": 10000
        }
    },
}

Could you advice on if there is something obviously stupid about our configuration? Is the capacity impractical or the memory way too low for daphne?

Requested background details

  • Your OS and runtime environment, and browser if applicable

Running in a docker container, ecs-task on amazon linux-ami. The docker-container is build from python:3.7-alpine3.10 as base

  • A pip freeze output showing your package versions
aioredis==1.3.1
amqp==2.5.2
appdirs==1.4.3
argon2-cffi==16.3.0
asgiref==3.2.3
async-timeout==3.0.1
attrs==19.3.0
autobahn==19.11.1
Automat==0.8.0
awscli==1.16.303
billiard==3.6.1.0
bleach==3.0.2
blis==0.2.4
boto3==1.9.101
botocore==1.12.253
cached-property==1.5.1
celery==4.3.0
certifi==2019.11.28
cffi==1.13.2
channels==2.2.0
channels-redis==2.4.0
chardet==3.0.4
Click==7.0
colorama==0.4.1
ConcurrentLogHandler==0.9.1
constantly==15.1.0
coverage==4.5.3
cryptography==2.8
cymem==2.0.3
Cython==0.29.14
daphne==2.4.0
defusedxml==0.6.0
Django==2.2.8
django-activity-stream==0.8.0
django-allauth==0.40.0
django-allauth-2fa==0.6
django-appconf==1.0.3
django-avatar==4.1.0
django-celery-beat==1.5.0
django-celery-results==1.1.2
django-cors-middleware==1.3.1
django-debug-toolbar==1.9.1
django-extensions==2.0.7
django-guardian==1.4.9
django-jsonfield==1.2.0
django-jsonfield-compat==0.4.4
django-oauth-toolkit==1.2.0
django-otp==0.7.4
django-otp-twilio==0.5.1
django-prometheus==1.0.15
django-redis==4.10.0
django-storages==1.7.2
django-test-plus==1.1.1
django-timezone-field==4.0
django-webpack-loader==0.6.0
djangorestframework==3.10.3
docutils==0.15.2
en-core-web-md==2.1.0
en-core-web-sm==2.1.0
factory-boy==2.11.1
Faker==3.0.0
gunicorn==20.0.4
h11==0.8.1
hiredis==1.0.1
httptools==0.0.13
hunspell==0.5.0
hyperlink==19.0.0
idna==2.8
incremental==17.5.0
isodate==0.6.0
itsdangerous==0.24
jieba==0.39
jmespath==0.9.4
joblib==0.14.0
jsonschema==2.6.0
kombu==4.6.0
lxml==4.3.3
meld3==2.0.0
msgpack==0.6.1
msgpack-python==0.5.4
murmurhash==1.0.2
numpy==1.17.2
oauthlib==3.1.0
pandas==0.25.3
Pillow==5.4.1
plac==0.9.6
preshed==2.0.1
prometheus-client==0.7.1
psycopg2==2.7.3
PuLP==1.6.9
pyasn1==0.4.8
pyasn1-modules==0.2.7
pycparser==2.19
PyHamcrest==1.9.0
PyJWT==1.7.1
pyOpenSSL==19.1.0
pyparsing==2.4.5
PySocks==1.7.1
python-crontab==2.4.0
python-dateutil==2.8.0
python3-openid==3.1.0
pytz==2019.3
PyYAML==5.1.2
qrcode==6.1
raven==6.10.0
redis==3.3.8
reportlab==3.4.0
requests==2.22.0
requests-file==1.4.3
requests-oauthlib==1.3.0
requests-toolbelt==0.9.1
rsa==3.4.2
s3transfer==0.2.1
scikit-learn==0.22
scipy==1.3.1
sentry-sdk==0.13.2
service-identity==18.1.0
six==1.13.0
slackclient==1.2.1
spacy==2.1.0
spacy-hunspell==0.1.0
sqlparse==0.3.0
srsly==0.2.0
statistics==1.0.3.5
supervisor==4.0.4
text-unidecode==1.3
thinc==7.0.8
tldextract==2.2.0
tqdm==4.40.0
twilio==6.30.0
Twisted==19.10.0
txaio==18.8.1
urllib3==1.25.7
uvicorn==0.10.8
uvloop==0.14.0
vine==1.3.0
wasabi==0.4.2
webencodings==0.5.1
websocket-client==0.56.0
websockets==8.1
xlrd==1.2.0
yara-python==3.10.0
zeep==2.5.0
zh-model==0.0.0
zope.interface==4.7.1

- What you expected to happen vs. what actually happened

I didn't expect the process to keep getting sigkilled. Earlier versions of channels did not have this issue.

- How you're running Channels (runserver? daphne/runworker? Nginx/Apache in front?)

I'm running channels using daphne in supervisor behind Nginx which itself sits behind a loadbalancer. 

- Console logs and full tracebacks of any errors

Probably relevant logs:

[208557.248169] Task in /ecs/63f414dd-5286-4cf8-a911-8f161bbd09c4/0721d1d3babfc0a1f64a10b6793ca243720aad49f733875cbf9d075dcc467e90 killed as a result of limit of /ecs/63f414dd-5286-4cf8-a911-8f161bbd09c4/0721d1d3babfc0a1f64a10b6793ca243720aad49f733875cbf9d075dcc467e90
[208557.269384] memory: usage 1048572kB, limit 1048576kB, failcnt 2277610
[208557.274009] memory+swap: usage 1048572kB, limit 2097152kB, failcnt 0
[208557.278596] kmem: usage 9772kB, limit 9007199254740988kB, failcnt 0
[208557.283176] Memory cgroup stats for /ecs/63f414dd-5286-4cf8-a911-8f161bbd09c4/0721d1d3babfc0a1f64a10b6793ca243720aad49f733875cbf9d075dcc467e90: cache:532KB rss:1037704KB rss_huge:0KB shmem:28KB mapped_file:300KB dirty:0KB writeback:0KB swap:0KB inactive_anon:28KB active_anon:1037704KB inactive_file:356KB active_file:164KB unevictable:0KB
[208557.308074] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
[208557.316418] [28485] 0 28485 407 30 4 3 0 0 secrets_entrypo
[208557.324641] [28533] 0 28533 406 31 5 3 0 0 run_microservic
[208557.332463] [29518] 0 29518 5723 3634 15 3 0 0 supervisord
[208557.340081] [29520] 102 29520 251 24 4 3 0 0 chronyd
[208557.347886] [29521] 0 29521 5628 3571 15 3 0 0 gunicorn
[208557.355402] [29522] 0 29522 2007 259 8 3 0 0 nginx
[208557.362926] [29525] 100 29525 2188 368 7 3 0 0 nginx
[208557.370375] [29526] 100 29526 2121 366 7 3 0 0 nginx
[208557.377736] [29531] 0 29531 91045 49348 155 3 0 0 gunicorn
[208557.385270] [29532] 0 29532 87644 46745 148 3 0 0 gunicorn
[208557.392925] [ 2642] 0 2642 194931 157096 357 5 0 0 daphne
[208557.400421] Memory cgroup out of memory: Kill process 2642 (daphne) score 582 or sacrifice child
[208557.407897] Killed process 2642 (daphne) total-vm:779724kB, anon-rss:628384kB, file-rss:0kB, shmem-rss:0kB


@JohnDoee
Copy link
Contributor

Check out django/channels#1181 (comment) - the threadpool doesn't reuse threads until it has created all of the threads, which is a bit of a weird design imo.

From channels readme

By default, the number of threads is set to "the number of CPUs * 5", so you may see up to this number of threads. If you want to change it, set the ASGI_THREADS environment variable to the maximum number you wish to allow.

So if you have 16 cores, that's 80 threads. If each request can use up to 150MB ram, that's 12GB locked for this because python doesn't really free up memory (or share between threads?).

@vaisaghvt
Copy link
Author

@JohnDoee thatnks for the super quick response. Where do I set this ASGI_THREADS? is that a setting in channel layers? or should it be set in the container?

@JohnDoee
Copy link
Contributor

It's an environment variable, with docker it'd be something like

-e ASGI_THREADS=8

@vaisaghvt
Copy link
Author

hi @JohnDoee I feel like i'm abusing your responsiveness. But if I set ASGI_THREADS=1 then my websockets completely stops working. "WebSocket is closed before the connection is established." is all i get. Is that normal?

@JohnDoee
Copy link
Contributor

I've never tried to set it that low so I wouldn't know for sure. It's not the behaviour I'd expect with the knowledge I have of ASGI and Channels though.

Only sync code should be thrown into a thread (e.g. database operations).

@vaisaghvt
Copy link
Author

@JohnDoee thanks a lot! That helps. So ASGI threads are what the sync tasks get offloaded to.

So I'm having this behavior that my reconnecting websockets fails to establish a websockets connection with the above error for a few tries and then establishes one on the nth try. This sometimes happens in the first try but after keeping the server running for a couple of hours it now almost always takes 5-10 tries to establish a connection...we initially thought it had something to do with the memory issue above. But the asgi_thread setting seems to have fixed that.

Do you have any advice on what other parameter I should play around with ? Should I be looking at the handshake timeout ?

@JohnDoee
Copy link
Contributor

Websocket handshakes must be accepted by your application, so a timeout like that would mean that part doesn't happen, i.e. the request doesn't reach anything to accept or deny the handshake. That's probably the best I can tell you with the information you provide.

A small side-note, a change from Channels 1 to Channels 2 is the meaning of "Channel Layer", it's only necessary if you have inter-application communication, e.g. group chats. If you're not, then you should probably just remove it.

@vaisaghvt
Copy link
Author

Hi @JohnDoee Thanks for all your help! I understand the handshake isn't related. So I'm closing this thread.

@caesar4321
Copy link

Did you get the issue resolved?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants