-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow memory leak in RedisChannelLayer .receive_buffer due to lack of cleanup/expiry #212
Comments
I'm attempting to work around this issue in the AWX project itself at the moment by adding some code which tracks disconnections, and prevents the redis channel layer from continuing to handle messages for that specific channel: It's worth mentioning however, that this workaround probably isn't perfect, and I'm guessing it still sometimes leaks a little bit of data. In my opinion, what's really needed is some mechanism for channels_redis to clean this data structure up proactively after messages have expired. Aside from a change in the Redis channel layer to add housekeeping to this data structure, can you think of anything else I could do here? |
Hi @ryanpetrello. First up can you update to 3.0.1? There were changes in 4c88088 related to expiration. ( |
@carltongibson I can give that a go, but I think the issue here isn't that keys aren't getting expired in redis, it's that the |
I'll spend some time in the next day or so trying to make a simple reproducer for this outside of Ansible AWX with 3.0.1. |
Hi @ryanpetrello yes, I realised that reading in more depth. If you can pin down a minimal case that would be awesome. Thanks. |
I'm able to reproduce this in a simple Django/Channels project (outside of AWX). Here's a repo with instructions for reproducing it: https://github.com/ryanpetrello/channels-redis-leak/
|
@ryanpetrello Super. Thanks! If you fancy, take a look at django/channels#165, which does a rewrite here: I haven't had bandwidth at all to look at that still but your input would be good, and it may inspire something at the worst. |
To be honest, @carltongibson, now that I'm reading it, this issue sounds like a duplicate of #384 I'm unsure if it's worth tracking there or here. I'll give #165 a whirl and see if fixes the memory leak for me, though. |
No, let's keep this open. Best to triangulate in these slow burners. |
@carltongibson I tried the patch at #165 and it also leaks for me. ^ @davidfstr noticed the same thing - the suggested rewrite also leaks. I do agree that the patch you linked to is simpler, but it still leaks - I think the only way around this issue is to implement some form of TTL and expiration for in-memory buffers. I'm currently tinkering with this idea. |
Ok, super. Thanks for giving it a run. If you have input here, that would be super. My attention is entirely on updating Channels for ASGI 3's single callables and documenting use with Django 3.1 — so I do need a bit of input to resolve this any time soon. |
@carltongibson here's my attempt at this: |
@davidfstr I believe I'm encountering a duplicate of the issue you reported at #384 Did you ever come up with a solution other than periodically restarting Daphne? |
respect the capacity setting so that the receive_buffer does not grow without bounds see: django#212
Hey @carltongibson, I've been running a patch like this one (ansible/awx#7950) in production for almost a week now, and am no longer seeing the memory leak outlined in my reproducer: |
respect the capacity setting so that the receive_buffer does not grow without bounds see: django#212
I've been following this thread, as I'm also seeing memory leak issues. We are using channels + channels_redis + uvicorn. One thing I wanted to add to the discussion is why we are seeing this receive_buffer growth in the first place. I'm actually a bit confused how @ryanpetrello can recreate this simply by opening a ws connection and then disconnecting. Based on my read of In our application, this is what we observe, with the following error triggering a climb in memory usage at https://github.com/encode/uvicorn/blob/master/uvicorn/protocols/websockets/websockets_impl.py#L157
I also verified the hypothesis that memory grows when a channel is never removed from a group on disconnect by overriding the async def websocket_disconnect(self, message):
"""
Called when a WebSocket connection is closed. Base level so you don't
need to call super() all the time.
"""
logger.info(f"WEBSOCKET DISCONNECT CALLED for {self.channel_name}")
# Don't remove the channel from the group
# try:
# for group in self.groups:
# await self.channel_layer.group_discard(group, self.channel_name)
# except AttributeError:
# raise Exception("BACKEND is unconfigured or doesn't support groups")
await self.disconnect(message["code"])
raise StopConsumer() So overall, my understanding of what's happening is 1) A cancellation error in our ASGI app during ws handshake (of unknown reason) closes the websocket connection and throws an ASGI error in uvicorn. 2) This disconnection doesn’t propagate to our django consumer so we never remove the channel from our group. 3) This leads to WS messages continually getting sent to the channel that has been disconnected, which will ultimately lead to populating it’s receive buffer and growing indefinitely. I think this workaround can serve as a patch to avoid indefinite memory leakage, but I'm curious if there are any ideas about the root cause, and if there would be any way to ensure that a ws connection drop can always propagate to channels so we can remove that channel from the group. |
Hey @peteryin21 - thanks for the detailed reply. For what it's worth, I'm not using I agree with your overall understanding; there's a comment thread on a related issue in channels that has similar/related thoughts: I don't think this patch is a perfect solution, but it's allowed me to get memory growth under control and not constantly restart Daphne. I'm wondering if it might also help for you: |
respect the capacity setting so that the receive_buffer does not grow without bounds see: django#212
respect the capacity setting so that the receive_buffer does not grow without bounds see: django#212
respect the capacity setting so that the receive_buffer does not grow without bounds see: django#212
respect the capacity setting so that the receive_buffer does not grow without bounds see: django#212
respect the capacity setting so that the receive_buffer does not grow without bounds see: django#212
respect the capacity setting so that the receive_buffer does not grow without bounds see: django#212
) respect the capacity setting so that the receive_buffer does not grow without bounds see: #212
Thanks for the report and the fix! 🥇 |
Which version of channels_redis is fixing the memory leak? |
3.1, according to the tweet linked above |
…219) respect the capacity setting so that the receive_buffer does not grow without bounds see: django/channels_redis#212
…219) respect the capacity setting so that the receive_buffer does not grow without bounds see: django/channels_redis#212
I'm observing what looks like a slow memory leak in
channels-redis
in a project I maintain that uses Daphne (Ansible AWX):https://github.com/ansible/awx/blob/devel/awx/main/consumers.py#L114
This looks quite similar to a recently reported (and resolved) memory leak: django/channels#1181
After noticing a slow memory leak in Daphe at high volumes of (large) messages:
I decided to add some instrumentation to my routing code with
tracemalloc
to try to spot the leak:...and what I found was that we were leaking the actual messages somewhere:
I spent some time reading over https://github.com/django/channels_redis/blob/master/channels_redis/core.py and was immediately suspicious of
self.receive_buffer
. In my reproduction, here's what I'm seeing:self.receive_buffer[channel]
for the prior browser client is still growing, and these messages never end up being freed. I can open subsequent browser sessions, establish new websocket connections, and using an interactive debugger, I can see that the channel layer'sreceive_buffer
is keeping around stale messages for old channels which will never be delivered (or freed).I think the root of this issue is that the
self.receive_buffer
data structure doesn't have any form of internal cleanup or expiry (which is supported by this code comment):channels_redis/channels_redis/core.py
Line 535 in 90129a6
In practice, having only a send / receive interface might make this a hard/expensive problem to solve (tracking and managing in-memory expirations). One thing that stands out to me about the Redis channel layer is that it does do expiration in redis, but once messages are loaded into memory, there doesn't seem to be any sort expiration logic for the
receive_buffer
data structure (and I'm able to see it grow in an unbounded way fairly easily in my reproduction).The text was updated successfully, but these errors were encountered: