Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeflatePerMessage VS memory consumption #1862

Closed
Kludex opened this issue Feb 3, 2023 Discussed in #1850 · 15 comments
Closed

DeflatePerMessage VS memory consumption #1862

Kludex opened this issue Feb 3, 2023 Discussed in #1850 · 15 comments

Comments

@Kludex
Copy link
Member

Kludex commented Feb 3, 2023

Discussed in #1850

Originally posted by M1ha-Shvn January 27, 2023
Hi.
I'm devloping a server serving websockets connections using FastAPI.
I've noticed, that creating several thousands of simultinious websocket connections leads to high memory usage (4 Gb per 5-8k websocket connections in my case). I've started debugging it with tracemalloc and found out that the largest amount of this memory is consumed by websockets deflate extension in this line.
After that I've digged into websockets deflate mechanics and found out that it can be tuned wisely in order to achive lower memory consumption using custom ServerPerMessageDeflateFactory. I've tried searching for it in FastApi => Starlette => Uvicorn code and it lead me here.
What is the source of memory leak:

  1. Uvicorn creates separate PerMessageDeflate instance for each websocket connection (using ServerPerMessageDeflateFactory). From my point of view, it is a disadvantageous behaviour: it would be much better if it is created not for each websocket connection, but for each combination of connection parameters (like Singleton pattern, but created for each combination of parameters. Something like lru_cache).
  2. The only setting uvicorn gives to tune deflate is just disabling it with --ws-per-message-deflate. It is not flexible for different cases.
@Kludex
Copy link
Member Author

Kludex commented Feb 3, 2023

@M1ha-Shvn I still don't have a clear view on how to solve the issue with the 1 that you mention above. Would you like to show me in a PR?

@Kludex
Copy link
Member Author

Kludex commented Feb 3, 2023

Also, do you have a MRE for me to confirm the issue?

@M1ha-Shvn
Copy link

M1ha-Shvn commented Feb 3, 2023

I'll try making MRE and PR a little bit later, no time today. What I can give you fast is the following log of tracemalloc. It has bin made in conditions:

  1. --workers = 16
  2. tracemalloc snapshot is dumped by asyncio.created_task every 10 minutes so I could compare snapshots
  3. Here is one of them. At this very moment, ~3000 active websockets are opened (I've made a tester based on websockets library in order to open them, traffic to sockets is generated from production server, it's something like 200 messages per second, but different websockets receive only small amount of this traffic)
  4. Next trace showed that the only thing that grew up in size significantly was self.encoder = zlib.compressobj (250+Mb for ~5000 simultinious sockets)
Line stat
/home/ubuntu2/fastapi-rts/venv/lib/python3.8/site-packages/websockets/extensions/permessage_deflate.py:64: size=148 MiB, count=4040, average=37.4 KiB
/home/ubuntu2/fastapi-rts/core/models.py:292: size=69.8 MiB, count=40733, average=1797 B
/home/ubuntu2/fastapi-rts/venv/lib/python3.8/site-packages/websockets/extensions/permessage_deflate.py:61: size=4148 KiB, count=1734, average=2449 B
/home/ubuntu2/fastapi-rts/venv/lib/python3.8/site-packages/websockets/datastructures.py:122: size=1691 KiB, count=28868, average=60 B
/usr/lib/python3.8/json/decoder.py:353: size=1283 KiB, count=13772, average=95 B
/home/ubuntu2/fastapi-rts/core/models.py:809: size=1280 KiB, count=1, average=1280 KiB
/home/ubuntu2/fastapi-rts/core/models.py:790: size=861 KiB, count=12244, average=72 B
/home/ubuntu2/fastapi-rts/venv/lib/python3.8/site-packages/uvicorn/protocols/websockets/websockets_impl.py:173: size=832 KiB, count=16619, average=51 B
/home/ubuntu2/fastapi-rts/core/models.py:789: size=730 KiB, count=18876, average=40 B
/home/ubuntu2/fastapi-rts/venv/lib/python3.8/site-packages/sentry_sdk/scope.py:299: size=680 KiB, count=1329, average=524 B


Traceback stat

4040 memory blocks: 151178.6 KiB
  File "/home/ubuntu2/fastapi-rts/venv/lib/python3.8/site-packages/websockets/extensions/permessage_deflate.py", line 64
    self.encoder = zlib.compressobj(

40733 memory blocks: 71484.1 KiB
  File "/home/ubuntu2/fastapi-rts/core/models.py", line 292
    res += pickle.dumps(self.data, fix_imports=False)

1734 memory blocks: 4147.6 KiB
  File "/home/ubuntu2/fastapi-rts/venv/lib/python3.8/site-packages/websockets/extensions/permessage_deflate.py", line 61
    self.decoder = zlib.decompressobj(wbits=-self.remote_max_window_bits)

28868 memory blocks: 1691.0 KiB
  File "/home/ubuntu2/fastapi-rts/venv/lib/python3.8/site-packages/websockets/datastructures.py", line 122
    self._dict.setdefault(key.lower(), []).append(value)

13772 memory blocks: 1283.5 KiB
  File "/usr/lib/python3.8/json/decoder.py", line 353
    obj, end = self.scan_once(s, idx)

1 memory blocks: 1280.0 KiB
  File "/home/ubuntu2/fastapi-rts/core/models.py", line 809
    self._cache[channel_name] = self._cache.pop(channel_name)

12244 memory blocks: 861.3 KiB
  File "/home/ubuntu2/fastapi-rts/core/models.py", line 790
    messages=UniqueSortedList((compressed_msg,)))

16619 memory blocks: 832.2 KiB
  File "/home/ubuntu2/fastapi-rts/venv/lib/python3.8/site-packages/uvicorn/protocols/websockets/websockets_impl.py", line 173
    (name.encode("ascii"), value.encode("ascii"))

18876 memory blocks: 729.8 KiB
  File "/home/ubuntu2/fastapi-rts/core/models.py", line 789
    self._cache[channel_name] = CacheElement(last_update=msg.created.timestamp(),

1329 memory blocks: 679.8 KiB
  File "/home/ubuntu2/fastapi-rts/venv/lib/python3.8/site-packages/sentry_sdk/scope.py", line 299
    self._breadcrumbs = deque()  # type: Deque[Breadcrumb]

As you can see, self.encoder = zlib.compressobj( allocates150Mb per worker => ~1Gb per app
Here is the graph of memory commited by the server, where uvicorn is served:
image
At the very beginning app has been rebooted. After that I started creating sockets with 0.05 delay between every socket.
When there are no sockets created, graph looks like this:
image

I've also tested gunicorn with Uvicorn worker class with the same memory leaking effect

@M1ha-Shvn
Copy link

M1ha-Shvn commented Feb 3, 2023

@M1ha-Shvn I still don't have a clear view on how to solve the issue with the 1 that you mention above. Would you like to show me in a PR?

What I suggest, is one of the following:

  1. Inherit ServerPerMessageDeflateFactory, so that it creates a single instance of PerMessageDeflate every time it is called, not new instance on every call. This instance should be stored somewhere between calls, in ServerPerMessageDeflateFactory child class attribute, for instance. And use this inherited class as extension instead of ServerPerMessageDeflateFactory,
  2. Optionally add some tuning params, described here and pass them to ServerPerMessageDeflateFactory, when instance is created

@M1ha-Shvn
Copy link

@Kludex
Copy link
Member Author

Kludex commented Feb 6, 2023

Thanks!

@M1ha-Shvn
Copy link

M1ha-Shvn commented Feb 6, 2023

I haven't tested this PR, made it fast in order to show general idea
#1864

@Kludex
Copy link
Member Author

Kludex commented Feb 6, 2023

@aaugustin I'm sorry to bother, but when you have time, would you mind giving me your input here to understand if we are going in the right direction? 🙏

@aaugustin
Copy link
Contributor

If you're using context takeover (i.e. you don't set no_context_takeover=True), then you cannot share PerMessageDeflate, because context is connection-local. It depends on previously exchanged messages.

If you aren't using context takeover, then you don't have a memory usage problem.

@aaugustin
Copy link
Contributor

Apart from that, is there a knob to configure max_window_bits? With a very high number of connections, probably you would benefit from lowering it.

See https://websockets.readthedocs.io/en/stable/topics/compression.html#compression-settings for defaults.

@M1ha-Shvn
Copy link

Hi, thanks for your answer.

If you're using context takeover (i.e. you don't set no_context_takeover=True), then you cannot share PerMessageDeflate, because context is connection-local. It depends on previously exchanged messages.

First of all, I understand that I've changed this behavior. But what I don't understand: why deflate object is created per connection? Why can not Deflate context be shared between all connections? From my point of view, in a typical websocket server you have lots of connections with clients and send very similar (particularly, JSON) messages to all clients => to all connections. So it would be worth using same deflate object with single context and increasing it's capacity, so it could compress messages better. Of course, it depends on connection settings, set by request headers. But there is a small number of combinations of these parameters => A limited number of objects can be created and used.

If you aren't using context takeover, then you don't have a memory usage problem.

I'm not so sure about it. Creating an object in python has its memory cost. Even if I won't use context takeover, objects PerMessageDeflate would be created and use memory for each connection.
P. s. There's also no ability in uvicorn for now to tune PerMessageDeflateFactory or PerMessageDeflate, only disabling it.

Apart from that, is there a knob to configure max_window_bits? With a very high number of connections, probably you would benefit from lowering it.

See https://websockets.readthedocs.io/en/stable/topics/compression.html#compression-settings for defaults.

Yes, adding deflate tuning settings to uvicorn settings was one of my proposals. Though, it would lower the problem for me, but would not solve it: I'll just have higher connection limit still consuming lots of memory per each connection

@aaugustin
Copy link
Contributor

Kludex asked for my inputs; I gave them. If you don't trust me, make your experiments and reach your own conclusions.

In your experiments, don't stop at opening connections; exchange a significant number of different messages on each connection, in both directions, and make sure they make it through the compress / decompress cycle correctly.

@aaugustin
Copy link
Contributor

aaugustin commented Feb 7, 2023

In case it helps:

  • The context is synchronized between both ends of the connection and its content depends on messages that are exchanged.
  • When context takeover is disabled, compressor / decompressor objects will be garbage collected as soon as compression / decompression is done.

@Kludex
Copy link
Member Author

Kludex commented Feb 7, 2023

Thanks @aaugustin! I really appreciate it. 🙏

@Kludex
Copy link
Member Author

Kludex commented Mar 9, 2023

In your experiments, don't stop at opening connections; exchange a significant number of different messages on each connection, in both directions, and make sure they make it through the compress / decompress cycle correctly.

Did you try this @M1ha-Shvn ?

@encode encode locked and limited conversation to collaborators Mar 13, 2023
@Kludex Kludex converted this issue into discussion #1900 Mar 13, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Projects
None yet
Development

No branches or pull requests

3 participants