DeflatePerMessage VS memory consumption #1862

Kludex · 2023-02-03T08:29:53Z

Discussed in #1850

^{Originally posted by M1ha-Shvn January 27, 2023}
Hi.
I'm devloping a server serving websockets connections using FastAPI.
I've noticed, that creating several thousands of simultinious websocket connections leads to high memory usage (4 Gb per 5-8k websocket connections in my case). I've started debugging it with tracemalloc and found out that the largest amount of this memory is consumed by websockets deflate extension in this line.
After that I've digged into websockets deflate mechanics and found out that it can be tuned wisely in order to achive lower memory consumption using custom ServerPerMessageDeflateFactory. I've tried searching for it in FastApi => Starlette => Uvicorn code and it lead me here.
What is the source of memory leak:

Uvicorn creates separate PerMessageDeflate instance for each websocket connection (using ServerPerMessageDeflateFactory). From my point of view, it is a disadvantageous behaviour: it would be much better if it is created not for each websocket connection, but for each combination of connection parameters (like Singleton pattern, but created for each combination of parameters. Something like lru_cache).
The only setting uvicorn gives to tune deflate is just disabling it with --ws-per-message-deflate. It is not flexible for different cases.

The text was updated successfully, but these errors were encountered:

Kludex · 2023-02-03T08:30:54Z

@M1ha-Shvn I still don't have a clear view on how to solve the issue with the 1 that you mention above. Would you like to show me in a PR?

Kludex · 2023-02-03T08:31:07Z

Also, do you have a MRE for me to confirm the issue?

M1ha-Shvn · 2023-02-03T09:26:31Z

I'll try making MRE and PR a little bit later, no time today. What I can give you fast is the following log of tracemalloc. It has bin made in conditions:

--workers = 16
tracemalloc snapshot is dumped by asyncio.created_task every 10 minutes so I could compare snapshots
Here is one of them. At this very moment, ~3000 active websockets are opened (I've made a tester based on websockets library in order to open them, traffic to sockets is generated from production server, it's something like 200 messages per second, but different websockets receive only small amount of this traffic)
Next trace showed that the only thing that grew up in size significantly was self.encoder = zlib.compressobj (250+Mb for ~5000 simultinious sockets)

Line stat
/home/ubuntu2/fastapi-rts/venv/lib/python3.8/site-packages/websockets/extensions/permessage_deflate.py:64: size=148 MiB, count=4040, average=37.4 KiB
/home/ubuntu2/fastapi-rts/core/models.py:292: size=69.8 MiB, count=40733, average=1797 B
/home/ubuntu2/fastapi-rts/venv/lib/python3.8/site-packages/websockets/extensions/permessage_deflate.py:61: size=4148 KiB, count=1734, average=2449 B
/home/ubuntu2/fastapi-rts/venv/lib/python3.8/site-packages/websockets/datastructures.py:122: size=1691 KiB, count=28868, average=60 B
/usr/lib/python3.8/json/decoder.py:353: size=1283 KiB, count=13772, average=95 B
/home/ubuntu2/fastapi-rts/core/models.py:809: size=1280 KiB, count=1, average=1280 KiB
/home/ubuntu2/fastapi-rts/core/models.py:790: size=861 KiB, count=12244, average=72 B
/home/ubuntu2/fastapi-rts/venv/lib/python3.8/site-packages/uvicorn/protocols/websockets/websockets_impl.py:173: size=832 KiB, count=16619, average=51 B
/home/ubuntu2/fastapi-rts/core/models.py:789: size=730 KiB, count=18876, average=40 B
/home/ubuntu2/fastapi-rts/venv/lib/python3.8/site-packages/sentry_sdk/scope.py:299: size=680 KiB, count=1329, average=524 B


Traceback stat

4040 memory blocks: 151178.6 KiB
  File "/home/ubuntu2/fastapi-rts/venv/lib/python3.8/site-packages/websockets/extensions/permessage_deflate.py", line 64
    self.encoder = zlib.compressobj(

40733 memory blocks: 71484.1 KiB
  File "/home/ubuntu2/fastapi-rts/core/models.py", line 292
    res += pickle.dumps(self.data, fix_imports=False)

1734 memory blocks: 4147.6 KiB
  File "/home/ubuntu2/fastapi-rts/venv/lib/python3.8/site-packages/websockets/extensions/permessage_deflate.py", line 61
    self.decoder = zlib.decompressobj(wbits=-self.remote_max_window_bits)

28868 memory blocks: 1691.0 KiB
  File "/home/ubuntu2/fastapi-rts/venv/lib/python3.8/site-packages/websockets/datastructures.py", line 122
    self._dict.setdefault(key.lower(), []).append(value)

13772 memory blocks: 1283.5 KiB
  File "/usr/lib/python3.8/json/decoder.py", line 353
    obj, end = self.scan_once(s, idx)

1 memory blocks: 1280.0 KiB
  File "/home/ubuntu2/fastapi-rts/core/models.py", line 809
    self._cache[channel_name] = self._cache.pop(channel_name)

12244 memory blocks: 861.3 KiB
  File "/home/ubuntu2/fastapi-rts/core/models.py", line 790
    messages=UniqueSortedList((compressed_msg,)))

16619 memory blocks: 832.2 KiB
  File "/home/ubuntu2/fastapi-rts/venv/lib/python3.8/site-packages/uvicorn/protocols/websockets/websockets_impl.py", line 173
    (name.encode("ascii"), value.encode("ascii"))

18876 memory blocks: 729.8 KiB
  File "/home/ubuntu2/fastapi-rts/core/models.py", line 789
    self._cache[channel_name] = CacheElement(last_update=msg.created.timestamp(),

1329 memory blocks: 679.8 KiB
  File "/home/ubuntu2/fastapi-rts/venv/lib/python3.8/site-packages/sentry_sdk/scope.py", line 299
    self._breadcrumbs = deque()  # type: Deque[Breadcrumb]

As you can see, self.encoder = zlib.compressobj( allocates150Mb per worker => ~1Gb per app
Here is the graph of memory commited by the server, where uvicorn is served:

At the very beginning app has been rebooted. After that I started creating sockets with 0.05 delay between every socket.
When there are no sockets created, graph looks like this:

I've also tested gunicorn with Uvicorn worker class with the same memory leaking effect

M1ha-Shvn · 2023-02-03T09:40:04Z

@M1ha-Shvn I still don't have a clear view on how to solve the issue with the 1 that you mention above. Would you like to show me in a PR?

What I suggest, is one of the following:

Inherit ServerPerMessageDeflateFactory, so that it creates a single instance of PerMessageDeflate every time it is called, not new instance on every call. This instance should be stored somewhere between calls, in ServerPerMessageDeflateFactory child class attribute, for instance. And use this inherited class as extension instead of ServerPerMessageDeflateFactory,
Optionally add some tuning params, described here and pass them to ServerPerMessageDeflateFactory, when instance is created

M1ha-Shvn · 2023-02-06T12:03:16Z

MRE: https://github.com/M1ha-Shvn/uvicorn-issue-1862

Kludex · 2023-02-06T12:08:55Z

Thanks!

M1ha-Shvn · 2023-02-06T12:34:40Z

I haven't tested this PR, made it fast in order to show general idea
#1864

Kludex · 2023-02-06T13:47:13Z

@aaugustin I'm sorry to bother, but when you have time, would you mind giving me your input here to understand if we are going in the right direction? 🙏

aaugustin · 2023-02-06T16:30:15Z

If you're using context takeover (i.e. you don't set no_context_takeover=True), then you cannot share PerMessageDeflate, because context is connection-local. It depends on previously exchanged messages.

If you aren't using context takeover, then you don't have a memory usage problem.

aaugustin · 2023-02-06T17:04:49Z

Apart from that, is there a knob to configure max_window_bits? With a very high number of connections, probably you would benefit from lowering it.

See https://websockets.readthedocs.io/en/stable/topics/compression.html#compression-settings for defaults.

M1ha-Shvn · 2023-02-07T05:07:03Z

Hi, thanks for your answer.

If you're using context takeover (i.e. you don't set no_context_takeover=True), then you cannot share PerMessageDeflate, because context is connection-local. It depends on previously exchanged messages.

First of all, I understand that I've changed this behavior. But what I don't understand: why deflate object is created per connection? Why can not Deflate context be shared between all connections? From my point of view, in a typical websocket server you have lots of connections with clients and send very similar (particularly, JSON) messages to all clients => to all connections. So it would be worth using same deflate object with single context and increasing it's capacity, so it could compress messages better. Of course, it depends on connection settings, set by request headers. But there is a small number of combinations of these parameters => A limited number of objects can be created and used.

If you aren't using context takeover, then you don't have a memory usage problem.

I'm not so sure about it. Creating an object in python has its memory cost. Even if I won't use context takeover, objects PerMessageDeflate would be created and use memory for each connection.
P. s. There's also no ability in uvicorn for now to tune PerMessageDeflateFactory or PerMessageDeflate, only disabling it.

Apart from that, is there a knob to configure max_window_bits? With a very high number of connections, probably you would benefit from lowering it.

See https://websockets.readthedocs.io/en/stable/topics/compression.html#compression-settings for defaults.

Yes, adding deflate tuning settings to uvicorn settings was one of my proposals. Though, it would lower the problem for me, but would not solve it: I'll just have higher connection limit still consuming lots of memory per each connection

aaugustin · 2023-02-07T06:29:05Z

Kludex asked for my inputs; I gave them. If you don't trust me, make your experiments and reach your own conclusions.

In your experiments, don't stop at opening connections; exchange a significant number of different messages on each connection, in both directions, and make sure they make it through the compress / decompress cycle correctly.

aaugustin · 2023-02-07T06:32:52Z

In case it helps:

The context is synchronized between both ends of the connection and its content depends on messages that are exchanged.
When context takeover is disabled, compressor / decompressor objects will be garbage collected as soon as compression / decompression is done.

Kludex · 2023-02-07T06:34:01Z

Thanks @aaugustin! I really appreciate it. 🙏

Kludex · 2023-03-09T21:46:23Z

In your experiments, don't stop at opening connections; exchange a significant number of different messages on each connection, in both directions, and make sure they make it through the compress / decompress cycle correctly.

Did you try this @M1ha-Shvn ?

Kludex added websockets need confirmation labels Feb 3, 2023

M1ha-Shvn pushed a commit to M1ha-Shvn/uvicorn that referenced this issue Feb 6, 2023

Optimize creating PerMessageDeflate instances (issue encode#1862)

27c0410

M1ha-Shvn mentioned this issue Feb 6, 2023

Optimize creating PerMessageDeflate #1864

Closed

encode locked and limited conversation to collaborators Mar 13, 2023

Kludex converted this issue into discussion #1900 Mar 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

DeflatePerMessage VS memory consumption #1862

DeflatePerMessage VS memory consumption #1862

Kludex commented Feb 3, 2023

Kludex commented Feb 3, 2023

Kludex commented Feb 3, 2023

M1ha-Shvn commented Feb 3, 2023 •

edited

Loading

M1ha-Shvn commented Feb 3, 2023 •

edited

Loading

M1ha-Shvn commented Feb 6, 2023

Kludex commented Feb 6, 2023

M1ha-Shvn commented Feb 6, 2023 •

edited

Loading

Kludex commented Feb 6, 2023 •

edited

Loading

aaugustin commented Feb 6, 2023

aaugustin commented Feb 6, 2023

M1ha-Shvn commented Feb 7, 2023

aaugustin commented Feb 7, 2023

aaugustin commented Feb 7, 2023 •

edited

Loading

Kludex commented Feb 7, 2023

Kludex commented Mar 9, 2023

This issue was moved to a discussion.

This issue was moved to a discussion.

DeflatePerMessage VS memory consumption #1862

DeflatePerMessage VS memory consumption #1862

Comments

Kludex commented Feb 3, 2023

Discussed in #1850

Kludex commented Feb 3, 2023

Kludex commented Feb 3, 2023

M1ha-Shvn commented Feb 3, 2023 • edited Loading

M1ha-Shvn commented Feb 3, 2023 • edited Loading

M1ha-Shvn commented Feb 6, 2023

Kludex commented Feb 6, 2023

M1ha-Shvn commented Feb 6, 2023 • edited Loading

Kludex commented Feb 6, 2023 • edited Loading

aaugustin commented Feb 6, 2023

aaugustin commented Feb 6, 2023

M1ha-Shvn commented Feb 7, 2023

aaugustin commented Feb 7, 2023

aaugustin commented Feb 7, 2023 • edited Loading

Kludex commented Feb 7, 2023

Kludex commented Mar 9, 2023

This issue was moved to a discussion.

M1ha-Shvn commented Feb 3, 2023 •

edited

Loading

M1ha-Shvn commented Feb 3, 2023 •

edited

Loading

M1ha-Shvn commented Feb 6, 2023 •

edited

Loading

Kludex commented Feb 6, 2023 •

edited

Loading

aaugustin commented Feb 7, 2023 •

edited

Loading