You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current ZStandard Manual states the following:
Block level API (deprecated)
If the normal API doesn't provide the functionality you need, please open a GitHub issue.
So here I am.
Context
In low latency real time applications such as games, it is necessary to send messages
at a high frequency, to a high amount of clients.
Suppose the following scenario:
You are a regular player hosting a 64-player game (no dedicated server)
Using a client(many) - host(1) topology.
In this sort of setup, you have the 1 host send packets to 63 clients at an approximate rate of 50/60/100Hz.
A single 'message' sent every tick may look something like this:
IP + UDP Header: 28 bytes
Network Library Header: 2 bytes
Payload: ~2016 bytes [before compression]
Assumes average 32 bytes per player. With data 63 players (all players except recipient).
A message like this will need to be sent for each 63 players, resulting in 6300 messages per second at a 100Hz tickrate.
Problem Statement
Important
Working at this scale while targeting regular user machines, every bit of effort must be taken to ensure efficiency.
The payload size in particular is especially sensitive. The MTU of a packet
is around the 1440-1460 range in practice for most ipv4 endpoints I interact with.
What this means that is the complete packet (IP + UDP + Lib + Payload) exceeds ~1440, it must be
split into 2 packets.
When that happens, the host will have to incur:
~32 bytes of overhead per client
Think 30 for second header and 2 for two extra fragmentation headers.
Doubles probability of dropped or delayed packets.
Since we have to send a second packet for each state now.
And therefore possible 'jitter' in game.
This overhead can add up quickly.
In this scenario, just this 32 byte overhead alone increases the bandwidth requirement by 1.5Mbit/s.
This alone is the difference between (for example):
Supporting 63 vs 64 players.
Experiencing packet loss when streaming AV1 video of your current experience.
Note
Big companies can afford to run dedicated servers and swallow inefficiencies however...
Open source library/code/mod authors like me, just doing free stuff for the community don't have these sorts of resources.
All infrastructure is hosted out of pocket at a loss.
Goals
Tip
The goal is to reduce the need to fragment at all costs.
To reduce the need to fragment, the packet size must be minimized as much as possible.
That includes manually taking control of block compression/decompression.
The difference between a 1440 and 1445 byte payload (min frame header + block header) can mean the difference between sending 1 and 2 packets. It is crucial I save every byte possible.
The following ideas come to mind (focusing on zstd integration here).
Using Custom Block Ranges and Dictionaries
Note
This requires some context, so apologies in advance.
Payload size can be reduced by taking manual control of block creation process.
By this I mean controlling which slice of data goes in which block.
Different sections of data can have different distributions of bytes; which in turn affects
the efficiency of huffman coding (and friends).
For this example, suppose we have an array of the following structure:
f32 x;
f32 y;
f32 z;
It is possible to maximize compression ratio by first reordering the data into structure of arrays:
x0, x1, y0, y1, z0, z1
And then reordering the bytes to group the each corresponding byte together:
This increases the likelihood of repeated data, as many floats will have the same matching first and/or second byte.
And lastly, doing a bit of delta encoding (subtract current byte from previous byte):
i.e.
x0 = 1327.784 = 44 A5 F9 18
x1 = 1320.432 = 44 A5 0D D4
(44 44) (A5 A5) (F9 0D) (18 D4)
After delta encoding:
(00 00) (00 00) ..
This increases number of repeated 0 byte sequences and distribution of 0 bytes for huffman coding.
I would like to be able to isolate the float data inside a single block, because the byte distribution
between this, and for example manually bit packed data will differ.
Likewise, I would also want to apply a custom dictionary for each manually created block.
Since that will have context tuned for the very specific data I am working with.
The dictionary would also have greater relevance, where as bit packed data in previous sliding window might not.
Custom Block Header
Because the MTU is ~1450 and max message length is not anywhere near the max block length,
it's possible to shave bytes from the block headers:
Here I trimmed the block header by 1 byte, which gives a max block size of 8192 bytes.
Greater than MTU and still easily larger than any packet will ever be.
Splitting Block Headers
Tip
Even better, it's possible to rearrange the data such that additional bytes can be saved.
Suppose I have a fixed (2) number of blocks. I would to split the block header into 2 locations.
Unused/Trimmed Space in Network Library Header
Start of Payload
I can store 2 bits in the Network Library Header to indicate if the block is compressed or not.
At the start of the payload I can then store 2 of the following:
For efficient packet handling, the compressed payload must be byte aligned.
The concern here is moreso that a lack of a raw/block compression API forces us to use the standard zstd block format.
All in All
I'm not sure how, without a block compression API it would be possible to get the maximum out of ZStandard;
especially when the difference in a use case like this between a few bytes here (MTU) can be the barrier between
a bad and a very good experience.
Note
These are estimate savings.
Given the earlier example with the 2016 byte payload, the ability to compress blocks provides the
following potential and likely savings:
~32 bytes from not exceeding MTU
If other savings are sufficient to avoid fragmentation.
~100-150 bytes from controlling the blocks manually
This is a pure estimate; I haven't got enough real data to test with; I believe it would be in this range however.
This is from splitting into blocks manually and using a custom dictionary.
2 bytes (6 - 4) from storing 2 block headers in a custom way.
2 bytes from not including a zstd frame header.
Somewhere around 150 bytes per message. Or 7.5Mbps of upload speed, if translated onto earlier example.
Describe the solution you'd like
N/A
zstd manual requested to open an issue if the required functionality is not present.
In this case the 'requirements' are:
Custom byte ranges for starting/ending blocks
Dictionary per block
Omitting frame header
Custom block header/metadata encoding.
The existing APIs make these optimisations possible.
On the other hand, I'm not sure if the alternative suggestion of using the 'normal' API covers this.
Main concern lies in fragmenting vs not fragmenting.
Doubling possibility of transmission failure is a very unwanted effect.
The text was updated successfully, but these errors were encountered:
The current ZStandard Manual states the following:
So here I am.
Context
In low latency real time applications such as games, it is necessary to send messages
at a high frequency, to a high amount of clients.
Suppose the following scenario:
In this sort of setup, you have the 1 host send packets to 63 clients at an approximate rate of 50/60/100Hz.
A single 'message' sent every
tick
may look something like this:A message like this will need to be sent for each 63 players, resulting in 6300 messages per second at a 100Hz tickrate.
Problem Statement
Important
Working at this scale while targeting regular user machines, every bit of effort must be taken to ensure efficiency.
The payload size in particular is especially sensitive. The MTU of a packet
is around the 1440-1460 range in practice for most ipv4 endpoints I interact with.
What this means that is the complete packet (IP + UDP + Lib + Payload) exceeds ~1440, it must be
split into 2 packets.
When that happens, the host will have to incur:
fragmentation headers
.This overhead can add up quickly.
In this scenario, just this 32 byte overhead alone increases the bandwidth requirement by 1.5Mbit/s.
This alone is the difference between (for example):
Note
Big companies can afford to run dedicated servers and swallow inefficiencies however...
Open source library/code/mod authors like me, just doing free stuff for the community don't have these sorts of resources.
All infrastructure is hosted out of pocket at a loss.
Goals
Tip
The goal is to reduce the need to fragment at all costs.
To reduce the need to fragment, the packet size must be minimized as much as possible.
That includes manually taking control of block compression/decompression.
The difference between a 1440 and 1445 byte payload (min frame header + block header) can mean the difference between sending 1 and 2 packets. It is crucial I save every byte possible.
The following ideas come to mind (focusing on zstd integration here).
Using Custom Block Ranges and Dictionaries
Note
This requires some context, so apologies in advance.
Payload size can be reduced by taking manual control of block creation process.
By this I mean controlling which slice of data goes in which block.
Different sections of data can have different distributions of bytes; which in turn affects
the efficiency of huffman coding (and friends).
For this example, suppose we have an array of the following structure:
It is possible to maximize compression ratio by first reordering the data into structure of arrays:
And then reordering the bytes to group the each corresponding byte together:
This increases the likelihood of repeated data, as many floats will have the same matching first and/or second byte.
And lastly, doing a bit of delta encoding (subtract current byte from previous byte):
i.e.
1327.784
=44 A5 F9 18
1320.432
=44 A5 0D D4
After delta encoding:
This increases number of repeated 0 byte sequences and distribution of 0 bytes for huffman coding.
I would like to be able to isolate the float data inside a single block, because the byte distribution
between this, and for example manually bit packed data will differ.
Likewise, I would also want to apply a custom dictionary for each manually created block.
Since that will have context tuned for the very specific data I am working with.
The dictionary would also have greater relevance, where as bit packed data in previous sliding window might not.
Custom Block Header
Because the MTU is ~1450 and max message length is not anywhere near the max block length,
it's possible to shave bytes from the block headers:
Here I trimmed the block header by 1 byte, which gives a max block size of 8192 bytes.
Greater than MTU and still easily larger than any packet will ever be.
Splitting Block Headers
Tip
Even better, it's possible to rearrange the data such that additional bytes can be saved.
Suppose I have a fixed (2) number of blocks. I would to split the block header into 2 locations.
I can store 2 bits in the
Network Library Header
to indicate if the block is compressed or not.At the start of the payload I can then store 2 of the following:
This saves yet another byte.
Note
For efficient packet handling, the compressed payload must be byte aligned.
The concern here is moreso that a lack of a raw/block compression API forces us to use the standard zstd block format.
All in All
I'm not sure how, without a block compression API it would be possible to get the maximum out of ZStandard;
especially when the difference in a use case like this between a few bytes here (MTU) can be the barrier between
a bad and a very good experience.
Note
These are estimate savings.
Given the earlier example with the 2016 byte payload, the ability to compress blocks provides the
following potential and likely savings:
6 - 4
) from storing 2 block headers in a custom way.Somewhere around 150 bytes per message. Or 7.5Mbps of upload speed, if translated onto earlier example.
Describe the solution you'd like
N/A
zstd manual requested to open an issue if the required functionality is not present.
In this case the 'requirements' are:
The existing APIs make these optimisations possible.
On the other hand, I'm not sure if the alternative suggestion of using the 'normal' API covers this.
Main concern lies in fragmenting vs not fragmenting.
Doubling possibility of transmission failure is a very unwanted effect.
The text was updated successfully, but these errors were encountered: