-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compression slows down network comms #7655
Comments
Minimal benchmark for lz4On my beastly AMD Ryzen 3950X:
On AWS EC2 m6i.large:
import time
import numpy
from distributed.protocol import deserialize_bytes, serialize_bytelist
x = numpy.random.random(128 * 2**20 // 8)
y = x.copy().reshape(-1)
y[::2] = 0
y = y.reshape(x.shape)
x_frames = serialize_bytelist(x)
y_frames = serialize_bytelist(y)
x_bytes = bytearray(b"".join(x_frames))
y_bytes = bytearray(b"".join(y_frames))
print("buffer (MiB)", x.nbytes / 2**20)
assert len(x_frames) == len(y_frames)
print("n. frames", len(x_frames))
print("serialized uncompressible (MiB)", len(x_bytes) / 2**20)
print("serialized compressible (MiB)", len(y_bytes) / 2**20)
def bench(func, data, size):
N = 20
t0 = time.perf_counter()
for _ in range(N):
func(data)
t1 = time.perf_counter()
elapsed = (t1 - t0) / N
return size / elapsed / 2**20
print("serialize uncompressible (MiB/s)", bench(serialize_bytelist, x, x.nbytes))
print("deserialize uncompressible (MiB/s)", bench(deserialize_bytes, x_bytes, x.nbytes))
print("serialize compressible (MiB/s)", bench(serialize_bytelist, y, x.nbytes))
print("deserialize compressible (MiB/s)", bench(deserialize_bytes, y_bytes, x.nbytes))
|
With blosc:
|
A selfless plug for cramjam's lz4 block, which is slightly faster especially w/ larger sizes and supports single allocation de/compression if the output size is already known. Albeit, I guess that'd only be most common for decompression here. fastparquet has been using it for a while now. |
Context for those who are not intimately familiar with how we do compression. This is done in distributed/distributed/protocol/compression.py Lines 154 to 197 in 0a88b76
and works through a couple of steps
So the question "how fast does or compression algo need to be?" depends on this cutoff. The breakeven point can be calculated roughly by # no compression time >= time for compressed bytes + compression + decompression
size / network >= new_size / network + size / compression + new_size / compression
new_size = compression_ratio * size
...
compression_ratio <= (compression - network) / (compression + network) which, given a bandwidth of 1GiBit/s (125MiB/s) gives a cutoff of about
so... indeed, this looks like lz4 is a loss on performance, even on paper, unless we are significantly reducing this threshold. This does not even account for us potentially throwing away the compressed bytes again because the sampling looked good but the real data didn't (not sure how likely this is, I guess the other way round is more likely) |
A realistic use case for both is that you have a chunk that start with an area full of zeros or another constant and later becomes random-ish (sample is compressible, total isn't), or the other way around (sample is uncompressible, total is). |
I'm just curious how often this happens since this represents the absolute worst case. I'm not suggesting to drop or change this sampling. |
FWIW we currently sample five random sections in order to avoid situations like this. |
In general though, if we learned that "compression is bad" I'd be very happy to have all of that functionality go away. Simple is good. |
Kudos, also, for finding this. I'm excited by the potential optimization. |
That's impressive. FWIW, I could not make import numpy
import lz4.block
import snappy
import cramjam
import blosc
import blosc2
x = numpy.random.random(64 * 2**20 // 8)
x[::2] = 0
b = x.tobytes()
print("=== lz4 ===")
c = lz4.block.compress(b)
print(len(c) / len(b))
assert lz4.block.decompress(c) == b
%timeit lz4.block.compress(b)
%timeit lz4.block.decompress(c)
print("=== snappy ===")
c = snappy.compress(b)
print(len(c) / len(b))
assert snappy.decompress(c) == b
%timeit snappy.compress(b)
%timeit snappy.decompress(c)
print("=== cramjam.lz4 ===")
c = cramjam.lz4.compress_block(b)
print(len(c) / len(b))
assert bytes(cramjam.lz4.decompress_block(c)) == b
d = bytearray(len(b))
cramjam.lz4.decompress_block_into(c, output=d)
assert d == b
%timeit cramjam.lz4.compress_block(b)
%timeit cramjam.lz4.decompress_block(c)
%timeit cramjam.lz4.decompress_block_into(c, output=d)
print("=== cramjam.snappy ===")
c = cramjam.snappy.compress(b)
print(len(c) / len(b))
assert bytes(cramjam.snappy.decompress(c)) == b
d = bytearray(len(b))
cramjam.snappy.decompress_into(c, output=d)
# assert d == b # Fails
%timeit cramjam.snappy.compress(b)
%timeit cramjam.snappy.decompress(c)
# %timeit cramjam.snappy.decompress_into(c, output=d)
print("=== blosc ===")
c = blosc.compress(b, typesize=8)
print(len(c) / len(b))
assert blosc.decompress(c) == b
%timeit blosc.compress(b, typesize=8)
%timeit blosc.decompress(c)
#%timeit blosc.decompress_ptr(c, id(d) + ???)
print("=== blosc2 ===")
c = blosc2.compress(b, typesize=8)
print(len(c) / len(b))
assert blosc2.decompress(c) == b
d = bytearray(len(b))
blosc2.decompress(c, dst=d)
assert d == b
%timeit blosc2.compress(b, typesize=8)
%timeit blosc2.decompress(c)
%timeit blosc2.decompress(c, dst=d)
|
It's also worth noting that microbenchmarks and back-of-the-envelope calculations will not necessarily decide this for us. CPUs on workers are typically busy with other things (at least assuming some computation and network are overlapping) so a realistic compression rate is possibly much slower that what we're measuring here. |
That's because the standard out of cramjam is a file-like buffer, and when that goes into another de/compression it reads it to the end, then subsequent calls act just like io.Buffer and will read zero bytes. Then zero bytes to de/compress. Here is some slightly modified version of the cramjam stuff, just getting the buffer view and using that via Also note import numpy
import lz4.block
import snappy
import cramjam
import blosc
import blosc2
x = numpy.random.random(64 * 2**20 // 8)
x[::2] = 0
b = x.tobytes()
print("=== lz4 ===")
c = lz4.block.compress(b)
print(len(c) / len(b))
assert lz4.block.decompress(c) == b
%timeit lz4.block.compress(b)
%timeit lz4.block.decompress(c)
print("=== cramjam.lz4 ===")
c = bytes(cramjam.lz4.compress_block(b))
print(len(c) / len(b))
assert bytes(cramjam.lz4.decompress_block(c)) == b
d = bytearray(len(b))
cramjam.lz4.decompress_block_into(c, output=d)
assert d == b
%timeit cramjam.lz4.compress_block(b)
%timeit cramjam.lz4.decompress_block(c)
%timeit cramjam.lz4.decompress_block_into(c, output=d)
print("=== snappy ===")
c = snappy.compress(b)
print(len(c) / len(b))
assert snappy.decompress(c) == b
%timeit snappy.compress(b)
%timeit snappy.decompress(c)
print("=== cramjam.snappy raw ===")
c = bytes(cramjam.snappy.compress_raw(b))
print(len(c) / len(b))
assert bytes(cramjam.snappy.decompress_raw(c)) == b
d = bytearray(len(b))
cramjam.snappy.decompress_raw_into(c, output=d)
assert d == b
%timeit cramjam.snappy.compress_raw(b)
%timeit cramjam.snappy.decompress_raw(c)
%timeit cramjam.snappy.decompress_raw_into(c, output=d)
print("=== cramjam.snappy ===")
c = bytes(cramjam.snappy.compress(b))
print(len(c) / len(b))
assert bytes(cramjam.snappy.decompress(c)) == b
d = bytearray(len(b))
cramjam.snappy.decompress_into(c, output=d)
assert d == b
%timeit cramjam.snappy.compress(b)
%timeit cramjam.snappy.decompress(c)
%timeit cramjam.snappy.decompress_into(c, output=d)
print("=== blosc ===")
c = blosc.compress(b, typesize=8)
print(len(c) / len(b))
assert blosc.decompress(c) == b
%timeit blosc.compress(b, typesize=8)
%timeit blosc.decompress(c)
#%timeit blosc.decompress_ptr(c, id(d) + ???)
print("=== blosc2 ===")
c = blosc2.compress(b, typesize=8)
print(len(c) / len(b))
assert blosc2.decompress(c) == b
d = bytearray(len(b))
blosc2.decompress(c, dst=d)
assert d == b
%timeit blosc2.compress(b, typesize=8)
%timeit blosc2.decompress(c)
%timeit blosc2.decompress(c, dst=d)
Also, I don't think it'd be absurd to add C-Blosc2 bindings to |
I'm not entirely convinced about using cramjam, or even keeping compression at all (for network). What worries me the most is the CPU load of user tasks (and the server itself considering this is offloaded). When I perform the above measurements with load on my CPU I see severe degradation of performance. When all my CPUs are under load (while true; i += 1) compression rate drops by about 2.5x while decompression rate drops by about 10x. These drops very likely depend severely on hardware and architecture but given the benchmarks so far, the only case where keeping network compression is beneficial is slow network OR if we set the threshold compression ratio significantly down from 90% to something like 50% just to ensure we're not loosing on this deal (not even talking about additional memory requirements here). If we need to lower it by that much, I wonder how valuable this still is or whether we should not just rip it out (and everything belonging to it, e.g. comm handshakes) |
My gut reaction is to change the default compression to I hear @crusaderky 's concerns about Client-Scheduler communications, but I think that a 2x difference there won't be that big of a deal (it's either very good or very bad today, so 2x doesn't necessarily meaningfully change things). |
That's because cramjam today doesn't release the GIL - a known issue that @milesgranger is looking into and which is obviously a showstopper to the adoption in dask. |
From an async conversation with @fjetter: While cramjam and blosc2 would be faster than lz4 (at least in decompression), the problem remains that you need to manually install them - as optional dependencies, nothing will pull them in. So a substantial amount of users will continue having lz4 installed for whatever reason but not our preferred libraries. So we verged in favour of having a switch for scheduler/worker<->worker comms and a separate one for client<->scheduler/worker. I briefly fiddled with the idea of having compression settings that depend on the IP of your peer: compression:
localhost: False # 127.0.0.x
lan: False # 10.x.x.x and 192.168.x.x
internet: auto # everything else The biggest benefit would be that it would automatically speed up client comms on LocalClusters without any need for configuration. However, it would completely fail clients connecting through a VPN into a corporate network, as their perceived IP address would be 10.x. So I think I will go for: compression:
remote-client: auto # client->scheduler and client->worker on LAN or internet
remote-worker: False # worker->scheduler and worker->worker on LAN or internet
localhost: False # any actor connecting to another actor on 127.0.0.x which covers the LocalCluster use case nicely. |
State of the art
Before dask either sends data over the network or writes it to disk, it tries compressing a small bit of it (10 kB) with lz4 and, if it compresses to 90% of its original size or better on the sample, it compresses the whole thing (implementation:
maybe_compress
).For spill/unspill, compression/decompression blocks the event loop.
For network comms, compression/decompression runs on a thread pool with a single worker (
offload
) while the GIL is released (this has been fixed very recently for outbound comms; before it blocked the event loop: #7593).Performance
Until now we did not have hard data on how beneficial compression is - just an academic expectation that "disk slow, network slow, cpu fast".
In coiled-runtime, I've just added tests [ coiled/benchmarks#714 ] that demonstrate different performance when data is compressible vs. when it is uncompressible.
coiled-runtime benchmarks run on coiled, which in turn means that all workers run on AWS EC2.
Test results are very bad.
I'm seeing that:
test_array.py::*
) are up to 70% slower when the workers spend time compressing and decompressing data compared to whenmaybe_compress
gives up after it fails to compress 10 kBtest_spill.py::test_spilling
) get a substantial speedup from compressiontest_spill.py::test_dot_product_spill
) don't show benefits from compression, probably because the two effects cancel each other out.Whoops. Disk slow, cpu fast, network faster.
Tests were performed downstream of #7593.
In the above chart, each test ran in 3 different configurations, which differ in how the original data is generated:
uncompressible
compressible
This data compresses to 42% of its original size, at a speed of 570 MiB/s (measured on lz4 4.0).
dummy
This was to rule out that the slowdown between compressible and uncompressible was because of the extra layer in the graph or by the introduction of cloudpickle in the serialization.
Possible solutions
First of all, we need to verify that the difference is actually caused by pure waiting time for decompression and not something else (e.g. GIL).
Reinstate blosc + fix #7433
blosc was removed in #5269 over concerns on maintainability.
However, it is 5x faster than lz4 and is can work around #7433 (with additional work to just reverting #5269).
#7433 is a known major cause of slowness besides raw throughput of the C algorithm.
This is my preferred choice.
Increase number of offload threads
At the moment all compression/decompression is pipelined onto a single
offload
thread.Increasing this number could be beneficial - if user functions alone are insufficient to saturate the CPU.
Thread safety of the various C libraries used for compression/decompression would need to be thoroughly verified.
Completely disable compression by default.
I don't think this is a good idea because (1) it would harm spilling and (2) it would severely harm client<->worker comms in all configurations where the client-worker bandwidth is much more limited than the worker-worker one - such as in Coiled.
From customer feedback, we know that many dask users can't read/write on cloud storage and are forced to upload/download everything from/to their laptop with
scatter
/gather
.Disable compression in network, keep it in spilling
Small code change needed. Same concerns as above.
Disable compression in worker-worker and scheduler-worker comms, keep it in client-worker comms and spilling
Larger and cumbersome code change needed. Very ugly IMHO.
Leave everything as it is in core dask; disable compression in coiled
This would make sense if we thought non-Coiled users had much lower inter-worker bandwidth than on Coiled, on average.
I don't think this is a good idea, as I don't have any evidence supporting this statement.
CC @fjetter @mrocklin @gjoseph92 @hendrikmakait
The text was updated successfully, but these errors were encountered: