-
Notifications
You must be signed in to change notification settings - Fork 249
Description
It appears that the size of the compressed output generated by SnappyOutputStream increased between versions 1.1.1.1 and 1.1.1.2. To see this, I ran a microbenchmark which serializes 1000 integers using Java serialization, compresses the result using a SnappyOutputStream, and reports the serialized size.
You can find the full source of my benchmark at https://gist.github.com/JoshRosen/f2b568662c3c6011df08. I've included a script that runs my benchmark against all recently-published snappy-java versions. Here are the results:
1.1.1.6 489
1.1.1.5 489
1.1.1.4
1.1.1.3 489
1.1.1.2 489
1.1.1.1 386
1.1.1 386
1.1.1-M4 386
1.1.1-M3 386
1.1.1-M2 386
1.1.1-M1 386
1.1.0.1 386
1.1.0 386
1.1.0-M4 386
1.1.0-M3 386
1.1.0-M2 386
1.1.0-M1 386
1.0.x
1.0.5.4 386
1.0.5.3 386
1.0.5.2 386
1.0.5.1 386
1.0.5 386
1.0.5-M4 386
1.0.5-M3 386
1.0.5-M2 386
1.0.5-M1 386
Based on this, it looks like the compression size got worse between 1.1.1.1 and 1.1.1.2. When I compare the commits between these versions (1.1.1...1.1.1.2), it looks like the only change was #82.
This result might be workload-dependent, so it may be worth investigating this with other benchmarks. I discovered this issue while investigating https://issues.apache.org/jira/browse/SPARK-5081, a Spark bug in which the size of shuffle data increased across Spark versions.