Skip to content

Increase in SnappyOutputStream output size after #82 #100

@JoshRosen

Description

@JoshRosen

It appears that the size of the compressed output generated by SnappyOutputStream increased between versions 1.1.1.1 and 1.1.1.2. To see this, I ran a microbenchmark which serializes 1000 integers using Java serialization, compresses the result using a SnappyOutputStream, and reports the serialized size.

You can find the full source of my benchmark at https://gist.github.com/JoshRosen/f2b568662c3c6011df08. I've included a script that runs my benchmark against all recently-published snappy-java versions. Here are the results:

1.1.1.6    489
1.1.1.5    489
1.1.1.4
1.1.1.3    489
1.1.1.2    489
1.1.1.1    386
1.1.1    386
1.1.1-M4    386
1.1.1-M3    386
1.1.1-M2    386
1.1.1-M1    386
1.1.0.1    386
1.1.0    386
1.1.0-M4    386
1.1.0-M3    386
1.1.0-M2    386
1.1.0-M1    386
1.0.x
1.0.5.4    386
1.0.5.3    386
1.0.5.2    386
1.0.5.1    386
1.0.5    386
1.0.5-M4    386
1.0.5-M3    386
1.0.5-M2    386
1.0.5-M1    386

Based on this, it looks like the compression size got worse between 1.1.1.1 and 1.1.1.2. When I compare the commits between these versions (1.1.1...1.1.1.2), it looks like the only change was #82.

This result might be workload-dependent, so it may be worth investigating this with other benchmarks. I discovered this issue while investigating https://issues.apache.org/jira/browse/SPARK-5081, a Spark bug in which the size of shuffle data increased across Spark versions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions