-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LUCENE-8739: custom codec providing Zstandard compression/decompression #439
base: main
Are you sure you want to change the base?
Conversation
…on with dictionary
/** Stored field format used by plugaable codec */ | ||
public class Lucene90CustomCompressionStoredFieldsFormat extends StoredFieldsFormat { | ||
// chunk size for zstandard | ||
private static final int ZSTD_BLOCK_LENGTH = 10 * 8 * 1024; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use bigger blocks, like for BEST_COMPRESSION.
private static final int ZSTD_BLOCK_LENGTH = 10 * 8 * 1024; | |
private static final int ZSTD_BLOCK_LENGTH = 10 * 48 * 1024; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to 48KB in PR
private static final int NUM_SUB_BLOCKS = 10; | ||
private final int level; | ||
public static final int defaultLevel = 3; | ||
private static final int DICT_SIZE_FACTOR = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do 6 like for BEST_COMPRESSION?
private static final int DICT_SIZE_FACTOR = 2; | |
private static final int DICT_SIZE_FACTOR = 6; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to 6 in PR
switch (mode) { | ||
case ZSTD: | ||
return new Lucene90CompressingStoredFieldsFormat( | ||
"CustomCompressionStoredFieldsZSTD", ZSTD_MODE_NO_DICT, ZSTD_BLOCK_LENGTH, 1024, 10); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4096 docs per chunk at most, like for BEST_COMPRESSION
"CustomCompressionStoredFieldsZSTD", ZSTD_MODE_NO_DICT, ZSTD_BLOCK_LENGTH, 1024, 10); | |
"CustomCompressionStoredFieldsZSTD", ZSTD_MODE_NO_DICT, ZSTD_BLOCK_LENGTH, 4096, 10); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to 4096 in PR
|
||
case ZSTD_DICT: | ||
return new Lucene90CompressingStoredFieldsFormat( | ||
"CustomCompressionStoredFieldsZSTD", ZSTD_MODE_DICT, ZSTD_BLOCK_LENGTH, 1024, 10); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"CustomCompressionStoredFieldsZSTD", ZSTD_MODE_DICT, ZSTD_BLOCK_LENGTH, 1024, 10); | |
"CustomCompressionStoredFieldsZSTD", ZSTD_MODE_DICT, ZSTD_BLOCK_LENGTH, 4096, 10); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to 4096
|
||
final Lucene90CustomCompressionCodec.Mode mode; | ||
|
||
private static int compressionLevel; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not make it static?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed static
moduleImplementation project(':lucene:core') | ||
moduleImplementation 'com.github.luben:zstd-jni:1.5.0-4' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just minor suggestion:
current version is 1.5.1-1
could bring another few percent compression speed
https://github.com/facebook/zstd/releases/tag/v1.5.1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Tobias,
I have tested 1.5.0-X and the data looks to be very promising with the current version. I have not tested 1.5.1-X JNI version but https://github.com/facebook/zstd/releases/tag/v1.5.1 looks good and can be picked up in the future since Zstd is very dynamic and we may see a lot of improvements over the current version!
Thanks!!
Is this PR going to be merged? What release is this planned for? |
See discussion on LUCENE-8739, this PR is unlikely going to get merged. |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
|
||
// dictionary compression first | ||
doCompress(bytes, off, dictLength, cctx, out); | ||
cctx.loadDict(new ZstdDictCompress(bytes, off, dictLength, this.compressionLevel)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should close this ZstdDictCompress object, or it will cause a memory leak. I have used your code in Elasticsearch and found it to be useful, but this is a minor issue that needs to be fixed.
// decompress dictionary first | ||
doDecompress(in, dctx, bytes, dictLength); | ||
|
||
dctx.loadDict(new ZstdDictDecompress(bytes.bytes, 0, dictLength)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should close this ZstdDictDecompress object too.
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
Description
#9784
Lucene currently supports LZ4 and Zlib compression/decompression for StoredFieldsFormat. We propose Zstandard (https://facebook.github.io/zstd/) compression/decompression for StoredFieldsFormat for the following reasons:
(https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/).
Solution
Tests
Checklist
Please review the following and check all that apply:
main
branch../gradlew check
.