Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORC-XXX: Support orc.compression.zstd.workers #1756

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

dongjoon-hyun
Copy link
Member

What changes were proposed in this pull request?

Why are the changes needed?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

@github-actions github-actions bot added the JAVA label Jan 17, 2024
@dongjoon-hyun
Copy link
Member Author

Evaluating if this has any benefits in ORC.

@cxzl25
Copy link
Contributor

cxzl25 commented Jan 17, 2024

Thanks @dongjoon-hyun for doing this, this is also what I want to introduce this configuration after zstd-jni merge, like Spark and Parquet also have similar configurations.

@dongjoon-hyun
Copy link
Member Author

Ya, indeed.

BTW, it seems that there is no perf gain with this so far. Interesting.

@cxzl25
Copy link
Contributor

cxzl25 commented Jan 19, 2024

it seems that there is no perf gain with this so far

Based on the product environment verification of this PR, I tested orc.compression.zstd.workers 0, 6, 15, and 16, and there seems to be no difference.

Although Paruqet also provides options for the number of zstd workers.

https://github.com/apache/parquet-mr/blob/c82d5b471a558124b03e67759038661a046f5938/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/ZstandardCodec.java#L53-L54

https://facebook.github.io/zstd/zstd_manual.html

    ZSTD_c_nbWorkers=400,    /* Select how many threads will be spawned to compress in parallel.
                              * When nbWorkers >= 1, triggers asynchronous mode when invoking ZSTD_compressStream*() :
                              * ZSTD_compressStream*() consumes input and flush output if possible, but immediately gives back control to caller,
                              * while compression is performed in parallel, within worker thread(s).
                              * (note : a strong exception to this rule is when first invocation of ZSTD_compressStream2() sets ZSTD_e_end :
                              *  in which case, ZSTD_compressStream2() delegates to ZSTD_compress2(), which is always a blocking call).
                              * More workers improve speed, but also increase memory usage.
                              * Default value is `0`, aka "single-threaded mode" : no worker is spawned,
                              * compression is performed inside Caller's thread, and all invocations are blocking */

@dongjoon-hyun
Copy link
Member Author

Thank you for double-check. Ya, it seems that our implementation has some limitations or bug.

Apache Spark also has the ZStandardCodec implementation based on this zstd-jni and it shows 30% or 40% improvement in the micro-bencharmk.

I'm still digging this because I believe this should be a part of Apache ORC 2.0.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants