Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid abuse dictionary encoding in parquet writer #1105

Closed
Tracked by #924
ShiKaiWi opened this issue Jul 25, 2023 · 0 comments
Closed
Tracked by #924

Avoid abuse dictionary encoding in parquet writer #1105

ShiKaiWi opened this issue Jul 25, 2023 · 0 comments
Labels
feature New feature or request

Comments

@ShiKaiWi
Copy link
Member

Describe This Problem

Currently, the default configuration is used for parquet writer to generate the sst, and dictionary encoding is used for bytes column (including string column) by default. Only the dictionary page exceeds 1MB, the parquet writer will fall back to plain encoding without the dictionary.

However, it maybe not very efficient for the column of high cardinality to be encoded by dictionary.

Proposal

Maybe we should choose to control whether to choose dictionary encoding according to the column's cardinality. However, the threshold of cardinality to enable dictionary encoding should be benchmarked and tested.

Additional Context

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant