Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabling Compression Level configuration of ZSTD and ZSTD (No Dict) #7555

Closed
sarthakaggarwal97 opened this issue May 14, 2023 · 10 comments · Fixed by #8312
Closed

Enabling Compression Level configuration of ZSTD and ZSTD (No Dict) #7555

sarthakaggarwal97 opened this issue May 14, 2023 · 10 comments · Fixed by #8312
Labels
enhancement Enhancement or improvement to existing feature or request v2.9.0 'Issues and PRs related to version v2.9.0'

Comments

@sarthakaggarwal97
Copy link
Contributor

sarthakaggarwal97 commented May 14, 2023

Is your feature request related to a problem? Please describe.
In the present implementation, zstd and zstd_no_dict uses default compression level i.e. 6 as mentioned here, here and here. According to documentation, these algorithms can support the compression level range of 1-22. Theoretically, configurable compression levels can help customers tune the tradeoff between storage and throughput according to their requirements.

Describe the solution you'd like
An introduction of configurable compression level as an index setting for zstd and zstd-no-dict

Additional context
#3354

cc: @mgodwan @backslasht @shwetathareja

@sarthakaggarwal97 sarthakaggarwal97 added enhancement Enhancement or improvement to existing feature or request untriaged labels May 14, 2023
@shwetathareja
Copy link
Member

@sarthakaggarwal97 OpenSearch users can use some of the existing benchmarks for their reference when choosing the compression level. It would be great if you can share some of the benchmark runs as well across different levels.

@dblock
Copy link
Member

dblock commented May 16, 2023

We've been collaborating with the Intel team and benchmarking compression levels in additional real world scenarios with @backslasht. Stay tuned @shwetathareja!

@dblock
Copy link
Member

dblock commented May 17, 2023

Related, in #7475 we are proposing to make the block size configurable as well.

@sarthakaggarwal97
Copy link
Contributor Author

sarthakaggarwal97 commented Jul 4, 2023

Currently, we have set the default compression level for zstd and zstd_no_dict as level 6 over here.

In the experiments, I'm observing that there is a improvement of roughly 5-6% in the average indexing throughput with level 3 as the default compression level.

Benchmarks

NYC Taxis Dataset

image

HTTP Logs Dataset

image

I think we should switch to level 3 as the default compression level. Moreover, zstd also internally uses level 3 as the default compression level as mentioned over here

cc: @mgodwan @shwetathareja @backslasht @dblock @reta

@mgodwan
Copy link
Member

mgodwan commented Jul 4, 2023

Thanks for sharing these numbers.
Given level 3 is the default level used in zstd implementation as well and yields better results in terms of throughput in majority of the cases, it should be alright to proceed with the same.

Is there any comparison of storage used as well across these to get insights around the trade-off?

@sarthakaggarwal97
Copy link
Contributor Author

sarthakaggarwal97 commented Jul 4, 2023

Thanks for sharing these numbers. Given level 3 is the default level used in zstd implementation as well and yields better results in terms of throughput in majority of the cases, it should be alright to proceed with the same.

Is there any comparison of storage used as well across these to get insights around the trade-off?

@mgodwan Adding comparison for store size along with indexing throughput. We roughly see 4% increase in the store size with level 3 having level 6 as baseline.

NYC Taxis

image

HTTP Logs

image

@backslasht
Copy link
Contributor

Thanks @sarthakaggarwal97. This seems like a safe bet as the increase in indexing throughput and storage size are proportional.

Have we run tests for sufficiently large time? wondering if it will also increase the background merge time as the segments might be large with compression level 3?

@sarthakaggarwal97
Copy link
Contributor Author

@backslasht yes, I observed background merges were happening for the runs. The segments with level 3 should not be large enough to start affecting background merge time when compared to level 6. Level 3 looks like a good sweet spot.
I will raise the PR to change the default compression level to 3.

@dblock
Copy link
Member

dblock commented Jul 6, 2023

Can this be closed with #8471?

@backslasht
Copy link
Contributor

@dblock - I think this will require #8312 to be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request v2.9.0 'Issues and PRs related to version v2.9.0'
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants