Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parquet integration tests for explicitly smaller page sizes, page pruning #4131

Merged
merged 1 commit into from
Nov 8, 2022

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Nov 7, 2022

Which issue does this PR close?

Draft as it builds on #4062

Closes #4087

Rationale for this change

Some of the parquet filter pushdown features do not really help unless there are multiple data pages. I wanted to cover these with tests

As it turns out I found (rediscovered) a gap #3833 while writing these tests 🎉

What changes are included in this PR?

  1. Add a max page row count to the parquet filter integration tests
  2. Add tests that page index pruning is working as expected

Are there any user-facing changes?

No

@github-actions github-actions bot added the core Core DataFusion crate label Nov 7, 2022
@alamb alamb marked this pull request as draft November 7, 2022 18:38
@alamb alamb force-pushed the alamb/test_with_smaller_pages branch from 438eac0 to c572688 Compare November 7, 2022 19:49

#[cfg(not(target_family = "windows"))]
#[tokio::test]
async fn single_file_small_data_pages() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the new test -- I verified manually the layout was good (typically makes 6 data pages for each column chunk)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 cool!

// page 5: DLE:RLE RLE:RLE VLE:RLE_DICTIONARY ST:[min: 1970-01-01T00:00:00.000000000, max: 1970-01-01T00:00:00.005330944, num_nulls not defined] CRC:[none] SZ:12601 VC:7739
TestCase::new(&test_parquet_file)
.with_name("selective")
// predicagte is chosen carefully to prune pages
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// predicagte is chosen carefully to prune pages
// predicate is chosen carefully to prune pages

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in #4130

c9e676d

pub fn try_new(
path: PathBuf,
props: WriterProperties,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change is to pass in the writer props to remove a level of indirection of how the file is created and make the code easier to read

@alamb alamb marked this pull request as ready for review November 7, 2022 19:53
@alamb alamb requested a review from Ted-Jiang November 7, 2022 19:53

#[cfg(not(target_family = "windows"))]
#[tokio::test]
async fn single_file_small_data_pages() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 cool!

if let Some(s) = opt.page_size {
props_builder = props_builder
.set_data_pagesize_limit(s)
.set_write_batch_size(s);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looking good!👍 Once i try to create page less then DEFAULT_WRITE_BATCH_SIZE row count, forgot set write_batch_size puzzle me a while 😂

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add this info in arrow-rs 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add this info in arrow-rs 🤔

@Ted-Jiang I am not sure what you mean. Do you mean improve the documentation here?

https://docs.rs/parquet/latest/parquet/file/properties/struct.WriterPropertiesBuilder.html#method.set_write_batch_size

Copy link
Member

@Ted-Jiang Ted-Jiang Nov 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb
I mean set here https://docs.rs/parquet/latest/parquet/file/properties/struct.WriterPropertiesBuilder.html#method.set_data_page_row_count_limit

if user only set set_data_page_row_count_limit to 100(without set_write_batch_size to 100 ), it less than default DEFAULT_WRITE_BATCH_SIZE(1024), i think it still cut 1024 row one page🤔.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i will test it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test shows:
First i set in single_file_small_data_pages

 let props = WriterProperties::builder()
        .set_data_page_row_count_limit(100)
        .build();

then i run parquet-tools column-index -c service /Users/yangjiang/data_8311.parquet

offset index for column service:
                          offset   compressed size       first row index
page-0                        29                48                     0
page-1                        77                48                  1024
page-2                       125                48                  2048
page-3                       173                48                  3072
page-4                       221                48                  4096
page-5                       269                48                  5120
page-6                       317                48                  6144
page-7                       365                48                  7168
page-8                       413                48                  8192
page-9                       461                48                  9216
page-10                      509                48                 10240
page-11                      557                48                 11264
page-12                      605                48                 12288
page-13                      653                48                 13312
page-14                      701                48                 14336
page-15                      749                48                 15360
page-16                      797                48                 16384
page-17                      845                48                 17408
page-18                      893                48                 18432
page-19                      941                48

Copy link
Member

@Ted-Jiang Ted-Jiang Nov 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So only when data_page_row_count_limit less than write_batch_size it works 😂
I think because arrow-rs are vectored writer, write one batch one time🤔

Copy link
Member

@Ted-Jiang Ted-Jiang Nov 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh it's already here 😭 , i should read them more carefully😩

Note: this is a best effort limit based on the write batch size

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I filed apache/arrow-rs#3068 to try and make the comments somewhat clearer

@alamb alamb merged commit 175adbd into apache:master Nov 8, 2022
@ursabot
Copy link

ursabot commented Nov 8, 2022

Benchmark runs are scheduled for baseline = 47f0e5a and contender = 175adbd. 175adbd is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@alamb alamb deleted the alamb/test_with_smaller_pages branch November 8, 2022 12:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add additional testing to parquet predicate pushdown integration tests
3 participants