-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add parquet integration tests for explicitly smaller page sizes, page pruning #4131
Conversation
438eac0
to
c572688
Compare
|
||
#[cfg(not(target_family = "windows"))] | ||
#[tokio::test] | ||
async fn single_file_small_data_pages() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the new test -- I verified manually the layout was good (typically makes 6 data pages for each column chunk)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 cool!
// page 5: DLE:RLE RLE:RLE VLE:RLE_DICTIONARY ST:[min: 1970-01-01T00:00:00.000000000, max: 1970-01-01T00:00:00.005330944, num_nulls not defined] CRC:[none] SZ:12601 VC:7739 | ||
TestCase::new(&test_parquet_file) | ||
.with_name("selective") | ||
// predicagte is chosen carefully to prune pages |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// predicagte is chosen carefully to prune pages | |
// predicate is chosen carefully to prune pages |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pub fn try_new( | ||
path: PathBuf, | ||
props: WriterProperties, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this change is to pass in the writer props to remove a level of indirection of how the file is created and make the code easier to read
|
||
#[cfg(not(target_family = "windows"))] | ||
#[tokio::test] | ||
async fn single_file_small_data_pages() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 cool!
if let Some(s) = opt.page_size { | ||
props_builder = props_builder | ||
.set_data_pagesize_limit(s) | ||
.set_write_batch_size(s); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looking good!👍 Once i try to create page less then DEFAULT_WRITE_BATCH_SIZE
row count, forgot set write_batch_size puzzle me a while 😂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can add this info in arrow-rs 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can add this info in arrow-rs 🤔
@Ted-Jiang I am not sure what you mean. Do you mean improve the documentation here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb
I mean set here https://docs.rs/parquet/latest/parquet/file/properties/struct.WriterPropertiesBuilder.html#method.set_data_page_row_count_limit
if user only set set_data_page_row_count_limit
to 100(without set_write_batch_size
to 100 ), it less than default DEFAULT_WRITE_BATCH_SIZE
(1024), i think it still cut 1024 row one page🤔.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i will test it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test shows:
First i set in single_file_small_data_pages
let props = WriterProperties::builder()
.set_data_page_row_count_limit(100)
.build();
then i run parquet-tools column-index -c service /Users/yangjiang/data_8311.parquet
offset index for column service:
offset compressed size first row index
page-0 29 48 0
page-1 77 48 1024
page-2 125 48 2048
page-3 173 48 3072
page-4 221 48 4096
page-5 269 48 5120
page-6 317 48 6144
page-7 365 48 7168
page-8 413 48 8192
page-9 461 48 9216
page-10 509 48 10240
page-11 557 48 11264
page-12 605 48 12288
page-13 653 48 13312
page-14 701 48 14336
page-15 749 48 15360
page-16 797 48 16384
page-17 845 48 17408
page-18 893 48 18432
page-19 941 48
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So only when data_page_row_count_limit
less than write_batch_size
it works 😂
I think because arrow-rs are vectored writer, write one batch one time🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh it's already here 😭 , i should read them more carefully😩
Note: this is a best effort limit based on the write batch size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I filed apache/arrow-rs#3068 to try and make the comments somewhat clearer
Benchmark runs are scheduled for baseline = 47f0e5a and contender = 175adbd. 175adbd is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Draft as it builds on #4062Closes #4087
Rationale for this change
Some of the parquet filter pushdown features do not really help unless there are multiple data pages. I wanted to cover these with tests
As it turns out I
found(rediscovered) a gap #3833 while writing these tests 🎉What changes are included in this PR?
Are there any user-facing changes?
No