Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: add more tests for statistics reading #10592

Merged
merged 2 commits into from
May 21, 2024

Conversation

NGA-TRAN
Copy link
Contributor

@NGA-TRAN NGA-TRAN commented May 20, 2024

Which issue does this PR close?

More tests for #10453

Rationale for this change

Add tests for parquet statistics reading

What changes are included in this PR?

Tests and bug tickets linked to appropriate tests

Are these changes tested?

Yes

Are there any user-facing changes?

No

@NGA-TRAN NGA-TRAN marked this pull request as ready for review May 21, 2024 16:15
@NGA-TRAN
Copy link
Contributor Author

@alamb : This PR is ready for review

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @NGA-TRAN good PR
Nice test description, this is important

What I still cannot understand is this a regression test for the bug we missed earlier?

@NGA-TRAN
Copy link
Contributor Author

@comphead

What I still cannot understand is this a regression test for the bug we missed earlier?

I am working on new arrow statistics API #10453 and @alamb suggested me to add more coverage tests as well as moving tests for private functions in datafusion/core/src/datasource/physical_plan/parquet/statistics.rs to this file. Sorry that I am not aware of available/related bug tickets

@alamb
Copy link
Contributor

alamb commented May 21, 2024

What I still cannot understand is this a regression test for the bug we missed earlier?

My strong suspicion is that the bugs @NGA-TRAN is finding would manifest themselves as potentially incorrect results when reading parquet files predicates on these types of columns.

It may also simply manifest as not being able to prune row groups based on such predicates

I haven't worked to make a reproducer as it would likely require creating parquet files with multiple row groups with carefully chosen data patterns

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @NGA-TRAN -- I think this is excellent test coverage

let row_per_group = 5;
// This creates a parquet file of 1 column "decimal_col" with decimal data type and precicion 9, scale 2
// file has 3 record batches, each has 5 rows. They will be saved into 3 row groups
let reader = parquet_file_many_columns(Scenario::Decimal, row_per_group).await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another important thing to test with decimals is different precision / scales -- maybe we can do this as a different PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb : can you be more specific? In the test below, I had to make sure they have the same precision & scale. What else do we have to test here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking smaller precisions -- I can't remember but I vaguely remember that spark stores different scale decimals with different underlying datatypes or something

"frontend five",
"backend one",
"backend eight",
])), // Shuld be BinaryArray
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb
Copy link
Contributor

alamb commented May 21, 2024

Since this PR is just tests, merging it in

@alamb alamb merged commit 96e0ee6 into apache:main May 21, 2024
24 checks passed
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
* test: add more tests for statistics reading

* Link bug tickets to the tests and run fmt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants