-
Notifications
You must be signed in to change notification settings - Fork 1.5k
PARQUET-355: Add Statistics Test for Parquet Columns #255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-355: Add Statistics Test for Parquet Columns #255
Conversation
|
A brief outline of the Code: (1) There is one test function that is generates random data using a random block / page size. (2) The (3) The tests each use (4) On test, each page is read from the file. We use a (5) The value-generator originally used to write the values is then queried for what it believes ought to the the correct statistics for the entries on this page. These values are then compared against what is actually read from the file. If all of the statistics match the anticipated values, then the test passes. |
|
@sircodesalotOfTheRound, thanks for working on this! I like the overall approach of using randomly-generated data, writing an entire file, and validating the pages individually. I had a few comments about how you're doing those tasks, but overall it is a great start. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should use a constant value as seed so that we can have reproducible tests with the same random values. If we use a dynamic value, then a bug might appear randomly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I just put my comments on the commit rather than the pull request by accident. More comments from me here: sircodesalotOfTheRound@e05447e
e05447e to
afc43f6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the validation should call updateStats. Otherwise, you're delegating to the implementation you're trying to test. Instead, the test should be that the min value is less-than or equal-to each value in the page. Similarly, you should count nulls and validate that the number matches up after iterating through the page values. That way we are checking the meaning of the stats values independent of the implementation.
I think I mentioned this before, but it was probably overlooked in a sea of comments.
c770a62 to
6cf6aef
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be UnsupportedOperationException. NotImplementedException comes from sun.reflect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok. C# Convention, thanks.
6cf6aef to
7d0b4fe
Compare
This changes the test implementation to use ColumnReaderImpl rather than reimplementing parts of it. That change enables this to test dictionary-encoded pages and this commit introduces contexts to test them. This also cleans up a few minor issues that were artifacts of the tests changing quite a bit, like the random value generators expecting a PrimitiveTypeName that wasn't used.
PARQUET-355: Use ColumnReaderImpl.
|
@sircodesalotOfTheRound, can you update the title of this issue to "PARQUET-355: Add..." so that we can merge it? Thanks! |
|
Okay, updated the title. Thanks! |
In response to PARQUET-251 created an integration test that generates random values and compares the statistics against the values read from a parquet file. There are two tools classes `DataGenerationContext` and `RandomValueGenerators` which are located in the same package as the unit test. I'm sure there is a better place to put these, but I leave that to your discretion. Thanks Reuben Author: Reuben Kuhnert <sircodesalot@gmail.com> Author: Ryan Blue <blue@apache.org> Closes apache#255 from sircodesalotOfTheRound/stats-validation and squashes the following commits: 680e96a [Reuben Kuhnert] Merge pull request #1 from rdblue/PARQUET-355-stats-validation-tests 9f0033f [Ryan Blue] PARQUET-355: Use ColumnReaderImpl. 7d0b4fe [Reuben Kuhnert] PARQUET-355: Add Statistics Validation Test
In response to PARQUET-251 created an integration test that generates random values and compares the statistics against the values read from a parquet file. There are two tools classes `DataGenerationContext` and `RandomValueGenerators` which are located in the same package as the unit test. I'm sure there is a better place to put these, but I leave that to your discretion. Thanks Reuben Author: Reuben Kuhnert <sircodesalot@gmail.com> Author: Ryan Blue <blue@apache.org> Closes apache#255 from sircodesalotOfTheRound/stats-validation and squashes the following commits: 680e96a [Reuben Kuhnert] Merge pull request #1 from rdblue/PARQUET-355-stats-validation-tests 9f0033f [Ryan Blue] PARQUET-355: Use ColumnReaderImpl. 7d0b4fe [Reuben Kuhnert] PARQUET-355: Add Statistics Validation Test
In response to PARQUET-251 created an integration test that generates random values and compares the statistics against the values read from a parquet file.
There are two tools classes
DataGenerationContextandRandomValueGeneratorswhich are located in the same package as the unit test. I'm sure there is a better place to put these, but I leave that to your discretion.Thanks
Reuben