Skip to content

Conversation

@sircodesalotOfTheRound
Copy link
Contributor

In response to PARQUET-251 created an integration test that generates random values and compares the statistics against the values read from a parquet file.

There are two tools classes DataGenerationContext and RandomValueGenerators which are located in the same package as the unit test. I'm sure there is a better place to put these, but I leave that to your discretion.

Thanks
Reuben

@sircodesalotOfTheRound
Copy link
Contributor Author

A brief outline of the Code:

(1) There is one test function that is generates random data using a random block / page size.

(2) The DataGenerationContext and the nested WriteContext callback classes use IOC to generate the file writer, write the data and then finally test the code.

(3) The tests each use RandomValueGenerators which generates a random list of values, these values are then written to the file.

(4) On test, each page is read from the file. We use a PagingValidator class which reads the statistics from each of the pages.

(5) The value-generator originally used to write the values is then queried for what it believes ought to the the correct statistics for the entries on this page. These values are then compared against what is actually read from the file.

If all of the statistics match the anticipated values, then the test passes.

@rdblue
Copy link
Contributor

rdblue commented Aug 11, 2015

@sircodesalotOfTheRound, thanks for working on this! I like the overall approach of using randomly-generated data, writing an entire file, and validating the pages individually. I had a few comments about how you're doing those tasks, but overall it is a great start. Thanks!

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use a constant value as seed so that we can have reproducible tests with the same random values. If we use a dynamic value, then a bug might appear randomly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I just put my comments on the commit rather than the pull request by accident. More comments from me here: sircodesalotOfTheRound@e05447e

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the validation should call updateStats. Otherwise, you're delegating to the implementation you're trying to test. Instead, the test should be that the min value is less-than or equal-to each value in the page. Similarly, you should count nulls and validate that the number matches up after iterating through the page values. That way we are checking the meaning of the stats values independent of the implementation.

I think I mentioned this before, but it was probably overlooked in a sea of comments.

@sircodesalotOfTheRound sircodesalotOfTheRound force-pushed the stats-validation branch 2 times, most recently from c770a62 to 6cf6aef Compare August 21, 2015 17:41
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be UnsupportedOperationException. NotImplementedException comes from sun.reflect

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok. C# Convention, thanks.

rdblue and others added 2 commits September 11, 2015 10:09
This changes the test implementation to use ColumnReaderImpl rather than
reimplementing parts of it. That change enables this to test
dictionary-encoded pages and this commit introduces contexts to test
them. This also cleans up a few minor issues that were artifacts of the
tests changing quite a bit, like the random value generators expecting a
PrimitiveTypeName that wasn't used.
@rdblue
Copy link
Contributor

rdblue commented Sep 17, 2015

@sircodesalotOfTheRound, can you update the title of this issue to "PARQUET-355: Add..." so that we can merge it? Thanks!

@sircodesalotOfTheRound sircodesalotOfTheRound changed the title Add Statistics Test for Parquet Columns PARQUET-251: Add Statistics Test for Parquet Columns Sep 17, 2015
@sircodesalotOfTheRound sircodesalotOfTheRound changed the title PARQUET-251: Add Statistics Test for Parquet Columns PARQUET-355: Add Statistics Test for Parquet Columns Sep 17, 2015
@sircodesalotOfTheRound
Copy link
Contributor Author

Okay, updated the title. Thanks!

@asfgit asfgit closed this in c381968 Sep 18, 2015
rdblue pushed a commit to rdblue/parquet-mr that referenced this pull request Jul 13, 2016
In response to PARQUET-251 created an integration test that generates random values and compares the statistics against the values read from a parquet file.

There are two tools classes `DataGenerationContext` and `RandomValueGenerators` which are located in the same package as the unit test. I'm sure there is a better place to put these, but I leave that to your discretion.

Thanks
Reuben

Author: Reuben Kuhnert <sircodesalot@gmail.com>
Author: Ryan Blue <blue@apache.org>

Closes apache#255 from sircodesalotOfTheRound/stats-validation and squashes the following commits:

680e96a [Reuben Kuhnert] Merge pull request #1 from rdblue/PARQUET-355-stats-validation-tests
9f0033f [Ryan Blue] PARQUET-355: Use ColumnReaderImpl.
7d0b4fe [Reuben Kuhnert] PARQUET-355: Add Statistics Validation Test
rdblue pushed a commit to rdblue/parquet-mr that referenced this pull request Jan 6, 2017
In response to PARQUET-251 created an integration test that generates random values and compares the statistics against the values read from a parquet file.

There are two tools classes `DataGenerationContext` and `RandomValueGenerators` which are located in the same package as the unit test. I'm sure there is a better place to put these, but I leave that to your discretion.

Thanks
Reuben

Author: Reuben Kuhnert <sircodesalot@gmail.com>
Author: Ryan Blue <blue@apache.org>

Closes apache#255 from sircodesalotOfTheRound/stats-validation and squashes the following commits:

680e96a [Reuben Kuhnert] Merge pull request #1 from rdblue/PARQUET-355-stats-validation-tests
9f0033f [Ryan Blue] PARQUET-355: Use ColumnReaderImpl.
7d0b4fe [Reuben Kuhnert] PARQUET-355: Add Statistics Validation Test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants