Skip to content

Conversation

@onlynone
Copy link

Copied from: https://github.com/Parquet/parquet-mr/pull/413

However, as tomwhite mentioned, there might be a better way to do this.

I had also written this:

current ++; still doesn't seem correct even when currentValue != null. Imagine a block with 100 records, but only the record at position 50 matches our filter. In this case, the first time nextKeyValue() is called, it will call recordReader.read() which will successfully find the record at pos 50, but current will just be incremented to 1.

dsy88 referenced this pull request in dsy88/incubator-parquet-mr Jun 20, 2014
dsy88 referenced this pull request in dsy88/incubator-parquet-mr Jun 20, 2014
NULL tuples causes NPE when writing
@tomwhite
Copy link
Member

@onlynone, I agree with you, however I think the fix is still functionally correct. That's what I meant about ensuring getProgress() is correct - although since it is used to give a rough measure of MR progress this change doesn't break applications, it just underestimates progress.

Having said that, here's another fix that correctly updates current: https://github.com/tomwhite/parquet-mr/compare/pr-413-change-filtering

@rdblue
Copy link
Contributor

rdblue commented Jun 26, 2014

I think that Tom's fix is correct and that's a reasonable work-around for right now. But I'd rather get rid of the recursive call because that will increase the stack for each filtered record. Here's a version that just loops until the internal reader starts returning non-null records again. It also checks to make sure the total isn't going past the currently loaded limit so that there aren't conditions where it would loop infinitely.

      try {
        checkRead();
        currentValue = recordReader.read();
        current ++;
        // only happens with FilteredRecordReader at end of block
        while (currentValue == null && current < total && current <= totalCountLoadedSoFar) {
          checkRead();
          currentValue = recordReader.read();
          current ++;
        }
        if (DEBUG) LOG.debug("read value: " + currentValue);
      } catch (RuntimeException e) {
        throw new ParquetDecodingException(format("Can not read value at %d in block %d in file %s", current, currentBlock, file), e);
      }

Like you said, a real fix needs to correctly keep track of the records that are filtered out. How about adding a count accessor to parquet.io.RecordReader? That would be a quick fix, but I'd rather see a better contract with the record reader that strictly defines behavior when it runs out of records and maybe keeps track internally. Iterator is good inspiration.

@tomwhite
Copy link
Member

Thanks for the review @rdblue. I agree that the minimal fix is the way to go to get this fixed in the short term; for one thing changing (Filtered)RecordReader causes the semantic versioning plugin to complain.

I've updated the minimal fix to avoid the recursive call as you suggested. See https://github.com/apache/incubator-parquet-mr/pull/9. It's slightly different to your code since we need to take account of the case where there are no further non-null records - i.e. the while loop needs to return false for that case. I've added a test for that case and also for the case where only the last block has a record that matches the filter.

@julienledem
Copy link
Member

LGTM.
Could you Open a parquet JIRA and prefix the name of the PR with its ID as described in the following link ?
https://github.com/apache/incubator-parquet-mr/pull/8/files?short_path=6a33714#diff-6a3371457528722a734f3c51d9238c13

@rdblue
Copy link
Contributor

rdblue commented Jun 27, 2014

Thanks for taking a look, Julien. I've opened PARQUET-9 for this.

asfgit referenced this pull request Jul 16, 2014
Update of the minimal fix discussed in https://github.com/apache/incubator-parquet-mr/pull/1, with the recursive call changed to to a loop.

Author: Tom White <tom@cloudera.com>
Author: Steven Willis <swillis@compete.com>

Closes #9 from tomwhite/filtering-records-across-multiple-blocks and squashes the following commits:

afb08a4 [Tom White] Minimal fix
9e723ee [Steven Willis] Test for filtering records across multiple blocks
@julienledem
Copy link
Member

@rdblue
Copy link
Contributor

rdblue commented Jul 21, 2014

@julienledem: yes. I think Tom had to create a new pull request because he couldn't push review changes to this one.

@julienledem
Copy link
Member

This is fixed:
apache/incubator-parquet-mr@2d8ebdb
@onlynone could you close this pull request?

@onlynone
Copy link
Author

Thanks Guys!

@onlynone onlynone closed this Jul 31, 2014
@julienledem
Copy link
Member

Thank you @onlynone !

rdblue referenced this pull request in rdblue/parquet-mr Aug 11, 2014
Update of the minimal fix discussed in https://github.com/apache/incubator-parquet-mr/pull/1, with the recursive call changed to to a loop.

Author: Tom White <tom@cloudera.com>
Author: Steven Willis <swillis@compete.com>

Closes apache#9 from tomwhite/filtering-records-across-multiple-blocks and squashes the following commits:

afb08a4 [Tom White] Minimal fix
9e723ee [Steven Willis] Test for filtering records across multiple blocks
parthchandra pushed a commit to parthchandra/incubator-parquet-mr that referenced this pull request Sep 4, 2014
merge apache/incubator-parquet-mr
abayer referenced this pull request in cloudera/parquet-mr Oct 13, 2014
Update of the minimal fix discussed in https://github.com/apache/incubator-parquet-mr/pull/1, with the recursive call changed to to a loop.

Author: Tom White <tom@cloudera.com>
Author: Steven Willis <swillis@compete.com>

Closes #9 from tomwhite/filtering-records-across-multiple-blocks and squashes the following commits:

afb08a4 [Tom White] Minimal fix
9e723ee [Steven Willis] Test for filtering records across multiple blocks
abayer referenced this pull request in cloudera/parquet-mr Dec 4, 2014
Update of the minimal fix discussed in https://github.com/apache/incubator-parquet-mr/pull/1, with the recursive call changed to to a loop.

Author: Tom White <tom@cloudera.com>
Author: Steven Willis <swillis@compete.com>

Closes #9 from tomwhite/filtering-records-across-multiple-blocks and squashes the following commits:

afb08a4 [Tom White] Minimal fix
9e723ee [Steven Willis] Test for filtering records across multiple blocks
asfgit pushed a commit that referenced this pull request Jan 27, 2015
…ng/binary column trunk are null

In case of all nulls in a binary column, statistics object read from file metadata is empty, and should return true for all nulls check for the column. Even if column has no values, it can be ignored.

The other way is to fix this behaviour in the writer, but is that what we want ?

Author: Yash Datta <Yash.Datta@guavus.com>
Author: Alex Levenson <alexlevenson@twitter.com>
Author: Yash Datta <saucam@gmail.com>

Closes #99 from saucam/npe and squashes the following commits:

5138e44 [Yash Datta] PARQUET-136: Remove unreachable block
b17cd38 [Yash Datta] Revert "PARQUET-161: Trigger tests"
82209e6 [Yash Datta] PARQUET-161: Trigger tests
aab2f81 [Yash Datta] PARQUET-161: Review comments for the test case
2217ee2 [Yash Datta] PARQUET-161: Add a test case for checking the correct statistics info is recorded in case of all nulls in a column
c2f8d6f [Yash Datta] PARQUET-161: Fix the write path to write statistics object in case of only nulls in the column
97bb517 [Yash Datta] Revert "revert TestStatisticsFilter.java"
a06f0d0 [Yash Datta] Merge pull request #1 from isnotinvain/alexlevenson/PARQUET-161-136
b1001eb [Alex Levenson] Fix statistics isEmpty, handle more edge cases in statistics filter
0c88be0 [Alex Levenson] revert TestStatisticsFilter.java
1ac9192 [Yash Datta] PARQUET-136: Its better to not filter chunks for which empty statistics object is returned. Empty statistics can be read in case of 1. pre-statistics files, 2. files written from current writer that has a bug, as it does not write the statistics if column has all nulls
e5e924e [Yash Datta] Revert "PARQUET-136: In case of all nulls in a binary column, statistics object read from file metadata is empty, and should return true for all nulls check for the column"
8cc5106 [Yash Datta] Revert "PARQUET-136: fix hasNulls to cater to the case where all values are nulls"
c7c126f [Yash Datta] PARQUET-136: fix hasNulls to cater to the case where all values are nulls
974a22b [Yash Datta] PARQUET-136: In case of all nulls in a binary column, statistics object read from file metadata is empty, and should return true for all nulls check for the column
dongche pushed a commit to dongche/incubator-parquet-mr that referenced this pull request Feb 3, 2015
…ng/binary column trunk are null

In case of all nulls in a binary column, statistics object read from file metadata is empty, and should return true for all nulls check for the column. Even if column has no values, it can be ignored.

The other way is to fix this behaviour in the writer, but is that what we want ?

Author: Yash Datta <Yash.Datta@guavus.com>
Author: Alex Levenson <alexlevenson@twitter.com>
Author: Yash Datta <saucam@gmail.com>

Closes apache#99 from saucam/npe and squashes the following commits:

5138e44 [Yash Datta] PARQUET-136: Remove unreachable block
b17cd38 [Yash Datta] Revert "PARQUET-161: Trigger tests"
82209e6 [Yash Datta] PARQUET-161: Trigger tests
aab2f81 [Yash Datta] PARQUET-161: Review comments for the test case
2217ee2 [Yash Datta] PARQUET-161: Add a test case for checking the correct statistics info is recorded in case of all nulls in a column
c2f8d6f [Yash Datta] PARQUET-161: Fix the write path to write statistics object in case of only nulls in the column
97bb517 [Yash Datta] Revert "revert TestStatisticsFilter.java"
a06f0d0 [Yash Datta] Merge pull request apache#1 from isnotinvain/alexlevenson/PARQUET-161-136
b1001eb [Alex Levenson] Fix statistics isEmpty, handle more edge cases in statistics filter
0c88be0 [Alex Levenson] revert TestStatisticsFilter.java
1ac9192 [Yash Datta] PARQUET-136: Its better to not filter chunks for which empty statistics object is returned. Empty statistics can be read in case of 1. pre-statistics files, 2. files written from current writer that has a bug, as it does not write the statistics if column has all nulls
e5e924e [Yash Datta] Revert "PARQUET-136: In case of all nulls in a binary column, statistics object read from file metadata is empty, and should return true for all nulls check for the column"
8cc5106 [Yash Datta] Revert "PARQUET-136: fix hasNulls to cater to the case where all values are nulls"
c7c126f [Yash Datta] PARQUET-136: fix hasNulls to cater to the case where all values are nulls
974a22b [Yash Datta] PARQUET-136: In case of all nulls in a binary column, statistics object read from file metadata is empty, and should return true for all nulls check for the column
asfgit pushed a commit that referenced this pull request Jun 17, 2015
...thod

Author: Alex Levenson <alexlevenson@twitter.com>
Author: Konstantin Shaposhnikov <Konstantin.Shaposhnikov@sc.com>
Author: kostya-sh <kostya-sh@users.noreply.github.com>

Closes #171 from kostya-sh/PARQUET-246 and squashes the following commits:

75950c5 [kostya-sh] Merge pull request #1 from isnotinvain/PR-171
a718309 [Konstantin Shaposhnikov] Merge remote-tracking branch 'refs/remotes/origin/master' into PARQUET-246
0367588 [Alex Levenson] Add regression test for PR-171
94e8fda [Alex Levenson] Merge branch 'master' into PR-171
0a9ac9f [Konstantin Shaposhnikov] [PARQUET-246] bugfix: reset all DeltaByteArrayWriter state in reset() method
asfgit pushed a commit that referenced this pull request Sep 18, 2015
In response to PARQUET-251 created an integration test that generates random values and compares the statistics against the values read from a parquet file.

There are two tools classes `DataGenerationContext` and `RandomValueGenerators` which are located in the same package as the unit test. I'm sure there is a better place to put these, but I leave that to your discretion.

Thanks
Reuben

Author: Reuben Kuhnert <sircodesalot@gmail.com>
Author: Ryan Blue <blue@apache.org>

Closes #255 from sircodesalotOfTheRound/stats-validation and squashes the following commits:

680e96a [Reuben Kuhnert] Merge pull request #1 from rdblue/PARQUET-355-stats-validation-tests
9f0033f [Ryan Blue] PARQUET-355: Use ColumnReaderImpl.
7d0b4fe [Reuben Kuhnert] PARQUET-355: Add Statistics Validation Test
asfgit pushed a commit that referenced this pull request Jan 19, 2017
In response to PARQUET-251 created an integration test that generates random values and compares the statistics against the values read from a parquet file.

There are two tools classes `DataGenerationContext` and `RandomValueGenerators` which are located in the same package as the unit test. I'm sure there is a better place to put these, but I leave that to your discretion.

Thanks
Reuben

Author: Reuben Kuhnert <sircodesalot@gmail.com>
Author: Ryan Blue <blue@apache.org>

Closes #255 from sircodesalotOfTheRound/stats-validation and squashes the following commits:

680e96a [Reuben Kuhnert] Merge pull request #1 from rdblue/PARQUET-355-stats-validation-tests
9f0033f [Ryan Blue] PARQUET-355: Use ColumnReaderImpl.
7d0b4fe [Reuben Kuhnert] PARQUET-355: Add Statistics Validation Test
gszadovszky pushed a commit to gszadovszky/parquet-mr that referenced this pull request Dec 13, 2018
shangxinli added a commit to shangxinli/parquet-mr that referenced this pull request Oct 30, 2020
shangxinli added a commit to shangxinli/parquet-mr that referenced this pull request Oct 30, 2020
shangxinli added a commit to shangxinli/parquet-mr that referenced this pull request Oct 30, 2020
shangxinli added a commit to shangxinli/parquet-mr that referenced this pull request Oct 30, 2020
shangxinli added a commit to shangxinli/parquet-mr that referenced this pull request Oct 30, 2020
shangxinli added a commit to shangxinli/parquet-mr that referenced this pull request Oct 30, 2020
shangxinli added a commit to shangxinli/parquet-mr that referenced this pull request Nov 1, 2020
LantaoJin pushed a commit to LantaoJin/parquet-mr that referenced this pull request Jun 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants