Filtering records across multiple blocks #1

onlynone · 2014-06-19T17:36:21Z

Copied from: https://github.com/Parquet/parquet-mr/pull/413

However, as tomwhite mentioned, there might be a better way to do this.

I had also written this:

current ++; still doesn't seem correct even when currentValue != null. Imagine a block with 100 records, but only the record at position 50 matches our filter. In this case, the first time nextKeyValue() is called, it will call recordReader.read() which will successfully find the record at pos 50, but current will just be incremented to 1.

update

NULL tuples causes NPE when writing

tomwhite · 2014-06-20T14:16:21Z

@onlynone, I agree with you, however I think the fix is still functionally correct. That's what I meant about ensuring getProgress() is correct - although since it is used to give a rough measure of MR progress this change doesn't break applications, it just underestimates progress.

Having said that, here's another fix that correctly updates current: https://github.com/tomwhite/parquet-mr/compare/pr-413-change-filtering

rdblue · 2014-06-26T03:50:11Z

I think that Tom's fix is correct and that's a reasonable work-around for right now. But I'd rather get rid of the recursive call because that will increase the stack for each filtered record. Here's a version that just loops until the internal reader starts returning non-null records again. It also checks to make sure the total isn't going past the currently loaded limit so that there aren't conditions where it would loop infinitely.

      try {
        checkRead();
        currentValue = recordReader.read();
        current ++;
        // only happens with FilteredRecordReader at end of block
        while (currentValue == null && current < total && current <= totalCountLoadedSoFar) {
          checkRead();
          currentValue = recordReader.read();
          current ++;
        }
        if (DEBUG) LOG.debug("read value: " + currentValue);
      } catch (RuntimeException e) {
        throw new ParquetDecodingException(format("Can not read value at %d in block %d in file %s", current, currentBlock, file), e);
      }

Like you said, a real fix needs to correctly keep track of the records that are filtered out. How about adding a count accessor to parquet.io.RecordReader? That would be a quick fix, but I'd rather see a better contract with the record reader that strictly defines behavior when it runs out of records and maybe keeps track internally. Iterator is good inspiration.

tomwhite · 2014-06-26T15:05:36Z

Thanks for the review @rdblue. I agree that the minimal fix is the way to go to get this fixed in the short term; for one thing changing (Filtered)RecordReader causes the semantic versioning plugin to complain.

I've updated the minimal fix to avoid the recursive call as you suggested. See https://github.com/apache/incubator-parquet-mr/pull/9. It's slightly different to your code since we need to take account of the case where there are no further non-null records - i.e. the while loop needs to return false for that case. I've added a test for that case and also for the case where only the last block has a record that matches the filter.

julienledem · 2014-06-26T21:23:30Z

LGTM.
Could you Open a parquet JIRA and prefix the name of the PR with its ID as described in the following link ?
https://github.com/apache/incubator-parquet-mr/pull/8/files?short_path=6a33714#diff-6a3371457528722a734f3c51d9238c13

rdblue · 2014-06-27T00:58:06Z

Thanks for taking a look, Julien. I've opened PARQUET-9 for this.

Update of the minimal fix discussed in https://github.com/apache/incubator-parquet-mr/pull/1, with the recursive call changed to to a loop. Author: Tom White <tom@cloudera.com> Author: Steven Willis <swillis@compete.com> Closes #9 from tomwhite/filtering-records-across-multiple-blocks and squashes the following commits: afb08a4 [Tom White] Minimal fix 9e723ee [Steven Willis] Test for filtering records across multiple blocks

julienledem · 2014-07-18T23:10:20Z

Was this included in https://github.com/apache/incubator-parquet-mr/pull/9 ?

rdblue · 2014-07-21T22:56:57Z

@julienledem: yes. I think Tom had to create a new pull request because he couldn't push review changes to this one.

julienledem · 2014-07-21T23:01:01Z

This is fixed:
apache/incubator-parquet-mr@2d8ebdb
@onlynone could you close this pull request?

onlynone · 2014-07-31T20:56:23Z

Thanks Guys!

julienledem · 2014-07-31T21:39:09Z

Thank you @onlynone !

Update of the minimal fix discussed in https://github.com/apache/incubator-parquet-mr/pull/1, with the recursive call changed to to a loop. Author: Tom White <tom@cloudera.com> Author: Steven Willis <swillis@compete.com> Closes apache#9 from tomwhite/filtering-records-across-multiple-blocks and squashes the following commits: afb08a4 [Tom White] Minimal fix 9e723ee [Steven Willis] Test for filtering records across multiple blocks

merge apache/incubator-parquet-mr

Update of the minimal fix discussed in https://github.com/apache/incubator-parquet-mr/pull/1, with the recursive call changed to to a loop. Author: Tom White <tom@cloudera.com> Author: Steven Willis <swillis@compete.com> Closes #9 from tomwhite/filtering-records-across-multiple-blocks and squashes the following commits: afb08a4 [Tom White] Minimal fix 9e723ee [Steven Willis] Test for filtering records across multiple blocks

…ng/binary column trunk are null In case of all nulls in a binary column, statistics object read from file metadata is empty, and should return true for all nulls check for the column. Even if column has no values, it can be ignored. The other way is to fix this behaviour in the writer, but is that what we want ? Author: Yash Datta <Yash.Datta@guavus.com> Author: Alex Levenson <alexlevenson@twitter.com> Author: Yash Datta <saucam@gmail.com> Closes #99 from saucam/npe and squashes the following commits: 5138e44 [Yash Datta] PARQUET-136: Remove unreachable block b17cd38 [Yash Datta] Revert "PARQUET-161: Trigger tests" 82209e6 [Yash Datta] PARQUET-161: Trigger tests aab2f81 [Yash Datta] PARQUET-161: Review comments for the test case 2217ee2 [Yash Datta] PARQUET-161: Add a test case for checking the correct statistics info is recorded in case of all nulls in a column c2f8d6f [Yash Datta] PARQUET-161: Fix the write path to write statistics object in case of only nulls in the column 97bb517 [Yash Datta] Revert "revert TestStatisticsFilter.java" a06f0d0 [Yash Datta] Merge pull request #1 from isnotinvain/alexlevenson/PARQUET-161-136 b1001eb [Alex Levenson] Fix statistics isEmpty, handle more edge cases in statistics filter 0c88be0 [Alex Levenson] revert TestStatisticsFilter.java 1ac9192 [Yash Datta] PARQUET-136: Its better to not filter chunks for which empty statistics object is returned. Empty statistics can be read in case of 1. pre-statistics files, 2. files written from current writer that has a bug, as it does not write the statistics if column has all nulls e5e924e [Yash Datta] Revert "PARQUET-136: In case of all nulls in a binary column, statistics object read from file metadata is empty, and should return true for all nulls check for the column" 8cc5106 [Yash Datta] Revert "PARQUET-136: fix hasNulls to cater to the case where all values are nulls" c7c126f [Yash Datta] PARQUET-136: fix hasNulls to cater to the case where all values are nulls 974a22b [Yash Datta] PARQUET-136: In case of all nulls in a binary column, statistics object read from file metadata is empty, and should return true for all nulls check for the column

…ng/binary column trunk are null In case of all nulls in a binary column, statistics object read from file metadata is empty, and should return true for all nulls check for the column. Even if column has no values, it can be ignored. The other way is to fix this behaviour in the writer, but is that what we want ? Author: Yash Datta <Yash.Datta@guavus.com> Author: Alex Levenson <alexlevenson@twitter.com> Author: Yash Datta <saucam@gmail.com> Closes apache#99 from saucam/npe and squashes the following commits: 5138e44 [Yash Datta] PARQUET-136: Remove unreachable block b17cd38 [Yash Datta] Revert "PARQUET-161: Trigger tests" 82209e6 [Yash Datta] PARQUET-161: Trigger tests aab2f81 [Yash Datta] PARQUET-161: Review comments for the test case 2217ee2 [Yash Datta] PARQUET-161: Add a test case for checking the correct statistics info is recorded in case of all nulls in a column c2f8d6f [Yash Datta] PARQUET-161: Fix the write path to write statistics object in case of only nulls in the column 97bb517 [Yash Datta] Revert "revert TestStatisticsFilter.java" a06f0d0 [Yash Datta] Merge pull request apache#1 from isnotinvain/alexlevenson/PARQUET-161-136 b1001eb [Alex Levenson] Fix statistics isEmpty, handle more edge cases in statistics filter 0c88be0 [Alex Levenson] revert TestStatisticsFilter.java 1ac9192 [Yash Datta] PARQUET-136: Its better to not filter chunks for which empty statistics object is returned. Empty statistics can be read in case of 1. pre-statistics files, 2. files written from current writer that has a bug, as it does not write the statistics if column has all nulls e5e924e [Yash Datta] Revert "PARQUET-136: In case of all nulls in a binary column, statistics object read from file metadata is empty, and should return true for all nulls check for the column" 8cc5106 [Yash Datta] Revert "PARQUET-136: fix hasNulls to cater to the case where all values are nulls" c7c126f [Yash Datta] PARQUET-136: fix hasNulls to cater to the case where all values are nulls 974a22b [Yash Datta] PARQUET-136: In case of all nulls in a binary column, statistics object read from file metadata is empty, and should return true for all nulls check for the column

...thod Author: Alex Levenson <alexlevenson@twitter.com> Author: Konstantin Shaposhnikov <Konstantin.Shaposhnikov@sc.com> Author: kostya-sh <kostya-sh@users.noreply.github.com> Closes #171 from kostya-sh/PARQUET-246 and squashes the following commits: 75950c5 [kostya-sh] Merge pull request #1 from isnotinvain/PR-171 a718309 [Konstantin Shaposhnikov] Merge remote-tracking branch 'refs/remotes/origin/master' into PARQUET-246 0367588 [Alex Levenson] Add regression test for PR-171 94e8fda [Alex Levenson] Merge branch 'master' into PR-171 0a9ac9f [Konstantin Shaposhnikov] [PARQUET-246] bugfix: reset all DeltaByteArrayWriter state in reset() method

In response to PARQUET-251 created an integration test that generates random values and compares the statistics against the values read from a parquet file. There are two tools classes `DataGenerationContext` and `RandomValueGenerators` which are located in the same package as the unit test. I'm sure there is a better place to put these, but I leave that to your discretion. Thanks Reuben Author: Reuben Kuhnert <sircodesalot@gmail.com> Author: Ryan Blue <blue@apache.org> Closes #255 from sircodesalotOfTheRound/stats-validation and squashes the following commits: 680e96a [Reuben Kuhnert] Merge pull request #1 from rdblue/PARQUET-355-stats-validation-tests 9f0033f [Ryan Blue] PARQUET-355: Use ColumnReaderImpl. 7d0b4fe [Reuben Kuhnert] PARQUET-355: Add Statistics Validation Test

Index validator

Steven Willis and others added 2 commits June 18, 2014 13:58

Test for filtering records across multiple blocks

cd69571

Minimal fix

f076692

dsy88 referenced this pull request in dsy88/incubator-parquet-mr Jun 20, 2014

Merge pull request #1 from Parquet/master

82b889c

update

dsy88 referenced this pull request in dsy88/incubator-parquet-mr Jun 20, 2014

Merge pull request #1 from jalkjaer/cascading_sink

000659a

NULL tuples causes NPE when writing

tomwhite mentioned this pull request Jun 26, 2014

PARQUET-9: Filtering records across multiple blocks #9

Closed

onlynone closed this Jul 31, 2014

parthchandra pushed a commit to parthchandra/incubator-parquet-mr that referenced this pull request Sep 4, 2014

Merge pull request apache#1 from apache/master

bd6b60b

merge apache/incubator-parquet-mr

rdblue mentioned this pull request Aug 2, 2016

PARQUET-601: Add support to configure the encoding used by ValueWriters #342

Closed

gszadovszky pushed a commit to gszadovszky/parquet-mr that referenced this pull request Dec 13, 2018

Merge pull request apache#1 from gszadovszky/index-validator

79a081d

Index validator

shangxinli mentioned this pull request Oct 31, 2019

PARQUET-1685: Truncate Min/Max for Statistics #696

Merged

1 task

shangxinli mentioned this pull request May 10, 2020

PARQUET-1827: UUID type currently not supported by parquet-mr #778

Merged

4 tasks

shangxinli added a commit to shangxinli/parquet-mr that referenced this pull request Oct 30, 2020

Address feedbacks wave apache#1

09eb06b

shangxinli added a commit to shangxinli/parquet-mr that referenced this pull request Oct 30, 2020

Address feedbacks wave apache#1

9e5da10

shangxinli added a commit to shangxinli/parquet-mr that referenced this pull request Oct 30, 2020

Address feedbacks wave apache#1

14e42ba

shangxinli added a commit to shangxinli/parquet-mr that referenced this pull request Oct 30, 2020

Address feedbacks wave apache#1

031130d

shangxinli added a commit to shangxinli/parquet-mr that referenced this pull request Oct 30, 2020

Address feedbacks wave apache#1

221fec6

shangxinli added a commit to shangxinli/parquet-mr that referenced this pull request Oct 30, 2020

Address feedbacks wave apache#1

f3af565

shangxinli added a commit to shangxinli/parquet-mr that referenced this pull request Nov 1, 2020

Address feedbacks wave apache#1

02b0e43

LantaoJin pushed a commit to LantaoJin/parquet-mr that referenced this pull request Jun 15, 2021

[CARMEL-932] fix rat header (apache#1)

8a52b71

This was referenced Jun 23, 2024

String decode using 'new String' is slow #1532

Closed

ParquetWriter.getDataSize NullPointerException after closed #2037

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Filtering records across multiple blocks #1

Filtering records across multiple blocks #1

Uh oh!

onlynone commented Jun 19, 2014

Uh oh!

tomwhite commented Jun 20, 2014

Uh oh!

rdblue commented Jun 26, 2014

Uh oh!

tomwhite commented Jun 26, 2014

Uh oh!

julienledem commented Jun 26, 2014

Uh oh!

rdblue commented Jun 27, 2014

Uh oh!

julienledem commented Jul 18, 2014

Uh oh!

rdblue commented Jul 21, 2014

Uh oh!

julienledem commented Jul 21, 2014

Uh oh!

onlynone commented Jul 31, 2014

Uh oh!

julienledem commented Jul 31, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Filtering records across multiple blocks #1

Filtering records across multiple blocks #1

Uh oh!

Conversation

onlynone commented Jun 19, 2014

Uh oh!

tomwhite commented Jun 20, 2014

Uh oh!

rdblue commented Jun 26, 2014

Uh oh!

tomwhite commented Jun 26, 2014

Uh oh!

julienledem commented Jun 26, 2014

Uh oh!

rdblue commented Jun 27, 2014

Uh oh!

julienledem commented Jul 18, 2014

Uh oh!

rdblue commented Jul 21, 2014

Uh oh!

julienledem commented Jul 21, 2014

Uh oh!

onlynone commented Jul 31, 2014

Uh oh!

julienledem commented Jul 31, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants