ensure null-counts are written for all-null columns #307

crepererum · 2021-05-17T15:29:39Z

Which issue does this PR close?

Fixes #306.

Rationale for this change

All-NULL columns are somewhat special because they don't have min/max values. However this shouldn't stop us from correctly calculating the null counts.

What changes are included in this PR?

Correctly populate null-counts from pages to columns.

Are there any user-facing changes?

Yes, parquet files now have a correct null-count

Fixes apache#306.

crepererum · 2021-05-17T15:30:53Z

parquet/src/arrow/arrow_writer.rs

@@ -1560,4 +1560,21 @@ mod tests {
            panic!("Statistics::Int64 missing")
        }
    }
+
+    #[test]


I am not entirely happy with this test since it requires an arrow data structure to actually test an parquet-part. It would probably be nice to have a more "pure" test as well, however I wasn't able to formulate an easy one.

I would say this looks more like an "integration" test rather than "unit" test (in the sense that you connect some higher level APIs and ensure the output is reasonable). I personally don't see any problems with this approach.

to remove the dependency on Arrow you would probably have to use the SerializedFileWriter API directly: https://docs.rs/parquet/4.0.0/parquet/file/writer/trait.FileWriter.html

But I think that would end up being quite a bit more code

crepererum · 2021-05-17T15:41:38Z

I am pretty sure the integration test failure is unrelated:

 /arrow/ci/scripts/cpp_build.sh: line 37:  3698 Segmentation fault

alamb · 2021-05-18T22:00:03Z

I have restarted the integration tests on this PR

alamb

This looks good to me -- thank you @crepererum

@sunchao or @nevi-me do you have any comments or thoughts?

alamb · 2021-05-18T22:05:13Z

parquet/src/arrow/arrow_writer.rs

@@ -1560,4 +1560,21 @@ mod tests {
            panic!("Statistics::Int64 missing")
        }
    }
+
+    #[test]


I would say this looks more like an "integration" test rather than "unit" test (in the sense that you connect some higher level APIs and ensure the output is reasonable). I personally don't see any problems with this approach.

to remove the dependency on Arrow you would probably have to use the SerializedFileWriter API directly: https://docs.rs/parquet/4.0.0/parquet/file/writer/trait.FileWriter.html

But I think that would end up being quite a bit more code

sunchao · 2021-05-19T19:38:30Z

parquet/src/column/writer.rs

@@ -607,9 +607,11 @@ impl<T: DataType> ColumnWriterImpl<T> {
        let max_def_level = self.descr.max_def_level();
        let max_rep_level = self.descr.max_rep_level();

+        // always update column NULL count, no matter if page stats are used
+        self.num_column_nulls += self.num_page_nulls;


hmm just curious why we don't need to update self.num_page_nulls accordingly, it is updated as follow:

for &level in levels { if level == self.descr.max_def_level() { values_to_write += 1; } else if calculate_page_stats { self.num_page_nulls += 1 } }

There are two variables here: num_page_nulls (which IS updated) and num_column_nulls (which prior to this PR was only updated if min/max values were set, which cannot happen for all-NULL columns). And the latter one (num_column_nulls) will be used to write the stats.

which IS updated

(Sorry for the slow response) This is the part I don't quite understand. add_data_page is called in two places: flush_data_pages and write_mini_batch. I can see this handles the first case but don't see how it applies for the second case too.

I'm not sure I follow:

Is your point that self.num_column_nulls += self.num_page_nulls; should only be triggered during flush_data_pages, not during write_mini_batch?

Or is the question under which condition write_mini_batch calls add_data_page? For that: As far as I can tell it is NOT called during the test at least because of should_add_data_page decides not to do so (likely because of the small test case).

I meant the second: write_mini_batch will call add_data_page if encoded data page size exceeds the limit. In this case it seems num_page_nulls and subsequently num_column_nulls will not be set.

I've just checked:

#[test] fn statistics_null_counts_big_mixed() { // check that null-count statistics work for larger data sizes let len = 1_000_000u64; let data: Vec<_> = (0..len).map(|x| if x % 2 == 0 {Some(x)} else {None}).collect(); let values = Arc::new(UInt64Array::from(data)); let file = one_column_roundtrip("null_counts_big_mixed", values, true); // check statistics are valid let reader = SerializedFileReader::new(file).unwrap(); let metadata = reader.metadata(); assert_eq!(metadata.num_row_groups(), 1); let row_group = metadata.row_group(0); assert_eq!(row_group.num_columns(), 1); let column = row_group.column(0); let stats = column.statistics().unwrap(); assert_eq!(stats.null_count(), len / 2); }

This triggers the add_data_page via write_mini_batch code path (with a small patch to the roundtrip test code to set a higher batch size). That works (with this PR) and both values are set. TBH that's what I expected since I don't really understand your reasoning behind "In this case it seems num_page_nulls and subsequently num_column_nulls will not be set.".

Oops, you're right. I misread calculate_page_stats in the code. My apologies.

sunchao

LGTM. Thanks @crepererum!

Fixes #306.

Fixes #306. Co-authored-by: Marco Neumann <marco@crepererum.net>

ensure null-counts are written for all-null columns

4c628f8

Fixes apache#306.

crepererum commented May 17, 2021

View reviewed changes

alamb added bug parquet Changes to the parquet crate labels May 18, 2021

alamb approved these changes May 18, 2021

View reviewed changes

sunchao reviewed May 19, 2021

View reviewed changes

sunchao approved these changes May 25, 2021

View reviewed changes

sunchao merged commit 94a82cd into apache:master May 25, 2021

alamb mentioned this pull request Jun 4, 2021

Respect max rowgroup size in Arrow writer #381

Merged

alamb pushed a commit that referenced this pull request Jun 4, 2021

ensure null-counts are written for all-null columns (#307)

46b771c

Fixes #306.

alamb added the cherry-picked label Jun 4, 2021

alamb mentioned this pull request Jun 4, 2021

Cherry pick ensure null-counts are written for all-null columns to active_release #404

Merged

alamb added a commit that referenced this pull request Jun 5, 2021

ensure null-counts are written for all-null columns (#307) (#404)

4d216f3

Fixes #306. Co-authored-by: Marco Neumann <marco@crepererum.net>

alamb mentioned this pull request Jun 10, 2021

Add changelog and bump version for proposed 4.3.0 release #444

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ensure null-counts are written for all-null columns #307

ensure null-counts are written for all-null columns #307

crepererum commented May 17, 2021

crepererum May 17, 2021

alamb May 18, 2021

crepererum commented May 17, 2021

alamb commented May 18, 2021

alamb left a comment

alamb May 18, 2021

sunchao May 19, 2021

crepererum May 20, 2021

sunchao May 24, 2021

crepererum May 25, 2021

sunchao May 25, 2021

crepererum May 25, 2021

sunchao May 25, 2021

sunchao left a comment

ensure null-counts are written for all-null columns #307

ensure null-counts are written for all-null columns #307

Conversation

crepererum commented May 17, 2021

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crepererum commented May 17, 2021

alamb commented May 18, 2021

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sunchao left a comment

Choose a reason for hiding this comment