Fix handling of nested leaf columns in parallel parquet writer #8923

devinjdangelo · 2024-01-20T14:11:28Z

Which issue does this PR close?

Closes #8851
Closes #8853

Rationale for this change

See issues above. Parallel parquet writer causes various errors/panics when used with nested columns.

What changes are included in this PR?

I identified the issue in this function which is supposed to send the appropriate arrow arrays to the correct column serialization workers:

https://github.com/apache/arrow-datafusion/blob/95e739cb605307d3337c54ef3f0ab8c72cca5717/datafusion/core/src/datasource/file_format/parquet.rs#L883-L902

The outer loop iterates over the "col_array_channels". This works when there are no nested columns (i.e. the inner loop only ever iterates once), but it is incorrect when there are nested columns.

The varying errors reported are explained by this bug since a few different things can go wrong here:

The wrong array of the wrong type is sent to a column serializer
The same column serializer is sent too many rows
A given column serializer is sent zero rows

This PR fixes this function so that it properly sends nested columns to the correct column serializer.

Are these changes tested?

Yes, copy.slt now includes tests with various column types at various levels of nesting with structs and arrays

Are there any user-facing changes?

No

devinjdangelo · 2024-01-20T14:25:25Z

cc @alamb @tustvold

Are there any additional tests which we should add to improve coverage for the parallel writer?

devinjdangelo · 2024-01-20T14:33:41Z

datafusion/common/src/config.rs

@@ -408,7 +408,7 @@ config_namespace! {
        /// parquet files by serializing them in parallel. Each column
        /// in each row group in each output file are serialized in parallel
        /// leveraging a maximum possible core count of n_files*n_row_groups*n_columns.
-        pub allow_single_file_parallelism: bool, default = false
+        pub allow_single_file_parallelism: bool, default = true


We could choose to leave this feature disabled by default if we wish to merge this before the next release. That would give us more time to ensure there are no other issues and improve test coverage further.

We are about to release 35.0.0 which would not include this change. See voting thread here: https://lists.apache.org/thread/onbs8l0w5s7693fchpyvwwgh61gf1jf8

Thus I think it would actually be best to turn this default to true so the code gets maximum "bake time" from people working off main before a release

==> This PR looks good to me

alamb · 2024-01-21T14:14:59Z

Are there any additional tests which we should add to improve coverage for the parallel writer?

I think if we could including tests that write lists of structs and structs of lists that would also be good

Perhaps like

❯ select [struct('foo', 1), struct('bar', 2)];
+-----------------------------------------------------------------------+
| make_array(struct(Utf8("foo"),Int64(1)),struct(Utf8("bar"),Int64(2))) |
+-----------------------------------------------------------------------+
| [{c0: foo, c1: 1}, {c0: bar, c1: 2}]                                  |
+-----------------------------------------------------------------------+
1 row in set. Query took 0.002 seconds.

❯ select struct('foo', [1,2,3], struct('bar', [2,3,4]));
+-----------------------------------------------------------------------------------------------------------------------+
| struct(Utf8("foo"),make_array(Int64(1),Int64(2),Int64(3)),struct(Utf8("bar"),make_array(Int64(2),Int64(3),Int64(4)))) |
+-----------------------------------------------------------------------------------------------------------------------+
| {c0: foo, c1: [1, 2, 3], c2: {c0: bar, c1: [2, 3, 4]}}                                                                |
+-----------------------------------------------------------------------------------------------------------------------+
1 row in set. Query took 0.003 seconds.

alamb

Thank you @devinjdangelo -- this looks great to me.

I have a suggestion for some additional tests, which we can add to this PR or we can also add it as a follow on PRs

alamb · 2024-01-21T14:11:50Z

datafusion/common/src/config.rs

@@ -408,7 +408,7 @@ config_namespace! {
        /// parquet files by serializing them in parallel. Each column
        /// in each row group in each output file are serialized in parallel
        /// leveraging a maximum possible core count of n_files*n_row_groups*n_columns.
-        pub allow_single_file_parallelism: bool, default = false
+        pub allow_single_file_parallelism: bool, default = true


We are about to release 35.0.0 which would not include this change. See voting thread here: https://lists.apache.org/thread/onbs8l0w5s7693fchpyvwwgh61gf1jf8

Thus I think it would actually be best to turn this default to true so the code gets maximum "bake time" from people working off main before a release

==> This PR looks good to me

alamb · 2024-01-21T14:15:26Z

datafusion/core/src/datasource/file_format/parquet.rs

+                .map_err(|_| {
+                    DataFusionError::Internal("Unable to send array to writer!".into())
+                })?;
+            next_channel += 1;


is the idea that there can be more than one array can be sent per field ?

Per top level field, yes. The "Field" referenced in the outer loop is only top level, non-nested fields. The compute_leaves function does the recursive iteration of all nested fields for each top level field.

There is actually an independent channel / parallel serializer for every nested (not only top level) field. So, we must advance the channel we are sending to on every iteration of compute_leaves not just on every iteration of a new top level field.

alamb · 2024-01-22T18:47:06Z

Thanks again @devinjdangelo

fix handling of nested columns

9d518f3

github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Jan 20, 2024

lint

fbf6961

devinjdangelo mentioned this pull request Jan 20, 2024

Release DataFusion 35.0.0 #8863

Closed

3 tasks

devinjdangelo commented Jan 20, 2024

View reviewed changes

alamb mentioned this pull request Jan 21, 2024

DataFusion weekly project plan (Andrew Lamb) - Jan 22, 2024 #8933

Closed

9 tasks

alamb approved these changes Jan 21, 2024

View reviewed changes

add suggested tests

4b6a2e3

alamb merged commit 38d5f75 into apache:main Jan 22, 2024
23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix handling of nested leaf columns in parallel parquet writer #8923

Fix handling of nested leaf columns in parallel parquet writer #8923

devinjdangelo commented Jan 20, 2024

devinjdangelo commented Jan 20, 2024

devinjdangelo Jan 20, 2024

alamb Jan 21, 2024 •

edited

Loading

alamb commented Jan 21, 2024

alamb left a comment •

edited

Loading

alamb Jan 21, 2024 •

edited

Loading

alamb Jan 21, 2024

devinjdangelo Jan 21, 2024

alamb commented Jan 22, 2024

Fix handling of nested leaf columns in parallel parquet writer #8923

Fix handling of nested leaf columns in parallel parquet writer #8923

Conversation

devinjdangelo commented Jan 20, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

devinjdangelo commented Jan 20, 2024

devinjdangelo Jan 20, 2024

Choose a reason for hiding this comment

alamb Jan 21, 2024 • edited Loading

Choose a reason for hiding this comment

alamb commented Jan 21, 2024

alamb left a comment • edited Loading

Choose a reason for hiding this comment

alamb Jan 21, 2024 • edited Loading

Choose a reason for hiding this comment

alamb Jan 21, 2024

Choose a reason for hiding this comment

devinjdangelo Jan 21, 2024

Choose a reason for hiding this comment

alamb commented Jan 22, 2024

alamb Jan 21, 2024 •

edited

Loading

alamb left a comment •

edited

Loading

alamb Jan 21, 2024 •

edited

Loading