fix: pruning by bloom filters for dictionary columns #13768

korowa · 2024-12-13T20:15:20Z

Which issue does this PR close?

Closes #13574 .

Rationale for this change

Currently row group pruning by Bloom filters is unaware of Dictionary typed scalar values (the literal part of predicate) while checking for values being contained in filter. This PR adds limited support for them -- now Bloom filter pruning is able to handle majority of Dictionary scalars (with the exception of Decimal type).

What changes are included in this PR?

Part of BloomFilterStatistics responsible for checking scalar value against Sbbf is moved into helper function with additional support of Dictionary() and LargeUtf8/LargeBinary type.

Are these changes tested?

There are tests for supported datatypes checking pruning and matching (to ensure that there won't be any excessive pruning) cases.

Are there any user-facing changes?

No

alamb

Thank you @korowa -- this looks good to me

I had a question about binary, but I also think this PR could be merged as is

cc @adriangb

alamb · 2024-12-15T21:24:33Z

datafusion/core/src/datasource/physical_plan/parquet/row_group_filter.rs

+            },
+            // Bloom filter pruning is performed only for Utf8 dictionary types since
+            // pruning predicate is not created for Dictionary(Numeric/Binary) types
+            ScalarValue::Dictionary(_, inner) => match inner.as_ref() {


it isn't clear to me that it is impossible to use Dictionary(Int8, Int64)` or something to encode Int64 values, but I think this is the most common example

Indeed, after rechecking I've found out that it's only a matter of casting literal to exact column type. I'll update filter check function and will add additional tests for more data types.

It seems like we maybe don't even need to chack the inner type explicitly (it would be checked by BloomFilterStatistics::check_scalar as well). However I think this is better than what is on main today and if it is important we can add support for other types

alamb · 2024-12-15T21:24:40Z

datafusion/core/src/datasource/physical_plan/parquet/row_group_filter.rs

-                    _ => true,
-                }
-            })
+            .map(|value| BloomFilterStatistics::check_scalar(sbbf, value, parquet_type))


alamb · 2024-12-15T21:27:30Z

datafusion/core/src/datasource/physical_plan/parquet/row_group_filter.rs

+            // Bloom filter pruning is performed only for Utf8 dictionary types since
+            // pruning predicate is not created for Dictionary(Numeric/Binary) types
+            ScalarValue::Dictionary(_, inner) => match inner.as_ref() {
+                ScalarValue::Utf8(_) | ScalarValue::LargeUtf8(_) => {


Did you also mean to check ScalarValue::{Binary,LargeBinary} here as well?

datafusion/core/tests/parquet/mod.rs

adriangb · 2024-12-15T22:11:13Z

The reason for not including all already supported ParquetExecBuilder::build` is not able to produce pruning predicate for Dictionary(numeric/binary) datatypes, so these queries won't normally reach bloom filter/statistics pruning.

Is this true? So Dictionary columns are incompatible with predicate pruning based on stats as well?

adriangb · 2024-12-16T01:10:35Z

Looks very nice to me thank you @korowa !

alamb

I also double checked that the newly added tests fail without the corresponding code change in this PR

failures:

---- parquet::row_group_pruning::test_bloom_filter_dict stdout ----
Planning sql SELECT * FROM t WHERE utf8 = 'h'
Input:
+------+------------+--------+--------------+
| utf8 | large_utf8 | binary | large_binary |
+------+------------+--------+--------------+
| a    | a          | 61     | 61           |
| b    | b          | 62     | 62           |
| c    | c          | 63     | 63           |
| d    | d          | 64     | 64           |
| e    | e          | 65     | 65           |
| f    | f          | 66     | 66           |
| g    | g          | 67     | 67           |
| h    | h          | 68     | 68           |
| i    | i          | 69     | 69           |
| j    | j          | 6a     | 6a           |
+------+------------+--------+--------------+
Query:
SELECT * FROM t WHERE utf8 = 'h'
Output:
+------+------------+--------+--------------+
| utf8 | large_utf8 | binary | large_binary |
+------+------------+--------+--------------+
| h    | h          | 68     | 68           |
+------+------------+--------+--------------+
Metrics:
num_predicate_creation_errors=0, time_elapsed_opening{partition=0}=13.856ms, time_elapsed_scanning_until_data{partition=0}=170.75µs, time_elapsed_scanning_total{partition=0}=222.541µs, time_elapsed_processing{partition=0}=13.786957ms, file_open_errors{partition=0}=0, file_scan_errors{partition=0}=0, start_timestamp{partition=0}=2024-12-16 15:44:59.575102 UTC, end_timestamp{partition=0}=2024-12-16 15:44:59.589198 UTC, elapsed_compute{partition=0}=NOT RECORDED, output_rows{partition=0}=5, predicate_evaluation_errors{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning6O2mvF.parquet}=0, row_groups_matched_bloom_filter{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning6O2mvF.parquet}=1, row_groups_pruned_bloom_filter{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning6O2mvF.parquet}=0, row_groups_matched_statistics{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning6O2mvF.parquet}=1, row_groups_pruned_statistics{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning6O2mvF.parquet}=1, bytes_scanned{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning6O2mvF.parquet}=0, pushdown_rows_pruned{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning6O2mvF.parquet}=0, pushdown_rows_matched{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning6O2mvF.parquet}=0, row_pushdown_eval_time{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning6O2mvF.parquet}=NOT RECORDED, statistics_eval_time{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning6O2mvF.parquet}=84.792µs, bloom_filter_eval_time{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning6O2mvF.parquet}=13.36025ms, page_index_rows_pruned{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning6O2mvF.parquet}=0, page_index_rows_matched{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning6O2mvF.parquet}=5, page_index_eval_time{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning6O2mvF.parquet}=86.5µs, metadata_load_time{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning6O2mvF.parquet}=295.625µs, predicate_evaluation_errors{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning6O2mvF.parquet}=0, row_groups_matched_bloom_filter{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning6O2mvF.parquet}=0, row_groups_pruned_bloom_filter{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning6O2mvF.parquet}=0, row_groups_matched_statistics{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning6O2mvF.parquet}=0, row_groups_pruned_statistics{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning6O2mvF.parquet}=0, bytes_scanned{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning6O2mvF.parquet}=1049141, pushdown_rows_pruned{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning6O2mvF.parquet}=0, pushdown_rows_matched{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning6O2mvF.parquet}=0, row_pushdown_eval_time{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning6O2mvF.parquet}=NOT RECORDED, statistics_eval_time{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning6O2mvF.parquet}=NOT RECORDED, bloom_filter_eval_time{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning6O2mvF.parquet}=NOT RECORDED, page_index_rows_pruned{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning6O2mvF.parquet}=0, page_index_rows_matched{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning6O2mvF.parquet}=0, page_index_eval_time{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning6O2mvF.parquet}=NOT RECORDED, metadata_load_time{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning6O2mvF.parquet}=NOT RECORDED
Planning sql SELECT * FROM t WHERE utf8 = 'ab'
Input:
+------+------------+--------+--------------+
| utf8 | large_utf8 | binary | large_binary |
+------+------------+--------+--------------+
| a    | a          | 61     | 61           |
| b    | b          | 62     | 62           |
| c    | c          | 63     | 63           |
| d    | d          | 64     | 64           |
| e    | e          | 65     | 65           |
| f    | f          | 66     | 66           |
| g    | g          | 67     | 67           |
| h    | h          | 68     | 68           |
| i    | i          | 69     | 69           |
| j    | j          | 6a     | 6a           |
+------+------------+--------+--------------+
Query:
SELECT * FROM t WHERE utf8 = 'ab'
Output:
++
++
Metrics:
num_predicate_creation_errors=0, time_elapsed_opening{partition=0}=13.949417ms, time_elapsed_scanning_until_data{partition=0}=186.875µs, time_elapsed_scanning_total{partition=0}=236.416µs, time_elapsed_processing{partition=0}=13.85075ms, file_open_errors{partition=0}=0, file_scan_errors{partition=0}=0, start_timestamp{partition=0}=2024-12-16 15:44:59.709470 UTC, end_timestamp{partition=0}=2024-12-16 15:44:59.723669 UTC, elapsed_compute{partition=0}=NOT RECORDED, output_rows{partition=0}=5, predicate_evaluation_errors{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning3gshjh.parquet}=0, row_groups_matched_bloom_filter{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning3gshjh.parquet}=1, row_groups_pruned_bloom_filter{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning3gshjh.parquet}=0, row_groups_matched_statistics{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning3gshjh.parquet}=1, row_groups_pruned_statistics{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning3gshjh.parquet}=1, bytes_scanned{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning3gshjh.parquet}=0, pushdown_rows_pruned{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning3gshjh.parquet}=0, pushdown_rows_matched{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning3gshjh.parquet}=0, row_pushdown_eval_time{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning3gshjh.parquet}=NOT RECORDED, statistics_eval_time{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning3gshjh.parquet}=76.041µs, bloom_filter_eval_time{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning3gshjh.parquet}=13.426708ms, page_index_rows_pruned{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning3gshjh.parquet}=0, page_index_rows_matched{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning3gshjh.parquet}=5, page_index_eval_time{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning3gshjh.parquet}=84µs, metadata_load_time{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning3gshjh.parquet}=333.291µs, predicate_evaluation_errors{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning3gshjh.parquet}=0, row_groups_matched_bloom_filter{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning3gshjh.parquet}=0, row_groups_pruned_bloom_filter{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning3gshjh.parquet}=0, row_groups_matched_statistics{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning3gshjh.parquet}=0, row_groups_pruned_statistics{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning3gshjh.parquet}=0, bytes_scanned{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning3gshjh.parquet}=1049141, pushdown_rows_pruned{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning3gshjh.parquet}=0, pushdown_rows_matched{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning3gshjh.parquet}=0, row_pushdown_eval_time{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning3gshjh.parquet}=NOT RECORDED, statistics_eval_time{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning3gshjh.parquet}=NOT RECORDED, bloom_filter_eval_time{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning3gshjh.parquet}=NOT RECORDED, page_index_rows_pruned{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning3gshjh.parquet}=0, page_index_rows_matched{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning3gshjh.parquet}=0, page_index_eval_time{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning3gshjh.parquet}=NOT RECORDED, metadata_load_time{partition=0, filename=var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/parquet_pruning3gshjh.parquet}=NOT RECORDED
thread 'parquet::row_group_pruning::test_bloom_filter_dict' panicked at datafusion/core/tests/parquet/row_group_pruning.rs:124:9:
assertion `left == right` failed: mismatched row_groups_matched_bloom_filter
  left: Some(1)
 right: Some(0)
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


failures:
    parquet::row_group_pruning::test_bloom_filter_dict

test result: FAILED. 171 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 2.40s

korowa · 2024-12-16T19:31:26Z

Is this true? So Dictionary columns are incompatible with predicate pruning based on stats as well?

No, it's not true, it was my mistake -- it works fine if literal is explicitly casted to column type.

korowa · 2024-12-17T19:25:47Z

I've added more tests along with types support, and updated PR description.

There are two issues I'm going to debug and create tickets or fix after getting more understanding of what's wrong there (they don't affect current implementation):

Decimal type -- there is something wrong with both statistics pruning and bloom filters (~~currently working on a reproducer~~ parquet RowGroup pruning for Dictionary(Decimal) type incorrect #13821)
Writing Dictionary(Float) panics in arrow writer, so there is no test for Float

alamb

Thanks again @korowa

alamb · 2024-12-17T21:13:07Z

datafusion/core/src/datasource/physical_plan/parquet/row_group_filter.rs

+            },
+            // Bloom filter pruning is performed only for Utf8 dictionary types since
+            // pruning predicate is not created for Dictionary(Numeric/Binary) types
+            ScalarValue::Dictionary(_, inner) => match inner.as_ref() {


It seems like we maybe don't even need to chack the inner type explicitly (it would be checked by BloomFilterStatistics::check_scalar as well). However I think this is better than what is on main today and if it is important we can add support for other types

adriangb · 2024-12-18T13:26:14Z

Thank you all for working on this. DataFusion is restoring my belief in open source one interaction at a time.

* Minor: Use `div_ceil` * Fix hash join with sort push down (#13560) * fix: join with sort push down * chore: insert some value * apply suggestion * recover handle_costom_pushdown change * apply suggestion * add more test * add partition * Improve substr() performance by avoiding using owned string (#13688) Co-authored-by: zhangli20 <zhangli20@kuaishou.com> * reinstate down_cast_any_ref (#13705) * Optimize performance of `character_length` function (#13696) * Optimize performance of function Signed-off-by: Tai Le Manh <manhtai.lmt@gmail.com> * Add pre-check array is null * Fix clippy warnings --------- Signed-off-by: Tai Le Manh <manhtai.lmt@gmail.com> * Update prost-build requirement from =0.13.3 to =0.13.4 (#13698) Updates the requirements on [prost-build](https://github.com/tokio-rs/prost) to permit the latest version. - [Release notes](https://github.com/tokio-rs/prost/releases) - [Changelog](https://github.com/tokio-rs/prost/blob/master/CHANGELOG.md) - [Commits](https://github.com/tokio-rs/prost/compare/v0.13.3...v0.13.4) --- updated-dependencies: - dependency-name: prost-build dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Minor: Output elapsed time for sql logic test (#13718) * Minor: Output elapsed time for sql logic test * refactor: simplify the `make_udf_function` macro (#13712) * refactor: replace `Vec` with `IndexMap` for expression mappings in `ProjectionMapping` and `EquivalenceGroup` (#13675) * refactor: replace Vec with IndexMap for expression mappings in ProjectionMapping and EquivalenceGroup * chore * chore: Fix CI * chore: comment * chore: simplify * Handle alias when parsing sql(parse_sql_expr) (#12939) * fix: Fix parse_sql_expr not handling alias * cargo fmt * fix parse_sql_expr example(remove alias) * add testing * add SUM udaf to TestContextProvider and modify test_sql_to_expr_with_alias for function * revert change on example `parse_sql_expr` * Improve documentation for TableProvider (#13724) * Reveal implementing type and return type in simple UDF implementations (#13730) Debug trait is useful for understanding what something is and how it's configured, especially if the implementation is behind dyn trait. * minor: Extract tests for `EXTRACT` AND `date_part` to their own file (#13731) * Support unparsing `UNNEST` plan to `UNNEST` table factor SQL (#13660) * add `unnest_as_table_factor` and `UnnestRelationBuilder` * unparse unnest as table factor * fix typo * add tests for the default configs * add a static const for unnest_placeholder * fix tests * fix tests * Update to apache-avro 0.17, fix compatibility changes schema handling (#13727) * Update apache-avro requirement from 0.16 to 0.17 --- updated-dependencies: - dependency-name: apache-avro dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * Fix compatibility changes schema handling apache-avro 0.17 - Handle ArraySchema struct - Handle MapSchema struct - Map BigDecimal => LargeBinary - Map TimestampNanos => Timestamp(TimeUnit::Nanosecond, None) - Map LocalTimestampNanos => todo!() - Add Default to FixedSchema test * Update Cargo.lock file for apache-avro 0.17 --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Marc Droogh <marc.droogh@imc.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Minor: Add doc example to RecordBatchStreamAdapter (#13725) * Minor: Add doc example to RecordBatchStreamAdapter * Update datafusion/physical-plan/src/stream.rs Co-authored-by: Berkay Şahin <124376117+berkaysynnada@users.noreply.github.com> --------- Co-authored-by: Berkay Şahin <124376117+berkaysynnada@users.noreply.github.com> * Implement GroupsAccumulator for corr(x,y) aggregate function (#13581) * Implement GroupsAccumulator for corr(x,y) * feedbacks * fix CI MSRV * review * avoid collect in accumulation * add back cast * fix union serialisation order in proto (#13709) * fix union serialisation order in proto * clippy * address comments * Minor: make unsupported `nanosecond` part a real (not internal) error (#13733) * Minor: make unsupported `nanosecond` part a real (not internal) error * fmt * Improve wording to refer to date part * Add tests for date_part on columns + timestamps with / without timezones (#13732) * Add tests for date_part on columns + timestamps with / without timezones * Add tests from https://github.com/apache/datafusion/pull/13372 * remove trailing whitespace * Optimize performance of `initcap` function (~2x faster) (#13691) * Optimize performance of initcap (~2x faster) Signed-off-by: Tai Le Manh <manhtai.lmt@gmail.com> * format --------- Signed-off-by: Tai Le Manh <manhtai.lmt@gmail.com> * Minor: Add documentation explaining that initcap oly works for ASCII (#13749) * Support sqllogictest --complete with postgres (#13746) Before the change, the request to use PostgreSQL was simply ignored when `--complete` flag was present. * doc-gen: migrate window functions documentation to attribute based (#13739) * doc-gen: migrate window functions documentation Signed-off-by: zjregee <zjregee@gmail.com> * fix: update Cargo.lock --------- Signed-off-by: zjregee <zjregee@gmail.com> * Minor: Remove memory reservation in `JoinLeftData` used in HashJoin (#13751) * Refactor JoinLeftData structure by removing unused memory reservation field in hash join implementation * Add Debug and Clone derives for HashJoinStreamState and ProcessProbeBatchState enums This commit enhances the HashJoinStreamState and ProcessProbeBatchState structures by implementing the Debug and Clone traits, allowing for easier debugging and cloning of these state representations in the hash join implementation. * Update to bigdecimal 0.4.7 (#13747) * Add big decimal formatting test cases with potential trailing zeros * Rename and simplify decimal rendering functions - add `decimal` to function name - drop `precision` parameter as it is not supposed to affect the result * Update to bigdecimal 0.4.7 Utilize new `to_plain_string` function * chore: clean up dependencies (#13728) * CI: Warn on unused crates * CI: Warn on unused crates * CI: Warn on unused crates * CI: Warn on unused crates * CI: Clean up dependencies * CI: Clean up dependencies * fix: Implicitly plan `UNNEST` as lateral (#13695) * plan implicit lateral if table factor is UNNEST * check for outer references in `create_relation_subquery` * add sqllogictest * fix lateral constant test to not expect a subquery node * replace sqllogictest in favor of logical plan test * update lateral join sqllogictests * add sqllogictests * fix logical plan test * Minor: improve the Deprecation / API health guidelines (#13701) * Minor: improve the Deprecation / API health policy * prettier * Update docs/source/library-user-guide/api-health.md Co-authored-by: Jonah Gao <jonahgao@msn.com> * Add version guidance and make more copy/paste friendly * prettier * better * rename to guidelines --------- Co-authored-by: Jonah Gao <jonahgao@msn.com> * fix: specify roottype in substrait fieldreference (#13647) * fix: specify roottype in fieldreference Signed-off-by: MBWhite <whitemat@uk.ibm.com> * Fix formatting Signed-off-by: MBWhite <whitemat@uk.ibm.com> * review suggestion Signed-off-by: MBWhite <whitemat@uk.ibm.com> --------- Signed-off-by: MBWhite <whitemat@uk.ibm.com> * Simplify type signatures using `TypeSignatureClass` for mixed type function signature (#13372) * add type sig class Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * timestamp Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * date part Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fmt Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * taplo format Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * tpch test Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * msrc issue Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * msrc issue Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * explicit hash Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * Enhance type coercion and function signatures - Added logic to prevent unnecessary casting of string types in `native.rs`. - Introduced `Comparable` variant in `TypeSignature` to define coercion rules for comparisons. - Updated imports in `functions.rs` and `signature.rs` for better organization. - Modified `date_part.rs` to improve handling of timestamp extraction and fixed query tests in `expr.slt`. - Added `datafusion-macros` dependency in `Cargo.toml` and `Cargo.lock`. These changes improve type handling and ensure more accurate function behavior in SQL expressions. * fix comment Signed-off-by: Jay Zhan <jayzhan211@gmail.com> * fix signature Signed-off-by: Jay Zhan <jayzhan211@gmail.com> * fix test Signed-off-by: Jay Zhan <jayzhan211@gmail.com> * Enhance type coercion for timestamps to allow implicit casting from strings. Update SQL logic tests to reflect changes in timestamp handling, including expected outputs for queries involving nanoseconds and seconds. * Refactor type coercion logic for timestamps to improve readability and maintainability. Update the `TypeSignatureClass` documentation to clarify its purpose in function signatures, particularly regarding coercible types. This change enhances the handling of implicit casting from strings to timestamps. * Fix SQL logic tests to correct query error handling for timestamp functions. Updated expected outputs for `date_part` and `extract` functions to reflect proper behavior with nanoseconds and seconds. This change improves the accuracy of test cases in the `expr.slt` file. * Enhance timestamp handling in TypeSignature to support timezone specification. Updated the logic to include an additional DataType for timestamps with a timezone wildcard, improving flexibility in timestamp operations. * Refactor date_part function: remove redundant imports and add missing not_impl_err import for better error handling --------- Signed-off-by: jayzhan211 <jayzhan211@gmail.com> Signed-off-by: Jay Zhan <jayzhan211@gmail.com> * Minor: Add some more blog posts to the readings page (#13761) * Minor: Add some more blog posts to the readings page * prettier * prettier * Update docs/source/user-guide/concepts-readings-events.md --------- Co-authored-by: Oleks V <comphead@users.noreply.github.com> * docs: update GroupsAccumulator instead of GroupAccumulator (#13787) Fixing `GroupsAccumulator` trait name in its docs * Improve Deprecation Guidelines more (#13776) * Improve deprecation guidelines more * prettier * fix: add `null_buffer` length check to `StringArrayBuilder`/`LargeStringArrayBuilder` (#13758) * fix: add `null_buffer` check for `LargeStringArray` Add a safety check to ensure that the alignment of buffers cannot be overflowed. This introduces a panic if they are not aligned through a runtime assertion. * fix: remove value_buffer assertion These buffers can be misaligned and it is not problematic, it is the `null_buffer` which we care about being of the same length. * feat: add `null_buffer` check to `StringArray` This is in a similar vein to `LargeStringArray`, as the code is the same, except for `i32`'s instead of `i64`. * feat: use `row_count` var to avoid drift * Revert the removal of reservation in HashJoin (#13792) * fix: restore memory reservation in JoinLeftData for accurate memory accounting in HashJoin This commit reintroduces the `_reservation` field in the `JoinLeftData` structure to ensure proper tracking of memory resources during join operations. The absence of this field could lead to inconsistent memory usage reporting and potential out-of-memory issues as upstream operators increase their memory consumption. * fmt Signed-off-by: Jay Zhan <jay.zhan@synnada.ai> --------- Signed-off-by: Jay Zhan <jay.zhan@synnada.ai> * added count aggregate slt (#13790) * Update documentation guidelines for contribution content (#13703) * Update documentation guidelines for contribution content * Apply suggestions from code review Co-authored-by: Piotr Findeisen <piotr.findeisen@gmail.com> Co-authored-by: Oleks V <comphead@users.noreply.github.com> * clarify discussions and remove requirements note * prettier * Update docs/source/contributor-guide/index.md Co-authored-by: Piotr Findeisen <piotr.findeisen@gmail.com> --------- Co-authored-by: Piotr Findeisen <piotr.findeisen@gmail.com> Co-authored-by: Oleks V <comphead@users.noreply.github.com> * Add Round trip tests for Array <--> ScalarValue (#13777) * Add Round trip tests for Array <--> ScalarValue * String dictionary test * remove unecessary value * Improve comments * fix: Limit together with pushdown_filters (#13788) * fix: Limit together with pushdown_filters * Fix format * Address new comments * Fix testing case to hit the problem * Minor: improve Analyzer docs (#13798) * Minor: cargo update in datafusion-cli (#13801) * Update datafusion-cli toml to pin home=0.5.9 * update Cargo.lock * Fix `ScalarValue::to_array_of_size` for DenseUnion (#13797) * fix: enable pruning by bloom filters for dictionary columns (#13768) * Handle empty rows for `array_distinct` (#13810) * handle empty array distinct * ignore * fix --------- Co-authored-by: Cyprien Huet <chuet@palantir.com> * Fix get_type for higher-order array functions (#13756) * Fix get_type for higher-order array functions * Fix recursive flatten The fix is covered by recursive flatten test case in array.slt * Restore "keep LargeList" in Array signature * clarify naming in the test * Chore: Do not return empty record batches from streams (#13794) * do not emit empty record batches in plans * change function signatures to Option<RecordBatch> if empty batches are possible * format code * shorten code * change list_unnest_at_level for returning Option value * add documentation take concat_batches into compute_aggregates function again * create unit test for row_hash.rs * add test for unnest * add test for unnest * add test for partial sort * add test for bounded window agg * add test for window agg * apply simplifications and fix typo * apply simplifications and fix typo * Handle possible overflows in StringArrayBuilder / LargeStringArrayBuilder (#13802) * test(13796): reproducer of overflow on capacity * fix(13796): handle overflows with proper max capacity number which is valid for MutableBuffer * refactor: use simple solution and provide panic * fix: Ignore empty files in ListingTable when listing files with or without partition filters, as well as when inferring schema (#13750) * fix: Ignore empty files in ListingTable when listing files with or without partition filters, as well as when inferring schema * clippy * fix csv and json tests * add testing for parquet * cleanup * fix parquet tests * document describe_partition, add back repartition options to one of the csv empty files tests * Support Null regex override in csv parser options. (#13228) Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Minor: Extend ScalarValue::new_zero() (#13828) * Update mod.rs * Update mod.rs * Update mod.rs * Update mod.rs * chore: temporarily disable windows flow (#13833) * feat: `parse_float_as_decimal` supports scientific notation and Decimal256 (#13806) * feat: `parse_float_as_decimal` supports scientific notation and Decimal256 * Fix test * Add test * Add test * Refine negative scales * Update comment * Refine bigint_to_i256 * UT for bigint_to_i256 * Add ut for parse_decimal * Replace `BooleanArray::extend` with `append_n` (#13832) * Rename `TypeSignature::NullAry` --> `TypeSignature::Nullary` and improve comments (#13817) * Rename `TypeSignature::NullAry` --> `TypeSignature::Nullary` and improve comments * Apply suggestions from code review Co-authored-by: Piotr Findeisen <piotr.findeisen@gmail.com> * improve docs --------- Co-authored-by: Piotr Findeisen <piotr.findeisen@gmail.com> * [bugfix] ScalarFunctionExpr does not preserve the nullable flag on roundtrip (#13830) * [test] coalesce round trip schema mismatch * [proto] added the nullable flag in PhysicalScalarUdfNode * [bugfix] propagate the nullable flag for serialized scalar UDFS * Add example of interacting with a remote catalog (#13722) * Add example of interacting with a remote catalog * Update datafusion/core/src/execution/session_state.rs Co-authored-by: Berkay Şahin <124376117+berkaysynnada@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Jonah Gao <jonahgao@msn.com> Co-authored-by: Weston Pace <weston.pace@gmail.com> * Use HashMap to hold tables --------- Co-authored-by: Berkay Şahin <124376117+berkaysynnada@users.noreply.github.com> Co-authored-by: Jonah Gao <jonahgao@msn.com> Co-authored-by: Weston Pace <weston.pace@gmail.com> * Update substrait requirement from 0.49 to 0.50 (#13808) * Update substrait requirement from 0.49 to 0.50 Updates the requirements on [substrait](https://github.com/substrait-io/substrait-rs) to permit the latest version. - [Release notes](https://github.com/substrait-io/substrait-rs/releases) - [Changelog](https://github.com/substrait-io/substrait-rs/blob/main/CHANGELOG.md) - [Commits](https://github.com/substrait-io/substrait-rs/compare/v0.49.0...v0.50.0) --- updated-dependencies: - dependency-name: substrait dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * Fix compilation * Add expr test --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: jonahgao <jonahgao@msn.com> * typo: remove extraneous "`" in doc comment, fix header (#13848) * typo: extraneous "`" in doc comment * Update datafusion/execution/src/runtime_env.rs * Update datafusion/execution/src/runtime_env.rs --------- Co-authored-by: Oleks V <comphead@users.noreply.github.com> * typo: remove extra "`" interfering with doc formatting (#13847) * Support n-ary monotonic functions in ordering equivalence (#13841) * Support n-ary monotonic functions in `discover_new_orderings` * Add tests for n-ary monotonic functions in `discover_new_orderings` * Fix tests * Fix non-monotonic test case * Fix unintended simplification * Minor comment changes * Fix tests * Add `preserves_lex_ordering` field * Use `preserves_lex_ordering` on `discover_new_orderings()` * Add `output_ordering` and `output_preserves_lex_ordering` implementations for `ConcatFunc` * Update tests * Move logic to UDF * Cargo fmt * Refactor * Cargo fmt * Simply use false value on default implementation * Remove unnecessary import * Clippy fix * Update Cargo.lock * Move dep to dev-dependencies * Rename output_preserves_lex_ordering to preserves_lex_ordering * minor --------- Co-authored-by: berkaysynnada <berkay.sahin@synnada.ai> * Replace `execution_mode` with `emission_type` and `boundedness` (#13823) * feat: update execution modes and add bitflags dependency - Introduced `Incremental` execution mode alongside existing modes in the DataFusion execution plan. - Updated various execution plans to utilize the new `Incremental` mode where applicable, enhancing streaming capabilities. - Added `bitflags` dependency to `Cargo.toml` for better management of execution modes. - Adjusted execution mode handling in multiple files to ensure compatibility with the new structure. * add exec API Signed-off-by: Jay Zhan <jay.zhan@synnada.ai> * replace done but has stackoverflow Signed-off-by: Jay Zhan <jay.zhan@synnada.ai> * exec API done Signed-off-by: Jay Zhan <jay.zhan@synnada.ai> * Refactor execution plan properties to remove execution mode - Removed the `ExecutionMode` parameter from `PlanProperties` across multiple physical plan implementations. - Updated related functions to utilize the new structure, ensuring compatibility with the changes. - Adjusted comments and cleaned up imports to reflect the removal of execution mode handling. This refactor simplifies the execution plan properties and enhances maintainability. * Refactor execution plan to remove `ExecutionMode` and introduce `EmissionType` - Removed the `ExecutionMode` parameter from `PlanProperties` and related implementations across multiple files. - Introduced `EmissionType` to better represent the output characteristics of execution plans. - Updated functions and tests to reflect the new structure, ensuring compatibility and enhancing maintainability. - Cleaned up imports and adjusted comments accordingly. This refactor simplifies the execution plan properties and improves the clarity of memory handling in execution plans. * fix test Signed-off-by: Jay Zhan <jay.zhan@synnada.ai> * Refactor join handling and emission type logic - Updated test cases in `sanity_checker.rs` to reflect changes in expected outcomes for bounded and unbounded joins, ensuring accurate test coverage. - Simplified the `is_pipeline_breaking` method in `execution_plan.rs` to clarify the conditions under which a plan is considered pipeline-breaking. - Enhanced the emission type determination logic in `execution_plan.rs` to prioritize `Final` over `Both` and `Incremental`, improving clarity in execution plan behavior. - Adjusted join type handling in `hash_join.rs` to classify `Right` joins as `Incremental`, allowing for immediate row emission. These changes improve the accuracy of tests and the clarity of execution plan properties. * Implement emission type for execution plans - Updated multiple execution plan implementations to replace `unimplemented!()` with `EmissionType::Incremental`, ensuring that the emission type is correctly defined for various plans. - This change enhances the clarity and functionality of the execution plans by explicitly specifying their emission behavior. These updates contribute to a more robust execution plan framework within the DataFusion project. * Enhance join type documentation and refine emission type logic - Updated the `JoinType` enum in `join_type.rs` to include detailed descriptions for each join type, improving clarity on their behavior and expected results. - Modified the emission type logic in `hash_join.rs` to ensure that `Right` and `RightAnti` joins are classified as `Incremental`, allowing for immediate row emission when applicable. These changes improve the documentation and functionality of join operations within the DataFusion project. * Refactor emission type logic in join and sort execution plans - Updated the emission type determination in `SortMergeJoinExec` and `SymmetricHashJoinExec` to utilize the `emission_type_from_children` function, enhancing the accuracy of emission behavior based on input characteristics. - Clarified comments in `sort.rs` regarding the conditions under which results are emitted, emphasizing the relationship between input sorting and emission type. - These changes improve the clarity and functionality of the execution plans within the DataFusion project, ensuring more robust handling of emission types. * Refactor emission type handling in execution plans - Updated the `emission_type_from_children` function to accept an iterator instead of a slice, enhancing flexibility in how child execution plans are passed. - Modified the `SymmetricHashJoinExec` implementation to utilize the new function signature, improving code clarity and maintainability. These changes streamline the emission type determination process within the DataFusion project, contributing to a more robust execution plan framework. * Enhance execution plan properties with boundedness and emission type - Introduced `boundedness` and `pipeline_behavior` methods to the `ExecutionPlanProperties` trait, improving the handling of execution plan characteristics. - Updated the `CsvExec`, `SortExec`, and related implementations to utilize the new methods for determining boundedness and emission behavior. - Refactored the `ensure_distribution` function to use the new boundedness logic, enhancing clarity in distribution decisions. - These changes contribute to a more robust and maintainable execution plan framework within the DataFusion project. * Refactor execution plans to enhance boundedness and emission type handling - Updated multiple execution plan implementations to incorporate `Boundedness` and `EmissionType`, improving the clarity and functionality of execution plans. - Replaced instances of `unimplemented!()` with appropriate emission types, ensuring that plans correctly define their output behavior. - Refactored the `PlanProperties` structure to utilize the new boundedness logic, enhancing decision-making in execution plans. - These changes contribute to a more robust and maintainable execution plan framework within the DataFusion project. * Refactor memory handling in execution plans - Updated the condition for checking memory requirements in execution plans from `has_finite_memory()` to `boundedness().requires_finite_memory()`, improving clarity in memory management. - This change enhances the robustness of execution plans within the DataFusion project by ensuring more accurate assessments of memory constraints. * Refactor boundedness checks in execution plans - Updated conditions for checking boundedness in various execution plans to use `is_unbounded()` instead of `requires_finite_memory()`, enhancing clarity in memory management. - Adjusted the `PlanProperties` structure to reflect these changes, ensuring more accurate assessments of memory constraints across the DataFusion project. - These modifications contribute to a more robust and maintainable execution plan framework, improving the handling of boundedness in execution strategies. * Remove TODO comment regarding unbounded execution plans in `UnboundedExec` implementation - Eliminated the outdated comment suggesting a switch to unbounded execution with finite memory, streamlining the code and improving clarity. - This change contributes to a cleaner and more maintainable codebase within the DataFusion project. * Refactor execution plan boundedness and emission type handling - Updated the `is_pipeline_breaking` method to use `requires_finite_memory()` for improved clarity in determining pipeline behavior. - Enhanced the `Boundedness` enum to include detailed documentation on memory requirements for unbounded streams. - Refactored `compute_properties` methods in `GlobalLimitExec` and `LocalLimitExec` to directly use the input's boundedness, simplifying the logic. - Adjusted emission type determination in `NestedLoopJoinExec` to utilize the `emission_type_from_children` function, ensuring accurate output behavior based on input characteristics. These changes contribute to a more robust and maintainable execution plan framework within the DataFusion project, improving clarity and functionality in handling boundedness and emission types. * Refactor emission type and boundedness handling in execution plans - Removed the `OptionalEmissionType` struct from `plan_properties.rs`, simplifying the codebase. - Updated the `is_pipeline_breaking` function in `execution_plan.rs` for improved readability by formatting the condition across multiple lines. - Adjusted the `GlobalLimitExec` implementation in `limit.rs` to directly use the input's boundedness, enhancing clarity in memory management. These changes contribute to a more streamlined and maintainable execution plan framework within the DataFusion project, improving the handling of emission types and boundedness. * Refactor GlobalLimitExec and LocalLimitExec to enhance boundedness handling - Updated the `compute_properties` methods in both `GlobalLimitExec` and `LocalLimitExec` to replace `EmissionType::Final` with `Boundedness::Bounded`, reflecting that limit operations always produce a finite number of rows. - Changed the input's boundedness reference to `pipeline_behavior()` for improved clarity in execution plan properties. These changes contribute to a more streamlined and maintainable execution plan framework within the DataFusion project, enhancing the handling of boundedness in limit operations. * Review Part1 * Update sanity_checker.rs * addressing reviews * Review Part 1 * Update datafusion/physical-plan/src/execution_plan.rs * Update datafusion/physical-plan/src/execution_plan.rs * Shorten imports * Enhance documentation for JoinType and Boundedness enums - Improved descriptions for the Inner and Full join types in join_type.rs to clarify their behavior and examples. - Added explanations regarding the boundedness of output streams and memory requirements in execution_plan.rs, including specific examples for operators like Median and Min/Max. --------- Signed-off-by: Jay Zhan <jay.zhan@synnada.ai> Co-authored-by: berkaysynnada <berkay.sahin@synnada.ai> Co-authored-by: Mehmet Ozan Kabak <ozankabak@gmail.com> * Preserve ordering equivalencies on `with_reorder` (#13770) * Preserve ordering equivalencies on `with_reorder` * Add assertions * Return early if filtered_exprs is empty * Add clarify comment * Refactor * Add comprehensive test case * Add comment for exprs_equal * Cargo fmt * Clippy fix * Update properties.rs * Update exprs_equal and add tests * Update properties.rs --------- Co-authored-by: berkaysynnada <berkay.sahin@synnada.ai> * replace CASE expressions in predicate pruning with boolean algebra (#13795) * replace CASE expressions in predicate pruning with boolean algebra * fix merge * update tests * add some more tests * add some more tests * remove duplicate test case * Update datafusion/physical-optimizer/src/pruning.rs * swap NOT for != * replace comments, update docstrings * fix example * update tests * update tests * Apply suggestions from code review Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Update pruning.rs Co-authored-by: Chunchun Ye <14298407+appletreeisyellow@users.noreply.github.com> * Update pruning.rs Co-authored-by: Chunchun Ye <14298407+appletreeisyellow@users.noreply.github.com> --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Chunchun Ye <14298407+appletreeisyellow@users.noreply.github.com> * enable DF's nested_expressions feature by in datafusion-substrait tests to make them pass (#13857) fixes #13854 Co-authored-by: Arttu Voutilainen <avo@iki.fi> * Add configurable normalization for configuration options and preserve case for S3 paths (#13576) * Do not normalize values * Fix tests & update docs * Prettier * Lowercase config params * Unify transform and parse * Fix tests * Rename `default_transform` and relax boundaries * Make `compression` case-insensitive * Comment to new line * Deprecate and ignore `enable_options_value_normalization` * Update datafusion/common/src/config.rs * fix typo --------- Co-authored-by: Oleks V <comphead@users.noreply.github.com> * Improve`Signature` and `comparison_coercion` documentation (#13840) * Improve Signature documentation more * Apply suggestions from code review Co-authored-by: Piotr Findeisen <piotr.findeisen@gmail.com> --------- Co-authored-by: Piotr Findeisen <piotr.findeisen@gmail.com> * feat: support normalized expr in CSE (#13315) * feat: support normalized expr in CSE * feat: support normalize_eq in cse optimization * feat: support cumulative binary expr result in normalize_eq --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Upgrade to sqlparser `0.53.0` (#13767) * chore: Udpate to sqlparser 0.53.0 * Update for new sqlparser API * more api updates * Avoid serializing query to SQL string unless it is necessary * Box wildcard options * chore: update datafusion-cli Cargo.lock * Minor: Use `resize` instead of `extend` for adding static values in SortMergeJoin logic (#13861) Thanks @Dandandan * feat(function): add `least` function (#13786) * start adding least fn * feat(function): add least function * update function name * fix scalar smaller function * add tests * run Clippy and Fmt * Generated docs using `./dev/update_function_docs.sh` * add comment why `descending: false` * update comment * Update least.rs Co-authored-by: Bruce Ritchie <bruce.ritchie@veeva.com> * Update scalar_functions.md * run ./dev/update_function_docs.sh to update docs * merge greatest and least implementation to one * add header --------- Co-authored-by: Bruce Ritchie <bruce.ritchie@veeva.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Improve SortPreservingMerge::enable_round_robin_repartition docs (#13826) * Clarify SortPreservingMerge::enable_round_robin_repartition docs * tweaks * Improve comments more * clippy * fix doc link * Minor: Unify `downcast_arg` method (#13865) * Implement `SHOW FUNCTIONS` (#13799) * introduce rid for different signature * implement show functions syntax * add syntax example * avoid duplicate join * fix clippy * show function_type instead of routine_type * add some doc and comments * Update bzip2 requirement from 0.4.3 to 0.5.0 (#13740) * Update bzip2 requirement from 0.4.3 to 0.5.0 Updates the requirements on [bzip2](https://github.com/trifectatechfoundation/bzip2-rs) to permit the latest version. - [Release notes](https://github.com/trifectatechfoundation/bzip2-rs/releases) - [Commits](https://github.com/trifectatechfoundation/bzip2-rs/compare/0.4.4...v0.5.0) --- updated-dependencies: - dependency-name: bzip2 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * Fix test * Fix CLI cargo.lock --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: jonahgao <jonahgao@msn.com> * Fix build (#13869) * feat(substrait): modular substrait consumer (#13803) * feat(substrait): modular substrait consumer * feat(substrait): include Extension Rel handlers in default consumer Include SerializerRegistry based handlers for Extension Relations in the DefaultSubstraitConsumer * refactor(substrait) _selection -> _field_reference * refactor(substrait): remove SubstraitPlannerState usage from consumer * refactor: get_state() -> get_function_registry() * docs: elide imports from example * test: simplify test * refactor: remove Arc from DefaultSubstraitConsumer * doc: add ticket for API improvements * doc: link DefaultSubstraitConsumer to from_subtrait_plan * refactor: remove redundant Extensions parsing * Minor: fix: Include FetchRel when producing LogicalPlan from Sort (#13862) * include FetchRel when producing LogicalPlan from Sort * add suggested test * address review feedback * Minor: improve error message when ARRAY literals can not be planned (#13859) * Minor: improve error message when ARRAY literals can not be planned * fmt * Update datafusion/sql/src/expr/value.rs Co-authored-by: Oleks V <comphead@users.noreply.github.com> --------- Co-authored-by: Oleks V <comphead@users.noreply.github.com> * Add documentation for `SHOW FUNCTIONS` (#13868) * Support unicode character for `initcap` function (#13752) * Support unicode character for 'initcap' function Signed-off-by: Tai Le Manh <manhtai.lmt@gmail.com> * Update unit tests * Fix clippy warning * Update sqllogictests - initcap * Update scalar_functions.md docs * Add suggestions change Signed-off-by: Tai Le Manh <manhtai.lmt@gmail.com> --------- Signed-off-by: Tai Le Manh <manhtai.lmt@gmail.com> * [minor] make recursive package dependency optional (#13778) * make recursive optional * add to default for common package * cargo update * added to readme * make test conditional * reviews * cargo update --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Minor: remove unused async-compression `futures-io` feature (#13875) * Minor: remove unused async-compression feature * Fix cli cargo lock * Consolidate Example: dataframe_output.rs into dataframe.rs (#13877) * Restore `DocBuilder::new()` to avoid breaking API change (#13870) * Fix build * Restore DocBuilder::new(), deprecate * cmt * clippy * Improve error messages for incorrect zero argument signatures (#13881) * Improve error messages for incorrect zero argument signatures * fix errors * fix fmt * Consolidate Example: simplify_udwf_expression.rs into advanced_udwf.rs (#13883) * minor: fix typos in comments / structure names (#13879) * minor: fix typo error in datafusion * fix: fix rebase error * fix: format HashJoinExec doc * doc: recover thiserror/preemptively * fix: other typo error fixed * fix: directories to dir_entries in catalog example * Support 1 or 3 arg in generate_series() UDTF (#13856) * Support 1 or 3 args in generate_series() UDTF * address comment * Support (order by / sort) for DataFrameWriteOptions (#13874) * Support (order by / sort) for DataFrameWriteOptions * Fix fmt * Fix import * Add insert into example * Update sort_merge_join.rs (#13894) * Update join_selection.rs (#13893) * Fix `recursive-protection` feature flag (#13887) * Fix recursive-protection feature flag * rename feature flag to be consistent * Make default * taplo format * Fix visibility of swap_hash_join (#13899) * Minor: Avoid emitting empty batches in partial sort (#13895) * Update partial_sort.rs * Update partial_sort.rs * Update partial_sort.rs * Prepare for 44.0.0 release: version and changelog (#13882) * Prepare for 44.0.0 release: version and changelog * update changelog * update configs * update before release * Support unparsing implicit lateral `UNNEST` plan to SQL text (#13824) * support unparsing the implicit lateral unnest plan * cargo clippy and fmt * refactor for `check_unnest_placeholder_with_outer_ref` * add const for the prefix string of unnest and outer refernece column * fix case_column_or_null with nullable when conditions (#13886) * fix case_column_or_null with nullable when conditions * improve sqllogictests for case_column_or_null --------- Co-authored-by: zhangli20 <zhangli20@kuaishou.com> * Fixed Issue #13896 (#13903) The URL to the external website was returning a 404. Presuming recent changes in the external website's structure, the required data has been moved to a different URL. The commit ensures the new URL is used. * Introduce `UserDefinedLogicalNodeUnparser` for User-defined Logical Plan unparsing (#13880) * make ast builder public * introduce udlp unparser * add documents * add examples * add negative tests and fmt * fix the doc * rename udlp to extension * apply the first unparsing result only * improve the doc * seperate the enum for the unparsing result * fix the doc --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Preserve constant values across union operations (#13805) * Add value tracking to ConstExpr for improved union optimization * Update PartialEq impl * Minor change * Add docstring for ConstExpr value * Improve constant propagation across union partitions * Add assertion for across_partitions * fix fmt * Update properties.rs * Remove redundant constant removal loop * Remove unnecessary mut * Set across_partitions=true when both sides are constant * Extract and use constant values in filter expressions * Add initial SLT for constant value tracking across UNION ALL * Assign values to ConstExpr where possible * Revert "Set across_partitions=true when both sides are constant" This reverts commit 3051cd470b0ad4a70cd8bd3518813f5ce0b3a449. * Temporarily take value from literal * Lint fixes * Cargo fmt * Add get_expr_constant_value * Make `with_value()` accept optional value * Add todo * Move test to union.slt * Fix changed slt after merge * Simplify constexpr * Update properties.rs --------- Co-authored-by: berkaysynnada <berkay.sahin@synnada.ai> * chore(deps): update sqllogictest requirement from 0.23.0 to 0.24.0 (#13902) * fix RecordBatch size in topK (#13906) * ci improvements, update protoc (#13876) * Fix md5 return_type to only return Utf8 as per current code impl. * ci improvements * Lock taiki-e/install-action to a githash for apache action policy - Release 2.46.19 in the case of this hash. * Lock taiki-e/install-action to a githash for apache action policy - Release 2.46.19 in the case of this hash. * Revert nextest change until action is approved. * Exclude requires workspace * Fixing minor typo to verify ci caching of builds is working as expected. * Updates from PR review. * Adding issue link for disabling intel mac build * improve performance of running examples * remove cargo check * Introduce LogicalPlan invariants, begin automatically checking them (#13651) * minor(13525): perform LP validation before and after each possible mutation * minor(13525): validate unique field names on query and subquery schemas, after each optimizer pass * minor(13525): validate union after each optimizer passes * refactor: make explicit what is an invariant of the logical plan, versus assertions made after a given analyzer or optimizer pass * chore: add link to invariant docs * fix: add new invariants module * refactor: move all LP invariant checking into LP, delineate executable (valid semantic plan) vs basic LP invariants * test: update test for slight error message change * fix: push_down_filter optimization pass can push a IN(<subquery>) into a TableScan's filter clause * refactor: move collect_subquery_cols() to common utils crate * refactor: clarify the purpose of assert_valid_optimization(), runs after all optimizer passes, except in debug mode it runs after each pass. * refactor: based upon performance tests, run the maximum number of checks without impa ct: * assert_valid_optimization can run each optimizer pass * remove the recursive cehck_fields, which caused the performance regression * the full LP Invariants::Executable can only run in debug * chore: update error naming and terminology used in code comments * refactor: use proper error methods * chore: more cleanup of error messages * chore: handle option trailer to error message * test: update sqllogictests tests to not use multiline * Correct return type for initcap scalar function with utf8view (#13909) * Set utf8view as return type when input type is the same * Verify that the returned type from call to scalar function matches the return type specified in the return_type function * Match return type to utf8view * Consolidate example: simplify_udaf_expression.rs into advanced_udaf.rs (#13905) * Implement maintains_input_order for AggregateExec (#13897) * Implement maintains_input_order for AggregateExec * Update mod.rs * Improve comments --------- Co-authored-by: berkaysynnada <berkay.sahin@synnada.ai> Co-authored-by: mertak-synnada <mertak67+synaada@gmail.com> Co-authored-by: Mehmet Ozan Kabak <ozankabak@gmail.com> * Move join type input swapping to pub methods on Joins (#13910) * doc-gen: migrate scalar functions (string) documentation 3/4 (#13926) Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * Update sqllogictest requirement from 0.24.0 to 0.25.0 (#13917) * Update sqllogictest requirement from 0.24.0 to 0.25.0 Updates the requirements on [sqllogictest](https://github.com/risinglightdb/sqllogictest-rs) to permit the latest version. - [Release notes](https://github.com/risinglightdb/sqllogictest-rs/releases) - [Changelog](https://github.com/risinglightdb/sqllogictest-rs/blob/main/CHANGELOG.md) - [Commits](https://github.com/risinglightdb/sqllogictest-rs/compare/v0.24.0...v0.25.0) --- updated-dependencies: - dependency-name: sqllogictest dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * Remove labels --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: jonahgao <jonahgao@msn.com> * Consolidate Examples: memtable.rs and parquet_multiple_files.rs (#13913) * doc-gen: migrate scalar functions (crypto) documentation (#13918) * doc-gen: migrate scalar functions (crypto) documentation * doc-gen: fix typo and update function docs --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * doc-gen: migrate scalar functions (datetime) documentation 1/2 (#13920) * doc-gen: migrate scalar functions (datetime) documentation 1/2 * fix: fix typo and update function docs --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * fix RecordBatch size in hash join (#13916) * doc-gen: migrate scalar functions (array) documentation 1/3 (#13928) * doc-gen: migrate scalar functions (array) documentation 1/3 * fix: remove unsed import, fix typo and update function docs --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * doc-gen: migrate scalar functions (math) documentation 1/2 (#13922) * doc-gen: migrate scalar functions (math) documentation 1/2 * fix: fix typo --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * doc-gen: migrate scalar functions (math) documentation 2/2 (#13923) * doc-gen: migrate scalar functions (math) documentation 2/2 * fix: fix typo --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * doc-gen: migrate scalar functions (array) documentation 3/3 (#13930) * doc-gen: migrate scalar functions (array) documentation 3/3 * fix: import doc and macro, fix typo and update function docs --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * doc-gen: migrate scalar functions (array) documentation 2/3 (#13929) * doc-gen: migrate scalar functions (array) documentation 2/3 * fix: import doc and macro, fix typo and update function docs --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * doc-gen: migrate scalar functions (string) documentation 4/4 (#13927) * doc-gen: migrate scalar functions (string) documentation 4/4 * fix: fix typo and update function docs --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * Support explain query when running dfbench with clickbench (#13942) * Support explain query when running dfbench * Address comments * Consolidate example to_date.rs into dateframe.rs (#13939) * Consolidate example to_date.rs into dateframe.rs * Assert results using assert_batches_eq * clippy * Revert "Update sqllogictest requirement from 0.24.0 to 0.25.0 (#13917)" (#13945) * Revert "Update sqllogictest requirement from 0.24.0 to 0.25.0 (#13917)" This reverts commit 0989649214a6fe69ffb33ed38c42a8d3df94d6bf. * add comment * Implement predicate pruning for `like` expressions (prefix matching) (#12978) * Implement predicate pruning for like expressions * add function docstring * re-order bounds calculations * fmt * add fuzz tests * fix clippy * Update datafusion/core/tests/fuzz_cases/pruning.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * doc-gen: migrate scalar functions (string) documentation 1/4 (#13924) Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * consolidate dataframe_subquery.rs into dataframe.rs (#13950) * migrate btrim to user_doc macro (#13952) * doc-gen: migrate scalar functions (datetime) documentation 2/2 (#13921) * doc-gen: migrate scalar functions (datetime) documentation 2/2 * fix: fix typo and update function docs * doc: update function docs * doc-gen: remove slash --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> * Add sqlite test files, progress bar, and automatic postgres container management into sqllogictests (#13936) * Fix md5 return_type to only return Utf8 as per current code impl. * Add support for sqlite test files to sqllogictest * Force version 0.24.0 of sqllogictest dependency until issue with labels is fixed. * Removed workaround for bug that was fixed. * Git submodule update ... err update, link to sqlite tests. * Git submodule update * Readd submodule --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Supporting writing schema metadata when writing Parquet in parallel (#13866) * refactor: make ParquetSink tests a bit more readable * chore(11770): add new ParquetOptions.skip_arrow_metadata * test(11770): demonstrate that the single threaded ParquetSink is already writing the arrow schema in the kv_meta, and allow disablement * refactor(11770): replace with new method, since the kv_metadata is inherent to TableParquetOptions and therefore we should explicitly make the API apparant that you have to include the arrow schema or not * fix(11770): fix parallel ParquetSink to encode arrow schema into the file metadata, based on the ParquetOptions * refactor(11770): provide deprecation warning for TryFrom * test(11770): update tests with new default to include arrow schema * refactor: including partitioning of arrow schema inserted into kv_metdata * test: update tests for new config prop, as well as the new file partition offsets based upon larger metadata * chore: avoid cloning in tests, and update code docs * refactor: return to the WriterPropertiesBuilder::TryFrom<TableParquetOptions>, and separately add the arrow_schema to the kv_metadata on the TableParquetOptions * refactor: require the arrow_schema key to be present in the kv_metadata, if is required by the configuration * chore: update configs.md * test: update tests to handle the (default) required arrow schema in the kv_metadata * chore: add reference to arrow-rs upstream PR * chore: Create devcontainer.json (#13520) * Create devcontainer.json * update devcontainer * remove useless features * Minor: consolidate ConfigExtension example into API docs (#13954) * Update examples README.md * Minor: consolidate ConfigExtension example into API docs * more docs * Remove update * clippy * Fix issue with ExtensionsOptions docs * Parallelize pruning utf8 fuzz test (#13947) * Add swap_inputs to SMJ (#13984) * fix(datafusion-functions-nested): `arrow-distinct` now work with null rows (#13966) * added failing test * fix(datafusion-functions-nested): `arrow-distinct` now work with null rows * Update datafusion/functions-nested/src/set_ops.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Update set_ops.rs --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Update release instructions for 44.0.0 (#13959) * Update release instructions for 44.0.0 * update macros and order * add functions-table * Add datafusion python 43.1.0 blog post to doc. (#13974) * Include license and notice files in more crates (#13985) * Extract postgres container from sqllogictest, update datafusion-testing pin (#13971) * Add support for sqlite test files to sqllogictest * Removed workaround for bug that was fixed. * Refactor sqllogictest to extract postgres functionality into a separate file. Removed dependency on once_cell in favour of LazyLock. * Add missing license header. * Update rstest requirement from 0.23.0 to 0.24.0 (#13977) Updates the requirements on [rstest](https://github.com/la10736/rstest) to permit the latest version. - [Release notes](https://github.com/la10736/rstest/releases) - [Changelog](https://github.com/la10736/rstest/blob/master/CHANGELOG.md) - [Commits](https://github.com/la10736/rstest/compare/v0.23.0...v0.23.0) --- updated-dependencies: - dependency-name: rstest dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Move hash collision test to run only when merging to main. (#13973) * Update itertools requirement from 0.13 to 0.14 (#13965) * Update itertools requirement from 0.13 to 0.14 Updates the requirements on [itertools](https://github.com/rust-itertools/itertools) to permit the latest version. - [Changelog](https://github.com/rust-itertools/itertools/blob/master/CHANGELOG.md) - [Commits](https://github.com/rust-itertools/itertools/compare/v0.13.0...v0.13.0) --- updated-dependencies: - dependency-name: itertools dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * Fix build * Simplify * Update CLI lock --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: jonahgao <jonahgao@msn.com> * Change trigger, rename `hash_collision.yml` to `extended.yml` and add comments (#13988) * Rename hash_collision.yml to extended.yml and add comments * Adjust schedule, add comments * Update job, rerun * doc-gen: migrate scalar functions (string) documentation 2/4 (#13925) * doc-gen: migrate scalar functions (string) documentation 2/4 * doc-gen: update function docs * doc: fix related udf order for upper function in documentation * Update datafusion/functions/src/string/concat_ws.rs * Update datafusion/functions/src/string/concat_ws.rs * Update datafusion/functions/src/string/concat_ws.rs * doc-gen: update function docs --------- Co-authored-by: Cheng-Yuan-Lai <a186235@g,ail.com> Co-authored-by: Oleks V <comphead@users.noreply.github.com> * Update substrait requirement from 0.50 to 0.51 (#13978) Updates the requirements on [substrait](https://github.com/substrait-io/substrait-rs) to permit the latest version. - [Release notes](https://github.com/substrait-io/substrait-rs/releases) - [Changelog](https://github.com/substrait-io/substrait-rs/blob/main/CHANGELOG.md) - [Commits](https://github.com/substrait-io/substrait-rs/compare/v0.50.0...v0.51.0) --- updated-dependencies: - dependency-name: substrait dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Update release README for datafusion-cli publishing (#13982) * Enhance LastValueAccumulator logic and add SQL logic tests for last_value function (#13980) - Updated LastValueAccumulator to include requirement satisfaction check before updating the last value. - Added SQL logic tests to verify the behavior of the last_value function with merge batches and ensure correct aggregation in various scenarios. * Improve deserialize_to_struct example (#13958) * Cleanup deserialize_to_struct example * prettier * Apply suggestions from code review Co-authored-by: Jonah Gao <jonahgao@msn.com> --------- Co-authored-by: Jonah Gao <jonahgao@msn.com> * Update docs (#14002) * Optimize CASE expression for "expr or expr" usage. (#13953) * Apply optimization for ExprOrExpr. * Implement optimization similar to existing code. * Add sqllogictest. * feat(substrait): introduce consume_rel and consume_expression (#13963) * feat(substrait): introduce consume_rel and consume_expression Route calls to from_substrait_rel and from_substrait_rex through the SubstraitConsumer in order to allow users to provide their own behaviour * feat(substrait): consume nulls of user-defined types * docs(substrait): consume_rel and consume_expression docstrings * Consolidate csv_opener.rs and json_opener.rs into a single example (#… (#13981) * Consolidate csv_opener.rs and json_opener.rs into a single example (#13955) * Update datafusion-examples/examples/csv_json_opener.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Update datafusion-examples/README.md Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Apply code formatting with cargo fmt --------- Co-authored-by: Sergey Zhukov <szhukov@aligntech.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * FIX : Incorrect NULL handling in BETWEEN expression (#14007) * submodule update * FIX : Incorrect NULL handling in BETWEEN expression * Revert "submodule update" This reverts commit 72431aadeaf33a27775a88c41931572a0b66bae3. * fix incorrect unit test * move sqllogictest to expr * feat(substrait): modular substrait producer (#13931) * feat(substrait): modular substrait producer * refactor(substrait): simplify col_ref_offset handling in producer * refactor(substrait): remove column offset tracking from producer * docs(substrait): document SubstraitProducer * refactor: minor cleanup * feature: remove unused SubstraitPlanningState BREAKING CHANGE: SubstraitPlanningState is no longer available * refactor: cargo fmt * refactor(substrait): consume_ -> handle_ * refactor(substrait): expand match blocks * refactor: DefaultSubstraitProducer only needs serializer_registry * refactor: remove unnecessary warning suppression * fix(substrait): route expr conversion through handle_expr * cargo fmt * fix: Avoid re-wrapping planning errors Err(DataFusionError::Plan) for use in plan_datafusion_err (#14000) * fix: unwrapping Err(DataFusionError::Plan) for use in plan_datafusion_err * test: add tests for error formatting during planning * feat: support `RightAnti` for `SortMergeJoin` (#13680) * feat: support `RightAnti` for `SortMergeJoin` * feat: preserve session id when using cxt.enable_url_table() (#14004) * Return error message during planning when inserting into a MemTable with zero partitions. (#14011) * Minor: Rewrite LogicalPlan::max_rows for Join and Union, made it easier to understand (#14012) * Refactor max_rows for join plan, made it easier to understand * Simplified max_rows for Union * Chore: update wasm-supported crates, add tests (#14005) * Chore: update wasm-supported crates * format * Use workspace rust-version for all workspace crates (#14009) * [Minor] refactor: make ArraySort public for broader access (#14006) * refactor: make ArraySort public for broader access Changes the visibility of the ArraySort struct fromsuper to public. allows broader access to the struct, enabling its use in other modules and promoting better code reuse. * clippy and docs --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Update sqllogictest requirement from =0.24.0 to =0.26.0 (#14017) * Update sqllogictest requirement from =0.24.0 to =0.26.0 Updates the requirements on [sqllogictest](https://github.com/risinglightdb/sqllogictest-rs) to permit the latest version. - [Release notes](https://github.com/risinglightdb/sqllogictest-rs/releases) - [Changelog](https://github.com/risinglightdb/sqllogictest-rs/blob/main/CHANGELOG.md) - [Commits](https://github.com/risinglightdb/sqllogictest-rs/compare/v0.24.0...v0.26.0) --- updated-dependencies: - dependency-name: sqllogictest dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * remove version pin and note --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Eduard Karacharov <eduard.karacharov@gmail.com> * `url` dependancy update (#14019) * `url` dependancy update * `url` version update for datafusion-cli * Minor: Improve zero partition check when inserting into `MemTable` (#14024) * Improve zero partition check when inserting into `MemTable` * update err msg * refactor: make structs public and implement Default trait (#14030) * Minor: Remove redundant implementation of `StringArrayType` (#14023) * Minor: Remove redundant implementation of StringArrayType Signed-off-by: Tai Le Manh <manhtai.lmt@gmail.com> * Deprecate rather than remove StringArrayType --------- Signed-off-by: Tai Le Manh <manhtai.lmt@gmail.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Added references to IDE documentation for dev containers along with a small note about why one may choose to do development using a dev container. (#14014) * Use partial aggregation schema for spilling to avoid column mismatch in GroupedHashAggregateStream (#13995) * Refactor spill handling in GroupedHashAggregateStream to use partial aggregate schema * Implement aggregate functions with spill handling in tests * Add tests for aggregate functions with and without spill handling * Move test related imports into mod test * Rename spill pool test functions for clarity and consistency * Refactor aggregate function imports to use fully qualified paths * Remove outdated comments regarding input batch schema for spilling in GroupedHashAggregateStream * Update aggregate test to use AVG instead of MAX * assert spill count * Refactor partial aggregate schema creation to use create_schema function * Refactor partial aggregation schema creation and remove redundant function * Remove unused import of Schema from arrow::datatypes in row_hash.rs * move spill pool testing for aggregate functions to physical-plan/src/aggregates * Use Arc::clone for schema references in aggregate functions * Encapsulate fields of `EquivalenceProperties` (#14040) * Encapsulate fields of `EquivalenceGroup` (#14039) * Fix error on `array_distinct` when input is empty #13810 (#14034) * fix * add test * oops --------- Co-authored-by: Cyprien Huet <chuet@palantir.com> * Update petgraph requirement from 0.6.2 to 0.7.1 (#14045) * Update petgraph requirement from 0.6.2 to 0.7.1 Updates the requirements on [petgraph](https://github.com/petgraph/petgraph) to permit the latest version. - [Changelog](https://github.com/petgraph/petgraph/blob/master/RELEASES.rst) - [Commits](https://github.com/petgraph/petgraph/compare/petgraph@v0.6.2...petgraph@v0.7.1) --- updated-dependencies: - dependency-name: petgraph dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * Update datafusion-cli/Cargo.lock --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Encapsulate fields of `OrderingEquivalenceClass` (make field non pub) (#14037) * Complete encapsulatug `OrderingEquivalenceClass` (make fields non pub) * fix doc * Fix: ensure that compression type is also taken into consideration during ListingTableConfig infer_options (#14021) * chore: add test to verify that schema is inferred as expected * chore: add comment to method as suggested * chore: restructure to avoid need to clone * chore: fix flaw in rewrite * feat(optimizer): Enable filter pushdown on window functions (#14026) * feat(optimizer): Enable filter pushdown on window functions Ensures selections can be pushed past window functions similarly to what is already done with aggregations, when possible. * fix: Add missing dependency * minor(optimizer): Use 'datafusion-functions-window' as a dev dependency * docs(optimizer): Add example to filter pushdown on LogicalPlan::Window * Unparsing optimized (> 2 inputs) unions (#14031) * tests and optimizer in testing queries * unparse optimized unions * format Cargo.toml * format Cargo.toml * revert test * rewrite test to avoid cyclic dep * remove old test * cleanup * comments and error handling * handle union with lt 2 inputs * Minor: Document output schema of LogicalPlan::Aggregate and LogicalPlan::Window (#14047) * Simplify error handling in case.rs (#13990) (#14033) * Simplify error handling in case.rs (#13990) * Fix issues causing GitHub checks to fail * Update datafusion/physical-expr/src/expressions/case.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> --------- Co-authored-by: Sergey Zhukov <szhukov@aligntech.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * feat: add `AsyncCatalogProvider` helpers for asynchronous catalogs (#13800) * Add asynchronous catalog traits to help users that have asynchronous catalogs * Apply clippy suggestions * Address PR reviews * Remove allow_unused exceptions * Update remote catalog example to demonstrate new helper structs * Move schema_name / catalog_name parameters into resolve f…

github-actions bot added the core Core DataFusion crate label Dec 13, 2024

alamb approved these changes Dec 15, 2024

View reviewed changes

alamb reviewed Dec 16, 2024

View reviewed changes

korowa force-pushed the fix-bloom-filter-dict branch from 6a63160 to a82f94e Compare December 17, 2024 18:20

korowa changed the title ~~fix: pruning by bloom filters for utf8 dictionary columns~~ fix: pruning by bloom filters for dictionary columns Dec 17, 2024

fix: enable pruning by bloom filters for dictionary columns

5c9d66e

korowa force-pushed the fix-bloom-filter-dict branch from a82f94e to 5c9d66e Compare December 17, 2024 20:31

alamb approved these changes Dec 17, 2024

View reviewed changes

alamb merged commit 452a8f4 into apache:main Dec 17, 2024
25 checks passed

alamb mentioned this pull request Jan 1, 2025

Jan 1, 2025: This week(s) in DataFusion #13970

Closed

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: pruning by bloom filters for dictionary columns #13768

fix: pruning by bloom filters for dictionary columns #13768

korowa commented Dec 13, 2024 •

edited

Loading

alamb left a comment

alamb Dec 15, 2024

korowa Dec 16, 2024

alamb Dec 17, 2024

alamb Dec 15, 2024

alamb Dec 15, 2024

adriangb commented Dec 15, 2024

adriangb commented Dec 16, 2024

alamb left a comment

korowa commented Dec 16, 2024 •

edited

Loading

korowa commented Dec 17, 2024 •

edited

Loading

alamb left a comment

alamb Dec 17, 2024

adriangb commented Dec 18, 2024

fix: pruning by bloom filters for dictionary columns #13768

fix: pruning by bloom filters for dictionary columns #13768

Conversation

korowa commented Dec 13, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb left a comment

Choose a reason for hiding this comment

alamb Dec 15, 2024

Choose a reason for hiding this comment

korowa Dec 16, 2024

Choose a reason for hiding this comment

alamb Dec 17, 2024

Choose a reason for hiding this comment

alamb Dec 15, 2024

Choose a reason for hiding this comment

alamb Dec 15, 2024

Choose a reason for hiding this comment

adriangb commented Dec 15, 2024

adriangb commented Dec 16, 2024

alamb left a comment

Choose a reason for hiding this comment

korowa commented Dec 16, 2024 • edited Loading

korowa commented Dec 17, 2024 • edited Loading

alamb left a comment

Choose a reason for hiding this comment

alamb Dec 17, 2024

Choose a reason for hiding this comment

adriangb commented Dec 18, 2024

korowa commented Dec 13, 2024 •

edited

Loading

korowa commented Dec 16, 2024 •

edited

Loading

korowa commented Dec 17, 2024 •

edited

Loading