Skip to content

Conversation

zhuqi-lucas
Copy link

Merge our internal changes and fix conflicts for DF branch 50.

andygrove and others added 30 commits November 4, 2024 19:40
* Initial commit

* Fix formatting

* Add across partitions check

* Add new test case

Add a new test case

* Fix buggy test
…#13909) (apache#13934)

* Set utf8view as return type when input type is the same

* Verify that the returned type from call to scalar function matches the return type specified in the return_type function

* Match return type to utf8view

Co-authored-by: Tim Saucer <timsaucer@gmail.com>
* fix: fetch is missed in the EnfoceSorting

* fix conflict

* resolve comments from alamb

* update
…e#14415) (apache#14453)

* chore: Fixed CI

* chore

* chore: Fixed clippy

* chore

Co-authored-by: Alex Huang <huangweijun1001@gmail.com>
* Test for string / numeric coercion

* fix tests

* Update tests

* Add tests to stringview

* add numeric coercion
@zhuqi-lucas zhuqi-lucas marked this pull request as ready for review September 17, 2025 05:56
/// (reading) If true, parquet reader will read columns of `Utf8/Utf8Large` with `Utf8View`,
/// and `Binary/BinaryLarge` with `BinaryView`.
pub schema_force_view_types: bool, default = true
pub schema_force_view_types: bool, default = false
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default to utf8.

// Whether to allow truncated rows when parsing.
// By default this is set to false and will error if the CSV rows have different lengths.
// When set to true then it will allow records with less than the expected number of columns
pub truncated_rows: Option<bool>, default = None
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our csv truncated_rows support, will be included in DF 51.0.0, but not for DF 50.0.0.
apache#17465

DecoderDeserializer::new(CsvDecoder::new(decoder))
}

fn csv_deserializer_with_truncated(
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our csv truncated_rows support, will be included in DF 51.0.0, but not for DF 50.0.0.
apache#17465

/// By default this is set to false and will error if the CSV rows have different lengths.
/// When set to true then it will allow records with less than the expected number of columns and fill the missing columns with nulls.
/// If the record’s schema is not nullable, then it will still return an error.
pub truncated_rows: bool,
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

assert_eq!(
string_truncation_stats.max_value,
Precision::Inexact(ScalarValue::Utf8View(Some("b".repeat(63) + "c")))
Precision::Inexact(Utf8(Some("b".repeat(63) + "c")))
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We default to utf8.

.await
.await?;
let mut id_annotator = NodeIdAnnotator::new();
annotate_node_id_for_execution_plan(&physical_plan, &mut id_annotator)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our internal node_id support.

let filter = col("date_string_col").eq(lit(ScalarValue::new_utf8view("01/01/09")));
// xudong: use new_utf8, because schema_force_view_types was changed to false now.
// qi: when schema_force_view_types setting to true, we should change back to utf8view
let filter = col("date_string_col").eq(lit(ScalarValue::new_utf8("01/01/09")));
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default to utf8.

self.sink.metrics()
}

fn with_node_id(
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Internal node_id support.

}
}

fn with_node_id(
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Internal node_id support.

/// Infers new predicates by substituting equalities.
/// For example, with predicates `t2.b = 3` and `t1.b > t2.b`,
/// we can infer `t1.b > 3`.
fn infer_predicates_from_equalities(predicates: Vec<Expr>) -> Result<Vec<Expr>> {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to:
apache#15906

Should we reopen our upstream PR?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do later

}

context.update_plan_from_children()
Ok((context.update_plan_from_children()?, fetch))
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to our internal fetch support for enforce_distribution.

data,
children,
},
mut fetch,
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to our internal fetch support for enforce_distribution.

// If `fetch` was not consumed, it means that there was `SortPreservingMergeExec` with fetch before
// It was removed by `remove_dist_changing_operators`
// and we need to add it back.
if fetch.is_some() {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to our internal fetch support for enforce_distribution.

use datafusion_common::DataFusionError;

// Util for traversing ExecutionPlan tree and annotating node_id
pub struct NodeIdAnnotator {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our internal support for node_id.

/// Execution plan for values list based relation (produces constant rows)
#[deprecated(
since = "45.0.0",
note = "Use `MemorySourceConfig::try_new_as_values` instead"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we still use deprecated API, so i can try to upgrade those cases in a follow-up PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

fn remove_dist_changing_operators(
mut distribution_context: DistributionContext,
) -> Result<DistributionContext> {
) -> Result<(
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Internal fetch support.

@zhuqi-lucas zhuqi-lucas merged commit ba0e3a0 into branch-50 Sep 17, 2025
59 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.