Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
149 commits
Select commit Hold shift + click to select a range
9707a8a
bump version and generate changelog
andygrove Nov 5, 2024
88f58bf
bump version and generate changelog
andygrove Nov 5, 2024
2d5364e
Downgrade tonic
matthewmturner Dec 23, 2024
2c35f17
[bug]: Fix wrong order by removal from plan (#13497)
akurmustafa Nov 24, 2024
608ee58
Correct return type for initcap scalar function with utf8view (#13909…
alamb Dec 28, 2024
3cc3fca
Update CHANGELOG
alamb Dec 28, 2024
5383d30
enforce_distribution: fix for limits getting lost
Max-Meldrum Dec 30, 2024
13f6aca
set default-features=false for datafusion in proto crate
Max-Meldrum Jan 7, 2025
d357c7a
Adding node_id patch to our fork
emgeee Sep 11, 2024
cbd3dbc
Changes to make streaming work
ameyc May 2, 2024
deecef1
only output node_id in display if it exists
Max-Meldrum Dec 11, 2024
57bf8d6
include projection in FilterExec::with_node_id
Max-Meldrum Jan 7, 2025
c431f0f
add missing with_fetch calls to with_node_id method
Max-Meldrum Jan 7, 2025
fa581d0
rework SortExec::with_node_id to not drop preserve_partitioning
Max-Meldrum Jan 8, 2025
555ef6b
set schema_force_view_types to false in ParquetOptions
Max-Meldrum Jan 9, 2025
0e3c9e0
Revert "enforce_distribution: fix for limits getting lost"
suremarc Jan 14, 2025
a4153bf
update sqllogictests after disabling view types
suremarc Jan 14, 2025
8ae4a95
fix fetch missed in EnforceDistribution
xudong963 Jan 15, 2025
1ae2702
fix enforcesorting missing fetch
xudong963 Jan 17, 2025
38f39f5
fix more fetch missing in enforcesorting
xudong963 Jan 17, 2025
f7740af
fix: fetch is missed in the EnforceSorting (#14192)
xudong963 Jan 20, 2025
22473d9
fix remaining test issues regarding with_node_id
Max-Meldrum Jan 23, 2025
f0f6e81
use new_utf8 instead of new_utf8view in page_pruning test as we have …
Max-Meldrum Jan 23, 2025
f3e7004
Expose more components from sqllogictest (#14249)
xudong963 Jan 23, 2025
c976a89
Extract useful methods from sqllogictest bin (#14267)
xudong963 Jan 25, 2025
ffff7a1
expose df sqllogictest error
xudong963 Jan 27, 2025
63bad11
update sqllogictest
xudong963 Jan 27, 2025
e3ea7d1
chore: Upgrade to `arrow`/`parquet` `54.1.0` and fix clippy/ci (#144…
alamb Feb 3, 2025
8f10fdf
Fix join type coercion (#14387) (#14454)
alamb Feb 3, 2025
755b26a
Support `Utf8View` to `numeric` coercion (#14377) (#14455)
alamb Feb 3, 2025
9d287bd
Update REGEXP_MATCH scalar function to support Utf8View (#14449) (#14…
alamb Feb 3, 2025
6146600
Fix regression list Type Coercion List with inner type struct which h…
alamb Feb 3, 2025
26058ac
Update changelog (#14460)
alamb Feb 3, 2025
6e1e0d1
fix datafusion-cli
xudong963 Feb 6, 2025
af26638
missing fetch after removing SPM
xudong963 Feb 10, 2025
d290676
Merge remote-tracking branch 'upstream/branch-44' into branch-44-toni…
xudong963 Feb 10, 2025
d518b51
update cargo toml
xudong963 Feb 10, 2025
e5431f1
make new_group_values public
xudong963 Feb 10, 2025
c103d08
cherry-pick upstream/14569
xudong963 Feb 12, 2025
e9fb062
fix EnforceDistribution
xudong963 Feb 24, 2025
51d0dea
Merge remote-tracking branch 'upstream/branch-45' into branch-44-toni…
xudong963 Feb 24, 2025
ee7b658
Merge remote-tracking branch 'upstream/branch-45'(with our fixes)
xudong963 Feb 24, 2025
3766da9
downgrade tonic
xudong963 Feb 24, 2025
2b5cec2
cherry-pick upstream/14569
xudong963 Feb 24, 2025
08b3ce0
public more parquet components
xudong963 Feb 28, 2025
8b3cd7b
Do not swap with projection when file is partitioned (#14956) (#14964)
alamb Mar 2, 2025
76d833a
Improve documentation for `DataSourceExec`, `FileScanConfig`, `DataSo…
alamb Mar 2, 2025
b494e97
Deprecate `Expr::Wildcard` (#14959) (#14976)
xudong963 Mar 3, 2025
65c8560
[branch-46] Update changelog for backports to 46.0.0 (#14977)
xudong963 Mar 3, 2025
ec4862f
Add note about upgrade guide into the release notes (#14979)
alamb Mar 3, 2025
d5ca830
Fix verification script and extended tests due to `rustup` changes (#…
alamb Mar 4, 2025
1c92803
upgrade tonic
xudong963 Mar 13, 2025
112e9eb
Update ring to v0.17.13 (#15063) (#15228)
alamb Mar 14, 2025
0877c99
Fix broken `serde` feature (#15124) (#15227)
alamb Mar 14, 2025
048a125
[branch-46] Fix wasm32 build on version 46 (#15229)
alamb Mar 14, 2025
68f2903
Update version to 46.0.1, add CHANGELOG (#15243)
xudong963 Mar 15, 2025
b8699d9
Merge remote-tracking branch 'upstream/branch-46' into branch-46-stream
xudong963 Mar 20, 2025
2e5b5e2
fix with_node_id and clippy
xudong963 Mar 20, 2025
3be582f
Fix invalid schema for unions in ViewTables (#15135)
Friede80 Mar 16, 2025
a28f2cd
Fix enforce_distribution and enforce_sorting missing fetch
xudong963 Apr 14, 2025
e443304
Final release note touchups (#15740)
alamb Apr 16, 2025
d0b0211
Merge remote-tracking branch 'upstream/branch-47' into branch-47-stream
xudong963 Apr 21, 2025
fe4a4ca
Upgrade DF47
xudong963 Apr 21, 2025
dfb339d
Fix: fetch is missing in plan_with_order_breaking_variants method
xudong963 Apr 23, 2025
656092e
Add fast path for optimize_projection (#15746)
xudong963 Apr 18, 2025
d2b8c15
Improve `simplify_expressions` rule (#15735)
xudong963 Apr 19, 2025
2d1062f
Speed up `optimize_projection` (#15787)
xudong963 Apr 23, 2025
738816d
Support inferring new predicates to push down
xudong963 Apr 24, 2025
d029200
Fix: `build_predicate_expression` method doesn't process `false` expr…
xudong963 May 12, 2025
378ce3b
Revert use file schema in parquet pruning (#16086)
adriangb May 21, 2025
c76c1f0
fix: [branch-48] Revert "Improve performance of constant aggregate wi…
andygrove Jun 6, 2025
b5dfdbe
feat: add metadata to literal expressions (#16170) (#16315)
andygrove Jun 7, 2025
33a32d4
[branch-48] Update CHANGELOG for latest 48.0.0 release (#16314)
alamb Jun 7, 2025
a13a6fe
Simplify filter predicates
xudong963 Jun 10, 2025
88c42dc
Merge remote-tracking branch 'upstream/branch-48' into branch-48-stream
xudong963 Jun 24, 2025
e5e5c48
Upgrade DF48
xudong963 Jun 24, 2025
6851d8e
Add the missing equivalence info for filter pushdown
liamzwbao Jul 4, 2025
054d193
48.0.1
xudong963 Jul 12, 2025
1ded6ef
[branch-49] Update version to `49.0.0`, add changelog (#16822)
alamb Jul 19, 2025
273d37a
chore: use `equals_datatype` for `BinaryExpr` (#16813) (#16847)
comphead Jul 22, 2025
afb9099
[branch-49] Final Changelog Tweaks (#16852)
alamb Jul 22, 2025
45dd3f9
Merge remote-tracking branch 'upstream/branch-49' into branch-49
xudong963 Aug 4, 2025
e4dd102
branch 49
xudong963 Aug 4, 2025
9cfb9cd
remove warning from every file open (#16968) (#17059)
mbutrovich Aug 6, 2025
f6ec4c3
#16994 Ensure CooperativeExec#maintains_input_order returns a Vec of …
pepijnve Aug 7, 2025
c7fbb3f
Add ExecutionPlan::reset_state (#17028) (#17096)
adriangb Aug 8, 2025
ee28aa7
[branch-49] Backport #17129 to branch 49 (#17143)
AdamGS Aug 12, 2025
52e4ef8
Pass the input schema to stats_projection for ProjectionExpr (#17123)…
alamb Aug 13, 2025
f05b128
[branch-49] fix: string_agg not respecting ORDER BY (#17058)
nuno-faria Aug 14, 2025
d1a6e9a
[branch-49] Update version to 49.0.1 and add changelog (#17175)
alamb Aug 14, 2025
374fcec
cherry-pick inlist fix (#17254)
haohuaijin Aug 20, 2025
930608a
fix check license header
xudong963 Aug 21, 2025
66ae588
fix cargo check: cargo check --profile ci --workspace --all-targets -…
xudong963 Aug 21, 2025
292641c
fix cargo example
xudong963 Aug 21, 2025
a6068c2
FFI_RecordBatchStream was causing a memory leak (#17190) (#17270)
timsaucer Aug 21, 2025
0d04475
fix: align `array_has` null buffer for scalar (#17272) (#17274)
comphead Aug 21, 2025
f43df3f
[branch-49] Prepare `49.0.2` version and changelog (#17277)
alamb Aug 21, 2025
25058de
fix cargo check --profile ci --no-default-features -p datafusion-proto
xudong963 Aug 22, 2025
c46f7a9
fix cargo doc
xudong963 Aug 22, 2025
deaf2e2
fix ut:custom_sources_cases::statistics::sql_limit(with_node_id of Co…
xudong963 Aug 22, 2025
f1b1bd8
fix ut: test_no_pushdown_through_aggregates & test_plan_with_order_pr…
xudong963 Aug 22, 2025
7dd5e6e
fix format
xudong963 Aug 22, 2025
2eca4c0
fix roundtrip_test
xudong963 Aug 22, 2025
8baa05d
schema_force_view_types to true
xudong963 Aug 25, 2025
9b2fbbb
use utfview8
xudong963 Aug 25, 2025
63c2ebc
schema_force_view_types to false(try true after df49)
xudong963 Aug 25, 2025
ed718c0
fix page_index_filter_one_col and remove an example of proto
xudong963 Aug 25, 2025
0bb16fa
fix configs.md
xudong963 Aug 25, 2025
09ff8f7
fix clippy
xudong963 Aug 25, 2025
1545f2d
update configs.md
xudong963 Aug 25, 2025
ca5b0fb
fix flaky test limit.test
xudong963 Aug 25, 2025
d8c3e03
Simplify predicates in `PushDownFilter` optimizer rule (#16362)
xudong963 Jun 25, 2025
2099882
Fix intermittent SQL logic test failure in limit.slt by adding ORDER …
kosiew Jun 6, 2025
ff8418c
fix limit.rs
xudong963 Aug 25, 2025
2c7836a
fix tpch q19
xudong963 Aug 25, 2025
9191f39
public GroupValues & new_group_values
xudong963 Aug 25, 2025
d358db4
fix clippy
xudong963 Aug 25, 2025
6e71350
Merge pull request #8 from polygon-io/branch-48-stream-fix
xudong963 Aug 26, 2025
c6b8211
Merge remote-tracking branch 'upstream/branch-49' into branch_49_fix
zhuqi-lucas Sep 3, 2025
5a99099
Merge remote-tracking branch 'origin/branch-48-stream' into branch_49…
zhuqi-lucas Sep 3, 2025
cefa63a
fix fetch with new order lex
zhuqi-lucas Sep 3, 2025
1f47d46
fix fetch add back with new lex order
zhuqi-lucas Sep 3, 2025
95aadb9
fix clippy
zhuqi-lucas Sep 3, 2025
70a3c94
fix clippy
zhuqi-lucas Sep 3, 2025
a93e81e
add order needed
zhuqi-lucas Sep 3, 2025
6a3d4f8
fix
zhuqi-lucas Sep 3, 2025
91e2904
fix auth check and port upstream fix: https://github.com/apache/dataf…
zhuqi-lucas Sep 3, 2025
b571c3b
Support csv truncte for datafusion
zhuqi-lucas Sep 4, 2025
1a2f8dc
Addressed in latest PR
zhuqi-lucas Sep 4, 2025
be0276d
Merge pull request #9 from polygon-io/branch_49_fix
xudong963 Sep 4, 2025
63c54ea
add generated field to proto
zhuqi-lucas Sep 4, 2025
b7f9828
generate proto
zhuqi-lucas Sep 4, 2025
d0b757b
add proto message and generated.
zhuqi-lucas Sep 4, 2025
5b9219d
fix
zhuqi-lucas Sep 4, 2025
5aa43e5
fix clippy
zhuqi-lucas Sep 4, 2025
5918ef8
Merge pull request #10 from polygon-io/support_csv_truncate
zhuqi-lucas Sep 5, 2025
cae4095
X-1035 Part-2: support csv scan to read truncted rows
zhuqi-lucas Sep 5, 2025
86c8754
fix CI
zhuqi-lucas Sep 5, 2025
253e49c
add csvfmt with
zhuqi-lucas Sep 5, 2025
194d952
Merge pull request #11 from polygon-io/support_csv_truncate_for_read
xudong963 Sep 5, 2025
ca5d44b
Merge remote-tracking branch 'origin/branch-48-stream' into branch-49
zhuqi-lucas Sep 7, 2025
faca92d
Merge pull request #12 from polygon-io/branch-49-support-csv-truncate
zhuqi-lucas Sep 8, 2025
a0fc642
Support csv truncated rows in datafusion (#17465)
zhuqi-lucas Sep 9, 2025
ca8cd34
Merge remote-tracking branch 'origin/branch-49' into branch-50
zhuqi-lucas Sep 16, 2025
e3c2493
Merge branch 'branch-50' into branch-50-upgrade
zhuqi-lucas Sep 16, 2025
8588da4
fix clippy
zhuqi-lucas Sep 16, 2025
238d58b
fix test and fmt
zhuqi-lucas Sep 16, 2025
e16c24f
fix proto test
zhuqi-lucas Sep 17, 2025
acd9ddf
remove unused file
zhuqi-lucas Sep 17, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/audit.yml
Original file line number Diff line number Diff line change
Expand Up @@ -44,4 +44,4 @@ jobs:
- name: Run audit check
# Ignored until https://github.com/apache/datafusion/issues/15571
# ignored py03 warning until arrow 55 upgrade
run: cargo audit --ignore RUSTSEC-2024-0370 --ignore RUSTSEC-2025-0020
run: cargo audit --ignore RUSTSEC-2024-0370 --ignore RUSTSEC-2025-0020 --ignore RUSTSEC-2025-0047
15 changes: 14 additions & 1 deletion datafusion/common/src/config.rs
Original file line number Diff line number Diff line change
Expand Up @@ -539,7 +539,7 @@ config_namespace! {

/// (reading) If true, parquet reader will read columns of `Utf8/Utf8Large` with `Utf8View`,
/// and `Binary/BinaryLarge` with `BinaryView`.
pub schema_force_view_types: bool, default = true
pub schema_force_view_types: bool, default = false
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default to utf8.


/// (reading) If true, parquet reader will read columns of
/// `Binary/LargeBinary` with `Utf8`, and `BinaryView` with `Utf8View`.
Expand Down Expand Up @@ -2521,6 +2521,10 @@ config_namespace! {
// The input regex for Nulls when loading CSVs.
pub null_regex: Option<String>, default = None
pub comment: Option<u8>, default = None
// Whether to allow truncated rows when parsing.
// By default this is set to false and will error if the CSV rows have different lengths.
// When set to true then it will allow records with less than the expected number of columns
pub truncated_rows: Option<bool>, default = None
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our csv truncated_rows support, will be included in DF 51.0.0, but not for DF 50.0.0.
apache#17465

}
}

Expand Down Expand Up @@ -2613,6 +2617,15 @@ impl CsvOptions {
self
}

/// Whether to allow truncated rows when parsing.
/// By default this is set to false and will error if the CSV rows have different lengths.
/// When set to true then it will allow records with less than the expected number of columns and fill the missing columns with nulls.
/// If the record’s schema is not nullable, then it will still return an error.
pub fn with_truncated_rows(mut self, allow: bool) -> Self {
self.truncated_rows = Some(allow);
self
}

/// The delimiter character.
pub fn delimiter(&self) -> u8 {
self.delimiter
Expand Down
179 changes: 178 additions & 1 deletion datafusion/core/src/datasource/file_format/csv.rs
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ mod tests {
use datafusion_physical_plan::{collect, ExecutionPlan};

use arrow::array::{
BooleanArray, Float64Array, Int32Array, RecordBatch, StringArray,
Array, BooleanArray, Float64Array, Int32Array, RecordBatch, StringArray,
};
use arrow::compute::concat_batches;
use arrow::csv::ReaderBuilder;
Expand Down Expand Up @@ -1256,4 +1256,181 @@ mod tests {
.build_decoder();
DecoderDeserializer::new(CsvDecoder::new(decoder))
}

fn csv_deserializer_with_truncated(
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our csv truncated_rows support, will be included in DF 51.0.0, but not for DF 50.0.0.
apache#17465

batch_size: usize,
schema: &Arc<Schema>,
) -> impl BatchDeserializer<Bytes> {
// using Arrow's ReaderBuilder and enabling truncated_rows
let decoder = ReaderBuilder::new(schema.clone())
.with_batch_size(batch_size)
.with_truncated_rows(true) // <- enable runtime truncated_rows
.build_decoder();
DecoderDeserializer::new(CsvDecoder::new(decoder))
}

#[tokio::test]
async fn infer_schema_with_truncated_rows_true() -> Result<()> {
let session_ctx = SessionContext::new();
let state = session_ctx.state();

// CSV: header has 3 columns, but first data row has only 2 columns, second row has 3
let csv_data = Bytes::from("a,b,c\n1,2\n3,4,5\n");
let variable_object_store = Arc::new(VariableStream::new(csv_data, 1));
let object_meta = ObjectMeta {
location: Path::parse("/")?,
last_modified: DateTime::default(),
size: u64::MAX,
e_tag: None,
version: None,
};

// Construct CsvFormat and enable truncated_rows via CsvOptions
let csv_options = CsvOptions::default().with_truncated_rows(true);
let csv_format = CsvFormat::default()
.with_has_header(true)
.with_options(csv_options)
.with_schema_infer_max_rec(10);

let inferred_schema = csv_format
.infer_schema(
&state,
&(variable_object_store.clone() as Arc<dyn ObjectStore>),
&[object_meta],
)
.await?;

// header has 3 columns; inferred schema should also have 3
assert_eq!(inferred_schema.fields().len(), 3);

// inferred columns should be nullable
for f in inferred_schema.fields() {
assert!(f.is_nullable());
}

Ok(())
}
#[test]
fn test_decoder_truncated_rows_runtime() -> Result<()> {
// Synchronous test: Decoder API used here is synchronous
let schema = csv_schema(); // helper already defined in file

// Construct a decoder that enables truncated_rows at runtime
let mut deserializer = csv_deserializer_with_truncated(10, &schema);

// Provide two rows: first row complete, second row missing last column
let input = Bytes::from("0,0.0,true,0-string\n1,1.0,true\n");
deserializer.digest(input);

// Finish and collect output
deserializer.finish();

let output = deserializer.next()?;
match output {
DeserializerOutput::RecordBatch(batch) => {
// ensure at least two rows present
assert!(batch.num_rows() >= 2);
// column 4 (index 3) should be a StringArray where second row is NULL
let col4 = batch
.column(3)
.as_any()
.downcast_ref::<StringArray>()
.expect("column 4 should be StringArray");

// first row present, second row should be null
assert!(!col4.is_null(0));
assert!(col4.is_null(1));
}
other => panic!("expected RecordBatch but got {other:?}"),
}
Ok(())
}

#[tokio::test]
async fn infer_schema_truncated_rows_false_error() -> Result<()> {
let session_ctx = SessionContext::new();
let state = session_ctx.state();

// CSV: header has 4 cols, first data row has 3 cols -> truncated at end
let csv_data = Bytes::from("id,a,b,c\n1,foo,bar\n2,foo,bar,baz\n");
let variable_object_store = Arc::new(VariableStream::new(csv_data, 1));
let object_meta = ObjectMeta {
location: Path::parse("/")?,
last_modified: DateTime::default(),
size: u64::MAX,
e_tag: None,
version: None,
};

// CsvFormat without enabling truncated_rows (default behavior = false)
let csv_format = CsvFormat::default()
.with_has_header(true)
.with_schema_infer_max_rec(10);

let res = csv_format
.infer_schema(
&state,
&(variable_object_store.clone() as Arc<dyn ObjectStore>),
&[object_meta],
)
.await;

// Expect an error due to unequal lengths / incorrect number of fields
assert!(
res.is_err(),
"expected infer_schema to error on truncated rows when disabled"
);

// Optional: check message contains indicative text (two known possibilities)
if let Err(err) = res {
let msg = format!("{err}");
assert!(
msg.contains("Encountered unequal lengths")
|| msg.contains("incorrect number of fields"),
"unexpected error message: {msg}",
);
}

Ok(())
}

#[tokio::test]
async fn test_read_csv_truncated_rows_via_tempfile() -> Result<()> {
use std::io::Write;

// create a SessionContext
let ctx = SessionContext::new();

// Create a temp file with a .csv suffix so the reader accepts it
let mut tmp = tempfile::Builder::new().suffix(".csv").tempfile()?; // ensures path ends with .csv
// CSV has header "a,b,c". First data row is truncated (only "1,2"), second row is complete.
write!(tmp, "a,b,c\n1,2\n3,4,5\n")?;
let path = tmp.path().to_str().unwrap().to_string();

// Build CsvReadOptions: header present, enable truncated_rows.
// (Use the exact builder method your crate exposes: `truncated_rows(true)` here,
// if the method name differs in your codebase use the appropriate one.)
let options = CsvReadOptions::default().truncated_rows(true);

println!("options: {}, path: {path}", options.truncated_rows);

// Call the API under test
let df = ctx.read_csv(&path, options).await?;

// Collect the results and combine batches so we can inspect columns
let batches = df.collect().await?;
let combined = concat_batches(&batches[0].schema(), &batches)?;

// Column 'c' is the 3rd column (index 2). The first data row was truncated -> should be NULL.
let col_c = combined.column(2);
assert!(
col_c.is_null(0),
"expected first row column 'c' to be NULL due to truncated row"
);

// Also ensure we read at least one row
assert!(combined.num_rows() >= 2);

Ok(())
}
}
18 changes: 17 additions & 1 deletion datafusion/core/src/datasource/file_format/options.rs
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,11 @@ pub struct CsvReadOptions<'a> {
pub file_sort_order: Vec<Vec<SortExpr>>,
/// Optional regex to match null values
pub null_regex: Option<String>,
/// Whether to allow truncated rows when parsing.
/// By default this is set to false and will error if the CSV rows have different lengths.
/// When set to true then it will allow records with less than the expected number of columns and fill the missing columns with nulls.
/// If the record’s schema is not nullable, then it will still return an error.
pub truncated_rows: bool,
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

}

impl Default for CsvReadOptions<'_> {
Expand All @@ -117,6 +122,7 @@ impl<'a> CsvReadOptions<'a> {
file_sort_order: vec![],
comment: None,
null_regex: None,
truncated_rows: false,
}
}

Expand Down Expand Up @@ -223,6 +229,15 @@ impl<'a> CsvReadOptions<'a> {
self.null_regex = null_regex;
self
}

/// Configure whether to allow truncated rows when parsing.
/// By default this is set to false and will error if the CSV rows have different lengths
/// When set to true then it will allow records with less than the expected number of columns and fill the missing columns with nulls.
/// If the record’s schema is not nullable, then it will still return an error.
pub fn truncated_rows(mut self, truncated_rows: bool) -> Self {
self.truncated_rows = truncated_rows;
self
}
}

/// Options that control the reading of Parquet files.
Expand Down Expand Up @@ -558,7 +573,8 @@ impl ReadOptions<'_> for CsvReadOptions<'_> {
.with_newlines_in_values(self.newlines_in_values)
.with_schema_infer_max_rec(self.schema_infer_max_records)
.with_file_compression_type(self.file_compression_type.to_owned())
.with_null_regex(self.null_regex.clone());
.with_null_regex(self.null_regex.clone())
.with_truncated_rows(self.truncated_rows);

ListingOptions::new(Arc::new(file_format))
.with_file_extension(self.file_extension)
Expand Down
4 changes: 2 additions & 2 deletions datafusion/core/src/datasource/file_format/parquet.rs
Original file line number Diff line number Diff line change
Expand Up @@ -581,11 +581,11 @@ mod tests {
assert_eq!(string_truncation_stats.null_count, Precision::Exact(2));
assert_eq!(
string_truncation_stats.max_value,
Precision::Inexact(ScalarValue::Utf8View(Some("b".repeat(63) + "c")))
Precision::Inexact(Utf8(Some("b".repeat(63) + "c")))
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We default to utf8.

);
assert_eq!(
string_truncation_stats.min_value,
Precision::Inexact(ScalarValue::Utf8View(Some("a".repeat(64))))
Precision::Inexact(Utf8(Some("a".repeat(64))))
);

Ok(())
Expand Down
10 changes: 8 additions & 2 deletions datafusion/core/src/execution/session_state.rs
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,9 @@ use datafusion_physical_expr::create_physical_expr;
use datafusion_physical_expr_common::physical_expr::PhysicalExpr;
use datafusion_physical_optimizer::optimizer::PhysicalOptimizer;
use datafusion_physical_optimizer::PhysicalOptimizerRule;
use datafusion_physical_plan::node_id::{
annotate_node_id_for_execution_plan, NodeIdAnnotator,
};
use datafusion_physical_plan::ExecutionPlan;
use datafusion_session::Session;
use datafusion_sql::parser::{DFParserBuilder, Statement};
Expand Down Expand Up @@ -647,9 +650,12 @@ impl SessionState {
logical_plan: &LogicalPlan,
) -> datafusion_common::Result<Arc<dyn ExecutionPlan>> {
let logical_plan = self.optimize(logical_plan)?;
self.query_planner
let physical_plan = self
.query_planner
.create_physical_plan(&logical_plan, self)
.await
.await?;
let mut id_annotator = NodeIdAnnotator::new();
annotate_node_id_for_execution_plan(&physical_plan, &mut id_annotator)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our internal node_id support.

}

/// Create a [`PhysicalExpr`] from an [`Expr`] after applying type
Expand Down
4 changes: 3 additions & 1 deletion datafusion/core/tests/parquet/page_pruning.rs
Original file line number Diff line number Diff line change
Expand Up @@ -165,7 +165,9 @@ async fn page_index_filter_one_col() {

// 5.create filter date_string_col == "01/01/09"`;
// Note this test doesn't apply type coercion so the literal must match the actual view type
let filter = col("date_string_col").eq(lit(ScalarValue::new_utf8view("01/01/09")));
// xudong: use new_utf8, because schema_force_view_types was changed to false now.
// qi: when schema_force_view_types setting to true, we should change back to utf8view
let filter = col("date_string_col").eq(lit(ScalarValue::new_utf8("01/01/09")));
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default to utf8.

let batches = get_filter_results(&state, filter.clone(), false).await;
assert_eq!(batches[0].num_rows(), 14);

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3615,18 +3615,19 @@ fn test_replace_order_preserving_variants_with_fetch() -> Result<()> {
);

// Apply the function
let result = replace_order_preserving_variants(dist_context)?;
let result = replace_order_preserving_variants(dist_context, false)?;

// Verify the plan was transformed to CoalescePartitionsExec
result
.0
.plan
.as_any()
.downcast_ref::<CoalescePartitionsExec>()
.expect("Expected CoalescePartitionsExec");

// Verify fetch was preserved
assert_eq!(
result.plan.fetch(),
result.0.plan.fetch(),
Some(5),
"Fetch value was not preserved after transformation"
);
Expand Down
Loading
Loading