Update to arrow `30.0.1` #4818

tustvold · 2023-01-04T10:08:29Z

Which issue does this PR close?

Closes #.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

tustvold · 2023-01-04T10:09:27Z

datafusion/physical-expr/src/expressions/nullif.rs

@@ -17,7 +17,7 @@

 use arrow::array::Array;
 use arrow::compute::eq_dyn;
-use arrow::compute::nullif::nullif;
+use arrow_select::nullif::nullif;


Upstream fix - apache/arrow-rs#3451

Longer term I hope to move datafusion away from the arrow dependency, and so this isn't really the end of the world

tustvold · 2023-01-04T11:43:00Z

datafusion/core/src/physical_plan/file_format/csv.rs

- crate::assert_batches_eq!(expected, &[batch]);
-
+ let err = it.next().await.unwrap().unwrap_err().to_string();
+ assert_eq!(err, "Csv error: incorrect number of fields, expected 14 got 13");


The new arrow CSV reader made this an error - apache/arrow-rs#3365 (comment)

tustvold · 2023-01-04T13:12:54Z

Practically speaking this is blocked on calebzulawski/target-features#1 it makes working on this crate completely unworkable

tustvold · 2023-01-04T20:41:04Z

#4821 contains the necessary changes to make the benchmarks pass

alamb · 2023-01-12T09:35:40Z

Now that arrow 30.0.1 is released, this PR can be unblocked, right?

tustvold · 2023-01-12T09:42:16Z

I'm working on it 😄 There be shenanigans

tustvold · 2023-01-12T10:21:26Z

datafusion/physical-expr/src/expressions/nullif.rs

@@ -17,7 +17,7 @@

 use arrow::array::Array;
 use arrow::compute::eq_dyn;
-use arrow::compute::nullif::nullif;
+use arrow::compute::kernels::nullif::nullif;


apache/arrow-rs#3515

alamb

Other than the CSV change, this PR looks good to me. Thank you @tustvold

alamb · 2023-01-12T21:47:15Z

datafusion/core/src/physical_plan/file_format/csv.rs

-
- crate::assert_batches_eq!(expected, &[batch]);
-
+ let err = it.next().await.unwrap().unwrap_err().to_string();


What happened here? Why is this erroring now?

The CSV reader got more picky, it now consistently errors on schema mismatch

I guess I am wondering if this test should then be updated:

Removed entirely (testing the csv reader error doesn't sound useful)

Updated to use a non broken CSV reader

I can try and look at it more carefully later today

Updated to use a non broken CSV reader

The CSV reader behaviour is correct and expected, whether the test is still providing value, I'm not sure - I thought removing it would be more controversial 😅

So I looked at this test more carefully -- I think it is supposed to be demonstrating that we can read from a CSV file where the schema in the file is a subset of the schema in the plan and the columns are supposed to get padded with nulls

It appears to have been added by @thinkharderdev in 7bec762 last year.

I think we should update the code so that it continues to pass.

I can try and look at it later this week if no one else has a chance

Yeah, that logic has never been correct, it incorrectly assumes that the CSV reader will pad nulls which it hasn't ever except for in the case of a prefix. Unlike parquet or JSON columns aren't named, and so there isn't a way to perform this.

Consider a schema containing columns a, b, and c, if the file actually had schema a,c previously this would interpret column c as b 😱. Now it errors as it should.

I would advise we merge this as is, and potentially file a follow up ticket to investigate this further. I think the mistake is in this test, but I'm not 100% sure

I've filed a ticket showing this being broken on master - #4918

Ticket link: #4919

alamb

Will make a small follow on PR to add a link to the tracking ticket

alamb · 2023-01-15T15:40:34Z

Thank you @tustvold

ursabot · 2023-01-15T15:50:31Z

Benchmark runs are scheduled for baseline = f376270 and contender = 2801c8c. 2801c8c is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Update to arrow 30

b65cef7

github-actions bot added core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Physical Expressions sql SQL Planner labels Jan 4, 2023

tustvold commented Jan 4, 2023

View reviewed changes

tustvold added 2 commits January 4, 2023 10:32

Update flight

ad27e1b

CSV Error

85cf443

tustvold commented Jan 4, 2023

View reviewed changes

Format

3b33110

tustvold marked this pull request as draft January 4, 2023 13:13

alamb mentioned this pull request Jan 6, 2023

Release DataFusion 15.0.0 #4468

Closed

5 tasks

tustvold added 2 commits January 9, 2023 15:48

Update arrow 30

9c61533

Merge remote-tracking branch 'upstream/master' into update-arrow-30

9d3714b

Tweaks

97d55b6

tustvold commented Jan 12, 2023

View reviewed changes

tustvold marked this pull request as ready for review January 12, 2023 11:11

alamb reviewed Jan 12, 2023

View reviewed changes

alamb changed the title ~~Update to arrow 30~~ Update to arrow 30.0.1 Jan 12, 2023

alamb changed the title ~~Update to arrow 30.0.1~~ Update to arrow 30.1.0 Jan 12, 2023

tustvold changed the title ~~Update to arrow 30.1.0~~ Update to arrow 30.0.1 Jan 15, 2023

alamb mentioned this pull request Jan 15, 2023

Incorrect Schema Adaption for CSV #4918

Open

alamb approved these changes Jan 15, 2023

View reviewed changes

alamb merged commit 2801c8c into apache:master Jan 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update to arrow `30.0.1` #4818

Update to arrow `30.0.1` #4818

tustvold commented Jan 4, 2023

tustvold Jan 4, 2023

tustvold Jan 4, 2023

tustvold commented Jan 4, 2023

tustvold commented Jan 4, 2023

alamb commented Jan 12, 2023

tustvold commented Jan 12, 2023

tustvold Jan 12, 2023

alamb left a comment

alamb Jan 12, 2023

tustvold Jan 13, 2023 •

edited

Loading

alamb Jan 13, 2023

tustvold Jan 13, 2023

alamb Jan 15, 2023

tustvold Jan 15, 2023 •

edited

Loading

tustvold Jan 15, 2023

alamb Jan 15, 2023

alamb left a comment

alamb commented Jan 15, 2023

ursabot commented Jan 15, 2023


		crate::assert_batches_eq!(expected, &[batch]);

		let err = it.next().await.unwrap().unwrap_err().to_string();

Update to arrow 30.0.1 #4818

Update to arrow 30.0.1 #4818

Conversation

tustvold commented Jan 4, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold commented Jan 4, 2023

tustvold commented Jan 4, 2023

alamb commented Jan 12, 2023

tustvold commented Jan 12, 2023

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold Jan 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold Jan 15, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb commented Jan 15, 2023

ursabot commented Jan 15, 2023

Update to arrow `30.0.1` #4818

Update to arrow `30.0.1` #4818

tustvold Jan 13, 2023 •

edited

Loading

tustvold Jan 15, 2023 •

edited

Loading