Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to arrow 30.0.1 #4818

Merged
merged 7 commits into from
Jan 15, 2023
Merged

Update to arrow 30.0.1 #4818

merged 7 commits into from
Jan 15, 2023

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Jan 4, 2023

Which issue does this PR close?

Closes #.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Physical Expressions sql SQL Planner labels Jan 4, 2023
@@ -17,7 +17,7 @@

use arrow::array::Array;
use arrow::compute::eq_dyn;
use arrow::compute::nullif::nullif;
use arrow_select::nullif::nullif;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upstream fix - apache/arrow-rs#3451

Longer term I hope to move datafusion away from the arrow dependency, and so this isn't really the end of the world

crate::assert_batches_eq!(expected, &[batch]);

let err = it.next().await.unwrap().unwrap_err().to_string();
assert_eq!(err, "Csv error: incorrect number of fields, expected 14 got 13");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new arrow CSV reader made this an error - apache/arrow-rs#3365 (comment)

@tustvold
Copy link
Contributor Author

tustvold commented Jan 4, 2023

Practically speaking this is blocked on calebzulawski/target-features#1 it makes working on this crate completely unworkable

@tustvold tustvold marked this pull request as draft January 4, 2023 13:13
@tustvold
Copy link
Contributor Author

tustvold commented Jan 4, 2023

#4821 contains the necessary changes to make the benchmarks pass

@alamb alamb mentioned this pull request Jan 6, 2023
5 tasks
@alamb
Copy link
Contributor

alamb commented Jan 12, 2023

Now that arrow 30.0.1 is released, this PR can be unblocked, right?

@tustvold
Copy link
Contributor Author

I'm working on it 😄 There be shenanigans

@@ -17,7 +17,7 @@

use arrow::array::Array;
use arrow::compute::eq_dyn;
use arrow::compute::nullif::nullif;
use arrow::compute::kernels::nullif::nullif;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tustvold tustvold marked this pull request as ready for review January 12, 2023 11:11
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than the CSV change, this PR looks good to me. Thank you @tustvold


crate::assert_batches_eq!(expected, &[batch]);

let err = it.next().await.unwrap().unwrap_err().to_string();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happened here? Why is this erroring now?

Copy link
Contributor Author

@tustvold tustvold Jan 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CSV reader got more picky, it now consistently errors on schema mismatch

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I am wondering if this test should then be updated:

  1. Removed entirely (testing the csv reader error doesn't sound useful)
  2. Updated to use a non broken CSV reader

I can try and look at it more carefully later today

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to use a non broken CSV reader

The CSV reader behaviour is correct and expected, whether the test is still providing value, I'm not sure - I thought removing it would be more controversial 😅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I looked at this test more carefully -- I think it is supposed to be demonstrating that we can read from a CSV file where the schema in the file is a subset of the schema in the plan and the columns are supposed to get padded with nulls

It appears to have been added by @thinkharderdev in 7bec762 last year.

I think we should update the code so that it continues to pass.

I can try and look at it later this week if no one else has a chance

Copy link
Contributor Author

@tustvold tustvold Jan 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that logic has never been correct, it incorrectly assumes that the CSV reader will pad nulls which it hasn't ever except for in the case of a prefix. Unlike parquet or JSON columns aren't named, and so there isn't a way to perform this.

Consider a schema containing columns a, b, and c, if the file actually had schema a,c previously this would interpret column c as b 😱. Now it errors as it should.

I would advise we merge this as is, and potentially file a follow up ticket to investigate this further. I think the mistake is in this test, but I'm not 100% sure

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've filed a ticket showing this being broken on master - #4918

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ticket link: #4919

@alamb alamb changed the title Update to arrow 30 Update to arrow 30.0.1 Jan 12, 2023
@alamb alamb changed the title Update to arrow 30.0.1 Update to arrow 30.1.0 Jan 12, 2023
@tustvold tustvold changed the title Update to arrow 30.1.0 Update to arrow 30.0.1 Jan 15, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will make a small follow on PR to add a link to the tracking ticket

@alamb
Copy link
Contributor

alamb commented Jan 15, 2023

Thank you @tustvold

@alamb alamb merged commit 2801c8c into apache:master Jan 15, 2023
@ursabot
Copy link

ursabot commented Jan 15, 2023

Benchmark runs are scheduled for baseline = f376270 and contender = 2801c8c. 2801c8c is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Physical Expressions sql SQL Planner
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants