-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update to arrow 30.0.1
#4818
Update to arrow 30.0.1
#4818
Conversation
@@ -17,7 +17,7 @@ | |||
|
|||
use arrow::array::Array; | |||
use arrow::compute::eq_dyn; | |||
use arrow::compute::nullif::nullif; | |||
use arrow_select::nullif::nullif; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upstream fix - apache/arrow-rs#3451
Longer term I hope to move datafusion away from the arrow
dependency, and so this isn't really the end of the world
crate::assert_batches_eq!(expected, &[batch]); | ||
|
||
let err = it.next().await.unwrap().unwrap_err().to_string(); | ||
assert_eq!(err, "Csv error: incorrect number of fields, expected 14 got 13"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new arrow CSV reader made this an error - apache/arrow-rs#3365 (comment)
Practically speaking this is blocked on calebzulawski/target-features#1 it makes working on this crate completely unworkable |
#4821 contains the necessary changes to make the benchmarks pass |
Now that arrow 30.0.1 is released, this PR can be unblocked, right? |
I'm working on it 😄 There be shenanigans |
@@ -17,7 +17,7 @@ | |||
|
|||
use arrow::array::Array; | |||
use arrow::compute::eq_dyn; | |||
use arrow::compute::nullif::nullif; | |||
use arrow::compute::kernels::nullif::nullif; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than the CSV change, this PR looks good to me. Thank you @tustvold
|
||
crate::assert_batches_eq!(expected, &[batch]); | ||
|
||
let err = it.next().await.unwrap().unwrap_err().to_string(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happened here? Why is this erroring now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CSV reader got more picky, it now consistently errors on schema mismatch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I am wondering if this test should then be updated:
- Removed entirely (testing the csv reader error doesn't sound useful)
- Updated to use a non broken CSV reader
I can try and look at it more carefully later today
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to use a non broken CSV reader
The CSV reader behaviour is correct and expected, whether the test is still providing value, I'm not sure - I thought removing it would be more controversial 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I looked at this test more carefully -- I think it is supposed to be demonstrating that we can read from a CSV file where the schema in the file is a subset of the schema in the plan and the columns are supposed to get padded with nulls
It appears to have been added by @thinkharderdev in 7bec762 last year.
I think we should update the code so that it continues to pass.
I can try and look at it later this week if no one else has a chance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that logic has never been correct, it incorrectly assumes that the CSV reader will pad nulls which it hasn't ever except for in the case of a prefix. Unlike parquet or JSON columns aren't named, and so there isn't a way to perform this.
Consider a schema containing columns a, b, and c, if the file actually had schema a,c previously this would interpret column c as b 😱. Now it errors as it should.
I would advise we merge this as is, and potentially file a follow up ticket to investigate this further. I think the mistake is in this test, but I'm not 100% sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've filed a ticket showing this being broken on master - #4918
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ticket link: #4919
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will make a small follow on PR to add a link to the tracking ticket
Thank you @tustvold |
Benchmark runs are scheduled for baseline = f376270 and contender = 2801c8c. 2801c8c is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #.
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?