-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade to object_store 0.9.0
and arrow 50.0.0
#8758
Conversation
1be2b9a
to
7e7a57a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this
@@ -40,12 +40,12 @@ async fn describe() -> Result<()> { | |||
"+------------+-------------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-----------------+------------+-------------------------+--------------------+-------------------+", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe these precision changes relate to apache/arrow-rs#5100
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a better result to me (less numeric stability)
@@ -451,10 +451,6 @@ Dml: op=[Insert Into] table=[test_decimal] | |||
"INSERT INTO test_decimal (nonexistent, price) VALUES (1, 2), (4, 5)", | |||
"Schema error: No field named nonexistent. Valid fields are id, price." | |||
)] | |||
#[case::type_mismatch( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -194,7 +194,7 @@ worth noting that using the settings in the `[profile.release]` section will sig | |||
|
|||
```toml | |||
[dependencies] | |||
datafusion = { version = "22.0" , features = ["simd"]} | |||
datafusion = { version = "22.0" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -1746,6 +1746,7 @@ mod tests { | |||
} | |||
|
|||
#[tokio::test] | |||
#[ignore] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is failing with a memory exhausted error, I don't believe this is because of an inherent issue with the arrow release, rather a very sensitive test, I don't think it should block the arrow release
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should figure out how to update the test to avoid the error, thought -- @kazuyukitanimura do you have any thoughts on how to do so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can update the max_memory
of new_spill_ctx(2, 1500)
in check_aggregates
as long as we understand why we need more memory.
Looks like the last change was actually reduced from 2500
to 1500
in #7587
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We changed the way aggregates are computed and this might have impacted buffer sizing in some way, it is hard to know for sure without investing a lot of time. If it isn't a major change I wouldn't be overly concerned about it. These sorts of test are always extremely fragile.
It could even be that the aggregates are now much faster and therefore we end up buffering more, I don't know 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is a major change personally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked through this PR and I agree there is nothing that looks like it should block the arrow-rs / arrow 50 release apache/arrow-rs#5234
Thank you @tustvold
@@ -40,12 +40,12 @@ async fn describe() -> Result<()> { | |||
"+------------+-------------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-----------------+------------+-------------------------+--------------------+-------------------+", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a better result to me (less numeric stability)
@@ -43,7 +43,7 @@ async fn csv_query_custom_udf_with_cast() -> Result<()> { | |||
"+------------------------------------------+", | |||
"| AVG(custom_sqrt(aggregate_test_100.c11)) |", | |||
"+------------------------------------------+", | |||
"| 0.6584408483418833 |", | |||
"| 0.6584408483418835 |", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this differs in the last decimal place and thus I think is related to floating point stability and thus this change is fine
@@ -1746,6 +1746,7 @@ mod tests { | |||
} | |||
|
|||
#[tokio::test] | |||
#[ignore] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should figure out how to update the test to avoid the error, thought -- @kazuyukitanimura do you have any thoughts on how to do so?
@@ -451,10 +451,6 @@ Dml: op=[Insert Into] table=[test_decimal] | |||
"INSERT INTO test_decimal (nonexistent, price) VALUES (1, 2), (4, 5)", | |||
"Schema error: No field named nonexistent. Valid fields are id, price." | |||
)] | |||
#[case::type_mismatch( | |||
"INSERT INTO test_decimal SELECT '2022-01-01', to_timestamp('2022-01-01T12:00:00')", | |||
"Error during planning: Cannot automatically convert Timestamp(Nanosecond, None) to Decimal128(10, 2)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
0.9.0
and arrow 50.0.0
@@ -29,7 +29,6 @@ rust-version = "1.70" | |||
[features] | |||
ci = [] | |||
default = ["mimalloc"] | |||
simd = ["datafusion/simd"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arrow 50 removed the manual SIMD implementation and now relies on auto vectorization - apache/arrow-rs#5184
@@ -340,13 +340,10 @@ mod tests { | |||
let session_token = "fake_session_token"; | |||
let location = "s3://bucket/path/file.parquet"; | |||
|
|||
// Missing region | |||
// Missing region, use object_store defaults |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
object_store defaults now to us-east-1 apache/arrow-rs#5244
@@ -138,7 +138,7 @@ physical_plan | |||
SortPreservingMergeExec: [column1@0 ASC NULLS LAST] | |||
--CoalesceBatchesExec: target_batch_size=8192 | |||
----FilterExec: column1@0 != 42 | |||
------ParquetExec: file_groups={4 groups: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/1.parquet:0..197], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/2.parquet:0..201], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/2.parquet:201..403], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/1.parquet:197..394]]}, projection=[column1], output_ordering=[column1@0 ASC NULLS LAST], predicate=column1@0 != 42, pruning_predicate=column1_min@0 != 42 OR 42 != column1_max@1, required_guarantees=[column1 not in (42)] | |||
------ParquetExec: file_groups={4 groups: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/1.parquet:0..202], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/2.parquet:0..207], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/2.parquet:207..414], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/1.parquet:202..405]]}, projection=[column1], output_ordering=[column1@0 ASC NULLS LAST], predicate=column1@0 != 42, pruning_predicate=column1_min@0 != 42 OR 42 != column1_max@1, required_guarantees=[column1 not in (42)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parquet file appears to be slightly larger and thus the offsets are now slightly different (this can happen because, for example, the metadata written changed (instead of "arrow-rs 49.0.0" it may now say "arrow-rs 50.0.0"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is also because we now write the column sort order information
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this PR apache/arrow-rs#5110
Which issue does this PR close?
Closes #.
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?