Upgrade to object_store `0.9.0` and arrow `50.0.0` #8758

tustvold · 2024-01-05T11:14:51Z

Which issue does this PR close?

Closes #.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

datafusion-cli/src/exec.rs

alamb

Thank you for this

datafusion-cli/src/exec.rs

tustvold · 2024-01-08T18:11:05Z

datafusion/core/tests/dataframe/describe.rs

    let expected = [
        "+------------+-------------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-----------------+------------+-------------------------+--------------------+-------------------+",
        "| describe   | id                | bool_col | tinyint_col        | smallint_col       | int_col            | bigint_col         | float_col          | double_col         | date_string_col | string_col | timestamp_col           | year               | month             |",
        "+------------+-------------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-----------------+------------+-------------------------+--------------------+-------------------+",


I believe these precision changes relate to apache/arrow-rs#5100

This seems like a better result to me (less numeric stability)

tustvold · 2024-01-08T18:11:39Z

datafusion/sql/tests/sql_integration.rs

    "INSERT INTO test_decimal (nonexistent, price) VALUES (1, 2), (4, 5)",
    "Schema error: No field named nonexistent. Valid fields are id, price."
 )]
-#[case::type_mismatch(


apache/arrow-rs#5123

tustvold · 2024-01-08T18:11:57Z

docs/source/user-guide/example-usage.md

 ```toml
 [dependencies]
-datafusion = { version = "22.0" , features = ["simd"]}
+datafusion = { version = "22.0" }


apache/arrow-rs#5184

tustvold · 2024-01-08T18:12:34Z

datafusion/physical-plan/src/aggregates/mod.rs

    }

    #[tokio::test]
+    #[ignore]


This is failing with a memory exhausted error, I don't believe this is because of an inherent issue with the arrow release, rather a very sensitive test, I don't think it should block the arrow release

We should figure out how to update the test to avoid the error, thought -- @kazuyukitanimura do you have any thoughts on how to do so?

We can update the max_memory of new_spill_ctx(2, 1500) in check_aggregates as long as we understand why we need more memory.
Looks like the last change was actually reduced from 2500 to 1500 in #7587

I adjusted the memory sizes in 0c4a8a1 to get the tests to pass.

However, I don't have a good idea of what requires more memory now. Any thoughts on allocation changes @tustvold (like did we change alignment in 50, or add some new fields to the array / buffer structures?)

We changed the way aggregates are computed and this might have impacted buffer sizing in some way, it is hard to know for sure without investing a lot of time. If it isn't a major change I wouldn't be overly concerned about it. These sorts of test are always extremely fragile.

It could even be that the aggregates are now much faster and therefore we end up buffering more, I don't know 😅

I don't think it is a major change personally

alamb

I looked through this PR and I agree there is nothing that looks like it should block the arrow-rs / arrow 50 release apache/arrow-rs#5234

Thank you @tustvold

datafusion/core/Cargo.toml

alamb · 2024-01-08T18:26:55Z

datafusion/core/tests/dataframe/describe.rs

    let expected = [
        "+------------+-------------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-----------------+------------+-------------------------+--------------------+-------------------+",
        "| describe   | id                | bool_col | tinyint_col        | smallint_col       | int_col            | bigint_col         | float_col          | double_col         | date_string_col | string_col | timestamp_col           | year               | month             |",
        "+------------+-------------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-----------------+------------+-------------------------+--------------------+-------------------+",


This seems like a better result to me (less numeric stability)

alamb · 2024-01-08T18:27:32Z

datafusion/core/tests/user_defined/user_defined_scalar_functions.rs

        "| AVG(custom_sqrt(aggregate_test_100.c11)) |",
        "+------------------------------------------+",
-        "| 0.6584408483418833                       |",
+        "| 0.6584408483418835                       |",


this differs in the last decimal place and thus I think is related to floating point stability and thus this change is fine

datafusion/physical-expr/src/expressions/case.rs

alamb · 2024-01-08T18:29:51Z

datafusion/physical-plan/src/aggregates/mod.rs

    }

    #[tokio::test]
+    #[ignore]


We should figure out how to update the test to avoid the error, thought -- @kazuyukitanimura do you have any thoughts on how to do so?

alamb · 2024-01-08T18:30:03Z

datafusion/sql/tests/sql_integration.rs

 )]
-#[case::type_mismatch(
-    "INSERT INTO test_decimal SELECT '2022-01-01', to_timestamp('2022-01-01T12:00:00')",
-    "Error during planning: Cannot automatically convert Timestamp(Nanosecond, None) to Decimal128(10, 2)"


datafusion/sqllogictest/test_files/subquery.slt

…e-0.9.0

…0.9.0

alamb · 2024-01-12T21:32:02Z

benchmarks/Cargo.toml

 [features]
 ci = []
 default = ["mimalloc"]
-simd = ["datafusion/simd"]


arrow 50 removed the manual SIMD implementation and now relies on auto vectorization - apache/arrow-rs#5184

datafusion-cli/src/exec.rs

alamb · 2024-01-12T21:36:58Z

datafusion-cli/src/exec.rs

        let location = "s3://bucket/path/file.parquet";

-        // Missing region
+        // Missing region, use object_store defaults


object_store defaults now to us-east-1 apache/arrow-rs#5244

alamb · 2024-01-12T21:38:13Z

datafusion/sqllogictest/test_files/repartition_scan.slt

 --CoalesceBatchesExec: target_batch_size=8192
 ----FilterExec: column1@0 != 42
------ParquetExec: file_groups={4 groups: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/1.parquet:0..197], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/2.parquet:0..201], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/2.parquet:201..403], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/1.parquet:197..394]]}, projection=[column1], output_ordering=[column1@0 ASC NULLS LAST], predicate=column1@0 != 42, pruning_predicate=column1_min@0 != 42 OR 42 != column1_max@1, required_guarantees=[column1 not in (42)]
+------ParquetExec: file_groups={4 groups: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/1.parquet:0..202], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/2.parquet:0..207], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/2.parquet:207..414], [WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/repartition_scan/parquet_table/1.parquet:202..405]]}, projection=[column1], output_ordering=[column1@0 ASC NULLS LAST], predicate=column1@0 != 42, pruning_predicate=column1_min@0 != 42 OR 42 != column1_max@1, required_guarantees=[column1 not in (42)]


The parquet file appears to be slightly larger and thus the offsets are now slightly different (this can happen because, for example, the metadata written changed (instead of "arrow-rs 49.0.0" it may now say "arrow-rs 50.0.0"

I think it is also because we now write the column sort order information

Perhaps this PR apache/arrow-rs#5110

github-actions bot added the core Core DataFusion crate label Jan 5, 2024

Prepare object_store 0.9.0

7e7a57a

tustvold force-pushed the prepare-object-store-0.9.0 branch from 1be2b9a to 7e7a57a Compare January 5, 2024 11:16

Update test

17c1a23

tustvold commented Jan 5, 2024

View reviewed changes

datafusion-cli/src/exec.rs Show resolved Hide resolved

alamb reviewed Jan 5, 2024

View reviewed changes

datafusion-cli/src/exec.rs Show resolved Hide resolved

Update to arrow 50.0.0

6a861df

tustvold changed the title ~~Prepare object_store 0.9.0~~ Prepare object_store 0.9.0 and arrow 50.0.0 Jan 8, 2024

github-actions bot added sql SQL Planner physical-expr Changes to the physical-expr crates labels Jan 8, 2024

tustvold commented Jan 8, 2024

View reviewed changes

tustvold mentioned this pull request Jan 8, 2024

Prepare arrow 50.0.0 apache/arrow-rs#5291

Merged

alamb mentioned this pull request Jan 8, 2024

DataFusion weekly project plan (Andrew Lamb) - Jan 8, 2024 #8786

Closed

7 tasks

alamb reviewed Jan 8, 2024

View reviewed changes

Update sqllogictest

46bec97

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jan 8, 2024

tustvold commented Jan 8, 2024

View reviewed changes

datafusion/sqllogictest/test_files/subquery.slt Outdated Show resolved Hide resolved

tustvold mentioned this pull request Jan 8, 2024

Release arrow-rs version 50.0.0 apache/arrow-rs#5234

Closed

tustvold and others added 6 commits January 9, 2024 09:12

Merge remote-tracking branch 'upstream/main' into prepare-object-stor…

9f723ac

…e-0.9.0

Update sqllogictests

4851323

Format

285fb33

Use nullif

1e4eca2

Merge remote-tracking branch 'apache/main' into prepare-object-store-…

8353982

…0.9.0

Use released version of arrow-rs

99c7177

alamb changed the title ~~Prepare object_store 0.9.0 and arrow 50.0.0~~ Upgrade to object_store 0.9.0 and arrow 50.0.0 Jan 12, 2024

Update README to remove references to SIMD

727b881

github-actions bot added the documentation Improvements or additions to documentation label Jan 12, 2024

alamb added 3 commits January 12, 2024 16:21

unpatch datafusion-cli

89cb067

Adjust memory sizes in tests

0c4a8a1

Restore test without explicit region

5a6d80d

alamb reviewed Jan 12, 2024

View reviewed changes

alamb marked this pull request as ready for review January 13, 2024 10:02

alamb approved these changes Jan 13, 2024

View reviewed changes

tustvold merged commit acf0f78 into apache:main Jan 14, 2024

Upgrade to object_store 0.9.0 and arrow 50.0.0 #8758

Upgrade to object_store 0.9.0 and arrow 50.0.0 #8758

Uh oh!

Conversation

tustvold commented Jan 5, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tustvold Jan 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tustvold Jan 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Upgrade to object_store `0.9.0` and arrow `50.0.0` #8758

Upgrade to object_store `0.9.0` and arrow `50.0.0` #8758

tustvold Jan 8, 2024 •

edited

Loading

tustvold Jan 12, 2024 •

edited

Loading