Skip to content

[SPARK-51747][SQL] Data source cached plan should respect options #50538

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from

Conversation

asl3
Copy link
Contributor

@asl3 asl3 commented Apr 8, 2025

What changes were proposed in this pull request?

Data source cached plan should respect options, such as CSV delimiter. Before this, DataSourceStrategy caches the first plan and reuses it in the future, ignoring updated options. This change returns a new plan if options are changed.

Why are the changes needed?

For example:

spark.sql("CREATE TABLE t(a string, b string) USING CSV".stripMargin)
spark.sql("INSERT INTO TABLE t VALUES ('a;b', 'c')")
 
spark.sql("SELECT * FROM t").show()
spark.sql("SELECT * FROM t WITH ('delimiter' = ';')")

Expected output:

+----+----+
|col1|col2|
+----+----+
| a;b|   c|
+----+----+

+----+----+
|col1|col2|
+----+----+
|   a| b,c|
+----+----+ 

Output before this PR:

+----+----+
|col1|col2|
+----+----+
| a;b|   c|
+----+----+

+----+----+
|col1|col2|
+----+----+
| a;b|   c|
+----+----+ 

The PR is needed to get the expected result.

Does this PR introduce any user-facing change?

Yes, corrects the caching behavior from DataSourceStrategy

How was this patch tested?

Added test in DDLSuite.scala

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Apr 8, 2025
@asl3 asl3 requested a review from gengliangwang April 9, 2025 18:52
Copy link
Contributor

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @asl3 !

@gengliangwang
Copy link
Member

Thanks, merging to master/4.0

gengliangwang added a commit that referenced this pull request Apr 10, 2025
### What changes were proposed in this pull request?

Data source cached plan should respect options, such as CSV delimiter. Before this, DataSourceStrategy caches the first plan and reuses it in the future, ignoring updated options. This change returns a **new plan** if options are changed.

### Why are the changes needed?

For example:

```
spark.sql("CREATE TABLE t(a string, b string) USING CSV".stripMargin)
spark.sql("INSERT INTO TABLE t VALUES ('a;b', 'c')")

spark.sql("SELECT * FROM t").show()
spark.sql("SELECT * FROM t WITH ('delimiter' = ';')")
```

Expected output:

 ```
+----+----+
|col1|col2|
+----+----+
| a;b|   c|
+----+----+

+----+----+
|col1|col2|
+----+----+
|   a| b,c|
+----+----+
 ```

Output before this PR:

 ```
+----+----+
|col1|col2|
+----+----+
| a;b|   c|
+----+----+

+----+----+
|col1|col2|
+----+----+
| a;b|   c|
+----+----+
```

The PR is needed to get the expected result.

### Does this PR introduce _any_ user-facing change?

Yes, corrects the caching behavior from DataSourceStrategy

### How was this patch tested?

Added test in DDLSuite.scala

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #50538 from asl3/asl3/datasourcestrategycacheoptions.

Lead-authored-by: Amanda Liu <amanda.liu@databricks.com>
Co-authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit d2a864f)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
gengliangwang pushed a commit that referenced this pull request Apr 15, 2025
…ion guide

### What changes were proposed in this pull request?

Follow-up to #50538.

Add a SQL legacy conf to enable/disable the change to allow users to restore the previous behavior. Also add a migration guide note.

### Why are the changes needed?

The original PR changes the behavior of reading from a data source file with options. The flag is needed to allow users a way to restore the former behavior, if desired.

### Does this PR introduce _any_ user-facing change?

No (original PR was a user-facing change, but this PR simply adds a config).

### How was this patch tested?

Added test for the config

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #50571 from asl3/asl3/filedatasourcecache-docsconf.

Authored-by: Amanda Liu <amanda.liu@databricks.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
gengliangwang pushed a commit that referenced this pull request Apr 15, 2025
…ion guide

### What changes were proposed in this pull request?

Follow-up to #50538.

Add a SQL legacy conf to enable/disable the change to allow users to restore the previous behavior. Also add a migration guide note.

### Why are the changes needed?

The original PR changes the behavior of reading from a data source file with options. The flag is needed to allow users a way to restore the former behavior, if desired.

### Does this PR introduce _any_ user-facing change?

No (original PR was a user-facing change, but this PR simply adds a config).

### How was this patch tested?

Added test for the config

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #50571 from asl3/asl3/filedatasourcecache-docsconf.

Authored-by: Amanda Liu <amanda.liu@databricks.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit 3998186)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
vladimirg-db pushed a commit to vladimirg-db/spark that referenced this pull request Apr 15, 2025
…ion guide

### What changes were proposed in this pull request?

Follow-up to apache#50538.

Add a SQL legacy conf to enable/disable the change to allow users to restore the previous behavior. Also add a migration guide note.

### Why are the changes needed?

The original PR changes the behavior of reading from a data source file with options. The flag is needed to allow users a way to restore the former behavior, if desired.

### Does this PR introduce _any_ user-facing change?

No (original PR was a user-facing change, but this PR simply adds a config).

### How was this patch tested?

Added test for the config

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#50571 from asl3/asl3/filedatasourcecache-docsconf.

Authored-by: Amanda Liu <amanda.liu@databricks.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants