[SPARK-46522][PYTHON] Block Python data source registration with name conflicts #44507

allisonwang-db · 2023-12-27T07:58:07Z

What changes were proposed in this pull request?

This PR prevents the registration of a Python data source if its name conflicts with either a built-in data source or a loadable custom Java/Scala data source.

Why are the changes needed?

To improve usability. For example, currently, users can register a data source that already exists, but they are unable to use it

spark.dataSource.registerPython("json", MyDataSource)  # OK

spark.read.format("json").load()
[FOUND_MULTIPLE_DATA_SOURCES] Detected multiple data sources with the name 'json'. Please check the data source isn't simultaneously registered and located in the classpath. SQLSTATE: 42710

Does this PR introduce any user-facing change?

No

How was this patch tested?

New unit tests

Was this patch authored or co-authored using generative AI tooling?

No

HyukjinKwon · 2023-12-27T11:38:47Z

sql/core/src/main/scala/org/apache/spark/sql/DataSourceRegistration.scala

I feel like we should probably scope this into runtime datasources (can be overwritten), and static datasources (cannot be overwritten). This change should probably be fine for now though.

Another concern is that when the name conflicts with statically registered sources..

The rule I am considering is that:

If the name of any user-defined data source conflicts with a built-in data source or an existing Java/Scala data source that can be loaded from the classpath, an exception will be thrown.

Otherwise, data sources can be overridden.

For static Python data sources, should we allow them to be overridden?

That makes sense in a way. But technically we can also overwrite Java/Scala datasources once we allow runtime registration, and my concern is that it might be somewhat inconsistent and difficult to understand how the overwrite works to the end users. One way is just to allow overwrite always, or disallow at all.

So I was thinking that statically registered ones cannot be overwritten but not sure which way is better.

I agree we should keep it consistent by blocking all statically registered data sources (we can change this based on user feedback in the future).
Currently, it appears we can't differentiate between a statically registered Python data source and a dynamic one. Perhaps we could add a flag in the dataSourceBuilders to indicate whether it's static or dynamic.

I am fine with this for now but should probably make sure ^ change happen before Spark 4.0.

Sounds good! Will add a TODO here.

sql/core/src/test/scala/org/apache/spark/sql/execution/python/PythonDataSourceSuite.scala

sql/core/src/main/scala/org/apache/spark/sql/DataSourceRegistration.scala

HyukjinKwon · 2024-01-08T07:20:44Z

Merged to master.

github-actions bot added SQL DOCS PYTHON labels Dec 27, 2023

HyukjinKwon reviewed Dec 27, 2023

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/execution/python/PythonDataSourceSuite.scala Outdated Show resolved Hide resolved

HyukjinKwon reviewed Dec 27, 2023

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/DataSourceRegistration.scala Outdated Show resolved Hide resolved

allisonwang-db added 2 commits January 3, 2024 17:45

block

c450df5

address comments

03e1c59

allisonwang-db force-pushed the spark-46522-check-name branch from 634dbd0 to 03e1c59 Compare January 3, 2024 11:21

HyukjinKwon approved these changes Jan 8, 2024

View reviewed changes

update

ae76340

HyukjinKwon closed this in b177df0 Jan 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-46522][PYTHON] Block Python data source registration with name conflicts #44507

[SPARK-46522][PYTHON] Block Python data source registration with name conflicts #44507

Uh oh!

allisonwang-db commented Dec 27, 2023

Uh oh!

HyukjinKwon Dec 27, 2023

Uh oh!

HyukjinKwon Dec 27, 2023

Uh oh!

allisonwang-db Jan 3, 2024

Uh oh!

HyukjinKwon Jan 4, 2024

Uh oh!

HyukjinKwon Jan 4, 2024

Uh oh!

allisonwang-db Jan 5, 2024

Uh oh!

HyukjinKwon Jan 8, 2024

Uh oh!

allisonwang-db Jan 8, 2024

Uh oh!

Uh oh!

Uh oh!

HyukjinKwon commented Jan 8, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-46522][PYTHON] Block Python data source registration with name conflicts #44507

[SPARK-46522][PYTHON] Block Python data source registration with name conflicts #44507

Uh oh!

Conversation

allisonwang-db commented Dec 27, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

HyukjinKwon commented Jan 8, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants