Skip to content

Conversation

@allisonwang-db
Copy link
Contributor

What changes were proposed in this pull request?

This PR prevents the registration of a Python data source if its name conflicts with either a built-in data source or a loadable custom Java/Scala data source.

Why are the changes needed?

To improve usability. For example, currently, users can register a data source that already exists, but they are unable to use it

spark.dataSource.registerPython("json", MyDataSource)  # OK

spark.read.format("json").load()
[FOUND_MULTIPLE_DATA_SOURCES] Detected multiple data sources with the name 'json'. Please check the data source isn't simultaneously registered and located in the classpath. SQLSTATE: 42710

Does this PR introduce any user-facing change?

No

How was this patch tested?

New unit tests

Was this patch authored or co-authored using generative AI tooling?

No

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we should probably scope this into runtime datasources (can be overwritten), and static datasources (cannot be overwritten). This change should probably be fine for now though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another concern is that when the name conflicts with statically registered sources..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rule I am considering is that:

  • If the name of any user-defined data source conflicts with a built-in data source or an existing Java/Scala data source that can be loaded from the classpath, an exception will be thrown.
  • Otherwise, data sources can be overridden.

For static Python data sources, should we allow them to be overridden?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense in a way. But technically we can also overwrite Java/Scala datasources once we allow runtime registration, and my concern is that it might be somewhat inconsistent and difficult to understand how the overwrite works to the end users. One way is just to allow overwrite always, or disallow at all.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I was thinking that statically registered ones cannot be overwritten but not sure which way is better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we should keep it consistent by blocking all statically registered data sources (we can change this based on user feedback in the future).
Currently, it appears we can't differentiate between a statically registered Python data source and a dynamic one. Perhaps we could add a flag in the dataSourceBuilders to indicate whether it's static or dynamic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with this for now but should probably make sure ^ change happen before Spark 4.0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! Will add a TODO here.

@allisonwang-db allisonwang-db force-pushed the spark-46522-check-name branch from 634dbd0 to 03e1c59 Compare January 3, 2024 11:21
@HyukjinKwon
Copy link
Member

Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants