-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-46522][PYTHON] Block Python data source registration with name conflicts #44507
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-46522][PYTHON] Block Python data source registration with name conflicts #44507
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we should probably scope this into runtime datasources (can be overwritten), and static datasources (cannot be overwritten). This change should probably be fine for now though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another concern is that when the name conflicts with statically registered sources..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rule I am considering is that:
- If the name of any user-defined data source conflicts with a built-in data source or an existing Java/Scala data source that can be loaded from the classpath, an exception will be thrown.
- Otherwise, data sources can be overridden.
For static Python data sources, should we allow them to be overridden?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense in a way. But technically we can also overwrite Java/Scala datasources once we allow runtime registration, and my concern is that it might be somewhat inconsistent and difficult to understand how the overwrite works to the end users. One way is just to allow overwrite always, or disallow at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I was thinking that statically registered ones cannot be overwritten but not sure which way is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we should keep it consistent by blocking all statically registered data sources (we can change this based on user feedback in the future).
Currently, it appears we can't differentiate between a statically registered Python data source and a dynamic one. Perhaps we could add a flag in the dataSourceBuilders to indicate whether it's static or dynamic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with this for now but should probably make sure ^ change happen before Spark 4.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! Will add a TODO here.
sql/core/src/test/scala/org/apache/spark/sql/execution/python/PythonDataSourceSuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/DataSourceRegistration.scala
Outdated
Show resolved
Hide resolved
634dbd0 to
03e1c59
Compare
|
Merged to master. |
What changes were proposed in this pull request?
This PR prevents the registration of a Python data source if its name conflicts with either a built-in data source or a loadable custom Java/Scala data source.
Why are the changes needed?
To improve usability. For example, currently, users can register a data source that already exists, but they are unable to use it
Does this PR introduce any user-facing change?
No
How was this patch tested?
New unit tests
Was this patch authored or co-authored using generative AI tooling?
No