-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-45917][PYTHON][SQL] Automatic registration of Python Data Source on startup #44504
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
33d74c5 to
99eca9c
Compare
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM.
|
Let me actually add the test cases here together while I am here. |
1685a57 to
7b8d3a5
Compare
7b8d3a5 to
1d061cb
Compare
1d061cb to
486ee18
Compare
sql/core/src/main/scala/org/apache/spark/sql/execution/python/UserDefinedPythonDataSource.scala
Outdated
Show resolved
Hide resolved
| val py4jPath = Paths.get( | ||
| sparkHome, "python", "lib", PythonUtils.PY4J_ZIP_NAME).toAbsolutePath |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need Py4J path? The Python functions are not supposed to use Py4J?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't use it but we need it when we import .. e.g., https://github.com/apache/spark/blob/master/python/pyspark/__init__.py#L53 -> https://github.com/apache/spark/blob/master/python/pyspark/conf.py#L23 when we import Python Data Source.
sql/core/src/test/scala/org/apache/spark/sql/execution/python/PythonDataSourceSuite.scala
Outdated
Show resolved
Hide resolved
|
Merged to master. |
|
@HyukjinKwon https://github.com/apache/spark/pull/44530/files |
|
Thx, let me fix up together at #44519 |
…ailable Data Sources ### What changes were proposed in this pull request? This PR is a sort of followup of #44504 but addresses a separate issue. This PR proposes to check: - if Python executable exists when looking up available Python Data Sources. - if PySpark source and Py4J files exist - for the case users don't have them in their machine (and don't use PySpark). ### Why are the changes needed? For some OSes such as Windows, or minimized Docker containers, there is no Python installed, and it will just fail even when users want to use Scala only. We should check the Python executable, and skip if that does not exist. ### Does this PR introduce _any_ user-facing change? No because the main change has not been released out yet. ### How was this patch tested? Manually tested. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44519 from HyukjinKwon/SPARK-46530. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
…file separator to correctly check PySpark library existence ### What changes were proposed in this pull request? This PR is a followup of #44519 that fixes a mistake of separating the paths. It should use `Files.pathSeparator`. ### Why are the changes needed? It works with testing mode, but it doesn't work with production mode otherwise. ### Does this PR introduce _any_ user-facing change? No, because the main change has not been released. ### How was this patch tested? Manually as described in "How was this patch tested?" at #44504. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44590 from HyukjinKwon/SPARK-46530-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
What changes were proposed in this pull request?
This PR proposes to add the support of automatic Python Data Source registration.
End user perspective:
# Assume that `customsource` defined a short name as `custom` pip install pyspark_customsourceUsers can directly use the Python Data Source
Developer perspective:
The packages should follow the structure below:
pyspark_prefixpyspark_*.DefaultSourcehas to be defined that inheritspyspark.sql.datasource.DataSourceFor example:
__init__.py:Why are the changes needed?
This allows the developers to release and maintain their 3rd party Python Data Sources separately (e.g., in PyPI), and end users can easily install the Python Data Source without doing anything other than just
pip install pyspark_their_sourceDoes this PR introduce any user-facing change?
Yes, this allows users to
pip install pyspark_custom_source, and automatically register it as Data Source available in Spark.How was this patch tested?
Unittests were added.
Also manual test as below:
Was this patch authored or co-authored using generative AI tooling?
No.