[SPARK-45917][PYTHON][SQL] Automatic registration of Python Data Source on startup #44504

HyukjinKwon · 2023-12-27T04:47:42Z

What changes were proposed in this pull request?

This PR proposes to add the support of automatic Python Data Source registration.

End user perspective:

# Assume that `customsource` defined a short name as `custom` 
pip install pyspark_customsource

Users can directly use the Python Data Source

df = spark.format("custom").load()

Developer perspective:

The packages should follow the structure below:

The package name should start with pyspark_ prefix
pyspark_*.DefaultSource has to be defined that inherits pyspark.sql.datasource.DataSource

For example:

pyspark_customsource
├── __init__.py
 ...

__init__.py:

from pyspark.sql.datasource import DataSource

class DefaultSource(Datasource):
    pass

Why are the changes needed?

This allows the developers to release and maintain their 3rd party Python Data Sources separately (e.g., in PyPI), and end users can easily install the Python Data Source without doing anything other than just pip install pyspark_their_source

Does this PR introduce any user-facing change?

Yes, this allows users to pip install pyspark_custom_source, and automatically register it as Data Source available in Spark.

How was this patch tested?

Unittests were added.

Also manual test as below:

rm -fr pyspark_mysource
mkdir pyspark_mysource
cd pyspark_mysource
echo '
from pyspark.sql.datasource import DataSource, DataSourceReader, InputPartition

class TestDataSourceReader(DataSourceReader):
    def __init__(self, options):
        self.options = options
    def partitions(self):
        return [InputPartition(i) for i in range(3)]
    def read(self, partition):
        yield partition.value, str(partition.value)


class DefaultSource(DataSource):
    @classmethod
    def name(cls):
        return "test"
    def schema(self):
        return "x INT, y STRING"
    def reader(self, schema) -> "DataSourceReader":
        return TestDataSourceReader(self.options)
    @classmethod
    def name(cls):
        return "mysource"
' > __init__.py
cd ..
./bin/pyspark

spark.read.format("mysource").load().show()

+---+---+
|  x|  y|
+---+---+
|  0|  0|
|  1|  1|
|  2|  2|
+---+---+

Was this patch authored or co-authored using generative AI tooling?

No.

dongjoon-hyun

+1, LGTM.

HyukjinKwon · 2023-12-27T07:05:26Z

Let me actually add the test cases here together while I am here.

sql/core/src/main/scala/org/apache/spark/sql/execution/python/UserDefinedPythonDataSource.scala

ueshin · 2023-12-27T18:50:49Z

core/src/main/scala/org/apache/spark/api/python/PythonUtils.scala

+      val py4jPath = Paths.get(
+        sparkHome, "python", "lib", PythonUtils.PY4J_ZIP_NAME).toAbsolutePath


Do we need Py4J path? The Python functions are not supposed to use Py4J?

We don't use it but we need it when we import .. e.g., https://github.com/apache/spark/blob/master/python/pyspark/__init__.py#L53 -> https://github.com/apache/spark/blob/master/python/pyspark/conf.py#L23 when we import Python Data Source.

sql/core/src/test/scala/org/apache/spark/sql/execution/python/PythonDataSourceSuite.scala

python/pyspark/sql/worker/lookup_data_sources.py

HyukjinKwon · 2023-12-28T03:44:05Z

Merged to master.

panbingkun · 2023-12-29T03:06:30Z

@HyukjinKwon
I have reverted #44504 (CommitID: 229a4eaf547e5c263c749bd53f7f9a89f4a9bea9).
Based on the current running results, the Run Spark on Kubernetes Integration test failure of GA is related to this.

https://github.com/apache/spark/pull/44530/files
https://github.com/panbingkun/spark/actions/runs/7353125339/job/20018716583

HyukjinKwon · 2023-12-29T05:22:37Z

Thx, let me fix up together at #44519

…ailable Data Sources ### What changes were proposed in this pull request? This PR is a sort of followup of #44504 but addresses a separate issue. This PR proposes to check: - if Python executable exists when looking up available Python Data Sources. - if PySpark source and Py4J files exist - for the case users don't have them in their machine (and don't use PySpark). ### Why are the changes needed? For some OSes such as Windows, or minimized Docker containers, there is no Python installed, and it will just fail even when users want to use Scala only. We should check the Python executable, and skip if that does not exist. ### Does this PR introduce _any_ user-facing change? No because the main change has not been released out yet. ### How was this patch tested? Manually tested. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44519 from HyukjinKwon/SPARK-46530. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>

…file separator to correctly check PySpark library existence ### What changes were proposed in this pull request? This PR is a followup of #44519 that fixes a mistake of separating the paths. It should use `Files.pathSeparator`. ### Why are the changes needed? It works with testing mode, but it doesn't work with production mode otherwise. ### Does this PR introduce _any_ user-facing change? No, because the main change has not been released. ### How was this patch tested? Manually as described in "How was this patch tested?" at #44504. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44590 from HyukjinKwon/SPARK-46530-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

github-actions bot added SQL CORE PYTHON labels Dec 27, 2023

HyukjinKwon force-pushed the SPARK-45917 branch from 33d74c5 to 99eca9c Compare December 27, 2023 05:43

dongjoon-hyun approved these changes Dec 27, 2023

View reviewed changes

HyukjinKwon marked this pull request as draft December 27, 2023 07:03

HyukjinKwon marked this pull request as ready for review December 27, 2023 11:26

HyukjinKwon force-pushed the SPARK-45917 branch from 1685a57 to 7b8d3a5 Compare December 27, 2023 11:27

HyukjinKwon mentioned this pull request Dec 27, 2023

[SPARK-46522][PYTHON] Block Python data source registration with name conflicts #44507

Closed

HyukjinKwon added 2 commits December 27, 2023 20:46

Statically register Python Data Source

181edd6

Cleanup

0f11115

HyukjinKwon force-pushed the SPARK-45917 branch from 7b8d3a5 to 1d061cb Compare December 27, 2023 11:46

Add a test, and cleanup

486ee18

HyukjinKwon force-pushed the SPARK-45917 branch from 1d061cb to 486ee18 Compare December 27, 2023 11:47

ueshin reviewed Dec 27, 2023

View reviewed changes

Address comments

acd6dc0

ueshin reviewed Dec 28, 2023

View reviewed changes

python/pyspark/sql/worker/lookup_data_sources.py Outdated Show resolved Hide resolved

comments

096beb8

ueshin approved these changes Dec 28, 2023

View reviewed changes

HyukjinKwon closed this in 229a4ea Dec 28, 2023

HyukjinKwon mentioned this pull request Dec 28, 2023

[SPARK-46530][PYTHON][SQL] Check Python executable when looking up available Data Sources #44519

Closed

HyukjinKwon mentioned this pull request Jan 4, 2024

[SPARK-46530][PYTHON][SQL][FOLLOW-UP] Uses path separator instead of file separator to correctly check PySpark library existence #44590

Closed

HyukjinKwon deleted the SPARK-45917 branch January 15, 2024 00:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-45917][PYTHON][SQL] Automatic registration of Python Data Source on startup #44504

[SPARK-45917][PYTHON][SQL] Automatic registration of Python Data Source on startup #44504

Uh oh!

HyukjinKwon commented Dec 27, 2023 •

edited

Loading

Uh oh!

dongjoon-hyun left a comment

Uh oh!

HyukjinKwon commented Dec 27, 2023

Uh oh!

Uh oh!

ueshin Dec 27, 2023

Uh oh!

HyukjinKwon Dec 27, 2023 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

HyukjinKwon commented Dec 28, 2023

Uh oh!

panbingkun commented Dec 29, 2023 •

edited

Loading

Uh oh!

HyukjinKwon commented Dec 29, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		val py4jPath = Paths.get(
		sparkHome, "python", "lib", PythonUtils.PY4J_ZIP_NAME).toAbsolutePath

[SPARK-45917][PYTHON][SQL] Automatic registration of Python Data Source on startup #44504

[SPARK-45917][PYTHON][SQL] Automatic registration of Python Data Source on startup #44504

Uh oh!

Conversation

HyukjinKwon commented Dec 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Dec 27, 2023

Uh oh!

Uh oh!

ueshin Dec 27, 2023

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Dec 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

HyukjinKwon commented Dec 28, 2023

Uh oh!

panbingkun commented Dec 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Dec 29, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HyukjinKwon commented Dec 27, 2023 •

edited

Loading

HyukjinKwon Dec 27, 2023 •

edited

Loading

panbingkun commented Dec 29, 2023 •

edited

Loading