Skip to content

Conversation

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Dec 27, 2023

What changes were proposed in this pull request?

This PR proposes to add the support of automatic Python Data Source registration.

End user perspective:

# Assume that `customsource` defined a short name as `custom` 
pip install pyspark_customsource

Users can directly use the Python Data Source

df = spark.format("custom").load()

Developer perspective:

The packages should follow the structure below:

  • The package name should start with pyspark_ prefix
  • pyspark_*.DefaultSource has to be defined that inherits pyspark.sql.datasource.DataSource

For example:

pyspark_customsource
├── __init__.py
 ...

__init__.py:

from pyspark.sql.datasource import DataSource

class DefaultSource(Datasource):
    pass

Why are the changes needed?

This allows the developers to release and maintain their 3rd party Python Data Sources separately (e.g., in PyPI), and end users can easily install the Python Data Source without doing anything other than just pip install pyspark_their_source

Does this PR introduce any user-facing change?

Yes, this allows users to pip install pyspark_custom_source, and automatically register it as Data Source available in Spark.

How was this patch tested?

Unittests were added.

Also manual test as below:

rm -fr pyspark_mysource
mkdir pyspark_mysource
cd pyspark_mysource
echo '
from pyspark.sql.datasource import DataSource, DataSourceReader, InputPartition

class TestDataSourceReader(DataSourceReader):
    def __init__(self, options):
        self.options = options
    def partitions(self):
        return [InputPartition(i) for i in range(3)]
    def read(self, partition):
        yield partition.value, str(partition.value)


class DefaultSource(DataSource):
    @classmethod
    def name(cls):
        return "test"
    def schema(self):
        return "x INT, y STRING"
    def reader(self, schema) -> "DataSourceReader":
        return TestDataSourceReader(self.options)
    @classmethod
    def name(cls):
        return "mysource"
' > __init__.py
cd ..
./bin/pyspark
spark.read.format("mysource").load().show()
+---+---+
|  x|  y|
+---+---+
|  0|  0|
|  1|  1|
|  2|  2|
+---+---+

Was this patch authored or co-authored using generative AI tooling?

No.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@HyukjinKwon HyukjinKwon marked this pull request as draft December 27, 2023 07:03
@HyukjinKwon
Copy link
Member Author

Let me actually add the test cases here together while I am here.

Comment on lines +170 to +171
val py4jPath = Paths.get(
sparkHome, "python", "lib", PythonUtils.PY4J_ZIP_NAME).toAbsolutePath
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need Py4J path? The Python functions are not supposed to use Py4J?

Copy link
Member Author

@HyukjinKwon HyukjinKwon Dec 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon
Copy link
Member Author

Merged to master.

@panbingkun
Copy link
Contributor

panbingkun commented Dec 29, 2023

@HyukjinKwon
I have reverted #44504 (CommitID: 229a4eaf547e5c263c749bd53f7f9a89f4a9bea9).
Based on the current running results, the Run Spark on Kubernetes Integration test failure of GA is related to this.

https://github.com/apache/spark/pull/44530/files
https://github.com/panbingkun/spark/actions/runs/7353125339/job/20018716583

@HyukjinKwon
Copy link
Member Author

Thx, let me fix up together at #44519

zhengruifeng pushed a commit that referenced this pull request Dec 30, 2023
…ailable Data Sources

### What changes were proposed in this pull request?

This PR is a sort of followup of #44504 but addresses a separate issue. This PR proposes to check:
- if Python executable exists when looking up available Python Data Sources.
- if PySpark source and Py4J files exist - for the case users don't have them in their machine (and don't use PySpark).

### Why are the changes needed?

For some OSes such as Windows, or minimized Docker containers, there is no Python installed, and it will just fail even when users want to use Scala only. We should check the Python executable, and skip if that does not exist.

### Does this PR introduce _any_ user-facing change?

No because the main change has not been released out yet.

### How was this patch tested?

Manually tested.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44519 from HyukjinKwon/SPARK-46530.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
HyukjinKwon added a commit that referenced this pull request Jan 4, 2024
…file separator to correctly check PySpark library existence

### What changes were proposed in this pull request?

This PR is a followup of #44519 that fixes a mistake of separating the paths. It should use `Files.pathSeparator`.

### Why are the changes needed?

It works with testing mode, but it doesn't work with production mode otherwise.

### Does this PR introduce _any_ user-facing change?

No, because the main change has not been released.

### How was this patch tested?

Manually as described in "How was this patch tested?" at #44504.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44590 from HyukjinKwon/SPARK-46530-followup.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
@HyukjinKwon HyukjinKwon deleted the SPARK-45917 branch January 15, 2024 00:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants