[SPARK-46765][PYTHON][CONNECT] Make `shuffle` specify the datatype of `seed` #44793

zhengruifeng · 2024-01-19T03:38:56Z

What changes were proposed in this pull request?

Make shuffle specify the datatype of seed

Why are the changes needed?

shuffle function may fail with an extreme low possibility (~ 2e-10) :

shuffle requires a Long type seed, in an unregistered function, and this Long value is extracted in Planner.

in Scala client the SparkClassUtils.random.nextLong make sure the type;
while in Python, lit(random.randint(0, sys.maxsize)) may return a Literal Integer instead of Literal Long.

In [26]: from pyspark.sql import functions as sf

In [27]: df = spark.createDataFrame([([1, 20, 3, 5],)], ['data'])

In [28]: df.select(sf.shuffle(df.data)).show()
+-------------+
|shuffle(data)|
+-------------+
|[1, 3, 5, 20]|
+-------------+


In [29]: df.select(sf.call_udf("shuffle", df.data, sf.lit(123456789000000))).show()
+-------------+
|shuffle(data)|
+-------------+
|[20, 1, 5, 3]|
+-------------+


In [30]: df.select(sf.call_udf("shuffle", df.data, sf.lit(12345))).show()
...
SparkConnectGrpcException: (org.apache.spark.sql.connect.common.InvalidPlanInput) seed should be a literal long, but got 12345

Another case is uuid, but it is not supported in Python due to namespace conflicts.
I don't find other similar cases.

Does this PR introduce any user-facing change?

no

How was this patch tested?

manually check

Was this patch authored or co-authored using generative AI tooling?

no

nit

zhengruifeng · 2024-01-19T03:42:06Z

ci: https://github.com/zhengruifeng/spark/actions/runs/7578671965/job/20641673985

zhengruifeng · 2024-01-19T06:12:18Z

thanks, merged to master

nit

a214a46

nit

github-actions bot added SQL PYTHON CONNECT labels Jan 19, 2024

zhengruifeng requested a review from HyukjinKwon January 19, 2024 03:40

HyukjinKwon approved these changes Jan 19, 2024

View reviewed changes

zhengruifeng closed this in 36da27f Jan 19, 2024

zhengruifeng deleted the py_shuffle_long branch January 19, 2024 06:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-46765][PYTHON][CONNECT] Make `shuffle` specify the datatype of `seed` #44793

[SPARK-46765][PYTHON][CONNECT] Make `shuffle` specify the datatype of `seed` #44793

Uh oh!

zhengruifeng commented Jan 19, 2024 •

edited

Loading

Uh oh!

zhengruifeng commented Jan 19, 2024

Uh oh!

zhengruifeng commented Jan 19, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-46765][PYTHON][CONNECT] Make shuffle specify the datatype of seed #44793

[SPARK-46765][PYTHON][CONNECT] Make shuffle specify the datatype of seed #44793

Uh oh!

Conversation

zhengruifeng commented Jan 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

zhengruifeng commented Jan 19, 2024

Uh oh!

zhengruifeng commented Jan 19, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-46765][PYTHON][CONNECT] Make `shuffle` specify the datatype of `seed` #44793

[SPARK-46765][PYTHON][CONNECT] Make `shuffle` specify the datatype of `seed` #44793

zhengruifeng commented Jan 19, 2024 •

edited

Loading