Skip to content

Conversation

@zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Jan 19, 2024

What changes were proposed in this pull request?

Make shuffle specify the datatype of seed

Why are the changes needed?

shuffle function may fail with an extreme low possibility (~ 2e-10) :

shuffle requires a Long type seed, in an unregistered function, and this Long value is extracted in Planner.

in Scala client the SparkClassUtils.random.nextLong make sure the type;
while in Python, lit(random.randint(0, sys.maxsize)) may return a Literal Integer instead of Literal Long.

In [26]: from pyspark.sql import functions as sf

In [27]: df = spark.createDataFrame([([1, 20, 3, 5],)], ['data'])

In [28]: df.select(sf.shuffle(df.data)).show()
+-------------+
|shuffle(data)|
+-------------+
|[1, 3, 5, 20]|
+-------------+


In [29]: df.select(sf.call_udf("shuffle", df.data, sf.lit(123456789000000))).show()
+-------------+
|shuffle(data)|
+-------------+
|[20, 1, 5, 3]|
+-------------+


In [30]: df.select(sf.call_udf("shuffle", df.data, sf.lit(12345))).show()
...
SparkConnectGrpcException: (org.apache.spark.sql.connect.common.InvalidPlanInput) seed should be a literal long, but got 12345

Another case is uuid, but it is not supported in Python due to namespace conflicts.
I don't find other similar cases.

Does this PR introduce any user-facing change?

no

How was this patch tested?

manually check

Was this patch authored or co-authored using generative AI tooling?

no

nit
@zhengruifeng
Copy link
Contributor Author

@zhengruifeng
Copy link
Contributor Author

thanks, merged to master

@zhengruifeng zhengruifeng deleted the py_shuffle_long branch January 19, 2024 06:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants