[SPARK-49552][PYTHON] Add DataFrame API support for new 'randstr' and 'uniform' SQL functions #48143

dtenedor · 2024-09-18T09:33:01Z

What changes were proposed in this pull request?

In #48004 we added new SQL functions randstr and uniform. This PR adds DataFrame API support for them.

For example, in Scala:

sql("create table t(col int not null) using csv")
sql("insert into t values (0)")
val df = sql("select col from t")
df.select(randstr(lit(5), lit(0)).alias("x")).select(length(col("x")))
> 5

df.select(uniform(lit(10), lit(20), lit(0)).alias("x")).selectExpr("x > 5")
> true

Why are the changes needed?

This improves DataFrame parity with the SQL API.

Does this PR introduce any user-facing change?

Yes, see above.

How was this patch tested?

This PR adds unit test coverage.

Was this patch authored or co-authored using generative AI tooling?

No.

commit uniform expression commit commit commit

commit

respond to code review comments respond to code review comments respond to code review comments

dtenedor · 2024-09-18T16:02:06Z

cc @HyukjinKwon @MaxGekk here is the DataFrame support for the new randstr and uniform functions :)

python/pyspark/sql/connect/functions/builtin.py

python/pyspark/sql/functions/builtin.py

dtenedor

Thanks @zhengruifeng for your review! Responded to your comments, please take another look.

python/pyspark/sql/functions/builtin.py

python/pyspark/sql/connect/functions/builtin.py

python/pyspark/sql/functions/builtin.py

xinrong-meng · 2024-09-20T02:10:02Z

python/pyspark/sql/functions/builtin.py

+    +------+
+    | ceV0P|
+    +------+
+


nit: we normally don't include an empty line at the end of the docstring

Sounds good, this is done.

xinrong-meng · 2024-09-20T02:10:29Z

python/pyspark/sql/functions/builtin.py

+    +------+
+    |     7|
+    +------+
+


Sounds good, this is done.

zhengruifeng

the python linter fails with:
https://github.com/dtenedor/spark/actions/runs/10962013320/job/30442618094

zhengruifeng · 2024-09-24T07:36:35Z

python/pyspark/sql/connect/functions/builtin.py

+) -> Column:
+    if seed is None:
+        return _invoke_function_over_columns(
+            "uniform", min, max, lit(random.randint(0, sys.maxsize))


_invoke_function_over_columns requires arguments be columns or column names

Suggested change

"uniform", min, max, lit(random.randint(0, sys.maxsize))

"uniform", lit(min), lit(max), lit(random.randint(0, sys.maxsize))

Thanks, this is done.

zhengruifeng · 2024-09-24T07:37:01Z

python/pyspark/sql/connect/functions/builtin.py

+            "uniform", min, max, lit(random.randint(0, sys.maxsize))
+        )
+    else:
+        return _invoke_function_over_columns("uniform", min, max, seed)


Suggested change

return _invoke_function_over_columns("uniform", min, max, seed)

return _invoke_function_over_columns("uniform", lit(min), lit(max), lit(seed))

Thanks, this is done.

zhengruifeng · 2024-09-24T07:37:23Z

python/pyspark/sql/connect/functions/builtin.py

@@ -2578,6 +2594,16 @@ def regexp_like(str: "ColumnOrName", regexp: "ColumnOrName") -> Column:
 regexp_like.__doc__ = pysparkfuncs.regexp_like.__doc__


+def randstr(length: Union[Column, int], seed: Optional[Union[Column, int]] = None) -> Column:
+    if seed is None:
+        return _invoke_function_over_columns("randstr", length, lit(random.randint(0, sys.maxsize)))


Suggested change

return _invoke_function_over_columns("randstr", length, lit(random.randint(0, sys.maxsize)))

return _invoke_function_over_columns("randstr", lit(length), lit(random.randint(0, sys.maxsize)))

Thanks, this is done.

zhengruifeng · 2024-09-24T07:37:49Z

python/pyspark/sql/connect/functions/builtin.py

+    if seed is None:
+        return _invoke_function_over_columns("randstr", length, lit(random.randint(0, sys.maxsize)))
+    else:
+        return _invoke_function_over_columns("randstr", length, seed)


Suggested change

return _invoke_function_over_columns("randstr", length, seed)

return _invoke_function_over_columns("randstr", lit(length), lit(seed))

Thanks, this is done.

zhengruifeng · 2024-09-24T07:39:21Z

python/pyspark/sql/functions/builtin.py

+    +--------------------+
+    """
+    length = _enum_to_value(length)
+    length = lit(length) if isinstance(length, int) else length


nit: lit function accepts both literals and Column

Thanks, updated.

dtenedor

Thanks @zhengruifeng for your reviews! Responded to your comments, hopefully the linter passes now.

dtenedor · 2024-09-24T22:14:18Z

python/pyspark/sql/connect/functions/builtin.py

+) -> Column:
+    if seed is None:
+        return _invoke_function_over_columns(
+            "uniform", min, max, lit(random.randint(0, sys.maxsize))


Thanks, this is done.

dtenedor · 2024-09-24T22:14:20Z

python/pyspark/sql/connect/functions/builtin.py

+            "uniform", min, max, lit(random.randint(0, sys.maxsize))
+        )
+    else:
+        return _invoke_function_over_columns("uniform", min, max, seed)


Thanks, this is done.

dtenedor · 2024-09-24T22:14:51Z

python/pyspark/sql/connect/functions/builtin.py

@@ -2578,6 +2594,16 @@ def regexp_like(str: "ColumnOrName", regexp: "ColumnOrName") -> Column:
 regexp_like.__doc__ = pysparkfuncs.regexp_like.__doc__


+def randstr(length: Union[Column, int], seed: Optional[Union[Column, int]] = None) -> Column:
+    if seed is None:
+        return _invoke_function_over_columns("randstr", length, lit(random.randint(0, sys.maxsize)))


Thanks, this is done.

dtenedor · 2024-09-24T22:14:55Z

python/pyspark/sql/connect/functions/builtin.py

+    if seed is None:
+        return _invoke_function_over_columns("randstr", length, lit(random.randint(0, sys.maxsize)))
+    else:
+        return _invoke_function_over_columns("randstr", length, seed)


Thanks, this is done.

dtenedor · 2024-09-24T22:15:52Z

python/pyspark/sql/functions/builtin.py

+    +--------------------+
+    """
+    length = _enum_to_value(length)
+    length = lit(length) if isinstance(length, int) else length


Thanks, updated.

zhengruifeng · 2024-09-25T02:58:20Z

thanks, merged to master

… 'uniform' SQL functions ### What changes were proposed in this pull request? In apache#48004 we added new SQL functions `randstr` and `uniform`. This PR adds DataFrame API support for them. For example, in Scala: ``` sql("create table t(col int not null) using csv") sql("insert into t values (0)") val df = sql("select col from t") df.select(randstr(lit(5), lit(0)).alias("x")).select(length(col("x"))) > 5 df.select(uniform(lit(10), lit(20), lit(0)).alias("x")).selectExpr("x > 5") > true ``` ### Why are the changes needed? This improves DataFrame parity with the SQL API. ### Does this PR introduce _any_ user-facing change? Yes, see above. ### How was this patch tested? This PR adds unit test coverage. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48143 from dtenedor/dataframes-uniform-randstr. Authored-by: Daniel Tenedorio <daniel.tenedorio@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>

dtenedor added 20 commits September 5, 2024 15:40

commit

d9223e5

commit uniform expression commit commit commit

respond to code review comments

c0a2551

respond to code review comments

f6ffde0

respond to code review comments

c22fef2

fix test

c78d8f0

fix test

5b6194e

fix test

da390f9

commit

836ec8e

commit

commit

4b41b34

respond to code review comments

54a1a2e

fix function description

c372075

fix test

515b046

fix test

3504124

respond to code review comments

0683318

respond to code review comments respond to code review comments respond to code review comments

respond to code review comments

7ec25a8

fix RandomSuite

1f5e866

respond to code review comments

8b64f33

commit

0313fbe

sync

d54eec7

sync

48f1dde

github-actions bot added SQL DOCS PYTHON CONNECT labels Sep 18, 2024

commit

f496e86

dtenedor changed the title ~~[WIP][SPARK-49552][Python] Add DataFrame API support for new 'randstr' and 'uniform' SQL functions~~ [SPARK-49552][Python] Add DataFrame API support for new 'randstr' and 'uniform' SQL functions Sep 18, 2024

dtenedor marked this pull request as ready for review September 18, 2024 15:59

HyukjinKwon changed the title ~~[SPARK-49552][Python] Add DataFrame API support for new 'randstr' and 'uniform' SQL functions~~ [SPARK-49552][PYTHON] Add DataFrame API support for new 'randstr' and 'uniform' SQL functions Sep 19, 2024

zhengruifeng reviewed Sep 19, 2024

View reviewed changes

python/pyspark/sql/functions/builtin.py Outdated Show resolved Hide resolved

respond to code review comments

160a5f4

dtenedor commented Sep 19, 2024

View reviewed changes

dtenedor requested a review from zhengruifeng September 19, 2024 10:14

dtenedor added 2 commits September 19, 2024 15:13

fix test

795f5a6

fix

6bb123f

xinrong-meng reviewed Sep 20, 2024

View reviewed changes

dtenedor added 4 commits September 20, 2024 13:59

commit

05a674c

fix test

fd3262f

fix

4929fbd

fix

ef7f9e7

dtenedor requested a review from xinrong-meng September 20, 2024 15:26

zhengruifeng reviewed Sep 24, 2024

View reviewed changes

respond to code review comments

d95a391

dtenedor commented Sep 24, 2024

View reviewed changes

dtenedor requested a review from zhengruifeng September 24, 2024 22:25

zhengruifeng approved these changes Sep 25, 2024

View reviewed changes

zhengruifeng closed this in e2d2ab5 Sep 25, 2024

zhengruifeng mentioned this pull request Oct 20, 2024

[SPARK-49552][CONNECT][FOLLOW-UP] Make 'randstr' and 'uniform' deterministic in Scala Client #48558

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-49552][PYTHON] Add DataFrame API support for new 'randstr' and 'uniform' SQL functions #48143

[SPARK-49552][PYTHON] Add DataFrame API support for new 'randstr' and 'uniform' SQL functions #48143

dtenedor commented Sep 18, 2024 •

edited

Loading

dtenedor commented Sep 18, 2024

dtenedor left a comment

xinrong-meng Sep 20, 2024

dtenedor Sep 20, 2024

xinrong-meng Sep 20, 2024

dtenedor Sep 20, 2024

zhengruifeng left a comment

zhengruifeng Sep 24, 2024

dtenedor Sep 24, 2024

zhengruifeng Sep 24, 2024

dtenedor Sep 24, 2024

zhengruifeng Sep 24, 2024

dtenedor Sep 24, 2024

zhengruifeng Sep 24, 2024

dtenedor Sep 24, 2024

zhengruifeng Sep 24, 2024

dtenedor Sep 24, 2024

dtenedor left a comment

dtenedor Sep 24, 2024

dtenedor Sep 24, 2024

dtenedor Sep 24, 2024

dtenedor Sep 24, 2024

dtenedor Sep 24, 2024

zhengruifeng commented Sep 25, 2024

	"uniform", min, max, lit(random.randint(0, sys.maxsize))
	"uniform", lit(min), lit(max), lit(random.randint(0, sys.maxsize))

	return _invoke_function_over_columns("uniform", min, max, seed)
	return _invoke_function_over_columns("uniform", lit(min), lit(max), lit(seed))

	return _invoke_function_over_columns("randstr", length, lit(random.randint(0, sys.maxsize)))
	return _invoke_function_over_columns("randstr", lit(length), lit(random.randint(0, sys.maxsize)))

[SPARK-49552][PYTHON] Add DataFrame API support for new 'randstr' and 'uniform' SQL functions #48143

[SPARK-49552][PYTHON] Add DataFrame API support for new 'randstr' and 'uniform' SQL functions #48143

Conversation

dtenedor commented Sep 18, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

dtenedor commented Sep 18, 2024

dtenedor left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhengruifeng left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dtenedor left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhengruifeng commented Sep 25, 2024

dtenedor commented Sep 18, 2024 •

edited

Loading