[SPARK-49505][SQL] Create new SQL functions "randstr" and "uniform" to generate random strings or numbers within ranges #48004

dtenedor · 2024-09-05T22:46:11Z

What changes were proposed in this pull request?

This PR introduces two new SQL functions "randstr" and "uniform" to generate random strings or numbers within ranges.

The "randstr" function returns a string of the specified length whose characters are chosen uniformly at random from the following pool of characters: 0-9, a-z, A-Z. The random seed is optional. The string length must be a constant two-byte or four-byte integer (SMALLINT or INT, respectively).
The "uniform" function returns a random value with independent and identically distributed values with the specified range of numbers. The random seed is optional. The provided numbers specifying the minimum and maximum values of the range must be constant. If both of these numbers are integers, then the result will also be an integer. Otherwise if one or both of these are floating-point numbers, then the result will also be a floating-point number.

For example:

SELECT randstr(5);
> ceV0P

SELECT randstr(10, 0) FROM VALUES (0), (1), (2) tab(col);
> ceV0PXaR2I
  fYxVfArnv7
  iSIv0VT2XL

SELECT uniform(10, 20.0F);
> 17.604954

SELECT uniform(10, 20, 0) FROM VALUES (0), (1), (2) tab(col);
> 15
  16
  17

Why are the changes needed?

This improves the SQL functionality of Apache Spark and improves its parity with other systems:

Does this PR introduce any user-facing change?

Yes, see above.

How was this patch tested?

This PR adds golden file based test coverage.

Was this patch authored or co-authored using generative AI tooling?

Not this time.

commit uniform expression commit commit commit

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala

dtenedor · 2024-09-05T22:47:32Z

cc @HyukjinKwon @MaxGekk here are a couple more :)

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala

commit

sql/core/src/test/scala/org/apache/spark/sql/ExpressionsSchemaSuite.scala

sql/core/src/test/resources/sql-functions/sql-expression-schema.md

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala

dtenedor

Thanks @MaxGekk for your reviews, responded to your comments, please take another look.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala

sql/core/src/test/scala/org/apache/spark/sql/ExpressionsSchemaSuite.scala

sql/core/src/test/resources/sql-functions/sql-expression-schema.md

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala

respond to code review comments respond to code review comments respond to code review comments

MaxGekk

Test the non-codegen implementation.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala

dtenedor

Thanks again @MaxGekk for the thorough reviews
they are helping the new functions cover all corner cases as carefully as possible.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala

MaxGekk

LGTM in general. @dtenedor Could you fix the test failure:

[info] - SPARK-49505: Test the RANDSTR and UNIFORM SQL functions without codegen *** FAILED *** (1 millisecond)
[info]   Exception evaluating randstr(10, 0) (ExpressionEvalHelper.scala:257)
...
[info]   Cause: java.lang.ClassCastException: class java.lang.Long cannot be cast to class java.lang.Integer (java.lang.Long and java.lang.Integer are in module java.base of loader 'bootstrap')
[info]   at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:99)
[info]   at org.apache.spark.sql.catalyst.expressions.RandStr.evalInternal(randomExpressions.scala:377)

MaxGekk

and improves its parity with other systems.

@dtenedor Could you provide a few links to the systems that you are going to reach the parity.

dtenedor · 2024-09-14T00:19:58Z

@dtenedor Could you provide a few links to the systems that you are going to reach the parity.

dtenedor · 2024-09-14T00:24:10Z

LGTM in general. @dtenedor Could you fix the test failure

👍 this is done

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala

dtenedor

Thanks again @MaxGekk for your reviews

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala

MaxGekk

Waiting for CI.

@dtenedor Please, update PR description (examples and so on) according to your recent changes.

MaxGekk · 2024-09-17T07:11:49Z

@dtenedor Could you fix the test failure, please. It seems it is related to your changes.

[info]   Cause: org.scalatest.exceptions.TestFailedException: Function 'randstr', Expression class 'org.apache.spark.sql.catalyst.expressions.RandStr' "[ceV]" did not equal "[8i7]"

MaxGekk · 2024-09-17T12:46:55Z

+1, LGTM. Merging to master.
Thank you, @dtenedor and @HyukjinKwon for review.

dtenedor · 2024-09-17T13:23:23Z

Hooray
thanks again for the reviews everyone

… 'uniform' SQL functions ### What changes were proposed in this pull request? In #48004 we added new SQL functions `randstr` and `uniform`. This PR adds DataFrame API support for them. For example, in Scala: ``` sql("create table t(col int not null) using csv") sql("insert into t values (0)") val df = sql("select col from t") df.select(randstr(lit(5), lit(0)).alias("x")).select(length(col("x"))) > 5 df.select(uniform(lit(10), lit(20), lit(0)).alias("x")).selectExpr("x > 5") > true ``` ### Why are the changes needed? This improves DataFrame parity with the SQL API. ### Does this PR introduce _any_ user-facing change? Yes, see above. ### How was this patch tested? This PR adds unit test coverage. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #48143 from dtenedor/dataframes-uniform-randstr. Authored-by: Daniel Tenedorio <daniel.tenedorio@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>

…o generate random strings or numbers within ranges ### What changes were proposed in this pull request? This PR introduces two new SQL functions "randstr" and "uniform" to generate random strings or numbers within ranges. * The "randstr" function returns a string of the specified length whose characters are chosen uniformly at random from the following pool of characters: 0-9, a-z, A-Z. The random seed is optional. The string length must be a constant two-byte or four-byte integer (SMALLINT or INT, respectively). * The "uniform" function returns a random value with independent and identically distributed values with the specified range of numbers. The random seed is optional. The provided numbers specifying the minimum and maximum values of the range must be constant. If both of these numbers are integers, then the result will also be an integer. Otherwise if one or both of these are floating-point numbers, then the result will also be a floating-point number. For example: ``` SELECT randstr(5); > ceV0P SELECT randstr(10, 0) FROM VALUES (0), (1), (2) tab(col); > ceV0PXaR2I fYxVfArnv7 iSIv0VT2XL SELECT uniform(10, 20.0F); > 17.604954 SELECT uniform(10, 20, 0) FROM VALUES (0), (1), (2) tab(col); > 15 16 17 ``` ### Why are the changes needed? This improves the SQL functionality of Apache Spark and improves its parity with other systems: * https://clickhouse.com/docs/en/sql-reference/functions/random-functions#randuniform * https://docs.snowflake.com/en/sql-reference/functions/uniform * https://www.microfocus.com/documentation/silk-test/21.0.2/en/silktestclassic-help-en/STCLASSIC-8BFE8661-RANDSTRFUNCTION-REF.html * https://docs.snowflake.com/en/sql-reference/functions/randstr ### Does this PR introduce _any_ user-facing change? Yes, see above. ### How was this patch tested? This PR adds golden file based test coverage. ### Was this patch authored or co-authored using generative AI tooling? Not this time. Closes apache#48004 from dtenedor/uniform-randstr-functions. Authored-by: Daniel Tenedorio <daniel.tenedorio@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

… 'uniform' SQL functions ### What changes were proposed in this pull request? In apache#48004 we added new SQL functions `randstr` and `uniform`. This PR adds DataFrame API support for them. For example, in Scala: ``` sql("create table t(col int not null) using csv") sql("insert into t values (0)") val df = sql("select col from t") df.select(randstr(lit(5), lit(0)).alias("x")).select(length(col("x"))) > 5 df.select(uniform(lit(10), lit(20), lit(0)).alias("x")).selectExpr("x > 5") > true ``` ### Why are the changes needed? This improves DataFrame parity with the SQL API. ### Does this PR introduce _any_ user-facing change? Yes, see above. ### How was this patch tested? This PR adds unit test coverage. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48143 from dtenedor/dataframes-uniform-randstr. Authored-by: Daniel Tenedorio <daniel.tenedorio@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>

…o generate random strings or numbers within ranges ### What changes were proposed in this pull request? This PR introduces two new SQL functions "randstr" and "uniform" to generate random strings or numbers within ranges. * The "randstr" function returns a string of the specified length whose characters are chosen uniformly at random from the following pool of characters: 0-9, a-z, A-Z. The random seed is optional. The string length must be a constant two-byte or four-byte integer (SMALLINT or INT, respectively). * The "uniform" function returns a random value with independent and identically distributed values with the specified range of numbers. The random seed is optional. The provided numbers specifying the minimum and maximum values of the range must be constant. If both of these numbers are integers, then the result will also be an integer. Otherwise if one or both of these are floating-point numbers, then the result will also be a floating-point number. For example: ``` SELECT randstr(5); > ceV0P SELECT randstr(10, 0) FROM VALUES (0), (1), (2) tab(col); > ceV0PXaR2I fYxVfArnv7 iSIv0VT2XL SELECT uniform(10, 20.0F); > 17.604954 SELECT uniform(10, 20, 0) FROM VALUES (0), (1), (2) tab(col); > 15 16 17 ``` ### Why are the changes needed? This improves the SQL functionality of Apache Spark and improves its parity with other systems: * https://clickhouse.com/docs/en/sql-reference/functions/random-functions#randuniform * https://docs.snowflake.com/en/sql-reference/functions/uniform * https://www.microfocus.com/documentation/silk-test/21.0.2/en/silktestclassic-help-en/STCLASSIC-8BFE8661-RANDSTRFUNCTION-REF.html * https://docs.snowflake.com/en/sql-reference/functions/randstr ### Does this PR introduce _any_ user-facing change? Yes, see above. ### How was this patch tested? This PR adds golden file based test coverage. ### Was this patch authored or co-authored using generative AI tooling? Not this time. Closes apache#48004 from dtenedor/uniform-randstr-functions. Authored-by: Daniel Tenedorio <daniel.tenedorio@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

… 'uniform' SQL functions ### What changes were proposed in this pull request? In apache#48004 we added new SQL functions `randstr` and `uniform`. This PR adds DataFrame API support for them. For example, in Scala: ``` sql("create table t(col int not null) using csv") sql("insert into t values (0)") val df = sql("select col from t") df.select(randstr(lit(5), lit(0)).alias("x")).select(length(col("x"))) > 5 df.select(uniform(lit(10), lit(20), lit(0)).alias("x")).selectExpr("x > 5") > true ``` ### Why are the changes needed? This improves DataFrame parity with the SQL API. ### Does this PR introduce _any_ user-facing change? Yes, see above. ### How was this patch tested? This PR adds unit test coverage. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48143 from dtenedor/dataframes-uniform-randstr. Authored-by: Daniel Tenedorio <daniel.tenedorio@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>

commit

d9223e5

commit uniform expression commit commit commit

github-actions bot added the SQL label Sep 5, 2024

dtenedor commented Sep 5, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala Show resolved Hide resolved

HyukjinKwon reviewed Sep 6, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala Show resolved Hide resolved

MaxGekk requested changes Sep 6, 2024

View reviewed changes

dtenedor added 2 commits September 9, 2024 10:53

respond to code review comments

c0a2551

respond to code review comments

f6ffde0

dtenedor requested review from MaxGekk and HyukjinKwon September 9, 2024 20:19

dtenedor added 6 commits September 9, 2024 13:26

respond to code review comments

c22fef2

fix test

c78d8f0

fix test

5b6194e

fix test

da390f9

commit

836ec8e

commit

commit

4b41b34

MaxGekk reviewed Sep 10, 2024

View reviewed changes

MaxGekk reviewed Sep 11, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala Show resolved Hide resolved

respond to code review comments

54a1a2e

dtenedor commented Sep 11, 2024

View reviewed changes

dtenedor requested a review from MaxGekk September 11, 2024 18:29

dtenedor added 3 commits September 11, 2024 15:26

fix function description

c372075

fix test

515b046

fix test

3504124

MaxGekk reviewed Sep 12, 2024

View reviewed changes

dtenedor requested a review from MaxGekk September 12, 2024 22:04

respond to code review comments

0683318

respond to code review comments respond to code review comments respond to code review comments

dtenedor force-pushed the uniform-randstr-functions branch from c3ff40b to 0683318 Compare September 13, 2024 03:18

MaxGekk requested changes Sep 13, 2024

View reviewed changes

respond to code review comments

7ec25a8

dtenedor commented Sep 13, 2024

View reviewed changes

dtenedor requested a review from MaxGekk September 13, 2024 18:23

MaxGekk reviewed Sep 13, 2024

View reviewed changes

fix RandomSuite

1f5e866

dtenedor requested a review from MaxGekk September 14, 2024 00:24

MaxGekk requested changes Sep 14, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala Outdated Show resolved Hide resolved

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala Show resolved Hide resolved

respond to code review comments

8b64f33

dtenedor commented Sep 16, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala Show resolved Hide resolved

dtenedor requested a review from MaxGekk September 16, 2024 14:25

MaxGekk approved these changes Sep 16, 2024

View reviewed changes

fix test

6881f9b

MaxGekk closed this in 6393afa Sep 17, 2024

pxLi mentioned this pull request Sep 18, 2024

[BUG] spark400 build failed do not conform to class UnaryExprMeta's type parameter NVIDIA/spark-rapids#11479

Closed

dtenedor mentioned this pull request Sep 18, 2024

[SPARK-49552][PYTHON] Add DataFrame API support for new 'randstr' and 'uniform' SQL functions #48143

Closed

abellina mentioned this pull request Sep 23, 2024

Use UnaryLike instead of UnaryExpression [databricks] NVIDIA/spark-rapids#11490

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-49505][SQL] Create new SQL functions "randstr" and "uniform" to generate random strings or numbers within ranges #48004

[SPARK-49505][SQL] Create new SQL functions "randstr" and "uniform" to generate random strings or numbers within ranges #48004

dtenedor commented Sep 5, 2024 •

edited by MaxGekk

Loading

dtenedor commented Sep 5, 2024

dtenedor left a comment

MaxGekk left a comment

dtenedor left a comment

MaxGekk left a comment

MaxGekk left a comment

dtenedor commented Sep 14, 2024

dtenedor commented Sep 14, 2024

dtenedor left a comment

MaxGekk left a comment

MaxGekk commented Sep 17, 2024

MaxGekk commented Sep 17, 2024

dtenedor commented Sep 17, 2024

[SPARK-49505][SQL] Create new SQL functions "randstr" and "uniform" to generate random strings or numbers within ranges #48004

[SPARK-49505][SQL] Create new SQL functions "randstr" and "uniform" to generate random strings or numbers within ranges #48004

Conversation

dtenedor commented Sep 5, 2024 • edited by MaxGekk Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

dtenedor commented Sep 5, 2024

dtenedor left a comment

Choose a reason for hiding this comment

MaxGekk left a comment

Choose a reason for hiding this comment

dtenedor left a comment

Choose a reason for hiding this comment

MaxGekk left a comment

Choose a reason for hiding this comment

MaxGekk left a comment

Choose a reason for hiding this comment

dtenedor commented Sep 14, 2024

dtenedor commented Sep 14, 2024

dtenedor left a comment

Choose a reason for hiding this comment

MaxGekk left a comment

Choose a reason for hiding this comment

MaxGekk commented Sep 17, 2024

MaxGekk commented Sep 17, 2024

dtenedor commented Sep 17, 2024

dtenedor commented Sep 5, 2024 •

edited by MaxGekk

Loading