-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-49505][SQL] Create new SQL functions "randstr" and "uniform" to generate random strings or numbers within ranges #48004
Conversation
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
Show resolved
Hide resolved
cc @HyukjinKwon @MaxGekk here are a couple more :) |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/ExpressionsSchemaSuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/resources/sql-functions/sql-expression-schema.md
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @MaxGekk for your reviews, responded to your comments, please take another look.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/ExpressionsSchemaSuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/resources/sql-functions/sql-expression-schema.md
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala
Outdated
Show resolved
Hide resolved
respond to code review comments respond to code review comments respond to code review comments
c3ff40b
to
0683318
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test the non-codegen implementation.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again @MaxGekk for the thorough reviews
they are helping the new functions cover all corner cases as carefully as possible.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM in general. @dtenedor Could you fix the test failure:
[info] - SPARK-49505: Test the RANDSTR and UNIFORM SQL functions without codegen *** FAILED *** (1 millisecond)
[info] Exception evaluating randstr(10, 0) (ExpressionEvalHelper.scala:257)
...
[info] Cause: java.lang.ClassCastException: class java.lang.Long cannot be cast to class java.lang.Integer (java.lang.Long and java.lang.Integer are in module java.base of loader 'bootstrap')
[info] at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:99)
[info] at org.apache.spark.sql.catalyst.expressions.RandStr.evalInternal(randomExpressions.scala:377)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and improves its parity with other systems.
@dtenedor Could you provide a few links to the systems that you are going to reach the parity.
|
👍 this is done |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again @MaxGekk for your reviews
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Waiting for CI.
@dtenedor Please, update PR description (examples and so on) according to your recent changes.
@dtenedor Could you fix the test failure, please. It seems it is related to your changes.
|
+1, LGTM. Merging to master. |
Hooray |
… 'uniform' SQL functions ### What changes were proposed in this pull request? In #48004 we added new SQL functions `randstr` and `uniform`. This PR adds DataFrame API support for them. For example, in Scala: ``` sql("create table t(col int not null) using csv") sql("insert into t values (0)") val df = sql("select col from t") df.select(randstr(lit(5), lit(0)).alias("x")).select(length(col("x"))) > 5 df.select(uniform(lit(10), lit(20), lit(0)).alias("x")).selectExpr("x > 5") > true ``` ### Why are the changes needed? This improves DataFrame parity with the SQL API. ### Does this PR introduce _any_ user-facing change? Yes, see above. ### How was this patch tested? This PR adds unit test coverage. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #48143 from dtenedor/dataframes-uniform-randstr. Authored-by: Daniel Tenedorio <daniel.tenedorio@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
…o generate random strings or numbers within ranges ### What changes were proposed in this pull request? This PR introduces two new SQL functions "randstr" and "uniform" to generate random strings or numbers within ranges. * The "randstr" function returns a string of the specified length whose characters are chosen uniformly at random from the following pool of characters: 0-9, a-z, A-Z. The random seed is optional. The string length must be a constant two-byte or four-byte integer (SMALLINT or INT, respectively). * The "uniform" function returns a random value with independent and identically distributed values with the specified range of numbers. The random seed is optional. The provided numbers specifying the minimum and maximum values of the range must be constant. If both of these numbers are integers, then the result will also be an integer. Otherwise if one or both of these are floating-point numbers, then the result will also be a floating-point number. For example: ``` SELECT randstr(5); > ceV0P SELECT randstr(10, 0) FROM VALUES (0), (1), (2) tab(col); > ceV0PXaR2I fYxVfArnv7 iSIv0VT2XL SELECT uniform(10, 20.0F); > 17.604954 SELECT uniform(10, 20, 0) FROM VALUES (0), (1), (2) tab(col); > 15 16 17 ``` ### Why are the changes needed? This improves the SQL functionality of Apache Spark and improves its parity with other systems: * https://clickhouse.com/docs/en/sql-reference/functions/random-functions#randuniform * https://docs.snowflake.com/en/sql-reference/functions/uniform * https://www.microfocus.com/documentation/silk-test/21.0.2/en/silktestclassic-help-en/STCLASSIC-8BFE8661-RANDSTRFUNCTION-REF.html * https://docs.snowflake.com/en/sql-reference/functions/randstr ### Does this PR introduce _any_ user-facing change? Yes, see above. ### How was this patch tested? This PR adds golden file based test coverage. ### Was this patch authored or co-authored using generative AI tooling? Not this time. Closes apache#48004 from dtenedor/uniform-randstr-functions. Authored-by: Daniel Tenedorio <daniel.tenedorio@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>
… 'uniform' SQL functions ### What changes were proposed in this pull request? In apache#48004 we added new SQL functions `randstr` and `uniform`. This PR adds DataFrame API support for them. For example, in Scala: ``` sql("create table t(col int not null) using csv") sql("insert into t values (0)") val df = sql("select col from t") df.select(randstr(lit(5), lit(0)).alias("x")).select(length(col("x"))) > 5 df.select(uniform(lit(10), lit(20), lit(0)).alias("x")).selectExpr("x > 5") > true ``` ### Why are the changes needed? This improves DataFrame parity with the SQL API. ### Does this PR introduce _any_ user-facing change? Yes, see above. ### How was this patch tested? This PR adds unit test coverage. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48143 from dtenedor/dataframes-uniform-randstr. Authored-by: Daniel Tenedorio <daniel.tenedorio@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
…o generate random strings or numbers within ranges ### What changes were proposed in this pull request? This PR introduces two new SQL functions "randstr" and "uniform" to generate random strings or numbers within ranges. * The "randstr" function returns a string of the specified length whose characters are chosen uniformly at random from the following pool of characters: 0-9, a-z, A-Z. The random seed is optional. The string length must be a constant two-byte or four-byte integer (SMALLINT or INT, respectively). * The "uniform" function returns a random value with independent and identically distributed values with the specified range of numbers. The random seed is optional. The provided numbers specifying the minimum and maximum values of the range must be constant. If both of these numbers are integers, then the result will also be an integer. Otherwise if one or both of these are floating-point numbers, then the result will also be a floating-point number. For example: ``` SELECT randstr(5); > ceV0P SELECT randstr(10, 0) FROM VALUES (0), (1), (2) tab(col); > ceV0PXaR2I fYxVfArnv7 iSIv0VT2XL SELECT uniform(10, 20.0F); > 17.604954 SELECT uniform(10, 20, 0) FROM VALUES (0), (1), (2) tab(col); > 15 16 17 ``` ### Why are the changes needed? This improves the SQL functionality of Apache Spark and improves its parity with other systems: * https://clickhouse.com/docs/en/sql-reference/functions/random-functions#randuniform * https://docs.snowflake.com/en/sql-reference/functions/uniform * https://www.microfocus.com/documentation/silk-test/21.0.2/en/silktestclassic-help-en/STCLASSIC-8BFE8661-RANDSTRFUNCTION-REF.html * https://docs.snowflake.com/en/sql-reference/functions/randstr ### Does this PR introduce _any_ user-facing change? Yes, see above. ### How was this patch tested? This PR adds golden file based test coverage. ### Was this patch authored or co-authored using generative AI tooling? Not this time. Closes apache#48004 from dtenedor/uniform-randstr-functions. Authored-by: Daniel Tenedorio <daniel.tenedorio@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>
… 'uniform' SQL functions ### What changes were proposed in this pull request? In apache#48004 we added new SQL functions `randstr` and `uniform`. This PR adds DataFrame API support for them. For example, in Scala: ``` sql("create table t(col int not null) using csv") sql("insert into t values (0)") val df = sql("select col from t") df.select(randstr(lit(5), lit(0)).alias("x")).select(length(col("x"))) > 5 df.select(uniform(lit(10), lit(20), lit(0)).alias("x")).selectExpr("x > 5") > true ``` ### Why are the changes needed? This improves DataFrame parity with the SQL API. ### Does this PR introduce _any_ user-facing change? Yes, see above. ### How was this patch tested? This PR adds unit test coverage. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48143 from dtenedor/dataframes-uniform-randstr. Authored-by: Daniel Tenedorio <daniel.tenedorio@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
What changes were proposed in this pull request?
This PR introduces two new SQL functions "randstr" and "uniform" to generate random strings or numbers within ranges.
For example:
Why are the changes needed?
This improves the SQL functionality of Apache Spark and improves its parity with other systems:
Does this PR introduce any user-facing change?
Yes, see above.
How was this patch tested?
This PR adds golden file based test coverage.
Was this patch authored or co-authored using generative AI tooling?
Not this time.