Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-49505][SQL] Create new SQL functions "randstr" and "uniform" to generate random strings or numbers within ranges #48004

Closed
wants to merge 18 commits into from

Conversation

dtenedor
Copy link
Contributor

@dtenedor dtenedor commented Sep 5, 2024

What changes were proposed in this pull request?

This PR introduces two new SQL functions "randstr" and "uniform" to generate random strings or numbers within ranges.

  • The "randstr" function returns a string of the specified length whose characters are chosen uniformly at random from the following pool of characters: 0-9, a-z, A-Z. The random seed is optional. The string length must be a constant two-byte or four-byte integer (SMALLINT or INT, respectively).
  • The "uniform" function returns a random value with independent and identically distributed values with the specified range of numbers. The random seed is optional. The provided numbers specifying the minimum and maximum values of the range must be constant. If both of these numbers are integers, then the result will also be an integer. Otherwise if one or both of these are floating-point numbers, then the result will also be a floating-point number.

For example:

SELECT randstr(5);
> ceV0P

SELECT randstr(10, 0) FROM VALUES (0), (1), (2) tab(col);
> ceV0PXaR2I
  fYxVfArnv7
  iSIv0VT2XL

SELECT uniform(10, 20.0F);
> 17.604954

SELECT uniform(10, 20, 0) FROM VALUES (0), (1), (2) tab(col);
> 15
  16
  17

Why are the changes needed?

This improves the SQL functionality of Apache Spark and improves its parity with other systems:

Does this PR introduce any user-facing change?

Yes, see above.

How was this patch tested?

This PR adds golden file based test coverage.

Was this patch authored or co-authored using generative AI tooling?

Not this time.

commit

uniform expression

commit

commit

commit
@github-actions github-actions bot added the SQL label Sep 5, 2024
@dtenedor
Copy link
Contributor Author

dtenedor commented Sep 5, 2024

cc @HyukjinKwon @MaxGekk here are a couple more :)

Copy link
Contributor Author

@dtenedor dtenedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @MaxGekk for your reviews, responded to your comments, please take another look.

respond to code review comments

respond to code review comments

respond to code review comments
Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test the non-codegen implementation.

Copy link
Contributor Author

@dtenedor dtenedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @MaxGekk for the thorough reviews
they are helping the new functions cover all corner cases as carefully as possible.

Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general. @dtenedor Could you fix the test failure:

[info] - SPARK-49505: Test the RANDSTR and UNIFORM SQL functions without codegen *** FAILED *** (1 millisecond)
[info]   Exception evaluating randstr(10, 0) (ExpressionEvalHelper.scala:257)
...
[info]   Cause: java.lang.ClassCastException: class java.lang.Long cannot be cast to class java.lang.Integer (java.lang.Long and java.lang.Integer are in module java.base of loader 'bootstrap')
[info]   at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:99)
[info]   at org.apache.spark.sql.catalyst.expressions.RandStr.evalInternal(randomExpressions.scala:377)

Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and improves its parity with other systems.

@dtenedor Could you provide a few links to the systems that you are going to reach the parity.

@dtenedor
Copy link
Contributor Author

LGTM in general. @dtenedor Could you fix the test failure

👍 this is done

Copy link
Contributor Author

@dtenedor dtenedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @MaxGekk for your reviews

Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Waiting for CI.

@dtenedor Please, update PR description (examples and so on) according to your recent changes.

@MaxGekk
Copy link
Member

MaxGekk commented Sep 17, 2024

@dtenedor Could you fix the test failure, please. It seems it is related to your changes.

[info]   Cause: org.scalatest.exceptions.TestFailedException: Function 'randstr', Expression class 'org.apache.spark.sql.catalyst.expressions.RandStr' "[ceV]" did not equal "[8i7]"

@MaxGekk
Copy link
Member

MaxGekk commented Sep 17, 2024

+1, LGTM. Merging to master.
Thank you, @dtenedor and @HyukjinKwon for review.

@MaxGekk MaxGekk closed this in 6393afa Sep 17, 2024
@dtenedor
Copy link
Contributor Author

Hooray
thanks again for the reviews everyone

zhengruifeng pushed a commit that referenced this pull request Sep 25, 2024
… 'uniform' SQL functions

### What changes were proposed in this pull request?

In #48004 we added new SQL functions `randstr` and `uniform`. This PR adds DataFrame API support for them.

For example, in Scala:

```
sql("create table t(col int not null) using csv")
sql("insert into t values (0)")
val df = sql("select col from t")
df.select(randstr(lit(5), lit(0)).alias("x")).select(length(col("x")))
> 5

df.select(uniform(lit(10), lit(20), lit(0)).alias("x")).selectExpr("x > 5")
> true
```

### Why are the changes needed?

This improves DataFrame parity with the SQL API.

### Does this PR introduce _any_ user-facing change?

Yes, see above.

### How was this patch tested?

This PR adds unit test coverage.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #48143 from dtenedor/dataframes-uniform-randstr.

Authored-by: Daniel Tenedorio <daniel.tenedorio@databricks.com>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
…o generate random strings or numbers within ranges

### What changes were proposed in this pull request?

This PR introduces two new SQL functions "randstr" and "uniform" to generate random strings or numbers within ranges.

* The "randstr" function returns a string of the specified length whose characters are chosen uniformly at random from the following pool of characters: 0-9, a-z, A-Z. The random seed is optional. The string length must be a constant two-byte or four-byte integer (SMALLINT or INT, respectively).
* The "uniform" function returns a random value with independent and identically distributed  values with the specified range of numbers. The random seed is optional. The provided numbers specifying the minimum and maximum values of the range must be constant. If both of these numbers are integers, then the result will also be an integer. Otherwise if one or both of these are floating-point numbers, then the result will also be a floating-point number.

For example:

```
SELECT randstr(5);
> ceV0P

SELECT randstr(10, 0) FROM VALUES (0), (1), (2) tab(col);
> ceV0PXaR2I
  fYxVfArnv7
  iSIv0VT2XL

SELECT uniform(10, 20.0F);
> 17.604954

SELECT uniform(10, 20, 0) FROM VALUES (0), (1), (2) tab(col);
> 15
  16
  17
```

### Why are the changes needed?

This improves the SQL functionality of Apache Spark and improves its parity with other systems:
* https://clickhouse.com/docs/en/sql-reference/functions/random-functions#randuniform
* https://docs.snowflake.com/en/sql-reference/functions/uniform
* https://www.microfocus.com/documentation/silk-test/21.0.2/en/silktestclassic-help-en/STCLASSIC-8BFE8661-RANDSTRFUNCTION-REF.html
* https://docs.snowflake.com/en/sql-reference/functions/randstr

### Does this PR introduce _any_ user-facing change?

Yes, see above.

### How was this patch tested?

This PR adds golden file based test coverage.

### Was this patch authored or co-authored using generative AI tooling?

Not this time.

Closes apache#48004 from dtenedor/uniform-randstr-functions.

Authored-by: Daniel Tenedorio <daniel.tenedorio@databricks.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
… 'uniform' SQL functions

### What changes were proposed in this pull request?

In apache#48004 we added new SQL functions `randstr` and `uniform`. This PR adds DataFrame API support for them.

For example, in Scala:

```
sql("create table t(col int not null) using csv")
sql("insert into t values (0)")
val df = sql("select col from t")
df.select(randstr(lit(5), lit(0)).alias("x")).select(length(col("x")))
> 5

df.select(uniform(lit(10), lit(20), lit(0)).alias("x")).selectExpr("x > 5")
> true
```

### Why are the changes needed?

This improves DataFrame parity with the SQL API.

### Does this PR introduce _any_ user-facing change?

Yes, see above.

### How was this patch tested?

This PR adds unit test coverage.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#48143 from dtenedor/dataframes-uniform-randstr.

Authored-by: Daniel Tenedorio <daniel.tenedorio@databricks.com>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
himadripal pushed a commit to himadripal/spark that referenced this pull request Oct 19, 2024
…o generate random strings or numbers within ranges

### What changes were proposed in this pull request?

This PR introduces two new SQL functions "randstr" and "uniform" to generate random strings or numbers within ranges.

* The "randstr" function returns a string of the specified length whose characters are chosen uniformly at random from the following pool of characters: 0-9, a-z, A-Z. The random seed is optional. The string length must be a constant two-byte or four-byte integer (SMALLINT or INT, respectively).
* The "uniform" function returns a random value with independent and identically distributed  values with the specified range of numbers. The random seed is optional. The provided numbers specifying the minimum and maximum values of the range must be constant. If both of these numbers are integers, then the result will also be an integer. Otherwise if one or both of these are floating-point numbers, then the result will also be a floating-point number.

For example:

```
SELECT randstr(5);
> ceV0P

SELECT randstr(10, 0) FROM VALUES (0), (1), (2) tab(col);
> ceV0PXaR2I
  fYxVfArnv7
  iSIv0VT2XL

SELECT uniform(10, 20.0F);
> 17.604954

SELECT uniform(10, 20, 0) FROM VALUES (0), (1), (2) tab(col);
> 15
  16
  17
```

### Why are the changes needed?

This improves the SQL functionality of Apache Spark and improves its parity with other systems:
* https://clickhouse.com/docs/en/sql-reference/functions/random-functions#randuniform
* https://docs.snowflake.com/en/sql-reference/functions/uniform
* https://www.microfocus.com/documentation/silk-test/21.0.2/en/silktestclassic-help-en/STCLASSIC-8BFE8661-RANDSTRFUNCTION-REF.html
* https://docs.snowflake.com/en/sql-reference/functions/randstr

### Does this PR introduce _any_ user-facing change?

Yes, see above.

### How was this patch tested?

This PR adds golden file based test coverage.

### Was this patch authored or co-authored using generative AI tooling?

Not this time.

Closes apache#48004 from dtenedor/uniform-randstr-functions.

Authored-by: Daniel Tenedorio <daniel.tenedorio@databricks.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
himadripal pushed a commit to himadripal/spark that referenced this pull request Oct 19, 2024
… 'uniform' SQL functions

### What changes were proposed in this pull request?

In apache#48004 we added new SQL functions `randstr` and `uniform`. This PR adds DataFrame API support for them.

For example, in Scala:

```
sql("create table t(col int not null) using csv")
sql("insert into t values (0)")
val df = sql("select col from t")
df.select(randstr(lit(5), lit(0)).alias("x")).select(length(col("x")))
> 5

df.select(uniform(lit(10), lit(20), lit(0)).alias("x")).selectExpr("x > 5")
> true
```

### Why are the changes needed?

This improves DataFrame parity with the SQL API.

### Does this PR introduce _any_ user-facing change?

Yes, see above.

### How was this patch tested?

This PR adds unit test coverage.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#48143 from dtenedor/dataframes-uniform-randstr.

Authored-by: Daniel Tenedorio <daniel.tenedorio@databricks.com>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants