Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ScalarUDF: Remove supports_zero_argument and avoid creating null array for empty args #10193

Merged

Conversation

jayzhan211
Copy link
Contributor

@jayzhan211 jayzhan211 commented Apr 23, 2024

Which issue does this PR close?

Issue from https://github.com/apache/datafusion/pull/10098/files/a422b3ab10519efea3ec671fef944a1dcf8aaf96#r1573855672

Closes #10205 .
Closes #10247

Rationale for this change

What changes are included in this PR?

  1. Remove supports_zero_argument
  2. Introduce support_randomness
  3. Introduce invoke_no_args

Are these changes tested?

Are there any user-facing changes?

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@github-actions github-actions bot added the physical-expr Physical Expressions label Apr 23, 2024
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@github-actions github-actions bot added the core Core DataFusion crate label Apr 23, 2024
@jayzhan211 jayzhan211 marked this pull request as ready for review April 23, 2024 09:53
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking very nice

let len = if args.is_empty() {
1
} else {
return exec_err!("Expect random function to take no param");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be more consistent with the other changes in this PR:

Suggested change
return exec_err!("Expect random function to take no param");
return exec_err!("Expect {} function to take no parameters", self.name());

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this before, and I noticed that random is used in test so I need to move it out as an independent function, but adding name as another argument looks so weird, so I keep it like this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can even return ScalarValue::F64 instead of Array. Let me fix these too

let len = if args.is_empty() {
1
} else {
return internal_err!("Invalid argument type");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good to have the test match the rest of the functions for consistency

Suggested change
return internal_err!("Invalid argument type");
return exec_err!("Expect function to take no parameters");

return exec_err!("Expect pi function to take no param");
}
if !args.is_empty() {
return exec_err!("Expect {} function to take no param", self.name());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return exec_err!("Expect {} function to take no param", self.name());
return exec_err!("Expect {} function to take no parameters", self.name());

let len = if args.is_empty() {
1
} else {
return exec_err!("Expect {} function to take no param", self.name());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return exec_err!("Expect {} function to take no param", self.name());
return exec_err!("Expect {} function to take no parameters", self.name());

@jayzhan211 jayzhan211 marked this pull request as draft April 23, 2024 12:07
@jayzhan211 jayzhan211 changed the title Avoid creating null array for empty args ScalarUDF: Remove supports_zero_argument and avoid creating null array for empty args Apr 23, 2024

#[test]
fn test_random_expression() {
let args = vec![ColumnarValue::Array(Arc::new(NullArray::new(1)))];
let array = random(&args)
let array = random(&[])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this test, since it is already covered

SELECT
random() BETWEEN 0.0 AND 1.0,
random() = random()

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@github-actions github-actions bot added logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt) labels Apr 23, 2024
@@ -403,123 +398,6 @@ async fn test_user_defined_functions_with_alias() -> Result<()> {
Ok(())
}

#[derive(Debug)]
pub struct RandomUDF {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have RandomFunc, they are the same so remove it.

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@github-actions github-actions bot removed the sqllogictest SQL Logic Tests (.slt) label Apr 23, 2024
}
let array = Float64Array::from_value(std::f64::consts::PI, 1);
Ok(ColumnarValue::Array(Arc::new(array)))
fn invoke(&self, _args: &[ColumnarValue]) -> Result<ColumnarValue> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think signature check is enough, so just ignore args

@jayzhan211 jayzhan211 marked this pull request as ready for review April 23, 2024 13:44
}

#[tokio::test]
async fn deregister_udf() -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is just moved , the test remains

@@ -615,6 +493,22 @@ async fn test_user_defined_functions_cast_to_i64() -> Result<()> {
Ok(())
}

#[tokio::test]
async fn deregister_udf() -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment on lines 325 to 329
/// # Zero Argument Functions
/// If the function has zero parameters (e.g. `now()`) it will be passed a
/// single element slice which is a a null array to indicate the batch's row
/// count (so the function can know the resulting array size).
///
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please leave a note about what is required to implement Zero Argument Functions? I think the expectation is that the output is a single ColumnarValue::Scalar, rather than an Array


assert_eq!(floats.len(), 1);
assert!(0.0 <= floats.value(0) && floats.value(0) < 1.0);
fn invoke(&self, _args: &[ColumnarValue]) -> Result<ColumnarValue> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the idea here is that expectation is that rand is invoked once per row rather than once per batch. And the only way it knew how many rows to make is to get a null array in 🤔

For example, when I run datafusion-cli from this PR to call random() the same value is returned for each row:

> create table foo as values (1), (2), (3), (4), (5);
0 row(s) fetched.
Elapsed 0.018 seconds.

> select column1, random() from foo;
+---------+--------------------+
| column1 | random()           |
+---------+--------------------+
| 1       | 0.9594375709000513 |
| 2       | 0.9594375709000513 |
| 3       | 0.9594375709000513 |
| 4       | 0.9594375709000513 |
| 5       | 0.9594375709000513 |
+---------+--------------------+
5 row(s) fetched.
Elapsed 0.012 seconds.

But I expect that each row has a different value for random()

However, since none of the tests failed, clearly we have a gap in test coverage 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! Let me think about how to design it, I would prefer something like support_random to specialize random() case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about adding a invoke_no_args(num_rows: usize) method to the ScalarUDFImpl -- with a default implementation that returns "not implemented" error

That might make it clear what was happening and would provide clear semantics about what to do in this case 🤔

Copy link
Contributor Author

@jayzhan211 jayzhan211 Apr 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see the message then, it is also a good idea.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we test the uuid function? i think the uuid has the same attribute like random
cc @jayzhan211

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have test like

query II
SELECT octet_length(uuid()), length(uuid())
----
36 36

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have test like

query II
SELECT octet_length(uuid()), length(uuid())
----
36 36

I means the uuid function is also invoked once per row rather than once per batch like random() function mentioned by alamb.

I test the function in the spark

spark-sql> desc test;
col1    int     NULL
Time taken: 0.065 seconds, Fetched 1 row(s)


spark-sql> select * from test;
1
2



spark-sql> select *,uuid() from test;
1       6b04b66c-2e6c-4925-8b18-a9d51d5ed80a
2       3e0be0c2-9ff2-422a-8f3f-5cdb6551264b

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file a issue for this: #10247

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@jayzhan211 jayzhan211 marked this pull request as ready for review April 24, 2024 03:05
@jayzhan211 jayzhan211 requested a review from alamb April 24, 2024 03:05
@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Apr 24, 2024 via email

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@jayzhan211 jayzhan211 marked this pull request as ready for review April 26, 2024 09:46
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jayzhan211 -- I think this now looks good. I retested with datafusion-cli on this PR and they make different values for each row.

DataFusion CLI v37.1.0
> create table t as values (1), (2);
0 row(s) fetched.
Elapsed 0.023 seconds.

> select random(), uuid() from t;
+--------------------+--------------------------------------+
| random()           | uuid()                               |
+--------------------+--------------------------------------+
| 0.7899418736375414 | 0e35ed7f-a059-43c1-9fe2-52afaf38c1d7 |
| 0.3470811911542333 | 42be02cc-cf5a-4736-b768-536cff69eb9c |
+--------------------+--------------------------------------+
2 row(s) fetched.
Elapsed 0.014 seconds.

I am going to add some additional tests to ensure this case is covered

@alamb
Copy link
Contributor

alamb commented Apr 26, 2024

Here is a PR with tests that validate we get the right answer for multiple rows: #10247

Thanks again @jayzhan211 for cleaning this up and @liukun4515 for your reviews

@jayzhan211 jayzhan211 merged commit c9bd291 into apache:main Apr 26, 2024
24 checks passed
@jayzhan211
Copy link
Contributor Author

Thanks @alamb and @liukun4515

@alamb alamb added the api change Changes the API exposed to users of the crate label Apr 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate core Core DataFusion crate logical-expr Logical plan and expressions physical-expr Physical Expressions
Projects
None yet
3 participants