Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add external function #12

Merged
merged 9 commits into from
Aug 7, 2023
Merged

Add external function #12

merged 9 commits into from
Aug 7, 2023

Conversation

TennyZhuang
Copy link
Contributor

Signed-off-by: TennyZhuang zty0826@gmail.com

Signed-off-by: TennyZhuang <zty0826@gmail.com>
Signed-off-by: TennyZhuang <zty0826@gmail.com>
Signed-off-by: TennyZhuang <zty0826@gmail.com>
@CLAassistant
Copy link

CLAassistant commented Nov 6, 2022

CLA assistant check
All committers have signed the CLA.

@TennyZhuang
Copy link
Contributor Author

@skyzh Can you ignore the HackMD bot in CLA checker?

@TennyZhuang TennyZhuang marked this pull request as ready for review November 6, 2022 09:56
Signed-off-by: TennyZhuang <zty0826@gmail.com>
Copy link
Contributor

@BowenXiao1999 BowenXiao1999 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Can you also provide more context why the interface should be async?
  2. How to understand cross-chunk concurrency? Do you mean if udf1 and udf2 both need chunk A, then they shared read this chunk?

rfcs/0012-external-function.md Outdated Show resolved Hide resolved
rfcs/0012-external-function.md Outdated Show resolved Hide resolved
@BugenZhao
Copy link
Member

2. How to understand cross-chunk concurrency?

Take Project as an example, if the UDF is I/O bounded, then we may want to process the input chunks in a pipeline manner by buffering them to be concurrent.

@skyzh
Copy link

skyzh commented Nov 7, 2022

Do we allow external functions in places other than projection? (e.g., filter?)

TennyZhuang and others added 2 commits November 7, 2022 13:29
Co-authored-by: Bugen Zhao <i@bugenzhao.com>
Co-authored-by: Bugen Zhao <i@bugenzhao.com>
@TennyZhuang
Copy link
Contributor Author

Do we allow external functions in places other than projection? (e.g., filter?)

Yes, I hope everywhere.

@skyzh
Copy link

skyzh commented Nov 7, 2022

Before introducing external functions to the system, we should also consider deterministic problems.

Considering the following query:

select * from (select udf(x) as x from table) where x > 10;

Suppose we get the following events from table:

INSERT 1
DELETE 1

If udf is not deterministic, suppose it gets 1 -> 11 for the first message, and 1 -> 10 for the second message.

Many things will go wrong. The filter will emit:

INSERT 1

which ignores the second delete.

One feasible solution is to have a special kind of projection called MaterializeProjection, where we materialize data by their pk, so that UDF will only be computed once at the first time when the generated event goes into the system.

@skyzh
Copy link

skyzh commented Nov 7, 2022

That is to say, MaterializeProjection is a cache for UDF, whereas given one input we will always produce the same output for a given UDF.

@TennyZhuang
Copy link
Contributor Author

TennyZhuang commented Nov 7, 2022

Before introducing external functions to the system, we should also consider deterministic problems.

Considering the following query:

select * from (select udf(x) as x from table) where x > 10;

Suppose we get the following events from table:

INSERT 1
DELETE 1

If udf is not deterministic, suppose it gets 1 -> 11 for the first message, and 1 -> 10 for the second message.

Many things will go wrong. The filter will emit:

INSERT 1

which ignores the second delete.

One feasible solution is to have a special kind of projection called MaterializeProjection, where we materialize data by their pk, so that UDF will only be computed once at the first time when the generated event goes into the system.

Good catch! I have no idea how to resolve that, how about only allowing all UDFs on the append-only stream?

@TennyZhuang
Copy link
Contributor Author

TennyZhuang commented Nov 7, 2022

  • Can you also provide more context why the interface should be async?

There may be RPC calls during the evaluation.

@skyzh
Copy link

skyzh commented Nov 7, 2022

Append-only stream should be fine. But this means that people can only do UDF before aggregation / TopN, which seems to be a common case in SQL. Not sure about the use case.

@TennyZhuang
Copy link
Contributor Author

TennyZhuang commented Nov 7, 2022

A basic idea:

Users must specify their UDFs are deterministic explicitly, otherwise, the UDFs can only be applied on an append-only stream or be materialized implicitly.

I will add this to the RFC.

Signed-off-by: TennyZhuang <zty0826@gmail.com>
Signed-off-by: TennyZhuang <zty0826@gmail.com>
Signed-off-by: TennyZhuang <zty0826@gmail.com>
@wangrunji0408
Copy link

Can we merge this RFC? as most of its proposals have been implemented now.

@TennyZhuang
Copy link
Contributor Author

Can we merge this RFC? as most of its proposals have been implemented now.

You can give an approval.

@TennyZhuang TennyZhuang merged commit 6d07087 into main Aug 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants