Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: materialize the result to handle non-deterministic functions #24

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

st1page
Copy link
Contributor

@st1page st1page commented Nov 21, 2022

No description provided.

@st1page st1page requested a review from TennyZhuang November 21, 2022 13:31
@st1page st1page requested review from skyzh and BugenZhao November 21, 2022 13:32
rfcs/0024-nondeterministic-functions.md Outdated Show resolved Hide resolved
rfcs/0024-nondeterministic-functions.md Show resolved Hide resolved
rfcs/0024-nondeterministic-functions.md Outdated Show resolved Hide resolved
rfcs/0024-nondeterministic-functions.md Outdated Show resolved Hide resolved
@jon-chuang
Copy link

Any interaction of UDF with watermark?

@st1page
Copy link
Contributor Author

st1page commented Mar 29, 2023

Any interaction of UDF with watermark?

I think we should give user way to config on the UDF such as

  • is deterministic, false by default
  • is monotonically increasing, false by default

@BugenZhao
Copy link
Member

This proposal inspires me that, if we're going to fix a bug for our built-in expression in a version update, it also breaks the determinism, even if the expression itself is mathematically pure. 😄 If we're for the sake of serious consistency and determinism, we should even version-tag the built-in expressions and keep the buggy one when fixing, in this case.

Copy link
Member

@xxchan xxchan Mar 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just came up with a problem: IIRC currently UDF is just an expression, which can be used in other places beside Project, e.g., Filter. Should we make them occur only in Project? Besides the determinism problem, there are also optimizations like risingwavelabs/risingwave#8703

cc @wangrunji0408 @TennyZhuang

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Firstly, I need to clarify that the "UDF" and "non-deterministic expression" are independent concepts. PROC_TIME() and RAND() are built-in expressions but they are non-deterministic. A user-defined function can also be deterministic if a user makes a contract with us.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And materialized Project is just used for non-deterministic expression on an "updatable stream". We can use a normal project on the append-only stream for non-deterministic expression too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally agree to refuse to add a new operator for UDF or async expression but not for non-deterministic expression. In another word, if we did add a "UDFProject", now we maybe have to accept StreamProject, StreamAsyncProject, StreamMaterializeProject, StreamAsyncMaterializeProject 😅 🥵

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, after thinking a while I feel that although conceptually different, practically they cannot be so cleanly separated 😅:

Since we want to "always materializing UDF" (except for rare cases and used with caution), MaterializeProject would become the main executor for UDFs. Optimizations for UDF in Project like risingwavelabs/risingwave#8703 would be in vain. On the other hand, do you think we should optimize MaterializeProject for async execution?

Copy link
Member

@xxchan xxchan Mar 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I'm wrong again😇😇😇. I forgot append-only stream. But the points might still apply

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all the optimization for async execution should be done behind the expression framework 🤔 Well there also are some executor-level optimization. For example, some users might not care about the order of the records on the append-only stream and we might reorder them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all the optimization for async execution should be done behind the expression framework

Is that enough? If so we can only tweak the performance for expr on one chunk, but not amoung chunks, like buffered in https://github.com/risingwavelabs/risingwave/pull/8703/files

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, actually I don’t mind adding many different executor variants in the future, just like the TopN variants 😅

Let’s conclude the discussion.


## Design

We will introduce a `MaterializeProject`, it will materialize some **partial** columns of result with the stream key of the input as the primiary key. When an `Insert` operation comes, it will compute the result and materialize some columns. when `Update` or `Delete` comes, it will lookup its state to replace the old value of the operation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Find a problem: not all expressions can be extracted to a Project easily, e.g., in FILTER clause. 🤡

This is mentioned in #12 as a reason against a dedicated operator for UDF.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants