Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: Force specifying watermark for sources #6750

Closed
fuyufjh opened this issue Dec 6, 2022 · 11 comments
Closed

Discussion: Force specifying watermark for sources #6750

fuyufjh opened this issue Dec 6, 2022 · 11 comments
Assignees
Labels
type/feature user-facing-changes Contains changes that are visible to users
Milestone

Comments

@fuyufjh
Copy link
Member

fuyufjh commented Dec 6, 2022

Background

In our current design, Watermark is introduced with a TVF (table-value function) e.g.

WATERMARK(orders, order_time, order_time - INTERVAL '1' MINUTE)

and then

CREATE SINK sales_per_hour AS
  SELECT customer_id, window_end as order_hour, SUM(price) as sum_price
  FROM TUMBLE(
    WATERMARK(orders, order_time, order_time - INTERVAL '1' MINUTE), 
    order_time, 
    INTERVAL '1' HOUR
  )
  GROUP BY window_end, customer_id EMIT ON WINDOW CLOSE;

See RFC#2 The WatermarkFilter and StreamSort operator for details.

Proposal

According to RFC#4: Unify the materialized source and table, in the future, we will only have 2 kinds of objects in databases: table and source. I think in most cases,

  • sources must need a watermark (otherwise the internal state cannot be cleaned)
  • while tables don't need one (because users are supposed to clean/delete some data if they want)

So, shall we enforce this by forcing users to define a watermark on creating a source? For example,

CREATE SOURCE Orders (
    user BIGINT,
    product STRING,
    order_time TIMESTAMP(3),
    WATERMARK FOR order_time AS order_time - INTERVAL '5' SECOND
) WITH ( ... );

Referred to the Flink document.

Then, users don't have to define the Watermark when writing queries

CREATE SINK sales_per_hour AS
  SELECT customer_id, window_end as order_hour, SUM(price) as sum_price
  FROM TUMBLE(orders, order_time, INTERVAL '1' HOUR)
  GROUP BY window_end, customer_id EMIT ON WINDOW CLOSE;

Note that this proposal is all about syntax rather than implementation.

Alternatively, users can also choose to use processing time as the time column. In this case, processing time has a implicit perfect watermark i.e. events are perfectly ordered and watermark latency is 0.

See also #7209.

CREATE SOURCE user_actions (
  user_name STRING,
  data STRING,
  user_action_time AS PROCTIME() -- declare an additional field as a processing time attribute
) WITH (
  ...
);

Users must choose one from above two when creating a source, otherwise an error will be displayed to guide him/her.

Benefits

The first and foremost reason is that the original watermark syntax is not compulsory, so users will be likely to ignore it. By enforcing it on defining sources, users would be easier when writing queries. No unfamiliar syntax anymore.

Another benefit of avoiding watermark syntax is - Strictly speaking, watermark is not a SQL function because all SQL functions operate data itself, while watermark operates on some kind of metadata. This obscure semantics cause it wired to execute watermark in batch queries.

@github-actions github-actions bot added this to the release-0.1.15 milestone Dec 6, 2022
@fuyufjh fuyufjh changed the title Discussion: Force specifying watermark (as well as event-time column) for sources? Discussion: Force specifying watermark for sources Dec 6, 2022
@liurenjie1024
Copy link
Contributor

liurenjie1024 commented Dec 6, 2022

The following question is: do we still allow to specify wartermark in tvf format?

@TennyZhuang
Copy link
Contributor

TennyZhuang commented Dec 6, 2022

Agree with the Flink part.

@fuyufjh
Copy link
Member Author

fuyufjh commented Dec 6, 2022

The following question is: do we still allow to specify wartermark in tvf format?

I want to remove it eventually.

@liurenjie1024
Copy link
Contributor

The following question is: do we still allow to specify wartermark in tvf format?

I want to remove it eventually.

I'm ok with this change, so we still will emit watermark even if there is no need for state cleaning? For example, a simple filter query. I think it's ok to me if it has no much effect on performance.

@hzxa21
Copy link
Collaborator

hzxa21 commented Dec 6, 2022

shall we enforce this by forcing users to define a watermark on creating a source?

By enforcing, do you mean CREATE SOURCE will fail if user doesn't specify watermark? What happen to the source without a timestamp column?

@fuyufjh
Copy link
Member Author

fuyufjh commented Dec 7, 2022

shall we enforce this by forcing users to define a watermark on creating a source?

By enforcing, do you mean CREATE SOURCE will fail if user doesn't specify watermark? What happen to the source without a timestamp column?

Yes. I want to prohibit that case.

@jon-chuang
Copy link
Contributor

jon-chuang commented Dec 9, 2022

shall we enforce this by forcing users to define a watermark on creating a source?

By enforcing, do you mean CREATE SOURCE will fail if user doesn't specify watermark? What happen to the source without a timestamp column?

Yes. I want to prohibit that case.

Or, could we ask users to define a watermark on the PROC_TIME in that case?

I also thought maybe we can have default based on PROC_TIME, but I think its better if user has to always explicitly define it.

@fuyufjh
Copy link
Member Author

fuyufjh commented Dec 9, 2022

Or, could we ask users to define a watermark on the PROC_TIME in that case?

Agree. In fact, Flink enforces this by only allowing watermark column to be used in window functions.

@fuyufjh
Copy link
Member Author

fuyufjh commented Dec 19, 2022

This proposal can be divided into two parts

  1. Define watermark on source
  2. Forced to define watermark

I think 1 is not controversial and we could start working on it now.

@fuyufjh
Copy link
Member Author

fuyufjh commented Jan 6, 2023

Updated. Processing time should be an alternative.

@fuyufjh fuyufjh added the user-facing-changes Contains changes that are visible to users label Jan 6, 2023
@fuyufjh fuyufjh assigned yuhao-su and unassigned fuyufjh Jan 30, 2023
@fuyufjh
Copy link
Member Author

fuyufjh commented Jan 30, 2023

We had agreed on this in the last discussion. Let's do it later on @yuhao-su

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/feature user-facing-changes Contains changes that are visible to users
Projects
None yet
Development

No branches or pull requests

6 participants