Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add kafka backfill frontend #15602

Merged
merged 34 commits into from
Apr 2, 2024
Merged

feat: add kafka backfill frontend #15602

merged 34 commits into from
Apr 2, 2024

Conversation

xxchan
Copy link
Member

@xxchan xxchan commented Mar 11, 2024

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Frontend/meta part of #14172 risingwavelabs/rfcs#72

Tracking issue #16003

How it works

  • Introduced a session variable enable_shared_source to gate the feature
    • TODO: the switch might be problematic. e.g., create source with it disabled, and then create mv with it enabled.

planner:

  • introduced Logical/StreamSourceBackfill. They are created when planning MV on source.
    When to_stream_prost, it becomes merge -> backfill
  • When CREATE SOURCE, create stream job (with one single node LogicalSource)
  • TODO: add some planner tests

fragmenter:

  • added new fragment type SourceBackfill

meta/source manager:

  • When Backfill fragment is created (and when split change), allocate its splits according to its upstream Source fragment.

meta/build actor graph:

  • do the job, similar to other special fragments

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added test labels as necessary. See details.
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

@xxchan xxchan marked this pull request as ready for review March 11, 2024 08:13
@xxchan xxchan mentioned this pull request Mar 11, 2024
12 tasks
@xxchan xxchan force-pushed the xxchan/backfill-fe branch from 228157d to b26f03c Compare March 11, 2024 08:38
@xxchan xxchan force-pushed the xxchan/backfill-fe branch from b26f03c to 2285c1f Compare March 12, 2024 09:32
@xxchan xxchan changed the base branch from main to 03-12-fix_deny_create_MV_on_shared_CDC_source March 12, 2024 09:32
Copy link
Member Author

xxchan commented Mar 12, 2024

Copy link
Member

@fuyufjh fuyufjh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

proto/catalog.proto Outdated Show resolved Hide resolved
// Only used when `has_streaming_job` is `true`.
// If `false`, `requires_singleton` will be set in the stream plan.
bool is_distributed = 15;
reserved "cdc_source_job"; // deprecated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deprecating cdc_source_job and adding has_streaming_job will break backward compability of existing CDC jobs, right?

Why not renaming cdc_source_job to has_streaming_job?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

has_streaming_job is misleading, too. I think we should mention source to indicate what it controls happens when create source, eg. source_has_streaming_job

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the field name could be breaking under SQL meta backend. 😐 Not sure why it's not linted.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not renaming cdc_source_job to has_streaming_job?

This is indeed the change I made. Number 13 is renamed.

Changing the field name could be breaking under SQL meta backend.

Indeed. 😕 Need to come up with another plan.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you already have one? Keeping it or introducing another one dedicated to this case both LGTM

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New plan:

  • keep cdc_source_job. New & old CDC sources will both set this field. This field is only used to set require_singleton
  • Change is_shared to optional (explicit presence). Old CDC sources will have this field set to None. New CDC sources will set to Some(true).
    • added method is_shared_compatibile to test cdc_source_job first. TBH, this is a little awkward. Better ideas are welcome.
    • Strictly speaking, it's not necessary to use optional, (cdc_source_job, is_shared) = (true,false) can also work. But I feel this can make it (just slightly) less error-prone.

Old plan:

  • Rename cdc_source_job to is_shared. This is more elegant, but breaks JSON compatibility for SQL meta backend.
  • Added is_distributed to set require_singleton in fragmenter for CDC source.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename cdc_source_job to is_shared. This is more elegant, but breaks JSON compatibility for SQL meta backend.

Maybe off-topic... I think storing protobuf into JSON sounds bad.

image

https://protobuf.dev/programming-guides/dos-donts/#text-format-interchange

src/meta/src/stream/source_manager.rs Show resolved Hide resolved
// Only used when `has_streaming_job` is `true`.
// If `false`, `requires_singleton` will be set in the stream plan.
bool is_distributed = 15;
reserved "cdc_source_job"; // deprecated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

has_streaming_job is misleading, too. I think we should mention source to indicate what it controls happens when create source, eg. source_has_streaming_job

Copy link
Contributor

Had an offline discussion with @xxchan, if we enable the useable source, there will be real actors running and keep fetching data even if no subsquent MVs.

Should have an optim to pause the source in this case to save resources, it is ok to impl it in next pr.

Copy link
Member

@yezizp2012 yezizp2012 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest of meta part LGTM. Can we run single node test in this PR? I'm not sure if it could fail it.

src/meta/src/manager/metadata.rs Outdated Show resolved Hide resolved
src/meta/src/rpc/ddl_controller.rs Outdated Show resolved Hide resolved
src/meta/src/stream/stream_manager.rs Outdated Show resolved Hide resolved
Makefile.toml Show resolved Hide resolved
// Only used when `has_streaming_job` is `true`.
// If `false`, `requires_singleton` will be set in the stream plan.
bool is_distributed = 15;
reserved "cdc_source_job"; // deprecated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the field name could be breaking under SQL meta backend. 😐 Not sure why it's not linted.

Base automatically changed from 03-12-fix_deny_create_MV_on_shared_CDC_source to main March 15, 2024 07:20
@@ -294,6 +294,7 @@ fn build_fragment(
}

NodeBody::StreamCdcScan(_) => {
// XXX: Should we use a different flag for CDC scan?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on whether we find the need to distinguish between them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do need to distinguish them. See the awkward code in tracking_progress_actor_ids. But I'm worried about compatibility so don't want to change it.

src/meta/src/stream/stream_graph/actor.rs Outdated Show resolved Hide resolved
@xxchan xxchan force-pushed the xxchan/backfill-fe branch from eb64efc to 17e3779 Compare March 28, 2024 10:07
@xxchan xxchan mentioned this pull request Mar 28, 2024
21 tasks
@xxchan
Copy link
Member Author

xxchan commented Apr 1, 2024

Changes since last review (It should be OK to review commits since last review)

Copy link
Member

@yezizp2012 yezizp2012 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meta part LGTM!

Copy link
Member

@fuyufjh fuyufjh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +56 to +57
// XXX: do we need to include partition and offset cols here? It's needed by Backfill's input, but maybe not output?
// But the source's "schema" contains the hidden columns.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is by design, so that

  1. users can optionally use these columns in their batch query.
  2. users can optionally use these columns in their streaming query if they specify include in source DDL.

@@ -1313,12 +1313,20 @@ pub async fn handle_create_source(
let sql_pk_names = bind_sql_pk_names(&stmt.columns, &stmt.constraints)?;

let create_cdc_source_job = with_properties.is_backfillable_cdc_connector();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw you renamed is_backfillable_cdc_connector to is_shared_cdc_connector but reverted. 😄 Personally, I will +1 for this renaming to reduce our terminology.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I reverted it by mistake when resolving conflicts 🤡

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, it does have a reason for not using "shared". #15635 (comment)

Maybe we can call it "shareable" instead of "backfillable"

src/frontend/src/optimizer/plan_node/logical_source.rs Outdated Show resolved Hide resolved
// - Direct CDC sources (mysql & postgresql). Backwards compat note: For old CDC job it's `None`; For new CDC job it's `Some(true)`.
// - MQ sources (Kafka, Pulsar, Kinesis, etc.)
//
// **Should also test `cdc_source_job` for backwards compatibility. Use `is_shared_compatible()` instead of this field directly.**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks error-prone to me. I would rather just reuse cdc_source_job even without renaming.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😄 Sure that's acceptable to me

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had better reuse and rename 😄

@xxchan xxchan added this pull request to the merge queue Apr 2, 2024
Merged via the queue into main with commit 2afbb6f Apr 2, 2024
30 of 32 checks passed
@xxchan xxchan deleted the xxchan/backfill-fe branch April 2, 2024 08:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants