Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(connector): support ingest Citus distributed table #8988

Merged
merged 10 commits into from
Apr 17, 2023

Conversation

StrikeW
Copy link
Contributor

@StrikeW StrikeW commented Apr 4, 2023

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Support citus-cdc connector for the CREATE TABLE statement, which allows users to ingest a distributed table stored in Citus to RisingWave. For example, we can ingest the github_events table via the following SQL (don't forget to specify a primary key for the table):

CREATE TABLE github_events_rw
(
    event_id bigint,
    event_type text,
    event_public boolean,
    repo_id bigint,
    payload jsonb,
    repo jsonb,
    user_id bigint,
    org jsonb,
    created_at timestamp,
    PRIMARY KEY (event_id, user_id)
) WITH (
    connector = 'citus-cdc',
    hostname = '127.0.0.1',
    port = '5432',
    username = 'dbz',
    password = '123456',
    database.servers = '172.31.29.245:5432,172.31.31.177:5432',
    database.name = 'postgres',
    schema.name = 'public',
    table.name = 'github_events',
    slot.name = 'github_events_dbz_slot1',
);

Users need to specify the host and port for Citus worker nodes by database.servers in the WITH clause.

Prerequisites

  1. Create a superuser role in the Citus cluster
# create role on coordinator node
create role dbz Superuser LOGIN;

# create role on worker nodes
SELECT run_command_on_workers($cmd$
  create role dbz Superuser LOGIN;
$cmd$);
  1. Set wal_level to logical on each worker node (e.g. modify the postgresql.conf)
  2. Set the replica identity to FULL for the table you want to ingest
# execute on coordinator node
ALTER TABLE github_events REPLICA IDENTITY FULL;

Limitations

  • The connector requires a Postgres superuser role
  • Only distributed table is supported
  • Cannot detect newly added worker nodes (can be improved in future PR)

Checklist For Contributors

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • I have demonstrated that backward compatibility is not broken by breaking changes and created issues to track deprecated features to be removed in the future. (Please refer to the issue)
  • All checks passed in ./risedev check (or alias, ./risedev c)

Checklist For Reviewers

  • I have requested macro/micro-benchmarks as this PR can affect performance substantially, and the results are shown.

Documentation

Click here for Documentation

Types of user-facing changes

Please keep the types that apply to your changes, and remove the others.

  • Connector (sources & sinks)

Release note

Support citus-cdc connector for the CREATE TABLE statement, which allows users to ingest a distributed table stored in Citus to RisingWave.

@tabVersion
Copy link
Contributor

@fuyufjh I am a little skeptical about this approach. Can we try to deliver the poc on this branch rather than main branch? We can merge the pr after they accept this.

@codecov
Copy link

codecov bot commented Apr 4, 2023

Codecov Report

Merging #8988 (9d60c43) into main (ff8de5c) will decrease coverage by 0.02%.
The diff coverage is 29.90%.

@@            Coverage Diff             @@
##             main    #8988      +/-   ##
==========================================
- Coverage   70.83%   70.81%   -0.02%     
==========================================
  Files        1201     1201              
  Lines      200454   200533      +79     
==========================================
+ Hits       141992   142013      +21     
- Misses      58462    58520      +58     
Flag Coverage Δ
rust 70.81% <29.90%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/connector/src/source/base.rs 65.26% <0.00%> (-2.33%) ⬇️
src/connector/src/source/cdc/enumerator/mod.rs 0.00% <0.00%> (ø)
src/connector/src/source/cdc/mod.rs 20.00% <0.00%> (+5.71%) ⬆️
src/connector/src/source/cdc/source/reader.rs 0.00% <0.00%> (ø)
src/connector/src/source/cdc/split.rs 58.33% <33.33%> (-10.09%) ⬇️
...c/connector/src/parser/maxwell/simd_json_parser.rs 82.35% <80.00%> (-0.18%) ⬇️
src/connector/src/parser/common.rs 71.69% <85.00%> (+1.48%) ⬆️
.../connector/src/parser/debezium/simd_json_parser.rs 91.26% <100.00%> (+0.08%) ⬆️
src/connector/src/parser/json_parser.rs 95.73% <100.00%> (+0.02%) ⬆️
src/frontend/src/handler/create_source.rs 49.49% <100.00%> (+0.08%) ⬆️

... and 6 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@fuyufjh
Copy link
Member

fuyufjh commented Apr 11, 2023

@fuyufjh I am a little skeptical about this approach. Can we try to deliver the poc on this branch rather than main branch? We can merge the pr after they accept this.

I am 50-50 on this.

Except the refactoring around PostgresValidator (which I think is good), this PR doesn't look very invasive to me. The newly-introduced server_addr in Split also LGTM.

Let the author @StrikeW make this decision :)

Copy link
Member

@fuyufjh fuyufjh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM

}

break;
case MYSQL:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious: why MySQL doesn't need a "Validator"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will but not that urgent. Just because I want to implement the feature ASAP. I will refactor the validation part when I have bandwidth.

@StrikeW
Copy link
Contributor Author

StrikeW commented Apr 11, 2023

Performance results with 1 coordinator 2 workers setup, ingest a TPCH lineitem table with 10M rows.
The result conforms to our bench result of the Postgres connector, that is each split can reach 2w rows/sec throughput.
image

@StrikeW
Copy link
Contributor Author

StrikeW commented Apr 12, 2023

@fuyufjh I am a little skeptical about this approach. Can we try to deliver the poc on this branch rather than main branch? We can merge the pr after they accept this.

Could you elaborate more about your concerns? @tabVersion

@github-actions github-actions bot added the user-facing-changes Contains changes that are visible to users label Apr 13, 2023
@StrikeW StrikeW enabled auto-merge April 17, 2023 08:23
@StrikeW StrikeW added this pull request to the merge queue Apr 17, 2023
Merged via the queue into main with commit 9151f80 Apr 17, 2023
@StrikeW StrikeW deleted the siyuan/citus-connector branch April 17, 2023 09:00
@CharlieSYH CharlieSYH added the 📖✓ Covered or will be covered in the user docs. label Apr 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
experimental type/feature user-facing-changes Contains changes that are visible to users 📖✓ Covered or will be covered in the user docs.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants