Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking: enable sink_decouple by default for all types of sink #17095

Closed
23 of 31 tasks
hzxa21 opened this issue Jun 4, 2024 · 9 comments
Closed
23 of 31 tasks

Tracking: enable sink_decouple by default for all types of sink #17095

hzxa21 opened this issue Jun 4, 2024 · 9 comments
Assignees
Milestone

Comments

@hzxa21
Copy link
Collaborator

hzxa21 commented Jun 4, 2024

Currently (as of v1.9) sink_deoucple is enabled by default in the following sinks:

  • MQ with append-only stream (including kafka, pulsar, kinesis, google pubsub. nats, mqtt)
  • Clickhouse with append-only stream
  • Iceberg/Datalake with commit_checkpoint_interval set > 1

For other sinks, user must specify sink_decouple session variable to enable this feature explicitly before sink creation. Given that external sink error and unavailability are unavoidable and uncontrollable by RisingWave, we should enable the sink checkpoint decoupling feature by default to provide a better user experience. We can do it gradually with the following schedule:

  • Create an alert when log store lag is large and increasing
  • Enable sink_decouple by default for upsert stream in v1.10
    • MQ (kafka, pulsar, kinesis, google pubsub, nats, mqtt)
    • Clickhouse
  • Enable sink_decouple by default for all other sinks in v1.11
  • Support async log truncation in eligible sinks (currently only kafka/pulsar sink is supported and for other sinks, there will be unnecessary writes into log store with sink decouple enabled even in happy path)
    • Kinesis
    • Google pubsub
    • Nats
    • MQTT
    • Clickhouse
    • Biquery
    • Doris/Starrocks
    • Elasticsearch
    • Redis
    • Snowflake
    • Cassandra
    • JDBC
    • http
    • mongodb
    • dynamodb
    • Datalake (iceberg/deltalake) (not eligible when commit_checkpoint_interval set to 1)
@github-actions github-actions bot added this to the release-1.10 milestone Jun 4, 2024
@hzxa21
Copy link
Collaborator Author

hzxa21 commented Jun 4, 2024

Note that pre-existing sinks created before will not be affected. In other words, if a sink is created with sink decouple disabled, sink decouple will not be enabled after version upgrade.

@fuyufjh
Copy link
Member

fuyufjh commented Jul 2, 2024

Please remember to update our Grafana Dashboard after rolling out sink_decouple to all sinks. Currently it's a bit confusing to both us and customers.

@xxhZs
Copy link
Contributor

xxhZs commented Aug 15, 2024

Currently supports default sink_decouple = true sink:

  1. chunk level trancate(async)
    all mq (only mqtt cannot support async, But because it has been turned on sink_decouple before, it remains turned on); redis; jdbc
    biquery feat(sink): support async for bigquery sink  #17488
    Elasticsearch/opensearch refactor(sink): refactor es and opensearch to support async #17746
    mongodb+dynamodb feat(sink): support async for mongodb dynamodb #17645
  2. commit_checkpoint_interval > 1 default sink_decouple = true
    delatlake ``iceberg ``starrocks ``clickhouse

Currently dont supports sink_decouple = true sink:
snowfalke ``doris Can support commit_checkpoint_interval , but no user requirements are not realized
cassandra,http: java interface
sql server: barrer commit

@fuyufjh
Copy link
Member

fuyufjh commented Aug 16, 2024

Thank @xxhZs for summarizing the status quo.

After some thoughts, I'd like to propose to enable sink_decouple for all sinks (again). Reasons:

  1. First of all, my motivation is to provide a consistent experience for users. We need to clearly answer the question: How does RisingWave behave when the sink goes wrong? The answer is completely different for the case of sink_decouple=true/false, which confuses users a lot.
  2. It increases the stability of clusters, which is exactly the initial motivation of log store. The performance concern is less important here because the bottleneck of a sink is very like to be something else, especially for these unsupported ones (Snowfalke, Cassandra), I bet their bottleneck must not be on the log store :)
  3. Also, reduce mind burden for our engineers. For example, we can rely on the retrying logic provided by log store, so that the SinkExecutor can return Err safely. The less branches, the more robustness.

@fuyufjh
Copy link
Member

fuyufjh commented Aug 16, 2024

2. commit_checkpoint_interval > 1 default sink_decouple = true
delatlakeiceberg starrocks clickhouse ``

For data lakes delatlake and iceberg, I think they are supposed to write in low frequency, perhaps we should set a higher default value of commit_checkpoint_interval for them, such as 10 secs.

For data warehouses, there is indeed some additional runtime cost, but I'd like to trade this off for better stability.

@pjpringlenom
Copy link

For sink decouple with upsert. Does this result in the conflation of multiple updates on the same key with in commit_checkpoint_interval ? I.e. reduce number of updates being sent per key

@xxhZs
Copy link
Contributor

xxhZs commented Sep 1, 2024

For sink decouple with upsert. Does this result in the conflation of multiple updates on the same key with in commit_checkpoint_interval ? I.e. reduce number of updates being sent per key

Hello, the answer is no, sink decouple is just a simple replay of the sink value and does not compress the same key

@fuyufjh
Copy link
Member

fuyufjh commented Sep 5, 2024

For sink decouple with upsert. Does this result in the conflation of multiple updates on the same key with in commit_checkpoint_interval ? I.e. reduce number of updates being sent per key

Good point. If there are many duplicates in the output data of each commit_checkpoint_interval, this can help to dedup.

Currently, the sink_decouple is implemented by a very simple structured called "log store". As its name implies, it just records input events one by one without doing any optimization like compacting the events under the same PK. We have not encountered such cases yet that deduplicating here can help a lot, but if anyone have met such cases, we may consider applying this optimization.

@xxhZs
Copy link
Contributor

xxhZs commented Sep 26, 2024

in #18182
set default_sink_decouple = true for all sink

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants