refactor: impl ack and migrate to durable consumer for Nats #18873

tabVersion · 2024-10-11T10:33:42Z

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

As title, do ack on messages when a barrier completes.

Checklist

I have written necessary rustdoc comments
I have added necessary unit tests and integration tests
I have added test labels as necessary. See details.
I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
All checks passed in ./risedev check (or alias, ./risedev c)
My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)

My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

add a required param for nats connector consumer.durable_name, and we will either build a new one or continue reading from an existing durable consumer.

Note here: we don't encourage reusing the consumer.durable_name across streaming jobs, it will cause data loss.

After this pr, RW no longer keeps offset of Nats, the broker is responsible for offset management. And we have to accept a regression on sematic, from exactly-once to at-least-once.
If users care about data loss, they should set consumer.ack_policy to all or explicit.

Why do I abandon managing offset for the durable consumer to achieve exactly-once?

the API get_or_create_consumer requires the config provided should remain the same as it was created. It does not encourage reading from a spec offset. If we really want to do it, we should build a new one.
- In view of complaints about too many consumer groups on kafka, I'd give up on prev impl.
Under this case, managing offset ourselves is not scalable. Imagine we have multiple exec, each one get a batch from the subject. After recovery, which offset should they start with?

Signed-off-by: tabVersion <tabvision@bupt.icu>

xxchan

After browsing the doc about the model, I don't get why we need to ack for NATS Jetstream.

https://docs.nats.io/using-nats/developer/develop_jetstream/consumers#delivery-reliability

The "AckPolicy" is defined by the Consumer, not the Stream. And it means "application level acknowledgements", like failure in bussiness logic.

We can just always use none and no need to ack.

src/connector/src/source/nats/mod.rs

src/connector/src/source/reader/reader.rs

tabVersion · 2024-10-11T12:46:14Z

After browsing the doc about the model, I don't get why we need to ack for NATS Jetstream.

docs.nats.io/using-nats/developer/develop_jetstream/consumers#delivery-reliability

The "AckPolicy" is defined by the Consumer, not the Stream. And it means "application level acknowledgements", like failure in bussiness logic.

Thanks for your fast review, the pr is part of Nats JetStream refactor. Let me address your concern in a tracking issue.

tabVersion · 2024-10-11T13:42:25Z

@xxchan refer to #18876 for details

src/stream/src/executor/source/source_executor.rs

yufansong · 2024-10-11T18:04:54Z

src/connector/src/source/nats/source/message.rs

@@ -30,9 +36,12 @@ impl From<NatsMessage> for SourceMessage {
            key: None,
            payload: Some(message.payload),
            // For nats jetstream, use sequence id as offset
+            // DEPRECATED: no longer use sequence id as offset, let nats broker handle failover


yufansong · 2024-10-11T18:10:56Z

src/connector/src/source/mod.rs

+                match ack_policy {
+                    JetStreamAckPolicy::None => (),
+                    JetStreamAckPolicy::Explicit => {
+                        for reply_subject in reply_subjects {
+                            ack(context, reply_subject).await;
+                        }
+                    }
+                    JetStreamAckPolicy::All => {
+                        if let Some(reply_subject) = reply_subjects.last() {
+                            ack(context, reply_subject.clone()).await;
+                        }
+                    }


Just curious, the subjects like "$JS.ACK.test_stream_1.l2vxD20k.1.3.4.1728547619594368340.0" already contain the offset information, right?

Want to make the logic correct in this situation:
Nats send message: m1,m2,m3,m4
RW get message m1, m2, (checkpoint 1), m3, m4
When we do checkpoint 1 and send batch ack, which means ack m2 subjects. In nats part, it will only ack the m1, m2, and m3,m4 not ack, right?

Just curious, the subjects like "$JS.ACK.test_stream_1.l2vxD20k.1.3.4.1728547619594368340.0" already contain the offset information, right?

yes

Want to make the logic correct in this situation:
Nats send message: m1,m2,m3,m4
RW get message m1, m2, (checkpoint 1), m3, m4
When we do checkpoint 1 and send batch ack, which means ack m2 subjects. In nats part, it will only ack the m1, m2, and m3,m4 not ack, right?

If ack_policy is ack_all (batch ack), at ckpt_1, we just ack msg2, which ack both msg1 and msg2. Msg3 and msg4 are not ack-ed.

yufansong

lgtm

Signed-off-by: tabVersion <tabvision@bupt.icu>

gitguardian · 2024-10-14T06:36:18Z

⚠️ GitGuardian has uncovered 3 secrets following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secrets in your pull request

GitGuardian id	GitGuardian status	Secret	Commit	Filename
9425213	Triggered	Generic Password	`a6ffcd6`	e2e_test/source/tvf/postgres_query.slt	View secret
9425213	Triggered	Generic Password	`a6ffcd6`	e2e_test/source/tvf/postgres_query.slt	View secret
9425213	Triggered	Generic Password	`02f28b8`	ci/scripts/e2e-source-test.sh	View secret

🛠 Guidelines to remediate hardcoded secrets

Understand the implications of revoking this secret by investigating where it is used in your code.
Replace and store your secrets safely. Learn here the best practices.
Revoke and rotate these secrets.
If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider

following these best practices for managing and storing secrets including API keys and other credentials
install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.

^{🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.}

tabVersion · 2024-10-14T06:41:23Z

src/connector/src/source/nats/source/message.rs

+            //
+            // DEPRECATED: no longer use sequence id as offset, let nats broker handle failover
+            // use reply_subject as offset for ack use, we just check the persisted state for whether this is the first run
+            offset: message.reply_subject.unwrap_or_default(),


@yufansong Unfortunately, offset column should be kept as additional column and I think it is not worth making an exception for Nats. We use offset here to store the reply subject.

Get it. SGTM

fuyufjh · 2024-10-14T08:29:02Z

src/connector/src/source/nats/source/reader.rs

+            // We have record on this Nats Split, contains the last seen offset (seq id) or reply subject
+            // We do not use the seq id as start position anymore,
+            // but just let the reader load from durable consumer on broker.


I understand these comments, but you didn't make any change to the implementation code (L58 ~ L72). Why & how it works?

We use api get_or_create_consumer when building stream consumer. api ref
If we utilize an existing durable consumer, the provided config should be the same as the consumer created. So the config here should never change, always align to the one conducted from with clause.

following changes in #18895

fuyufjh · 2024-10-14T08:31:08Z

src/connector/src/error.rs

@@ -61,6 +62,8 @@ def_anyhow_newtype! {
    async_nats::jetstream::consumer::pull::MessagesError => "Nats error",
    async_nats::jetstream::context::CreateStreamError => "Nats error",
    async_nats::jetstream::stream::ConsumerError => "Nats error",
+    NatsJetStreamError => "Nats error",


All these 5 errors are "Nats error". Consider including more context?

fuyufjh

LGTM

xxchan

Since the main motivation is performance, should we first do a quick demo & benchmark to confirm this direction is correct?

Another thing I still don't understand why we need to make Ack policy customizable. It should be decided by the application (i.e., us).

tabVersion · 2024-10-14T09:15:17Z

Since the main motivation is performance, should we first do a quick demo & benchmark to confirm this direction is correct?

Don't get me wrong, the motivation for the refactor is making Nats JetStream consumer parallelizable. Ack is not avoidable when syncing consumption progress among workers.

xxchan · 2024-10-14T09:22:50Z

Don't get me wrong, the motivation for the refactor is making Nats JetStream consumer parallelizable.

I mean exactly that we can test "consumer group"-like usage can have perf improvements. Otherwise, perhaps sth like manually shard subjects into sub-subjects is inevitable.

e.g., Chrisss93/flink-connector-nats says:

Specifying multiple consumers with non-overlapping subject-filters allows different portions of the stream to be read and processed in parallel (somewhat like a Kafka/Pulsar topic with multiple partitions).

(I do not mean what they are doing is correct neither.)

Google Pub/Sub mentioned explicitly that they are "per-message parallelism". But I feel lost when browsing NATS's docs about what's the best practice. Therefore, IMO testing is better.

xxchan · 2024-10-14T09:25:51Z

On the other hand, cumulative ack (AckAll)'s overhead should be much smaller than individual acks.

tabVersion · 2024-10-14T09:27:54Z

On the other hand, cumulative ack (AckAll)'s overhead should be much smaller than individual acks.

risingwave/src/connector/src/source/nats/mod.rs

Lines 137 to 139 in b6c4ca8

    
           #[serde(rename = "consumer.ack_policy")] 
        
           #[serde_as(as = "Option<DisplayFromStr>")] 
        
           pub ack_policy: Option<String>,

we are offering the flexibility to users.

xxchan · 2024-10-14T09:32:50Z

we are offering the flexibility to users.

I asked this exact question above:

Another thing I still don't understand why we need to make Ack policy customizable. It should be decided by the application (i.e., us).

I don't get what benefits this flexibility brings to users.

stdrc · 2024-10-14T10:07:13Z

Since the main motivation is performance, should we first do a quick demo & benchmark to confirm this direction is correct?

Agree. IIUC, changing "ack all" + "sequence number as offset" to per-message ack is a regression in terms of the "exactly-once" semantics. It's reasonable to prove there'll be performance benefit. And if we make it customizable, when user specifies ack_all, it will be indeed a regression of "exactly-once" semantics.

xxchan · 2024-10-14T10:43:23Z

changing "ack all" + "sequence number as offset" to per-message ack is a regression in terms of the "exactly-once" semantics.

Previously we are not "ack all", but no ack (regardless of ack policy).

And regarding exactly-once and performance, #18876 has more discussion.

yufansong · 2024-10-15T05:51:05Z

Since the main motivation is performance, should we first do a quick demo & benchmark to confirm this direction is correct?

Agree. IIUC, changing "ack all" + "sequence number as offset" to per-message ack is a regression in terms of the "exactly-once" semantics. It's reasonable to prove there'll be performance benefit. And if we make it customizable, when user specifies ack_all, it will be indeed a regression of "exactly-once" semantics.

Previous, we use no ack to keep exactly-once sematic, but will have performance problem

Signed-off-by: tabVersion <tabvision@bupt.icu>

ack on Nats JetStream

3812714

Signed-off-by: tabVersion <tabvision@bupt.icu>

tabVersion requested review from yufansong and xxchan October 11, 2024 10:33

github-actions bot added the type/fix Bug fix label Oct 11, 2024

tabVersion marked this pull request as ready for review October 11, 2024 10:36

graphite-app bot requested a review from a team October 11, 2024 10:37

tabVersion requested review from fuyufjh and removed request for a team October 11, 2024 10:37

xxchan reviewed Oct 11, 2024

View reviewed changes

src/connector/src/source/nats/mod.rs Outdated Show resolved Hide resolved

src/connector/src/source/nats/mod.rs Show resolved Hide resolved

src/connector/src/source/reader/reader.rs Outdated Show resolved Hide resolved

tabVersion mentioned this pull request Oct 11, 2024

refactor: make Nats JetStream scale #18876

Closed

4 tasks

yufansong reviewed Oct 11, 2024

View reviewed changes

yufansong approved these changes Oct 12, 2024

View reviewed changes

tabVersion added 3 commits October 14, 2024 14:23

fix offset column

a332ff7

Signed-off-by: tabVersion <tabvision@bupt.icu>

resolve comments

b70f296

Signed-off-by: tabVersion <tabvision@bupt.icu>

Merge remote-tracking branch 'origin' into tab/nats-offset

a6ffcd6

tabVersion commented Oct 14, 2024

View reviewed changes

tabVersion requested review from yufansong and xxchan October 14, 2024 06:41

fuyufjh reviewed Oct 14, 2024

View reviewed changes

tabVersion mentioned this pull request Oct 14, 2024

refactor: migrate Nats JetStream consumer to durable one #18895

Merged

9 tasks

fuyufjh approved these changes Oct 14, 2024

View reviewed changes

xxchan reviewed Oct 14, 2024

View reviewed changes

refactor: migrate Nats JetStream consumer to durable one (#18895)

c3d8176

Signed-off-by: tabVersion <tabvision@bupt.icu>

tabVersion changed the title ~~fix: ack messages on Nats JetStream~~ refactor: impl ack and migrate to durable consumer for Nats Oct 15, 2024

Merge branch 'main' into tab/nats-offset

02f28b8

github-actions bot added the type/refactor label Oct 15, 2024

tabVersion enabled auto-merge October 15, 2024 08:00

tabVersion added this pull request to the merge queue Oct 15, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 15, 2024

tabVersion added this pull request to the merge queue Oct 16, 2024

Merged via the queue into main with commit d997ebe Oct 16, 2024
28 of 30 checks passed

tabVersion deleted the tab/nats-offset branch October 16, 2024 03:45

tabVersion mentioned this pull request Oct 24, 2024

fix: remove redundant consumer_durable_name #19090

Merged

9 tasks

BugenZhao mentioned this pull request Dec 11, 2024

risingwave 2.1.0 risingwavelabs/homebrew-risingwave#46

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: impl ack and migrate to durable consumer for Nats #18873

refactor: impl ack and migrate to durable consumer for Nats #18873

tabVersion commented Oct 11, 2024 •

edited

Loading

xxchan left a comment

tabVersion commented Oct 11, 2024

tabVersion commented Oct 11, 2024

yufansong Oct 11, 2024

yufansong Oct 11, 2024

tabVersion Oct 12, 2024

yufansong left a comment

gitguardian bot commented Oct 14, 2024 •

edited

Loading

tabVersion Oct 14, 2024

yufansong Oct 14, 2024

fuyufjh Oct 14, 2024

tabVersion Oct 14, 2024

fuyufjh Oct 14, 2024

fuyufjh left a comment

xxchan left a comment

tabVersion commented Oct 14, 2024

xxchan commented Oct 14, 2024

xxchan commented Oct 14, 2024

tabVersion commented Oct 14, 2024

xxchan commented Oct 14, 2024

stdrc commented Oct 14, 2024

xxchan commented Oct 14, 2024

yufansong commented Oct 15, 2024

refactor: impl ack and migrate to durable consumer for Nats #18873

refactor: impl ack and migrate to durable consumer for Nats #18873

Conversation

tabVersion commented Oct 11, 2024 • edited Loading

What's changed and what's your intention?

Checklist

Documentation

Release note

xxchan left a comment

Choose a reason for hiding this comment

tabVersion commented Oct 11, 2024

tabVersion commented Oct 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yufansong left a comment

Choose a reason for hiding this comment

gitguardian bot commented Oct 14, 2024 • edited Loading

⚠️ GitGuardian has uncovered 3 secrets following the scan of your pull request.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fuyufjh left a comment

Choose a reason for hiding this comment

xxchan left a comment

Choose a reason for hiding this comment

tabVersion commented Oct 14, 2024

xxchan commented Oct 14, 2024

xxchan commented Oct 14, 2024

tabVersion commented Oct 14, 2024

xxchan commented Oct 14, 2024

stdrc commented Oct 14, 2024

xxchan commented Oct 14, 2024

yufansong commented Oct 15, 2024

tabVersion commented Oct 11, 2024 •

edited

Loading

gitguardian bot commented Oct 14, 2024 •

edited

Loading