Bug: consumption of connector source will hang forever if encounters an error #7192

StrikeW · 2023-01-04T15:05:17Z

Describe the bug

If the connector source throws an error up to the source executor, the consumption of the source will stop forever:

risingwave/src/stream/src/executor/source/reader.rs

Lines 61 to 62 in 6f39f43

    
           error!("hang up stream reader due to polling error: {}", err); 
        
           futures::future::pending().stack_trace("source_error").await

2023-01-04T11:56:46.373988Z INFO risingwave_connector::source::kafka::source::reader:135: kafka message error: Err(KafkaError (Partition EOF: 11))
2023-01-04T11:56:46.375555Z INFO risingwave_connector::source::kafka::source::reader:135: kafka message error: Err(KafkaError (Partition EOF: 11))
2023-01-04T11:56:46.376565Z ERROR risingwave_stream::executor::source::reader:61: hang up stream reader due to polling error: internal error: Partition EOF: 11
2023-01-04T11:56:46.377123Z ERROR risingwave_stream::executor::source::reader:61: hang up stream reader due to polling error: internal error: Partition EOF: 11

But the barrier messages can still pass to downstream executors:

risingwave/src/stream/src/executor/source/reader.rs

Lines 90 to 95 in 98ae6ce

    
           select_with_strategy( 
        
               barrier_receiver_arm, 
        
               source_stream_arm, 
        
               // We prefer barrier on the left hand side over source chunks. 
        
               |_: &mut ()| PollNext::Left, 
        
           )

I am not sure whether this problem is a by-design feature.

To Reproduce

Manually construct an error in the KafkaSplitReader and propagate it to the upper layer.

diff --git a/src/connector/src/source/kafka/source/reader.rs b/src/connector/src/source/kafka/source/reader.rs
index ab993dfde..e24706f76 100644
--- a/src/connector/src/source/kafka/source/reader.rs
+++ b/src/connector/src/source/kafka/source/reader.rs
@@ -21,6 +21,7 @@ use futures::StreamExt;
 use futures_async_stream::try_stream;
 use rdkafka::config::RDKafkaLogLevel;
 use rdkafka::consumer::{Consumer, DefaultConsumerContext, StreamConsumer};
+use rdkafka::error::KafkaError;
 use rdkafka::{ClientConfig, Message, Offset, TopicPartitionList};

 use crate::source::base::{SourceMessage, SplitReader, MAX_CHUNK_SIZE};
@@ -129,6 +130,8 @@ impl KafkaSplitReader {
         #[for_await]
         'for_outer_loop: for msgs in self.consumer.stream().ready_chunks(MAX_CHUNK_SIZE) {
             for msg in msgs {
+                let fake_msg = Err(KafkaError::PartitionEOF(11));
+                let my_msg = fake_msg?;
                 let msg = msg?;
                 let cur_offset = msg.offset();
                 bytes_current_second += match &msg.payload() {

Then start the cluster with the full config and run the tpch-bench script to ingest data.

Expected behavior

IIUC, when executors encounter an error in other scenarios, the error will report to the Meta via the barrier collection mechanism (#6319). Then Meta will enter the recovery process to recover the whole streaming graph.

But for the scenario of connector source failure, the upstream system may become available after a while or the failure is unrecoverable indeed. So it may be a waste to let Meta recover the cluster blindly:

risingwave/src/meta/src/barrier/mod.rs

Lines 772 to 780 in 1ba6981

    
           if let Err(err) = result { 
        
               // FIXME: If it is a connector source error occurred in the init barrier, we should pass 
        
               // back to frontend 
        
               fail_point!("inject_barrier_err_success"); 
        
               let fail_node = checkpoint_control.barrier_failed(); 
        
               tracing::warn!("Failed to complete epoch {}: {:?}", prev_epoch, err); 
        
               self.do_recovery(err, fail_node, state, tracker, checkpoint_control) 
        
                   .await; 
        
               return;

Candidate solutions:

Employ a bounded retry strategy for connector sources
When a connector source encounters an error, we will try our best to recover the consumer client. For example, we can drop the current consumer and create a new one to try to resume consumption. If we fail to recover the consumer, we can choose to hang up the connector source stream and prompt users to drop the source and troubleshoot the upstream system.
Hang up the connector source stream forever as the current implementation did and prompt users to drop the source and troubleshoot the upstream system. For example, we can prompt an error to users when they query the downstream MVs of the broken source.

Additional context

No response

The text was updated successfully, but these errors were encountered:

BugenZhao · 2023-02-27T05:45:23Z

+1. The upper-layer streaming logic has no way to handle the connector error, so we expect the connector itself to retry in the case of temporary service unavailable as much as possible. For those unrecoverable errors, we catch the error by tracing::error here and hang it up to avoid interfering with other streaming jobs. The most trivial way for handling this will be dropping and re-creating the streaming job.

So we may also need to report the error as a user error? cc @jon-chuang @fuyufjh

jon-chuang · 2023-02-28T03:55:02Z

internal error: Partition EOF: 11

I believe that in this case, it's not an unrecoverable error. Is this accurate? @tabVersion

Perhaps this particular case should actually not be handled as an error. Unclear if we should even report it as a user error. Consuming the Kafka partition temporarily should be expected in normal operation.

EDIT: I guess this is an artificial error and we should not focus too much on it.

jon-chuang · 2023-02-28T04:24:05Z

It also seems that we need to classify errors into recoverable and unrecoverable errors.

Question: should this be defined in the connector, or in the stream layer?

jon-chuang · 2023-02-28T04:35:06Z

Employ a bounded retry strategy for connector sources
When a connector source encounters an error, we will try our best to recover the consumer client. For example, we can drop the current consumer and create a new one to try to resume consumption. If we fail to recover the consumer, we can choose to hang up the connector source stream and prompt users to drop the source and troubleshoot the upstream system.

+1 for this. Connector itself should not propagate the error to stream, rather, it should itself retry, and once it terminates the retry, emit an error that the stream actor will decide what to do with.

I think we should bounded exponential retry (default: up to few hours? - to give the user/SRE time to respond, perhaps restart the source) and report the user error in the meantime.

jon-chuang · 2023-02-28T04:53:58Z

But the barrier messages can still pass to downstream executors:

~~I don't think this should be the behaviour now.~~

~~Currently, due to `try_stream`, every executor should first propagate its msg errors into the stream. This will eventually be intercepted by `impl StreamConsumer for DispatchExecutor`, the error from `StreamExecutorResult` will be propagated into `BarrierStream`, and `Actor::run_consumer` will terminate, triggering `context.lock_barrier_manager().notify_failure(actor_id, err);`~~

~~I believe that select_with_strategy will still propagate errors if they are available while a barrier has not yet arrived.~~

Edit: yes, futures::pending will cause this infinite pause. Hanging silently like this is definitely undesirable. The behaviour should always be to trigger either an explicit MV suspension or MV retry if the connector propagates an error.

jon-chuang · 2023-02-28T10:21:07Z

After discussion with @tabVersion , we are in agreement on the following:

For the connector side, we expect both recoverable and unrecoverable errors. We will not make any effort to categorize the errors into these two categories. Instead, we will attempt to handle both scenarios by an exponential backoff up to 1 hour (by default). This will be implemented as a wrapper around the connectors. Once the max timeout/max retries are reached, we will emit the error to stream side.
The behaviour for stream will currently be to restart from checkpoint (the current behaviour is retry every 10 seconds and infinite...see: chore(recovery): remove max retry attempts for recovery #6531), but we should eventually implement suspending the MV for this type of error (irrecoverable error - since the connector side has implemented it's own retry strategy). This means that recoverable/irrecoverable error distinction is made on the stream side.

StrikeW added the type/bug Something isn't working label Jan 4, 2023

github-actions bot added this to the release-0.1.16 milestone Jan 4, 2023

tabVersion self-assigned this Jan 30, 2023

tabVersion modified the milestones: release-0.1.16, release-0.1.17 Jan 30, 2023

tabVersion removed this from the release-0.1.17 milestone Feb 27, 2023

StrikeW mentioned this issue Feb 27, 2023

RFC: Suspend MV on Non-Recoverable Errors risingwavelabs/rfcs#54

Open

hzxa21 added the priority/high label Mar 30, 2023

hzxa21 added this to the release-0.19 milestone Mar 30, 2023

tabVersion mentioned this issue Mar 31, 2023

fix: propagate connector reader error to the upper layer #8920

Merged

5 tasks

tabVersion closed this as completed May 19, 2023

BugenZhao mentioned this issue Aug 16, 2023

Do not fail RisingWave checkpoint if upstream source is unavailable #11705

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: consumption of connector source will hang forever if encounters an error #7192

Bug: consumption of connector source will hang forever if encounters an error #7192

StrikeW commented Jan 4, 2023 •

edited

Loading

BugenZhao commented Feb 27, 2023

jon-chuang commented Feb 28, 2023 •

edited

Loading

jon-chuang commented Feb 28, 2023 •

edited

Loading

jon-chuang commented Feb 28, 2023 •

edited

Loading

jon-chuang commented Feb 28, 2023 •

edited

Loading

jon-chuang commented Feb 28, 2023 •

edited

Loading

Bug: consumption of connector source will hang forever if encounters an error #7192

Bug: consumption of connector source will hang forever if encounters an error #7192

Comments

StrikeW commented Jan 4, 2023 • edited Loading

Describe the bug

To Reproduce

Expected behavior

Additional context

BugenZhao commented Feb 27, 2023

jon-chuang commented Feb 28, 2023 • edited Loading

jon-chuang commented Feb 28, 2023 • edited Loading

jon-chuang commented Feb 28, 2023 • edited Loading

jon-chuang commented Feb 28, 2023 • edited Loading

jon-chuang commented Feb 28, 2023 • edited Loading

StrikeW commented Jan 4, 2023 •

edited

Loading

jon-chuang commented Feb 28, 2023 •

edited

Loading

jon-chuang commented Feb 28, 2023 •

edited

Loading

jon-chuang commented Feb 28, 2023 •

edited

Loading

jon-chuang commented Feb 28, 2023 •

edited

Loading

jon-chuang commented Feb 28, 2023 •

edited

Loading