-
Notifications
You must be signed in to change notification settings - Fork 287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TiCDC kafka sink retry is too short. #9504
Comments
/severity major |
The retry should be enough to allow for:
|
Example of delays: package main
import (
"context"
"errors"
"fmt"
"math"
"time"
"github.com/pingcap/tiflow/pkg/retry"
)
func main() {
retryconfs := []struct {
delay int64
retries uint64
}{
{20, 3},
{20, 10},
{20000, 10},
{20000000, 10},
{20000000000, 10},
{20000000000000, 10},
{20000000000000000, 10},
{math.MaxInt64, 10},
{math.MaxInt64, 100},
{math.MaxInt64, 200},
{math.MaxInt64, 300},
}
start := time.Now()
for _, retryconf := range retryconfs {
fmt.Printf("Trying with delay of %d and retries set to %d\n", retryconf.delay, retryconf.retries)
retry.Do(context.Background(), func() error {
print(".")
return errors.New("uhoh")
}, retry.WithBackoffBaseDelay(retryconf.delay), retry.WithMaxTries(retryconf.retries))
fmt.Printf("\nTotal time: %s\n\n", time.Since(start))
}
}
|
I would suggest to:
Note that there is a typo in |
/remove-label may-affects-5.2 |
the same as #9481 |
There is a retry mechanism inside the Sarama admin client, we can adjust these configuration to let the admin client retry multiple times if the error is retryable. For unretryable errors, such as broken pipe, or EOF, we can throw the error to the upper level, so that the whole Kafka sink can be reconstructed, and establish a new connection with the Kafka cluster if the cluster is available. |
What did you do?
Failure on the kafka sink results in fatal error due to too short retry time.
What did you expect to see?
Proper retry on kafka connection failures
What did you see instead?
Looks like
defaultRetryBackoff
(*saramaAdminClient) queryClusterWithRetry()
is usingretry.WithBackoffBaseDelay()
with a delay/retry of/
defaultRetryMaxTrieswhich are set to 20 and 3. However as the argument for
WithBackoffBaseDelay()` is in Ms this isn't long enough to do a proper backoff.tiflow/pkg/sink/kafka/admin.go
Lines 45 to 46 in a2f5a74
tiflow/pkg/sink/kafka/admin.go
Line 109 in a2f5a74
tiflow/pkg/retry/options.go
Line 55 in a2f5a74
The result is a TiCDC changefeed that's dead in the water.
Versions of the cluster
TiCDC v7.1.1
This is a possible regression of #8225 and #8223
The text was updated successfully, but these errors were encountered: