Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TiCDC kafka sink retry is too short. #9504

Closed
dveeden opened this issue Aug 4, 2023 · 7 comments · Fixed by #9511
Closed

TiCDC kafka sink retry is too short. #9504

dveeden opened this issue Aug 4, 2023 · 7 comments · Fixed by #9511
Labels
affects-6.5 affects-7.1 area/ticdc Issues or PRs related to TiCDC. severity/major type/bug The issue is confirmed as a bug.

Comments

@dveeden
Copy link
Contributor

dveeden commented Aug 4, 2023

What did you do?

Failure on the kafka sink results in fatal error due to too short retry time.

What did you expect to see?

Proper retry on kafka connection failures

What did you see instead?

[2023/08/04 06:39:28.761 +00:00] [WARN] [admin.go:100] ["query kafka cluster meta failed, retry it"] [namespace=default] [changefeed=foobar-2023-08-03] [error=EOF]
[2023/08/04 06:39:28.796 +00:00] [WARN] [admin.go:100] ["query kafka cluster meta failed, retry it"] [namespace=default] [changefeed=foobar-2023-08-03] [error="write tcp 4.3.2.1:44712->4.3.2.2:9093: write: broken pipe"]
[2023/08/04 06:39:28.897 +00:00] [WARN] [admin.go:100] ["query kafka cluster meta failed, retry it"] [namespace=default] [changefeed=foobar-2023-08-03] [error="write tcp 4.3.2.1:44712->4.3.2.2:9093: write: broken pipe"]
[2023/08/04 06:39:28.897 +00:00] [WARN] [kafka_manager.go:167] ["Kafka admin client describe topics failed"] [namespace=default] [changefeed=foobar-2023-08-03] [error="[CDC:ErrReachMaxTry]reach maximum try: 3, error: write tcp 4.3.2.1:44712->4.3.2.2:9093: write: broken pipe: write tcp 4.3.2.1:44712->4.3.2.2:9093: write: broken pipe"] [errorVerbose="[CDC:ErrReachMaxTry]reach maximum try: 3, error: write tcp 4.3.2.1:44712->4.3.2.2:9093: write: broken pipe: write tcp 4.3.2.1:44712->4.3.2.2:9093: write: broken pipe\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20221009092201-b66cddb77c32/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\tgithub.com/pingcap/errors@v0.11.5-0.20221009092201-b66cddb77c32/normalize.go:164\ngithub.com/pingcap/tiflow/pkg/retry.run\n\tgithub.com/pingcap/tiflow/pkg/retry/retry_with_opt.go:69\ngithub.com/pingcap/tiflow/pkg/retry.Do\n\tgithub.com/pingcap/tiflow/pkg/retry/retry_with_opt.go:34\ngithub.com/pingcap/tiflow/pkg/sink/kafka.(*saramaAdminClient).queryClusterWithRetry\n\tgithub.com/pingcap/tiflow/pkg/sink/kafka/admin.go:92\ngithub.com/pingcap/tiflow/pkg/sink/kafka.(*saramaAdminClient).GetTopicsMeta\n\tgithub.com/pingcap/tiflow/pkg/sink/kafka/admin.go:251\ngithub.com/pingcap/tiflow/cdc/sink/dmlsink/mq/manager.(*kafkaTopicManager).getMetadataOfTopics\n\tgithub.com/pingcap/tiflow/cdc/sink/dmlsink/mq/manager/kafka_manager.go:165\ngithub.com/pingcap/tiflow/cdc/sink/dmlsink/mq/manager.(*kafkaTopicManager).backgroundRefreshMeta\n\tgithub.com/pingcap/tiflow/cdc/sink/dmlsink/mq/manager/kafka_manager.go:106\nruntime.goexit\n\truntime/asm_arm64.s:1172"] [duration=139.603489ms]

Looks like (*saramaAdminClient) queryClusterWithRetry() is using retry.WithBackoffBaseDelay() with a delay/retry of defaultRetryBackoff/defaultRetryMaxTrieswhich are set to 20 and 3. However as the argument forWithBackoffBaseDelay()` is in Ms this isn't long enough to do a proper backoff.

defaultRetryBackoff = 20
defaultRetryMaxTries = 3

}, retry.WithBackoffBaseDelay(defaultRetryBackoff), retry.WithMaxTries(defaultRetryMaxTries))

func WithBackoffBaseDelay(delayInMs int64) Option {

The result is a TiCDC changefeed that's dead in the water.

Versions of the cluster

TiCDC v7.1.1

This is a possible regression of #8225 and #8223

@dveeden dveeden added area/ticdc Issues or PRs related to TiCDC. type/bug The issue is confirmed as a bug. labels Aug 4, 2023
@dveeden
Copy link
Contributor Author

dveeden commented Aug 4, 2023

/severity major

@dveeden
Copy link
Contributor Author

dveeden commented Aug 4, 2023

The retry should be enough to allow for:

  • a network switch being restarted
  • a kafka server restart

@dveeden
Copy link
Contributor Author

dveeden commented Aug 4, 2023

Example of delays:

package main

import (
	"context"
	"errors"
	"fmt"
	"math"
	"time"

	"github.com/pingcap/tiflow/pkg/retry"
)

func main() {
	retryconfs := []struct {
		delay   int64
		retries uint64
	}{
		{20, 3},
		{20, 10},
		{20000, 10},
		{20000000, 10},
		{20000000000, 10},
		{20000000000000, 10},
		{20000000000000000, 10},
		{math.MaxInt64, 10},
		{math.MaxInt64, 100},
		{math.MaxInt64, 200},
		{math.MaxInt64, 300},
	}

	start := time.Now()
	for _, retryconf := range retryconfs {
		fmt.Printf("Trying with delay of %d and retries set to %d\n", retryconf.delay, retryconf.retries)
		retry.Do(context.Background(), func() error {
			print(".")
			return errors.New("uhoh")
		}, retry.WithBackoffBaseDelay(retryconf.delay), retry.WithMaxTries(retryconf.retries))
		fmt.Printf("\nTotal time: %s\n\n", time.Since(start))
	}
}
Trying with delay of 20 and retries set to 3
...
Total time: 81.656453ms

Trying with delay of 20 and retries set to 10
..........
Total time: 910.131335ms

Trying with delay of 20000 and retries set to 10
..........
Total time: 1.812777722s

Trying with delay of 20000000 and retries set to 10
..........
Total time: 2.715434142s

Trying with delay of 20000000000 and retries set to 10
..........
Total time: 3.617932324s

Trying with delay of 20000000000000 and retries set to 10
..........
Total time: 4.520525005s

Trying with delay of 20000000000000000 and retries set to 10
..........
Total time: 5.423173451s

Trying with delay of 9223372036854775807 and retries set to 10
..........
Total time: 6.325754579s

Trying with delay of 9223372036854775807 and retries set to 100
....................................................................................................
Total time: 16.253248642s

Trying with delay of 9223372036854775807 and retries set to 200
........................................................................................................................................................................................................
Total time: 36.213455742s

Trying with delay of 9223372036854775807 and retries set to 300
............................................................................................................................................................................................................................................................................................................
Total time: 1m6.194981098s

@dveeden
Copy link
Contributor Author

dveeden commented Aug 4, 2023

I would suggest to:

  • Increase the default delay/retry to a combination that would retry for at least 1 minute
  • Make the retry/delay configurable as part of the sink uri
  • Consider retrying forever
  • Consider other retry options (e.g. WithTotalRetryDuratoin())

Note that there is a typo in WithTotalRetryDuratoin.

@dveeden
Copy link
Contributor Author

dveeden commented Aug 4, 2023

/remove-label may-affects-5.2
/remove-label may-affects-5.3
/remove-label may-affects-5.4
/remove-label may-affects-6.1
/remove-label may-affects-6.5

@3AceShowHand
Copy link
Contributor

the same as #9481

@3AceShowHand
Copy link
Contributor

There is a retry mechanism inside the Sarama admin client, we can adjust these configuration to let the admin client retry multiple times if the error is retryable. For unretryable errors, such as broken pipe, or EOF, we can throw the error to the upper level, so that the whole Kafka sink can be reconstructed, and establish a new connection with the Kafka cluster if the cluster is available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-6.5 affects-7.1 area/ticdc Issues or PRs related to TiCDC. severity/major type/bug The issue is confirmed as a bug.
Projects
None yet
2 participants