Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support timeout when fetching metadata #1359

Merged
merged 1 commit into from
May 8, 2019

Conversation

slaunay
Copy link
Contributor

@slaunay slaunay commented Apr 18, 2019

Versions

Sarama Version: v1.22.0-6-g2a49b70 (2a49b70)
Kafka Version: 1.0.2
Go Version: 1.12

Configuration

What configuration values are you using for Sarama and Kafka?

Default configuration, see unit tests.

Problem Description

Fetching metadata from an unreachable cluster using the default configuration (30s DialTimeout, 3 retries, 250ms backoff) of 2 brokers can take more than 4 minutes before failing with ErrOutOfBrokers.

The following logs can be seen when running the provided unit test against master with unresponsive seed brokers (with TCP read timeout on 127.0.0.1:63168 and 127.0.0.1:63169 to simulate unreachable cluster with TCP dial timeout):

DEBUG=true go test -v -run=TestClientMetadataTimeout/timeout=10m0s
[sarama] 2019/04/18 09:38:00 Using random seed: 1555605480041448000
=== RUN   TestClientMetadataTimeout
=== RUN   TestClientMetadataTimeout/timeout=10m0s
[sarama] 2019/04/18 09:38:05 Initializing new client
[sarama] 2019/04/18 09:38:05 client/metadata fetching metadata for all topics from broker 127.0.0.1:63166
[sarama] 2019/04/18 09:38:05 Connected to broker at 127.0.0.1:63166 (unregistered)
[sarama] 2019/04/18 09:38:05 Successfully initialized new client
[sarama] 2019/04/18 09:38:05 Closed connection to broker 127.0.0.1:63166
[sarama] 2019/04/18 09:38:05 client/metadata fetching metadata for all topics from broker 127.0.0.1:63168
[sarama] 2019/04/18 09:38:05 Connected to broker at 127.0.0.1:63168 (unregistered)
[sarama] 2019/04/18 09:38:35 client/metadata got error from broker -1 while fetching metadata: read tcp 127.0.0.1:63170->127.0.0.1:63168: i/o timeout
[sarama] 2019/04/18 09:38:35 Closed connection to broker 127.0.0.1:63168
[sarama] 2019/04/18 09:38:35 client/metadata fetching metadata for all topics from broker 127.0.0.1:63169
[sarama] 2019/04/18 09:38:35 Connected to broker at 127.0.0.1:63169 (unregistered)
[sarama] 2019/04/18 09:39:05 client/metadata got error from broker -1 while fetching metadata: read tcp 127.0.0.1:63184->127.0.0.1:63169: i/o timeout
[sarama] 2019/04/18 09:39:05 Closed connection to broker 127.0.0.1:63169
[sarama] 2019/04/18 09:39:05 client/metadata no available broker to send metadata request to
[sarama] 2019/04/18 09:39:05 client/brokers resurrecting 2 dead seed brokers
[sarama] 2019/04/18 09:39:05 client/metadata retrying after 250ms... (3 attempts remaining)
[sarama] 2019/04/18 09:39:05 client/metadata fetching metadata for all topics from broker 127.0.0.1:63168
[sarama] 2019/04/18 09:39:05 Connected to broker at 127.0.0.1:63168 (unregistered)
[sarama] 2019/04/18 09:39:35 client/metadata got error from broker -1 while fetching metadata: read tcp 127.0.0.1:63187->127.0.0.1:63168: i/o timeout
[sarama] 2019/04/18 09:39:35 Closed connection to broker 127.0.0.1:63168
[sarama] 2019/04/18 09:39:35 client/metadata fetching metadata for all topics from broker 127.0.0.1:63169
[sarama] 2019/04/18 09:40:05 client/metadata got error from broker -1 while fetching metadata: read tcp 127.0.0.1:63193->127.0.0.1:63169: i/o timeout
[sarama] 2019/04/18 09:40:05 Closed connection to broker 127.0.0.1:63169
[sarama] 2019/04/18 09:40:05 client/metadata no available broker to send metadata request to
[sarama] 2019/04/18 09:40:05 client/brokers resurrecting 2 dead seed brokers
[sarama] 2019/04/18 09:40:05 client/metadata retrying after 250ms... (2 attempts remaining)
[sarama] 2019/04/18 09:40:05 client/metadata fetching metadata for all topics from broker 127.0.0.1:63168
[sarama] 2019/04/18 09:40:05 Connected to broker at 127.0.0.1:63168 (unregistered)
[sarama] 2019/04/18 09:40:35 client/metadata got error from broker -1 while fetching metadata: read tcp 127.0.0.1:63196->127.0.0.1:63168: i/o timeout
[sarama] 2019/04/18 09:40:35 Closed connection to broker 127.0.0.1:63168
[sarama] 2019/04/18 09:40:35 client/metadata fetching metadata for all topics from broker 127.0.0.1:63169
[sarama] 2019/04/18 09:40:35 Connected to broker at 127.0.0.1:63169 (unregistered)
[sarama] 2019/04/18 09:41:05 client/metadata got error from broker -1 while fetching metadata: read tcp 127.0.0.1:63204->127.0.0.1:63169: i/o timeout
[sarama] 2019/04/18 09:41:05 Closed connection to broker 127.0.0.1:63169
[sarama] 2019/04/18 09:41:05 client/metadata no available broker to send metadata request to
[sarama] 2019/04/18 09:41:05 client/brokers resurrecting 2 dead seed brokers
[sarama] 2019/04/18 09:41:05 client/metadata retrying after 250ms... (1 attempts remaining)
[sarama] 2019/04/18 09:41:05 client/metadata fetching metadata for all topics from broker 127.0.0.1:63168
[sarama] 2019/04/18 09:41:35 client/metadata got error from broker -1 while fetching metadata: read tcp 127.0.0.1:63211->127.0.0.1:63168: i/o timeout
[sarama] 2019/04/18 09:41:35 Closed connection to broker 127.0.0.1:63168
[sarama] 2019/04/18 09:41:35 client/metadata fetching metadata for all topics from broker 127.0.0.1:63169
[sarama] 2019/04/18 09:42:05 client/metadata got error from broker -1 while fetching metadata: read tcp 127.0.0.1:63230->127.0.0.1:63169: i/o timeout
[sarama] 2019/04/18 09:42:05 Closed connection to broker 127.0.0.1:63169
[sarama] 2019/04/18 09:42:05 client/metadata no available broker to send metadata request to
[sarama] 2019/04/18 09:42:05 client/brokers resurrecting 2 dead seed brokers
[sarama] 2019/04/18 09:42:05 Closing Client
    --- PASS: TestClientMetadataTimeout/timeout=10m0s (240.80s)
        client_test.go:670: Got err: kafka: client has run out of available brokers to talk to (Is your cluster reachable?) after waiting for: 4m0.792878004s

This time grows exponentially with the number of brokers because of the following formula:

Net.[Dial|Read]Timeout * BrokerCount * (Metadata.Retry.Max + 1) + Metadata.Retry.Backoff * Metadata.Retry.Max

Solution

The proposed solution is to add a new configuration option Metadata.Timeout to fail faster.

Changes

  • add Metadata.Timeout configuration option to fail faster (disabled by default for backward compatibility)
  • stop retry to query metadata on next broker when reaching such timeout has expired and return error
  • add unit tests to validate behaviour when such configuration option is set to non 0

Testing done

See provided unit tests to see ErrOutOfBrokers returned much faster.

We have been using that approach in production successfully (based on a different Sarama release) to failover a producer from one Kafka cluster to another.
When one cluster become unreachable over WAN we can failover to another region in less than 30 seconds using the following configuration:

config.Net.DialTimeout = 10 * time.Second
config.Net.ReadTimeout = 15 * time.Second
config.Net.WriteTimeout = 10 * time.Second
// Use a single retry for querying metadata rather than 3
config.Metadata.Retry.Max = 1
// Same for produce request
config.Producer.Retry.Max = 1
// Try to connect to at most two unreachable brokers
config.Metadata.Timeout = 15 * time.Second
// Disable producer retries
saramaConfig.Producer.Retry.Max = 0
// Also disable backoff as we could sleep for 100ms for every queued message
// that need to be retried if a cluster is not reachable...
saramaConfig.Producer.Retry.Backoff = 0

@ghost ghost added the cla-needed label Apr 18, 2019
@slaunay
Copy link
Contributor Author

slaunay commented Apr 18, 2019

I just signed the CLA.

@ghost ghost removed the cla-needed label Apr 18, 2019
Fetching metadata from an unreachable cluster of 2 brokers can take more
than 3 minutes using the default configuration (30s DialTimeout, 3 retries,
250ms backoff) before failing with ErrOutOfBrokers.

Adding config.Metadata.Timeout configuration option to fail faster.
@slaunay slaunay force-pushed the enhancement/metadata-refresh-fail-fast branch from 070038f to fe1b184 Compare April 23, 2019 16:29
@varun06
Copy link
Contributor

varun06 commented May 3, 2019

@slaunay code looks good and straight forward. Just want to make sure that this is a desirable feature to add. @bai can you please add some folks to review who might have opinions on this?

@bai bai requested a review from sam-obeid May 3, 2019 12:36
@mieczkowski
Copy link

This is nice (ale missing for me) feature, so +1

Copy link

@sam-obeid sam-obeid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This totally makes sense! We don't want to break backward compatibility, so disabling it by default is a good choice.

This raised a valid concern regarding the large time spent querying Metadata when one or more brokers are unreachable! We should look into failing faster even without using this new configuration!

Thank you for your contribution @slaunay
Verified that tests are green 👍

@bai bai merged commit a5ecebc into IBM:master May 8, 2019
@slaunay
Copy link
Contributor Author

slaunay commented May 8, 2019

Thanks for the review, failing fast (time bound) vs tolerating faults (e.g. try all brokers) is often tricky to get right.

@slaunay slaunay deleted the enhancement/metadata-refresh-fail-fast branch May 8, 2019 16:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants