[receiver/kafkametricsreceiver] do not crash collector on startup when kafka is unavailable #8817

mwear · 2022-03-23T18:45:42Z

Description:
This PR updates the kafkametricsreceiver so that it will not crash the collector during start if kafka is not available. It pushes the sarama client setup into the Scrape method. If kafka is unavailable the Scrape will return an error and try to setup the client during subsequent scrapes until it is successful. This is part of a larger discussion (#8816) on how scrape based receivers should behave when the service they monitor is not available. I'm open to any and all suggestions how this should be handled, and hope to use this as a starting point to come up with best practices that can be applied to other scrape based receivers.

Link to tracking Issue:
#8816, #8349

Testing:
Unit tests were updated for the changed behavior. I also did an end to test where I started the collector without a running kafka service. The collector starts and logs errors when the kafkametricsreceiver fails to scrape. I verified that when I start kafka, the receiver successfully connects and starts producing metrics. See the abbreviated log output below:

2022-03-23T11:27:09.410-0700	info	builder/receivers_builder.go:68	Receiver is starting...	{"kind": "receiver", "name": "kafkametrics"}
2022-03-23T11:27:09.410-0700	info	zapgrpc/zapgrpc.go:174	[core] Subchannel picks a new address "ingest.staging.lightstep.com:443" to connect	{"grpc_log": true}
2022-03-23T11:27:09.410-0700	info	builder/receivers_builder.go:73	Receiver started.	{"kind": "receiver", "name": "kafkametrics"}

...

2022-03-23T11:32:11.731-0700	error	scraperhelper/scrapercontroller.go:198	Error scraping metrics	{"kind": "receiver", "name": "kafkametrics", "error": "failed to create client in consumer scraper: kafka: client has run out of available brokers to talk to (Is your cluster reachable?)", "scraper": "consumers"}
go.opentelemetry.io/collector/receiver/scraperhelper.(*controller).scrapeMetricsAndReport
	go.opentelemetry.io/collector@v0.47.1-0.20220321233732-3cec6d3d98d9/receiver/scraperhelper/scrapercontroller.go:198
go.opentelemetry.io/collector/receiver/scraperhelper.(*controller).startScraping.func1
	go.opentelemetry.io/collector@v0.47.1-0.20220321233732-3cec6d3d98d9/receiver/scraperhelper/scrapercontroller.go:173

...

2022-03-23T11:33:09.617-0700	DEBUG	loggingexporter/logging_exporter.go:64	ResourceMetrics #0
Resource SchemaURL:
InstrumentationLibraryMetrics #0
InstrumentationLibraryMetrics SchemaURL:
InstrumentationLibrary otelcol/kafkametrics
Metric #0
Descriptor:
     -> Name: kafka.brokers
     -> Description:
     -> Unit:
     -> DataType: Gauge
NumberDataPoints #0
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2022-03-23 18:33:09.436189 +0000 UTC
Value: 1
ResourceMetrics #1
Resource SchemaURL:
InstrumentationLibraryMetrics #0
InstrumentationLibraryMetrics SchemaURL:
InstrumentationLibrary otelcol/kafkametrics

github-actions · 2022-04-07T05:15:47Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

mwear · 2022-04-11T21:28:29Z

I'm wondering if we can consider reviewing this PR as-is. There is some work in progress around health reporting that we can take advantage of when it's finished, but his PR stands on its own as an improvement. With these changes, the collector will not crash if kafka is not available during startup, and it will try to reconnect on each scrape until successful.

codeboten

Makes sense to me, this is still an improvement, even if it's not applicable to all receivers just yet. Please resolve the conflicts, I'll review shortly

codeboten

I think this is an improvement to the current behaviour, scrapers failing to talk to their destinations shouldn't crash the collector. Please address the changelog comment. @dmitryax PTAL

CHANGELOG.md

receiver/kafkametricsreceiver/broker_scraper.go

dmitryax · 2022-04-21T01:03:26Z

receiver/kafkametricsreceiver/consumer_scraper.go

+	if err := s.setupClient(); err != nil {
+		return pmetric.Metrics{}, err
+	}


This way of code reusing seems a bit confusing to me. It sounds like we setup a client on each scrape. If we remove the function it becomes much cleaner IMO

Suggested change

if err := s.setupClient(); err != nil {

return pmetric.Metrics{}, err

}

if s.client != nil {

s.client, err := newSaramaClient(s.config.Brokers, s.saramaConfig)

if err != nil {

return pmetric.Metrics{}, fmt.Errorf("failed to create client in consumer scraper: %w", err)

}

}

Another approach to avoid copying this code in all the scrapers would be using a wrapper for the scrape function, then each scraper creation would be:

scraperhelper.NewScraper( s.Name(), withClientSetup(s.scrape), scraperhelper.WithShutdown(s.shutdown), )

I liked both of your suggestions, but went with the first (inlining) as it makes the code slightly easier to follow.

codeboten

The change looks good to me. The suggestions proposed by @dmitryax make the pr cleaner as well. 👍

…n kafka is unavailable (open-telemetry#8817) * kafkametricsreceiver initialize client in scrape * add changelog entry * more descriptive changelog entry * kafkametricsreceiver: guard against nil client in shutdown * kafkametricsreceiver: inline client creation in scrape

mwear requested a review from a team March 23, 2022 18:45

mwear requested a review from dmitryax as a code owner March 23, 2022 18:45

github-actions bot assigned bogdandrutu Mar 23, 2022

mwear changed the title ~~kafkametricsreceiver initialize client in scrape~~ [receiver/kafkametricsreceiver] do not crash collector on startup when kafka is unavailable Mar 23, 2022

mwear mentioned this pull request Mar 23, 2022

Scrape based receiver startup behavior #8816

Closed

mwear mentioned this pull request Apr 6, 2022

[poc] status reporting open-telemetry/opentelemetry-collector#5158

Closed

6 tasks

github-actions bot added the Stale label Apr 7, 2022

mwear force-pushed the kafkametricsreceiver_startup branch from 5570505 to aa83a11 Compare April 11, 2022 21:06

github-actions bot removed the Stale label Apr 12, 2022

codeboten reviewed Apr 14, 2022

View reviewed changes

mwear force-pushed the kafkametricsreceiver_startup branch from aa83a11 to e86dd09 Compare April 14, 2022 16:20

codeboten reviewed Apr 18, 2022

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

dmitryax reviewed Apr 20, 2022

View reviewed changes

receiver/kafkametricsreceiver/broker_scraper.go Outdated Show resolved Hide resolved

dmitryax reviewed Apr 21, 2022

View reviewed changes

codeboten approved these changes Apr 21, 2022

View reviewed changes

mwear force-pushed the kafkametricsreceiver_startup branch from d414813 to b3b30d2 Compare April 21, 2022 17:40

mwear added 5 commits April 21, 2022 11:11

kafkametricsreceiver initialize client in scrape

cfe7482

add changelog entry

bce9533

more descriptive changelog entry

c603b0e

kafkametricsreceiver: guard against nil client in shutdown

cfb93f1

kafkametricsreceiver: inline client creation in scrape

ef0e7ed

mwear force-pushed the kafkametricsreceiver_startup branch from b3b30d2 to ef0e7ed Compare April 21, 2022 18:14

dmitryax approved these changes Apr 21, 2022

View reviewed changes

codeboten merged commit f5f05d5 into open-telemetry:main Apr 21, 2022

mwear mentioned this pull request Apr 21, 2022

[receiver/kafkametricsreceiver] collector crashes if Kafka is unavailable at startup #8349

Closed

mwear mentioned this pull request May 3, 2022

status reporting open-telemetry/opentelemetry-collector#5304

Closed

ahayworth mentioned this pull request May 16, 2022

Update to otel-contrib-col v0.51.0 Shopify/opentelemetry-collector-contrib#796

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[receiver/kafkametricsreceiver] do not crash collector on startup when kafka is unavailable #8817

[receiver/kafkametricsreceiver] do not crash collector on startup when kafka is unavailable #8817

mwear commented Mar 23, 2022

github-actions bot commented Apr 7, 2022

mwear commented Apr 11, 2022 •

edited

Loading

codeboten left a comment

codeboten left a comment

dmitryax Apr 21, 2022

mwear Apr 21, 2022

codeboten left a comment

-	if err := s.setupClient(); err != nil {
-		return pmetric.Metrics{}, err
-	}
+	if s.client != nil {
+		s.client, err := newSaramaClient(s.config.Brokers, s.saramaConfig)
+		if err != nil {
+			return pmetric.Metrics{}, fmt.Errorf("failed to create client in consumer scraper: %w", err)
+		}
+	}

[receiver/kafkametricsreceiver] do not crash collector on startup when kafka is unavailable #8817

[receiver/kafkametricsreceiver] do not crash collector on startup when kafka is unavailable #8817

Conversation

mwear commented Mar 23, 2022

github-actions bot commented Apr 7, 2022

mwear commented Apr 11, 2022 • edited Loading

codeboten left a comment

Choose a reason for hiding this comment

codeboten left a comment

Choose a reason for hiding this comment

dmitryax Apr 21, 2022

Choose a reason for hiding this comment

mwear Apr 21, 2022

Choose a reason for hiding this comment

codeboten left a comment

Choose a reason for hiding this comment

mwear commented Apr 11, 2022 •

edited

Loading