Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[receiver/kafkametricsreceiver] do not crash collector on startup when kafka is unavailable #8817

Merged
merged 5 commits into from
Apr 21, 2022

Conversation

mwear
Copy link
Member

@mwear mwear commented Mar 23, 2022

Description:
This PR updates the kafkametricsreceiver so that it will not crash the collector during start if kafka is not available. It pushes the sarama client setup into the Scrape method. If kafka is unavailable the Scrape will return an error and try to setup the client during subsequent scrapes until it is successful. This is part of a larger discussion (#8816) on how scrape based receivers should behave when the service they monitor is not available. I'm open to any and all suggestions how this should be handled, and hope to use this as a starting point to come up with best practices that can be applied to other scrape based receivers.

Link to tracking Issue:
#8816, #8349

Testing:
Unit tests were updated for the changed behavior. I also did an end to test where I started the collector without a running kafka service. The collector starts and logs errors when the kafkametricsreceiver fails to scrape. I verified that when I start kafka, the receiver successfully connects and starts producing metrics. See the abbreviated log output below:

2022-03-23T11:27:09.410-0700	info	builder/receivers_builder.go:68	Receiver is starting...	{"kind": "receiver", "name": "kafkametrics"}
2022-03-23T11:27:09.410-0700	info	zapgrpc/zapgrpc.go:174	[core] Subchannel picks a new address "ingest.staging.lightstep.com:443" to connect	{"grpc_log": true}
2022-03-23T11:27:09.410-0700	info	builder/receivers_builder.go:73	Receiver started.	{"kind": "receiver", "name": "kafkametrics"}

...

2022-03-23T11:32:11.731-0700	error	scraperhelper/scrapercontroller.go:198	Error scraping metrics	{"kind": "receiver", "name": "kafkametrics", "error": "failed to create client in consumer scraper: kafka: client has run out of available brokers to talk to (Is your cluster reachable?)", "scraper": "consumers"}
go.opentelemetry.io/collector/receiver/scraperhelper.(*controller).scrapeMetricsAndReport
	go.opentelemetry.io/collector@v0.47.1-0.20220321233732-3cec6d3d98d9/receiver/scraperhelper/scrapercontroller.go:198
go.opentelemetry.io/collector/receiver/scraperhelper.(*controller).startScraping.func1
	go.opentelemetry.io/collector@v0.47.1-0.20220321233732-3cec6d3d98d9/receiver/scraperhelper/scrapercontroller.go:173

...

2022-03-23T11:33:09.617-0700	DEBUG	loggingexporter/logging_exporter.go:64	ResourceMetrics #0
Resource SchemaURL:
InstrumentationLibraryMetrics #0
InstrumentationLibraryMetrics SchemaURL:
InstrumentationLibrary otelcol/kafkametrics
Metric #0
Descriptor:
     -> Name: kafka.brokers
     -> Description:
     -> Unit:
     -> DataType: Gauge
NumberDataPoints #0
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2022-03-23 18:33:09.436189 +0000 UTC
Value: 1
ResourceMetrics #1
Resource SchemaURL:
InstrumentationLibraryMetrics #0
InstrumentationLibraryMetrics SchemaURL:
InstrumentationLibrary otelcol/kafkametrics

@mwear mwear requested a review from a team March 23, 2022 18:45
@mwear mwear requested a review from dmitryax as a code owner March 23, 2022 18:45
@mwear mwear changed the title kafkametricsreceiver initialize client in scrape [receiver/kafkametricsreceiver] do not crash collector on startup when kafka is unavailable Mar 23, 2022
@github-actions
Copy link
Contributor

github-actions bot commented Apr 7, 2022

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Apr 7, 2022
@mwear mwear force-pushed the kafkametricsreceiver_startup branch from 5570505 to aa83a11 Compare April 11, 2022 21:06
@mwear
Copy link
Member Author

mwear commented Apr 11, 2022

I'm wondering if we can consider reviewing this PR as-is. There is some work in progress around health reporting that we can take advantage of when it's finished, but his PR stands on its own as an improvement. With these changes, the collector will not crash if kafka is not available during startup, and it will try to reconnect on each scrape until successful.

@github-actions github-actions bot removed the Stale label Apr 12, 2022
Copy link
Contributor

@codeboten codeboten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me, this is still an improvement, even if it's not applicable to all receivers just yet. Please resolve the conflicts, I'll review shortly

@mwear mwear force-pushed the kafkametricsreceiver_startup branch from aa83a11 to e86dd09 Compare April 14, 2022 16:20
Copy link
Contributor

@codeboten codeboten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is an improvement to the current behaviour, scrapers failing to talk to their destinations shouldn't crash the collector. Please address the changelog comment. @dmitryax PTAL

CHANGELOG.md Outdated Show resolved Hide resolved
Comment on lines 75 to 69
if err := s.setupClient(); err != nil {
return pmetric.Metrics{}, err
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way of code reusing seems a bit confusing to me. It sounds like we setup a client on each scrape. If we remove the function it becomes much cleaner IMO

Suggested change
if err := s.setupClient(); err != nil {
return pmetric.Metrics{}, err
}
if s.client != nil {
s.client, err := newSaramaClient(s.config.Brokers, s.saramaConfig)
if err != nil {
return pmetric.Metrics{}, fmt.Errorf("failed to create client in consumer scraper: %w", err)
}
}

Another approach to avoid copying this code in all the scrapers would be using a wrapper for the scrape function, then each scraper creation would be:

scraperhelper.NewScraper(
		s.Name(),
		withClientSetup(s.scrape),
		scraperhelper.WithShutdown(s.shutdown),
	)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I liked both of your suggestions, but went with the first (inlining) as it makes the code slightly easier to follow.

Copy link
Contributor

@codeboten codeboten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks good to me. The suggestions proposed by @dmitryax make the pr cleaner as well. 👍

@mwear mwear force-pushed the kafkametricsreceiver_startup branch from d414813 to b3b30d2 Compare April 21, 2022 17:40
@mwear mwear force-pushed the kafkametricsreceiver_startup branch from b3b30d2 to ef0e7ed Compare April 21, 2022 18:14
@codeboten codeboten merged commit f5f05d5 into open-telemetry:main Apr 21, 2022
djaglowski pushed a commit to djaglowski/opentelemetry-collector-contrib that referenced this pull request Apr 22, 2022
…n kafka is unavailable (open-telemetry#8817)

* kafkametricsreceiver initialize client in scrape

* add changelog entry

* more descriptive changelog entry

* kafkametricsreceiver: guard against nil client in shutdown

* kafkametricsreceiver: inline client creation in scrape
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants