Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hostmetrics receiver logs spurious errors when it races with process termination #30434

Closed
ringerc opened this issue Jan 10, 2024 · 5 comments
Closed

Comments

@ringerc
Copy link

ringerc commented Jan 10, 2024

Describe the bug

If a process terminates midway through a hostmetrics process scrape, the scraper will log a spurious error like

{"level":"error","ts":1704887224.2511945,"caller":"scraperhelper/scrapercontroller.go:200","msg":"Error scraping metrics","kind":"receiver","name":"hostmetrics","data_type":"metrics","error":"error reading cpu times for process \"postgres\" (pid 1965300): open /host/proc/1965300/stat: no such file or directory; error reading memory info for process \"postgres\" (pid 1965300): open /host/proc/1965300/statm: no such file or directory","scraper":"process","stacktrace":"go.opentelemetry.io/collector/receiver/scraperhelper.(*controller).scrapeMetricsAndReport\n\tgo.opentelemetry.io/collector/receiver@v0.90.1/scraperhelper/scrapercontroller.go:200\ngo.opentelemetry.io/collector/receiver/scraperhelper.(*controller).startScraping.func1\n\tgo.opentelemetry.io/collector/receiver@v0.90.1/scraperhelper/scrapercontroller.go:176"}

because /proc/1965300 existed when it listed /proc, but had vanished by the time it tried to read its contents.

This error is expected and can be safely silenced. It'd make sense to stat the directory on I/O error, and if it's ENOENT, suppress the error. Or just ignore ENOENT for subdirs entirely since that's the only likely cause for this error.

Steps to reproduce

Run a workload that creates and terminates lots of processes, while running hostmetrics process receiver.

What did you expect to see?

No error level logs.

What did you see instead?

error level logs about failure to read procfs entries for processes that vanished during the scrape.

What version did you use?

v0.91.0

What config did you use?

A generic sample config with the hostmetrics receiver enabled.

Environment

Generic k8s (kind)

@ringerc ringerc added the bug Something isn't working label Jan 10, 2024
@mx-psi mx-psi transferred this issue from open-telemetry/opentelemetry-collector Jan 11, 2024
Copy link
Contributor

Pinging code owners for receiver/hostmetrics: @dmitryax @braydonk. See Adding Labels via Comments if you do not have permissions to add labels yourself.

@crobert-1 crobert-1 added the needs triage New item requiring triage label Jan 11, 2024
@atoulme
Copy link
Contributor

atoulme commented Jan 12, 2024

Thanks for the report @ringerc - would you like to try and offer a fix for this issue?

@atoulme atoulme removed the needs triage New item requiring triage label Jan 12, 2024
@ringerc
Copy link
Author

ringerc commented Jan 15, 2024

@atoulme I'd be happy to give it a go though it'll be a while before I can queue it up. In the meantime hopefully others will see this and at least know what the errors are from / why they appear.

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Mar 18, 2024
Copy link
Contributor

This issue has been closed as inactive because it has been stale for 120 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants