Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[metricbeat] - Allow metricsets to report their status via v2 protocol #40025

Merged
merged 35 commits into from
Jul 24, 2024

Conversation

VihasMakwana
Copy link
Contributor

@VihasMakwana VihasMakwana commented Jun 26, 2024

Proposed commit message

Add StatusReporter for letting individual metricsets to report their statuses.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Related issues

Use cases

Screenshots

Logs

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jun 26, 2024
@VihasMakwana VihasMakwana added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Jun 26, 2024
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jun 26, 2024
Copy link
Contributor

mergify bot commented Jun 26, 2024

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @VihasMakwana? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

@VihasMakwana VihasMakwana force-pushed the add-metricbeat-status-reporter branch from 6c10a9f to e37ef86 Compare June 26, 2024 13:42
@VihasMakwana VihasMakwana changed the title [DRAFT] - Allow metrcisets to report their status via v2 protocol [metricbeat] - Allow metricsets to report their status via v2 protocol Jun 26, 2024
@VihasMakwana VihasMakwana marked this pull request as ready for review June 26, 2024 16:03
@VihasMakwana VihasMakwana requested a review from a team as a code owner June 26, 2024 16:03
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@VihasMakwana VihasMakwana requested a review from cmacknz June 26, 2024 16:10
@VihasMakwana VihasMakwana requested a review from a team as a code owner June 26, 2024 16:14
Copy link
Contributor

@faec faec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, but I don't see any actual invocations of SetStatusReporter, which seems to be required for the other changes to have an effect. Am I missing something, or is that coming in a later PR?

metricbeat/mb/mb.go Outdated Show resolved Hide resolved
@VihasMakwana
Copy link
Contributor Author

VihasMakwana commented Jul 15, 2024

This looks good, but I don't see any actual invocations of SetStatusReporter, which seems to be required for the other changes to have an effect. Am I missing something, or is that coming in a later PR?

@Fae, This part was introduced by #39209 and it's done at

if config.StatusReporter != nil {
if runnerWithStatus, ok := runner.(status.WithStatusReporter); ok {
runnerWithStatus.SetStatusReporter(config.StatusReporter)
}
}
in a generic way for all the runners.

@VihasMakwana VihasMakwana requested a review from faec July 15, 2024 20:50
Copy link
Contributor

@faec faec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying, looks good to me!

@VihasMakwana
Copy link
Contributor Author

Thanks I wasn't following that. I think it would be a more obvious if we set state based on the return value of Fetch (certainly it would be for me). I think it also has more obvious behavior with respect to when the status will be cleared - it will be cleared on the first call to Fetch that doesn't return an error.

@cmacknz Let me know how it looks to you now and if you have any more comments regarding this.


@faec Thanks for reviewing!

@cmacknz
Copy link
Member

cmacknz commented Jul 16, 2024

The changes themselves LGTM, but I am not getting the result I expected when I test this manually. What I did:

  1. Create an ubuntu VM with multipass and an 8.15.0-SNAPSHOT ESS deployment on staging.elastic.co.
  2. Build agentbeat from this branch with SNAPSHOT=true PLATFORMS=linux/arm64 PACKAGES=targz mage package
  3. Transfer agentbeat into the VM: multipass transfer ./build/distributions/agentbeat-8.16.0-SNAPSHOT-linux-arm64.tar.gz $VM_NAME:/home/ubuntu
  4. Shell into the VM with multipass shell
  5. Download latest 8.15.0 BC agent: curl -LO https://staging.elastic.co/8.15.0-a47a2107/downloads/beats/elastic-agent/elastic-agent-8.15.0-linux-arm64.tar.gz
  6. Extract the agent: tar xvfz elastic-agent-8.15.0-linux-arm64.tar.gz
  7. Overwrite the packaged agentbeat with the one from this branch: cp ./agentbeat-8.16.0-SNAPSHOT-linux-arm64/agentbeat ./elastic-agent-8.15.0-linux-arm64/data/elastic-agent-475d24/components/
  8. Install the agent as unprivileged using the default Fleet policy to collect system metrics: ./elastic-agent install --url=X --enrollment-token=Y --unprivileged -f
  9. Create a simple script and run it as root to guarantee there is some PID running under root that we should fail to get metrics from.
~$ cat script.sh
while true;
do
        echo "Hello!";
        sleep 1;
done
~$ sudo ./script.sh
  1. Observe agent remains healthy but the logs show privilege related errors, many of which are only visible at debug level.
{"log.level":"debug","@timestamp":"2024-07-16T21:04:30.148Z","message":"Non-fatal error fetching PID metrics for 30358, metrics are valid, but partial: Not enough privileges to fetch information: /io unavailable; if running inside a container, use SYS_PTRACE: error fetching IO metrics: open /proc/30358/io: permission denied","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"log.logger":"processes","log.origin":{"file.line":268,"file.name":"process/process.go","function":"github.com/elastic/elastic-agent-system-metrics/metric/system/process.(*Stats).pidFill"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-07-16T21:04:40.143Z","message":"Non-fatal error fetching PID metrics for 30357, metrics are valid, but partial: Not enough privileges to fetch information: /io unavailable; if running inside a container, use SYS_PTRACE: error fetching IO metrics: open /proc/30357/io: permission denied","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"log.logger":"processes","log.origin":{"file.line":268,"file.name":"process/process.go","function":"github.com/elastic/elastic-agent-system-metrics/metric/system/process.(*Stats).pidFill"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-07-16T21:05:50.143Z","message":"Non-fatal error fetching PID metrics for 30358, metrics are valid, but partial: Not enough privileges to fetch information: /io unavailable; if running inside a container, use SYS_PTRACE: error fetching IO metrics: open /proc/30358/io: permission denied","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"service.name":"metricbeat","ecs.version":"1.6.0","log.logger":"processes","log.origin":{"file.line":268,"file.name":"process/process.go","function":"github.com/elastic/elastic-agent-system-metrics/metric/system/process.(*Stats).pidFill"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-07-16T21:05:50.143Z","message":"Error fetching PID info for 30642, skipping: GetInfoForPid: no such process","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"service.name":"metricbeat","ecs.version":"1.6.0","log.logger":"processes","log.origin":{"file.line":198,"file.name":"process/process.go","function":"github.com/elastic/elastic-agent-system-metrics/metric/system/process.(*Stats).pidIter"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-07-16T21:06:10.055Z","message":"Non-fatal error fetching PID metrics for 30356, metrics are valid, but partial: Not enough privileges to fetch information: /io unavailable; if running inside a container, use SYS_PTRACE: error fetching IO metrics: open /proc/30356/io: permission denied","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"log.logger":"processes","log.origin":{"file.line":268,"file.name":"process/process.go","function":"github.com/elastic/elastic-agent-system-metrics/metric/system/process.(*Stats).pidFill"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-07-16T21:06:10.056Z","message":"Non-fatal error fetching PID metrics for 30357, metrics are valid, but partial: Not enough privileges to fetch information: /io unavailable; if running inside a container, use SYS_PTRACE: error fetching IO metrics: open /proc/30357/io: permission denied","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"service.name":"metricbeat","ecs.version":"1.6.0","log.logger":"processes","log.origin":{"file.line":268,"file.name":"process/process.go","function":"github.com/elastic/elastic-agent-system-metrics/metric/system/process.(*Stats).pidFill"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-07-16T21:06:10.056Z","message":"Non-fatal error fetching PID metrics for 30358, metrics are valid, but partial: Not enough privileges to fetch information: /io unavailable; if running inside a container, use SYS_PTRACE: error fetching IO metrics: open /proc/30358/io: permission denied","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"ecs.version":"1.6.0","log.logger":"processes","log.origin":{"file.line":268,"file.name":"process/process.go","function":"github.com/elastic/elastic-agent-system-metrics/metric/system/process.(*Stats).pidFill"},"service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-07-16T21:06:10.056Z","message":"Non-fatal error fetching PID metrics for 30649, metrics are valid, but partial: Not enough privileges to fetch information: /io unavailable; if running inside a container, use SYS_PTRACE: error fetching IO metrics: open /proc/30649/io: permission denied","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"log.logger":"processes","log.origin":{"file.line":268,"file.name":"process/process.go","function":"github.com/elastic/elastic-agent-system-metrics/metric/system/process.(*Stats).pidFill"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-07-16T21:06:10.056Z","message":"Non-fatal error fetching PID metrics for 30650, metrics are valid, but partial: Not enough privileges to fetch information: /io unavailable; if running inside a container, use SYS_PTRACE: error fetching IO metrics: open /proc/30650/io: permission denied","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"service.name":"metricbeat","ecs.version":"1.6.0","log.logger":"processes","log.origin":{"file.line":268,"file.name":"process/process.go","function":"github.com/elastic/elastic-agent-system-metrics/metric/system/process.(*Stats).pidFill"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-07-16T21:06:10.056Z","message":"Non-fatal error fetching PID metrics for 30651, metrics are valid, but partial: Not enough privileges to fetch information: /io unavailable; if running inside a container, use SYS_PTRACE: error fetching IO metrics: open /proc/30651/io: permission denied","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"log.logger":"processes","log.origin":{"file.line":268,"file.name":"process/process.go","function":"github.com/elastic/elastic-agent-system-metrics/metric/system/process.(*Stats).pidFill"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-07-16T21:06:10.056Z","message":"Non-fatal error fetching PID metrics for 30652, metrics are valid, but partial: Not enough privileges to fetch information: /io unavailable; if running inside a container, use SYS_PTRACE: error fetching IO metrics: open /proc/30652/io: permission denied","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"log.logger":"processes","log.origin":{"file.line":268,"file.name":"process/process.go","function":"github.com/elastic/elastic-agent-system-metrics/metric/system/process.(*Stats).pidFill"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-07-16T21:06:10.056Z","message":"Non-fatal error fetching PID metrics for 30670, metrics are valid, but partial: Not enough privileges to fetch information: /io unavailable; if running inside a container, use SYS_PTRACE: error fetching IO metrics: open /proc/30670/io: permission denied","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"log.origin":{"file.line":268,"file.name":"process/process.go","function":"github.com/elastic/elastic-agent-system-metrics/metric/system/process.(*Stats).pidFill"},"service.name":"metricbeat","ecs.version":"1.6.0","log.logger":"processes","ecs.version":"1.6.0"}

I've attached diagnostics from the VM. The logs indicating the error is non-fatal might be what is catching us here, I think this has changed since the issue to implement this was created. Perhaps you were initially right hooking into the error in the event, which might catch these since they are non-fatal, but I still think looking at events makes it harder to see when we will go back to healthy. Regardless, this isn't actually catching the degraded state we want.

unpriv-metricbeat-diagnostics.zip

I also wonder if we could get a test to catch this, assuming I verified this correctly.

@VihasMakwana
Copy link
Contributor Author

@cmacknz I see.

I think this is due to the fact that elastic-agent-system-metrics only logs the error and doesn't return them to the caller when monitoring set of processes. Example https://github.com/elastic/elastic-agent-system-metrics/blob/a9ecc37f3de53d18a85ad182718010650aad774b/metric/system/process/process.go#L263-L269

We can make use of multierr to capture the errors encountered for set of processes and pass it to the caller.

Copy link
Contributor

mergify bot commented Jul 17, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b add-metricbeat-status-reporter upstream/add-metricbeat-status-reporter
git merge upstream/main
git push upstream add-metricbeat-status-reporter

Copy link
Contributor

mergify bot commented Jul 23, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b add-metricbeat-status-reporter upstream/add-metricbeat-status-reporter
git merge upstream/main
git push upstream add-metricbeat-status-reporter

@@ -253,14 +255,20 @@ func (msw *metricSetWrapper) fetch(ctx context.Context, reporter reporter) {
err := fetcher.Fetch(reporter.V2())
if err != nil {
reporter.V2().Error(err)
msw.module.UpdateStatus(status.Degraded, fmt.Sprintf("Error fetching data for metricset %s.%s: %s", msw.module.Name(), msw.MetricSet.Name(), err))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use %v for the err

}
case mb.ReportingMetricSetV2WithContext:
reporter.StartFetchTimer()
err := fetcher.Fetch(ctx, reporter.V2())
if err != nil {
reporter.V2().Error(err)
msw.module.UpdateStatus(status.Degraded, fmt.Sprintf("Error fetching data for metricset %s.%s: %s", msw.module.Name(), msw.MetricSet.Name(), err))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use %v for the err

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pkoutsovasilis I didn't see this.
I'll raise another PR for this.
Sorry from my end!

Copy link
Contributor

@pkoutsovasilis pkoutsovasilis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some nits, but looks good to me

@VihasMakwana
Copy link
Contributor Author

@cmacknz Screenshot of error event on Kibana:

Screenshot 2024-07-24 at 3 00 33 PM

@VihasMakwana
Copy link
Contributor Author

Merging this as of now.
#40298 will take care of the non-fatal errors.

@VihasMakwana VihasMakwana merged commit c86330a into elastic:main Jul 24, 2024
68 checks passed
mergify bot pushed a commit that referenced this pull request Jul 24, 2024
#40025)

* fix: initial commit

* tests: add integration test cases

* fix: expand testing scenarios

* fix: add comments

* fix: move integration tests to system/process

* cleanup

* fix: ci

* fix: ci and typos

* chore: update changelog

* fix: add helper

* fix: remove extra space

* fix: ci

* fix: move integration tests to x-pack

* fix: add null check

* fix: ci

* fix: remove unused code

* fix: lint

* fix: lint and imports

* fix: ci windows

* inting for windows

* fix lint linux

* fix: go imports

* fix: switch to the generic way

* chore: make error descriptive

* fix: move status report after fetch

* fix: typo

* fix: remove nolint

---------

Co-authored-by: Pierre HILBERT <pierre.hilbert@elastic.co>
(cherry picked from commit c86330a)
Copy link
Contributor

mergify bot commented Jul 31, 2024

⚠️ The sha of the head commit of this PR conflicts with #40400. Mergify cannot evaluate rules on this PR. ⚠️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.15 Automated backport to the 8.15 branch with mergify Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
8 participants