Skip to content

Conversation

@mergify
Copy link
Contributor

@mergify mergify bot commented Jun 27, 2025

What does this PR do?

This PR introduces an architectural enhancement to how the Elastic Agent launches and supervises its internal OpenTelemetry (OTEL) collector:

  • Refactors the OTEL manager to run the collector as a supervised subprocess rather than embedding it directly.
    - Adds a new edot-supervised sub-command to the Elastic Agent, enabling delegated startup of the OTEL collector with direct integration to the supervisor. After review comments: consolidated with the existing otel command
  • Enables support for the healthcheckv2 extension in the OTEL configuration to enhance lifecycle and health signal handling from collector components.
  • Improves test coverage and reliability by refactoring the otelmanager test logic to use time-based event checks, ensuring better guarantees.

This work sets the foundation for better fault isolation, process recovery, and alignment with existing process invocation model of elastic agent.

Why is it important?

Running the collector as a supervised subprocess provides improved fault isolation; a crash in the collector won't directly affect the main agent process.

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Disruptive User Impact

No immediate user impact is expected. The functionality remains the same as when running the collector directly inside the agent process. The new supervision model is internal and gated behind the edot-supervised command.

How to test this PR locally

mage unitTest

Related issues

N/A


This is an automatic backport of pull request #8248 done by [Mergify](https://mergify.com).

* feat: enable support for healthcheckv2 extension

* feat: add edot-supervised agent sub-command

* feat: refactor otel manager to invoke collector as subprocess

* fix: staticcheck QF1003 in unmarshalLevel of log_writer

* doc: add changelog fragment

* fix: add a random UUID as name of the healthcheck extension to avoid conflicts

* feat: consolidate otel and otel-supervised subcommands

* fix: use http.DefaultClient directly

* fix: add require failure message

* fix: rename otelSetSupervised to otelSetSupervisedFlagName and improve description

* fix: improve documentation

* fix: extract otel settings preparation in separate function

* fix: allocate healthCheckExtensionID in idiomatic way

* fix: update NOTICE files

* fix: exclude extensions from otel to beats status processing

* fix: always emit statuses

* fix: emit statuses only if there is a change in the event and subcomponents status/error

* fix: denoise code

* feat: reintroduce running collector embedded

* fix: update NOTICE.txt

* fix: replace runtime with execution

* fix: clean up commented code from TestCompareAggregateStatuses

* fix: removed changelog

* fix: exclude extensions from getOtelRuntimePipelineStatuses

* fix: pass elastic-agent logging level to supervised collector

* fix: couple embedded collector context with parent one

* fix: increase interval and max failed attempts of healthcheck v2 polling

* fix: make exceeding failed attempts a recoverable error and don't give up

* feat: add recovery support for supervised edot

* feat: rework health check fail to connect threshold

* feat: add license headers in recovery_backoff.go and recovery_noop.go

* fix: handle races in otel manager tests

* feat: support resetting to initial backoff interval for recoveryBackoff

* fix: correct comments

* fix: make recovery backoff unit tests more robust on OS with lower time resolution

* fix: format code after resolving conflicts

(cherry picked from commit c56581d)

# Conflicts:
#	internal/pkg/otel/manager/manager.go
#	magefile.go
@mergify mergify bot requested a review from a team as a code owner June 27, 2025 14:50
@mergify mergify bot added backport conflicts There is a conflict in the backported pull request labels Jun 27, 2025
@mergify mergify bot requested review from pchila and ycombinator and removed request for a team June 27, 2025 14:50
@mergify
Copy link
Contributor Author

mergify bot commented Jun 27, 2025

Cherry-pick of c56581d has failed:

On branch mergify/bp/8.19/pr-8248
Your branch is up to date with 'origin/8.19'.

You are currently cherry-picking commit c56581d44.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:
	modified:   NOTICE-fips.txt
	modified:   NOTICE.txt
	modified:   go.mod
	modified:   go.sum
	modified:   internal/pkg/agent/application/application.go
	modified:   internal/pkg/agent/application/coordinator/coordinator_test.go
	modified:   internal/pkg/agent/cmd/otel.go
	modified:   internal/pkg/agent/cmd/otel_flags.go
	new file:   internal/pkg/agent/cmd/otel_test.go
	modified:   internal/pkg/otel/README.md
	new file:   internal/pkg/otel/agentprovider/buffer_provider.go
	new file:   internal/pkg/otel/agentprovider/buffer_provider_test.go
	modified:   internal/pkg/otel/agentprovider/provider.go
	modified:   internal/pkg/otel/agentprovider/provider_test.go
	new file:   internal/pkg/otel/agentprovider/scheme.go
	modified:   internal/pkg/otel/components.go
	new file:   internal/pkg/otel/manager/common.go
	new file:   internal/pkg/otel/manager/common_test.go
	new file:   internal/pkg/otel/manager/execution.go
	new file:   internal/pkg/otel/manager/execution_embedded.go
	new file:   internal/pkg/otel/manager/execution_subprocess.go
	modified:   internal/pkg/otel/manager/extension.go
	new file:   internal/pkg/otel/manager/healthcheck.go
	new file:   internal/pkg/otel/manager/healthcheck_test.go
	modified:   internal/pkg/otel/manager/manager_test.go
	new file:   internal/pkg/otel/manager/recovery_backoff.go
	new file:   internal/pkg/otel/manager/recovery_backoff_test.go
	new file:   internal/pkg/otel/manager/recovery_noop.go
	new file:   internal/pkg/otel/manager/testing/testing.go
	modified:   internal/pkg/otel/run.go
	modified:   internal/pkg/otel/translate/status.go
	modified:   pkg/component/runtime/log_writer.go
	modified:   pkg/core/process/process.go
	modified:   testing/integration/ess/otel_test.go

Unmerged paths:
  (use "git add <file>..." to mark resolution)
	both modified:   internal/pkg/otel/manager/manager.go
	both modified:   magefile.go

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally

@mergify mergify bot added conflicts There is a conflict in the backported pull request backport labels Jun 27, 2025
@github-actions github-actions bot added enhancement New feature or request Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team skip-changelog labels Jun 27, 2025
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@pkoutsovasilis pkoutsovasilis removed the conflicts There is a conflict in the backported pull request label Jun 27, 2025
@elastic-sonarqube
Copy link

@elasticmachine
Copy link
Contributor

💛 Build succeeded, but was flaky

Failed CI Steps

History

cc @pkoutsovasilis

@pkoutsovasilis pkoutsovasilis merged commit cf97ae1 into 8.19 Jun 27, 2025
20 checks passed
@pkoutsovasilis pkoutsovasilis deleted the mergify/bp/8.19/pr-8248 branch June 27, 2025 16:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport enhancement New feature or request skip-changelog Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants