[component] Status Aggregation in Core #10058

mwear · 2024-04-30T22:24:13Z

Functions to handle status aggregation were added to core when status reporting was originally implemented. These functions were initially adequate while I was working on an alternative health check extension, but as more requirements surfaced, I had to reimplement alternative logic to handle all of the use cases. This PR specifically introduces the alternative aggregation logic and this PR shows how everything works together.

The goal of status aggregation is to take a collection of status events and combine them into a single event that represents the most relevant status based on the collection as a whole. One assumption we made in the original aggregation logic, was that PermanentErrors should take precedence over RecoverableErrors. While this still the default aggregation used in the health check extension, users have the option to include or ignore permanent and / or recoverable errors independently. If a user ignores PermanentErrors, but includes RecoverableErrors, we want RecoverableErrors to take priority over PermanentErrors during aggregation. In order to handle this is this in the extension, I implemented a function that takes an ErrorPriority and returns a suitable aggregation function.

We can divide component statuses into two categories. Lifecycle statuses (eg Starting, Stopping, Stopped) and runtime statuses (eg RecoverableError, PermanentError, FatalError). In the original aggregation logic, error statuses take precedence over lifecycle statuses. In the health check extension I had to change the aggregation logic to give precedence to lifecycle events in order to distinguish between a collector that is starting up or shutting down (a 503 status) or a collector in an error state (a 500).

There is a tangentially related issue with PermanentErrors and the underlying finite state machine that governs transitions between statuses. Currently, a PermanentError is a final state. That is, once a component enters this state, no further transitions are allowed. In light of the work I did on the alternative health check extension, I believe we should allow a transition from PermanentError to Stopping to consistently prioritize lifecycle events for components. This transition also make sense from a practical perspective. A component in a PermanentError state is one that has been started and is running, although in a likely degraded state. The collector will call shutdown on the component (when the collector is shutting down) and we should allow the status to reflect that.

To summarize, the aggregation logic as it exists in core was not usable for the health check extension and needed to be reimplemented. The health check extension needed the ability to prioritize RecoverableErrors over PermanentErrors in some, but not all cases. It also needed to prioritize lifecycle statuses over runtime statuses. A change that was not made, but should be made, is allowing a component to transition from a PermanentError state to Stopping (for consistency). As we work towards component 1.0, we should decide if the aggregation logic we have in core should be replaced by what was implemented for the health check extension, leave it as is, or remove it completely.

The text was updated successfully, but these errors were encountered:

TylerHelmuth · 2024-05-02T20:50:49Z

The collector will call shutdown on the component (when the collector is shutting down) and we should allow the status to reflect that.

Yes this seems totally valid. As part of manually handling the permanent error a graceful shutdown could occur, and that transition should be recorded.

TylerHelmuth · 2024-05-02T20:52:54Z

One argument for leaving it in core is to try to help components aggregate statuses in a consistent manner so that the meaning of a Status is consistent between components. But seeing as how the very first component to use statuses instantly needed its own aggregation logic, part of me thinks it is an individual component concern.

mwear · 2024-05-02T23:35:37Z

Once we are happy with the aggregation for the health check extension, I think we can and should promote it to core, or some other shared library. We may have been over zealous in starting with it in core, but the intent was as you point out, to help aggregate status consistently between consumers.

#### Description Adds an RFC for component status reporting. The main goal is to define what component status reporting is, our current, implementation, and how such a system interacts with a 1.0 component. When merged, the following issues will be unblocked: - #9823 - #10058 - #9957 - #9324 - #6506 --------- Co-authored-by: Matthew Wear <matthew.wear@gmail.com> Co-authored-by: Pablo Baeyens <pbaeyens31+github@gmail.com>

TylerHelmuth · 2024-08-22T22:22:20Z

With #10413 merged, I am removing this from the component 1.0 milestone and Collector V1 project.

Adds an RFC for component status reporting. The main goal is to define what component status reporting is, our current, implementation, and how such a system interacts with a 1.0 component. When merged, the following issues will be unblocked: - open-telemetry#9823 - open-telemetry#10058 - open-telemetry#9957 - open-telemetry#9324 - open-telemetry#6506 --------- Co-authored-by: Matthew Wear <matthew.wear@gmail.com> Co-authored-by: Pablo Baeyens <pbaeyens31+github@gmail.com>

… to Stopping (#10958) #### Description In #10058 I mentioned: > There is a tangentially related issue with PermanentErrors and the underlying finite state machine that governs transitions between statuses. Currently, a PermanentError is a final state. That is, once a component enters this state, no further transitions are allowed. In light of the work I did on the alternative health check extension, I believe we should allow a transition from PermanentError to Stopping to consistently prioritize lifecycle events for components. This transition also make sense from a practical perspective. A component in a PermanentError state is one that has been started and is running, although in a likely degraded state. The collector will call shutdown on the component (when the collector is shutting down) and we should allow the status to reflect that. This PR makes the suggested change and updates the documentation to reflect that. As this is an internal change, I have not included a changelog. Also note, we can close #10058 after this as we've already removed status aggregation from core during the recent component status refactor.  #### Link to tracking issue Fixes #10058  #### Testing units  #### Documentation Updated docs/component-status.md and associated diagram.  Co-authored-by: Tyler Helmuth <12352919+TylerHelmuth@users.noreply.github.com> Co-authored-by: Antoine Toulme <atoulme@splunk.com>

… to Stopping (open-telemetry#10958) #### Description In open-telemetry#10058 I mentioned: > There is a tangentially related issue with PermanentErrors and the underlying finite state machine that governs transitions between statuses. Currently, a PermanentError is a final state. That is, once a component enters this state, no further transitions are allowed. In light of the work I did on the alternative health check extension, I believe we should allow a transition from PermanentError to Stopping to consistently prioritize lifecycle events for components. This transition also make sense from a practical perspective. A component in a PermanentError state is one that has been started and is running, although in a likely degraded state. The collector will call shutdown on the component (when the collector is shutting down) and we should allow the status to reflect that. This PR makes the suggested change and updates the documentation to reflect that. As this is an internal change, I have not included a changelog. Also note, we can close open-telemetry#10058 after this as we've already removed status aggregation from core during the recent component status refactor.  #### Link to tracking issue Fixes open-telemetry#10058  #### Testing units  #### Documentation Updated docs/component-status.md and associated diagram.  Co-authored-by: Tyler Helmuth <12352919+TylerHelmuth@users.noreply.github.com> Co-authored-by: Antoine Toulme <atoulme@splunk.com>

TylerHelmuth added the area:component label May 2, 2024

TylerHelmuth added this to the go.opentelemetry.io/collector/component 1.0 milestone May 30, 2024

github-project-automation bot added this to Collector: v1 May 30, 2024

github-project-automation bot moved this to Todo in Collector: v1 May 30, 2024

TylerHelmuth mentioned this issue Jun 17, 2024

[component] Component status reporting rfc #10413

Merged

mwear mentioned this issue Jul 29, 2024

[component] Move component status reporting public API to new componentstatus module #10730

Merged

TylerHelmuth removed this from Collector: v1 Aug 22, 2024

TylerHelmuth removed this from the go.opentelemetry.io/collector/component 1.0 milestone Aug 22, 2024

TylerHelmuth added area:componentstatus and removed area:component labels Aug 22, 2024

mwear mentioned this issue Aug 23, 2024

[service/internal] Allow components to transition from PermanentError to Stopping #10958

Merged

mx-psi closed this as completed in #10958 Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[component] Status Aggregation in Core #10058

[component] Status Aggregation in Core #10058

mwear commented Apr 30, 2024

TylerHelmuth commented May 2, 2024

TylerHelmuth commented May 2, 2024

mwear commented May 2, 2024

TylerHelmuth commented Aug 22, 2024

[component] Status Aggregation in Core #10058

[component] Status Aggregation in Core #10058

Comments

mwear commented Apr 30, 2024

TylerHelmuth commented May 2, 2024

TylerHelmuth commented May 2, 2024

mwear commented May 2, 2024

TylerHelmuth commented Aug 22, 2024