Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add /liveness endpoint to elastic-agent #4499

Merged
merged 32 commits into from
Apr 15, 2024
Merged
Show file tree
Hide file tree
Changes from 27 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
2755f0f
add liveness endpoint
fearful-symmetry Mar 28, 2024
11ad640
format files
fearful-symmetry Mar 28, 2024
1c78dbb
add changelog
fearful-symmetry Mar 28, 2024
bdc35c6
linter...
fearful-symmetry Mar 28, 2024
5590642
linter, again
fearful-symmetry Mar 28, 2024
0eab59a
refactor helpers so we don't move the component type around
fearful-symmetry Mar 29, 2024
fb6a31d
finish up reloader, HTTP handling
fearful-symmetry Apr 4, 2024
1dd8c9d
cleanup
fearful-symmetry Apr 4, 2024
240c25b
Merge remote-tracking branch 'upstream/main' into liveness-endpoint
fearful-symmetry Apr 4, 2024
4eeff6b
linter
fearful-symmetry Apr 4, 2024
3eee7aa
notice
fearful-symmetry Apr 4, 2024
82e338d
remove tests based on changed behavior
fearful-symmetry Apr 4, 2024
673e9e9
Merge remote-tracking branch 'origin/liveness-endpoint' into liveness…
fearful-symmetry Apr 4, 2024
9166208
add integration test
fearful-symmetry Apr 5, 2024
0e53358
format
fearful-symmetry Apr 5, 2024
3c1fba3
format, again
fearful-symmetry Apr 5, 2024
f85692b
code cleanup
fearful-symmetry Apr 9, 2024
e3fe848
add heartbeat check
fearful-symmetry Apr 10, 2024
6b548b7
Merge remote-tracking branch 'upstream/main' into liveness-endpoint
fearful-symmetry Apr 10, 2024
6f83308
fix tests
fearful-symmetry Apr 10, 2024
b1dfcb0
improve comments
fearful-symmetry Apr 10, 2024
ad6808b
refactor to simplify http handler
fearful-symmetry Apr 11, 2024
dff3e0d
linter
fearful-symmetry Apr 11, 2024
98370bd
revert some changes tfrom older handling implementation
fearful-symmetry Apr 11, 2024
bfd1291
linter
fearful-symmetry Apr 11, 2024
a24a7e8
fix linter and loop mem usage
fearful-symmetry Apr 11, 2024
f675b3a
update docs, fix test
fearful-symmetry Apr 12, 2024
85189bf
improve docs
fearful-symmetry Apr 12, 2024
336b73d
simplify switch statement
fearful-symmetry Apr 12, 2024
1b65f37
update docs, tests, provide a 'coordinator' mode
fearful-symmetry Apr 12, 2024
6e46f01
change default behavior, rename failon opts
fearful-symmetry Apr 12, 2024
3e64845
cleanup, rename a few things, short-out coordinator check
fearful-symmetry Apr 15, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions _meta/config/common.p2.yml.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,10 @@ inputs:
# # The name of the output to use for monitoring data.
# use_output: monitoring
# # exposes agent metrics using http, by default sockets and named pipes are used
# # Also exposes a /liveness endpoint that will return an HTTP code depending on agent status:
# # 200: Agent is healthy
# # 500: A component or unit is in a failed state
# # 503: The agent coordinator is unresponsive
# http:
# # enables http endpoint
# enabled: false
Expand Down
6 changes: 5 additions & 1 deletion _meta/config/common.reference.p2.yml.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -144,7 +144,11 @@ inputs:
# pprof.enabled: false
# # The name of the output to use for monitoring data.
# use_output: monitoring
# # exposes agent metrics using http, by default sockets and named pipes are used
# # Exposes agent metrics using http, by default sockets and named pipes are used.
# # Also exposes a /liveness endpoint that will return an HTTP code depending on agent status:
# # 200: Agent is healthy
# # 500: A component or unit is in a failed state
# # 503: The agent coordinator is unresponsive
# http:
# # enables http endpoint
# enabled: false
Expand Down
32 changes: 32 additions & 0 deletions changelog/fragments/1711653910-add-liveness-endpoint.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Kind can be one of:
# - breaking-change: a change to previously-documented behavior
# - deprecation: functionality that is being removed in a later release
# - bug-fix: fixes a problem in a previous version
# - enhancement: extends functionality but does not break or fix existing behavior
# - feature: new functionality
# - known-issue: problems that we are aware of in a given version
# - security: impacts on the security of a product or a user’s deployment.
# - upgrade: important information for someone upgrading from a prior version
# - other: does not fit into any of the other categories
kind: feature

# Change summary; a 80ish characters long description of the change.
summary: add-liveness-endpoint
fearful-symmetry marked this conversation as resolved.
Show resolved Hide resolved

# Long description; in case the summary is not enough to describe the change
# this field accommodate a description without length limits.
# NOTE: This field will be rendered only for breaking-change and known-issue kinds at the moment.
description: Adds a liveness endpoint for l8s monitoring
fearful-symmetry marked this conversation as resolved.
Show resolved Hide resolved

# Affected component; usually one of "elastic-agent", "fleet-server", "filebeat", "metricbeat", "auditbeat", "all", etc.
component: monitoring

# PR URL; optional; the PR number that added the changeset.
# If not present is automatically filled by the tooling finding the PR where this changelog fragment has been added.
# NOTE: the tooling supports backports, so it's able to fill the original PR number instead of the backport PR number.
# Please provide it if you are adding a fragment for a different PR.
#pr: https://github.com/owner/repo/1234
fearful-symmetry marked this conversation as resolved.
Show resolved Hide resolved

# Issue URL; optional; the GitHub issue related to this changeset (either closes or is part of).
# If not present is automatically filled by the tooling with the issue linked to the PR number.
#issue: https://github.com/owner/repo/1234
6 changes: 5 additions & 1 deletion elastic-agent.reference.yml
Original file line number Diff line number Diff line change
Expand Up @@ -150,7 +150,11 @@ inputs:
# pprof.enabled: false
# # The name of the output to use for monitoring data.
# use_output: monitoring
# # exposes agent metrics using http, by default sockets and named pipes are used
# # Exposes agent metrics using http, by default sockets and named pipes are used.
# # Also exposes a /liveness endpoint that will return an HTTP code depending on agent status:
# # 200: Agent is healthy
# # 500: A component or unit is in a failed state
# # 503: The agent coordinator is unresponsive
# http:
# # enables http endpoint
# enabled: false
Expand Down
4 changes: 4 additions & 0 deletions elastic-agent.yml
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,10 @@ inputs:
# # The name of the output to use for monitoring data.
# use_output: monitoring
# # exposes agent metrics using http, by default sockets and named pipes are used
# # Also exposes a /liveness endpoint that will return an HTTP code depending on agent status:
# # 200: Agent is healthy
# # 500: A component or unit is in a failed state
# # 503: The agent coordinator is unresponsive
# http:
# # enables http endpoint
# enabled: false
Expand Down
24 changes: 24 additions & 0 deletions internal/pkg/agent/application/coordinator/coordinator.go
Original file line number Diff line number Diff line change
Expand Up @@ -279,6 +279,11 @@ type Coordinator struct {

// mx sync.RWMutex
// protection protection.Config

// a sync channel that can be called by other components to check if the main coordinator
// loop in runLoopIteration() is active and listening.
// Should only be interacted with via CoordinatorActive() or runLoopIteration()
heartbeatChan chan struct{}
}

// The channels Coordinator reads to receive updates from the various managers.
Expand Down Expand Up @@ -372,6 +377,7 @@ func New(logger *logger.Logger, cfg *configuration.Configuration, logLevel logp.
logLevelCh: make(chan logp.Level),
overrideStateChan: make(chan *coordinatorOverrideState),
upgradeDetailsChan: make(chan *details.Details),
heartbeatChan: make(chan struct{}),
}
// Setup communication channels for any non-nil components. This pattern
// lets us transparently accept nil managers / simulated events during
Expand Down Expand Up @@ -412,6 +418,22 @@ func (c *Coordinator) State() State {
return c.stateBroadcaster.Get()
}

// CoordinatorActive is a blocking method that waits for a channel response
// from the coordinator loop. This can be used to as a basic health check,
// as we'll timeout and return false if the coordinator run loop doesn't
// respond to our channel.
func (c *Coordinator) CoordinatorActive(timeout time.Duration) bool {
ctx, cancel := context.WithTimeout(context.Background(), timeout)
defer cancel()

select {
case <-c.heartbeatChan:
return true
case <-ctx.Done():
return false
}
}

func (c *Coordinator) RegisterMonitoringServer(s configReloader) {
c.monitoringServerReloader = s
}
Expand Down Expand Up @@ -977,6 +999,8 @@ func (c *Coordinator) runLoopIteration(ctx context.Context) {
case upgradeDetails := <-c.upgradeDetailsChan:
c.setUpgradeDetails(upgradeDetails)

case c.heartbeatChan <- struct{}{}:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question on this. This looks like we will put something on the channel as soon as possible, so if the coordinator gets blocked after that, you would read from the heartbeatChan and think that it was up, because in the past it had been able to write to the channel. On your next read it would fail. Is that correct?

If it is, I'd rather see something that records a timestamp every time runLoopIteration is called, then we can check to see if that timestamp is within our timeout window.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if the coordinator gets blocked after that, you would read from the heartbeatChan and think that it was up, because in the past it had been able to write to the channel. On your next read it would fail. Is that correct?

So, if I understand you correctly, yes. We could end up in a state where the coordinator blocks right after a heartbeat call. Because the liveness endpoint is meant to be repeated on some kind of regular period, I'm not too worried about that.

I specially didn't go with some kind of timestamp mechanism on @faec 's advice, since she was worried about the added complexity/edge cases of a time comparison, compared to a simple signal like this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this just tells us that the runLoopIteration function can run and the select can hit the heartbeat case. It doesn't tell us if the coordinator can/has processed any of the other cases.

I was thinking to be "alive", we want to know that one of the case statements besides heartbeat statement has run. We would probably need to bound that comparison with the check-in Interval. Or to put another way, the runLoopIteration function should happen at least every check-in Interval (2x is probably safer) and it might happen more frequently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I sort of agree, this only really works as a basic heartbeat.

My concern is that "has the coordinator done anything else?" would be a bit flaky, as we can't really guarantee what state other sub-components will be in at any given time. @faec can comment more, but for that to work we might need to refactor part of the coordinator to allow for a more sophisticated health check.


case componentState := <-c.managerChans.runtimeManagerUpdate:
// New component change reported by the runtime manager via
// Coordinator.watchRuntimeComponents(), merge it with the
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ import (
"context"
"errors"
"fmt"
"net"
"testing"
"time"

Expand Down Expand Up @@ -570,7 +571,7 @@ func TestCoordinatorPolicyChangeUpdatesMonitorReloader(t *testing.T) {
}

monitoringServer := &fakeMonitoringServer{}
newServerFn := func() (reload.ServerController, error) {
newServerFn := func(*monitoringCfg.MonitoringConfig) (reload.ServerController, error) {
return monitoringServer, nil
}
monitoringReloader := reload.NewServerReloader(newServerFn, logger, monitoringCfg.DefaultConfig())
Expand Down Expand Up @@ -1054,3 +1055,7 @@ func (fs *fakeMonitoringServer) Reset() {
fs.stopTriggered = false
fs.startTriggered = false
}

func (fs *fakeMonitoringServer) Addr() net.Addr {
return nil
}
12 changes: 11 additions & 1 deletion internal/pkg/agent/application/monitoring/handler.go
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,9 @@ import (
"encoding/json"
"fmt"
"net/http"
"time"

"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator"
)

const errTypeUnexpected = "UNEXPECTED"
Expand All @@ -16,6 +19,13 @@ type apiError interface {
Status() int
}

// CoordinatorState is used by the HTTP handlers that take a coordinator object.
// This interface exists to help make testing easier.
type CoordinatorState interface {
State() coordinator.State
CoordinatorActive(timeout time.Duration) bool
}

func createHandler(fn func(w http.ResponseWriter, r *http.Request) error) *apiHandler {
return &apiHandler{
innerFn: fn,
Expand All @@ -30,7 +40,7 @@ type apiHandler struct {
func (h *apiHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
err := h.innerFn(w, r)
if err != nil {
switch e := err.(type) { // nolint:errorlint // Will need refactor.
switch e := err.(type) { //nolint:errorlint // Will need refactor.
case apiError:
w.WriteHeader(e.Status())
default:
Expand Down
40 changes: 40 additions & 0 deletions internal/pkg/agent/application/monitoring/liveness.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
// Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
// or more contributor license agreements. Licensed under the Elastic License;
// you may not use this file except in compliance with the Elastic License.

package monitoring

import (
"fmt"
"net/http"
"time"

"github.com/elastic/elastic-agent-client/v7/pkg/client"
)

func livenessHandler(coord CoordinatorState) func(http.ResponseWriter, *http.Request) error {
return func(w http.ResponseWriter, r *http.Request) error {
w.Header().Set("Content-Type", "application/json; charset=utf-8")

state := coord.State()
isUp := coord.CoordinatorActive(time.Second * 10)
fearful-symmetry marked this conversation as resolved.
Show resolved Hide resolved

failConfig, err := handleFormValues(r)
if err != nil {
return fmt.Errorf("error handling form values: %w", err)
}

unhealthyComponent := false
for _, comp := range state.Components {
if (failConfig.Failed && comp.State.State == client.UnitStateFailed) || (failConfig.Degraded && comp.State.State == client.UnitStateDegraded) {
unhealthyComponent = true
}
}
if !isUp {
w.WriteHeader(http.StatusServiceUnavailable)
} else if unhealthyComponent {
w.WriteHeader(http.StatusInternalServerError)
}
return nil
}
}
Loading
Loading