Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a /-/healthy endpoint for monitoring component health #2197

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,8 @@ Main (unreleased)

- Add `otelcol.receiver.influxdb` to convert influx metric into OTEL. (@EHSchmitt4395)

- Add a new `/-/healthy` endpoint which returns HTTP 500 if one or more components are unhealthy. (@ptodev)

### Enhancements

- Add second metrics sample to the support bundle to provide delta information (@dehaansa)
Expand Down
2 changes: 1 addition & 1 deletion docs/sources/reference/_index.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
canonical: https://grafana.com/docs/alloy/latest/reference/
description: The reference-level documentaiton for Grafana Aloy
ptodev marked this conversation as resolved.
Show resolved Hide resolved
description: The reference-level documentation for Grafana Alloy
menuTitle: Reference
title: Grafana Alloy Reference
weight: 600
Expand Down
78 changes: 78 additions & 0 deletions docs/sources/reference/http/_index.md
ptodev marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
---
canonical: https://grafana.com/docs/alloy/latest/reference/http/
description: Learn about HTTP endpoints exposed by Grafana Alloy
title: The Grafana Alloy HTTP endpoints
menuTitle: HTTP endpoints
weight: 700
---

# The {{% param "FULL_PRODUCT_NAME" %}} HTTP endpoints
ptodev marked this conversation as resolved.
Show resolved Hide resolved

{{< param "FULL_PRODUCT_NAME" >}} has several default HTTP endpoints that are available by default regardless of which components you have configured.
You can use these HTTP endpoints to monitor, health check, and troubleshoot {{< param "PRODUCT_NAME" >}}.

The HTTP server which exposes them is configured via the [http block](../config-blocks/http)
and the `--server.` [command line arguments](../cli/run).
For example, if you set the `--server.http.listen-addr` command line argument to `127.0.0.1:12345`,
you can query the `127.0.0.1:12345/metrics` endpoint to see the internal metrics of {{< param "PRODUCT_NAME" >}}.

### /metrics

The `/metrics` endpoint returns the internal metrics of {{< param "PRODUCT_NAME" >}} in the Prometheus exposition format.

### /-/ready

An {{< param "PRODUCT_NAME" >}} instance is ready once it has loaded its initial configuration.
If the instance is ready, the `/-/ready` endpoint returns `HTTP 200 OK` and the message `Alloy is ready.`
Otherwise, if the instance is not ready, the `/-/ready` endpoint returns `HTTP 503 Service Unavailable` and the message `Alloy is not ready.`

### /-/healthy

When all {{< param "PRODUCT_NAME" >}} components are working correctly, all components are considered healthy.
If all components are healthy, the `/-/healthy` endpoint returns `HTTP 200 OK` and the message `All Alloy components are healthy.`.
Otherwise, if any of the components are not working correctly, the `/-/healthy` endpoint returns `HTTP 500 Internal Server Error` and an error message.
You can also monitor component health through the {{< param "PRODUCT_NAME" >}} [UI](../../troubleshoot/debug#alloy-ui).

```shell
$ curl localhost:12345/-/healthy
All Alloy components are healthy.
```

```shell
$ curl localhost:12345/-/healthy
unhealthy components: math.add
```

{{< admonition type="note" >}}
The `/-/healthy` endpoint isn't suitable for a [Kubernetes liveness probe][k8s-liveness].

An {{< param "PRODUCT_NAME" >}} instance that reports as unhealthy should not necessarily be restarted.
For example, a component may be unhealthy due to an invalid configuration or an unavailable external resource.
In this case, restarting {{< param "PRODUCT_NAME" >}} would not fix the problem.
A restart may make it worse, because it would could stop the flow of telemetry in healthy pipelines.

[k8s-liveness]: https://kubernetes.io/docs/concepts/configuration/liveness-readiness-startup-probes/
{{< /admonition >}}

### /-/reload

The `/-/reload` endpoint reloads the {{< param "PRODUCT_NAME" >}} configuration file.
If the configuration file can't be reloaded, the `/-/reload` endpoint returns `HTTP 400 Bad Request` and an error message.

```shell
$ curl localhost:12345/-/reload
config reloaded
```

```shell
$ curl localhost:12345/-/reload
error during the initial load: /Users/user1/Desktop/git.alloy:13:1: Failed to build component: loading custom component controller: custom component config not found in the registry, namespace: "math", componentName: "add"
```

### /-/support

The `/-/support` endpoint returns a [support bundle](../../troubleshoot/support_bundle) that contains information about your {{< param "PRODUCT_NAME" >}} instance. You can use this information as a baseline when debugging an issue.

### /debug/pprof

The `/debug/pprof` endpoint returns a pprof Go [profile](../../troubleshoot/profile) that you can use to visualize and analyze profiling data.
26 changes: 26 additions & 0 deletions internal/service/http/http.go
Original file line number Diff line number Diff line change
Expand Up @@ -188,6 +188,32 @@ func (s *Service) Run(ctx context.Context, host service.Host) error {
otelmux.WithTracerProvider(s.tracer),
))

// The implementation for "/-/healthy" is inspired by
// the "/components" web API endpoint in /internal/web/api/api.go
r.HandleFunc("/-/healthy", func(w http.ResponseWriter, r *http.Request) {
components, err := host.ListComponents("", component.InfoOptions{
thampiotr marked this conversation as resolved.
Show resolved Hide resolved
GetHealth: true,
})
if err != nil {
http.Error(w, err.Error(), http.StatusInternalServerError)
return
}

unhealthyComponents := []string{}
for _, c := range components {
if c.Health.Health == component.HealthTypeUnhealthy {
unhealthyComponents = append(unhealthyComponents, c.ComponentName)
}
}
if len(unhealthyComponents) > 0 {
http.Error(w, "unhealthy components: "+strings.Join(unhealthyComponents, ", "), http.StatusInternalServerError)
return
}

fmt.Fprintln(w, "All Alloy components are healthy.")
w.WriteHeader(http.StatusOK)
})

r.Handle(
"/metrics",
promhttp.HandlerFor(s.gatherer, promhttp.HandlerOpts{}),
Expand Down
82 changes: 77 additions & 5 deletions internal/service/http/http_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ package http
import (
"context"
"fmt"
"io"
"net/http"
"testing"

Expand Down Expand Up @@ -43,6 +44,26 @@ func TestHTTP(t *testing.T) {
require.NoError(t, err)
defer resp.Body.Close()

buf, err := io.ReadAll(resp.Body)
require.Equal(t, "Alloy is ready.\n", string(buf))

require.Equal(t, http.StatusOK, resp.StatusCode)
})

util.Eventually(t, func(t require.TestingT) {
ptodev marked this conversation as resolved.
Show resolved Hide resolved
cli, err := config.NewClientFromConfig(config.HTTPClientConfig{}, "test")
require.NoError(t, err)

req, err := http.NewRequest(http.MethodGet, fmt.Sprintf("http://%s/-/healthy", env.ListenAddr()), nil)
require.NoError(t, err)

resp, err := cli.Do(req)
require.NoError(t, err)
defer resp.Body.Close()

buf, err := io.ReadAll(resp.Body)
require.Equal(t, "All Alloy components are healthy.\n", string(buf))

require.Equal(t, http.StatusOK, resp.StatusCode)
})
}
Expand Down Expand Up @@ -157,9 +178,53 @@ func Test_Toggle_TLS(t *testing.T) {
}
}

func TestUnhealthy(t *testing.T) {
ctx := componenttest.TestContext(t)

env, err := newTestEnvironment(t)
require.NoError(t, err)

env.components = []*component.Info{
{
ID: component.ID{
ModuleID: "",
LocalID: "testCompId",
},
Label: "testCompLabel",
ComponentName: "testCompName",
Health: component.Health{
Health: component.HealthTypeUnhealthy,
},
},
}
require.NoError(t, env.ApplyConfig(""))

go func() {
require.NoError(t, env.Run(ctx))
}()

util.Eventually(t, func(t require.TestingT) {
cli, err := config.NewClientFromConfig(config.HTTPClientConfig{}, "test")
require.NoError(t, err)

req, err := http.NewRequest(http.MethodGet, fmt.Sprintf("http://%s/-/healthy", env.ListenAddr()), nil)
require.NoError(t, err)

resp, err := cli.Do(req)
require.NoError(t, err)
defer resp.Body.Close()

buf, err := io.ReadAll(resp.Body)
require.Equal(t, "unhealthy components: testCompName\n", string(buf))

require.Equal(t, http.StatusInternalServerError, resp.StatusCode)
})
}

type testEnvironment struct {
svc *Service
addr string
svc *Service
addr string
components []*component.Info
}

func newTestEnvironment(t *testing.T) (*testEnvironment, error) {
Expand Down Expand Up @@ -196,20 +261,27 @@ func (env *testEnvironment) ApplyConfig(config string) error {
}

func (env *testEnvironment) Run(ctx context.Context) error {
return env.svc.Run(ctx, fakeHost{})
return env.svc.Run(ctx, fakeHost{
components: env.components,
})
}

func (env *testEnvironment) ListenAddr() string { return env.addr }

type fakeHost struct{}
type fakeHost struct {
components []*component.Info
}

var _ service.Host = (fakeHost{})

func (fakeHost) GetComponent(id component.ID, opts component.InfoOptions) (*component.Info, error) {
return nil, fmt.Errorf("no such component %s", id)
}

func (fakeHost) ListComponents(moduleID string, opts component.InfoOptions) ([]*component.Info, error) {
func (f fakeHost) ListComponents(moduleID string, opts component.InfoOptions) ([]*component.Info, error) {
if f.components != nil {
return f.components, nil
}
if moduleID == "" {
return nil, nil
}
Expand Down
Loading