Beat readiness probe #3197

sebgl · 2020-06-08T08:44:32Z

We probably want to introduce a readiness probe for Beats.
It's a bit surprising right now to see filebeat "ready" while Elasticsearch is unavailable.

It looks like we could execute a filebeat test output command. To investigate.

The text was updated successfully, but these errors were encountered:

david-kow · 2020-06-08T09:57:58Z

What ready should indicate though? If Beat can start getting logs/metrics in, I'd consider it ready even if the output is not ready itself. I'd think that's what outputs (ES for instance) ready is for.

anyasabo · 2020-06-08T14:17:58Z

For filebeat, filebeat test output is at least what the helm chart uses:
https://github.com/elastic/helm-charts/blob/master/filebeat/values.yaml#L72

anyasabo · 2020-07-20T20:19:13Z

https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#when-should-you-use-a-readiness-probe

If you'd like to start sending traffic to a Pod only when a probe succeeds, specify a readiness probe. In this case, the readiness probe might be the same as the liveness probe, but the existence of the readiness probe in the spec means that the Pod will start without receiving any traffic and only start receiving traffic after the probe starts succeeding. If your Container needs to work on loading large data, configuration files, or migrations during startup, specify a readiness probe.

If you want your Container to be able to take itself down for maintenance, you can specify a readiness probe that checks an endpoint specific to readiness that is different from the liveness probe.

The main reason I can think we would want to define a readiness probe is if you were using beats to monitor your other beats. In that case I think you would want to know if the beat was up but the output was down (and so it should be ready even if the output is down).

"Is the output responding" seems more of a question of health in the beats status. I'm not sure there's a good way for ECK to retrieve that though. We currently define beat health as:

const (
	// BeatRedHealth means that the health is neither yellow nor green.
	BeatRedHealth BeatHealth = "red"

	// BeatYellowHealth means that:
	// 1) at least one Pod is Ready, and
	// 2) association is not configured, or configured and established
	BeatYellowHealth BeatHealth = "yellow"

	// BeatGreenHealth means that:
	// 1) all Pods are Ready, and
	// 2) association is not configured, or configured and established
	BeatGreenHealth BeatHealth = "green"
)

david-kow · 2020-07-21T07:25:20Z

In that case I think you would want to know if the beat was up but the output was down (and so it should be ready even if the output is down).

I'm not sure I'm getting what do you mean here. If we have:

ES    <----    Metricbeat    --(monitoring)-->    Filebeat    --(shipping logs for)-->    Pod

Then we can have the following (main) failure cases:

Pod is down - Metricbeat and Filebeat are ready
Filebeat is down - the fact that Filebeat is down is reported by Metricbeat, but the Metricbeat itself is ready
ES is down - Metricbeat can't output, but it's running (and caches the data) so it's ready

For "Is the output responding" I agree it's difficult, I think we would only know from logs that there is an issue.

anyasabo · 2020-07-21T12:59:38Z

I'm not sure I'm getting what do you mean here.

Because I did a poor job of explaining it :D What I meant was that I think we want to leave it as is for the reasons you described in your comment. If we want to do anything it would be exposing the output status in the Beats CR, but I'm not sure we can simply (maybe the beats state/status endpoint exposes the info?).

pebrc · 2020-08-10T15:32:40Z

We should probably close this in favour of another issue that will update the status of the Beats resource with some information about the output status.

Just as an aside because filebeat test output was mentioned, it returns an error despite a working configuration due to a DNS check it does:

[root@gke-pebrc-dev-cluster-default-pool-0ce0f2c1-nl52 filebeat]# filebeat test output
elasticsearch: http://elasticsearch:9200...
  parse url... OK
  connection...
    parse host... OK
    dns lookup... ERROR lookup elasticsearch on 10.73.16.10:53: no such host

sebgl added the >enhancement Enhancement of existing functionality label Jun 8, 2020

pebrc added the :feature/Beats label Jun 8, 2020

david-kow mentioned this issue Jun 12, 2020

Add KibanaRef to Beats and support setup.kibana #3211

Merged

david-kow removed the :beats label Jul 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Beat readiness probe #3197

Beat readiness probe #3197

sebgl commented Jun 8, 2020

david-kow commented Jun 8, 2020

anyasabo commented Jun 8, 2020 •

edited

Loading

anyasabo commented Jul 20, 2020

david-kow commented Jul 21, 2020

anyasabo commented Jul 21, 2020 •

edited

Loading

pebrc commented Aug 10, 2020

Beat readiness probe #3197

Beat readiness probe #3197

Comments

sebgl commented Jun 8, 2020

david-kow commented Jun 8, 2020

anyasabo commented Jun 8, 2020 • edited Loading

anyasabo commented Jul 20, 2020

david-kow commented Jul 21, 2020

anyasabo commented Jul 21, 2020 • edited Loading

pebrc commented Aug 10, 2020

anyasabo commented Jun 8, 2020 •

edited

Loading

anyasabo commented Jul 21, 2020 •

edited

Loading