Support "cluster" scope in Metricbeat elasticsearch module #18547

ycombinator · 2020-05-14T20:17:15Z

What does this PR do?

This PR introduces a new scope setting for the elasticsearch Metricbeat module. This setting can take one of two values:

node (default): indicates that each item in the hosts list points to a distinct Elasticsearch node in a cluster, or
cluster: indicates that each item in the hosts lists points to a single endpoint for a distinct Elasticsearch cluster (e.g. a load-balancing proxy fronting the cluster).

Why is it important?

Sometimes it may not be possible for Metricbeat to reach individual Elasticsearch nodes. It might only have access to a single endpoint that fronts the entire Elasticsearch cluster.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Manual testing

Testing the new scope: cluster functionality introduced in this PR requires an Elasticsearch cluster with multiple nodes. The easiest way is probably to spin up an Elastic Cloud cluster.

Enable the elasticsearch-xpack module.

./metricbeat modules enable elasticsearch-xpack

Configure the module (./modules.d/elasticsearch-xpack.yml) like so:
```
- module: elasticsearch
  xpack.enabled: true
  period: 10s
  hosts:
    - XXXXX
  scope: cluster
  username: XXXXX
  password: XXXXX
```
Obviously, replace the XXXXX placeholders with your own.

The key is that there should be exactly one item under hosts, pointing to single endpoint for your cluster. This could be the Elasticsearch endpoint obtained from Elastic Cloud or the address of a single node in your on-prem/local Elasticsearch cluster.
Configure Metricbeat to send the collected stats to an Elasticsearch cluster. This will act as your Monitoring Cluster. It could be the same cluster you're using to collect stats from (as configured in your elasticsearch-xpack module configuration above) or it could be an entirely separate cluster.
```
metricbeat.config.modules:
path: ${path.config}/modules.d/*.yml
reload.enabled: false

output.elasticsearch:
  enabled: true
  hosts:
    - XXXXXX
  username: XXXXXX
  password: XXXXXX
```
Start Metricbeat.
```
./metricbeat -e
```
Let Metricbeat run for ~30 seconds. Make sure there are no errors in the Metricbeat logs.
Perform the following query against your Monitoring Cluster.
```
POST .monitoring-es-7-mb-*/_search
{
  "size": 0,
  "aggs": {
    "by_cluster_uuid": {
      "terms": {
        "field": "cluster_uuid"
      }
    },
    "cluster_stats": {
      "filter": {
        "term": {
          "type": "cluster_stats"
        }
      },
      "aggs": {
        "by_period": {
          "date_histogram": {
            "field": "timestamp",
            "interval": "10s"
          }
        }
      }
    },
    "node_stats": {
      "filter": {
        "term": {
          "type": "node_stats"
        }
      },
      "aggs": {
        "by_period": {
          "date_histogram": {
            "field": "timestamp",
            "interval": "10s"
          }
        }
      }
    }
  }
}
```
This query checks 3 things:
1. The aggs.by_cluster_uuid aggregation checks that we are only seeing data for a single cluster and that all documents contain that single cluster UUID.
  - Verify that aggregations.by_cluster_uuid.buckets only contains a single bucket, for the single cluster UUID of the Elasticsearch cluster you are monitoring.
  - Verify that aggregations.by_cluster_uuid.buckets[0].doc_count is the same as the hits.total.value.
2. The aggs.cluster_stats aggregation checks that type: cluster_stats documents are only indexed once every collection period (10 seconds) and that there is at most one document per collection period. We are using type: cluster_stats here as an example; the same should be true for any types other than type: node_stats.
  - Verify that aggregations.cluster_stats.by_period.buckets have several buckets, each corresponding to a time period. Each buckets should be 10 seconds "wide". Within each bucket, verify that doc_count is <= 1.
3. The aggs.node_stats aggregation checks that type: node_stats documents are only indexed once every collection period (10 seconds) and that there are at most N documents per collection period, where N is the number of nodes in the cluster you are monitoring.
  - Verify that aggregations.node_stats.by_period.buckets have several buckets, each corresponding to a time period. Each buckets should be 10 seconds "wide". Within each bucket, verify that doc_count is <= N, where N is the number of nodes in the cluster you are monitoring.

Related issues

Resolves Metricbeat elasticsearch module should be able to collect from a single cluster endpoint #18539.

elasticmachine · 2020-05-14T20:41:50Z

💚 Build Succeeded

Expand to view the summary

Build stats

Build Cause: [Pull request Support "cluster" scope in Metricbeat elasticsearch module #18547 updated]
Start Time: 2020-05-18T09:15:01.442+0000
Duration: 73 min 14 sec (4394428)

Test stats 🧪

Test	Results
Failed	0
Passed	3975
Skipped	935
Total	4910

elasticmachine · 2020-05-15T06:38:35Z

Pinging @elastic/integrations (Team:Integrations)

exekias · 2020-05-15T08:20:24Z

metricbeat/module/elasticsearch/ccr/ccr.go

+	// If we're talking to a set of ES nodes directly, only collect stats from the master node so
+	// we don't collect the same stats from every node and end up duplicating them.
+	if m.HostsMode == elasticsearch.HostsModeNode {
+		isMaster, err := elasticsearch.IsMaster(m.HTTP, m.GetServiceURI())
+		if err != nil {
+			return errors.Wrap(err, "error determining if connected Elasticsearch node is master")
+		}

-	// Not master, no event sent
-	if !isMaster {
-		m.Logger().Debug("trying to fetch ccr stats from a non-master node")
-		return nil
+		// Not master, no event sent
+		if !isMaster {
+			m.Logger().Debug("trying to fetch ccr stats from a non-master node")
+			return nil
+		}
 	}


this logic repeats across the board, I wonder if it should go into a helper function

Refactored in 348c9b9.

exekias · 2020-05-15T08:21:16Z

metricbeat/module/elasticsearch/node_stats/data_xpack.go

-	// it will provid the data for multiple nodes. This will mean the detection of the
-	// master node will not be accurate anymore as often in these cases a proxy is in front
-	// of ES and it's not know if the request will be routed to the same node as before.
+	// TODO: call GET _nodes/_master?filter_path=nodes.*.name to figure out the ID of the master node


I guess this TODO is done by GetMasterNodeID

Removed in 34cd849.

exekias · 2020-05-15T08:22:42Z

metricbeat/metricbeat.reference.yml

+  # a distinct Elasticsearch cluster, e.g. a load-balancer fronting the cluster.
+  # Set to node (default) to treat each item in the hosts list as an individual
+  # node in the Elasticsearch cluster.
+  #hosts_mode: node


I wonder if mode is enough, in other places we have used scope too. Anyway I don't have a strong opinion here

I was reserving mode for this change: #9424 (comment).

Using scope here instead of hosts_mode sounds good to me.

Changed in f2a5618.

exekias

This is looking great Shaunak! thank you for working on this

ycombinator · 2020-05-26T20:08:33Z

I tested the changes in this PR for heap usage w.r.t # of nodes in the ES cluster being monitored. There was no significant difference in heap usage, no matter how many nodes were in the the ES cluster being monitored. Details below.

Setup

Spun up an Elastic Cloud deployment, initially with just 1 node in 1 zone. Pre-allocated 64GB RAM since that would be necessary later to scale up the number of nodes.
Modified the elasticsearch module code to only load the index metricset.
Enabled the elasticsearch-xpack module and configured it like so:

- module: elasticsearch
  xpack.enabled: true
  period: 1s
  hosts:
  - https://XXXXXX.us-central1.gcp.cloud.es.io:9243
  username: elastic
  password: XXXXXXXXXX
  scope: cluster

Configured Metricbeat like so:

metricbeat.config.modules:
  path: ${path.config}/modules.d/*.yml
  reload.enabled: false

output.console.enabled: true

Ran Metricbeat with the HTTP profiling endpoint enabled.

./metricbeat -c metricbeat.test.yml --httpprof localhost:9000 -e

Collected Metricbeat heap usage every 200ms.

while true; do curl -s 'http://localhost:9000/debug/vars' | jq '.["beat.memstats.memory_alloc"]'; sleep 0.2; done

Stopped collecting heap usage and Metricbeat. Recorded usage in Google sheet (see below).
Incremented number of nodes in ES cluster by 1, repeated above steps.

Results

https://docs.google.com/spreadsheets/d/1Boi0uw846OSY3vnGqC604Dj8Lh28lD3G9OsKFgPGz8I/edit?usp=sharing

elasticmachine · 2020-05-26T20:09:56Z

Pinging @elastic/stack-monitoring (Stack monitoring)

Co-authored-by: DeDe Morton <dede.morton@elastic.co>

…20413) * Adding configuration for hosts_mode * Only perform master check in HostsModeNode * Only ask the node if it's the master node if we're in HostsModeNode * Unpack host_mode string into enum * Adding some specific TODOs in node_stats code * Updating x-pack/metricbeat reference config * Set correct service URI * Get master node ID * Adding CHANGELOG entry * Rename hosts_mode => scope * Removing stale TODO comment * Adding docs * Refactoring common code into helper method * Do not set service URI up front * Updating documentation per review * Remove comments from doc examples * Adding configuration for hosts_mode * Set correct service URI * Adding CHANGELOG entry * Rename hosts_mode => scope * Do not set service URI up front * Update metricbeat/docs/modules/elasticsearch.asciidoc Co-authored-by: DeDe Morton <dede.morton@elastic.co> * Update metricbeat/module/elasticsearch/_meta/docs.asciidoc Co-authored-by: DeDe Morton <dede.morton@elastic.co> * Update reference config * Cleaning up CHANGELOG * Updating generated files Co-authored-by: DeDe Morton <dede.morton@elastic.co> Co-authored-by: DeDe Morton <dede.morton@elastic.co>

…ne-2.0 * upstream/master: [docs] Promote ingest management to beta (elastic#20295) Upgrade elasticsearch client library used in tests (elastic#20405) Disable logging when pulling on python integration tests (elastic#20397) Remove pillow from testing requirements.txt (elastic#20407) [Filebeat][ATP Module]Setting user agent field required by the API (elastic#20440) [Ingest Manager] Send datastreams fields (elastic#20402) Add event.ingested to all Filebeat modules (elastic#20386) [Elastic Agent] Fix agent control socket path to always be less than 107 characters (elastic#20426) Improve cgroup_regex docs with examples (elastic#20425) Makes `metrics` config option required in app_insights (elastic#20406) Ensure install scripts only install if needed (elastic#20349) Update container name for the azure filesets (elastic#19899) Group same timestamp metrics values in app_insights metricset (elastic#20403) add_process_metadata processor adds container id even if process metadata not accessible (elastic#19767) Support "cluster" scope in Metricbeat elasticsearch module (elastic#18547) [Filebeat][SophosXG Module] Renaming module and fileset (elastic#20396) Update Suricata dashboards (elastic#20394) [Elastic Agent] Improve version, restart, enroll CLI commands (elastic#20359) Prepare home directories for docker images in a different stage (elastic#20356)

…allation * upstream/master: (23 commits) [docs] Promote ingest management to beta (elastic#20295) Upgrade elasticsearch client library used in tests (elastic#20405) Disable logging when pulling on python integration tests (elastic#20397) Remove pillow from testing requirements.txt (elastic#20407) [Filebeat][ATP Module]Setting user agent field required by the API (elastic#20440) [Ingest Manager] Send datastreams fields (elastic#20402) Add event.ingested to all Filebeat modules (elastic#20386) [Elastic Agent] Fix agent control socket path to always be less than 107 characters (elastic#20426) Improve cgroup_regex docs with examples (elastic#20425) Makes `metrics` config option required in app_insights (elastic#20406) Ensure install scripts only install if needed (elastic#20349) Update container name for the azure filesets (elastic#19899) Group same timestamp metrics values in app_insights metricset (elastic#20403) add_process_metadata processor adds container id even if process metadata not accessible (elastic#19767) Support "cluster" scope in Metricbeat elasticsearch module (elastic#18547) [Filebeat][SophosXG Module] Renaming module and fileset (elastic#20396) Update Suricata dashboards (elastic#20394) [Elastic Agent] Improve version, restart, enroll CLI commands (elastic#20359) Prepare home directories for docker images in a different stage (elastic#20356) New multiline mode in Filebeat: while_pattern (elastic#19662) ...

…8547) * Adding configuration for hosts_mode * Only perform master check in HostsModeNode * Only ask the node if it's the master node if we're in HostsModeNode * Unpack host_mode string into enum * Adding some specific TODOs in node_stats code * Updating x-pack/metricbeat reference config * Set correct service URI * Get master node ID * Adding CHANGELOG entry * Rename hosts_mode => scope * Removing stale TODO comment * Adding docs * Refactoring common code into helper method * Do not set service URI up front * Updating documentation per review * Remove comments from doc examples * Adding configuration for hosts_mode * Set correct service URI * Adding CHANGELOG entry * Rename hosts_mode => scope * Do not set service URI up front * Update metricbeat/docs/modules/elasticsearch.asciidoc Co-authored-by: DeDe Morton <dede.morton@elastic.co> * Update metricbeat/module/elasticsearch/_meta/docs.asciidoc Co-authored-by: DeDe Morton <dede.morton@elastic.co> * Update reference config * Cleaning up CHANGELOG * Updating generated files Co-authored-by: DeDe Morton <dede.morton@elastic.co>

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label May 14, 2020

andresrc added the Team:Integrations Label for the Integrations team label May 15, 2020

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label May 15, 2020

andresrc added [zube]: Inbox [zube]: In Review and removed [zube]: Inbox [zube]: In Review labels May 15, 2020

exekias reviewed May 15, 2020

View reviewed changes

ycombinator requested a review from dedemorton May 15, 2020 11:36

ycombinator marked this pull request as ready for review May 26, 2020 20:09

ycombinator requested a review from exekias May 26, 2020 20:09

ycombinator added the Feature:Stack Monitoring label May 26, 2020

ycombinator added enhancement v7.9.0 v8.0.0 needs_backport PR is waiting to be backported to other branches. labels May 26, 2020

ycombinator changed the title ~~Support "cluster" mode in Metricbeat elasticsearch module~~ Support "cluster" scope in Metricbeat elasticsearch module May 26, 2020

exekias approved these changes May 27, 2020

View reviewed changes

ycombinator mentioned this pull request May 27, 2020

Update ES monitoring with Metricbeat doc with new scope setting elastic/elasticsearch#57217

Closed

ycombinator added the test-plan Add this PR to be manual test plan label May 27, 2020

ycombinator and others added 15 commits August 3, 2020 09:38

Adding docs

b763633

Refactoring common code into helper method

e80a7f4

Do not set service URI up front

b493312

Updating documentation per review

2a89782

Remove comments from doc examples

6ebfeae

Adding configuration for hosts_mode

3c84bb3

Set correct service URI

8e46133

Adding CHANGELOG entry

de0defd

Rename hosts_mode => scope

8f005dd

Do not set service URI up front

01a51f2

Update metricbeat/docs/modules/elasticsearch.asciidoc

baa6da8

Co-authored-by: DeDe Morton <dede.morton@elastic.co>

Update metricbeat/module/elasticsearch/_meta/docs.asciidoc

9569e6d

Co-authored-by: DeDe Morton <dede.morton@elastic.co>

Update reference config

54f6c3f

Cleaning up CHANGELOG

d0343a6

Updating generated files

2f507a9

ycombinator merged commit aaeead0 into elastic:master Aug 3, 2020

ycombinator deleted the mb-es-cluster-mode branch August 3, 2020 19:25

zube bot added [zube]: Done and removed [zube]: In Review labels Aug 3, 2020

ycombinator mentioned this pull request Aug 3, 2020

Cherry-pick to 7.x: Support "cluster" scope in Metricbeat elasticsearch module (#18547) #20413

Merged

ycombinator removed needs_backport PR is waiting to be backported to other branches. Team:Automation Label for the Observability productivity team labels Aug 3, 2020

ycombinator mentioned this pull request Aug 10, 2020

Clarrify metricbeat must fetch data from all nodes #20441

Closed

6 tasks

andresrc added the test-plan-added This PR has been added to the test plan label Oct 3, 2020

zube bot removed the [zube]: Done label Nov 2, 2020

lcawl mentioned this pull request Dec 24, 2020

[DOCS] Re-add Stack Monitoring section in elasticsearch module #23286

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support "cluster" scope in Metricbeat elasticsearch module #18547

Support "cluster" scope in Metricbeat elasticsearch module #18547

ycombinator commented May 14, 2020 •

edited

Loading

elasticmachine commented May 14, 2020 •

edited

Loading

Build stats

Test stats 🧪

elasticmachine commented May 15, 2020

exekias May 15, 2020

ycombinator May 15, 2020

exekias May 15, 2020

ycombinator May 15, 2020

exekias May 15, 2020

ycombinator May 15, 2020 •

edited

Loading

ycombinator May 15, 2020

exekias left a comment

ycombinator commented May 26, 2020

elasticmachine commented May 26, 2020

Support "cluster" scope in Metricbeat elasticsearch module #18547

Support "cluster" scope in Metricbeat elasticsearch module #18547

Conversation

ycombinator commented May 14, 2020 • edited Loading

What does this PR do?

Why is it important?

Checklist

Manual testing

Related issues

elasticmachine commented May 14, 2020 • edited Loading

💚 Build Succeeded

Build stats

Test stats 🧪

elasticmachine commented May 15, 2020

exekias May 15, 2020

Choose a reason for hiding this comment

ycombinator May 15, 2020

Choose a reason for hiding this comment

exekias May 15, 2020

Choose a reason for hiding this comment

ycombinator May 15, 2020

Choose a reason for hiding this comment

exekias May 15, 2020

Choose a reason for hiding this comment

ycombinator May 15, 2020 • edited Loading

Choose a reason for hiding this comment

ycombinator May 15, 2020

Choose a reason for hiding this comment

exekias left a comment

Choose a reason for hiding this comment

ycombinator commented May 26, 2020

Setup

Results

elasticmachine commented May 26, 2020

ycombinator commented May 14, 2020 •

edited

Loading

elasticmachine commented May 14, 2020 •

edited

Loading

ycombinator May 15, 2020 •

edited

Loading