Task/use etcd metrics endpoint #11280

odacremolbap · 2019-03-16T19:17:37Z

Add etcd metrics endpoint for Etcd V3 as a new metricset

odacremolbap · 2019-03-16T19:19:35Z

I need to wrap my head around the unit tests and docs.
Opening the PR in order to get some help while I read upon it at other modules

ruflin · 2019-03-18T09:10:26Z

@odacremolbap Happy to help, let me know where :-)

odacremolbap · 2019-03-19T12:52:20Z

Received some out-of-band feedback on metricset naming:

Etcd is a single binary that exposes V2 and V3
Clients choose which one to use, it is expected that since V3 was released, all clients use V3
Each version (V2/V3) keeps most endpoints and storage separate. No data saved as Vx can be read from Vy.
Monitoring is also kept separate for each version, V2 will use current beats metricsets (self, leader, store) meanwhile V3 will expose a prometheus formatted metrics endpoint. Metrics read for each version will be specific to its version, although I have some doubts regarding memory and disk.
It is not easy to discover which version is being used. All endpoints are up.

From a user perspective:

An admin knows what version is being used, it is expected to be V3
Although it is possible to use V2 and V3 at the same time, that's an anti-pattern to be avoided

At this moment, we are exposing

self
store
disk
metrics

3 first for V2, last one for V3.
We need to come up with a solution for making clear which one to use.

jsoriano

Thanks for taking this! I have added some comments, mainly about fields and default configs.

metricbeat/docs/modules/etcd/metrics.asciidoc

metricbeat/metricbeat.reference.yml

jsoriano · 2019-03-19T13:30:13Z

metricbeat/module/etcd/metrics/_meta/data.json

+                        "1.024": 6,
+                        "2.048": 6,
+                        "4.096": 6,
+                        "8.192": 6


Having dots in field names is problematic as they are going to be stored as objects in Elasticsearch, scale them to milliseconds (take a look to this conversation and this PR).

jsoriano · 2019-03-19T13:32:20Z

metricbeat/module/etcd/metrics/metrics.go

+			// Disk
+			"etcd_mvcc_db_total_size_in_bytes":          prometheus.Metric("disk.mvcc_db_total_size_in_bytes"),
+			"etcd_disk_wal_fsync_duration_seconds":      prometheus.Metric("disk.wal_fsync_duration_seconds"),
+			"etcd_disk_backend_commit_duration_seconds": prometheus.Metric("disk.backend_commit_duration_seconds"),


To scale it to milliseconds it'd be something like

Suggested change

"etcd_disk_backend_commit_duration_seconds": prometheus.Metric("disk.backend_commit_duration_seconds"),

"etcd_disk_backend_commit_duration_seconds": prometheus.Metric("disk.backend_commit_duration_seconds", prometheus.OpMultiplyBuckets(1000))),

jsoriano · 2019-03-19T13:34:30Z

metricbeat/module/etcd/metrics/_meta/fields.yml

+      description: >
+        Write ahead logs latency sum
+
+    - name: disk.backend_commit_duration_seconds.bucket


I think a wildcard is needed here (and the same for other histograms)

Suggested change

- name: disk.backend_commit_duration_seconds.bucket

- name: disk.backend_commit_duration_seconds.bucket.*

jsoriano · 2019-03-19T13:54:14Z

metricbeat/module/etcd/metrics/metrics_integration_test.go

+	if err := mbtest.WriteEventsReporterV2(f, t, ""); err != nil {
+		t.Fatal("write", err)
+	}
+}


WriteEventsReporterV2 already checks for errors and non empty events, so this method should be just:

func TestData(t *testing.T) { compose.EnsureUp(t, "etcd") f := mbtest.NewReportingMetricSetV2(t, getConfig()) if err := mbtest.WriteEventsReporterV2(f, t, ""); err != nil { t.Fatal("write", err) } }

metricbeat/module/etcd/metrics/metrics_integration_test.go

metricbeat/module/etcd/store/_meta/docs.asciidoc

jsoriano · 2019-03-19T14:11:03Z

metricbeat/module/etcd/metrics/_meta/testdata/metrics.plain

+grpc_server_handled_total{grpc_code="Internal",grpc_method="UserGrantRole",grpc_service="etcdserverpb.Auth",grpc_type="unary"} 0
+grpc_server_handled_total{grpc_code="Internal",grpc_method="UserList",grpc_service="etcdserverpb.Auth",grpc_type="unary"} 0
+grpc_server_handled_total{grpc_code="Internal",grpc_method="UserRevokeRole",grpc_service="etcdserverpb.Auth",grpc_type="unary"} 0
+grpc_server_handled_total{grp


It'd be probably nice to expose some of these metrics on methods, but lets leave it for future changes.

metricbeat/module/etcd/metrics/_meta/fields.yml

jsoriano · 2019-03-19T14:18:29Z

Regarding v2 vs v3 storages. I think we should continue enabling the old metricsets by default, and at some moment we would enable also this new one so we have all covered by default.

I saw that some operations on v2 storage also affect the metrics exposed by the new endpoint, maybe at some moment the new endpoint covers both storages, not sure about their plan on this. If that is the case, maybe we can deprecate the old metricsets at some point, but not before metricbeat 8.

ruflin · 2019-03-20T08:03:27Z

I think there are 2 users: Admin that deploys Metricbeat and he should know if we uses v2 or v3. From a data consumer perspective it should not matter which one is used. My understanding so far is that v3 is mostly a superset of v2. So if a user upgrades from v2 to v3, the resulting events should still look the same but have more data inside.

The above assume that we like the current data structure. If we don't I'm also ok with introducing a new better data structure for v3.

odacremolbap · 2019-03-20T09:58:51Z

Regarding module/metricsets:

current schema is

module etcd
metricsets store, self, leader, metrics

The main thing to improve is noticing users which etcd version matches which metrics. Current V2 metricsets (store, self, leader) are GA and being used, imho we should keep it as is for now. metrics metricset is V3. The name follows the endpoint where metrics are retrieved, which is not a descriptive name from the user POV.

Choices:

keep metricset as they are
change metricsto something like metricsV3
refactor V2 metricsets so they become storeV2, selfV2, leaderV2 while keeping current names also for backwards compatibility
create etcdV3 as a new module separate from current `etcd``

My preferred would be 2
Comments @ruflin @jsoriano ?

ruflin · 2019-03-20T10:22:46Z

As mentioned before, I prefer not to mention V3 in the final doc as for the consumer it should not matter, so just metrics should be fine. What I don't like about the metrics prefix is that it's a bit meaningless as everything here is metrics.

One other idea triggered by your comment that it's a huge migration from A to B and it's unlikely that both will be used at the same time. What happens if not use the metrics prefix and put all the metrics directly under etcd? Will it conflict with v2? It's not a typical thing we do for metricsets but if I understand this change here to prometheus correctly this will be the only endpoint available and it's unlikely more endpoints will be added in v3? It's just that more metrics will be added? Based on your example event we would have etcd.disk.* etc. an none of these seems to conflict with the current one?

odacremolbap · 2019-03-20T11:03:59Z

generally speaking adding version to metric names is not a good idea, I agree with you, no point on creating metricsets for fooV1 and a new set for fooV2. The problem with etcd is that they sort of bundle 2 products in 1 binary.

As an example, here is how you use the client for V2 for any etcd release:

etcdctl get myKey

and here is how you use V3

ETCDCTL_API=3 etcdctl get myKey

Admins, devs, users, anyone dealing with etcd must know beforehand if they are targeting V2 or V3. Internally the command will be redirected to a different set of functions based on the environment variable.

So, if i understood it correctly, that proposal would be, keeping current V2

etcd.store
etcd.self
etcd.leader
And add new V3 metrics without a V3 reference at that same level
etcd.memory
etcd.network
etcd.server
etcd.disk

I'm ok with that since it avoids using extremely generic term metrics, just wondering if we can come up with something better to highlight that whenever you use V3 you should be using those new ones. Just like etcdctl requiring ETCDCTL_API=3 environment variable.

We might not add any V3 reference to metrics, but as an etcd user in my past life, I would be confused to see etcd.server and etcd.self without knowing which one is pre-V3 and which one post-V3

If you feel that our best move is setting those names, and clarifying at docs, I'm ok with that, I guess there is no perfect solution since this problem is upstream etcd.

odacremolbap · 2019-03-21T08:45:21Z

As discussed out of band with @ruflin

We will add a field to metricsets that indicate the API version used when retrieving the metrics.
All V2 metrics will have apiVersion: 2
All V3 metrics will have apiVersion: 3

Although users will still need to get to the docs to check which metricset applies to what apiVersion, this solution has a number of advantages:

if some V2 metrics are still use for V3 users, they will be around
still V2 and V3 will be distinguishable and available for filtering at ES
current V2 users won't be affected

@jsoriano feedback?

jsoriano · 2019-03-21T09:57:19Z

Ok to add an apiVersion field, but taking into account that "v3" metrics also contain metrics for v2 store and endpoint.

odacremolbap · 2019-03-22T12:12:17Z

@jsoriano @ruflin
I've pushed

adding apiVersion to both V2 and V3 etcd metrics
changing the namespace for V3 metrics so that they are consistent with V2 metrics placement

prometheus.MetricsMapping needed to be added the Namespace field in order for this to work.

I haven't included the updated JSON until we sort out why agent group field is missing

jsoriano

This is looking good

jsoriano · 2019-03-25T09:23:35Z

metricbeat/module/etcd/metrics/metrics.go

+			"etcd_network_client_grpc_sent_bytes_total":     prometheus.Metric("network.client_grpc_sent.bytes"),
+			"etcd_network_client_grpc_received_bytes_total": prometheus.Metric("network.client_grpc_received.bytes"),
+		},
+		ExtraFields: map[string]string{"apiVersion": "3"},


Please don't use camel case for field names.

Suggested change

ExtraFields: map[string]string{"apiVersion": "3"},

ExtraFields: map[string]string{"api_version": "3"},

jsoriano · 2019-03-25T09:25:35Z

metricbeat/helper/prometheus/prometheus.go

+		r.Event(mb.Event{
+			MetricSetFields: event,
+			Namespace:       mapping.Namespace,
+		})


Nice. I think this should be moved to its own PR, with a note in the developers changelog.

+1 on having a separate PR. I think @sayden will also be happy to see this.

thanks,
created #11423
pushed #11424

jsoriano

It LGTM, just one more thing I have just thought about, could you add this metricset to the list of ones tested here?

ruflin · 2019-03-27T07:08:02Z

metricbeat/module/etcd/metrics/metrics_integration_test.go

+}
+
+func TestData(t *testing.T) {
+	compose.EnsureUp(t, "etcd")


As we already have the new test setup for this, we don't need this method I think.

odacremolbap added 8 commits March 6, 2019 22:31

etcd metrics - has_leader metric

d72e96e

add all standard etcd metrics

372c32d

add recommended metrics for etcd

3088b3a

add boilerplate comments

dd2e3ee

Merge branch 'master' into task/use-etcd-metrics-endpoint

62c1289

Merge branch 'master' into task/use-etcd-metrics-endpoint

850545c

unit test WIP

9d7fa51

Merge branch 'master' into task/use-etcd-metrics-endpoint

95348e8

odacremolbap added in progress Pull request is currently in progress. Metricbeat Metricbeat Team:Integrations Label for the Integrations team labels Mar 16, 2019

odacremolbap requested a review from jsoriano March 16, 2019 19:17

odacremolbap requested review from a team as code owners March 16, 2019 19:17

ruflin assigned odacremolbap Mar 18, 2019

alvarolobato added the [zube]: In Progress label Mar 18, 2019

odacremolbap added 2 commits March 19, 2019 10:58

add test that allows for automated test data generation

d091993

add docs

a8bad5a

jsoriano reviewed Mar 19, 2019

View reviewed changes

odacremolbap added 6 commits March 19, 2019 19:21

set etcd metrics metricset to beta

7d8e607

merge master + resolve conflict

5f8e873

avoid dot naming at metricsets buckets

a44bd2a

revert to etcd GA metrics as default

5c1500b

remove redundant check from etcd event test

ebc8158

remove non useful tests

244de8a

add refactored units

985cf0b

odacremolbap requested a review from jsoriano March 20, 2019 14:47

odacremolbap added 2 commits March 21, 2019 22:45

add apiVersion for V2 etcd metrics

cdf9730

add namespace for etcd v3 metrics

980106c

odacremolbap added 2 commits March 22, 2019 15:54

add test data for etcd metrics

e85ed22

add updated docs

84fa7e7

jsoriano reviewed Mar 25, 2019

View reviewed changes

update data.json for all etcd modules (removing agent)

ecae2ee

odacremolbap mentioned this pull request Mar 25, 2019

Add Namespaces for Prometheus helper mappings #11423

Closed

odacremolbap added 2 commits March 25, 2019 16:51

merge master, solve conflicts

4340666

properly structure fields

77bacce

jsoriano reviewed Mar 25, 2019

View reviewed changes

odacremolbap requested review from jsoriano and ruflin March 25, 2019 22:28

add Etcd V3 metrics MetricSet to python tests

f2f3642

jsoriano approved these changes Mar 25, 2019

View reviewed changes

odacremolbap merged commit bf8ebaf into elastic:master Mar 26, 2019

odacremolbap deleted the task/use-etcd-metrics-endpoint branch March 26, 2019 12:07

zube bot added [zube]: Done and removed [zube]: In Progress labels Mar 26, 2019

ruflin reviewed Mar 27, 2019

View reviewed changes

exekias removed the [zube]: Done label Apr 8, 2019

odacremolbap mentioned this pull request Apr 30, 2019

Add Metricbeat ETCD overview dashboard #10591

Closed

sayden mentioned this pull request Nov 18, 2021

ETCD Metricbeat module needs polishing and grooming elastic/integrations#487

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Task/use etcd metrics endpoint #11280

Task/use etcd metrics endpoint #11280

odacremolbap commented Mar 16, 2019 •

edited by alvarolobato

Loading

odacremolbap commented Mar 16, 2019

ruflin commented Mar 18, 2019

odacremolbap commented Mar 19, 2019

jsoriano left a comment

jsoriano Mar 19, 2019

jsoriano Mar 19, 2019

jsoriano Mar 19, 2019

jsoriano Mar 19, 2019

jsoriano Mar 19, 2019

jsoriano commented Mar 19, 2019

ruflin commented Mar 20, 2019

odacremolbap commented Mar 20, 2019

ruflin commented Mar 20, 2019

odacremolbap commented Mar 20, 2019

odacremolbap commented Mar 21, 2019

jsoriano commented Mar 21, 2019

odacremolbap commented Mar 22, 2019

jsoriano left a comment

jsoriano Mar 25, 2019

jsoriano Mar 25, 2019

ruflin Mar 25, 2019

odacremolbap Mar 25, 2019

jsoriano left a comment •

edited

Loading

ruflin Mar 27, 2019

	"etcd_disk_backend_commit_duration_seconds": prometheus.Metric("disk.backend_commit_duration_seconds"),
	"etcd_disk_backend_commit_duration_seconds": prometheus.Metric("disk.backend_commit_duration_seconds", prometheus.OpMultiplyBuckets(1000))),

	- name: disk.backend_commit_duration_seconds.bucket
	- name: disk.backend_commit_duration_seconds.bucket.*

	ExtraFields: map[string]string{"apiVersion": "3"},
	ExtraFields: map[string]string{"api_version": "3"},

Task/use etcd metrics endpoint #11280

Task/use etcd metrics endpoint #11280

Conversation

odacremolbap commented Mar 16, 2019 • edited by alvarolobato Loading

odacremolbap commented Mar 16, 2019

ruflin commented Mar 18, 2019

odacremolbap commented Mar 19, 2019

jsoriano left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsoriano commented Mar 19, 2019

ruflin commented Mar 20, 2019

odacremolbap commented Mar 20, 2019

ruflin commented Mar 20, 2019

odacremolbap commented Mar 20, 2019

odacremolbap commented Mar 21, 2019

jsoriano commented Mar 21, 2019

odacremolbap commented Mar 22, 2019

jsoriano left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsoriano left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

odacremolbap commented Mar 16, 2019 •

edited by alvarolobato

Loading

jsoriano left a comment •

edited

Loading