Add usage report into Loki. #5361

cyriltovena · 2022-02-10T08:32:34Z

What this PR does / why we need it:

This PRs add usage report to grafana.com into Loki.

It basically add a new modules that will never fail, when running the module try to get a consensus on what is the cluster unique ID and then send a report from every component running every hour. The cluster ID is use to compute aggregation of all components the server side.

How does the consensus works ?

Ingesters are leader in the consensus meaning they are the only one that can actually store in the object store the unique ID. They do that using the Loki kv store and object store for persisting the data over restart.

Each ingester will do as follow:

Check if the cluster id exists in the kv store.
- if it does, we verify that it also exists in the object store and reconcile if needed.
Check if the cluster id exists in the object store and reconcile the kv store.
If none of those are true, ingester will try to CAS the kvstore to set a new cluster id, in case of wining they will store the cluster id in the object store.
Then finally they will use that cluster id to send report.

Other component (followers) will only retry indefinitely to fetch the cluster id from the object store and once they have it, they will start sending report with the ID.

In case there are many failure trying to unmarshal the cluster ID, all component can decide to nuke it.

What happen if we change to a new object store ?

Since we also store the cluster ID in the kvstore, and ingester will realize that it is missing in the new object store and will try to reconcile.
This means if you nuke at the same time the object store AND the kv store, you'll end up with having a new cluster ID but we consider this case to be rare.

What stats are we sending ?

Full disclaimer here, we're not sending any confidential data but only informations about:

What object store is being used ?
What's the scale of the data being ingested ?
How fast are we ingesting and flushing ?
How fast queries are in that cluster ?
Version, CPU count and memory size.

See the json below.

This is a report from a single binary, if you're using multiple component some stats may be missing from one component to another.

json report

{
	"clusterID": "f06b33a4-be8a-45d5-a8f9-9667f003b700",
	"createdAt": "2022-02-09T08:32:10.26395+01:00",
	"interval": "2022-02-10T08:36:10.26395+01:00",
	"target": "all",
	"version": {
		"version": "",
		"revision": "",
		"branch": "",
		"buildUser": "",
		"buildDate": "",
		"goVersion": "go1.17.2"
	},
	"os": "darwin",
	"arch": "amd64",
	"edition": "oss",
	"metrics": {
		"ingester_flushed_chunks_age_seconds": {
			"stddev": 0,
			"stdvar": 0,
			"avg": 32857.973619,
			"count": 1,
			"min": 32857.973619,
			"max": 32857.973619
		},
		"num_cpu": 16,
		"distributor_replication_factor": 1,
		"ingester_streams_count": 1,
		"query_metric_bytes_per_second": {
			"avg": 86512.48688046652,
			"count": 1715,
			"min": 0,
			"max": 7001745,
			"stddev": 578305.5424162439,
			"stdvar": 334437300389.34607
		},
		"query_metric_lines_per_second": {
			"min": 0,
			"max": 308201,
			"stddev": 25884.49007341756,
			"stdvar": 670006826.3608522,
			"avg": 3873.5586005830855,
			"count": 1715
		},
		"ingester_active_tenants": 1,
		"ingester_target_size_bytes": 1572864,
		"memstats": {
			"sys": 70534152,
			"heap_alloc": 33771944,
			"num_gc": 101,
			"gc_cpu_fraction": 0.00025775059945585605,
			"alloc": 33771944,
			"total_alloc": 1515006248,
			"heap_inuse": 41517056,
			"stack_inuse": 3997696,
			"pause_total_ns": 19223528
		},
		"compactor_retention_enabled": "false",
		"distributor_bytes_received": {
			"total": 30968,
			"rate": 516.1260609192866
		},
		"ingester_flushed_chunks": {
			"total": 0,
			"rate": 0
		},
		"query_log_bytes_per_second": {
			"stddev": 663299.4385104065,
			"stdvar": 439966145128.22064,
			"avg": 101709.73578717193,
			"count": 2744,
			"min": 0,
			"max": 7778734
		},
		"store_object_type": "filesystem",
		"ingester_flushed_chunks_lines": {
			"avg": 594,
			"count": 1,
			"min": 594,
			"max": 594,
			"stddev": 0,
			"stdvar": 0
		},
		"ingester_wal": "enabled",
		"ingester_chunk_created": {
			"total": 0,
			"rate": 0
		},
		"ingester_compression": "gzip",
		"ingester_flushed_chunks_lifespan_seconds": {
			"stdvar": 0,
			"avg": 9.126944444444444,
			"count": 1,
			"min": 9.126944444444444,
			"max": 9.126944444444444,
			"stddev": 0
		},
		"ingester_flushed_chunks_utilization": {
			"avg": 0.0017712910970052083,
			"count": 1,
			"min": 0.0017712910970052083,
			"max": 0.0017712910970052083,
			"stddev": 0,
			"stdvar": 0
		},
		"num_goroutine": 258,
		"distributor_lines_received": {
			"total": 3871,
			"rate": 64.51575570097039
		},
		"compactor_default_retention": "31d",
		"store_schema": "v11",
		"query_log_lines_per_second": {
			"count": 2744,
			"min": 0,
			"max": 315413,
			"stddev": 27780.167284388925,
			"stdvar": 771737694.3486327,
			"avg": 4281.011297376088
		},
		"store_index_type": "boltdb-shipper",
		"ingester_flushed_chunks_bytes": {
			"min": 3008,
			"max": 3008,
			"stddev": 0,
			"stdvar": 0,
			"avg": 3008,
			"count": 1
		}
	}
}

Special notes for your reviewer:

Found a bug in DSkit and had to revendor a fix. see grafana/dskit#132

Fixes #5062

Checklist

Documentation added
Tests updated
Add an entry in the CHANGELOG.md about the changes.

Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>

jeschkies

Could you document the option to disable reports? I think we should be transparent on this.

jeschkies · 2022-02-10T08:46:26Z

pkg/usagestats/stats.go

+// sendReport sends the report to the stats server
+func sendReport(ctx context.Context, seed *ClusterSeed, interval time.Time) error {
+	report := buildReport(seed, interval)
+	out, err := jsoniter.MarshalIndent(report, "", " ")


I thought it's gonna be Prometheus metrics. What's the reason for a custom API and store?

It's very hard to read a Prometheus metric. And I needed more stats like counter, min,max, string !

Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>

dannykopping

This is pretty goddamn awesome @cyriltovena!
I'd love to add some usage stats around recording/alerting rules, but we can do this later

pkg/storage/store.go

pkg/usagestats/stats.go

pkg/usagestats/reporter.go

Co-authored-by: Danny Kopping <dannykopping@gmail.com>

Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>

cyriltovena · 2022-02-10T11:07:22Z

The new DSKit brought some linter issue on it.

Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>

kavirajk

Looks super cool 🎉

pkg/usagestats/reporter.go

Co-authored-by: Danny Kopping <dannykopping@gmail.com>

cyriltovena · 2022-02-10T12:53:53Z

I'll follow up with a documentation on what we collect.

* Adds leader election process Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com> * fluke Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com> * fixes the kv typecheck * wire up the http client * Hooking into loki services, hit a bug * Add stats variable. * re-vendor dskit and improve to never fail service * Intrument Loki with the package * Add changelog entry Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com> * Fixes compactor test Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com> * Add configuration documentation Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com> * Update pkg/usagestats/reporter.go Co-authored-by: Danny Kopping <dannykopping@gmail.com> * Add boundary check Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com> * Add log for success report. Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com> * lint Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com> * Update pkg/usagestats/reporter.go Co-authored-by: Danny Kopping <dannykopping@gmail.com> Co-authored-by: Danny Kopping <dannykopping@gmail.com>

cyriltovena added 9 commits February 2, 2022 17:12

Adds leader election process

bc41ae2

Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>

fluke

555e8ba

Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>

fixes the kv typecheck

c72c163

wire up the http client

f097044

Hooking into loki services, hit a bug

38b661d

Add stats variable.

cd75528

re-vendor dskit and improve to never fail service

a054056

Intrument Loki with the package

ae1e564

Merge remote-tracking branch 'upstream/main' into usage-report

a9f61d3

cyriltovena requested a review from a team as a code owner February 10, 2022 08:32

pull-request-size bot added the size/XXL label Feb 10, 2022

cyriltovena requested a review from DanCech February 10, 2022 08:32

Add changelog entry

9eab96b

Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>

jeschkies reviewed Feb 10, 2022

View reviewed changes

cyriltovena added 2 commits February 10, 2022 10:20

Fixes compactor test

6fe98ea

Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>

Add configuration documentation

ff3a134

Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>

cyriltovena requested a review from KMiller-Grafana as a code owner February 10, 2022 09:32

dannykopping approved these changes Feb 10, 2022

View reviewed changes

pkg/storage/store.go Outdated Show resolved Hide resolved

pkg/usagestats/stats.go Show resolved Hide resolved

pkg/usagestats/reporter.go Outdated Show resolved Hide resolved

cyriltovena and others added 4 commits February 10, 2022 10:36

Update pkg/usagestats/reporter.go

1deadc0

Co-authored-by: Danny Kopping <dannykopping@gmail.com>

Add boundary check

e1fc2f4

Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>

Merge remote-tracking branch 'origin/usage-report' into usage-report

1e11b69

Add log for success report.

4eef924

Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>

lint

80d799b

Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>

kavirajk approved these changes Feb 10, 2022

View reviewed changes

dannykopping reviewed Feb 10, 2022

View reviewed changes

pkg/usagestats/reporter.go Outdated Show resolved Hide resolved

Update pkg/usagestats/reporter.go

52a8a56

Co-authored-by: Danny Kopping <dannykopping@gmail.com>

dannykopping merged commit bbaef79 into grafana:main Feb 10, 2022

dannykopping mentioned this pull request Feb 28, 2022

dannykopping/autodeploy key dannykopping/loki#1

Closed

rfratto mentioned this pull request Mar 14, 2022

Add opt in feature telemetry grafana/agent#1488

Closed

aureq mentioned this pull request Apr 26, 2022

Usage report opt-out and lower usage report verbosity. #6016

Open

marctc mentioned this pull request Apr 29, 2022

Add usage reporter to track feature flags grafana/agent#1661

Merged

3 tasks

RichiH mentioned this pull request May 4, 2022

Add usage report to Mimir grafana/mimir#1815

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add usage report into Loki. #5361

Add usage report into Loki. #5361

cyriltovena commented Feb 10, 2022 •

edited

Loading

jeschkies left a comment

jeschkies Feb 10, 2022

cyriltovena Feb 10, 2022

dannykopping left a comment

cyriltovena commented Feb 10, 2022

kavirajk left a comment

cyriltovena commented Feb 10, 2022

Add usage report into Loki. #5361

Add usage report into Loki. #5361

Conversation

cyriltovena commented Feb 10, 2022 • edited Loading

How does the consensus works ?

What happen if we change to a new object store ?

What stats are we sending ?

jeschkies left a comment

Choose a reason for hiding this comment

jeschkies Feb 10, 2022

Choose a reason for hiding this comment

cyriltovena Feb 10, 2022

Choose a reason for hiding this comment

dannykopping left a comment

Choose a reason for hiding this comment

cyriltovena commented Feb 10, 2022

kavirajk left a comment

Choose a reason for hiding this comment

cyriltovena commented Feb 10, 2022

cyriltovena commented Feb 10, 2022 •

edited

Loading