-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add usage report into Loki. #5361
Conversation
Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>
Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you document the option to disable reports? I think we should be transparent on this.
// sendReport sends the report to the stats server | ||
func sendReport(ctx context.Context, seed *ClusterSeed, interval time.Time) error { | ||
report := buildReport(seed, interval) | ||
out, err := jsoniter.MarshalIndent(report, "", " ") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it's gonna be Prometheus metrics. What's the reason for a custom API and store?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's very hard to read a Prometheus metric. And I needed more stats like counter, min,max, string !
Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>
Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty goddamn awesome @cyriltovena!
I'd love to add some usage stats around recording/alerting rules, but we can do this later
Co-authored-by: Danny Kopping <dannykopping@gmail.com>
Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>
Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>
The new DSKit brought some linter issue on it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks super cool 🎉
Co-authored-by: Danny Kopping <dannykopping@gmail.com>
I'll follow up with a documentation on what we collect. |
* Adds leader election process Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com> * fluke Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com> * fixes the kv typecheck * wire up the http client * Hooking into loki services, hit a bug * Add stats variable. * re-vendor dskit and improve to never fail service * Intrument Loki with the package * Add changelog entry Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com> * Fixes compactor test Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com> * Add configuration documentation Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com> * Update pkg/usagestats/reporter.go Co-authored-by: Danny Kopping <dannykopping@gmail.com> * Add boundary check Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com> * Add log for success report. Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com> * lint Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com> * Update pkg/usagestats/reporter.go Co-authored-by: Danny Kopping <dannykopping@gmail.com> Co-authored-by: Danny Kopping <dannykopping@gmail.com>
What this PR does / why we need it:
This PRs add usage report to grafana.com into Loki.
It basically add a new modules that will never fail, when running the module try to get a consensus on what is the cluster unique ID and then send a report from every component running every hour. The cluster ID is use to compute aggregation of all components the server side.
How does the consensus works ?
Ingesters are leader in the consensus meaning they are the only one that can actually store in the object store the unique ID. They do that using the Loki kv store and object store for persisting the data over restart.
Each ingester will do as follow:
Other component (followers) will only retry indefinitely to fetch the cluster id from the object store and once they have it, they will start sending report with the ID.
In case there are many failure trying to unmarshal the cluster ID, all component can decide to nuke it.
What happen if we change to a new object store ?
Since we also store the cluster ID in the kvstore, and ingester will realize that it is missing in the new object store and will try to reconcile.
This means if you nuke at the same time the object store AND the kv store, you'll end up with having a new cluster ID but we consider this case to be rare.
What stats are we sending ?
Full disclaimer here, we're not sending any confidential data but only informations about:
See the json below.
json report
Special notes for your reviewer:
Found a bug in DSkit and had to revendor a fix. see grafana/dskit#132
Fixes #5062
Checklist
CHANGELOG.md
about the changes.