Monitoring with Prometheus #2068

platan · 2018-09-10T19:45:13Z

I would like to propose a monitoring solution for Shields based on Prometheus and Grafana.

Prometheus is an open-source monitoring and alerting system (overview). This page compares Promethues with alternatives.

How Prometheus works? An application exposes it's metrics as a plain text via HTTP. Then Prometheus periodically pulls these metrics and allows to display graphs based on these metrics. So it works different comparing to Graphite, where application push data to Graphite.

Grafana can be configured to use Prometheus as a data source for graphs.

https://github.com/siimon/prom-client is a good client for Node.js we can use. It collects recommended default metrics and can be used to prepare custom metrics.

What do we have now?

Shields version with default Prometheus metrics (Metrics with Prometheus #2069)
Two instances of Shields with metrics https://shields-metrics1.now.sh/metrics, https://shields-metrics2.now.sh/metrics
A Prometheus instance pulling metrics from these 2 Shields instances
A Grafana instance with a dashboard with default metrics https://metrics.shields.platan.space available for all (address and availability can change). I've prepared this dashboard myself and there is probably a place for improvements in it.

What do we have to do?

discuss this idea!
decide whether we want /metrics resource be available for all or limited to some IP/users
add custom metrics (I've already started with some basic custom metrics)
decide where Prometheus/Grafana should be hosted (it's ok for me to host it at https://metrics.shields.platan.space for some time)
decide who should access to Grafana (available roles: Viewer, Editor, Admin)
decide whether we want to use Prometheus/Grafana as an alerting system
document how to set Prometheus/Grafana instance from scratch

* I'm not a monitoring systems expert. I just have some experience with Grafana, Graphite and Prometheus :-).

The text was updated successfully, but these errors were encountered:

chris48s · 2018-09-11T19:44:59Z

Looks like a good idea. Having access to this kind of data would be great for us. It would improve things that are currently difficult like diagnosing out-of-memory errors or tuning the LRU cache size and give us more of an understanding of where performance bottlenecks might be.

This needs feedback from @paulmelnikow and/or @espadrine to take forward but to throw in my 2c I would be in favour of making the metrics completely public as long as:

No sensitive data is exposed which might compromise our security or that of our users
There is no (or very small) performance penalty on the shields servers if lots of people view the metrics

What do you think about that issue - are the other things to consider there which I have not mentioned?

platan · 2018-09-11T20:11:12Z

Good points @chris48s!

No sensitive data is exposed which might compromise our security or that of our users

I do not see such threat now.

There is no (or very small) performance penalty on the shields servers if lots of people view the metrics

I will check impact of reading metrics via /metrics endpoint. If there is some performance penalty we can cache responses or limit access to them (e.g. limit access to IP of Prometheus).

paulmelnikow · 2018-10-30T01:42:43Z

This sounds great! Thanks so much for your work. System monitoring eases one of our current pain points with the servers.

I like how this solution assembles off-the-shelf tools to solve the problem. It's a smart approach which gives us a lot of bang for the buck.

These options seem solid, and though I haven't done any kind of comparison shopping, I think this is a great place to start. If it turns out down the line something else suits our needs better, we can change tack. (I have some familiarity with these tools, but haven't used Prometheus before and am definitely not an expert.)

decide where Prometheus/Grafana should be hosted (it's ok for me to host it at https://metrics.shields.platan.space for some time)

Thanks for offering to host! I think it's awesome for you to host it for the time being.

I'd want to make a plan for having it not depend on any one person, though. One idea is Grafana Labs, which offers cloud Grafana + Prometheus. I wonder if they would give us a donation.

decide whether we want /metrics resource be available for all or limited to some IP/users

I'm inclined to put this behind a secret key and/or limit IPs. Not because we don't want to share the data, but because I don't know much about prom-client so would rather be on the conservative side with the raw endpoint.

decide who should access to Grafana (available roles: Viewer, Editor, Admin)

Perhaps we could start with the maintainers having access, and then publish a public dashboard once we have a good grip on how the tool works and what exactly we're sharing?

One thing to keep in mind is that we have a "no tracking" promise. So we do need to make sure whatever we're monitoring doesn't amount to tracking. I don't think that will be hard, but let's not forget to think about it.

platan · 2018-10-31T20:33:53Z

I'd want to make a plan for having it not depend on any one person, though. One idea is Grafana Labs, which offers cloud Grafana + Prometheus. I wonder if they would give us a donation.

I wrote to them in April 2018, but I didn't get a response. I will forward you an email I wrote to them.

decide whether we want /metrics resource be available for all or limited to some IP/users

I'm inclined to put this behind a secret key and/or limit IPs. Not because we don't want to share the data, but because I don't know much about prom-client so would rather be on the conservative side with the raw endpoint.

So will add a IP limit for this resource in my PR.

decide who should access to Grafana (available roles: Viewer, Editor, Admin)

Perhaps we could start with the maintainers having access, and then publish a public dashboard once we have a good grip on how the tool works and what exactly we're sharing?

OK. We can start with this approach. Since dashboard won't be public at the beginning we can discuss details at Discord.

One thing to keep in mind is that we have a "no tracking" promise. So we do need to make sure whatever we're monitoring doesn't amount to tracking. I don't think that will be hard, but let's not forget to think about it.

👍 I have this promise in my mind form the beginning - I would like to aggregate data about the performance.

platan · 2018-10-31T20:51:22Z

@chris48s Few weeks ago I wrote that I will check impact of reading /metrics resource. First thought was that this will be easy since we have metrics ;-) and we can compare behavior of two instances (https://shields-metrics1.now.sh/metrics, https://shields-metrics2.now.sh/metrics) - one with requests to /metrics and other without them. But these instances does not handle any requests for badges and it would be difficult to assess the impact of requests to metrics endpoint. I started to looking for a way to generate some load on these instances. One way is to get thousands of URLs to img.shields.io from README files from GitHub. Unfortunately I didn't manage to get these URLs and prepare this test until today. As Paul said we can start with IP limited access to /metrics resource.

chris48s · 2018-10-31T20:58:27Z

If we're going to limit access to a small group, lets ignore that issue for now

paulmelnikow · 2019-01-06T20:14:02Z

After discussion today, we decided to leave the metrics public.

platan · 2019-01-09T17:56:37Z

Metrics are available here: https://metrics.shields.io

What's next?

discuss this idea!
decide whether we want /metrics resource be available for all or limited to some IP/users
add custom metrics (I've already started with some basic custom metrics)

I would like to start with some really basic metrics of usage (number of invocations in time - counter) of badges (like Github-license). Prometheus has labels, so we can use several dimensions to describe metrics:
shields_badge_request_total{category="downloads", service="NPM"}
Does it keep "no tracking" promise?

decide where Prometheus/Grafana should be hosted (it's ok for me to host it at https://metrics.shields.platan.space for some time)
decide who should access to Grafana (available roles: Viewer, Editor, Admin)
decide whether we want to use Prometheus/Grafana as an alerting system
document how to set Prometheus/Grafana instance from scratch

I will document this

paulmelnikow · 2019-01-09T18:17:31Z

Request counts are on my mind too.

How would we do this reliably using Prometheus? The way it's built seems better at tracking current state than activity over time.

Is there a pattern we can follow?

platan · 2019-01-09T19:47:18Z

How would we do this reliably using Prometheus? The way it's built seems better at tracking current state than activity over time.
Is there a pattern we can follow?

First you have to find something to measure (e.g. amount of time it takes to run method, number of method invocations, number of open files)
Choose appropriate metric type https://prometheus.io/docs/concepts/metric_types/ (all of them are implemented in prom-client https://github.com/siimon/prom-client#counter) and meter your values in your app.
"Counters go up, and reset when the process restarts."
"Gauges are similar to Counters but Gauges value can be decreased."
"Summaries calculate percentiles of observed values.. The default percentiles are: 0.01, 0.05, 0.5, 0.9, 0.95, 0.99, 0.999. "
Then your Prometheus pulls metrics from your app via /metrics resource and stores them as time series data.

Application expose metrics with current state, but Prometheus stores time series data - activity over time.

I wrote we can start with metric showing number of invocations. It would be even better to start with metric showing amount of time it takes to process a badge request.

paulmelnikow · 2019-01-09T21:29:29Z

Time it takes to process a badge request sounds good! We could even do that by badge type, which seems even more helpful.

There are two challenges with using Prometheus to do overall request counts:

Can we see how many requests something has gotten within a week, a day, or other time interval? We'd have to compare the counter at the beginning and at the end of the interval. It doesn't sound like that would be convenient.
Counter totals are reset with each new process, so we'd have to persist the totals on the servers and somehow re-initialize them. It's not very PaaS-friendly.

The issue prometheus/prometheus#2473 gets at what I meant. There may be a way we can hack it though it seems like for that kind of analytics there might be a better tool we could consider.

paulmelnikow · 2019-02-23T21:17:37Z

This post talks about using Prometheus' rate() function to visualize a counter. Instead of graphing the counter's value, we'd graph the rate of change of the counter:

Sometimes we restart or re-deploy our Sanic application, we may ask what happens when the process restarts and the counter is reset to 0? This is a common case, luckily the rate() function in Prometheus will automatically handle this for us. So it is okay if the Sanic application process is restarted and the value is resetted to zero, nothing bad will happen.

Does that make sense to you @platan? I'm game to give that a try!

platan · 2019-02-25T20:18:46Z

I agree with you, rate function should do the job.

1. Can we see how many requests something has gotten within a week, a day, or other time interval? We'd have to compare the counter at the beginning and at the end of the interval. It doesn't sound like that would be convenient.

Increase function can be used in this case.

2. Counter totals are reset with each new process, so we'd have to persist the totals on the servers and somehow re-initialize them. It's not very PaaS-friendly.
The issue prometheus/prometheus#2473 gets at what I meant. There may be a way we can hack it though it seems like for that kind of analytics there might be a better tool we could consider.

Prometheus is not a long-term storage solution (https://dev.to/mhausenblas/revisiting-promcon-2018-panel-on-prometheus-long-term-storage-5f1p). Do we need data from longer period of time?

paulmelnikow · 2019-02-25T20:23:32Z

Ah, nice, it seems like some combination of rate() and increase() accomplish this.

Do we need data from longer period of time?

Longer than what? 😀

I think long-term data would be nice, and would make for interesting data to dig through, but probably isn't essential. A month seems like it could be adequate as a starting point.

paulmelnikow · 2019-02-27T23:36:27Z

Perhaps rather than update our legacy analytics for #1848, we could migrate them to Prometheus instead.

The other thing we're tracking is which template is being used. I'd also like to track which logos are being used (and what proportion of requests has a logo).

One possibility would be to use additional labels on the service-request badge. However I worry that will take up more memory on the server and bandwidth in the metric requests. If we have each logo and each template that increases the number of possible in-memory incrementers from one per service to ~10 per service. I'm not too familiar with how prom-client is implemented, but it's safe to assume it keeps all these labeled numbers in memory!

Since we don't really care how these metrics correlate with which service, would it be better to create separate metrics for these other two dimensions?

This picks up #2068 by adding per-badge stats as discussed in #966. It ensures every service has a unique `name` property. By default this comes from the class name, and is overridden in all the various places where the class names are duplicated. (Some of those don't seem that useful, like the various download interval services, though those need to be refactored down into a single service anyway.) Tests enforce the names are unique. These are the names used by the service-test runner, so it's a good idea to make them unique anyway. (It was sort of strange before that you had to specify `nuget` instead of e.g. `resharper`.) I've added validation to `deprecatedService` and `redirector`, and required that every `route` has a `base`, even if it's an empty string. The name is used to generate unique metric labels, generating metrics like these: ``` service_requests_total{category="activity",family="eclipse-marketplace",service="eclipse_marketplace_update"} 2 service_requests_total{category="activity",family="npm",service="npm_collaborators"} 3 service_requests_total{category="activity",family="steam",service="steam_file_release_date"} 2 service_requests_total{category="analysis",family="ansible",service="ansible_galaxy_content_quality_score"} 2 service_requests_total{category="analysis",family="cii-best-practices",service="cii_best_practices_service"} 4 service_requests_total{category="analysis",family="cocoapods",service="cocoapods_docs"} 2 service_requests_total{category="analysis",family="codacy",service="codacy_grade"} 3 service_requests_total{category="analysis",family="coverity",service="coverity_scan"} 2 service_requests_total{category="analysis",family="coverity",service="deprecated_coverity_ondemand"} 2 service_requests_total{category="analysis",family="dependabot",service="dependabot_semver_compatibility"} 3 service_requests_total{category="analysis",family="lgtm",service="lgtm_alerts"} 2 service_requests_total{category="analysis",family="lgtm",service="lgtm_grade"} 3 service_requests_total{category="analysis",family="snyk",service="snyk_vulnerability_git_hub"} 4 service_requests_total{category="analysis",family="snyk",service="snyk_vulnerability_npm"} 5 service_requests_total{category="analysis",family="symfony",service="sensiolabs_i_redirector"} 1 service_requests_total{category="analysis",family="symfony",service="symfony_insight_grade"} 1 service_requests_total{category="build",family="appveyor",service="app_veyor_ci"} 3 service_requests_total{category="build",family="appveyor",service="app_veyor_tests"} 6 service_requests_total{category="build",family="azure-devops",service="azure_dev_ops_build"} 6 service_requests_total{category="build",family="azure-devops",service="azure_dev_ops_release"} 5 service_requests_total{category="build",family="azure-devops",service="azure_dev_ops_tests"} 6 service_requests_total{category="build",family="azure-devops",service="vso_build_redirector"} 2 service_requests_total{category="build",family="azure-devops",service="vso_release_redirector"} 1 service_requests_total{category="build",family="bitbucket",service="bitbucket_pipelines"} 5 service_requests_total{category="build",family="circleci",service="circle_ci"} 5 ``` This is predicated on being able to use Prometheus's [`rate()`](https://prometheus.io/docs/prometheus/latest/querying/functions/#rate) function to visualize a counter's rate of change, as mentioned at #2068 (comment). Otherwise the stats will be disrupted every time a server restarts. The metrics only appear on new-style services.

We're getting good results from #3093, so there's no reason to keep maintaining this code. Ref #1848 #2068

paulmelnikow · 2019-04-17T22:19:22Z

Seems like this can be closed now! We've got the analytics working well via #3093. Let's open a new issue for any follow-on work.

platan · 2019-05-04T19:32:02Z

https://github.com/platan/metrics-shields-io-config is an Ansible playbook which can be used to configure monitoring for Shields.io (https://metrics.shields.io) and to create a new instance of monitoring from scratch.

paulmelnikow · 2019-05-04T21:13:38Z

Should we add that info here? https://github.com/badges/shields/blob/master/doc/production-hosting.md#monitoring

platan mentioned this issue Sep 10, 2018

Metrics with Prometheus #2069

Merged

paulmelnikow added the operations Hosting, monitoring, and reliability for the production badge servers label Nov 6, 2018

This was referenced Feb 23, 2019

Refactor [Codecov] #3074

Merged

Debounce badge updates from badge customizer #3092

Closed

Add per-badge metrics for BaseService #3093

Merged

paulmelnikow mentioned this issue Mar 7, 2019

Remove legacy analytics #3179

Merged

paulmelnikow added a commit that referenced this issue Mar 8, 2019

Remove legacy analytics (#3179)

612831a

We're getting good results from #3093, so there's no reason to keep maintaining this code. Ref #1848 #2068

paulmelnikow closed this as completed Apr 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monitoring with Prometheus #2068

Monitoring with Prometheus #2068

platan commented Sep 10, 2018 •

edited

Loading

chris48s commented Sep 11, 2018

platan commented Sep 11, 2018 •

edited

Loading

paulmelnikow commented Oct 30, 2018

platan commented Oct 31, 2018

platan commented Oct 31, 2018

chris48s commented Oct 31, 2018

paulmelnikow commented Jan 6, 2019

platan commented Jan 9, 2019

paulmelnikow commented Jan 9, 2019

platan commented Jan 9, 2019

paulmelnikow commented Jan 9, 2019

paulmelnikow commented Feb 23, 2019

platan commented Feb 25, 2019

paulmelnikow commented Feb 25, 2019

paulmelnikow commented Feb 27, 2019

paulmelnikow commented Apr 17, 2019

platan commented May 4, 2019

paulmelnikow commented May 4, 2019

Monitoring with Prometheus #2068

Monitoring with Prometheus #2068

Comments

platan commented Sep 10, 2018 • edited Loading

chris48s commented Sep 11, 2018

platan commented Sep 11, 2018 • edited Loading

paulmelnikow commented Oct 30, 2018

platan commented Oct 31, 2018

platan commented Oct 31, 2018

chris48s commented Oct 31, 2018

paulmelnikow commented Jan 6, 2019

platan commented Jan 9, 2019

paulmelnikow commented Jan 9, 2019

platan commented Jan 9, 2019

paulmelnikow commented Jan 9, 2019

paulmelnikow commented Feb 23, 2019

platan commented Feb 25, 2019

paulmelnikow commented Feb 25, 2019

paulmelnikow commented Feb 27, 2019

paulmelnikow commented Apr 17, 2019

platan commented May 4, 2019

paulmelnikow commented May 4, 2019

platan commented Sep 10, 2018 •

edited

Loading

platan commented Sep 11, 2018 •

edited

Loading