Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitoring with Prometheus #2068

Closed
platan opened this issue Sep 10, 2018 · 18 comments
Closed

Monitoring with Prometheus #2068

platan opened this issue Sep 10, 2018 · 18 comments
Labels
operations Hosting, monitoring, and reliability for the production badge servers

Comments

@platan
Copy link
Member

platan commented Sep 10, 2018

I would like to propose a monitoring solution for Shields based on Prometheus and Grafana.

Prometheus is an open-source monitoring and alerting system (overview). This page compares Promethues with alternatives.

How Prometheus works? An application exposes it's metrics as a plain text via HTTP. Then Prometheus periodically pulls these metrics and allows to display graphs based on these metrics. So it works different comparing to Graphite, where application push data to Graphite.

Grafana can be configured to use Prometheus as a data source for graphs.

https://github.com/siimon/prom-client is a good client for Node.js we can use. It collects recommended default metrics and can be used to prepare custom metrics.

What do we have now?

What do we have to do?

  • discuss this idea!
  • decide whether we want /metrics resource be available for all or limited to some IP/users
  • add custom metrics (I've already started with some basic custom metrics)
  • decide where Prometheus/Grafana should be hosted (it's ok for me to host it at https://metrics.shields.platan.space for some time)
  • decide who should access to Grafana (available roles: Viewer, Editor, Admin)
  • decide whether we want to use Prometheus/Grafana as an alerting system
  • document how to set Prometheus/Grafana instance from scratch

* I'm not a monitoring systems expert. I just have some experience with Grafana, Graphite and Prometheus :-).

screencapture-metrics-shields-platan-space-d-g_1b7zhik-prom-client-default-metrics-2018-09-10-21_44_00

@chris48s
Copy link
Member

Looks like a good idea. Having access to this kind of data would be great for us. It would improve things that are currently difficult like diagnosing out-of-memory errors or tuning the LRU cache size and give us more of an understanding of where performance bottlenecks might be.

This needs feedback from @paulmelnikow and/or @espadrine to take forward but to throw in my 2c I would be in favour of making the metrics completely public as long as:

  • No sensitive data is exposed which might compromise our security or that of our users
  • There is no (or very small) performance penalty on the shields servers if lots of people view the metrics

What do you think about that issue - are the other things to consider there which I have not mentioned?

@platan
Copy link
Member Author

platan commented Sep 11, 2018

Good points @chris48s!

  • No sensitive data is exposed which might compromise our security or that of our users

I do not see such threat now.

  • There is no (or very small) performance penalty on the shields servers if lots of people view the metrics

I will check impact of reading metrics via /metrics endpoint. If there is some performance penalty we can cache responses or limit access to them (e.g. limit access to IP of Prometheus).

@paulmelnikow
Copy link
Member

This sounds great! Thanks so much for your work. System monitoring eases one of our current pain points with the servers.

I like how this solution assembles off-the-shelf tools to solve the problem. It's a smart approach which gives us a lot of bang for the buck.

These options seem solid, and though I haven't done any kind of comparison shopping, I think this is a great place to start. If it turns out down the line something else suits our needs better, we can change tack. (I have some familiarity with these tools, but haven't used Prometheus before and am definitely not an expert.)

decide where Prometheus/Grafana should be hosted (it's ok for me to host it at https://metrics.shields.platan.space for some time)

Thanks for offering to host! I think it's awesome for you to host it for the time being.

I'd want to make a plan for having it not depend on any one person, though. One idea is Grafana Labs, which offers cloud Grafana + Prometheus. I wonder if they would give us a donation.

  • decide whether we want /metrics resource be available for all or limited to some IP/users

I'm inclined to put this behind a secret key and/or limit IPs. Not because we don't want to share the data, but because I don't know much about prom-client so would rather be on the conservative side with the raw endpoint.

  • decide who should access to Grafana (available roles: Viewer, Editor, Admin)

Perhaps we could start with the maintainers having access, and then publish a public dashboard once we have a good grip on how the tool works and what exactly we're sharing?

One thing to keep in mind is that we have a "no tracking" promise. So we do need to make sure whatever we're monitoring doesn't amount to tracking. I don't think that will be hard, but let's not forget to think about it.

@platan
Copy link
Member Author

platan commented Oct 31, 2018

I'd want to make a plan for having it not depend on any one person, though. One idea is Grafana Labs, which offers cloud Grafana + Prometheus. I wonder if they would give us a donation.

I wrote to them in April 2018, but I didn't get a response. I will forward you an email I wrote to them.

  • decide whether we want /metrics resource be available for all or limited to some IP/users

I'm inclined to put this behind a secret key and/or limit IPs. Not because we don't want to share the data, but because I don't know much about prom-client so would rather be on the conservative side with the raw endpoint.

So will add a IP limit for this resource in my PR.

  • decide who should access to Grafana (available roles: Viewer, Editor, Admin)

Perhaps we could start with the maintainers having access, and then publish a public dashboard once we have a good grip on how the tool works and what exactly we're sharing?

OK. We can start with this approach. Since dashboard won't be public at the beginning we can discuss details at Discord.

One thing to keep in mind is that we have a "no tracking" promise. So we do need to make sure whatever we're monitoring doesn't amount to tracking. I don't think that will be hard, but let's not forget to think about it.

👍 I have this promise in my mind form the beginning - I would like to aggregate data about the performance.

@platan
Copy link
Member Author

platan commented Oct 31, 2018

@chris48s Few weeks ago I wrote that I will check impact of reading /metrics resource. First thought was that this will be easy since we have metrics ;-) and we can compare behavior of two instances (https://shields-metrics1.now.sh/metrics, https://shields-metrics2.now.sh/metrics) - one with requests to /metrics and other without them. But these instances does not handle any requests for badges and it would be difficult to assess the impact of requests to metrics endpoint. I started to looking for a way to generate some load on these instances. One way is to get thousands of URLs to img.shields.io from README files from GitHub. Unfortunately I didn't manage to get these URLs and prepare this test until today. As Paul said we can start with IP limited access to /metrics resource.

@chris48s
Copy link
Member

If we're going to limit access to a small group, lets ignore that issue for now

@paulmelnikow paulmelnikow added the operations Hosting, monitoring, and reliability for the production badge servers label Nov 6, 2018
@paulmelnikow
Copy link
Member

After discussion today, we decided to leave the metrics public.

@platan
Copy link
Member Author

platan commented Jan 9, 2019

Metrics are available here: https://metrics.shields.io

What's next?

  • discuss this idea!
  • decide whether we want /metrics resource be available for all or limited to some IP/users
  • add custom metrics (I've already started with some basic custom metrics)

I would like to start with some really basic metrics of usage (number of invocations in time - counter) of badges (like Github-license). Prometheus has labels, so we can use several dimensions to describe metrics:
shields_badge_request_total{category="downloads", service="NPM"}
Does it keep "no tracking" promise?

  • decide where Prometheus/Grafana should be hosted (it's ok for me to host it at https://metrics.shields.platan.space for some time)
  • decide who should access to Grafana (available roles: Viewer, Editor, Admin)
  • decide whether we want to use Prometheus/Grafana as an alerting system
  • document how to set Prometheus/Grafana instance from scratch

I will document this

@paulmelnikow
Copy link
Member

Request counts are on my mind too.

How would we do this reliably using Prometheus? The way it's built seems better at tracking current state than activity over time.

Is there a pattern we can follow?

@platan
Copy link
Member Author

platan commented Jan 9, 2019

How would we do this reliably using Prometheus? The way it's built seems better at tracking current state than activity over time.
Is there a pattern we can follow?

  1. First you have to find something to measure (e.g. amount of time it takes to run method, number of method invocations, number of open files)
  2. Choose appropriate metric type https://prometheus.io/docs/concepts/metric_types/ (all of them are implemented in prom-client https://github.com/siimon/prom-client#counter) and meter your values in your app.
    "Counters go up, and reset when the process restarts."
    "Gauges are similar to Counters but Gauges value can be decreased."
    "Summaries calculate percentiles of observed values.. The default percentiles are: 0.01, 0.05, 0.5, 0.9, 0.95, 0.99, 0.999. "
  3. Then your Prometheus pulls metrics from your app via /metrics resource and stores them as time series data.

Application expose metrics with current state, but Prometheus stores time series data - activity over time.

I wrote we can start with metric showing number of invocations. It would be even better to start with metric showing amount of time it takes to process a badge request.

@paulmelnikow
Copy link
Member

Time it takes to process a badge request sounds good! We could even do that by badge type, which seems even more helpful.

There are two challenges with using Prometheus to do overall request counts:

  1. Can we see how many requests something has gotten within a week, a day, or other time interval? We'd have to compare the counter at the beginning and at the end of the interval. It doesn't sound like that would be convenient.
  2. Counter totals are reset with each new process, so we'd have to persist the totals on the servers and somehow re-initialize them. It's not very PaaS-friendly.

The issue prometheus/prometheus#2473 gets at what I meant. There may be a way we can hack it though it seems like for that kind of analytics there might be a better tool we could consider.

@paulmelnikow
Copy link
Member

This post talks about using Prometheus' rate() function to visualize a counter. Instead of graphing the counter's value, we'd graph the rate of change of the counter:

Sometimes we restart or re-deploy our Sanic application, we may ask what happens when the process restarts and the counter is reset to 0? This is a common case, luckily the rate() function in Prometheus will automatically handle this for us. So it is okay if the Sanic application process is restarted and the value is resetted to zero, nothing bad will happen.

Does that make sense to you @platan? I'm game to give that a try!

@platan
Copy link
Member Author

platan commented Feb 25, 2019

I agree with you, rate function should do the job.

1. Can we see how many requests something has gotten within a week, a day, or other time interval? We'd have to compare the counter at the beginning and at the end of the interval. It doesn't sound like that would be convenient.

Increase function can be used in this case.

2. Counter totals are reset with each new process, so we'd have to persist the totals on the servers and somehow re-initialize them. It's not very PaaS-friendly.

The issue prometheus/prometheus#2473 gets at what I meant. There may be a way we can hack it though it seems like for that kind of analytics there might be a better tool we could consider.

Prometheus is not a long-term storage solution (https://dev.to/mhausenblas/revisiting-promcon-2018-panel-on-prometheus-long-term-storage-5f1p). Do we need data from longer period of time?

@paulmelnikow
Copy link
Member

Ah, nice, it seems like some combination of rate() and increase() accomplish this.

Do we need data from longer period of time?

Longer than what? 😀

I think long-term data would be nice, and would make for interesting data to dig through, but probably isn't essential. A month seems like it could be adequate as a starting point.

@paulmelnikow
Copy link
Member

Perhaps rather than update our legacy analytics for #1848, we could migrate them to Prometheus instead.

The other thing we're tracking is which template is being used. I'd also like to track which logos are being used (and what proportion of requests has a logo).

One possibility would be to use additional labels on the service-request badge. However I worry that will take up more memory on the server and bandwidth in the metric requests. If we have each logo and each template that increases the number of possible in-memory incrementers from one per service to ~10 per service. I'm not too familiar with how prom-client is implemented, but it's safe to assume it keeps all these labeled numbers in memory!

Since we don't really care how these metrics correlate with which service, would it be better to create separate metrics for these other two dimensions?

paulmelnikow added a commit that referenced this issue Feb 27, 2019
This picks up #2068 by adding per-badge stats as discussed in #966.

It ensures every service has a unique `name` property. By default this comes from the class name, and is overridden in all the various places where the class names are duplicated. (Some of those don't seem that useful, like the various download interval services, though those need to be refactored down into a single service anyway.) Tests enforce the names are unique. These are the names used by the service-test runner, so it's a good idea to make them unique anyway. (It was sort of strange before that you had to specify `nuget` instead of e.g. `resharper`.)

I've added validation to `deprecatedService` and `redirector`, and required that every `route` has a `base`, even if it's an empty string.

The name is used to generate unique metric labels, generating metrics like these:

```
service_requests_total{category="activity",family="eclipse-marketplace",service="eclipse_marketplace_update"} 2
service_requests_total{category="activity",family="npm",service="npm_collaborators"} 3
service_requests_total{category="activity",family="steam",service="steam_file_release_date"} 2
service_requests_total{category="analysis",family="ansible",service="ansible_galaxy_content_quality_score"} 2
service_requests_total{category="analysis",family="cii-best-practices",service="cii_best_practices_service"} 4
service_requests_total{category="analysis",family="cocoapods",service="cocoapods_docs"} 2
service_requests_total{category="analysis",family="codacy",service="codacy_grade"} 3
service_requests_total{category="analysis",family="coverity",service="coverity_scan"} 2
service_requests_total{category="analysis",family="coverity",service="deprecated_coverity_ondemand"} 2
service_requests_total{category="analysis",family="dependabot",service="dependabot_semver_compatibility"} 3
service_requests_total{category="analysis",family="lgtm",service="lgtm_alerts"} 2
service_requests_total{category="analysis",family="lgtm",service="lgtm_grade"} 3
service_requests_total{category="analysis",family="snyk",service="snyk_vulnerability_git_hub"} 4
service_requests_total{category="analysis",family="snyk",service="snyk_vulnerability_npm"} 5
service_requests_total{category="analysis",family="symfony",service="sensiolabs_i_redirector"} 1
service_requests_total{category="analysis",family="symfony",service="symfony_insight_grade"} 1
service_requests_total{category="build",family="appveyor",service="app_veyor_ci"} 3
service_requests_total{category="build",family="appveyor",service="app_veyor_tests"} 6
service_requests_total{category="build",family="azure-devops",service="azure_dev_ops_build"} 6
service_requests_total{category="build",family="azure-devops",service="azure_dev_ops_release"} 5
service_requests_total{category="build",family="azure-devops",service="azure_dev_ops_tests"} 6
service_requests_total{category="build",family="azure-devops",service="vso_build_redirector"} 2
service_requests_total{category="build",family="azure-devops",service="vso_release_redirector"} 1
service_requests_total{category="build",family="bitbucket",service="bitbucket_pipelines"} 5
service_requests_total{category="build",family="circleci",service="circle_ci"} 5
```

This is predicated on being able to use Prometheus's [`rate()`](https://prometheus.io/docs/prometheus/latest/querying/functions/#rate) function to visualize a counter's rate of change, as mentioned at #2068 (comment). Otherwise the stats will be disrupted every time a server restarts.

The metrics only appear on new-style services.
paulmelnikow added a commit that referenced this issue Mar 8, 2019
We're getting good results from #3093, so there's no reason to keep maintaining this code.

Ref #1848 #2068
@paulmelnikow
Copy link
Member

Seems like this can be closed now! We've got the analytics working well via #3093. Let's open a new issue for any follow-on work.

@platan
Copy link
Member Author

platan commented May 4, 2019

https://github.com/platan/metrics-shields-io-config is an Ansible playbook which can be used to configure monitoring for Shields.io (https://metrics.shields.io) and to create a new instance of monitoring from scratch.

@paulmelnikow
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
operations Hosting, monitoring, and reliability for the production badge servers
Projects
None yet
Development

No branches or pull requests

3 participants